Forum: Managing VoltDB

Post: VoltDB sharding key hash function

VoltDB sharding key hash function
seo01
Sep 28, 2010
I want a 2 node VoltDB install. The column I plan to shard on is our client id. At the moment we have 2 very large clients making up the majority of our data. I want these to end up on different nodes. Can someone point me toward the hash function for the sharding key in the src code so I can double check this will happen? The data type of this column is a VARCHAR(16).
More Info
jhugg
Sep 28, 2010
As for the partitioning...

First, Here is the code:

https://source.voltdb.com/browse/Engineering/trunk/src/frontend/org/volt...
You want to look at the hashinate() method.

If you use an integer for the partition key, then we simply use a modulo hash to assign values to partitions (for now). In a two partition cluster, odd values go to one partition and even values go to another partition. If you have more partitions, you can predict which values go where based on simple modulo math.

For strings, we convert the string to the UTF-8 encoded bytes and mash them together, then modulo the result. This will be much harder to guarantee that two different values will map to different partition. If you know the two values in advance, you could try the math out and see.

So if you have two partitions, and one partition per node, then it's easy to tell which node a customer will map to. If you have more partitions than nodes, it's trickier.

In the future, we would like to give the user more control over how values are mapped to partitions, but for now, we suggest that if the current method doesn't meed your needs, create a map of a complex type to integers yourself, then we can partition on the integers.

As for the two nodes...

Note that VoltDB partitions on a CPU core level. A machine with 4 cores might have 4 partitions of data (user configurable). If you put one partition on each box, then you are probably not effectively using all the cores in your machine. You may want to consider running two partitions per box and using the second box for redundancy.

Another thing to consider when the cardinality of your partition column is low is that you won't be able to scale the database size by adding more nodes than partition key values.

Still, none of this may matter. VoltDB is very very fast and you may have lots of headroom on a single machine with a redundant backup node. We encourage you to try it out and see how it performs for your app. Let us know if you have any questions along the way.

==
EDIT: We've since moved our code to GitHub. The new location of the hashing function is here: https://github.com/VoltDB/voltdb/blob/master/src/frontend/org/voltdb/The...
Are VARCHARs stored internally as Strings?
monster
Oct 2, 2010
As for the partitioning...

First, Here is the code:

https://source.voltdb.com/browse/Engineering/trunk/src/frontend/org/volt...
You want to look at the hashinate() method.

If you use an integer for the partition key, then we simply use a modulo hash to assign values to partitions (for now). In a two partition cluster, odd values go to one partition and even values go to another partition. If you have more partitions, you can predict which values go where based on simple modulo math.



I looked at the code for the hashing that you linked to, and saw that the key can be either a Number or a String, and that the String is converted to UTF-8 to find the hash value. It is unclear to me if that code is meant to run on the server or the client. So I was wondering if you store VARCHARs internally as Strings in memory, because storing them as UTF-8 byte[] would probably save you about 50% memory for "bigger" Strings, since Strings are a char[], and a char is two bytes, but most char in Strings are going to be 7-bit ASCII anyway?
String storage format
rbetts
Oct 2, 2010
I looked at the code for the hashing that you linked to, and saw that the key can be either a Number or a String, and that the String is converted to UTF-8 to find the hash value. It is unclear to me if that code is meant to run on the server or the client. So I was wondering if you store VARCHARs internally as Strings in memory, because storing them as UTF-8 byte[] would probably save you about 50% memory for "bigger" Strings, since Strings are a char[], and a char is two bytes, but most char in Strings are going to be 7-bit ASCII anyway?



At the storage layer, VARCHARs are length (in bytes) preceded char[] storing UTF-8 encoded data.