Forum: Building VoltDB Applications

Post: Maximum/optimal number of partitions?

Maximum/optimal number of partitions?
DennisP
May 10, 2013
I'm wondering if there is a maximum or optimal number of partitions for VoltDB?

I can either partition my data by year which would result in 20 partitions (there is 20 years worth of data in my database) or I can partition by year and customerID which would result in 2000 (100 customers * 20 years) or so partitions. All my stored procedures use year and customerID so they would all be single-partitioned.

Which case would give the best performance in terms of throughput and latency? Can VoltDB even support that many partitions (and likely more in the future)?
bballard
May 10, 2013
The number of partitions is actually a function of the hostcount, sitesperhost, and kfactor which are set in the deployment file. For example, if you had 3 hosts, 12 sitesperhost, kfactor=1, there would be 36 total "sites", but with kfactor=1 there would be 2 copies of each partition, so there would be 18 total partitions.

Your data is spread across these partitions using a deterministic function, so the database knows which partition to store each record, or to run each stored procedure call, based on the partition key value or the input parameter value. So you can have an unlimited cardinality of values, but the number of partitions will not change until you modify the deployment of the cluster.

With a cardinality of 20, using year alone may not spread as evenly as with other approach. If you had >20 partitions some would be unused, and if you had <20 partitions some would have 2 years of data, while others would have 1 year of data. You might also consider using customerID alone, or combining the two columns into a single partition key as you mentioned.
DennisP
May 10, 2013
Ah, ok. Thank you for the clear explanation. I will do some benchmarking to determine the best partition key for my data.
aweisberg
May 10, 2013
Ah, ok. Thank you for the clear explanation. I will do some benchmarking to determine the best partition key for my data.


Hi Dennis,

For your use case where you always query by customer + year you would be best off partitioning on that. That will expose more parallelism and allow for better data distribution without introducing any distributed transactions.

Ariel