Forum: Building VoltDB Applications

Post: Partitioning scheme IoT

Partitioning scheme IoT
Oct 23, 2016
Not sure if this belongs here or in Architecture..

I've read the docs but struggling to understand this.
Use case:
IoT multi-tenant SaaS
20000 clients
300000 devices transmitting complex data every 30 seconds.

The partition keys can only be clientId, deviceId or a combined column (clientId:deviceId) (date/time would not work because it would create a "hot" partition.)
If I run 10 8-core machines I would have 60 partitions available, and this isn't enough.
I know there is no ranged partitioning but there must be a way to handle this...

What am I missing here?
Oct 24, 2016
You may be thinking that you need at least 20,000 partitions if you were to partition by clientId. You don't. Think of the partitions as buckets. If you chose clientId as the partitioning key, each clientId value would be assigned (using a consistent hashing algorithm) to one of the partitions, but there would be many other clientIds in the same partition. You don't know which ids are in which partition, it's somewhat arbitrary, but you can be sure that for any given clientId value ALL of the records will be present there in the same partition. That allows you to join different tables that are partitioned on the same key, or to query and aggregate multiple records associated with the same value. Likewise if you chose deviceID, you would have all the data for a given device in the same partition.

A common problem with range-based partitioning is as you mentioned hot spots as well as the need to maintain and rebalance the ranges. Hashing tends to distribute the data more evenly so you don't need to actively manage the partitioning.
Oct 25, 2016
Aha, thank you so much. I knew I was missing something.
Is this hashed key mentioned in the docs?
also, this mean the other tables (device config, users etc.) must be replicated or be partitioned by the same key right? otherwise I may end up with multi-partition queries?
Oct 25, 2016
Yes, there's a section on partitioning in the docs here:

If you only need to do single table lookups or single table operations, you can choose whatever keys work best for even distribution.

If you need to join tables in a query or access them within the same transaction, they need to either be replicated or share a common partition key. The name of the column doesn't matter, but the values need to be the same. Sometimes it helps to denormalize the schema somewhat in order to have a common partition key shared by various tables that work together.

For example, suppose you have three tables: CUSTOMERS, ORDERS, and LINE_ITEMS tables, and you typically join the first two tables by the CUSTOMER_ID and the last two tables by the ORDER_ID. To make this work in VoltDB, especially if you have many customers, you might add a CUSTOMER_ID column to the LINE_ITEMS table. Third normal form would say that is unnecessary, but it provides a common partitioning key and is pretty simple and doesn't take a lot of memory to add one more numerical column. Then as long as you have the CUSTOMER_ID as input, you can access all of that customer's orders and associated line items. On the other hand, if you rarely have the customer_id, and most interactions have only the ORDER_ID as input (for example stock market orders may be much more order-centric), you might just partition the last two tables by the ORDER_ID and use a separate transaction when you need to look up the CUSTOMER record. Or, if you have thousands but not millions of customers, you might make CUSTOMERS a replicated table.