Forum: Building VoltDB Applications

Post: VoltDB bulk loader

VoltDB bulk loader
gideon caller
Oct 14, 2015
Hi everyone,

I'm currently working on a Java application using Spark Streaming where each Spark worker saves about 30-40k lines into VoltDB
I've already seen this thread regarding VoltDB bulk insert. I've recently encountered a VoltDB bulk loader and I was wondering if I can also use it for bulk insertion and is it better than just inserting in a tight loop. I'm asking this since this class is not highly documented and I couldn't understand if it's better or worse than the tight loop insertion method

Thanks in advance :)
Oct 14, 2015
Yes, the VoltDB bulk loader was added to increase the speed of CSV loading into VoltDB.

Rather than create one single-partition insert transaction per row, it performs partitioning client side. It collects buffers of rows for each partition and then inserts them in larger single-transaction batches. This reduces transactional overhead and some wire-size overhead.

If you have enough rows that can share one instance of a VoltBulkLoader, performance can be significantly faster. If you are tearing down and rebuilding a VoltBulkLoader for each Spark Streaming batch of 30-40k rows, I'm not sure how much benefit there would be. It might be a bit faster or it might not.

One thing to be aware of it the error handling is more complex with VoltBulkLoader. It's possible for a single insert to fail due to a unique constraint violation (for example), and this will cause the whole batch insert to abort and roll back, even if many of the rows could be inserted just fine. VoltBulkLoader will handle this by re-trying failed batches on a row-by-row basis. If you have a lot of failures, performance can be impacted.

You may want to look at the code for the CSV loader for an example use of VoltBulkLoader.

Let us know if you have more questions. Also, why Spark Streaming into VoltDB and not directly into VoltDB? What is SS doing for you that VoltDB can't do directly?
gideon caller
Oct 18, 2015
1st, thanks for answering so quickly.
2nd, I'm using Spark Streaming since I want distributed reads from Kafka and writes to VoltDB (I don't want to coordinate the distribution myself by playing with Kafka offsets for different VoltDB nodes). Also I'm not quite sure how to actually use the VoltDB Kafka importer. Could you share a link?

Thanks again :)
Oct 18, 2015
You should check out the VoltDB Kafka Importer. If it works for your use case (1 topic => 1 procedure for now), it will manage offsets, failure semantics and M to N clustering.

Our standalone "kafkaloader" has been around for a while, but the importer is something we released in September IIRC. The docs are a bit thin, but check section 15.12 of the user guide on "Understanding Import" and it has Kafka specific stuff on the bottom.

Let us know if you have more questions.