Forum: Building VoltDB Applications

Post: Volt vs Postgres

Volt vs Postgres
Bob Larrick
Jun 10, 2013
I've been tasked with exploring VoltDB for work, where we hope to use it to help us do ad-hoc queries on the daily logs from our applications.

To see if it was worth further investigation I loaded both VoltDB and Postgres up with a fraction (approximately 2.7GB) of our log data, which was in CSV format.
All tests done on my laptop, a quad-core macbook pro with 8GB of RAM.

I used the csvloader provided with VoltDB, it took a little over 50 minutes to load :(
I used the COPY functionality of Postgres, it took less than 3 minutes.

"SELECT COUNT (*) from logs"
VoltDB: 14,960,452 rows, couldn't figure out how to time query execution (seemed fast, less than a second)
Postgres : 14,960,452 rows, ~2000ms

"SELECT SUM(price) from logs"
Postgres: 4342552.21367791, ~5000ms
VoltDB: VOLTDB ERROR: SQL ERROR
More than 100 MB of temp table memory used while executing SQL. Aborting.
at org.voltdb.sysprocs.AdHoc_RO_SP.run(AdHoc_RO_SP.java:53)

:(

1) Is there a way to load the data into VoltDB faster? csvloader is an order of magnitude away from good enough.
2) What do I have to do to make summing a column work? For a 2.7GB dataset on a machine with 8GB of RAM, I expect this to work out of the box.


We were very excited when first researching VoltDB, but this first hands-on experience with it has been very disappointing. Am I doing things the worst way possible or what?
bballard
Jun 10, 2013
Hi Bob,

We'd like to help make your evaluation of VoltDB a success. Hopefully we can resolve your present issue here in the forum, but we'd invite you to contact us at info@voltdb.com so that we can discuss your use case and success criteria and help you with your evaluation.

The performance you saw from CSVLoader is not typical. Is the logs table partitioned? For example, after the CREATE TABLE statement in the DDL, a partitioned table would have a command like this:
PARTITION TABLE example_table ON COLUMN id;

This may explain the unusually slow loading time you're seeing.

Regarding the query, it should benefit from the table being partitioned as well, even though this query needs to get the sum from the entire table. For a partitioned table, each site in the cluster will compute a sub-total SUM from the portion of records located in that site, and the coordinating site will add these together to produce the response. Otherwise, each site would be adding up all of the records. Even with partitioning, it is still possible that this query could result in hitting the temp table limit of 100MB. This is a protection to prevent longer-running queries from running for too long and not allowing the high velocity workload to proceed. The best way to get a sum of the entire table in VoltDB like this is to define a VIEW and query the VIEW rather than the table.

A VIEW in VoltDB is always materialized and updated on commit, but without significantly impacting the performance of inserts, updates, or deletes. This is because the records in a VIEW are stored locally in the same site as the corresponding records of the table, and the functionality is basic. There is a convention for the structure of a VIEW, it must be defined by GROUP BY keys, followed by COUNT(*), followed by any additional COUNT() or SUM() aggregations. Even though you want to SUM() the entire table, there needs to be at least one GROUP BY key for a VIEW, but you can query the VIEW to get the overall sum. For example, in the DDL:
CREATE VIEW logs_view AS
SELECT foo, COUNT(*) AS total_logs, SUM(bar) AS BAR
FROM logs
GROUP BY foo;

Then to query it:
SELECT SUM(bar) FROM logs_view;

This view will be maintained in real time, and the query of the view will only need to scan a small number of summary rows to produce the total, so it should be instantaneous.

Best regards,
Ben
Bob Larrick
Jun 10, 2013
Hi Ben,

Thanks for the quick reply.
The table is not partitioned, I have only just read http://blog.voltdb.com/understanding-voltdb-data-partitioning/ , so I don't really know why I should partition or what to partition on.
For a single table on a single machine, how could partitioning help? Unless I have my terminology mixed up I think there is only one site, and one node.


The CREATE TABLE statements are identical except that postgres uses int2 where VoltDB uses tinyint (no indexes or anything fancy), which leads me to believe the difference in load time is entirely due to the loading mechanism, csvloader vs COPY.
I would expect the VoltDB data loading to be much faster than postgres, since postgres has to write to the actual hard drive, and voltdb is entirely in memory.

I'm sure there are lots of other ways to load the data into VoltDB, but this is day one for me so maybe you can suggest some alternatives to csvload?


I don't think creating views for every query we want to execute will be a good solution, since the whole idea is to allow easy ad-hoc querying of the data. Is VoltDB well suited to ad-hoc querying or does it rely on the user tuning the table and views ahead of time?

Thanks,

Bob
aaTesting
Jun 15, 2013
Maybe I am wrong but this is how I understood this VoltDB.

You partition your table on one column. Volt DB than distributes rows of your table over sitesperhoste defined in deployment.xml file.

If you don't partition table the copy of your table will be on every sites on your host. This scenario is ok for small tables which don't get updated and an used for reading.

The sites-per-host number is the number of partitions on each machine, a number that should be close to the number of cores (75% of cores, you can test what gives you best performance).
So if you have 4 cores you can set 3 siteperhost.

<deployment>
<cluster hostcount="1" sitesperhost="3">
</cluster>
</deployment>

This mean that your table rows will be distributed on 3 sites.
So when you use stored procedures those that get data from only one site are single partitioned and they will execute fast because they can run in parallel.
If you are using data from all partitions than you have multi-partition
stored procedures and they can be slow.
Also it is recommended to use stored procedures and not ad hoc queries for high performance.
Maybe this can help you.

http://www.slideshare.net/VoltDB/building-voltdb-applications
http://blog.voltdb.com/programming-voltdb-easy-flexible-and-ultra-fast/

Also I tried to import .sql file exported from Oracle and had java heap problems and import was very slow.
I know you can increase tmp table size in deployment.xml file but I don't know how.