Forum: Building VoltDB Applications

Post: Can voltdb satisfy my app ?

Can voltdb satisfy my app ?
ddorian43
Aug 23, 2014
I want to create a real-time tracking app. Basically, you visit a site, and a javascript function sends an event every 15 seconds with some info about that visitor to the db.

So i have :

create table (project_id, session_id, timestamp, a,b,c,d,e, tags) partition by project_id, primary_key=project_id+session_id;


~97 percent of queries are upserts in the form of:
insert or update timestamp=current_timestamp where project_id=x and session_id=y (i also insert/update other info like (a,b,c,d), but i always upsert in 1 partition / 1 row)
This is inserted whenever an event comes from the javascript function in the user's browser.


2% of queries are when customers login and ask questions about their users. This is a dashboard that does ~10 queries , always to 1 partition.
Queries are the form of:
SELECT COUNT(dimension), dimension GROUP BY dimension ORDER BY count(dimension) LIMIT 10;
SELECT COUNT(DISTINCT(dimension1)), DISTINCT(COUNT(dimension2));


<0.1% of queries, is when i(admin), want to know info about all the sites, so, multi-partition procedure. This is the same as the queries above, except no "where project_id=x" filtering is done.


Questions:


  1. In the table, tags needs to be an array of text, or something similar that lets me anwser:
    select count(explode(tags)), explode(tags) FROM online GROUP BY explode(tags).
    Is there a way ?
    In postgresql this can be done by using unnest array: http://www.postgresql.org/docs/9.2/static/functions-array.html

  2. I don't need rows that are older than 30 seconds.
    Should i do a multi-partition delete every 30 seconds, or should i create a delete_query that is executed every random(100) inside the upsert_procedure()?
    Can I do a Delete query for the whole partition inside a procedure? Basically filtering by partition_id or something ?

  3. When I insert events, is it possible to send a 'fire and forget' query like mongodb.
    Basically i dont need a response from the db if the query succeeded or not.
    http://stackoverflow.com/questions/11563627/what-is-exactly-meant-by-fire-and-forget-write-in-mongodb
    I'll be using the python driver

  4. When I query for stats, can I put multiple queries inside one procedure?
    Each query will return different rows with different columns, but all will select on 1 partition

  5. When i query for stats, I need to filter on all the dimensions, and they will not be indexed. This works right, filtering on non-indexed-columns ?

  6. Is it possible to run some tables with disk logging while other tables can just be in-memory and get wiped on restart? This will greatly simplify my application.
    If not, can I use a large aync commit-log checkpointing, and force checkpointing on some queries ? These will rarely happen, <0.1 % of queries.

  7. What happens if the data grows larger than memory ?


    Thanks
rmorgenstein
Aug 25, 2014
VoltDB sounds like a good fit for a workload that is 99+% single partition.

We added a default procedure for UPSERT in V4.4 (see the Default Procedures section ). If you haven't gotten a new version in a while, I suggest you download the newest (we just release V4.6).




Questions:


  1. In the table, tags needs to be an array of text, or something similar that lets me anwser:
    select count(explode(tags)), explode(tags) FROM online GROUP BY explode(tags).
    Is there a way ?
    In postgresql this can be done by using unnest array: http://www.postgresql.org/docs/9.2/static/functions-array.html

    I am researching this and will get back to you soon.

  2. I don't need rows that are older than 30 seconds.
    Should i do a multi-partition delete every 30 seconds, or should i create a delete_query that is executed every random(100) inside the upsert_procedure()?
    Can I do a Delete query for the whole partition inside a procedure? Basically filtering by partition_id or something ?

    See the examples/windowing sample application in the kit. It shows 2 methods for deleting. V4.6 has some significant improvements to DELETE performance for tables with low-cardinality indexes.

  3. When I insert events, is it possible to send a 'fire and forget' query like mongodb.
    Basically i dont need a response from the db if the query succeeded or not.
    http://stackoverflow.com/questions/11563627/what-is-exactly-meant-by-fire-and-forget-write-in-mongodb
    I'll be using the python driver

    The Python client doesn't have an async mode, but Python's threading has a global lock, so probably isn't the best choice for a high performance client anyways. I suggest using any other client to do this (java, php, c++, node.js).

  4. When I query for stats, can I put multiple queries inside one procedure?
    Each query will return different rows with different columns, but all will select on 1 partition

    You can put multiple queries in a procedure. You can also query multiple tables. The examples/voter Vote procedure for how to queue and execute multiple queries.

  5. When i query for stats, I need to filter on all the dimensions, and they will not be indexed. This works right, filtering on non-indexed-columns ?

    >>You can run queries on any columns, regardless of indexing. You can look at the SQL plans for all your procedures in the schema report at http://<yourserver>:8080

  6. Is it possible to run some tables with disk logging while other tables can just be in-memory and get wiped on restart? This will greatly simplify my application.
    If not, can I use a large aync commit-log checkpointing, and force checkpointing on some queries ? These will rarely happen, <0.1 % of queries.

    Command logging is for all tables. See https://voltdb.com/docs/UsingVoltDB/ChapCmdLog.php for information on sizing and tuning.

  7. What happens if the data grows larger than memory ?

    The OOM killer will come into action. Some customer handle this with monitoring (e.g. Nagios, New Relic) to alert when this happens. You can also use the "LIMIT PARTITION ROWS" constraint on a table to prevent this. See https://voltdb.com/docs/UsingVoltDB/ddlref_createtable.php for more information


    Thanks
xin
Aug 25, 2014
Answer for question 1:

VoltDB currently does not support ARRAY type. The work around for your problem can be:

You create a separate table to contain the primary key of the original table and insert the element of tags array as a row in the new table. Then you can join the new table with the original table on the primary key and run GROUP BY queries.

Please let us know if you have more questions.

Thanks,
Xin