Forum: Building VoltDB Applications

Post: VoltDB for realtime statistics/analytics and long-time historical stats

VoltDB for realtime statistics/analytics and long-time historical stats
ravlio
Mar 15, 2016
Hello. We are looking for database suitable for creating realtime statistics of event stream with optional analytics possibilities. We must make aggregates by several dimensions and constantly update historical data with delayed events and corrections.

We have such stream with volume of 100-300m records daily:

timestamp | partner_id | campaign_id | site_id | spot_id | ad_id | country_id | clicks_count | views_count


clicks_count and views_count are metrics and the rest are dimensions. Our usual queries is:

select hour(timestamp), sum(clicks_count), sum(views_count), country_id from events where day(timestamp)=day(current_timestamp) and partner_id=4 and country_id='US' group by hour(timestamp), partner_id, country_id
select hour(timestamp), sum(clicks_count), sum(views_count), campaign_id from events where day(timestamp)=day(current_timestamp) and partner_id=4 and campaign_id=3 group by hour(timestamp), partner_id, campaign_id


As you can see, queries are always contains grouping by several columns (date, partner_id) and additional grouping by needle column, depends in which section of dashboard user is. Roll-up segments maximum granularity is one hour.

I have two strategies of ingesting data here. First let Storm/Spark/Flink do aggregations and updates groupings. Second do it via materialized views. As far as I understand VoltDB have incremental materialized views which is very cool. In theory we can make materialized views such as: stats_by_country, stats_by_site, stats_by_spot and make them automatically populate by just adding raw event data in fact table. And final queries will look like: clicks_count, views_count from stats_by_country where day(timestamp)=day(current_timestamp) and hour(timestamp)=hour(current_timestamp) and partner_id=4 and country_id='US' Much lighter, without groupings at all.

We do not need to store each of events, just rollup them. So in the end of the day I plain to merge materialized views to real table with previous aggregates and clear events table.

Most important part and why we want to try to do it on VoltDB is stats updates. We can receive event to be recorded, for instance, month ago. And we need to update roll-ups. With VoltDB it looks very simple instead of other databases with append-only mechanism.

My questions are:

Did VoltDB fit such task with materialized views for hot roll-ups and then to moving them to historical table(s) if consider that there are about 500k rows of roll-ups each day. We need to make storage for years with quick access to any of date period.
What if we make just one table: timestamp | partner_id | campaign_id | site_id | spot_id | ad_id | country_id | clicks_count | views_count. Then, while ingesting, group all dimensions in one group and update metrics. Make index on each column and make queries in columnar databases manner on arbitrary dimensions. How VoltDB will handle ad-hoc queries?

VoltDB is great product for fast-moving data, but we are interested in how it works with ad-hoc queries on large tables. Do we need in partitioned tables, (like in PostgreSQL you make table for each day of stats and then make queries to multiple tables if you need get more then one day stats) for not decreasing query performance with time because of increasing total index size or it is enough to make table partitioned by time, write all data to one table and query speed will stay the same today and after one year of data ingestion?

Thank you.
bballard
Mar 15, 2016
Your basic use case is exactly what materialized views are perfect for. They automatically increment the group by results so that the queries are very simple and fast. Materialized views also are very scalable and have a very small impact on ingestion rate. For example, I've seen cases where a table with 10 views can handle ingestion at 90+% of the rate with no views, and if you need more speed, you can always expand the cluster.

Until recently your strategy of copying records from views into a table would be recommended for handling longer-term summary data, where you want to delete the records from the table, but keep the aggregated views.

In VoltDB v6.1, there is now a new feature that would be ideal for what you're trying to do. What we previously called "export tables" are now called "streams", and you can now define views on streams. The data is streamed to a downstream system for archival storage and longer-term analytics, but the aggregated views of the records persist indefinitely. You can maintain aggregations of petabytes of raw data that flowed through VoltDB. You can also delete records from the views at some point if you need to. See the release notes here: https://docs.voltdb.com/ReleaseNotes/
ravlio
Mar 17, 2016
Thank you for such a quick answer.

How do you think will VoltDB handle OLAP? I mean I have 10 dimensions and I need to do ad-hoc queries as fast as possible in arbitrary combinations. For this I can make materialized view with one group of all dimensions and get all possible combinations of dimensions. Like it is realized in Druid, Apache Kylin, Apache Lens. Does it possible in VoltDB? In this case the benefit of VoltDB among others is updating historical data. But can I make index over each dimensions and make queries in various combinations in constantly growing data - that is the question.