Forum: Building VoltDB Applications

Post: Limitaion of 50MB for the result set of a query

Limitaion of 50MB for the result set of a query
Feb 17, 2016
My application is currently querying voltDB,which is returning a set of data points on which I am doing some computation(like percentile,median,variance,etc). However when the data grows in size(millions of rows) I get the following error "Output from SQL stmt overflowed output/network buffer of 50mb (-32k for message headers). Try a "limit" clause or a stronger predicate.". Any workaround on how to increase the buffer size?
Feb 17, 2016
That is a fundamental limit, there is no way to override or extend it. The maximum column size is 1MB, and the maximum row size is 2MB. There is no limit on the number of rows in a VoltTable.
The maximum message size that can be returned from a stored procedure or adhoc query is 50MB.

If you are running into this known limitation, the workaround that might apply is not to use OFFSET but instead to add a "WHERE" clause that filters out rows already seen -- those with ordered column value(s) < the maximum previously returned.
Peter Zhao
Feb 18, 2016
Thanks Peter for the response. I do have another query if you don't mind. Is it possible to do a batch insert in VoltDB? If yes, do you mind sharing some code on how it can be done. I tried looking it up in VoltDB docs but couldn't find it
Feb 18, 2016
If you're writing a java client, please take a look at our java example voter stored procedure. Below is a list of queued sql statements prior to its execution in line 67.
In sqlcmd, you can pipe a file into sqlcmd, sqlcmd < foo.ddl, where your foo.ddl may look like this (this is utilizing our voter example):
exec VOTES.insert 9999,'ma',1;
exec VOTES.insert 9999,'ma',1;
Further, you can batch them into one transaction for speed by utilizing file keyword. Please take a look at this example and where it ends at line 85.

Let me know if this answers your question.
Peter Zhao
Feb 19, 2016
Hi Peter,
Let me paint a clear picture of the problem I am trying to solve. I have written a java thread(writer) which inserts data(random real values) into voltdb. In parallel one of my other threads reads data continuously from the voltDB database and computes percentile on the dataset(as voltdb doesn't support Percentile sql directly so I am doing it on Java). Now the data that is being fetched from the DB keeps on increasing as the writer thread pushes more data in the DB. This poses a problem for me as I am unable to fetch more than 1.6 Million rows.Was wondering if you have any ideas on howto circumvent this hurdle.One idea I had was whether I could create a view such as

CREATE VIEW selectdata (id,counting) AS SELECT id,COUNT(*) FROM utable GROUP BY id;

Problem is how to invoke a view from java. There is no API call for view in Client.

The agenda for this exercise is to understand how voltDB fairs against other DB's such as oracle,aerospike,cassandra,etc. for real time streaming data.
Feb 19, 2016
This is a clearer picture. The best answer is in this post here:
Specifically the last post, where some sample code is provided and a link to the java client for VoltDB NBBO Example App.
The idea is to create an index and loop through that ordered index by offset defined in the where clause.
Peter Zhao
Feb 19, 2016
Hi raviojha23,

I discussed this with Peter, and want to suggest something a little different. I wrote the other post he mentioned, and it is a sound method if your end goal is to extract a larger than 50MB data set once in a while. But your goal is to calculate percentiles, and I think you are already on the right track to use a VIEW.

The benefit of the VIEW is to get values and their counts, rather than aggregating this at runtime. This will especially help if the data has a smaller count of discrete values, and would help less if the data is very precise and continuous such that most values don't appear more than once or a few times.

You could then write a stored procedure to read the results of querying this VIEW and compute percentiles. For example

CREATE TABLE student_grades (
student_id integer not null,
grade integer not null, -- out of 100
primary key student_id

CREATE VIEW grade_counts AS
SELECT grade, count(*) as num
FROM student_grades
GROUP BY grade;

Stored procedure logic:

SELECT sum(num) from grade_counts;
SELECT grade, num FROM grade_counts ORDER BY grade ASC;

int count=0
int total = (the total number of students or grades given)

While loop iterating through the view results:
count += num
count / total = percentile

This keeps all of the data on the database, and sends only the percentile results back to the client, which is much more efficient than pulling down all of the raw data or even pre-aggregated data to do the calculation on the client side.

Feb 23, 2016
Hi Ben,
Thanks for pitching in. The dataset that I am working on are random real values, hence they are quite unique and can't be grouped as you suggested.

I tried using view to extract data however view as well has the same limitation of 50Mb.

One other approach that comes to the top of my mind is to compute the percentile using a query such as this:

select top 0.05*count(*) * from my_table order by value descending where I can select only the top 5 % of the values in descending order and my percentile would be the last row of the returned resultset. This approach will work ok only till a few million rows(my 5 % has to be less than 1.6 million rows- threshold at which 50Mb is exhausted). Maybe as a solution to the above problem I can write a nested query to return only the required percentile row.

Could you help me in formulating the above query as VoltDB is throwing some error.

If the above approach doesn't fly I can do something on the lines that you have suggested in your other thread.

On another note would the limitation of 50 Mb be done away with in the future release of VoltDB?