Forum: Managing VoltDB

Post: Volt seems fragile

Volt seems fragile
pmarch
Feb 14, 2011
I have volt running on a machine that matches the suggested specs. I'm part of a team working on different shifts. Each day, I leave volt running with a fresh import of sample data to facilitate the other developers work. Each day, I find volt has stopped running for various reasons.

1 - too many open files (we fixed that)
2 - jvm errors (can't fix that)
3 - a user submitted an adhoc query with a '?' left in it
etc...

I would expect volt db to be running 24/7 except for uncontrollable hardware or system errors. So far, it's appearing to be pretty fragile.

Is my expectation too high? Is there something I must do to get the more stable behavior I expect?
more details on jvm errors?
alexlzl
Feb 15, 2011
I am also evaluating VoltDB in my company environment.

- too many open files. Is that related to "ulimit" settings? That should have nothing to do with VoltDB or Java, it is your OS settings, and in server environment, your SA should have set it with higher value (instead of the default 1024)

- jvm errors. Can you give a stack trace of such errors? or core-dump?

- adhoc query should never be allowed in Production mode any way. Though crashing the server should never happen, maybe you can paste an example of your query?
Your expectations are
rbetts
Feb 15, 2011
Your expectations are certainly not too high. If you can provide additional information or examples of these errors, we will fix the underlying causes and add regression tests for them. We take all reported errors very seriously.

We run Volt on internal clusters for many weeks 24x7. There are some JVMs defects that VoltDB reproduces - work arounds for those errors (JVM settings) are described in the release notes:

http://community.voltdb.com/docs/ReleaseNotes/index
Thank you,
Ryan.
Thanks for the responses...
pmarch
Feb 15, 2011
We've only just begun our work with volt (1 week +), so we attribute much of our trouble to being newbies. I will more carefully gather info regarding failures as we go and submit them.

1 - We changed the open files from the default of 1024 to 10240 - is there a suggested number?
2 - I will post the jvm error when it happens again. (sorry, there were at least two last I looked)
3 - here is an example adhoc query that causes volt to terminate with the following error: terminate called after throwing an instance of 'voltdb::FatalException' - we used the provided python tool for adhoc and see this in the output as well: something really bad just happened len() of unsized object

SELECT U.User_ID, COUNT(F.Favorite_User_ID) as favorites FROM User U, Group_User_Map G, Favorite F WHERE G.User_Group_ID = ? AND U.Status=1 AND U.User_ID = G.User_ID AND U.User_ID = F.User_ID GROUP BY U.User_ID;

In this case, the query was copied from a procedure and placed in the adhoc tool to check results - notice the ? was (mistakenly) not replaced with a value. Not a real world production example, but still surprising that it causes termination.
re: Thanks for the responses
rbetts
Feb 15, 2011
We've only just begun our work with volt (1 week +), so we attribute much of our trouble to being newbies. I will more carefully gather info regarding failures as we go and submit them.





Issue 1: file descriptors
Do you know how many concurrent client connections are you using? Are you using the JSON / HTTP API? We are aware of a file descriptor leak that affects the JSON interface during a node rejoin: https://issues.voltdb.com/browse/ENG-954. Your total fd requirement will be higher if using JSON (for legitimate reasons - not because of the defect). Volt requires a coule of fds for its intra-cluster mesh, a few fds for logging, snapshot writing and other utilities. The bulk of file descriptor will be for connected clients.
Issue 2: We appreciate your time in collecting and posting these if they re-occur.
Issue 3: All adhoc queries are multi-partition. We don't support a single partition adhoc query. If your stored procedure that usually contains that query is single partition (perhaps partitioned on U.User_ID?), it would be be planned differently when used in a procedure. I tested some simple adhoc queries that contained "?", and none were allowed by the adhoc query planner. If you can send your schema and tell me your partitioning scheme for User, Group_User_Map and Favorite (you can email rbetts at voltdb.com to keep this private), I'll reproduce exactly this query. Otherwise, I can re-create a guess based on your SQL and try to reproduce this error.

The "something really bad just happened..." message is generated by the adhoc browser utility, not by the database itself. Perhaps this was a consequence of the database terminating.

Ryan.
Parameterized adhoc query defect reproduced
rbetts
Feb 15, 2011
Issue 1: file descriptors
Do you know how many concurrent client connections are you using? Are you using the JSON / HTTP API? We are aware of a file descriptor leak that affects the JSON interface during a node rejoin:




Seb was able to reproduce the adhoc parameterized query crash. You can track progress on that defect here:
https://issues.voltdb.com/browse/ENG-993
Latest
pmarch
Feb 16, 2011
Thanks for all the info.

We have implemented the changes recommended in the Release Notes.

Issue 1: Currently, we are using Java only - no JSON. I would expect no more than 5 concurrent connections as we are a small team. We have verified that we are properly cleaning up connection resources. For production use, it is possible we could have concurrent connections in the hundreds (maybe thousands).

Issue 2: No unexpected termination yet (following changes) - figures :-) - I will post if/when they happen

Issue 3: Glad you were able to reproduce. For what it's worth, we can reproduce with the simplest query. e.g. SELECT last_name FROM name WHERE first_name = ? The name table can be as simple as ID INT NOT NULL, FIRST_NAME VARCHAR(32) NOT NULL, LAST_NAME VARCHAR(32) NOT NULL, PRIMARY KEY (ID).
Issue 3 fixed on trunk
jhugg
Feb 21, 2011
Thanks for all the info.

We have implemented the changes recommended in the Release Notes.




Regarding Issue 3, we've closed ENG-993 and in revision r1535, we have significantly beefed up some of the SQL handling around this issue. The upcoming 1.3 release should not crash on this kind of AdHoc SQL and should provide better error message.