Forum: Managing VoltDB

Post: VoltDB rejoin lets other node crash

VoltDB rejoin lets other node crash
Monika
Dec 2, 2013
Hi,

we are testing our 4 node cluster running voltDB 3.7 community edition.
Currently we are testing rejoining a cluster. We kill one node and then let it rejoin while having some load on the cluster (~150 procedure calls/s).
While the node that was killed successfully rejoins the cluster another node is dying with the following log.
There's a replicated table called Domains. It looks like one of the nodes has lost its data. When connecting with the command line tool to the rejoined node one can actually get the data from the table Domains, thus data should be in the cluster.

Does anybody have an idea, what happens here?
Many thanks,

Monika

2013-12-02 13:43:44,636 FATAL [Volt Network - 1] TM: Stored procedure AddUser generated different SQL queries at different partitions. Shutting down to preserve data integrity.
2013-12-02 13:43:44,637 ERROR [Volt Network - 1] TM: HASH MISMATCH COMPARING: 0 to 1785107999
PREV MESSAGE: INITITATE_RESPONSE FOR TXN 3619634828853250
SP HANDLE: 3619634828853250
INITIATOR HSID: 8:1
COORDINATOR HSID: 8:1
CLIENT INTERFACE HANDLE: 562949953421312
CLIENT CONNECTION ID: 8
READ-ONLY: false
RECOVERING: false
MISPARTITIONED: false
COMMIT
CLIENT RESPONSE:
{"status":1,"appstatus":-128,"statusstring":null,"appstatusstring":null,"exception":null,"results":[{"status":0,"schema":[{"name":"modified_tuples","type":6}],"data":[[1]]},{"status":0,"schema":[{"name":"modified_tuples","type":6}],"data":[[1]]}]}
CURR MESSAGE: INITITATE_RESPONSE FOR TXN 3619634828853250
SP HANDLE: 3619634828853250
INITIATOR HSID: 8:1
COORDINATOR HSID: 8:1
CLIENT INTERFACE HANDLE: 562949953421312
CLIENT CONNECTION ID: 8
READ-ONLY: false
RECOVERING: false
MISPARTITIONED: false
ROLLBACK/ABORT,
CLIENT RESPONSE:
{"status":-1,"appstatus":-128,"statusstring":"VOLTDB ERROR: USER ABORT
Expectation failing in procedure: AddUser
Running SQL: SELECT id FROM Domains WHERE name = ?;
Error message: Expected one row, but got 0
at com.gehrs.userdb.procedures.AddUser.run(null:-1)","appstatusstring":null,"exception":null,"results":[]}

2013-12-02 13:43:44,642 FATAL [Volt Network - 1] HOST: HASH MISMATCH: replicas produced different results.
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: Stack trace from crashLocalVoltDB() method:
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: java.lang.Thread.dumpThreads(Native Method)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: java.lang.Thread.getAllStackTraces(Thread.java:1546)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltdb.VoltDB.printStackTraces(VoltDB.java:672)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltdb.VoltDB.crashLocalVoltDB(VoltDB.java:757)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltdb.iv2.SpScheduler.handleInitiateResponseMessage(SpScheduler.java:659)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltdb.iv2.SpScheduler.deliver(SpScheduler.java:359)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltdb.iv2.InitiatorMailbox.deliverInternal(InitiatorMailbox.java:271)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltdb.iv2.InitiatorMailbox.deliver(InitiatorMailbox.java:247)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltcore.messaging.ForeignHost.deliverMessage(ForeignHost.java:288)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltcore.messaging.ForeignHost.handleRead(ForeignHost.java:332)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltcore.messaging.ForeignHost.access$000(ForeignHost.java:41)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltcore.messaging.ForeignHost$FHInputHandler.handleMessage(ForeignHost.java:74)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltcore.network.VoltPort.run(VoltPort.java:183)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltcore.network.VoltNetwork.callPort(VoltNetwork.java:389)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltcore.network.VoltNetwork.invokeCallbacks(VoltNetwork.java:417)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: org.voltcore.network.VoltNetwork.run(VoltNetwork.java:296)
2013-12-02 13:43:44,643 FATAL [Volt Network - 1] HOST: java.lang.Thread.run(Thread.java:679)
jhugg
Dec 2, 2013
"Stored procedure AddUser generated different SQL queries at different partitions. Shutting down to preserve data integrity."

This message is key to understanding what happened. VoltDB requires all stored procedure logic to be deterministic. When redundancy is enabled, we achieve high parallel throughput by running the same procedure logic on the same state in the same order.

In this case, we detected (after the fact) that the same procedure had a different outcome when run at two different replicas. Specifically, it appears to have succeeded on one replica and rolled back after failing an expectation on another. Since there is no way to know what was desired and the data is now likely divergent between replicas, we shut down the cluster to prevent further corruption.

I would examine the AddUser procedure for sources of non-determinism, including, but not limited to, using random values or time from directly from Java. See more here: http://voltdb.com/docs/UsingVoltDB/DesignProc.php.

Let us know if you can't find a solution and we'll try to figure out some next steps. Thanks.
Monika
Dec 3, 2013
Hi,

thanks for your answer.
The log file also says:

{"status":-1,"appstatus":-128,"statusstring":"VOLTDB ERROR: USER ABORT
Expectation failing in procedure: AddUser
Running SQL: SELECT id FROM Domains WHERE name = ?;
Error message: Expected one row, but got 0
at com.gehrs.userdb.procedures.AddUser.run(null:-1)","appstatusstring":null,"exception":null,"resul ts":[]}

This is confusing. It states that one replica does not have data in the table in question, but another replica seems to have data in it. The table is read only within the procedure, actually within all procedures of the application. And the table is a replicated table. Data is loaded into the table at startup of the application.
Reading the log file I would think that one replica has lost its data.

I have tried a number of rejoins of different nodes within the cluster. Everything looks fine until I start running procedures. Then every other node except the newly joined one is failing with messages like theses.

Do you have any further idea what could have happened? The procedure is deterministic and only use getTransactionTime().

Many thanks,
Monika
Monika
Dec 4, 2013
Hi,

I just tried rejoining while running the voter example and got the same error.
I run the voter application on 4 nodes, killed one node and let it rejoin.
The client was running all the time. After the node rejoined the cluster, the whole cluster crashed with the following error:

2013-12-04 11:42:52,040 INFO [Iv2ExecutionSite: 1:2] REJOIN: Initializing snapshot stream processor for source site id: 4:6, and with processorid: 0
2013-12-04 11:42:56,562 FATAL [Volt Network - 1] TM: Stored procedure Vote generated different SQL queries at different partitions. Shutting down to preserve data integrity.
2013-12-04 11:42:56,562 ERROR [Volt Network - 1] TM: HASH MISMATCH COMPARING: 0 to -894730862
PREV MESSAGE: INITITATE_RESPONSE FOR TXN 3619650615738369
SP HANDLE: 3619650615738369
INITIATOR HSID: 1:1
COORDINATOR HSID: 1:1
CLIENT INTERFACE HANDLE: 281474977851543
CLIENT CONNECTION ID: 4
READ-ONLY: false
RECOVERING: false
MISPARTITIONED: false
COMMIT
CLIENT RESPONSE:
{"status":1,"appstatus":-128,"statusstring":null,"appstatusstring":null,"exception":null,"results":[{"status":-128,"schema":[{"name":"","type":6}],"data":[[0]]}]}
CURR MESSAGE: INITITATE_RESPONSE FOR TXN 3619650615738369
SP HANDLE: 3619650615738369
INITIATOR HSID: 1:1
COORDINATOR HSID: 1:1
CLIENT INTERFACE HANDLE: 281474977851543
CLIENT CONNECTION ID: 4
READ-ONLY: false
RECOVERING: false
MISPARTITIONED: false
COMMIT
CLIENT RESPONSE:
{"status":1,"appstatus":-128,"statusstring":null,"appstatusstring":null,"exception":null,"results":[{"status":-128,"schema":[{"name":"","type":6}],"data":[[1]]}]}
jhugg
Dec 4, 2013
Hmm... I'm really sorry about this. You may have found a bug. We'll certainly look into it. Can you send us the exact deployment configuration and any other information needed to reproduce this? Do you have any other server logs and/or crash dump files you can send us from the 4 nodes? Send to jhugg at voltdb.com and I'll get them to the right place.

Finally, have you tried the same thing with our enterprise trial edition?

Thank you.
Monika
Dec 4, 2013
Many thanks for your answer.
We will send you the required information.
Regarding your last question. Yes, we have tried the same thing with the enterprise trial edition. There rejoin failed as well, but live rejoin did work.

Monika
jhugg
Dec 4, 2013
At this point we have reproduced this issue here and have made it a high priority. We see the issue using blocking rejoins under certain scenarios, but we haven't reproduced the issue with non-blocking rejoins (available in the Enterprise Edition).

Our goal will be to fix the bug and put out a patch release to 3.7 as soon as possible. We're also going to figure out how our tests didn't catch this. Thanks for bringing this to our attention.

If you'd like to continue your tests in the interim, we suggest using the Enterprise trial with non-blocking rejoins.
jhugg
Dec 4, 2013
Question: Is the table "Domains" replicated or partitioned?
Monika
Dec 5, 2013
Many thanks for your response!

The "Domains" table is a replicated one.

Monika