Forum: Building VoltDB Clients

Post: The whole cluster crashes when I kill 1 of 3 nodes in cluster

The whole cluster crashes when I kill 1 of 3 nodes in cluster
ruhruhroy
Jan 26, 2011
I have servers vvoltftldev01 (Site 1), vvoltftldev02 (site 101), and vvoltftldev03 (site 201) running.
When I stop the node vvoltftldev03 the whole cluster (the other 2 servers) shutdowns down due to errors:
Error on vvoltftldev02:
[PeriodicWork] INFO org.voltdb.messaging.impl.HostMessenger - Attempted delivery of message to failed site: 201
[Fault Distributor] ERROR HOST - Host failed, hostname: vvoltftldev03
[Fault Distributor] ERROR HOST - Host ID: 2
[Fault Distributor] ERROR HOST - Removing sites from cluster: [200, 201]
[ExecutionSite:0101] ERROR HOST - DEAD HOST DETECTED, hostname: vvoltftldev01
[ExecutionSite:0101] INFO HOST - current time: 1296056797123
[ExecutionSite:0101] INFO HOST - last message: 1296056786040
[ExecutionSite:0101] INFO HOST - delta: 11083
....
....
....
[Fault Distributor] ERROR HOST - Host failed, hostname: vvoltftldev01
[Fault Distributor] ERROR HOST - Host ID: 0
[Fault Distributor] ERROR HOST - Removing sites from cluster: [0, 1]
[ExecutionSite:0101] INFO RECOVERY - Sending fault data [201, 200] to [101] survivors with lastKnownGloballyCommitedMultiPartTxnId 0
[ExecutionSite:0101] INFO RECOVERY - Sent fault data. Expecting 1 responses.
[ExecutionSite:0101] INFO RECOVERY - Received failure message 1 of 1 from 1 for failed sites [201, 200] with commit point 0 safe txn id 812949932276187139
[ExecutionSite:0101] INFO HOST - Partition detected for 50/50 failure. This survivor set is continuing execution.
[ExecutionSite:0101] INFO RECOVERY - Handling node faults 2 with globalMultiPartCommitPoint 0 and globalInitiationPoint 812949932276187139
[ExecutionSite:0101] INFO RECOVERY - Sending fault data [0, 1, 201, 200] to [101] survivors with lastKnownGloballyCommitedMultiPartTxnId 0
[ExecutionSite:0101] INFO RECOVERY - Sent fault data. Expecting 2 responses.
[ExecutionSite:0101] INFO RECOVERY - Discarding failure message from 101 because it was missing failed sites [0, 1]
[ExecutionSite:0101] INFO RECOVERY - Received failure message 1 of 2 from 101 for failed sites [0, 1, 201, 200] with commit point 0 safe txn id 812949935757459457
[ExecutionSite:0101] INFO RECOVERY - Received failure message 2 of 2 from 101 for failed sites [0, 1, 201, 200] with commit point 0 safe txn id 812949932276187139
[ExecutionSite:0101] INFO HOST - Partition detection triggered for 50/50 cluster failure. This survivor set is shutting down.
[ExecutionSite:0101] INFO RECOVERY - Handling node faults 0 with globalMultiPartCommitPoint 0 and globalInitiationPoint 812949935757459457
[ExecutionSite:0101] INFO RECOVERY - Scheduling snapshot after txnId 812949935757459457 for cluster partition fault. Current commit point: 0
[ExecutionSite:0101] INFO RECOVERY - Delivering roadblock action: org.voltdb.ExecutionSite$ExecutionSiteLocalSnapshotMessage@45c3e9ba for txnId: 812949935757459457
[ExecutionSite:0101] INFO HOST - Executing local snapshot. Completing any on-going snapshots.
[ExecutionSite:0101] INFO HOST - Executing local snapshot. Creating new snapshot.
[Snapshot terminator] INFO HOST - Snapshot miniclip finished at 1296056799431 and took -8.1294863970066E14 seconds
[ExecutionSite:0101] INFO HOST - Executing local snapshot. Finished final snapshot. Shutting down. Result: header size: 63
status code: -128 column count: 5
cols (HOST_ID:INTEGER), (HOSTNAME:STRING), (SITE_ID:INTEGER), (RESULT:STRING), (ERR_MSG:STRING),
rows -
1, vvoltftldev02, 101, SUCCESS, ,
java.lang.Thread.dumpThreads(Native Method)
java.lang.Thread.getAllStackTraces(Thread.java:1503)
org.voltdb.VoltDB.crashVoltDB(VoltDB.java:343)
org.voltdb.ExecutionSite.handleMailboxMessage(ExecutionSite.java:1210)
org.voltdb.ExecutionSite.run(ExecutionSite.java:910)
org.voltdb.RealVoltDB.run(RealVoltDB.java:866)
org.voltdb.VoltDB.main(VoltDB.java:368)
VoltDB has encountered an unrecoverable error and is exiting.
The log may contain additional information.
Error on vvoltftldev01:
[PeriodicWork] INFO org.voltdb.messaging.impl.HostMessenger - Attempted delivery of message to failed site: 201
[Fault Distributor] ERROR HOST - Host failed, hostname: vvoltftldev03
[Fault Distributor] ERROR HOST - Host ID: 2
[Fault Distributor] ERROR HOST - Removing sites from cluster: [200, 201]
[ExecutionSite:0001] INFO RECOVERY - Sending fault data [201, 200] to [1, 101] survivors with lastKnownGloballyCommitedMultiPartTxnId 0
[ExecutionSite:0001] INFO RECOVERY - Sent fault data. Expecting 2 responses.
[ExecutionSite:0001] INFO RECOVERY - Received failure message 1 of 2 from 1 for failed sites [201, 200] with commit point 0 safe txn id 812949932276187139
[PeriodicWork] ERROR HOST - DEAD HOST DETECTED, hostname: vvoltftldev02
[PeriodicWork] INFO HOST - current time: 1296056796042
[PeriodicWork] INFO HOST - last message: 1296056786039
[PeriodicWork] INFO HOST - delta: 10003
[Fault Distributor] ERROR HOST - Host failed, hostname: vvoltftldev02
[Fault Distributor] ERROR HOST - Host ID: 1
[Fault Distributor] ERROR HOST - Removing sites from cluster: [100, 101]
[ExecutionSite:0001] INFO RECOVERY - Detected a concurrent failure from FaultDistributor, new failed sites [100, 101, 201, 200]
[ExecutionSite:0001] INFO RECOVERY - Sending fault data [100, 101, 201, 200] to [1] survivors with lastKnownGloballyCommitedMultiPartTxnId 0
[ExecutionSite:0001] INFO RECOVERY - Sent fault data. Expecting 2 responses.
[ExecutionSite:0001] INFO RECOVERY - Received failure message 1 of 2 from 1 for failed sites [100, 101, 201, 200] with commit point 0 safe txn id 812949932318130178
[ExecutionSite:0001] INFO RECOVERY - Received failure message 2 of 2 from 1 for failed sites [100, 101, 201, 200] with commit point 0 safe txn id 812949932276187139
[ExecutionSite:0001] INFO HOST - Partition detection triggered. This minority survivor set is shutting down.
[ExecutionSite:0001] INFO RECOVERY - Handling node faults 1 2 with globalMultiPartCommitPoint 0 and globalInitiationPoint 812949932318130178
[ExecutionSite:0001] INFO RECOVERY - Scheduling snapshot after txnId 812949932318130178 for cluster partition fault. Current commit point: 0
[ExecutionSite:0001] INFO RECOVERY - Delivering roadblock action: org.voltdb.ExecutionSite$ExecutionSiteLocalSnapshotMessage@7e0b6ef8 for txnId: 812949932318130178
[ExecutionSite:0001] INFO HOST - Executing local snapshot. Completing any on-going snapshots.
[ExecutionSite:0001] INFO HOST - Executing local snapshot. Creating new snapshot.
[Snapshot terminator] INFO HOST - Snapshot miniclip finished at 1296056797171 and took -8.12948636261333E14 seconds
[ExecutionSite:0001] INFO HOST - Executing local snapshot. Finished final snapshot. Shutting down. Result: header size: 63
status code: -128 column count: 5
cols (HOST_ID:INTEGER), (HOSTNAME:STRING), (SITE_ID:INTEGER), (RESULT:STRING), (ERR_MSG:STRING),
rows -
0, vvoltftldev01, 1, SUCCESS, ,
java.lang.Thread.dumpThreads(Native Method)
java.lang.Thread.getAllStackTraces(Thread.java:1503)
org.voltdb.VoltDB.crashVoltDB(VoltDB.java:343)
org.voltdb.ExecutionSite.handleMailboxMessage(ExecutionSite.java:1210)
org.voltdb.ExecutionSite.run(ExecutionSite.java:910)
org.voltdb.RealVoltDB.run(RealVoltDB.java:866)
org.voltdb.VoltDB.main(VoltDB.java:368)
VoltDB has encountered an unrecoverable error and is exiting.
The log may contain additional information.
The whole cluster crashes when I kill 1 of 3 nodes in cluster
tcallaghan
Jan 26, 2011
Can you please paste the contents of your deployment.xml file and the server banner from the console window when VoltDB started (so I can see exact version information)?
Thanks,
Tim