Forum: Managing VoltDB

Post: Is Community Edition cluster supposed to survive a node crash?

Is Community Edition cluster supposed to survive a node crash?
lgielgud
Mar 1, 2012
Using [main] HOST: Build: 2.2.1 voltdb-2.2.1-0-g7894422 Community Edition on two machines. Deployment.xml is
<?xml version="1.0"?>
<deployment>
<cluster hostcount="2"
sitesperhost="5"
kfactor="0"
/>
<httpd port="8090" enabled="true">
<jsonapi enabled="true" />
</httpd>
</deployment>
Both instances appear to start normally. When I ctrl-c either one, I see this in the other instance's output:
ERROR 00:44:47,463 [Fault Distributor] HOST: Host failed, host id: 1 hostname: beater2
ERROR 00:44:47,463 [Fault Distributor] HOST: Removing sites from cluster: [100, 101, 102, 103, 104, 105]
WARN 00:44:47,466 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
WARN 00:44:47,472 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
WARN 00:44:47,478 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
WARN 00:44:47,484 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
WARN 00:44:47,493 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
WARN 00:44:47,500 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
WARN 00:44:47,506 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
WARN 00:44:47,512 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
WARN 00:44:47,518 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
WARN 00:44:47,524 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
WARN 00:44:47,530 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
FATAL 00:44:47,533 [Fault Distributor] HOST: Failure of host 1 has rendered the cluster unviable. Shutting down...
FATAL 00:44:47,533 [Fault Distributor] HOST: Unexpected crash
FATAL 00:44:47,535 [Fault Distributor] HOST: Stack trace from crashVoltDB() method:
java.lang.Thread.dumpThreads(Native Method)
java.lang.Thread.getAllStackTraces(Thread.java:1530)
org.voltdb.VoltDB.crashLocalVoltDB(VoltDB.java:472)
org.voltdb.VoltDB.crashVoltDB(VoltDB.java:448)
org.voltdb.VoltDBNodeFailureFaultHandler.handleNodeFailureFault(VoltDBNodeFailureFaultHandler.java:92)
org.voltdb.VoltDBNodeFailureFaultHandler.faultOccured(VoltDBNodeFailureFaultHandler.java:63)
org.voltdb.fault.FaultDistributor.processPendingFaults(FaultDistributor.java:311)
org.voltdb.fault.FaultDistributor.run(FaultDistributor.java:391)
java.lang.Thread.run(Thread.java:662)

VoltDB has encountered an unrecoverable error and is exiting.
The log may contain additional information.
WARN 00:44:47,536 [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 100
I also tried with 3 machines, with different kfactors, and with community edition 2.1.3 - all with the same result.
Is it expected behavior?
Hi,With a k-factor of 0 this
aweisberg
Mar 1, 2012
Hi,

With a k-factor of 0 this is expected behavior. Do you have the full output from when this occurs with a kfactor > 1? k is the number of node failures the cluster is guaranteed to tolerate. With larger cluster you might get lucky and survive more than k failures.

-Ariel
A kfactor of 0 means there is
rbetts
Mar 1, 2012
A kfactor of 0 means there is no replication. Set the kfactor to > 0 and the cluster will survive "k" failures. If you see this message: FATAL 00:44:47,533 [Fault Distributor] HOST: Failure of host 1 has rendered the cluster unviable. Shutting down... -- it means that the cluster no longer has a copy of each partition.

Ryan.
Ah - I should have read
lgielgud
Mar 5, 2012
Ah - I should have read Recovering From System Failures more carefully.

Also it turns out I was only using kfactor=0, and now I understand why the result I saw is expected. Indeed you're right, when kfactor is > 0, it survives a node failure and I can rejoin as described in the doc.

Thanks for the replies, and keep up the nice work