Forum: Managing VoltDB

Post: Nodes stop talking to each other and form independent clusters

Nodes stop talking to each other and form independent clusters
malnahlawi
Jan 28, 2013
We're running into this issue pretty frequently now. We have a two node cluster. The two nodes seem to stop talking to each other resulting in two operational clusters!

Our application is pretty simple, we have data producers that insert data into the cluster and reports that view data from the cluster. The implication of the issue above is pretty significant since the inserted data will be round-robin'ed into the two clusters and the reports will show roughly half what they should've.

We have snapshot enabled on both machines. We take snapshots every 30 minutes and keep the last two snapshots.

Any ideas to why this is happening?

Below is the log.

Thanks!
Mahmoud

2013-01-28 08:22:27,783 ERROR [Heartbeat] HOST: DEAD HOST DETECTED, hostname: UNKNOWN_HOSTNAME
2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: current time: 1359361347781
2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: last message: 1359361337779
2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: delta (millis): 10002
2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: timeout value (millis): 10000
2013-01-28 08:22:27,787 INFO [ZooKeeperServer] JOIN: Agreement, Sending fault data 2:-1 to 8:-1 survivors
2013-01-28 08:22:27,787 INFO [ZooKeeperServer] JOIN: Agreement, Sent fault data. Expecting 1 responses.
2013-01-28 08:22:27,788 INFO [ZooKeeperServer] JOIN: Agreement, Received failure message from 8:-1 for failed sites 2:-1 safe txn id 1343987019583324162 failed site 2:-1
2013-01-28 08:22:27,788 INFO [ZooKeeperServer] JOIN: Agreement, handling site faults for newly failed sites 2:-1 initiatorSafeInitPoints {2:-11343987019583324162}
2013-01-28 08:22:27,788 INFO [ZooKeeperServer] ZK-SERVER: Initiating close of session 0x1255d74c8c000002
2013-01-28 08:22:27,799 INFO [ZooKeeperServer] ZK-SERVER: Processed session termination for sessionid: 0x1255d74c8c000002
2013-01-28 08:22:27,800 INFO [main-EventThread] LOGGING: Detected the snapshot truncation leader's ephemeral node deletion
2013-01-28 08:22:27,800 INFO [SnapshotDaemon] LOGGING: Starting leader election for snapshot truncation daemon
2013-01-28 08:22:27,802 INFO [Leader elector-/db/leaders/globalservice] HOST: Host 8 promoted to be the global service provider
2013-01-28 08:22:27,817 INFO [SnapshotDaemon] LOGGING: This node was selected as the leader for snapshot truncation
2013-01-28 08:22:27,819 ERROR [Fault Distributor] HOST: Sites failed, site ids: 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7
2013-01-28 08:22:27,819 INFO [Thread-14] EXPORT: Attempting to boot export client due to rejoin or other cluster topology change
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:0] JOIN: Sending fault data 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 to 8:0, 8:1, 8:2, 8:3, 8:4, 8:5 survivors with lastKnownGloballyCommitedMultiPartTxnId 0
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:5] JOIN: Sending fault data 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 to 8:0, 8:1, 8:2, 8:3, 8:4, 8:5 survivors with lastKnownGloballyCommitedMultiPartTxnId 0
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:4] JOIN: Sending fault data 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 to 8:0, 8:1, 8:2, 8:3, 8:4, 8:5 survivors with lastKnownGloballyCommitedMultiPartTxnId 0
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:5] HOST: Received failure message from 8:0 for failed sites 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 for initiator id 2:2 with commit point 0 safe txn id -9223372036854775808
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:4] HOST: Received failure message from 8:0 for failed sites 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 for initiator id 2:2 with commit point 0 safe txn id -9223372036854775808
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:4] HOST: Received failure message from 8:5 for failed sites 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 for initiator id 2:2 with commit point 0 safe txn id -9223372036854775808
rbetts
Jan 28, 2013
The cluster is reporting that the two nodes can not communicate. After 10s of this condition, they assume the other has died and enter failure recovery. You can prevent the split-brain by enabling the partition detection feature. See: http://voltdb.com/docs/UsingVoltDB/KsafeNetPart.php. (Note -- this is the 3.0 documentation; in 3.0 partition detection is enabled by default. In prior releases this is not the case.)

To figure out why this happening, we will need more information. Can you describe where you are running this cluster? EC2? Back-room? Do you have any monitoring of the network on these machines (that would correlate network partitions with the timeouts? What version of VoltDB you are running? Finally, do you do any GC logging?
rbetts
Jan 31, 2013
Hey - wanted to follow up and see if you had resolved your problem?
Ryan.
malnahlawi
Feb 5, 2013
Hi Ryan,

We're using 2.8.4 and we're running on EC2.

I looked at the network logs and there were no dropped packets.

We're going to change the cluster to use 3 machines instead of 2 so that the brain never gets split in half. Would that work? we'll keep the k factor to 2.

thanks again,
Mahmoud
malnahlawi
Feb 5, 2013
Hi Ryan,

We're using 2.8.4 and we're running on EC2.

I looked at the network logs and there were no dropped packets.

We're going to change the cluster to use 3 machines instead of 2 so that the brain never gets split in half. Would that work? we'll keep the k factor to 2.

thanks again,
Mahmoud
rbetts
Feb 5, 2013
3 machines is a good place to start, defensively. Note that k-factor is actually the number of failures you can tolerate (replication_factor - 1). If you have a 2 node cluster, your current max k-factor is 1.

This isn't a very satisfying answer, though -- if you don't suspect an actual network partition, would you be willing to work with us to debug what might be happening in this cluster? Feel free to contact me directly: rbetts@voltdb.com.

Thanks,
Ryan.
rbetts
Feb 5, 2013
3 machines is a good place to start, defensively. Note that k-factor is actually the number of failures you can tolerate (replication_factor - 1). If you have a 2 node cluster, your current max k-factor is 1.

This isn't a very satisfying answer, though -- if you don't suspect an actual network partition, would you be willing to work with us to debug what might be happening in this cluster? Feel free to contact me directly: rbetts@voltdb.com.

Thanks,
Ryan.
rbetts
Feb 5, 2013
After reading my forum reply, another of our engineers asked me follow up with a few more questions for you.

> What size instances? In EC2, smaller instance types can get CPU cycles taken away (starving the cluster -- or VM migration).
>
> In addition, time can run backwards (which would alert with this error message: ERROR 21:07:12,539 [Heartbeat] HOST: Initiator time moved backwards from:
> 1329253632736 to 1329253632539, a difference of 0.20 seconds.)
>
> Workaround to make time not reverse is: sudo echo 1 > /proc/sys/xen/independent_wallclock

Additionally, Ariel recently wrote a mesh connectivity tool; this is a separate java program that monitors connections between nodes independently of the VoltDB process. If we provide you this tool and instructions, would you be able to run it on your VoltDB nodes to gather some information to enable a clearer resolution?

Thanks.
Ryan.
rbetts
Feb 5, 2013
After reading my forum reply, another of our engineers asked me follow up with a few more questions for you.

> What size instances? In EC2, smaller instance types can get CPU cycles taken away (starving the cluster -- or VM migration).
>
> In addition, time can run backwards (which would alert with this error message: ERROR 21:07:12,539 [Heartbeat] HOST: Initiator time moved backwards from:
> 1329253632736 to 1329253632539, a difference of 0.20 seconds.)
>
> Workaround to make time not reverse is: sudo echo 1 > /proc/sys/xen/independent_wallclock

Additionally, Ariel recently wrote a mesh connectivity tool; this is a separate java program that monitors connections between nodes independently of the VoltDB process. If we provide you this tool and instructions, would you be able to run it on your VoltDB nodes to gather some information to enable a clearer resolution?

Thanks.
Ryan.
richardortizi
Feb 26, 2013
we will need more information. Can you describe where you are running this cluster? EC2? Back-room? Do you have any monitoring of the network on these machines (that would correlate network partitions with the timeouts? What version of VoltDB you are running? Finally, do you do any GC logging?
richardortizi
Feb 26, 2013
We're using 2.8.4 and we're running on EC2.
I looked at the network logs and there were no dropped packets.
richardortizi
Feb 26, 2013
We're using 2.8.4 and we're running on EC2.
I looked at the network logs and there were no dropped packets.