Page 1 of 2 12 LastLast
Results 1 to 10 of 12

Thread: Nodes stop talking to each other and form independent clusters

  1. #1
    New Member
    Join Date
    Dec 2012
    Posts
    5

    Nodes stop talking to each other and form independent clusters

    We're running into this issue pretty frequently now. We have a two node cluster. The two nodes seem to stop talking to each other resulting in two operational clusters!

    Our application is pretty simple, we have data producers that insert data into the cluster and reports that view data from the cluster. The implication of the issue above is pretty significant since the inserted data will be round-robin'ed into the two clusters and the reports will show roughly half what they should've.

    We have snapshot enabled on both machines. We take snapshots every 30 minutes and keep the last two snapshots.

    Any ideas to why this is happening?

    Below is the log.

    Thanks!
    Mahmoud

    2013-01-28 08:22:27,783 ERROR [Heartbeat] HOST: DEAD HOST DETECTED, hostname: UNKNOWN_HOSTNAME
    2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: current time: 1359361347781
    2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: last message: 1359361337779
    2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: delta (millis): 10002
    2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: timeout value (millis): 10000
    2013-01-28 08:22:27,787 INFO [ZooKeeperServer] JOIN: Agreement, Sending fault data 2:-1 to 8:-1 survivors
    2013-01-28 08:22:27,787 INFO [ZooKeeperServer] JOIN: Agreement, Sent fault data. Expecting 1 responses.
    2013-01-28 08:22:27,788 INFO [ZooKeeperServer] JOIN: Agreement, Received failure message from 8:-1 for failed sites 2:-1 safe txn id 1343987019583324162 failed site 2:-1
    2013-01-28 08:22:27,788 INFO [ZooKeeperServer] JOIN: Agreement, handling site faults for newly failed sites 2:-1 initiatorSafeInitPoints {2:-11343987019583324162}
    2013-01-28 08:22:27,788 INFO [ZooKeeperServer] ZK-SERVER: Initiating close of session 0x1255d74c8c000002
    2013-01-28 08:22:27,799 INFO [ZooKeeperServer] ZK-SERVER: Processed session termination for sessionid: 0x1255d74c8c000002
    2013-01-28 08:22:27,800 INFO [main-EventThread] LOGGING: Detected the snapshot truncation leader's ephemeral node deletion
    2013-01-28 08:22:27,800 INFO [SnapshotDaemon] LOGGING: Starting leader election for snapshot truncation daemon
    2013-01-28 08:22:27,802 INFO [Leader elector-/db/leaders/globalservice] HOST: Host 8 promoted to be the global service provider
    2013-01-28 08:22:27,817 INFO [SnapshotDaemon] LOGGING: This node was selected as the leader for snapshot truncation
    2013-01-28 08:22:27,819 ERROR [Fault Distributor] HOST: Sites failed, site ids: 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7
    2013-01-28 08:22:27,819 INFO [Thread-14] EXPORT: Attempting to boot export client due to rejoin or other cluster topology change
    2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:0] JOIN: Sending fault data 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 to 8:0, 8:1, 8:2, 8:3, 8:4, 8:5 survivors with lastKnownGloballyCommitedMultiPartTxnId 0
    2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:5] JOIN: Sending fault data 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 to 8:0, 8:1, 8:2, 8:3, 8:4, 8:5 survivors with lastKnownGloballyCommitedMultiPartTxnId 0
    2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:4] JOIN: Sending fault data 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 to 8:0, 8:1, 8:2, 8:3, 8:4, 8:5 survivors with lastKnownGloballyCommitedMultiPartTxnId 0
    2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:5] HOST: Received failure message from 8:0 for failed sites 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 for initiator id 2:2 with commit point 0 safe txn id -9223372036854775808
    2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:4] HOST: Received failure message from 8:0 for failed sites 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 for initiator id 2:2 with commit point 0 safe txn id -9223372036854775808
    2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:4] HOST: Received failure message from 8:5 for failed sites 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 for initiator id 2:2 with commit point 0 safe txn id -9223372036854775808

  2. #2
    Super Moderator
    Join Date
    Feb 2010
    Posts
    186
    The cluster is reporting that the two nodes can not communicate. After 10s of this condition, they assume the other has died and enter failure recovery. You can prevent the split-brain by enabling the partition detection feature. See: http://voltdb.com/docs/UsingVoltDB/KsafeNetPart.php. (Note -- this is the 3.0 documentation; in 3.0 partition detection is enabled by default. In prior releases this is not the case.)

    To figure out why this happening, we will need more information. Can you describe where you are running this cluster? EC2? Back-room? Do you have any monitoring of the network on these machines (that would correlate network partitions with the timeouts? What version of VoltDB you are running? Finally, do you do any GC logging?

  3. #3
    Super Moderator
    Join Date
    Feb 2010
    Posts
    186
    Hey - wanted to follow up and see if you had resolved your problem?
    Ryan.

  4. #4
    New Member
    Join Date
    Dec 2012
    Posts
    5
    Hi Ryan,

    We're using 2.8.4 and we're running on EC2.

    I looked at the network logs and there were no dropped packets.

    We're going to change the cluster to use 3 machines instead of 2 so that the brain never gets split in half. Would that work? we'll keep the k factor to 2.

    thanks again,
    Mahmoud

  5. #5
    New Member
    Join Date
    Dec 2012
    Posts
    5
    Hi Ryan,

    We're using 2.8.4 and we're running on EC2.

    I looked at the network logs and there were no dropped packets.

    We're going to change the cluster to use 3 machines instead of 2 so that the brain never gets split in half. Would that work? we'll keep the k factor to 2.

    thanks again,
    Mahmoud

  6. #6
    Super Moderator
    Join Date
    Feb 2010
    Posts
    186
    3 machines is a good place to start, defensively. Note that k-factor is actually the number of failures you can tolerate (replication_factor - 1). If you have a 2 node cluster, your current max k-factor is 1.

    This isn't a very satisfying answer, though -- if you don't suspect an actual network partition, would you be willing to work with us to debug what might be happening in this cluster? Feel free to contact me directly: rbetts@voltdb.com.

    Thanks,
    Ryan.

  7. #7
    Super Moderator
    Join Date
    Feb 2010
    Posts
    186
    3 machines is a good place to start, defensively. Note that k-factor is actually the number of failures you can tolerate (replication_factor - 1). If you have a 2 node cluster, your current max k-factor is 1.

    This isn't a very satisfying answer, though -- if you don't suspect an actual network partition, would you be willing to work with us to debug what might be happening in this cluster? Feel free to contact me directly: rbetts@voltdb.com.

    Thanks,
    Ryan.

  8. #8
    Super Moderator
    Join Date
    Feb 2010
    Posts
    186
    After reading my forum reply, another of our engineers asked me follow up with a few more questions for you.

    > What size instances? In EC2, smaller instance types can get CPU cycles taken away (starving the cluster -- or VM migration).
    >
    > In addition, time can run backwards (which would alert with this error message: ERROR 21:07:12,539 [Heartbeat] HOST: Initiator time moved backwards from:
    > 1329253632736 to 1329253632539, a difference of 0.20 seconds.)
    >
    > Workaround to make time not reverse is: sudo echo 1 > /proc/sys/xen/independent_wallclock

    Additionally, Ariel recently wrote a mesh connectivity tool; this is a separate java program that monitors connections between nodes independently of the VoltDB process. If we provide you this tool and instructions, would you be able to run it on your VoltDB nodes to gather some information to enable a clearer resolution?

    Thanks.
    Ryan.

  9. #9
    Super Moderator
    Join Date
    Feb 2010
    Posts
    186
    After reading my forum reply, another of our engineers asked me follow up with a few more questions for you.

    > What size instances? In EC2, smaller instance types can get CPU cycles taken away (starving the cluster -- or VM migration).
    >
    > In addition, time can run backwards (which would alert with this error message: ERROR 21:07:12,539 [Heartbeat] HOST: Initiator time moved backwards from:
    > 1329253632736 to 1329253632539, a difference of 0.20 seconds.)
    >
    > Workaround to make time not reverse is: sudo echo 1 > /proc/sys/xen/independent_wallclock

    Additionally, Ariel recently wrote a mesh connectivity tool; this is a separate java program that monitors connections between nodes independently of the VoltDB process. If we provide you this tool and instructions, would you be able to run it on your VoltDB nodes to gather some information to enable a clearer resolution?

    Thanks.
    Ryan.

  10. #10
    New Member
    Join Date
    Feb 2013
    Location
    http://www.superdvdoutlet.ca/
    Posts
    4
    we will need more information. Can you describe where you are running this cluster? EC2? Back-room? Do you have any monitoring of the network on these machines (that would correlate network partitions with the timeouts? What version of VoltDB you are running? Finally, do you do any GC logging?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •