We're running into this issue pretty frequently now. We have a two node cluster. The two nodes seem to stop talking to each other resulting in two operational clusters!
Our application is pretty simple, we have data producers that insert data into the cluster and reports that view data from the cluster. The implication of the issue above is pretty significant since the inserted data will be round-robin'ed into the two clusters and the reports will show roughly half what they should've.
We have snapshot enabled on both machines. We take snapshots every 30 minutes and keep the last two snapshots.
Any ideas to why this is happening?
Below is the log.
Thanks!
Mahmoud
2013-01-28 08:22:27,783 ERROR [Heartbeat] HOST: DEAD HOST DETECTED, hostname: UNKNOWN_HOSTNAME
2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: current time: 1359361347781
2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: last message: 1359361337779
2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: delta (millis): 10002
2013-01-28 08:22:27,784 INFO [Heartbeat] HOST: timeout value (millis): 10000
2013-01-28 08:22:27,787 INFO [ZooKeeperServer] JOIN: Agreement, Sending fault data 2:-1 to 8:-1 survivors
2013-01-28 08:22:27,787 INFO [ZooKeeperServer] JOIN: Agreement, Sent fault data. Expecting 1 responses.
2013-01-28 08:22:27,788 INFO [ZooKeeperServer] JOIN: Agreement, Received failure message from 8:-1 for failed sites 2:-1 safe txn id 1343987019583324162 failed site 2:-1
2013-01-28 08:22:27,788 INFO [ZooKeeperServer] JOIN: Agreement, handling site faults for newly failed sites 2:-1 initiatorSafeInitPoints {2:-11343987019583324162}
2013-01-28 08:22:27,788 INFO [ZooKeeperServer] ZK-SERVER: Initiating close of session 0x1255d74c8c000002
2013-01-28 08:22:27,799 INFO [ZooKeeperServer] ZK-SERVER: Processed session termination for sessionid: 0x1255d74c8c000002
2013-01-28 08:22:27,800 INFO [main-EventThread] LOGGING: Detected the snapshot truncation leader's ephemeral node deletion
2013-01-28 08:22:27,800 INFO [SnapshotDaemon] LOGGING: Starting leader election for snapshot truncation daemon
2013-01-28 08:22:27,802 INFO [Leader elector-/db/leaders/globalservice] HOST: Host 8 promoted to be the global service provider
2013-01-28 08:22:27,817 INFO [SnapshotDaemon] LOGGING: This node was selected as the leader for snapshot truncation
2013-01-28 08:22:27,819 ERROR [Fault Distributor] HOST: Sites failed, site ids: 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7
2013-01-28 08:22:27,819 INFO [Thread-14] EXPORT: Attempting to boot export client due to rejoin or other cluster topology change
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:0] JOIN: Sending fault data 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 to 8:0, 8:1, 8:2, 8:3, 8:4, 8:5 survivors with lastKnownGloballyCommitedMultiPartTxnId 0
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:5] JOIN: Sending fault data 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 to 8:0, 8:1, 8:2, 8:3, 8:4, 8:5 survivors with lastKnownGloballyCommitedMultiPartTxnId 0
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:4] JOIN: Sending fault data 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 to 8:0, 8:1, 8:2, 8:3, 8:4, 8:5 survivors with lastKnownGloballyCommitedMultiPartTxnId 0
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:5] HOST: Received failure message from 8:0 for failed sites 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 for initiator id 2:2 with commit point 0 safe txn id -9223372036854775808
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:4] HOST: Received failure message from 8:0 for failed sites 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 for initiator id 2:2 with commit point 0 safe txn id -9223372036854775808
2013-01-28 08:22:27,820 INFO [ExecutionSite: 8:4] HOST: Received failure message from 8:5 for failed sites 2:0, 2:1, 2:2, 2:3, 2:4, 2:5, 2:6, 2:7 for initiator id 2:2 with commit point 0 safe txn id -9223372036854775808


Reply With Quote