Forum: Managing VoltDB

Post: Community Edition: Error while Setting up a Cluster with 4 Nodes

Community Edition: Error while Setting up a Cluster with 4 Nodes
revdev
Aug 1, 2012
Hi,

I have a cluster with total 4 nodes but I am not able to properly initialize the cluster... Following is the information I have. Please let me know if I am doing anything wrong. Thanks a lot!

Command run on all 4 machines:
voltdb catalog helloWorld.jar deployment deployment.xml host voltest1 &


Deployment.xml:
<?xml version="1.0"?>
<deployment>
<cluster hostcount="4" sitesperhost="3" kfactor="1"/>
<paths>
<voltdbroot path="/tmp" />
<snapshots path="/opt/voltdbsaves" />
</paths>
<httpd enabled="true">
<jsonapi enabled="true" />
</httpd>
</deployment>

Error on exit on host:
Build: 2.8 voltdb-2.8-0-gd88e671-local Community Edition
Connecting to VoltDB cluster as the leader...
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
at org.voltcore.network.VoltPort.lockForHandlingWork(VoltPort.java:169)
at org.voltcore.network.VoltNetwork.callPort(VoltNetwork.java:387)
at org.voltcore.network.VoltNetwork.access$300(VoltNetwork.java:85)
at org.voltcore.network.VoltNetwork$3.run(VoltNetwork.java:274)
at org.voltcore.network.VoltNetwork.run(VoltNetwork.java:303)
at java.lang.Thread.run(Thread.java:679)
ERROR: Sites failed, site ids: 3:3, 2:2, 1:1, 2:3, 3:2, 1:0, 1:3, 3:1, 2:0, 3:0, 2:1, 1:2
FATAL: Node fault detected before all nodes finished initializing. Cluster will not start.
VoltDB has encountered an unrecoverable error and is exiting.
The log may contain additional information.

Error on exit on secondary machines:
Build: 2.8 voltdb-2.8-0-gd88e671-local Community Edition
Connecting to the VoltDB cluster leader voltest1/10.178.111.37:3021
WARN: Joining primary failed: Connection refused retrying..
WARN: Joining primary failed: Connection refused retrying..
3 Notified of host 0
3 Notified of host 1
3 Notified of host 2
FATAL: Failed to initialize site tracker with all hosts before timeout
FATAL: Stack trace from crashLocalVoltDB() method:
FATAL: java.lang.Thread.dumpThreads(Native Method)
FATAL: java.lang.Thread.getAllStackTraces(Thread.java:1546)
FATAL: org.voltdb.VoltDB.crashLocalVoltDB(VoltDB.java:529)
FATAL: org.voltdb.RealVoltDB.initialize(RealVoltDB.java:478)
FATAL: org.voltdb.VoltDB.initialize(VoltDB.java:677)
FATAL: org.voltdb.VoltDB.main(VoltDB.java:661)
VoltDB has encountered an unrecoverable error and is exiting.
The log may contain additional information.

Logs on Host:
2012-08-01 21:41:57,181 INFO [main] CONSOLE: Initializing VoltDB...

_ __ ____ ____ ____
| | / /___ / / /_/ __ \/ __ )
| | / / __ \/ / __/ / / / __ |
| |/ / /_/ / / /_/ /_/ / /_/ /
|___/\____/_/\__/_____/_____/

--------------------------------

2012-08-01 21:41:57,201 INFO [main] CONSOLE: Build: 2.8 voltdb-2.8-0-gd88e671-local Community Edition
2012-08-01 21:41:57,217 INFO [main] NETWORK: Default network thread count: 2
2012-08-01 21:41:57,241 INFO [main] HOST: Beginning inter-node communication on port 3021.
2012-08-01 21:41:57,242 INFO [main] HOST: Attempting to bind to leader ip voltest1/10.178.111.37:3021
2012-08-01 21:41:57,246 INFO [main] CONSOLE: Connecting to VoltDB cluster as the leader...
2012-08-01 21:41:57,297 INFO [main] ZK-SERVER: binding to port /127.0.0.1:2181
2012-08-01 21:41:57,313 INFO [main] ZK-SERVER: Created server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000
2012-08-01 21:41:57,387 INFO [main] ZK-SERVER: Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=2000 watcher=org.voltcore.zk.ZKUtil$1@b07f45d
2012-08-01 21:41:57,396 INFO [main-SendThread()] ZK-CLIENT: Opening socket connection to server /127.0.0.1:2181
2012-08-01 21:41:57,400 INFO [NIOServerCxn.Factory:/127.0.0.1:2181] ZK-SERVER: Accepted socket connection from /127.0.0.1:42151
2012-08-01 21:41:57,402 INFO [main-SendThread(localhost:2181)] ZK-CLIENT: Socket connection established to localhost/127.0.0.1:2181, initiating session
2012-08-01 21:41:57,405 INFO [NIOServerCxn.Factory:/127.0.0.1:2181] ZK-SERVER: Client attempting to establish new session at /127.0.0.1:42151
2012-08-01 21:41:57,421 INFO [ZooKeeperServer] ZK-SERVER: Established session 0x10d8bec90f000000 with negotiated timeout 6000 for client /127.0.0.1:42151
2012-08-01 21:41:57,421 INFO [main-SendThread(localhost:2181)] ZK-CLIENT: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x10d8bec90f000000, negotiated timeout = 6000
2012-08-01 21:41:57,472 INFO [Socket Joiner] HOST: Attempting to bind to internal ip 0.0.0.0/0.0.0.0:3021
2012-08-01 21:41:57,499 INFO [main] HOST: URL of deployment info: deployment.xml
2012-08-01 21:41:57,794 INFO [main] HOST: Cluster has 4 hosts with leader hostname: "voltest1". 3 sites per host. K = 1.
2012-08-01 21:41:57,794 INFO [main] HOST: The entire cluster has 2 copies of each of the 6 logical partitions.
2012-08-01 21:41:57,794 INFO [main] HOST: Detection of network partitions in the cluster is not enabled.
2012-08-01 21:41:57,795 INFO [main] HOST: Using "/tmp" for voltdbroot directory.
2012-08-01 21:41:58,123 INFO [main] HOST: hsql loaded
2012-08-01 21:42:08,524 INFO [Socket Joiner] HOST: Received request type REQUEST_HOSTID
2012-08-01 21:42:08,546 INFO [Socket Joiner] HOST: Heartbeat timeout to host: /10.178.96.253:41004 is 10000 milliseconds
2012-08-01 21:42:08,560 INFO [ZooKeeperServer] JOIN: Joining site 1:-1 known active sites 0:-1, 1:-1
2012-08-01 21:42:08,639 INFO [ZooKeeperServer] JOIN: Shipping ZK snapshot from 0:-1 to 1:-1
2012-08-01 21:42:13,072 INFO [Socket Joiner] HOST: Received request type REQUEST_HOSTID
2012-08-01 21:42:16,118 INFO [Socket Joiner] HOST: Heartbeat timeout to host: /10.178.111.2:60026 is 10000 milliseconds
2012-08-01 21:42:16,208 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 2:-1) FOR TXN 1213930002932826114 and LAST SAFE -1 because it is not from a known up site
2012-08-01 21:42:16,213 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 2:-1) FOR TXN 1213930003008323586 and LAST SAFE -1 because it is not from a known up site
2012-08-01 21:42:16,224 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 2:-1) FOR TXN 1213930003092209666 and LAST SAFE -1 because it is not from a known up site
2012-08-01 21:42:16,229 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 2:-1) FOR TXN 1213930003142541314 and LAST SAFE -1 because it is not from a known up site
2012-08-01 21:42:16,235 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 2:-1) FOR TXN 1213930003192872962 and LAST SAFE -1 because it is not from a known up site
2012-08-01 21:42:16,246 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 2:-1) FOR TXN 1213930003285147650 and LAST SAFE -1 because it is not from a known up site
<ABOVE REPEATS FEW HUNDRED TIMES MORE>
2012-08-01 21:42:19,157 INFO [ZooKeeperServer] JOIN: Joining site 2:-1 known active sites 0:-1, 1:-1, 2:-1
2012-08-01 21:42:19,157 INFO [Socket Joiner] HOST: Received request type REQUEST_HOSTID
2012-08-01 21:42:22,198 INFO [ZooKeeperServer] JOIN: Shipping ZK snapshot from 0:-1 to 2:-1
2012-08-01 21:42:22,714 INFO [Socket Joiner] HOST: Heartbeat timeout to host: /10.178.111.3:43066 is 10000 milliseconds
2012-08-01 21:42:22,808 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 3:-1) FOR TXN 1213930057215508483 and LAST SAFE -1 because it is not from a known up site
2012-08-01 21:42:22,815 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 3:-1) FOR TXN 1213930057299394563 and LAST SAFE -1 because it is not from a known up site
2012-08-01 21:42:22,821 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 3:-1) FOR TXN 1213930057349726211 and LAST SAFE -1 because it is not from a known up site
2012-08-01 21:42:22,827 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 3:-1) FOR TXN 1213930057400057859 and LAST SAFE -1 because it is not from a known up site
2012-08-01 21:42:22,838 INFO [ZooKeeperServer] JOIN: Dropping message HEARTBEAT (FROM 3:-1) FOR TXN 1213930057483943939 and LAST SAFE -1 because it is not from a known up site
<ABOVE REPEATS FEW HUNDRED TIMES MORE>
2012-08-01 21:42:25,753 INFO [ZooKeeperServer] JOIN: Joining site 3:-1 known active sites 0:-1, 1:-1, 2:-1, 3:-1
2012-08-01 21:42:25,762 INFO [ZooKeeperServer] JOIN: Shipping ZK snapshot from 0:-1 to 3:-1
2012-08-01 21:42:44,856 INFO [main] HOST: Registering stats mailbox id 0:-2
2012-08-01 21:42:47,415 WARN [ZooKeeperServer] org.voltdb.messaging.impl.HostMessenger: Attempted delivery of message to failed site: 1:-1
2012-08-01 21:42:47,417 INFO [ZooKeeperServer] JOIN: Agreement, Sending fault data 1:-1 to 3:-1, 2:-1, 0:-1 survivors
2012-08-01 21:42:47,418 INFO [ZooKeeperServer] JOIN: Agreement, Sent fault data. Expecting 3 responses.
2012-08-01 21:42:47,418 INFO [ZooKeeperServer] JOIN: Agreement, Received failure message from 0:-1 for failed sites 1:-1 safe txn id 1213930261536833537 failed site 1:-1
2012-08-01 21:42:47,575 INFO [ZooKeeperServer] JOIN: Agreement, Detected a concurrent failure from FaultDistributor, new failed site 3:-1
2012-08-01 21:42:47,575 INFO [ZooKeeperServer] JOIN: Agreement, Sending fault data 3:-1, 1:-1 to 2:-1, 0:-1 survivors
2012-08-01 21:42:47,576 INFO [ZooKeeperServer] JOIN: Agreement, Sent fault data. Expecting 4 responses.
2012-08-01 21:42:47,576 INFO [ZooKeeperServer] JOIN: Agreement, Received failure message from 0:-1 for failed sites 3:-1, 1:-1 safe txn id 1213930262174367747 failed site 3:-1
2012-08-01 21:42:47,576 INFO [ZooKeeperServer] JOIN: Agreement, Received failure message from 0:-1 for failed sites 3:-1, 1:-1 safe txn id 1213930261536833537 failed site 1:-1
2012-08-01 21:42:47,656 INFO [ZooKeeperServer] JOIN: Agreement, Detected a concurrent failure from FaultDistributor, new failed site 2:-1
2012-08-01 21:42:47,656 INFO [ZooKeeperServer] JOIN: Agreement, Sending fault data 3:-1, 2:-1, 1:-1 to 0:-1 survivors
2012-08-01 21:42:47,656 INFO [ZooKeeperServer] JOIN: Agreement, Sent fault data. Expecting 3 responses.
2012-08-01 21:42:47,656 INFO [ZooKeeperServer] JOIN: Agreement, Received failure message from 0:-1 for failed sites 3:-1, 2:-1, 1:-1 safe txn id 1213930262174367747 failed site 3:-1
2012-08-01 21:42:47,656 INFO [ZooKeeperServer] JOIN: Agreement, Received failure message from 0:-1 for failed sites 3:-1, 2:-1, 1:-1 safe txn id 1213930263843700738 failed site 2:-1
2012-08-01 21:42:47,656 INFO [ZooKeeperServer] JOIN: Agreement, handling site faults for newly failed sites 3:-1, 2:-1, 1:-1 initiatorSafeInitPoints {3:-11213930262174367747, 2:-11213930263843700738, 1:-11213930261536833537}
2012-08-01 21:42:47,656 INFO [ZooKeeperServer] ZK-SERVER: Initiating close of session 0x10d8befaea000003
2012-08-01 21:42:47,657 INFO [ZooKeeperServer] ZK-SERVER: Initiating close of session 0x10d8bef424000002
2012-08-01 21:42:47,657 INFO [ZooKeeperServer] ZK-SERVER: Initiating close of session 0x10d8beda19000001
2012-08-01 21:42:47,658 INFO [ZooKeeperServer] ZK-SERVER: Processed session termination for sessionid: 0x10d8befaea000003
2012-08-01 21:42:47,674 INFO [ZooKeeperServer] ZK-SERVER: Processed session termination for sessionid: 0x10d8bef424000002
2012-08-01 21:42:47,675 INFO [ZooKeeperServer] ZK-SERVER: Processed session termination for sessionid: 0x10d8beda19000001
2012-08-01 21:42:47,689 ERROR [Fault Distributor] HOST: Sites failed, site ids: 3:3, 2:2, 1:1, 2:3, 3:2, 1:0, 1:3, 3:1, 2:0, 3:0, 2:1, 1:2
2012-08-01 21:42:48,198 FATAL [Fault Distributor] HOST: Node fault detected before all nodes finished initializing. Cluster will not start.
The deployment file and
rbetts
Aug 2, 2012
The deployment file and command line look sane. Can you include the full logs from the not-voltest1 nodes? You can email them to me (rbetts@voltdb.com) if you don't want to prune them for the forum.
Ryan.