Two node cluster rejoin issue
Nov 5, 2013
We have a cluster of 2 nodes, when started the cluster, node01 was the leader.
We tested 2 scenarios:
1. node1 is running (it was the leader when started the cluster one month ago), node2 can join successfully.
2. However, after that (both nodes contain the same data), we brought down node1, then try to rejoin it with node2, this test failed, with errors like this:
2013-11-05 01:14:43,197 INFO [main] CONSOLE: Build: 3.2.1 voltdb-3.2.1-0-gcaca22e-local Community Edition
2013-11-05 01:14:43,208 INFO [main] NETWORK: Default network thread count: 4
2013-11-05 01:14:43,239 INFO [main] HOST: Beginning inter-node communication on port 3021.
2013-11-05 01:14:43,239 INFO [main] HOST: Attempting to bind to leader ip volt-n02.addsrv.com/10.84.121.153:3021
2013-11-05 01:14:43,241 INFO [main] CONSOLE: Connecting to the VoltDB cluster leader volt-n02.addsrv.com/10.84.121.153:3021
2013-11-05 01:14:43,243 WARN [main] org.voltcore.messaging.SocketJoiner: Joining primary failed: Connection refused retrying..
2013-11-05 01:14:43,493 WARN [main] org.voltcore.messaging.SocketJoiner: Joining primary failed: Connection refused retrying..
.....
We had to reboot the cluster eventually, any suggestions why node1 failed to join node2?
We tested 2 scenarios:
1. node1 is running (it was the leader when started the cluster one month ago), node2 can join successfully.
2. However, after that (both nodes contain the same data), we brought down node1, then try to rejoin it with node2, this test failed, with errors like this:
2013-11-05 01:14:43,197 INFO [main] CONSOLE: Build: 3.2.1 voltdb-3.2.1-0-gcaca22e-local Community Edition
2013-11-05 01:14:43,208 INFO [main] NETWORK: Default network thread count: 4
2013-11-05 01:14:43,239 INFO [main] HOST: Beginning inter-node communication on port 3021.
2013-11-05 01:14:43,239 INFO [main] HOST: Attempting to bind to leader ip volt-n02.addsrv.com/10.84.121.153:3021
2013-11-05 01:14:43,241 INFO [main] CONSOLE: Connecting to the VoltDB cluster leader volt-n02.addsrv.com/10.84.121.153:3021
2013-11-05 01:14:43,243 WARN [main] org.voltcore.messaging.SocketJoiner: Joining primary failed: Connection refused retrying..
2013-11-05 01:14:43,493 WARN [main] org.voltcore.messaging.SocketJoiner: Joining primary failed: Connection refused retrying..
.....
We had to reboot the cluster eventually, any suggestions why node1 failed to join node2?