Forum: Managing VoltDB

Post: Two node cluster rejoin issue

Two node cluster rejoin issue
hchen
Nov 5, 2013
We have a cluster of 2 nodes, when started the cluster, node01 was the leader.

We tested 2 scenarios:

1. node1 is running (it was the leader when started the cluster one month ago), node2 can join successfully.
2. However, after that (both nodes contain the same data), we brought down node1, then try to rejoin it with node2, this test failed, with errors like this:

2013-11-05 01:14:43,197 INFO [main] CONSOLE: Build: 3.2.1 voltdb-3.2.1-0-gcaca22e-local Community Edition
2013-11-05 01:14:43,208 INFO [main] NETWORK: Default network thread count: 4
2013-11-05 01:14:43,239 INFO [main] HOST: Beginning inter-node communication on port 3021.
2013-11-05 01:14:43,239 INFO [main] HOST: Attempting to bind to leader ip volt-n02.addsrv.com/10.84.121.153:3021
2013-11-05 01:14:43,241 INFO [main] CONSOLE: Connecting to the VoltDB cluster leader volt-n02.addsrv.com/10.84.121.153:3021
2013-11-05 01:14:43,243 WARN [main] org.voltcore.messaging.SocketJoiner: Joining primary failed: Connection refused retrying..
2013-11-05 01:14:43,493 WARN [main] org.voltcore.messaging.SocketJoiner: Joining primary failed: Connection refused retrying..
.....

We had to reboot the cluster eventually, any suggestions why node1 failed to join node2?
nshi
Nov 5, 2013

2013-11-05 01:14:43,493 WARN [main] org.voltcore.messaging.SocketJoiner: Joining primary failed: Connection refused retrying..


The error message above means that node1 could not establish a connection to node2, most likely because node2 was not running any more.

node2 might have terminated itself when you brought down node1 because network partition detection was triggered. If you are trying to test rejoin in a dev environment, you can disable network partition detection and bring down any one of the two nodes. But this is not recommended in production. In production, you can switch to odd number of nodes with ksafety=1 to avoid network partition. To find more about network partition detection, please follow the link below.

http://voltdb.com/docs/UsingVoltDB/KsafeNetPart.php