Results 1 to 2 of 2

Thread: Two node cluster rejoin issue

  1. #1
    New Member
    Join Date
    Dec 2012
    Posts
    4

    Two node cluster rejoin issue

    We have a cluster of 2 nodes, when started the cluster, node01 was the leader.

    We tested 2 scenarios:

    1. node1 is running (it was the leader when started the cluster one month ago), node2 can join successfully.
    2. However, after that (both nodes contain the same data), we brought down node1, then try to rejoin it with node2, this test failed, with errors like this:

    2013-11-05 01:14:43,197 INFO [main] CONSOLE: Build: 3.2.1 voltdb-3.2.1-0-gcaca22e-local Community Edition
    2013-11-05 01:14:43,208 INFO [main] NETWORK: Default network thread count: 4
    2013-11-05 01:14:43,239 INFO [main] HOST: Beginning inter-node communication on port 3021.
    2013-11-05 01:14:43,239 INFO [main] HOST: Attempting to bind to leader ip volt-n02.addsrv.com/10.84.121.153:3021
    2013-11-05 01:14:43,241 INFO [main] CONSOLE: Connecting to the VoltDB cluster leader volt-n02.addsrv.com/10.84.121.153:3021
    2013-11-05 01:14:43,243 WARN [main] org.voltcore.messaging.SocketJoiner: Joining primary failed: Connection refused retrying..
    2013-11-05 01:14:43,493 WARN [main] org.voltcore.messaging.SocketJoiner: Joining primary failed: Connection refused retrying..
    .....

    We had to reboot the cluster eventually, any suggestions why node1 failed to join node2?

  2. #2
    Super Moderator
    Join Date
    Feb 2010
    Posts
    79
    Quote Originally Posted by hchen View Post
    2013-11-05 01:14:43,493 WARN [main] org.voltcore.messaging.SocketJoiner: Joining primary failed: Connection refused retrying..
    The error message above means that node1 could not establish a connection to node2, most likely because node2 was not running any more.

    node2 might have terminated itself when you brought down node1 because network partition detection was triggered. If you are trying to test rejoin in a dev environment, you can disable network partition detection and bring down any one of the two nodes. But this is not recommended in production. In production, you can switch to odd number of nodes with ksafety=1 to avoid network partition. To find more about network partition detection, please follow the link below.

    http://voltdb.com/docs/UsingVoltDB/KsafeNetPart.php
    Ning

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •