Announcement

Collapse
No announcement yet.

[possible serious bug] about Only one host can rejoin at a time. Host 8 is still rejoining error

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • [possible serious bug] about Only one host can rejoin at a time. Host 8 is still rejoining error

    We are testing voltdb 7.4. The cluster consists of two nodes, volt1 and volt2. If we restart the voltdb process on one node(volt1), if the time between two nodes is not synced, then upon startup the volt1 will report skew time and abort. That's normal, however, even after we synced the time, then start the voltdb again on volt1 , it will report the below error messages continuously:

    2017-07-18 16:13:09,754 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 10 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.
    2017-07-18 16:13:19,773 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 29 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.
    2017-07-18 16:13:48,792 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 63 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.
    2017-07-18 16:14:51,810 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 147 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.
    2017-07-18 16:16:31,269 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 10 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.
    2017-07-18 16:16:41,287 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 43 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.
    2017-07-18 16:17:18,831 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 300 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.
    2017-07-18 16:17:24,300 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 109 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.

    To bring everything to normal we have to restart volt2, which is undesirable since it makes the whole cluster unavailable. It seems volt1's abort on the first join attempt left some info on volt2. Is there a way to tell volt2 there is no node current rejoining?

    BTW, for this two nodes cluster, the network partition detection is turned off.

    Regards,
    -Xiang

  • #2
    By the way, the command to start voltdb

    voltdb start -B --host=volt1,volt2 --count=2 --dir=$VOLTDBROOT_PARENT --license=$HOME/conf/license.xml

    Comment


    • #3
      Hi Xiang,

      Are you using any scripts to execute the "voltdb start" command, or are you starting it manually?

      Normally when you start a two-node cluster like this, one will be Host 0 and the other will be Host 1. They will start together as a cluster, so there won't be any node rejoining. Afterwards, if a node fails, you would start it again with the same command and it would rejoin the cluster as Host 2. The Host ID numbers are never re-used as long as the cluster remains available.

      How is it that you are up beyond Host 8? I think you may have run some sequence of incorrect commands that put the cluster into a confused state. Please attach the volt.log files for volt1 and volt2, and we can piece together the history and see if this is a bug.

      The easiest way to fix this would be to shutdown volt2 and restart both nodes together.

      Thanks,
      Ben

      Comment


      • #4
        Hi Ben,

        Thanks for your quick response.

        We just use the command "voltdb start -B --host=volt1,volt2 --count=2 --dir=$VOLTDBROOT_PARENT --license=$HOME/conf/license.xml" to start voltdb.

        From my observation, if voltdb process restarts, it will get a new host id.

        I did a quick test again, this issue is reproducible:
        I started voltdb on volt1, volt2. From voltadbmin status, I know the host id is 0 and 1.
        Then I stop voltdb on volt2, and then restart it, it can join the cluster successfully. This time, the host id of volt2 is 2.
        If I stop voltdb on volt2 again, and intentionly set the clock 2 minutes ahead of current time. Then the voltdb on volt2 failed on start.(expected, complaining about clock skew, I guess this time it will get host id 3)
        Then, I synced the time, and start voltdb on volt2 again, this time the below log shows continuously , that is, voltdb on volt1 still think the last failed join attempt (by host id 3) is still rejoing.

        Request to join cluster mesh is rejected, retrying in 25 seconds. Only one host can rejoin at a time. Host 3 is still rejoining

        And I waited for at least several hours, it seems voltdb on volt1 didn't timeout this failed join attempt by host id 3.

        So I think it's a bug.

        -Regards,
        Xiang

        Comment


        • #5
          Hi Xiang,

          Thank you for testing that this is reproducible. I've filed a bug: https://issues.voltdb.com/browse/ENG-12876

          We will fix this in a future release.

          Best regards,
          Ben

          Comment

          Working...
          X