Problems rejoining - dropped Heartbeat Messages
Jul 11, 2012
Hi,
I'm trying VoltDB CE in a two node cluster. When I stop the volt
process on one of the sites (after the successful initialisation of
both) and try to rejoin, it fails and complains about dropped heartbeat
messages:
-----------------------------------------------------------------
HOST2:
-----------------------------------------------------------------
[voltdb@volt2 VoltCollect]$ voltdb catalog collect.jar deployment deployment.xml leader volt1
Initializing VoltDB...
_ __ ____ ____ ____
| | / /___ / / /_/ __ \/ __ )
| | / / __ \/ / __/ / / / __ |
| |/ / /_/ / / /_/ /_/ / /_/ /
|___/\____/_/\__/_____/_____/
--------------------------------
Build: 2.7 voltdb-2.7-0-g7722ff9 Community Edition
Connecting to the VoltDB cluster leader volt1/10.241.57.139:3021
1 Notified of host 0
Initializing initiator ID: 1, SiteID: 1:2
WARN: Running without redundancy (k=0) is not recommended for production use.
Server completed initialization.
-----------------------------------------------------------------
It started fine. I killed it and tried rejoining...
-----------------------------------------------------------------
[voltdb@volt2 VoltCollect]$ voltdb rejoinhost volt1 deployment deployment.xml
Initializing VoltDB...
_ __ ____ ____ ____
| | / /___ / / /_/ __ \/ __ )
| | / / __ \/ / __/ / / / __ |
| |/ / /_/ / / /_/ /_/ / /_/ /
|___/\____/_/\__/_____/_____/
--------------------------------
Build: 2.7 voltdb-2.7-0-g7722ff9 Community Edition
Connecting to the VoltDB cluster leader volt1/10.241.57.139:3021
2 Notified of host 0
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@781205b8
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@6735b09d
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@75de485a
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@5460492a
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@7d638fac
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@4afc818c
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@79d34ca
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@61f4bdad
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@ad0db19
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@15e04bdb
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@38942215
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@549adb8
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@282c0dbe
Initializing initiator ID: 2, SiteID: 2:2
WARN: Running without redundancy (k=0) is not recommended for production use.
Server completed initialization.
FATAL: Timed out waiting for connection from source partition
VoltDB has encountered an unrecoverable error and is exiting.
The log may contain additional information.
-----------------------------------------------------------------
HOST1:
-----------------------------------------------------------------
[voltdb@volt1 VoltCollect]$ voltdb catalog collect.jar deployment deployment.xml leader volt1
Initializing VoltDB...
_ __ ____ ____ ____
| | / /___ / / /_/ __ \/ __ )
| | / / __ \/ / __/ / / / __ |
| |/ / /_/ / / /_/ /_/ / /_/ /
|___/\____/_/\__/_____/_____/
--------------------------------
Build: 2.7 voltdb-2.7-0-g7722ff9 Community Edition
Connecting to VoltDB cluster as the leader...
Initializing initiator ID: 0, SiteID: 0:2
WARN: Running without redundancy (k=0) is not recommended for production use.
Server completed initialization.
-----------------------------------------------------------------
When the other host is disconnected:
-----------------------------------------------------------------
ERROR: Sites failed, site ids: 1:1, 1:0, 1:3, 1:2
Failure delta is 1:1, 1:0, 1:3, 1:2 with failures 1:1, 1:0, 1:3, 1:2
Failure delta is 1:1, 1:0, 1:3, 1:2 with failures 1:1, 1:0, 1:3, 1:2
-----------------------------------------------------------------
On the rejoin attempt:
-----------------------------------------------------------------
ERROR: Sites failed, site ids: 2:2, 2:3, 2:0, 2:1
-----------------------------------------------------------------
and several seconds later lots of:
-----------------------------------------------------------------
WARN: Dropping message HEARTBEAT (FROM 2:2) FOR TXN
1198388346540261378 and LAST SAFE 1198388306753093634 because it is from
a unknown site id 2:2
WARN: Dropping message HEARTBEAT (FROM 2:2) FOR TXN 1198388346708033538
and LAST SAFE 1198388306753093634 because it is from a unknown site id
2:2
WARN: Dropping message HEARTBEAT (FROM 2:2) FOR TXN 1198388346875805698
and LAST SAFE 1198388306753093634 because it is from a unknown site id
2:2
...
Surprisingly, rejoining works without any problems when I try it the
other way round. I switched of the firewall etc. My deployment file:
<?xml version="1.0"?>
<deployment>
<cluster hostcount="2" sitesperhost="2" kfactor="1"/>
<paths>
<voltdbroot path="/tmp" />
</paths>
<httpd enabled="true">
<jsonapi enabled="true" />
</httpd>
</deployment>
I can also see the kfactor warning even though it is set to 1:
WARN: Running without redundancy (k=0) is not recommended for production use.
Appreciate any help on this.
Cheers,
Radek
I'm trying VoltDB CE in a two node cluster. When I stop the volt
process on one of the sites (after the successful initialisation of
both) and try to rejoin, it fails and complains about dropped heartbeat
messages:
-----------------------------------------------------------------
HOST2:
-----------------------------------------------------------------
[voltdb@volt2 VoltCollect]$ voltdb catalog collect.jar deployment deployment.xml leader volt1
Initializing VoltDB...
_ __ ____ ____ ____
| | / /___ / / /_/ __ \/ __ )
| | / / __ \/ / __/ / / / __ |
| |/ / /_/ / / /_/ /_/ / /_/ /
|___/\____/_/\__/_____/_____/
--------------------------------
Build: 2.7 voltdb-2.7-0-g7722ff9 Community Edition
Connecting to the VoltDB cluster leader volt1/10.241.57.139:3021
1 Notified of host 0
Initializing initiator ID: 1, SiteID: 1:2
WARN: Running without redundancy (k=0) is not recommended for production use.
Server completed initialization.
-----------------------------------------------------------------
It started fine. I killed it and tried rejoining...
-----------------------------------------------------------------
[voltdb@volt2 VoltCollect]$ voltdb rejoinhost volt1 deployment deployment.xml
Initializing VoltDB...
_ __ ____ ____ ____
| | / /___ / / /_/ __ \/ __ )
| | / / __ \/ / __/ / / / __ |
| |/ / /_/ / / /_/ /_/ / /_/ /
|___/\____/_/\__/_____/_____/
--------------------------------
Build: 2.7 voltdb-2.7-0-g7722ff9 Community Edition
Connecting to the VoltDB cluster leader volt1/10.241.57.139:3021
2 Notified of host 0
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@781205b8
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@6735b09d
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@75de485a
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@5460492a
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@7d638fac
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@4afc818c
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@79d34ca
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@61f4bdad
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@ad0db19
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@15e04bdb
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@38942215
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@549adb8
WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@282c0dbe
Initializing initiator ID: 2, SiteID: 2:2
WARN: Running without redundancy (k=0) is not recommended for production use.
Server completed initialization.
FATAL: Timed out waiting for connection from source partition
VoltDB has encountered an unrecoverable error and is exiting.
The log may contain additional information.
-----------------------------------------------------------------
HOST1:
-----------------------------------------------------------------
[voltdb@volt1 VoltCollect]$ voltdb catalog collect.jar deployment deployment.xml leader volt1
Initializing VoltDB...
_ __ ____ ____ ____
| | / /___ / / /_/ __ \/ __ )
| | / / __ \/ / __/ / / / __ |
| |/ / /_/ / / /_/ /_/ / /_/ /
|___/\____/_/\__/_____/_____/
--------------------------------
Build: 2.7 voltdb-2.7-0-g7722ff9 Community Edition
Connecting to VoltDB cluster as the leader...
Initializing initiator ID: 0, SiteID: 0:2
WARN: Running without redundancy (k=0) is not recommended for production use.
Server completed initialization.
-----------------------------------------------------------------
When the other host is disconnected:
-----------------------------------------------------------------
ERROR: Sites failed, site ids: 1:1, 1:0, 1:3, 1:2
Failure delta is 1:1, 1:0, 1:3, 1:2 with failures 1:1, 1:0, 1:3, 1:2
Failure delta is 1:1, 1:0, 1:3, 1:2 with failures 1:1, 1:0, 1:3, 1:2
-----------------------------------------------------------------
On the rejoin attempt:
-----------------------------------------------------------------
ERROR: Sites failed, site ids: 2:2, 2:3, 2:0, 2:1
-----------------------------------------------------------------
and several seconds later lots of:
-----------------------------------------------------------------
WARN: Dropping message HEARTBEAT (FROM 2:2) FOR TXN
1198388346540261378 and LAST SAFE 1198388306753093634 because it is from
a unknown site id 2:2
WARN: Dropping message HEARTBEAT (FROM 2:2) FOR TXN 1198388346708033538
and LAST SAFE 1198388306753093634 because it is from a unknown site id
2:2
WARN: Dropping message HEARTBEAT (FROM 2:2) FOR TXN 1198388346875805698
and LAST SAFE 1198388306753093634 because it is from a unknown site id
2:2
...
Surprisingly, rejoining works without any problems when I try it the
other way round. I switched of the firewall etc. My deployment file:
<?xml version="1.0"?>
<deployment>
<cluster hostcount="2" sitesperhost="2" kfactor="1"/>
<paths>
<voltdbroot path="/tmp" />
</paths>
<httpd enabled="true">
<jsonapi enabled="true" />
</httpd>
</deployment>
I can also see the kfactor warning even though it is set to 1:
WARN: Running without redundancy (k=0) is not recommended for production use.
Appreciate any help on this.
Cheers,
Radek