Forum: Managing VoltDB

Post: Problems rejoining - dropped Heartbeat Messages

Problems rejoining - dropped Heartbeat Messages
radek1st
Jul 11, 2012
Hi,

I'm trying VoltDB CE in a two node cluster. When I stop the volt
process on one of the sites (after the successful initialisation of
both) and try to rejoin, it fails and complains about dropped heartbeat
messages:

-----------------------------------------------------------------

HOST2:

-----------------------------------------------------------------

[voltdb@volt2 VoltCollect]$ voltdb catalog collect.jar deployment deployment.xml leader volt1

Initializing VoltDB...

_ __ ____ ____ ____

| | / /___ / / /_/ __ \/ __ )

| | / / __ \/ / __/ / / / __ |

| |/ / /_/ / / /_/ /_/ / /_/ /

|___/\____/_/\__/_____/_____/

--------------------------------

Build: 2.7 voltdb-2.7-0-g7722ff9 Community Edition

Connecting to the VoltDB cluster leader volt1/10.241.57.139:3021

1 Notified of host 0

Initializing initiator ID: 1, SiteID: 1:2

WARN: Running without redundancy (k=0) is not recommended for production use.

Server completed initialization.

-----------------------------------------------------------------

It started fine. I killed it and tried rejoining...

-----------------------------------------------------------------

[voltdb@volt2 VoltCollect]$ voltdb rejoinhost volt1 deployment deployment.xml

Initializing VoltDB...

_ __ ____ ____ ____

| | / /___ / / /_/ __ \/ __ )

| | / / __ \/ / __/ / / / __ |

| |/ / /_/ / / /_/ /_/ / /_/ /

|___/\____/_/\__/_____/_____/

--------------------------------

Build: 2.7 voltdb-2.7-0-g7722ff9 Community Edition

Connecting to the VoltDB cluster leader volt1/10.241.57.139:3021

2 Notified of host 0

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@781205b8

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@6735b09d

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@75de485a

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@5460492a

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@7d638fac

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@4afc818c

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@79d34ca

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@61f4bdad

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@ad0db19

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@15e04bdb

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@38942215

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@549adb8

WARN: No-op mailbox(2:2) dropped message org.voltdb.messaging.CoalescedHeartbeatMessage@282c0dbe

Initializing initiator ID: 2, SiteID: 2:2

WARN: Running without redundancy (k=0) is not recommended for production use.

Server completed initialization.

FATAL: Timed out waiting for connection from source partition

VoltDB has encountered an unrecoverable error and is exiting.

The log may contain additional information.

-----------------------------------------------------------------

HOST1:

-----------------------------------------------------------------

[voltdb@volt1 VoltCollect]$ voltdb catalog collect.jar deployment deployment.xml leader volt1

Initializing VoltDB...

_ __ ____ ____ ____

| | / /___ / / /_/ __ \/ __ )

| | / / __ \/ / __/ / / / __ |

| |/ / /_/ / / /_/ /_/ / /_/ /

|___/\____/_/\__/_____/_____/

--------------------------------

Build: 2.7 voltdb-2.7-0-g7722ff9 Community Edition

Connecting to VoltDB cluster as the leader...

Initializing initiator ID: 0, SiteID: 0:2

WARN: Running without redundancy (k=0) is not recommended for production use.

Server completed initialization.

-----------------------------------------------------------------

When the other host is disconnected:

-----------------------------------------------------------------

ERROR: Sites failed, site ids: 1:1, 1:0, 1:3, 1:2

Failure delta is 1:1, 1:0, 1:3, 1:2 with failures 1:1, 1:0, 1:3, 1:2

Failure delta is 1:1, 1:0, 1:3, 1:2 with failures 1:1, 1:0, 1:3, 1:2

-----------------------------------------------------------------

On the rejoin attempt:

-----------------------------------------------------------------

ERROR: Sites failed, site ids: 2:2, 2:3, 2:0, 2:1

-----------------------------------------------------------------

and several seconds later lots of:

-----------------------------------------------------------------

WARN: Dropping message HEARTBEAT (FROM 2:2) FOR TXN
1198388346540261378 and LAST SAFE 1198388306753093634 because it is from
a unknown site id 2:2

WARN: Dropping message HEARTBEAT (FROM 2:2) FOR TXN 1198388346708033538
and LAST SAFE 1198388306753093634 because it is from a unknown site id
2:2

WARN: Dropping message HEARTBEAT (FROM 2:2) FOR TXN 1198388346875805698
and LAST SAFE 1198388306753093634 because it is from a unknown site id
2:2

...

Surprisingly, rejoining works without any problems when I try it the
other way round. I switched of the firewall etc. My deployment file:


<?xml version="1.0"?>
<deployment>
<cluster hostcount="2" sitesperhost="2" kfactor="1"/>
<paths>
<voltdbroot path="/tmp" />
</paths>
<httpd enabled="true">
<jsonapi enabled="true" />
</httpd>
</deployment>

I can also see the kfactor warning even though it is set to 1:


WARN: Running without redundancy (k=0) is not recommended for production use.

Appreciate any help on this.

Cheers,

Radek
The dropped messages warning can be ignored
rmorgenstein
Jul 14, 2012
Radek,

They indicate a temporary condition when a new node has started to
rejoin but is not fully participating in the cluster. As you mentioned,
they have make no difference and the rejoin succeeds. In the upcoming
July release, they have been reclassified as INFO level messages.

The k-factor warning was firing incorrectly. This was fixed in V2.7.1.
Send me your logfiles
rmorgenstein
Jul 15, 2012
I misread your post. Send me the command you used to rejoin
and logfiles from both nodes and I'll take a look. rmorgenstein at
voltdb.com.

Ruth
Hi Ruth
radek1st
Jul 18, 2012
We've already changed the set up, so I can't get the logs to you at the moment. When we test the cluster setup again, I'll get back to you. Thanks

Radek