Forum: Other

Post: "Voltport died", seeing 'java.nio.channels.CancelledKeyException' and 'java.nio.BufferOverflowException'

"Voltport died", seeing 'java.nio.channels.CancelledKeyException' and 'java.nio.BufferOverflowException'
davidr
Jul 18, 2013
Folks,

I'm seeing 2 seperate errors when I try and run more than about 40K TPS on a 3 node cluster of 16 core IBM boxes. Each box is:

Linux f3sn05 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux


The first thing we see is a sudden application latency spike we we try to drain a connection, followed by nio.BufferOverflowException:

3079 [Volt Network - 0] INFO StatsUnloader11 - StatsUnloader:11: Opening stats file /home/voltdb/voltRuntime/SWBQGenerated/stats/volt_11_20130718_0453.stat
3483 [VoltSpace11] WARN Partition11 - Call to drain() took 3331ms
[voltdb@f3sn03 voltLocal]$ java.nio.BufferOverflowException
at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:182)
at org.voltdb.VoltTable.flattenToBuffer(VoltTable.java:1582)
at org.voltdb.ClientResponseImpl.flattenToBuffer(ClientResponseImpl.java:279)
at org.voltdb.messaging.InitiateResponseMessage.flattenToBuffer(InitiateResponseMessage.java:210)
at org.voltcore.messaging.ForeignHost$1.serialize(ForeignHost.java:202)
at org.voltcore.network.NIOWriteStream.swapAndSerializeQueuedWrites(NIOWriteStream.java:356)
at org.voltcore.network.VoltPort.drainWriteStream(VoltPort.java:233)
at org.voltcore.network.VoltPort.run(VoltPort.java:194)
at org.voltcore.network.VoltNetwork.callPort(VoltNetwork.java:385)
at org.voltcore.network.VoltNetwork.invokeCallbacks(VoltNetwork.java:413)
at org.voltcore.network.VoltNetwork.run(VoltNetwork.java:292)
at java.lang.Thread.run(Thread.java:662)
ERROR: VoltPort died due to an unexpected exception
java.nio.BufferOverflowException
at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:182)
at org.voltdb.VoltTable.flattenToBuffer(VoltTable.java:1582)
at org.voltdb.ClientResponseImpl.flattenToBuffer(ClientResponseImpl.java:279)
at org.voltdb.messaging.InitiateResponseMessage.flattenToBuffer(InitiateResponseMessage.java:210)
at org.voltcore.messaging.ForeignHost$1.serialize(ForeignHost.java:202)
at org.voltcore.network.NIOWriteStream.swapAndSerializeQueuedWrites(NIOWriteStream.java:356)
at org.voltcore.network.VoltPort.drainWriteStream(VoltPort.java:233)
at org.voltcore.network.VoltPort.run(VoltPort.java:194)
at org.voltcore.network.VoltNetwork.callPort(VoltNetwork.java:385)
at org.voltcore.network.VoltNetwork.invokeCallbacks(VoltNetwork.java:413)
at org.voltcore.network.VoltNetwork.run(VoltNetwork.java:292)
at java.lang.Thread.run(Thread.java:662)


Things then get a lot worse:

WARN: Attempted delivery of message to failed site: 1:-1
WARN: Attempted delivery of message to failed site: 1:-1
WARN: Attempted delivery of message to failed site: 1:-1
WARN: Attempted delivery of message to failed site: 1:-1
WARN: Received remote hangup from foreign host 129.40.80.12
WARN: Attempted delivery of message to failed site: 1:8
WARN: Host 1 failed
java.nio.BufferOverflowException
at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:182)
at org.voltdb.VoltTable.flattenToBuffer(VoltTable.java:1582)
at org.voltdb.ClientResponseImpl.flattenToBuffer(ClientResponseImpl.java:279)
at org.voltdb.messaging.InitiateResponseMessage.flattenToBuffer(InitiateResponseMessage.java:210)
at org.voltcore.messaging.ForeignHost$1.serialize(ForeignHost.java:202)
at org.voltcore.network.NIOWriteStream.swapAndSerializeQueuedWrites(NIOWriteStream.java:356)
at org.voltcore.network.VoltPort.drainWriteStream(VoltPort.java:233)
at org.voltcore.network.VoltPort.run(VoltPort.java:194)
at org.voltcore.network.VoltNetwork.callPort(VoltNetwork.java:385)
at org.voltcore.network.VoltNetwork.invokeCallbacks(VoltNetwork.java:413)
at org.voltcore.network.VoltNetwork.run(VoltNetwork.java:292)
at java.lang.Thread.run(Thread.java:662)


The second error we see is different but also related to networking. It may or may not happen at the same time as the first one:

WARN: Attempted delivery of message to failed site: 2:-1
WARN: Attempted delivery of message to failed site: 0:7
WARN: Attempted delivery of message to failed site: 2:7
WARN: Attempted delivery of message to failed site: 0:4
WARN: Attempted delivery of message to failed site: 2:4
WARN: Attempted delivery of message to failed site: 0:7
WARN: Attempted delivery of message to failed site: 2:7
WARN: Attempted delivery of message to failed site: 0:10
WARN: Attempted delivery of message to failed site: 2:10
WARN: Attempted delivery of message to failed site: 0:7
WARN: Attempted delivery of message to failed site: 2:7
WARN: Attempted delivery of message to failed site: 0:10
WARN: Attempted delivery of message to failed site: 2:10
WARN: Attempted delivery of message to failed site: 0:10
WARN: Attempted delivery of message to failed site: 2:10
WARN: Attempted delivery of message to failed site: 0:-1
WARN: Attempted delivery of message to failed site: 2:-1
WARN: Received remote hangup from foreign host 129.40.80.13
WARN: Received remote hangup from foreign host 129.40.80.15
WARN: Host 2 failed
WARN: Host 0 failed
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:69)
at org.voltcore.network.VoltPort.lockForHandlingWork(VoltPort.java:164)
at org.voltcore.network.VoltNetwork.callPort(VoltNetwork.java:383)
at org.voltcore.network.VoltNetwork.access$200(VoltNetwork.java:85)
at org.voltcore.network.VoltNetwork$3.run(VoltNetwork.java:252)
at org.voltcore.network.VoltNetwork.run(VoltNetwork.java:301)
at java.lang.Thread.run(Thread.java:662)
FATAL: Partition detection snapshot completed. Shutting down.
VoltDB has encountered an unrecoverable error and is exiting.
The log may contain additional information.


Now, my suspicion is that my network is a bit iffy, but what should I do to prove that?

Note that AFAIK we're not doing anything unusual at the VoltDB level other than more work. Memory is fine and CPU is around 12% when we see this.

David Rolfe
davidr
Jul 18, 2013
We've established that the chain of events was not provoked by our use of drain(). What's really odd is that the drain() in question was called against a client that wasn't doing any work, but still took ages to complete...
rbetts
Jul 18, 2013
David,

Can you send the full log file to support@voltdb.com? It isn't quite obvious to me what is happening here and which exceptions appeared where / when. The last error is reporting that the cluster is network partitioned, that it made a snapshot to record its last consistent state and then terminated because of the partition.

We have a network mesh tool that measures the health of a network. That is our usual way of helping to discover wonkiness.

If you could follow up to support@voltdb.com with the log files, I will dig in to this further and will provide you the network mesh testing tool.

Thanks,
Ryan.
davidr
Jul 19, 2013
Ryan,

Will send data via email today.

DR