Forum: Managing VoltDB

Post: Java GC causing dead hosts

Java GC causing dead hosts
paulp
Mar 5, 2012
What's the best way to troubleshoot a long running GC that seems to cause dead hosts? I just experienced a rather long minor GC that lasted 9.92 sec and I suspect that's what caused the voltdb process on this node to quit. I've since rejoined the node with -Xmx1024m -XX:NewSize=768m -XX:+UseConcMarkSweepGC and it seems to have decreased both the frequency and duration of GC's, I was wondering if there's more that can be done?

Here's a snippet from my log file:

358669.103: [GC [PSYoungGen: 120128K->1312K(140032K)] 179409K->60825K(212032K), 0.0028090 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
358669.335: [GC [PSYoungGen: 140000K->960K(162880K)] 199513K->60737K(234880K), 0.0028260 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
358669.592: [GC [PSYoungGen: 160704K->832K(162880K)] 220481K->60713K(234880K), 0.0035880 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
358669.934: [GC [PSYoungGen: 160576K->1344K(189632K)] 220457K->61273K(261632K), 9.9296460 secs] [Times: user=57.77 sys=11.86, real=9.93 secs]
2012-03-05 12:34:10,960 [ZooKeeperServer] ERROR HOST [] - DEAD HOST DETECTED, hostname: vdb002
2012-03-05 12:34:10,960 [ZooKeeperServer] INFO HOST [] - current time: 1330950850958
2012-03-05 12:34:10,960 [ZooKeeperServer] INFO HOST [] - last message: 1330950840956
2012-03-05 12:34:10,960 [ZooKeeperServer] INFO HOST [] - delta (millis): 10002
2012-03-05 12:34:10,960 [ZooKeeperServer] INFO HOST [] - timeout value (millis): 10000
2012-03-05 12:34:10,960 [ZooKeeperServer] ERROR HOST [] - DEAD HOST DETECTED, hostname: vdb010
2012-03-05 12:34:10,960 [ZooKeeperServer] INFO HOST [] - current time: 1330950850958
2012-03-05 12:34:10,960 [ZooKeeperServer] INFO HOST [] - last message: 1330950840956
2012-03-05 12:34:10,960 [ZooKeeperServer] INFO HOST [] - delta (millis): 10002
2012-03-05 12:34:10,960 [ZooKeeperServer] INFO HOST [] - timeout value (millis): 10000
2012-03-05 12:34:10,960 [Fault Distributor] ERROR HOST [] - Host failed, host id: 1 hostname: vdb002
2012-03-05 12:34:10,960 [Fault Distributor] ERROR HOST [] - Removing sites from cluster: [100, 101, 102, 103, 104, 105, 106, 107, 108]
2012-03-05 12:34:10,966 [ZooKeeperServer] WARN org.voltdb.messaging.impl.HostMessenger [] - Attempted delivery of message to failed site: 100
2012-03-05 12:34:11,011 [ZooKeeperServer] ERROR HOST [] - DEAD HOST DETECTED, hostname: vdb010
2012-03-05 12:34:11,011 [ZooKeeperServer] INFO HOST [] - current time: 1330950851009
2012-03-05 12:34:11,011 [ZooKeeperServer] INFO HOST [] - last message: 1330950840956
2012-03-05 12:34:11,011 [ZooKeeperServer] INFO HOST [] - delta (millis): 10053
2012-03-05 12:34:11,011 [ZooKeeperServer] INFO HOST [] - timeout value (millis): 10000
Hi Paul,A 9 second GC time
aweisberg
Mar 6, 2012
Hi Paul,

A 9 second GC time does not jive with the the type of GC and the amount of young gen data retained during the GC. That 9 second GC was a young gen so it is the copying collector and only 1334k was retained. With young gen GC you only pay a penalty for the amount of live data that has to be copied and a megabyte is not a lot so it should be quick. On faster systems that kind of GC will be sub-millisecond.

I think looking at RSS and swap usage is a good idea. Monitoring with Nagios and maintaining a history is helpful. Some flavors of Linux will swap out application memory in favor of cached filesystem data even when there is free memory available.

-Ariel