Forum: Other

Post: key_value benchmark network latency

key_value benchmark network latency
alexlzl
Feb 7, 2011
Hello,

I am running the key_value benchmark program comes with the voltdb 1.2.1.02 package. When running both client and server on the same box, the average latency is 1.06ms, however when I put client on a different machine, average latency jumped to 60ms+. The two machines are on the same switch/subnet, and ping response is 0.06ms. We also tried to use IP address directly and DNS caching (nscd) to remove the DNS factor, however, they didn't help.

Should I expect the 60ms+ latency in such a simple benchmark testing environment? Frankly I am very disappointed. Any other tips or tricks?

I am using the default "ant" to start server, and "ant client" to launch the ClientKV client, ant version is 1.7.1, Java version is 1.6.0_21, OS is CentOS 5.5 64bit.

Thank you.
alex
test output
alexlzl
Feb 8, 2011
To eliminate possible issues with our OS or network, we also ran Redis benchmark on the same two machines, it was blazing fast. Even at 12KB value size, the average latency is around 1.7ms over network (we ran both Redis native benchmark client, as well as our own simple Java client).

[updated 2/8/11 12 PM PST]
* Just ran the same tests on our Production boxes, again same latency over network. Running client-threaded didn't help either.

* Also followed the release notes here (http://community.voltdb.com/docs/ReleaseNotes/index) since we are using CentOS 5.5 with kernel 2.6.18. Turn of TCP window scaling on both client and server makes no difference. Also tried turning of tcp tso and gro, they actually made the latency even worse (250ms average).

I understand VoltDB is optimized for throughput and I should benchmark against a cluster of machines. However 60ms+ latency over network on such a simple configuration just doesn't sound right. Is there any trick I missed?

Here are some test output for VoltDB ClientKV test. I ran everything from the example/key_value directory as default, except change the server address for ant "client" target in build.xml:

localhost testing (client and server on same host, using "hostname" instead of "localhost" in build.xml)
[java] INFO 2011-02-07 17:21:47,943 [main] com.ClientKV: *************************************************************************
[java] INFO 2011-02-07 17:21:47,943 [main] com.ClientKV: Checking Results - Get/Put Benchmark
[java] INFO 2011-02-07 17:21:47,943 [main] com.ClientKV: *************************************************************************
[java] INFO 2011-02-07 17:21:47,943 [main] com.ClientKV: - System ran for 120.0130 seconds
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - SP Calls / GETS / PUTS = 4,404,634 / 3,335,644 / 1,068,990
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - SP calls per second = 36,701.31
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - GETS per second = 27,794.02
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - PUTS per second = 8,907.29
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - PUTS Uncompressed Bytes / Compressed Bytes / Compressed Size / Avg Value Size Bytes = 12,827,880,000 / 12,827,880,000 / 100.00% / 12,000.00
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - GETS Uncompressed Bytes / Compressed Bytes / Compressed Size / Avg Value Size Bytes = 40,027,728,000 / 40,027,728,000 / 100.00% / 12,000.00
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - Average Latency = 1.04 ms
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - Latency 0ms - 25ms = 4,330,062
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - Latency 25ms - 50ms = 0
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - Latency 50ms - 75ms = 0
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - Latency 75ms - 100ms = 0
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - Latency 100ms - 125ms = 0
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - Latency 125ms - 150ms = 0
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - Latency 150ms - 175ms = 0
[java] INFO 2011-02-07 17:21:47,944 [main] com.ClientKV: - Latency 175ms - 200ms = 0
[java] INFO 2011-02-07 17:21:47,945 [main] com.ClientKV: - Latency 200ms+ = 0

Remote host result:
[java] INFO 2011-02-07 16:03:41,772 [main] com.ClientKV: *************************************************************************
[java] INFO 2011-02-07 16:03:41,772 [main] com.ClientKV: Checking Results - Get/Put Benchmark
[java] INFO 2011-02-07 16:03:41,772 [main] com.ClientKV: *************************************************************************
[java] INFO 2011-02-07 16:03:41,773 [main] com.ClientKV: - System ran for 120.1600 seconds
[java] INFO 2011-02-07 16:03:41,773 [main] com.ClientKV: - SP Calls / GETS / PUTS = 740,644 / 561,308 / 179,336
[java] INFO 2011-02-07 16:03:41,773 [main] com.ClientKV: - SP calls per second = 6,163.81
[java] INFO 2011-02-07 16:03:41,773 [main] com.ClientKV: - GETS per second = 4,671.34
[java] INFO 2011-02-07 16:03:41,773 [main] com.ClientKV: - PUTS per second = 1,492.48
[java] INFO 2011-02-07 16:03:41,773 [main] com.ClientKV: - PUTS Uncompressed Bytes / Compressed Bytes / Compressed Size / Avg Value Size Bytes = 2,152,032,000 / 2,152,032,000 / 100.00% / 12,000.00
[java] INFO 2011-02-07 16:03:41,773 [main] com.ClientKV: - GETS Uncompressed Bytes / Compressed Bytes / Compressed Size / Avg Value Size Bytes = 6,735,696,000 / 6,735,696,000 / 100.00% / 12,000.00
[java] INFO 2011-02-07 16:03:41,773 [main] com.ClientKV: - Average Latency = 61.09 ms
[java] INFO 2011-02-07 16:03:41,774 [main] com.ClientKV: - Latency 0ms - 25ms = 0
[java] INFO 2011-02-07 16:03:41,774 [main] com.ClientKV: - Latency 25ms - 50ms = 211,220
[java] INFO 2011-02-07 16:03:41,774 [main] com.ClientKV: - Latency 50ms - 75ms = 479,156
[java] INFO 2011-02-07 16:03:41,774 [main] com.ClientKV: - Latency 75ms - 100ms = 1,800
[java] INFO 2011-02-07 16:03:41,774 [main] com.ClientKV: - Latency 100ms - 125ms = 155
[java] INFO 2011-02-07 16:03:41,774 [main] com.ClientKV: - Latency 125ms - 150ms = 0
[java] INFO 2011-02-07 16:03:41,774 [main] com.ClientKV: - Latency 150ms - 175ms = 0
[java] INFO 2011-02-07 16:03:41,774 [main] com.ClientKV: - Latency 175ms - 200ms = 0
[java] INFO 2011-02-07 16:03:41,774 [main] com.ClientKV: - Latency 200ms+ = 28,412
You just need a little tuning
sebc
Feb 8, 2011
To eliminate possible issues with our OS or network, we also ran Redis benchmark on the same two machines, it was blazing fast. Even at 12KB value size, the average latency is around 1.7ms over network (we ran both Redis native benchmark client, as well as our own simple Java client)...


Alex,

You're totally right: 60ms is horrendous - you just need a little tuning.

The sample's configuration is set to "fire-hose" the VoltDB server, attempting to push 999,999,999 SP Calls/s - clearly unrealistic.
In the case of this specific application, because of the heavy network usage, you hit several potential bottlenecks:
- Capacity of your CPU to manage the network traffic
- Actual bandwidth available

Running on localhost is biased (and arguably not realistic: you wouldn't run your web/app server on your database server).
The bias - very specific to this network-intensive benchmark - is clear from the first log: 40GB in GETs over a 2 minute period - your Gigabit link would only give you 15GB: traffic is going through the loopback interface!

In your second, more realistic run, hardware limits show: while you are not maxing-out your network bandwidth, you are apparently maxing out your network adapter's (or CPU's) ability to manage that traffic (visible in the log: 6.75GB in GETs - half your 15GB link capacity)

So, how do you tune up the application so you don't get that 60ms latency?

Take a look at the second log. The reported actual SP Calls/s is at 6,164 (vs. 999,999,999 requested, first line of the log).

One rule of thumb is to run at around 80%-90% capacity (try out different values); which is what you see after running the app in "fire-hose" mode.

So, here, in build.xml change the client call's first argument from 999999999 to 5100 - this will deal with your latency issue: you should see no more than 2ms with optimal settings.
Ultimately while you might get 10% less transactions through in the 2-minute test (compared to the "fire-hosed" run), your engine will purr with a 1-2ms latency instead of crawling, forcing 200ms transaction through!

In general - pretty much regardless of the application - you should expect the following latency:
- Single-node VoltDB: 1-2ms
- Multi-node VoltDB Cluster: 7-9ms

Have a go at it and let us know if that does the trick!
You might want to check out Key-Value Benchmarking Blog Post for more on this sample and what you should expect from it.
Problem solved!
alexlzl
Feb 8, 2011
Alex,

You're totally right: 60ms is horrendous - you just need a little tuning.

The sample's configuration is set to "fire-hose" the VoltDB server, attempting to push 999,999,999 SP Calls/s - clearly unrealistic.
In the case of this specific application, because of the heavy netw...


Thank you so much for the detailed explanation (we did spend a whole day trying everything possible). After changing client's first parameter from 99999999 to 5100, the average latency dropped to 7-10ms for a single node cluster in this specific key-value testing. And change it lower yields to even better latency (3500 -> 4ms, 1000 -> 2ms). We can see that the Giga-link is fully saturated. This is great!

Thank you again. One note though is that you guys probably want to change the code shipped to a "ready for benchmarking" shape (not necessarily fully optimized, just not big surprises like this one). I am pretty sure most people will start off with those wonderful examples. :-)
Thanks for your suggestion :-)
sebc
Feb 10, 2011
Thank you so much for the detailed explanation (we did spend a whole day trying everything possible). After changing client's first parameter from 99999999 to 5100, the average latency dropped to 7-10ms for a single node cluster in this specific key-value testing. And change it lower yields to even better latency (3500 -> 4ms, 1000 -> 2ms). We can see that the Giga-link is fully saturated. This is great!...


Excellent Alex. Glad it worked out.

You did have an excellent suggestion, so we just added a quick "auto-tuning" feature on the "Key Value" and "Voter" samples, both often used for benchmarking. You should now be able to run the sample as-is and see it self-optimize the transaction rate to reach a desired latency. Feel free to grab the updated samples with source code from our SVN repository: http://svnmirror.voltdb.com/eng/trunk/examples/ so you can see it in action and how we implemented this basic "throttling" functionality.