Forum: Other

Post: Numactl --performance improvement

Numactl --performance improvement
Aug 16, 2012
I am doing some Voltdb benchmark

There are 2 factors i found as below:

1.I fixed voltdb instance on One CPU use numactl --cpunodebind=0, and I got almost twice performance of READ(select * from table) than before.

have anyone tried that before?

2.I disable hyper-thread on 8core intel E5,got some performance improve
Hi Bill, Re #1 It's safe to
Aug 17, 2012
Hi Bill,

Re #1

It's safe to assume you are doing select * from a very small table?

If you pin all the threads to a socket they can communicate via local cache and all memory allocations are local to that NUMA node. In the current release that is a big help because transaction initiation contains some global state that is highly contended. In the upcoming version there is no more global or shared state around transaction initiation. The dominant cost for most transactions is networking, the actual execution is trivial.

I have had similar findings. It is much faster to run two Volt processes each bound to a socket then to run one process. We completely rewrote transaction initiation and networking to eliminate global locks and shared state and it was a 2-4x improvement, but still not enough when you have 10-gig E. There is plenty of compute time available and profiling doesn't show contention, and everything is mysteriously idle waiting on network messages.

My next step is to add node aware replication so that you can run multiple processes per node with replication and then use numactl to bind a Volt process to each socket. It's a lot of work to make it self configuring across a variety of hardware and software (numactl isn't installed everywhere) and Java is no help when it comes to thread/memory placement.

Re #2
We haven't played much with disabling hyper-threading, but this doesn't surprise me for most workloads where transaction sizes are small. For TPC-C which has very complex transactions and a low number of total transactions we found that hyper-threading was slightly faster in terms of throughput. I think there is still a place for hyper-threading in Volt but there is more work to more accurately pair the right threads together on the same core.

Results are heavily effected by the choice of sites per host and may change. If you twiddle that number you may find with/without hyperthreading is equivalent.

hi, Ariel Thank you very much
Aug 19, 2012
hi, Ariel
Thank you very much for your response.I can provide you with some detail, maybe it has some reference value.

We are doing some benchmark & comparison for several DB on intel X86 platform, such as hbase, mongodb, memchached, coherence. For comparison, we use a unify workload YCSB, of course we will do some tpc-c in future.
for voltdb benchmark, the test schema is one table with 11 column, 1 primary key + 10 data column. In our test, each column contain 100b, so 10 column will have 1k data.

transaction read is like Select * from table where primary key = ?
transaction update is Update table set data column1 = ? where primary key = ?

i got some result as below:

for the numactl

Node Site/Node Feature Read update Load
2 8 cpunodebind=0 182943 261178 115278
2 16 cpunodebind=all 132001 184207 93935
4 8 cpunodebind=0 314579 725554 204192
4 16 cpunodebind=all 320448 564935 194286
in our test, numa provide improvement for both read and update, so our next test will all use numa.

I have another question, in our workload, each thread will do createconnection() to cluster, we've tried different thread to test, I found this:
in read: one thread got good performance, more than 10 will decrease tps
in update: more threads perform better, i use twice the number of total sites
Have you tried this?

Looking forward to your new version
Thank you again!

looking forward to your new version
Hi Bill,The performance
Aug 22, 2012
Hi Bill,

When you say createConnection, are you talking about having separate client instances for each thread, or once client instance where you call createConnection multiple times? The Java client library is basically a connection pool and defaults to a single network thread. If you end up doing work in the procedure callbacks that can hurt performance because the callback thread is a finite (although very large) capacity. We recommend sharing client instances between threads.

Normally the latency in Volt is such that I would expect any kind of performance out of 10 threads unless you were generating a lot of work asynchronously. I think you will also find better performance with lower # of sites per host.

The current version of Volt has somewhat strange performance curves at different load levels because the global transaction ordering mechanism actually requires load to propagate information and without that load latency is worse. If you aren't generating load asynchronously and have a low thread count you will see poor performance.

The performance profile of Volt is going to change dramatically in the next few months in terms of latency and throughput. The original transaction initiation system was written and benchmarked against TPC-C which has large complex transactions where the cost of scheduling the transaction was small. You can expect to get good throughput with 10-40 threads instead of hundreds.

We have since found that for most workloads (such as the YCSB one you describe) the cost of scheduling and replicating a transaction is >50% of the cost of actually executing the transaction and the serial portions of code that did transaction scheduling was a bottleneck.

We have been calling this IV2 and the improvement is in the 2-4x range. I also think that you will see true scale up with IV2 as opposed to scale up until you reach the serial bottleneck as in pre-IV2.

The fundamental change is that there is no longer a per process transaction initiator that has a per process lock. There is a now an initiator per partition so the lock around transaction initiation is split.
It is also looking like the initiators should have dedicated threads instead of critical sections. This where I think hyper-threading will prove useful because it will allow initiator, network, and command log threads to run concurrently with transaction execution.

With IV2 I think you will find that binding to a single socket will not improve performance because IV2 will be able to take advantage of the extra capacity much better then pre-IV2 Volt.

The remaining bottleneck is turning out to be networking, especially in small clusters with replication where there is a single socket between each Volt process. Volt is using NIO and a single thread is used to read/write from each socket and the path for queueing a message to a socket has a critical section that is highly contended. Performance wise you can't beat blocking IO, but that requires separate sockets for sending and receive. Java BIO doesn't allow concurrent writes/reads and the BIO version of NIO is very slow when you use separate blocking threads for send/receive.

It's easy to add multiple sockets in between cluster nodes although that raises some interesting failure scenarios, but using multiple sockets for client connections requires a change to the wire protocol and client libraries. Another issue is that some customers are relying on NIO to support large numbers of concurrent connections so we can't stop supporting that even though it is slower for use cases without large numbers of connections. It's a real engineering headache.

If you want to learn more about how to play with IV2 you can shoot me an email,