When you say createConnection, are you talking about having separate client instances for each thread, or once client instance where you call createConnection multiple times? The Java client library is basically a connection pool and defaults to a single network thread. If you end up doing work in the procedure callbacks that can hurt performance because the callback thread is a finite (although very large) capacity. We recommend sharing client instances between threads.
Normally the latency in Volt is such that I would expect any kind of performance out of 10 threads unless you were generating a lot of work asynchronously. I think you will also find better performance with lower # of sites per host.
The current version of Volt has somewhat strange performance curves at different load levels because the global transaction ordering mechanism actually requires load to propagate information and without that load latency is worse. If you aren't generating load asynchronously and have a low thread count you will see poor performance.
The performance profile of Volt is going to change dramatically in the next few months in terms of latency and throughput. The original transaction initiation system was written and benchmarked against TPC-C which has large complex transactions where the cost of scheduling the transaction was small. You can expect to get good throughput with 10-40 threads instead of hundreds.
We have since found that for most workloads (such as the YCSB one you describe) the cost of scheduling and replicating a transaction is >50% of the cost of actually executing the transaction and the serial portions of code that did transaction scheduling was a bottleneck.
We have been calling this IV2 and the improvement is in the 2-4x range. I also think that you will see true scale up with IV2 as opposed to scale up until you reach the serial bottleneck as in pre-IV2.
The fundamental change is that there is no longer a per process transaction initiator that has a per process lock. There is a now an initiator per partition so the lock around transaction initiation is split.
It is also looking like the initiators should have dedicated threads instead of critical sections. This where I think hyper-threading will prove useful because it will allow initiator, network, and command log threads to run concurrently with transaction execution.
With IV2 I think you will find that binding to a single socket will not improve performance because IV2 will be able to take advantage of the extra capacity much better then pre-IV2 Volt.
The remaining bottleneck is turning out to be networking, especially in small clusters with replication where there is a single socket between each Volt process. Volt is using NIO and a single thread is used to read/write from each socket and the path for queueing a message to a socket has a critical section that is highly contended. Performance wise you can't beat blocking IO, but that requires separate sockets for sending and receive. Java BIO doesn't allow concurrent writes/reads and the BIO version of NIO is very slow when you use separate blocking threads for send/receive.
It's easy to add multiple sockets in between cluster nodes although that raises some interesting failure scenarios, but using multiple sockets for client connections requires a change to the wire protocol and client libraries. Another issue is that some customers are relying on NIO to support large numbers of concurrent connections so we can't stop supporting that even though it is slower for use cases without large numbers of connections. It's a real engineering headache.
If you want to learn more about how to play with IV2 you can shoot me an email, email@example.com