Forum: Managing VoltDB

Post: Scale Voltdb on a NUMA machine

Scale Voltdb on a NUMA machine
beilei sun
Jan 19, 2017
Hi,

I scale Voltdb from a 2 sockets (4 cores/CPU) machine to a 16 sockets (18 cores/CPU) machine, and measured through a "One INSERT followed by Ten SELECT" benchmark.
The performance scales from 60,000+ TPS up to 180,000+ TPS. I am quite disappointed about the result.

On the 16 Sockets machine, I fount that SpinPause() and ParallelTaskTerminator::off_termination() in the libjvm.so and _raw_spin_lock() in the kernel is called most frequently among all of the functions, (the percentages are 44.36%, 26.21%, and 8.32%, respectively. obtained by "perf top"). However, none of the aforementioned functions are called frequently in the 2 sockets machine.

Since Voltdb is designed for lock-free, I thought locks shouldn't be used so frequently.
Is this situation caused by the NUMA architecture, which makes partitions distribute among different NUMA nodes?
Or, are there any other reasons that I don't know?
How can I create multiple voltdb processes and bind them onto different sockets?


I saw Stonebraker wrote "So far, we have primarily worried about “scale out” onto multiple nodes in a cluster. However, our approach (dividing memory shared over K cores on a node into K non-shared “chunks” and assigning each to a specific core) will work fine on NUMA; it just won’t particularly leverage the NUMA architecture to advantage " on https://www.voltdb.com/blog/qa-from-march-21-webinar .
I can find the way in the available document. Please help...

Thank you very much
rmorgenstein
Jan 20, 2017
It's hard to say whether your performance is expected or not without knowing more about how you're testing. Do you have code we can look at?

If you want to try multiple sockets, here is the magic (I'm assuming your are at or near master - if not, let me know and I will amend with older-version instructions):

1) Make a myconfig.xml with the sitesperhost you want and any other settings.
2) Set up 4 separate voltdb roots - for example:
3) Start 4 different voltdb processes all with different port numbers so that they don't collide.
4) numactl the processes or the threads or whatever you want
5) Make sure that your client processes connect to all of the processes by listing all of them. For example, myhost:10200,myhost:10201, ...

Myconfig.xml just has sitesperhost - you may want to play with this in your multi-process setup.
<?xml version="1.0"?>
<deployment>
<cluster sitesperhost="4" />
</deployment>


The following bash will init and start 4 nodes.


ADMIN= 1010
CLIENT=1020
ZK=1030
INTERNAL=1040
HTTP=808
NODES=4

for i in $(seq 0 $(($NODES - 1)))
do
voltdb init --force -D node$i -C myconfig.xml
done


for i in $(seq 0 $(($NODES - 1)))
do
voltdb start -D node$i --http=$HTTP$i --internal=$INTERNAL$i --admin=$ADMIN$i --zookeeper=$ZK$i --client=$CLIENT$i -H `hostname`:${INTERNAL}0 -c $NODES &
sleep 5
done
beilei sun
Jan 24, 2017
Thanks very much.

It did work to bind VoltDB instances onto different NUMA nodes. I am still trying different binding policies, and will let you know if there are any exciting news.

While, I still have an open question. VoltDB is unable to take the NUMA topology into consideration while partitioning automatically using the hash algorithm. I think this is the major reason that throttles the performance of VoltDB on NUMA architecture. Is there any possibility to modify the partition algorithm to utilize the NUMA characteristics, so that VoltDB can scale linerly on the NUMA machine? JVM has an option for the NUMA-awareness. Can we use this option while partitioning?
rmorgenstein
Jan 30, 2017
NUMA-aware partitioning is not on our immediate roadmap. We've recently improved our partitioning between machines in an HA (i.e. k-safe) cluster to improve fault tolerance (see the V6.9 release notes point 1.4 https://docs.voltdb.com/v6docs/ReleaseNotes/) and are interested in further advances in partition layout. Are you interested in doing this work? This would be a great feature and we are always interested in contributions from the community.

Ruth
jhugg
Jan 30, 2017
NUMA may have something to do with it, but it's perhaps even more likely that the networking and transaction management layers of VoltDB haven't been optimized to scale to that many sockets and cores. One of the benefits of running more than one VoltDB process on a machine is that it scales all of the locks and coordination from network connections and transaction overhead.

We've actually implemented "rack-awareness" in VoltDB, which allows you to tell VoltDB that a group of VoltDB processes are likely to fail together, and the system will assign data such that replicas are not kept on the same machine. We've also made some of our clients able to discover VoltDB processes and ports. Both of these features make it easier to run 2 or more VoltDB processes on a single piece of hardware.

Going forward, we'd like to make it simpler to deploy, monitor and maintain multiple VoltDB processes on one OS, and expect this will allow us to scale to high socket count machines.

It would be nicer in some ways if we just used one process that had much better numa awareness, but it's a lot more work for less additional benefit.

Still, as Ruth said, we are always interested in suggestions or help.
beilei sun
Feb 4, 2017
Thanks Ruth. I am not sure, what do you mean by "further advances in partition layout". Will the partitioning algorithm keeps on improving availability, or try to take other situations into consideration?
I am quite interested in the work of NUMA-aware partitioning algorithm, either for performance or for availability.
beilei sun
Feb 4, 2017
Thanks.
According to my observation, the lock and coordination problem still exists while running multiple VoltDB processes on a machine.
In fact, I think it is more complicated when scaling on the NUMA machine, since the memory access latencys to different sockets may vary greatly.

NUMA may have something to do with it, but it's perhaps even more likely that the networking and transaction management layers of VoltDB haven't been optimized to scale to that many sockets and cores. One of the benefits of running more than one VoltDB process on a machine is that it scales all of the locks and coordination from network connections and transaction overhead.

We've actually implemented "rack-awareness" in VoltDB, which allows you to tell VoltDB that a group of VoltDB processes are likely to fail together, and the system will assign data such that replicas are not kept on the same machine. We've also made some of our clients able to discover VoltDB processes and ports. Both of these features make it easier to run 2 or more VoltDB processes on a single piece of hardware.

Going forward, we'd like to make it simpler to deploy, monitor and maintain multiple VoltDB processes on one OS, and expect this will allow us to scale to high socket count machines.

It would be nicer in some ways if we just used one process that had much better numa awareness, but it's a lot more work for less additional benefit.

Still, as Ruth said, we are always interested in suggestions or help.