We recommend setting sitesperhost to somewhere between 1/2 to 3/4 of the total number of cores (counting 2x for CPUs with hyperthreading), but it's always good to experiment to see where you get the best results. In many cases sitesperhost=3 is 3x faster than sitesperhost=1.
The client side can often be the bottleneck too, especially if you use synchronous calls because they are blocking so the calling thread cannot send another request until the response has been received. The entire round-trip time for a response is essentially in the critical path. You can add more threads to parallelize the work, but it may need hundreds of threads to generate requests as fast as the database can process them. Another option is a single-threaded client that uses asynchronous calls. Because they are not blocking, a single thread can send many requests per second. The responses are received on another thread. This generally can generate requests faster than the database can process them, until you get to high levels of throughput (>200K/sec) where you may need more threads or more client instances.
If you were to use synchronous calls and start with just 1 thread, then add threads incrementally, you would see each thread adding 1x to the throughput until you reached a point of diminishing returns and it flattened to a constant rate of throughput, which would be the full capacity of the database. But if you use asynchronous calls, in most cases you immediately jump to the full capacity of the database. We call it "fire-hosing" the database.