Forum: VoltDB Architecture

Post: Why does recovery cause throughput to drop to 0 in bench?

Why does recovery cause throughput to drop to 0 in bench?
sdrobert
Aug 11, 2011
Hi,


I've been looking all over for this. I just checked out the latest build of VoltDB (1.3.6). When I run the TPC-C like benchmark with 3 hosts, 4 sites per host, and k-safety=1 (that's 6 unique partitions), I kill one host. Then, using the same command that the BenchmarkController uses to start a server (plus 'rejoinhost XXX'), I rejoin the host into the cluster. While tables are in recovery, the throughput for the cluster is 0 txn/s.


I understand that a multi-partition transaction could drop the throughput to 0 txn/s, but why recovery? Yes, the partitions in recovery will not be able to do work, but only 4 out of 6 unique partitions are occupied with recovery. What are the other 2 doing?


Thanks,
Sean
Hi Robert, It is true that
aweisberg
Aug 11, 2011
Hi Robert,


It is true that only the involved partitions are blocked, but if enough work is queued for the blocked partitions the cluster will stop accepting work in order to avoid running out of memory. Each node accepts a maximum of 64 megabytes/5000 txns (which ever happens first). Eventually all outstanding requests allowed into the system are for the blocked partitions.


This is described in Using VoltDB 8.3.1 (KSafeRecover#KsafeRejoin). This is something we are looking to improve in the future.


-Ariel