Thank you very much for a detailed explanation. It is very clear to me now.
One minor issue is that you mentioned you could get snapshot throughput of 70 MB/sec. With that throughput, could you take snapshot with granularity of 1 sec for a database whose size is greater than 70MB ?..
One minor issue is that you mentioned you could get snapshot throughput of 70 MB/sec. With that throughput, could you take snapshot with granularity of 1 sec for a database whose size is greater than 70MB ?
Snapshotting speeds up linearly as you add nodes much like regular Volt. Replicas do not speed up snapshotting because each replica records a copy of all tables so you will end up with three copies of your database when you snapshot. In theory you can speed up snapshotting by adding a faster disk subsystem at each node, but there is a tipping point where the current implementation doesn't saturate the disk subsystem.
For example, in case of 9.94 transactions per minutes (as noted in your White Paper), I assume, on the average, each transaction update 1 tuple, then the number of updates per second will be about 160k updates per second. Also assume the updates are uniformly distributed, then you have to copy about half of them into the temporary table, i.e. 80k tuples.
Also, assume the cost of copying (include locking COW iterator
position to check the relative position, checking the dirty bit, copying the tuple to the temporary table) is about 1 micro second. Then for each second, the overhead is about 80k * 1 micro seconds = 0.08 second. That means taking snapshot reduces the performance by about 8%.
Here @ VoltDB we believe that locks belong on bagels and not tables, indexes, or iterators. There is no locking cost associate with accessing the iterator because every partition has a dedicated thread with exclusive access to all the table, index, and iterator data, and that is the only thread that will ever access that data. In VoltDB parlance the partition is the lock and it is always held by its dedicated thread.
The cost to copy an average size tuple is less than 1 microsecond, and there are many threads in the system so there is actually (partition count * 1000000 microseconds) of execution time available each second. A tuple copy is already involved in every update and delete due to the undo log. I remember when we first added undo logging and the overhead was much smaller than expected. This is all taking place in memory with a hot cache at the source and destination. I suspect that the real overhead of snapshotting comes from the table scanning (pulling in memory is slow) and CRC calculations done to serialize chunks of table data. The serialization work takes execution time away from running stored procedures.
If every procedure invocation modifies a single tuple then the dominant cost in the system is going to be messaging and invoking stored procedures. In that situation the partitions are likely to spend enough time starved for work that snapshotting will not impact throughput even if the client applications firehoses requests.
If every procedure invocation modifies hundreds of tuples or does other time consuming work then the partitions will never starve for work if the system is saturated with requests and any time spent snapshotting is time spent not doing procedure work. If the system is not saturated then the snapshot work can be interleaved with regular procedure work and it will not have a large impact.
We have yet to come across a workload that does enough work in each stored procedure to %100 saturate a partition without some starvation. The less work done in each procedure the more likely the partition is going to spend time starved for procedure work.