Forum: Other

Post: Recovery

Recovery
chbussler
Apr 4, 2010
Hi,
it seems that the finest granularity of automatic snapshots is 1 second. So I could have a database snapshot every 1 second (I have not yet tried that out). If my database crashes for some reason within the 1 second interval, then I think I understood that all transactions are lost after the last snapshot and cannot be recovered, is that true?

In case this is true, do you have some recommendation/idea/plans to make committed transactions recoverable between snapshots?

Thanks,
Christoph

PS Not sure if it would make sense to introduce an 'concepts' or 'architecture' forum for these type of questions. Thanks.
Snapshot granularity
tcallaghan
Apr 5, 2010
Christoph,

The feature you are requesting is covered by VoltDB's k-safety functionality. In a 4-node cluster with k=1, each row of data exists on 2 nodes, and each node commits the transaction before a "success" message is returned to the user. If you require more fault tolerance you can set k=2 or higher.

And yes, you can set the snapshot frequency to 1s, so the database will be creating snapshots continuously. These snapshots would allow you to recover if you had a multi-node failure greater than your k-safety setting.

-Tim
Snapshot granularity vs. k-safety setting?
chbussler
Apr 6, 2010
Christoph,
The feature you are requesting is covered by VoltDB's k-safety functionality. In a 4-node cluster with k=1, each row of data exists on 2 nodes, and each node commits the transaction before a "success" message is returned to the user. If you require more fault tolerance you can set k=2 or higher...-Tim


Hi Tim,

I assume snapshots are slowing down the database. So one strategy, seems like, is to increase the k-safety setting and decrease the snapshot frequency. Does this make sense? Or is an increased snapshot frequency preferable over and increased k-safety setting?

Thanks,
Christoph
snapshotting vs. k-safety
tcallaghan
Apr 6, 2010
Hi Tim,

I assume snapshots are slowing down the database. So one strategy, seems like, is to increase the k-safety setting and decrease the snapshot frequency. Does this make sense? Or is an increased snapshot frequency preferable over and increased k-safety setting?

Thanks,
Christoph


Christoph,

If your cluster is properly sized for your transactional workload (and has a decent disk sub-system), then snapshot creation will not dramatically impact your system performance. Remember that you will only use a snapshot for recovery in the event of a catastrophic hardware failure (more nodes than your k-safety value can cover).

A higher k-safety value and internally redundant servers will guarantee a more available system and is, in my opinion, a better strategy.

-Tim
Recovery: Single sited, multi-sited
tuancao
Apr 11, 2010
Hi,
In case of single-sited, do you allow concurrent transactions?
if so, I wonder how automated snapshot could obtain a consistent image of the database.

In case of multi-sited, a transaction is distributed among many nodes. As I understand from your document (section 7), snapshots are taken locally on each node and there is no coordination between nodes while taking snapshots. It is not clearly to me that how you can guarantee the database can recover to a consistent stage from the locally taken snapshots.

Thanks,
Tuan
Recovery: Single sited, multi-sited
tcallaghan
Apr 12, 2010
Hi,
In case of single-sited, do you allow concurrent transactions?
if so, I wonder how automated snapshot could obtain a consistent image of the database...

Tuan


Tuan,


Our snapshots are consistent across the entire cluster, regardless of the type or quantity of transactions that occur when the snapshot is running. Each node in the cluster (and each partition in each node) is responsible for writing its data to disk. The coordination of a snapshot happens just prior to starting, as we need to ensure that the snapshot begins at the same time on all nodes.

-Tim
Recovery: Single sited, multi-sited
tuancao
Apr 12, 2010
Tuan,

Our snapshots are consistent across the entire cluster, regardless of the type or quantity of transactions that occur when the snapshot is running. Each node in the cluster (and each partition in each node) is responsible for writing its data to disk. The coordination of a snapshot happens just prior to starting, as we need to ensure that the snapshot begins at the same time on all nodes.

-Tim


Hi Tim,

Thank you very much for the answer.
You said the snapshot begins at the same time. Does "time" mean the transaction number or real time?

Also, when the system is taking snapshots, does it block transaction processing?
Do you keep versions of tuples in main-memory (as HARBOR does) to run some kinds of historical queries?

Thanks,
Tuan
Recovery: Single sited, multi-sited
tcallaghan
Apr 12, 2010
Hi Tim,

Thank you very much for the answer.
You said the snapshot begins at the same time. Does "time" mean the transaction number or real time?..


Tuan,

The snapshot begins at a particular transaction number, which is how we coordinate with all nodes in the cluster.

A scheduled snapshot never blocks transactional work. When you request a snapshot manually you can specify if you want it to block or not.

As for the technical implementation, I'll have one of our developers follow up with the details.

-Tim
Recovery: Single sited, multi-sited
tuancao
Apr 12, 2010
Tuan,

The snapshot begins at a particular transaction number, which is how we coordinate with all nodes in the cluster.

A scheduled snapshot never blocks transactional work. When you request a snapshot manually you can specify if you want it to block or not.

As for the technical implementation, I'll have one of our developers follow up with the details.

-Tim


Thanks Tim,

I am looking forward to know how voltDB could take snapshots without blocking transaction processing.

Tuan
Recovery: Single sited, multi-sited
aweisberg
Apr 12, 2010
Thanks Tim,

I am looking forward to know how voltDB could take snapshots without blocking transaction processing.

Tuan



Hi Tuan,

This description takes a two pronged approach. First I describe the copy on write scheme used to allow online snapshots and then I describe how the snapshot data is extracted and stored by the Java topend.

The snapshot mechanism is a fairly straightforward copy on write scheme.

Table storage in Volt uses fixed size blocks of fixed size tuples with variable size data (VARCHAR) stored indirectly on the heap. These blocks can be iterated and viewed as a linear list of tuples some of which are deleted or unused.

Each tuple has a 1-byte header that contains a bit indicating whether a tuple is dirty. In normal operation this bit is always set to false when a tuple is created (where creation is copying it into unused storage). When a table is in COW mode it maintains a table scanning iterator (COW Iterator) that only returns clean tuples. The iterator always unsets a tuple's dirty flag after it has been scanned and it is an invariant that the dirty flag is unset for all tuples that have been scanned by a COW iterator. This includes tuples that the iterator does not return such as tuples that were dirty when scanned by the iterator. Each table also maintains a separate temporary table with an identical schema. This temporary table is populated with copies of tuples that are dirtied while COW mode is active.
Tuples inserted while COW is active have their dirty flag set based on their position relative to the COW iterator. If the tuple is being inserted into an area that has already been scanned then flag will not be set in order to maintain the previously mentioned invariant. If the tuple is being inserted into an area that has not been scanned by the COW iterator then the dirty flag will be set so that it will be skipped by the COW iterator.

Tuple updates and deletes have the same position sensitivy as inserts. No special handling is required if the dirty flag is already set because either the original tuple has already been copied to the temporary table or the tuple was inserted after COW mode was activated. If the dirty flag is not set and the to-be mutated tuple is stored in an area that has already been scanned by the COW iterator then the dirty flag is not set and no additional work is required. If the tuple is in an area that has not been scanned it must be copied to the temporary table before mutation and the dirty flag must be set in the case of an update.
Eventually the COW iterator will have returned all tuples in regular table storage and the table leaves COW mode. The temporary table containing the copies of tuples that were dirtied is iterated to retrieve the remaining tuples and then deleted.

The @SnapshotSave multi-partition system procedure causes the the system to go into copy on write mode at the same transaction id at every partition. At each host the sysproc creates a target for snapshot data for each table. Currently the only target implemented is flat files on disk with a dedicated thread doing the blocking IO. Every partition is given a list of targets indicating what tables should go into COW mode and where the tuple data should be sent to (via the target). The work of snapshotting replicated tables is round robined across partitions and is done after all partitioned table data has been snapshotted. Because the same targets are used in the same order at every partition it is possible for disk writes to be sequential to a single file. In testing I was able to get snapshot throughput of 70 megabytes/sec on a disk capable of 90 megabytes/sec sequential write.

Each partition has a small buffer pool with fixed size buffers. In between executing transactions a partition empties the pool and copies chunk of tuple data into the buffers and then passes them off to a snapshot target. The snapshot target returns consumed buffers back to the partition they originally came from. This approach provides some parallelism and load balancing across partitions and is responsive to the rate at which the target consumes buffers. Eventually all the partitions will finish snapshotting and the last partition to finish kicks off a thread to close all the targets. For the default target this syncs the files, then sets a flag in each file to indicate the sync completed, and then syncs again to write the flag. A snapshot file will fail a CRC check of the header if the final sync for the completion flag is not finished ensuring that incomplete files are detected. Every chunk passed to the snapshot target has a CRC prepended to it and as previously mentioned there is a CRC of the file header that doesn't cover chunk data.

The impact of online snapshotting when the cluster is not saturated is not as much as you might think. It's not free but it certainly doesn't grind things down to a crawl. It steals idle cycles to do the serialization, the memory allocation overhead is amortized in large chunks, a tuple copy is a straight memory copy with the cache hot on one side and warm on the other, reads and inserts are not effected, a tuple can only be dirtied once, and the set of tuples that can be dirtied rapidly shrinks as the snapshot reaches completion @ a rate of 70+ megs/sec. The only way to know is to try it our on your hardware and workload and see how it goes.
I think that covers it. Let me know if you have any questions.

Ariel
Recovery: Single sited, multi-sited
tuancao
Apr 12, 2010
Hi Tuan,

This description takes a two pronged approach. First I describe the copy on write scheme used to allow online snapshots and then I describe how the snapshot data is extracted and stored by the Java topend.

The snapshot mechanism is a fairly straightforward copy on write scheme...

Ariel


Hi Ariel,


Thank you very much for a detailed explanation. It is very clear to me now.

One minor issue is that you mentioned you could get snapshot throughput of 70 MB/sec. With that throughput, could you take snapshot with granularity of 1 sec for a database whose size is greater than 70MB ?

Another thing I am curious to know is the overhead when we enable automated snapshots.

For example, in case of 9.94 transactions per minutes (as noted in your White Paper), I assume, on the average, each transaction update
1 tuple, then the number of updates per second will be about 160k updates per second. Also assume the updates are uniformly distributed, then you have to copy about half of them into the temporary table, i.e. 80k tuples.

Also, assume the cost of copying (include locking COW iterator position to check the relative position, checking the dirty bit, copying the tuple to the temporary table) is about 1 micro second. Then for each second, the overhead is about 80k * 1 micro seconds = 0.08 second. That means taking snapshot reduces the performance by about 8%.

Tuan
Recovery: Single sited, multi-sited
aweisberg
Apr 13, 2010
Hi Ariel,

Thank you very much for a detailed explanation. It is very clear to me now.

One minor issue is that you mentioned you could get snapshot throughput of 70 MB/sec. With that throughput, could you take snapshot with granularity of 1 sec for a database whose size is greater than 70MB ?..

Tuan


Hi Tuan,

One minor issue is that you mentioned you could get snapshot throughput of 70 MB/sec. With that throughput, could you take snapshot with granularity of 1 sec for a database whose size is greater than 70MB ?

Snapshotting speeds up linearly as you add nodes much like regular Volt. Replicas do not speed up snapshotting because each replica records a copy of all tables so you will end up with three copies of your database when you snapshot. In theory you can speed up snapshotting by adding a faster disk subsystem at each node, but there is a tipping point where the current implementation doesn't saturate the disk subsystem.

For example, in case of 9.94 transactions per minutes (as noted in your White Paper), I assume, on the average, each transaction update 1 tuple, then the number of updates per second will be about 160k updates per second. Also assume the updates are uniformly distributed, then you have to copy about half of them into the temporary table, i.e. 80k tuples.
Also, assume the cost of copying (include locking COW iterator position to check the relative position, checking the dirty bit, copying the tuple to the temporary table) is about 1 micro second. Then for each second, the overhead is about 80k * 1 micro seconds = 0.08 second. That means taking snapshot reduces the performance by about 8%.

Here @ VoltDB we believe that locks belong on bagels and not tables, indexes, or iterators. There is no locking cost associate with accessing the iterator because every partition has a dedicated thread with exclusive access to all the table, index, and iterator data, and that is the only thread that will ever access that data. In VoltDB parlance the partition is the lock and it is always held by its dedicated thread.
The cost to copy an average size tuple is less than 1 microsecond, and there are many threads in the system so there is actually (partition count * 1000000 microseconds) of execution time available each second. A tuple copy is already involved in every update and delete due to the undo log. I remember when we first added undo logging and the overhead was much smaller than expected. This is all taking place in memory with a hot cache at the source and destination. I suspect that the real overhead of snapshotting comes from the table scanning (pulling in memory is slow) and CRC calculations done to serialize chunks of table data. The serialization work takes execution time away from running stored procedures.
If every procedure invocation modifies a single tuple then the dominant cost in the system is going to be messaging and invoking stored procedures. In that situation the partitions are likely to spend enough time starved for work that snapshotting will not impact throughput even if the client applications firehoses requests.

If every procedure invocation modifies hundreds of tuples or does other time consuming work then the partitions will never starve for work if the system is saturated with requests and any time spent snapshotting is time spent not doing procedure work. If the system is not saturated then the snapshot work can be interleaved with regular procedure work and it will not have a large impact.

We have yet to come across a workload that does enough work in each stored procedure to %100 saturate a partition without some starvation. The less work done in each procedure the more likely the partition is going to spend time starved for procedure work.

-Ariel
Thanks :-)
henning
Apr 16, 2010
Hi Tuan,

This description takes a two pronged approach. First I describe the copy on write scheme used to allow online snapshots and then I describe how the snapshot data is extracted and stored by the Java topend.

The snapshot mechanism is a fairly straightforward copy on write scheme...

Ariel


... for that treat, Ariel. Can't wait to learn more about the inner workings of VoltDB.
So in a nutshell, snapshots are always transaction-consistent?
Yep. Like doing select * from
aweisberg
Apr 18, 2010
... for that treat, Ariel. Can't wait to learn more about the inner workings of VoltDB.
So in a nutshell, snapshots are always transaction-consistent?


Yep. Like doing select * from every table in a multi-partition transaction and then piping it to disk.

-Ariel