Forum: Other

Post: Cluster node failure detection/handling

Cluster node failure detection/handling
chbussler
Apr 4, 2010
Hi,

I might have overlooked this in the documentation. Let's assume that there is a cluster with several nodes (e.g. 3 machines, each having a multiple-core processor), and a database that is running on all three nodes. What happens if one of the nodes fails (there can be different type of failures, of course, like network partitioning, disk failures, etc.)?

I'd be interested in learning how the database behaves in this situation, what that means to e.g. automatic snapshot processing, transaction processing of not yet committed transactions and if all clients observe the same error behavior (or only those clients that connect to the faulty node).
The main focus of this question is the data consistency and potential data loss/recovery that needs to be dealt with.

Thanks,
Christoph
Christoph, VoltDB replicates
rbetts
Apr 5, 2010
Christoph,

VoltDB replicates partitions across nodes. The number of replicas is a user configurable option provided to the VoltCompiler application command line. The user specifies the "K-Safe" factor. VoltDB creates k+1 copies of each partition. This allows k-concurrent node failures without loss of user data or interruption of VoltDB's availability.

In the current 0.6.02 beta, VoltDB supports replicating data but does not yet continue operation when a node fails. Detecting and continuing operation is work that is nearing completion and will be part of the next VoltDB release. We're excited!
Most failures devolve to the same detection path: a node "crashes" due to a software fault or detects a local hardware fault and self-terminates execution. The nodes composing the cluster are connected by an all-to-all TCP/IP socket mesh and heartbeat each other periodically (they also piggy-back heartbeat data on transaction initiations and other intra-cluster messages). Failures are typically detected as write errors on this socket mesh.

VoltDB's design assumes a relatively small number of nodes (10's, not 1000's) connected by a common network switch. VoltDB chooses to optimize consistency and availability over complex network partition handling. Essentially, we believe that at expected cluster sizes and correct network link redundancy, network partitions are rare.

When a node fails, on-going transactions from a client can be divided into a few buckets. Describing what happens to these different categories of transactions is perhaps a good way to talk about failure handling.
(In the following text, by "transaction initiation," I mean the internal VoltDB message that bundles the user provided stored procedure parameters, the procedure to invoke, and some internal metadata about the transaction.)

Bucket 1: transactions instantiated by a client and sent to the failed node but not received by the failed node's transaction initiator. These transactions are never seen by the VoltDB cluster will not occur in VoltDB.

Bucket 2: transactions instantiated by a client, received by the failed node with initiations that have been internally issued to the corresponding partition replicas. Transaction initiations are two-phase committed across replicas. When a node fails, replicas communicate to discover the most recent fully replicated transaction initiation from the failed node. All transactions before and including this transaction are completed. All transactions (initiated by the failed node) after this transaction are failed by the system.

Bucket 3: in progress multi-partition transactions running on at least one VoltDB execution site. If the failed node was running the Java stored procedure code, in progress work at remote nodes is rolled back and the transaction is failed by the system. If the failed node was only executing SQL and not running the Java stored procedure code, the transaction is run to completion by the remaining nodes of the cluster.

In fewer words, transactions are either run completely or not all.

Clients will see the results of transactions at all nodes. There is immediate consistency in VoltDB, not eventual consistency. Clients can detect most node failures via TCP/IP. In near term releases, clients are not directly informed of which transactions were completed or failed as a result of node failure. This must be discovered out of band from the VoltDB operation log (the syslog log, not the internal data journaling log).

The VoltDB system has full and consistent information on the fate of these transactions and typically clients are connected to more than one database node; transaction fate could be broadcast to clients from non-failed nodes in some future release.

Your question gets right to the heart of the matter. I hope this answer is detailed enough to help you picture a little more about VoltDB's design while avoiding too much VoltDB-specific jargon.

Thanks,
Ryan.
Hi Ryan, thanks a lot for
chbussler
Apr 6, 2010
Christoph,

VoltDB replicates partitions across nodes. The number of replicas is a user configurable option provided to the VoltCompiler application command line. The user specifies the "K-Safe" factor. VoltDB creates k+1 copies of each partition. This allows k-concurrent node failures without loss of user data or interruption of VoltDB's availability...

Ryan.


Hi Ryan,

thanks a lot for this extensive reply; I appreciate it. Is this a standard algorithm for in memory databases? If so, I'd appreciate if you could give me some pointers to publications that I can do some more reading. But you answered my question.

Thanks,
Christoph
Christoph, Our algorithm is
rbetts
Apr 6, 2010
Hi Ryan,

thanks a lot for this extensive reply; I appreciate it. Is this a standard algorithm for in memory databases? If so, I'd appreciate if you could give me some pointers to publications that I can do some more reading. But you answered my question.

Thanks,
Christoph


Christoph,

Our algorithm is built on standard techniques, like logging for undo and two phase commit, used commonly in distributed systems. The unique architecture of VoltDB leads to unique details in the full implementation, though.

If you'd like to read more about the academic underpinnings of VoltDB, and haven't yet read the H-Store papers, they may interest you: http://db.cs.yale.edu/hstore/.

Ryan.
Partition and Lost Transaction Results
henning
Apr 16, 2010
Christoph,

VoltDB replicates partitions across nodes. The number of replicas is a user configurable option provided to the VoltCompiler application command line. The user specifies the "K-Safe" factor. VoltDB creates k+1 copies of each partition. This allows k-concurrent node failures without loss of user data or interruption of VoltDB's availability...

Thanks,
Ryan.


Thanks for the details, Ryan.

Two questions: what happens when partitioning happens. I.e. a network failure, resulting e.g. in two subnets that can't see each other. And then the network comes back up. I can live with the answer 'undefined' but I would like to know. And is a time span defined of how long a network partition VoltDB will survive? Will it actively reset minority nodes that rejoin? At some point?

" In near term releases, clients are not directly informed of which transactions were completed or failed as a result of node failure."

Does this mean: when a transaction was completed but the procedure hosting node went down before it could tell the client about the success of the transaction, then the client will never know. Unless it is taught to scan the logs. And it may, with some book keeping, realize there was no answer for some transaction at all, not even a fail. But this is the only case where the information will be lost. And the client will know that the node went down from an exception. Correct?