In the case of a single-partition transaction, we run the Java stored procedure on k+1 nodes. Before returning an answer to the client, we confirm the result is the same. If the result is not the same because one of the nodes had a hardware or software failure, then the transaction is committed and we fail the bad node. If the result is not the same but no failure is detected, I THINK we currently fail the system, as this should never happen. We haven't ever seen this happen on 1.0.01 and we've run many billions of k-safe transactions. We'd like to add support for different behavior when k=2 or more, as it's easier to detect odd failures when nodes can vote.
In the case of multi-partition procedures, the Java is run on one node and all nodes must commit or fail for the transaction to commit. Note that only k copies of each partition can fail.
There is additional code to decide what happens when the node running the java fails (the coordinator in volt-speak) or the node handling the client connection fails (the initiator). With the exception of a few known bugs with incredibly unlucky failure timing, we either commit or rollback in concert at all nodes. Those bugs are being worked on now.
Voltdb doc states that:
K-safety involves duplicating database partitions so that if a partition is lost (either due to hardware or
software problems) the database can continue to function with the remaining duplicates. In the case of
VoltDB, the duplicate partitions are fully functioning members of the cluster, including all read and write
operations that apply to those partitions. (In other words, the duplicates function as peers rather than in
a master-slave relationship.)
When the duplicates function as peers, will all the duplicates and the partition they duplicate from do exactly the same work??
Does that mean if we enable k-safety, then the throughput (TPS) will be reduced by (k+1)?