Forum: VoltDB Architecture

Post: K-Safety

K-Safety
iggy_fernandez
Jun 6, 2010
With K-Safety, what happens if a transaction aborts in one of the replicas? The abort outcome is acceptable only if all the other replicas have achieved the abort outcome. How does an individual replica find out whether or not the other replicas have achieved the abort outcome?


Regards,


Iggy Fernandez
Two cases to worry about.
jhugg
Jun 7, 2010
In the case of a single-partition transaction, we run the Java stored procedure on k+1 nodes. Before returning an answer to the client, we confirm the result is the same. If the result is not the same because one of the nodes had a hardware or software failure, then the transaction is committed and we fail the bad node. If the result is not the same but no failure is detected, I THINK we currently fail the system, as this should never happen. We haven't ever seen this happen on 1.0.01 and we've run many billions of k-safe transactions. We'd like to add support for different behavior when k=2 or more, as it's easier to detect odd failures when nodes can vote.


In the case of multi-partition procedures, the Java is run on one node and all nodes must commit or fail for the transaction to commit. Note that only k copies of each partition can fail.


There is additional code to decide what happens when the node running the java fails (the coordinator in volt-speak) or the node handling the client connection fails (the initiator). With the exception of a few known bugs with incredibly unlucky failure timing, we either commit or rollback in concert at all nodes. Those bugs are being worked on now.
Thank you
iggy_fernandez
Jun 7, 2010
In the case of a single-partition transaction, we run the Java stored procedure on k+1 nodes. Before returning an answer to the client, we confirm the result is the same. If the result is not the same because one of the nodes had a hardware or software failure, then the transaction is committed and we fail the bad node. If the result is not the same but no failure is detected, I THINK we currently fail the system, as this should never happen. We haven't ever seen this happen on 1.0.01 and we've run many billions of k-safe transactions. We'd like to add support for different behavior when k=2 or more, as it's easier to detect odd failures when nodes can vote.


In the case of multi-partition procedures, the Java is run on one node and all nodes must commit or fail for the transaction to commit. Note that only k copies of each partition can fail.


There is additional code to decide what happens when the node running the java fails (the coordinator in volt-speak) or the node handling the client connection fails (the initiator). With the exception of a few known bugs with incredibly unlucky failure timing, we either commit or rollback in concert at all nodes. Those bugs are being worked on now.


Thank you. Your reply was helpful.


Regards,


Iggy Fernandez
k-safety
tuancao
Jul 9, 2010
In the case of a single-partition transaction, we run the Java stored procedure on k+1 nodes. Before returning an answer to the client, we confirm the result is the same. If the result is not the same because one of the nodes had a hardware or software failure, then the transaction is committed and we fail the bad node. If the result is not the same but no failure is detected, I THINK we currently fail the system, as this should never happen. We haven't ever seen this happen on 1.0.01 and we've run many billions of k-safe transactions. We'd like to add support for different behavior when k=2 or more, as it's easier to detect odd failures when nodes can vote.


In the case of multi-partition procedures, the Java is run on one node and all nodes must commit or fail for the transaction to commit. Note that only k copies of each partition can fail.


There is additional code to decide what happens when the node running the java fails (the coordinator in volt-speak) or the node handling the client connection fails (the initiator). With the exception of a few known bugs with incredibly unlucky failure timing, we either commit or rollback in concert at all nodes. Those bugs are being worked on now.


Hi,


Voltdb doc states that:


"
K-safety involves duplicating database partitions so that if a partition is lost (either due to hardware or
software problems) the database can continue to function with the remaining duplicates. In the case of
VoltDB, the duplicate partitions are fully functioning members of the cluster, including all read and write
operations that apply to those partitions. (In other words, the duplicates function as peers rather than in
a master-slave relationship.)
"


When the duplicates function as peers, will all the duplicates and the partition they duplicate from do exactly the same work??
Does that mean if we enable k-safety, then the throughput (TPS) will be reduced by (k+1)?


Thanks,
Tuan
Hi Tuan, Short answer is yes.
aweisberg
Jul 9, 2010
Hi,


Voltdb doc states that:


"
K-safety involves duplicating database partitions so that if a partition is lost (either due to hardware or
software problems) the database can continue to function with the remaining duplicates. In the case of
VoltDB, the duplicate partitions are fully functioning members of the cluster, including all read and write
operations that apply to those partitions. (In other words, the duplicates function as peers rather than in
a master-slave relationship.)
"


When the duplicates function as peers, will all the duplicates and the partition they duplicate from do exactly the same work??
Does that mean if we enable k-safety, then the throughput (TPS) will be reduced by (k+1)?


Thanks,
Tuan


Hi Tuan,


Short answer is yes. I wouldn't call it reduced because the purpose of replication is to gain availability not performance. I thought we had the single partition performance thing under control ;-)


Write transactions have to be executed at all replicas so there is no performance gain when adding replicas. Currently read only transactions are executed at all replicas and the results are compared to ensure that they match because we are paranoid. In the future we can add functionality to allow read only transactions to execute at a subset of replicas in order to improve performance. When that happens that depends on whether there is demand for such functionality.


There was an interesting blog post a while back that pointed out that VoltDB doesn't scale reads. It makes them really really fast and with replication and the previously mentioned enhancement it can scale reads to all partitions, but it doesn't solve the problem of accessing a single key a brazillion times per second ala the articles on the NY Times front page. Replication scales all partitions and doesn't target specific partitions with the hot data (the one with the articles on the front page).


VoltDB will never be memcached (memcached + 10 gig-E is really really fast), but it is interesting to think about what you could do to replicate data at the application level inside VoltDB to scale reads and whether that would be more or less painful then maintaining a separate memcached layer. Also interesting to think about what facilities VoltDB could provide to make this possible/easier.


--Ariel
k-safety
tuancao
Jul 9, 2010
Hi Tuan,


Short answer is yes. I wouldn't call it reduced because the purpose of replication is to gain availability not performance. I thought we had the single partition performance thing under control ;-)


Write transactions have to be executed at all replicas so there is no performance gain when adding replicas. Currently read only transactions are executed at all replicas and the results are compared to ensure that they match because we are paranoid. In the future we can add functionality to allow read only transactions to execute at a subset of replicas in order to improve performance. When that happens that depends on whether there is demand for such functionality.


There was an interesting blog post a while back that pointed out that VoltDB doesn't scale reads. It makes them really really fast and with replication and the previously mentioned enhancement it can scale reads to all partitions, but it doesn't solve the problem of accessing a single key a brazillion times per second ala the articles on the NY Times front page. Replication scales all partitions and doesn't target specific partitions with the hot data (the one with the articles on the front page).


VoltDB will never be memcached (memcached + 10 gig-E is really really fast), but it is interesting to think about what you could do to replicate data at the application level inside VoltDB to scale reads and whether that would be more or less painful then maintaining a separate memcached layer. Also interesting to think about what facilities VoltDB could provide to make this possible/easier.


--Ariel


Hi Ariel,


Thank you very much. Could you also point me to the post where you guys discussed about scaling reads?


Thanks,
Tuan