Forum: VoltDB Architecture

Post: FailureSiteUpdateMessages question

FailureSiteUpdateMessages question
sdrobert
Jun 14, 2011
Hello,


I was just wondering how VoltDB manages to clear all remaining FailureSiteUpdateMessages when node failures have been resolved (if there are any left)? It's important, since any of said messages can cause possibly rejoined nodes to fault.


Thank you for your time,
Sean
Failure agreement
aweisberg
Jun 15, 2011
Hi Sean,


When the failure agreement process completes all those messages are guaranteed to have been consumed. The agreement process works by having the sites broadcast the set of things they think is failed until all survivors can agree on the full set. The algorithm at each survivor is to wait for a message until a message with the correct set is received from each site. If the survivor is informed of a new failure then it restarts the process, adding the new failure to its set. TCP ordering guarantees that all messages from a site that include an incomplete set will be consumed before messages containing the complete set arrive.


Hope that makes sense.
-Ariel