Forum: Managing VoltDB

Post: about high availability and zero down time upgrade

about high availability and zero down time upgrade
xiangz
Jul 3, 2015
Hi all,

I'm curious about the HA feature of voltdb. I'm evaluating voltdb in a mission critical environment and have the following questions:
1). about all the nodes in the same switch, as suggested in the documentation. If so, would that switch be the single point of failure. In our general setup, every hosts have two NICs connected to two interconnected switches, one NIC is active, another is standby, will this be a problem for voltdb? Suppose there are total three severs in the voltdb cluster, and the k-factor is 2, will the active NIC of the three hosts be connected to the same switch or different switch? If all active NIC are connected to the same switch, if that switch is broken, maybe 2-3 seconds later, the standby NIC will be active in another switch, will the whole cluster survive if the heartbeat timeout parameter is set to about 5 seconds?

By the way, what's the default value of heartbeat timeout? and any general suggestions for this parameter in real world?


2). Is there a way to upgrade voltdb with near zero down time of service? I'm thinking about using replication, is it possible to upgrade the replication cluster first, and continue replicate or even just completely replicate from the main cluster with lower version? If so, I can upgrade the replication cluster first, then after replication, promote the replication cluster and switch the client to use the replication cluster.

3). Still with the 3 nodes, k-factor=2 scenario, if two nodes is down, the third node will shutdown automatically. If snapshot, command logging are enabled, could I manually change the host count to 1, and start up the cluster on the third node without data lose?

Thanks in advance.
-xiangz
jhugg
Jul 4, 2015
Hi all,

I'm curious about the HA feature of voltdb. I'm evaluating voltdb in a mission critical environment and have the following questions:
1). about all the nodes in the same switch, as suggested in the documentation. If so, would that switch be the single point of failure. In our general setup, every hosts have two NICs connected to two interconnected switches, one NIC is active, another is standby, will this be a problem for voltdb? Suppose there are total three severs in the voltdb cluster, and the k-factor is 2, will the active NIC of the three hosts be connected to the same switch or different switch? If all active NIC are connected to the same switch, if that switch is broken, maybe 2-3 seconds later, the standby NIC will be active in another switch, will the whole cluster survive if the heartbeat timeout parameter is set to about 5 seconds?


Today, losing the TCP connection between nodes will trigger a fault event. If the new switch comes up and the interface/IP magic is such that the TCP connection isn't broken, it should be ok; TCP is pretty robust. I don't think it can do the hop if the IP changes or the interface changes though.

Attempting to reconnect when the connection drops within the heartbeat window has been on our to-do for a while.

By the way, what's the default value of heartbeat timeout? and any general suggestions for this parameter in real world?


90s is the default, which is high, but pretty safe. The best number for you depends on a lot of things.
1. What else is running on the machine?
2. What kind of machine is it?
3. What software config?
4. What network config?
5. What's your tolerance for killing a node prematurely?

Many users can set this number lower, and I've seen it work as low as 2s, but it really depends.

2). Is there a way to upgrade voltdb with near zero down time of service? I'm thinking about using replication, is it possible to upgrade the replication cluster first, and continue replicate or even just completely replicate from the main cluster with lower version? If so, I can upgrade the replication cluster first, then after replication, promote the replication cluster and switch the client to use the replication cluster.


We support rolling patch-release upgrades for some of our enterprise customers with special support contracts.

Typically the best way to do general upgrades is using replication as you describe.

Note that depending on your data, upgrading via snapshot could be very quick, from seconds to small minutes for some. It could also be longer for others depending on data size per CPU and schema complexity.

3). Still with the 3 nodes, k-factor=2 scenario, if two nodes is down, the third node will shutdown automatically. If snapshot, command logging are enabled, could I manually change the host count to 1, and start up the cluster on the third node without data lose?


The last node may or may not shut down, depending on a few things.

Recovering a snapshot always requires the full cluster. Relaxing this is also on our list. If you just want to recover from a full snapshot, you can do this on any size cluster if you have the memory available.