Forum: Managing VoltDB

Post: Performing a Maintenance Window using the command-line

Performing a Maintenance Window using the command-line
Aug 17, 2012
A maintenance window is a process you use to safely stop and restart VoltDB. This could be needed either to perform physical server maintenance such as replacing RAM or other hardware, or to make changes to the application catalog or the deployment configuration.

If you are using VoltDB Enterprise Manager (available with VoltDB Enterprise Edition), then you can follow the steps the Management Guide. This post deals with the command-line process.

Because VoltDB is in-memory, you want to take a snapshot before stopping the database so you can restore to that state. This is true for Community Edition and in some cases as well for Enterprise Edition when you are making changes. And since a snapshot is a point-in-time consistent copy of all the data in memory across the cluster, you want to first prevent users from making additional changes before you take this last snapshot. So the process for a maintenance window consists of the following steps:

  1. Pause the database (disconnect users)
  2. Take a manual snapshot
  3. Shut down the database
  4. Make changes, which may include:

    1. Hardware
    2. a new Catalog file
    3. changes to the deployment.xml file

    4. Restart the database in admin mode
    5. Reload the data from the snapshot
    6. Resume the database (allow users to connect)

      There is more information about when and why you may need to take a maintenance window for updating the application in Chapter 7 of Using VoltDB. Changes such as updates to stored procedures, new stored procedures, and new tables can be made by updating the catalog on the fly, but other changes require a maintenance window.

      The step-by-step for this is in Section 9.1.3, “Changing the Database Schema or Cluster Configuration Using Save and Restore”. There, some code examples are provided, but they involve writing a small java utility program to perform these actions. Here we are going to show how to perform these steps from the command line.

      These commands will use the SQLCMD command-line program, which is located in the bin directory of the VoltDB distribution. You can run this program, and then interactively enter the commands that are in quotes below, but another way is to echo the command and pipe it into sqlcmd, which is a convenient way to use sqlcmd within a script. In all of the examples below, sqlcmd is connected to localhost on port 21211. The servers parameter could be the hostname of one of the nodes in the cluster, and 21211 is the default admin port, but you could check the deployment.xml file to see if another port has been configured instead.

      1. Pause the database

      echo "exec @Pause" | sqlcmd --servers=localhost --port=21211 -or-

      sqlcmd --servers=locahost --port=21211sqlcmd> exec @Pause

      2. Take a snapshot

      echo "exec @SnapshotSave /path/to/save/dir snapshot_name 1" | sqlcmd --servers=localhost --port=21211 -or-

      sqlcmd> exec @SnapshotSave /path/to/save/dir snapshot_name 1Look at the output to verify that the snapshot was successful.

      3. Shutdown the database

      echo "exec @Shutdown" | sqlcmd --servers=localhost --port=21211Note: This command always outputs an error message. This is because sqlcmd lost the connection to the database when it shut itself down. This error can be ignored.

      4. Make Changes

      This may include hardware changes, replacing the application catalog jar file with a new one, or edits to the deployment.xml file.

      5. Restart in admin mode

      Before you restart the database, to have it startup in admin mode you need to make the following edit to the deployment.xml file.

      <deployment> ... <admin-mode port="21211" adminstartup="true"/></deployment>Then to start the database, you use the voltdb command line program, which is in the bin directory. This command takes a "startup action" parameter which is CREATE, START, or RECOVER. CREATE will start the database with no data (i.e. empty). RECOVER will recover data from the command log (first by loading the latest snapshot taken automatically by the command log feature). START will check to see if there is a command log present and if so it will do a RECOVER, otherwise it will do a CREATE.

      In this case, we always want to use the CREATE action. This may seem counter-intuitive. The reason is that we want to avoid reloading data from the command log because if we do that the database will use the older catalog file that was in place before the database stopped. If we're taking this maintenance window to change the catalog, this would defeat the purpose.

      The following is an example command:

      $ voltdb create \ host voltsvr1 \ catalog mycatalog.jar \ deployment deployment.xml \ license /opt/voltdb/voltdb/license.xmlIf you start the database manually like this, you want to use nohup so that it runs in the background and won't stop when the shell is closed. You could redirect STDERR and STDOUT to a log file, but voltdb already logs the same output to a file as per the log4j configuration (read the voltdb startup script for more on this).

      $ nohup voltdb create \ host voltsvr1 \ catalog mycatalog.jar \ deployment deployment.xml \ license /opt/voltdb/voltdb/license.xml \ &> /dev/null < /dev/null &This command should be run on each of the servers in a cluster. Please note that the "HOST" parameter is new for VoltDB 2.8, replacing the previous "LEADER" parameter. They are equivalent. You would use the same hostname for this parameter on all the servers in the cluster.

      6. Reload the snapshot

      echo "exec @SnapshotRestore /path/to/save/dir snapshot_name" | sqlcmd --servers=localhost --port=21211Look at the output to verify that the snapshot was successful.

      7. Resume the database

      echo "exec @Resume" | sqlcmd --servers=localhost --port=21211Now the maintenance window is complete and users can connect.

      Automating the Process

      To minimize the maintenance window, steps 1-3 can be automated into a single script to safely take the cluster down, and steps 5-7 can be scripted to bring it back up post changes. Depending on the changes in Step 4, that may also be possible to automate.

      The attached scripts show a basic example of automating these steps.
Aug 17, 2012
Here are the script attachments.