Forum: Managing VoltDB

Post: issue on changing sites per host

issue on changing sites per host
Sep 21, 2016
From the manual, it states that we can change "the sites per host" via save and restore. However, from my experiments, it doesn't work. My steps are below, please help check whether I missed something.

voltdb version is 6.6.

1. init a new database with below deployment.xml

<?xml version="1.0"?>
        <cluster hostcount="1" sitesperhost="16" kfactor= "0" />
        <commandlog enabled="true" logsize="1024" synchronous="true" >
                <frequency time="2" transactions="100"/>
        <snapshot enabled="false"/>
        <httpd enabled="true">
                <jsonapi enabled="true" />
                <commandlog path="/opt/test/voltdbroot/cmdlog/" />
                <commandlogsnapshot path="/opt/test/voltdbroot/cmd_snapshots" />
                <snapshots path="/opt/test/voltdbroot/auto_snapshots" />

2. start the server, and create a sample test table
create table test(a int);

3. use voltadmin save /tmp test to save the snapshot.

4. voltadmin shutdown

5. init a new database with sitesperhost = 8 (it's the only change from the above deployment.xml)

6. start the db in pause mode

7. then use voltadmin restore /tmp test to restore the snapshot previously saved.

8. voltadmin resume

9. create another test table, create table test1(a int), then shutdown the server using "voltadmin shutdown"

10. start the db again, this time the db started ok, can restore the command log snapshot and recover command log correctly.

11. create another test table create table test2(a int);

12. voltadmin shutdown to shutdown the server again, and start the db again, this time, the server won't start, report the below error in the log file:

2016-09-21 05:48:07,064 FATAL [main] LOGGING: Command logs are incomplete, expecting 8 partitions, but only have 16
2016-09-21 05:48:07,572 FATAL [main] HOST: No replay plan generated for this host

If I remove the command log folder, the server can start correctly(restore from the latest cmd_snapshots), however, the test2 table is missing(since the transaction is in the command log).

Really strange, I have re-produced the above error many times. Even I manually saved a snapshot right after the above restore step, and use the newly saved snapshot to restore a new database, it reports the same error during the SECOND start.

I know there is a limitation, that the number of the unique partitions must be same to recover command log. However, I just use save/restore to change "sites per host". What is the correctly steps to change sites per host?

I tried the old voltdb create/recover and the new init/start commands, the result is the same error.

BTW, if it's a bug, it's a critical bug, since save/restore works, and even the first restart works, however, the second restart will fail.

Sep 21, 2016
You have to "save" then "restore", when changing sites per host, as you did in step 3 and 7. You are trying to recover the database from command logs. Topology changes are not allowed during command log replay (recovery) in order to preserve determinism when transactions are replayed (command log recovery replays the transaction stream since the last saved snapshot).

Sep 21, 2016
Thanks for the quick response. Yes. I know topology changes are not allowed during command log replay (recovery). After the step 7 mentioned above, I think I have already changed the sites per host to 8 from 16(which can also be confirmed by system procedure @SystemInformation). The steps after step 7 didn't change "sites per host" anymore. Just two restarts. Why did it report error during the SECOND restart? Please note, the first restart after restore always works, however the second restart fails.

Sep 21, 2016

We're trying it here and will let you know what we see. Your steps look correct.
Sep 21, 2016

We have reproduced the issue. It is a bug in the system and we are currently tracking down the root cause. We will update the thread once we know the workaround or have a fix.
Sep 22, 2016

We have reproduced the issue. It is a bug in the system and we are currently tracking down the root cause. We will update the thread once we know the workaround or have a fix.

I really appreciate your quick response.

Sep 28, 2016

Thank you for reporting this issue. It turns out to be a problem in recovery only if you restore to a cluster with fewer partitions. This defect will be fixed in our October release.