Announcement

Collapse
No announcement yet.

issue on changing sites per host

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • issue on changing sites per host

    From the manual, it states that we can change "the sites per host" via save and restore. However, from my experiments, it doesn't work. My steps are below, please help check whether I missed something.

    voltdb version is 6.6.

    1. init a new database with below deployment.xml

    Code:
    <?xml version="1.0"?>
    <deployment>
            <cluster hostcount="1" sitesperhost="16" kfactor= "0" />
            <commandlog enabled="true" logsize="1024" synchronous="true" >
                    <frequency time="2" transactions="100"/>
            </commandlog>
            <snapshot enabled="false"/>
            <httpd enabled="true">
                    <jsonapi enabled="true" />
            </httpd>
            <paths>
                    <commandlog path="/opt/test/voltdbroot/cmdlog/" />
                    <commandlogsnapshot path="/opt/test/voltdbroot/cmd_snapshots" />
                    <snapshots path="/opt/test/voltdbroot/auto_snapshots" />
            </paths>
    </deployment>
    2. start the server, and create a sample test table
    create table test(a int);

    3. use voltadmin save /tmp test to save the snapshot.

    4. voltadmin shutdown

    5. init a new database with sitesperhost = 8 (it's the only change from the above deployment.xml)

    6. start the db in pause mode

    7. then use voltadmin restore /tmp test to restore the snapshot previously saved.

    8. voltadmin resume

    9. create another test table, create table test1(a int), then shutdown the server using "voltadmin shutdown"

    10. start the db again, this time the db started ok, can restore the command log snapshot and recover command log correctly.

    11. create another test table create table test2(a int);

    12. voltadmin shutdown to shutdown the server again, and start the db again, this time, the server won't start, report the below error in the log file:

    2016-09-21 05:48:07,064 FATAL [main] LOGGING: Command logs are incomplete, expecting 8 partitions, but only have 16
    2016-09-21 05:48:07,572 FATAL [main] HOST: No replay plan generated for this host

    If I remove the command log folder, the server can start correctly(restore from the latest cmd_snapshots), however, the test2 table is missing(since the transaction is in the command log).

    Really strange, I have re-produced the above error many times. Even I manually saved a snapshot right after the above restore step, and use the newly saved snapshot to restore a new database, it reports the same error during the SECOND start.

    I know there is a limitation, that the number of the unique partitions must be same to recover command log. However, I just use save/restore to change "sites per host". What is the correctly steps to change sites per host?

    I tried the old voltdb create/recover and the new init/start commands, the result is the same error.

    BTW, if it's a bug, it's a critical bug, since save/restore works, and even the first restart works, however, the second restart will fail.

    Regards,
    -Xiang

  • #2
    You have to "save" then "restore", when changing sites per host, as you did in step 3 and 7. You are trying to recover the database from command logs. Topology changes are not allowed during command log replay (recovery) in order to preserve determinism when transactions are replayed (command log recovery replays the transaction stream since the last saved snapshot).

    John

    Comment


    • #3
      Thanks for the quick response. Yes. I know topology changes are not allowed during command log replay (recovery). After the step 7 mentioned above, I think I have already changed the sites per host to 8 from 16(which can also be confirmed by system procedure @SystemInformation). The steps after step 7 didn't change "sites per host" anymore. Just two restarts. Why did it report error during the SECOND restart? Please note, the first restart after restore always works, however the second restart fails.

      -Xiang

      Comment


      • #4
        Xiang,

        We're trying it here and will let you know what we see. Your steps look correct.

        Comment


        • #5
          Xiang,

          We have reproduced the issue. It is a bug in the system and we are currently tracking down the root cause. We will update the thread once we know the workaround or have a fix.
          Ning

          Comment


          • #6
            Originally posted by nshi View Post
            Xiang,

            We have reproduced the issue. It is a bug in the system and we are currently tracking down the root cause. We will update the thread once we know the workaround or have a fix.
            I really appreciate your quick response.

            -Xiang

            Comment


            • #7
              Xiang,

              Thank you for reporting this issue. It turns out to be a problem in recovery only if you restore to a cluster with fewer partitions. This defect will be fixed in our October release.

              Ruth

              Comment

              Working...
              X