Announcement

Collapse
No announcement yet.

VolDB heap is exceeded when many MP transactions failed

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • VolDB heap is exceeded when many MP transactions failed

    Hi, I've encountered the following problem while running a simple load test.

    Preconditions
    VoltDB single node is run with 5 partitions and the following schema:
    Code:
    CREATE TABLE TABLE1 (
        id INTEGER NOT NULL,
        CONSTRAINT pk_TABLE1 PRIMARY KEY (id)
    );
    Please note that table is replicated - the problem does not appear with a partitioned table!
    Scenario
    The test performs multiple calls of 'TABLE1.insert' procedure with a single key (1) - so 'CONSTRAINT VIOLATION' error occurred.
    Result
    After a short period (some minutes) VoltDB stops responding to requests.
    Expected Result
    VoltDB continue to operate.

    I've done some investigations and it appears that structure RepairLog swells with items referring FragmentTaskMessage and CompleteTransactionMessage instances. As a result, 'Old generation' area is filled completely with these messages and VM is forced to continuously perform a garbage collection instead of serving requests.
    Tried to modify the code and disable recording to RepairLog - seems problem is gone.
    Could anyone please help me to figure out why RepairLog is needed? Is it possible to disable it w/o reliability impact?
    Thank you in advance!

  • #2
    The RepairLog will only hold on to the series of FragmentTaskMessages if the replicated table insertions rollback. If there are successfully committed multi-partition write transactions (e.g. replicated table insertions), the repair log will be truncated regularly.

    I have filed a ticket to optimize this.

    https://issues.voltdb.com/browse/ENG-8888
    Ning

    Comment


    • #3
      Ok thanks!
      In order to let me better understand of how VoltDB works, could you please explain a bit of what main activities 'repair' process (that uses RepairLog) involves and when it is executed?
      Am I understand correctly, that it is related to the synchronization of the rejoining node? So most of data is synchronized using the snapshot that is sent to the rejoining node. So we only need that RepairLog hold items since the snapshot was maid till rejoining node synchronization will complete. Is it correct?

      Comment


      • #4
        The RepairLog holds in-flight transactions so that partition replicas can resume operation in case of partition leader failures.

        Rejoining node uses a separate backlog to hold the transactions that are executed while the snapshot is being synchronized. Once rejoin finishes, the backlog is cleared.
        Ning

        Comment

        Working...
        X