Forum: Other

Post: VolDB heap is exceeded when many MP transactions failed

VolDB heap is exceeded when many MP transactions failed
Dmtry
Aug 24, 2015
Hi, I've encountered the following problem while running a simple load test.

Preconditions
VoltDB single node is run with 5 partitions and the following schema:

CREATE TABLE TABLE1 (
    id INTEGER NOT NULL,
    CONSTRAINT pk_TABLE1 PRIMARY KEY (id)
);

Please note that table is replicated - the problem does not appear with a partitioned table!
Scenario
The test performs multiple calls of 'TABLE1.insert' procedure with a single key (1) - so 'CONSTRAINT VIOLATION' error occurred.
Result
After a short period (some minutes) VoltDB stops responding to requests.
Expected Result
VoltDB continue to operate.

I've done some investigations and it appears that structure RepairLog swells with items referring FragmentTaskMessage and CompleteTransactionMessage instances. As a result, 'Old generation' area is filled completely with these messages and VM is forced to continuously perform a garbage collection instead of serving requests.
Tried to modify the code and disable recording to RepairLog - seems problem is gone.
Could anyone please help me to figure out why RepairLog is needed? Is it possible to disable it w/o reliability impact?
Thank you in advance!
nshi
Aug 24, 2015
The RepairLog will only hold on to the series of FragmentTaskMessages if the replicated table insertions rollback. If there are successfully committed multi-partition write transactions (e.g. replicated table insertions), the repair log will be truncated regularly.

I have filed a ticket to optimize this.

https://issues.voltdb.com/browse/ENG-8888
Dmtry
Aug 26, 2015
Ok thanks!
In order to let me better understand of how VoltDB works, could you please explain a bit of what main activities 'repair' process (that uses RepairLog) involves and when it is executed?
Am I understand correctly, that it is related to the synchronization of the rejoining node? So most of data is synchronized using the snapshot that is sent to the rejoining node. So we only need that RepairLog hold items since the snapshot was maid till rejoining node synchronization will complete. Is it correct?
nshi
Aug 26, 2015
The RepairLog holds in-flight transactions so that partition replicas can resume operation in case of partition leader failures.

Rejoining node uses a separate backlog to hold the transactions that are executed while the snapshot is being synchronized. Once rejoin finishes, the backlog is cleared.