Forum: Managing VoltDB

Post: Warnings in the log file when Command Log enabled

Warnings in the log file when Command Log enabled
MarcinD
Feb 3, 2014
Hi,

I get the following warnings in the logs when command log is enabled:


WARN: Snapshot save feasibility test failed for host 0 table  with error message SNAPSHOT IN PROGRESS



WARN: Network was more than two seconds late in updating the estimated time


What does these messages mean and what impact they have?

Regards,
Marcin
nshi
Feb 3, 2014

WARN: Snapshot save feasibility test failed for host 0 table  with error message SNAPSHOT IN PROGRESS


When command logging is enabled, the system will try to issue a snapshot to truncate the command log when certain amount of log has been written to disk. If the data is large enough, the previous snapshot may not finish before a new one is requested. The system will retry after a minute or so. Meanwhile, transactions will still be logged.

The impact of this is that you will have a relatively large command log to recover from if the cluster crashes.


WARN: Network was more than two seconds late in updating the estimated time


This could be caused by long Java GCs or system scheduling problems. If you see long GC pauses in the log file, it means that the system is using a lot of Java heap. If you could share your log files, it'll help determining what may have caused it.

If network is being delayed often, it may affect the performance of the system. You may see high latency numbers for transactions.
MarcinD
Feb 4, 2014
Hi Ning,

Thanks for your reply. Please find the log attached.

Thanks,
Marcin
nshi
Feb 4, 2014
Hi Marcin,

The network delays may not be caused by Java GCs. There are some GCs that took about 500ms, but not very often.

The network delay caused one of the client connection to be closed prematurely.


2014-02-04 09:49:12,778   WARN  [Periodic Work] NETWORK: Closing connection to org.voltcore.network.VoltPort$1@6657395e:dub-001784-VM01/10.0.22.6:38842 at Tue Feb 04 09:49:12 GMT+00:00 2014 because it refuses to read responses


The server closed the client connection because the network thread couldn't send anything to that client for more than 4 seconds. It is correlated to the network thread being delayed for more than 2 seconds.

The server started 2 network threads and 2 site threads, so there were at least 4 threads contending for CPU. It seems like that the client was also running on the same server, generating heavy workload. So the network thread might be delayed because of OS scheduling.

Snapshot initiation taking more than 3.6 seconds implies that the disk might be too busy writing command logs. You can consider putting snapshots and command logs on two different disks.

The following steps may help relieve the contention on CPU.

  • Run the client on a separate machine
  • Reduce the client load to 80% of the peak
  • Reduce the number of sites to 1
MarcinD
Feb 4, 2014
Hi Ning,

Could you possibly have a look at the other 4 log files and see if the same issue causes a problem. The previous log was taken from a configuration of just one host, because I only had that one handy as I had deleted all the logs from my 'proper test run' and couldn't reproduce the issues for a while. In this case the cluster consists of 4 nodes and client is running on one of those machines also. Don't mind "No space left on device" ERROR at the end of the logs as in this particular run I was trying command log with the size of 40GB which turned out to be too big for my disk resources.

Thanks,
Marcin
nshi
Feb 4, 2014

Could you possibly have a look at the other 4 log files and see if the same issue causes a problem.


I don't see any files attached to this post. Maybe you forgot to upload them?


Don't mind "No space left on device" ERROR at the end of the logs as in this particular run I was trying command log with the size of 40GB which turned out to be too big for my disk resources.


The larger the size of the command log, the longer it takes to replay. The system will try to truncate the log when half of the segments are full. You can set the log size to a smaller number so that the system will try to snapshot more often to truncate the segments.
MarcinD
Feb 5, 2014
Hi Ning,

Sorry forgot to attach files at first but I did few minutes later so you should see them. I have tried various sizes of 'command log' (1GB, 5GB, 10GB, 15GB, 20GB, 40GB) It didn't change the fact I kept getting the message below:


WARN: Snapshot save feasibility test failed for host 0 table  with error message SNAPSHOT IN PROGRESS


Regards,
Marcin
nshi
Feb 5, 2014
Hi Marcin,

The message you saw is fine. A complete snapshot of your system takes about 466 seconds to finish. Any attempt to snapshot again when a snapshot is already in progress will generate that warning message, telling you that the attempt failed and it will be retried later.

To speed up snapshot, you can separate the snapshot and command log directories onto two different drives.