Forum: Building VoltDB Applications

Post: Question on using CSVLoader of voltdb

Question on using CSVLoader of voltdb
Nithya.narasimhan@hp.com
Jan 29, 2014
Hi,

I am using the CSVLoader to load 10,000000 rows in a voltdb table.It is a partitioned table.

Command used is - ./csvloader -f test10000000.csv -m 20000000 --limitrows 15000000 customer

Deployment.xml is -

<?xml version="1.0"?>
<deployment>
<cluster hostcount="1"
sitesperhost="3"
/>
<httpd enabled="true">
<jsonapi enabled="true" />
</httpd>
</deployment>

The performance I got is -

CSVLoader elaspsed: 98.32 seconds
Number of input lines skipped: 0
Number of lines read from input: 10000000
Number of rows discovered: 10000000
Number of rows successfully inserted: 9498503
Number of rows that could not be inserted: 501497
CSVLoader rate: 96608.05 row/s

Now I added one more node to the cluster and used following -

./csvloader -f test10000000.csv -m 20000000 --limitrows 15000000 customer --servers 172.16.251.152,172.16.251.153

My deployment xml is

<?xml version="1.0"?>
<deployment>
<cluster hostcount="2"
sitesperhost="3"
/>
<httpd enabled="true">
<jsonapi enabled="true" />
</httpd>
</deployment>

However the seconds elapsed has doubled.Why is this happening? Should the time taken not be halved ? What am I doing wrong?

Thanks in advance

Nithya
anish
Jan 29, 2014
Nithya,

CSVLoader launches thread per partition for processing. Since you have now more partitions the loader is not optimally managing threads based on your number of cores on the machine. We are aware of the slowness in such configuration.
To work around this issue can you run cdv loader as below and see if you get better performance.

./csvloader -f test10000000.csv -m 20000000 --limitrows 15000000 -p CUSTOMER.insert --servers 172.16.251.152,172.16.251.153

Thanks
Anish
Few more questions on CSVLoader
Nithya.narasimhan@hp.com
Jan 30, 2014
Hi,

Thank you very much for your response.After using this option, the time taken has reduced further by 16 seconds.The output is now as below -

CSVLoader elaspsed: 134.508 seconds
Number of input lines skipped: 0
Number of lines read from input: 10000000
Number of rows discovered: 10000000
Number of rows successfully inserted: 9498503
Number of rows that could not be inserted: 501497
CSVLoader rate: 70616.64 row/s

But my doubt is this, after adding a new node why has the time taken increased instead of decreasing? Also is there anyway to suppress the ERROR output that comes on the console?
anish
Jan 30, 2014
Nithya,

Can you send us your ddl?

Also are you running 2 nodes on different machine and csvloader on third?
Can you share system information? memory,cpu, cores and such.

Thanks
Anish
energyd
Oct 29, 2015
Hi there,
By using "-p CUSTOMER.insert" option, I assume you are using the default stored procedure? In that case, how does the procedure know if the table is partitioned and act as a single partition stored procedure?
bballard
Oct 29, 2015
The default procedures such as TABLENAME.insert are generated for each table in the database, so the database knows if the tables are partitioned and whether to also partition these procedures, and which parameter to set as the partitioning parameter since the parameters are in the same order as the columns in the table.
energyd
Oct 29, 2015
Thank you!