Forum: Building VoltDB Applications

Post: Running voter example on multiple machines

Running voter example on multiple machines
tuancao
May 15, 2010
Hi,

I am trying to benchmark the k-safety feature voltdb has just provided. My first step is to run the voter example on a cluster of 4 nodes ( intended for 3 servers and 1 client). Unfortunately, the server could not connect to the leader node.
Below is my build.xml:
************************************START build.xml***********************************************************

<?xml version="1.0" ?> <project default="main" name="build file">  <!-- *************************************** PATHS AND PROPERTIES *************************************** -->  <property name='build.dir'             location='obj/' /> <property name='src.dir'               location='src/' /> <property name='debugoutput.dir'       location='debugoutput/' />  <path id='project.classpath'>     <fileset dir='../../voltdb' >         <include name='voltdb*.jar' />     </fileset>     <pathelement location='${build.dir}' />     <pathelement path="${java.class.path}"/> </path>  <!-- *************************************** PRIMARY ENTRY POINTS *************************************** -->  <target name="main" depends="srccompile, catalog" description="Default. Compile Java clients and stored procedures, then run >  <target name="server" depends="srccompile, catalog" description="Start VoltDB Server.">     <java fork="yes" classname="org.voltdb.VoltDB">         <jvmarg value="-Djava.library.path=../../voltdb" />         <jvmarg value="-server"/>         <jvmarg value="-Xmx2048m"/>         <jvmarg value="-XX:+HeapDumpOnOutOfMemoryError" />         <jvmarg value="-XX:HeapDumpPath=/tmp" />         <jvmarg value="-XX:-ReduceInitialCardMarks" />         <arg value="catalog"/>         <arg value="catalog.jar"/>         <classpath refid='project.classpath'/>         <assertions><disable/></assertions>     </java> </target>  <target name="client" depends="srccompile" description="Start the client application.">     <java fork="yes" classname="com.ClientVoter">         <jvmarg value="-Xmx512m"/>         <jvmarg value="-XX:+HeapDumpOnOutOfMemoryError" />         <jvmarg value="-XX:HeapDumpPath=/tmp" />         <jvmarg value="-XX:-ReduceInitialCardMarks" />         <arg value="6"/>                                <!-- total number of contestants (maximum 12) -->         <arg value="2"/>                                <!-- number of votes allowed per phone number -->         <arg value="100000"/>                           <!-- maximum number of votes per second this client can generate -->         <arg value="5"/>                                <!-- client application feedback interval (seconds) -->         <arg value="120"/>                              <!-- client application duration (seconds) -->         <arg value="3"/>                                <!-- number of seconds to wait before recording latency information ->         <arg value="wl10.cac.cornell.edu, wl11.cac.cornell.edu, wl13.cac.cornell.edu"/>                        <!-- comma sep>         <classpath refid='project.classpath'/>         <assertions><disable/></assertions>     </java> </target>  <target name="catalog" depends="srccompile" description="Create the catalog." >     <java fork="yes" failonerror="true"           classname="org.voltdb.compiler.VoltCompiler" >         <jvmarg value="-Djava.library.path=../../voltdb" />         <arg value="project.xml"/>                  <!-- project file -->         <arg value="4"/>                            <!-- hosts -->         <arg value="2"/>                            <!-- sites -->         <arg value="wl10.cac.cornell.edu"/>                    <!-- leader -->         <arg value="catalog.jar"/>                  <!-- output -->         <classpath refid='project.classpath' />         <assertions><disable /></assertions>     </java> </target>   <!-- *************************************** CLEANING *************************************** -->  <target name='clean' description="Remove all compiled files.">     <delete includeemptydirs="true" failonerror='false'>         <fileset dir="${build.dir}" includes="**/*" />         <fileset dir="${debugoutput.dir}" includes="**/*" />         <fileset dir="." defaultexcludes="yes" >             <include name="catalog.jar" />         </fileset>     </delete> </target>  <!-- *************************************** JAVA COMPILATION *************************************** -->  <target name="srccompile">     <mkdir dir='${build.dir}' />     <javac target="1.6" srcdir="${src.dir}" destdir='${build.dir}' debug='true'>         <classpath refid="project.classpath" />     </javac> </target>  </project> 
************************************END build.xml***********************************************************
Basically, I kept the server part untouched.
For client part, I provided a comma separated list of servers to connect to, i.e. wl10.cac.cornell.edu, wl11.cac.cornell.edu, wl13.cac.cornell.edu

For catalog part, I changed the hosts field to 4 and the leader filed as one of the servers, i.e. wl10.cac.cornell.edu
My home directory on the cluster is NFS mounted.
On wl10, I compiled the catalog successfully by:

[cat82@wl01 voter]$ ant Buildfile: build.xml ......... BUILD SUCCESSFUL 
Then, I started the server:

[cat82@wl10 voter]$ ant server ...................       [java] 19 [main] INFO HOST - Build: 0.9.01 https://svn.voltdb.com/eng/trunk?revision=475      [java] 34 [main] INFO HOST - HTTP admin console listening on port 8080      [java] 34 [main] INFO HOST - Loading application catalog jarfile from /home/fs01/cat82/tuandev/voltdb-0.9.01/examples/voter/catalog.jar      [java] 118 [main] INFO HOST - Creating host manager for 4 hosts using leader wl10.cac.cornell.edu/10.84.3.60      [java] 126 [Thread-3] INFO HOST - Connecting to VoltDB cluster as the leader... 
but it is stuck at "Connecting to VoltDB cluster as the leader".

Could you please help to solve this problem?

Thanks,
Tuan
Re: Running voter example on multiple machines
aris_sety
May 15, 2010
Hi Tuancao,

Did you start another server? It wait for another server to follow their leader. You must at least start 4 node as server (with wl10.cac.cornell.edu as the leader) to make a cluster, because you set hosts field to 4. As you have 3 server, set it to 3. Or in your configuration, you can use last node as server and client.
Running voter example on multiple machines
tuancao
May 15, 2010
Hi Tuancao,

Did you start another server? It wait for another server to follow their leader. You must at least start 4 node as server (with wl10.cac.cornell.edu as the leader) to make a cluster, because you set hosts field to 4. As you have 3 server, set it to 3. Or in your configuration, you can use last node as server and client.


Hi,

Thank you very much for replying me on the weekend.

I changed the hosts field to 3 and did: ant to compile it again.

Then on node wl10, I did:
[cat82@wl10 voter]$ ant server
Buildfile: build.xml

srccompile:

catalog:
[java] No logging configuration supplied via -Dlog4j.configuration. Supplying default config that logs to INFO or higher to STDOUT
[java] ** BEGIN PROJECT COMPILE: project.xml **
[java] 17 [main] INFO COMPILER - Catalog leader: wl10.cac.cornell.edu hosts, sites 3, 2
[java] 245 [main] INFO COMPILER - Path to catalog ddl.sql
[java] INFO [Initialize.class]: Compiling Statement: select count(*) from contestants;
[java] INFO [Initialize.class]: Compiling Statement: insert into contestants (contestant_name, contestant_number) values (?, ?);
[java] INFO [Results.class]: Compiling Statement: select a.contestant_name c1, sum(b.num_votes) c2 from v_votes_by_contestant_number b, contestants a where a.contestant_number = b.contestant_number group by a.contestant_name order by a.contestant_name;
[java] INFO [Vote.class]: Compiling Statement: select contestant_number from contestants where contestant_number = ?;
[java] INFO [Vote.class]: Compiling Statement: select num_votes from v_votes_by_phone_number where phone_number = ?;
[java] INFO [Vote.class]: Compiling Statement: insert into votes (phone_number, contestant_number) values (?, ?);

server:
[java] No logging configuration supplied via -Dlog4j.configuration. Supplying default config that logs to INFO or higher to STDOUT
[java] 11 [main] INFO HOST - Loading...

[java] _ __ ____ ____ ____
[java] | | / /___ / / /_/ __ \/ __ )
[java] | | / / __ \/ / __/ / / / __ |
[java] | |/ / /_/ / / /_/ /_/ / /_/ /
[java] |___/\____/_/\__/_____/_____/

[java] Initialization Log Output:
[java] --------------------------------

[java] 19 [main] INFO HOST - Build: 0.9.01 https://svn.voltdb.com/eng/trunk?revision=475
[java] 33 [main] INFO HOST - HTTP admin console listening on port 8080
[java] 33 [main] INFO HOST - Loading application catalog jarfile from /home/fs01/cat82/tuandev/voltdb-0.9.01/examples/voter/catalog.jar
[java] 113 [main] INFO HOST - Creating host manager for 3 hosts using leader wl10.cac.cornell.edu/10.84.3.60
[java] 121 [Thread-3] INFO HOST - Connecting to VoltDB cluster as the leader...

.....................................Comment: it had waited until I started 2 other servers, then it crashed.............................................................

[java] java.lang.Thread.dumpThreads(Native Method)
[java] 22976 [Thread-3] INFO HOST - Maximum clock/network skew is 2 milliseconds (according to leader)
[java] java.lang.Thread.getAllStackTraces(Thread.java:1487)
[java] 22976 [Thread-3] INFO HOST - Catalog checksums do not match across cluster
[java] org.voltdb.VoltDB.crashVoltDB(VoltDB.java:210)
[java] org.voltdb.messaging.SocketJoiner.runPrimary(SocketJoiner.java:286)
[java] org.voltdb.messaging.SocketJoiner.run(SocketJoiner.java:132)
[java] VoltDB has encountered an unrecoverable error and is exiting.
[java] The log may contain additional information.
[java] Java Result: 255

BUILD SUCCESSFUL
Total time: 25 seconds

It was waiting at : [java] 121 [Thread-3] INFO HOST - Connecting to VoltDB cluster as the leader...

Then I logged in to node wl11, and did:
[cat82@wl11 voter]$ ant server
............................
[java] 32 [main] INFO HOST - Loading application catalog jarfile from /home/fs01/cat82/tuandev/voltdb-0.9.01/examples/voter/catalog.jar
[java] java.lang.Thread.dumpThreads(Native Method)
[java] java.lang.Thread.getAllStackTraces(Thread.java:1487)
[java] org.voltdb.VoltDB.crashVoltDB(VoltDB.java:210)
[java] org.voltdb.messaging.SocketJoiner.runNonPrimary(SocketJoiner.java:413)
[java] org.voltdb.messaging.SocketJoiner.run(SocketJoiner.java:137)
[java] VoltDB has encountered an unrecoverable error and is exiting.
[java] The log may contain additional information.
[java] 114 [main] INFO HOST - Creating host manager for 3 hosts using leader wl10.cac.cornell.edu/10.84.3.60
[java] 121 [Thread-3] INFO HOST - Connecting to the VoltDB cluster leader...
[java] 128 [Thread-3] INFO HOST - Maximum clock/network skew is 2 milliseconds (according to leader)
[java] 128 [Thread-3] INFO HOST - Catalog checksums do not match across cluster
[java] Java Result: 255

BUILD SUCCESSFUL
Total time: 3 seconds

and on node wl13:
[cat82@wl13 voter]$ ant server
..................................
[java] 10293 [Thread-3] INFO HOST - Maximum clock/network skew is 2 milliseconds (according to leader)
[java] 10293 [Thread-3] INFO HOST - Catalog checksums do not match across cluster
[java] java.lang.Thread.dumpThreads(Native Method)
[java] java.lang.Thread.getAllStackTraces(Thread.java:1487)
[java] org.voltdb.VoltDB.crashVoltDB(VoltDB.java:210)
[java] org.voltdb.messaging.SocketJoiner.runNonPrimary(SocketJoiner.java:413)
[java] org.voltdb.messaging.SocketJoiner.run(SocketJoiner.java:137)
[java] VoltDB has encountered an unrecoverable error and is exiting.
[java] The log may contain additional information.
[java] Java Result: 255

BUILD SUCCESSFUL
Total time: 13 seconds

What does that mean: "Catalog checksums do not match across cluster"??

Thanks,
Tuan
This means your catalogs aren't identical.
jhugg
May 15, 2010
Hi,

Thank you very much for replying me on the weekend.

I changed the hosts field to 3 and did: ant to compile it again.

Then on node wl10, I did:
[cat82@wl10 voter]$ ant server
Buildfile: build.xml

srccompile:..

Thanks,
Tuan


Hi Tuan,

This is a new safety check in version 0.9.01. VoltDB reads the compiled application catalog jar files at each node. It computes a checksum of these files and compares with the other nodes. If the files aren't bitwise-identical, then the cluster will fail to start.

This is an easy problem to run into if you are compiling your application catalog 3 times on each of your three nodes. Our compiler doesn't always generate the exact same file for the exact same input. However, it could also mean you are running an outdated or non-matching application catalog file at one of your nodes. That would be bad.

An easy fix is to generate the application catalog file in one place, then copy the jar file to each of your three nodes.

Here's the related issue if you're curious: ENG-432. I've also created a follow-up issue to improve the way this process works: ENG-543. Thanks for posting.

-John Hugg
VoltDB Engineering
Hi John, You meant the
tuancao
May 15, 2010
Hi Tuan,

This is a new safety check in version 0.9.01. VoltDB reads the compiled application catalog jar files at each node. It computes a checksum of these files and compares with the other nodes. If the files aren't bitwise-identical, then the cluster will fail to start...

-John Hugg
VoltDB Engineering


Hi John,

You meant the application catalog file is the catalog.jar, right?
My home directory on all my nodes is NFS mounted. So, I need to compile once in any of the node, all three nodes should use the same catalog.jar.

I tried ant clean, and ant to build catalog.jar one more time but I faced the same problem, i.e. when I ran ant server on 2 nodes, wl10 and wl11 (where wl10 is the leader node), these 2 nodes are stuck at:
[java] 119 [Thread-3] INFO HOST - Connecting to VoltDB cluster as the leader...

at the moment I ran: ant server on the 3rd node: wl13, then all 3 nodes gave the same errors:
[java] 130 [Thread-3] INFO HOST - Maximum clock/network skew is 4 milliseconds (according to leader)

[java] 21198 [Thread-3] INFO HOST - Catalog checksums do not match across cluster

By the way, all my nodes are running Redhat Enterprise Linux. I hope that does not create any problem.

Any idea?

Thanks,
Tuan
Yes, that file must be identical
aris_sety
May 15, 2010
Hi John,

You meant the application catalog file is the catalog.jar, right?
My home directory on all my nodes is NFS mounted. So, I need to compile once in any of the node, all three nodes should use the same catalog.jar...

Thanks,
Tuan




Tuan,
You must build catalog in one site and distribute it's copy results (catalog.jar) to another 2 node.
You can read another post at https://community.voltdb.com/node/104, She had the same problem with you.

I think it doesn't matter you use Redhat Enterprise Linux.
-Aris
It looks like the server ant
rbetts
May 15, 2010
Hi John,

You meant the application catalog file is the catalog.jar, right?
My home directory on all my nodes is NFS mounted. So, I need to compile once in any of the node, all three nodes should use the same catalog.jar..

Thanks,
Tuan


It looks like the server ant target might recompile the catalog as a dependency. In this case, three servers using a single NFS mounted home directory are perhaps all writing the same file(s) as they start?

I suggest removing the compilation and catalog dependencies from your server target if these servers are racing to the write to the same file system. Build the catalog as an independent step. Start the servers as a separate ant target.

*--Ryan.
Hi Ryan, Thank you very much.
tuancao
May 16, 2010
It looks like the server ant target might recompile the catalog as a dependency. In this case, three servers using a single NFS mounted home directory are perhaps all writing the same file(s) as they start?

I suggest removing the compilation and catalog dependencies from your server target if these servers are racing to the write to the same file system. Build the catalog as an independent step. Start the servers as a separate ant target.

*--Ryan.


Hi Ryan,

Thank you very much. After removing the catalog dependency from server target, the servers work fine. I could start 3 servers on my three nodes wl10, wl11 and wl13.

But when I tried to start the client on another node, wl08, it failed:
[cat82@wl08 voter]$ ant client
Buildfile: /home/fs01/cat82/tuandev/voltdb-0.9.01/examples/voter/build.xml

srccompile:
[javac] /home/fs01/cat82/tuandev/voltdb-0.9.01/examples/voter/build.xml:102: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds

client:
[java] Allowing 2 votes per phone number
[java] Submitting 100,000 SP Calls/sec
[java] Feedback interval = 5 second(s)
[java] Running for 120 second(s)
[java] Latency not recorded for 3 second(s)
[java] No logging configuration supplied via -Dlog4j.configuration. Supplying default config that logs to INFO or higher to STDOUT
[java] Connecting to server: wl10.cac.cornell.edu
[java] Connecting to server: wl11.cac.cornell.edu
[java] Exception in thread "main" java.nio.channels.UnresolvedAddressException
[java] at sun.nio.ch.Net.checkAddress(Net.java:30)
[java] at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:487)
[java] at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
[java] at org.voltdb.client.ConnectionUtil.getAuthenticatedConnection(ConnectionUtil.java:94)
[java] at org.voltdb.client.Distributer.createConnection(Distributer.java:448)
[java] at org.voltdb.client.Distributer.createConnection(Distributer.java:442)
[java] at org.voltdb.client.ClientImpl.createConnection(ClientImpl.java:113)
[java] at com.ClientVoter.main(ClientVoter.java:165)
[java] Java Result: 1

I also copied my build.xml part of client:

<target name="client" depends="srccompile" description="Start the client application.">
    <java fork="yes" classname="com.ClientVoter">
        <jvmarg value="-Xmx512m"/>
        <jvmarg value="-XX:+HeapDumpOnOutOfMemoryError" />
        <jvmarg value="-XX:HeapDumpPath=/tmp" />
        <jvmarg value="-XX:-ReduceInitialCardMarks" />
        <arg value="6"/>                                <!-- total number of contestants (maximum 12) -->
        <arg value="2"/>                                <!-- number of votes allowed per phone number -->
         <arg value="100000"/>                           <!-- maximum  number of votes per second this client can generate -->
        <arg value="5"/>                                <!-- client application feedback interval (seconds) -->
        <arg value="120"/>                              <!-- client application duration (seconds) -->
         <arg value="3"/>                                <!-- number of  seconds to wait before recording latency information -->
         <arg value="wl10.cac.cornell.edu, wl11.cac.cornell.edu,  wl13.cac.cornell.edu"/>                        <!-- comma  separated list of servers to connect to -->
        <classpath refid='project.classpath'/>
        <assertions><disable/></assertions>
    </java>
</target>



BUILD SUCCESSFUL
Total time: 1 second

From node wl08, I could ping all the servers:

[cat82@wl08 voter]$ ping wl11.cac.cornell.edu
PING wl11.cac.cornell.edu (10.84.3.61) 56(84) bytes of data.
64 bytes from wl11.cac.cornell.edu (10.84.3.61): icmp_seq=1 ttl=64 time=0.382 ms
64 bytes from wl11.cac.cornell.edu (10.84.3.61): icmp_seq=2 ttl=64 time=0.124 ms

--- wl11.cac.cornell.edu ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.124/0.253/0.382/0.129 ms
[cat82@wl08 voter]$ ping wl10.cac.cornell.edu
PING wl10.cac.cornell.edu (10.84.3.60) 56(84) bytes of data.
64 bytes from wl10.cac.cornell.edu (10.84.3.60): icmp_seq=1 ttl=64 time=0.118 ms
64 bytes from wl10.cac.cornell.edu (10.84.3.60): icmp_seq=2 ttl=64 time=0.129 ms
64 bytes from wl10.cac.cornell.edu (10.84.3.60): icmp_seq=3 ttl=64 time=0.106 ms

--- wl10.cac.cornell.edu ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 0.106/0.117/0.129/0.015 ms
[cat82@wl08 voter]$ ping wl13.cac.cornell.edu
PING wl13.cac.cornell.edu (10.84.3.63) 56(84) bytes of data.
64 bytes from wl13.cac.cornell.edu (10.84.3.63): icmp_seq=1 ttl=64 time=1.33 ms
64 bytes from wl13.cac.cornell.edu (10.84.3.63): icmp_seq=2 ttl=64 time=0.110 ms
64 bytes from wl13.cac.cornell.edu (10.84.3.63): icmp_seq=3 ttl=64 time=0.100 ms

--- wl13.cac.cornell.edu ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.100/0.513/1.330/0.577 ms

Any idea how it failed?

Thanks,
Tuan
Reproduced the host resolution error.
rbetts
May 16, 2010
Hi Ryan,

Thank you very much. After removing the catalog dependency from server target, the servers work fine. I could start 3 servers on my three nodes wl10, wl11 and wl13...

Thanks,
Tuan


Tuan,

I believe there must be a defect in the parsing of the final hostname. I can reproduce this just using:

I filed https://issues.voltdb.com/browse/ENG-544 to improve the logging in this path to print the unresolved hostname.
I filed https://issues.voltdb.com/browse/ENG-546 to fix the hostname parsing problem.
I filed https://issues.voltdb.com/browse/ENG-547 to figure out how to cleanup the build dependencies that caused your catalog errors earlier.

Sorry you are having such a tough time with this example - we appreciate that you are taking the time to report these problems.
Workaround for host resolution error
rbetts
May 16, 2010
Tuan,

I believe there must be a defect in the parsing of the final hostname. I can reproduce this just using:

I filed https://issues.voltdb.com/browse/ENG-544 to improve the logging in this path to print the unresolved hostname.
I filed https://issues.voltdb.com/browse/ENG-546 to fix the hostname parsing problem.
I filed https://issues.voltdb.com/browse/ENG-547 to figure out how to cleanup the build dependencies that caused your catalog errors earlier.

Sorry you are having such a tough time with this example - we appreciate that you are taking the time to report these problems.




The voter example client doesn't trim whitespace from hostnames in the comma separated list. I fixed this on the trunk. In the meantime, a simple work around is to not use spaces in your list. For example, the following should work:
Workaround for host resolution error
rbetts
May 16, 2010
Tuan,

I believe there must be a defect in the parsing of the final hostname. I can reproduce this just using:..

The voter example client doesn't trim whitespace from hostnames in the comma separated list. I fixed this on the trunk. In the meantime, a simple work around is to not use spaces in your list. For example, the following should work:
Thanks, that fixed the
tuancao
May 17, 2010
The voter example client doesn't trim whitespace from hostnames in the comma separated list. I fixed this on the trunk. In the meantime, a simple work around is to not use spaces in your list. For example, the following should work:

Thanks, that fixed the problem.
Tuan