Announcement

Collapse
No announcement yet.

TPCC performance drop

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • TPCC performance drop

    Hi everyone,

    New to VoltDB so I thought to start by running the standard TPCC benchmark to be able to compare against other systems. I downloaded the source code, compiled and run the benchmark under tests/test_apps/tpcc with 16 threads, 16 warehouses. The client is on a different machine. Everything works out of the box which is great. However, performance is not very stable. Two issues:

    - Performance drops almost linearly with the number of transactions run. Throughput starts at around 50k SP/s (great!) and then slowly drops until less than 3k SP/s. I realize that the database grows and this affects performance but >10x seems too much.

    - If I kill the client and reconnect throughput drops by ~2x.

    Did anyone experience a similar behavior?

    Thank you in advance,
    Radu

  • #2
    I'd like to see the output from the run

    Can you please send me the client output from your run? I'd also like to see the server description. This is in the log files: grep HOST log/volt.log

    Send it to rmorgenstein at voltdb dot com

    Comment


    • #3
      Hi, That sounds like

      Hi,

      That sounds like swapping. VoltDB requires that your data set fit in memory to see good performance. We usually run with swap disabled because Linux tends to swap stuff out early when there is still free memory. If you run with top, how much swap is in use?

      Ariel

      Comment


      • #4
        Swapping

        Originally posted by aweisberg View Post
        Hi,

        That sounds like swapping. VoltDB requires that your data set fit in memory to see good performance. We usually run with swap disabled because Linux tends to swap stuff out early when there is still free memory. If you run with top, how much swap is in use?

        Ariel
        Hi aweisberg,

        Sorry for the late answer. I did not get any email about a reply..

        It is not swapping (I was running with swapoff -a anyway) but a query optimization issue with the OrderStat-getLastOrder query. The query and the winning plan:

        SQL: SELECT O_ID, O_CARRIER_ID, O_ENTRY_D FROM ORDERS WHERE O_W_ID = ? AND O_D_ID = ? AND O_C_ID = ? ORDER BY O_ID DESC LIMIT 1

        RETURN RESULTS TO STORED PROCEDURE
        LIMIT 1
        ORDER BY (SORT)
        INDEX SCAN of "ORDERS" using "IDX_ORDERS" (unique-scan covering)

        As you can see, the problem is with the SORT/LIMIT 1 operators. The plan scans all orders, sorts them, and then picks the latest. Obviously, the IDX_ORDER should return a single tuple, the latest order, and there should be no ORDER BY (SORT). As the ORDER table grows with time/number of transactions, the execution time for the OrderStat-getLastOrder query also grows with time/number of transactions as the sort becomes more and more expensive. This looks like a trivial thing to solve, I'm looking at the code now, but takes time as I'm not familiar with the project..

        Regarding swapping, I am actually interested in performance when RAM is insufficient and I've run a bunch of experiments with TPCC. In my experience it works really well for the TPCC workload.

        Comment


        • #5
          Hi rmorgenstein, Thank you

          Originally posted by rmorgenstein View Post
          Can you please send me the client output from your run? I'd also like to see the server description. This is in the log files: grep HOST log/volt.log

          Send it to rmorgenstein at voltdb dot com
          Hi rmorgenstein,

          Thank you for the reply!

          For the first question, please see my comments below explaining the long term throughput drop trend.

          For the second issue, throughput drop after client disconnect, I dont have a clear explanation why it happens.

          I am preparing a bunch of throughput vs time graphs showing the two issues and send them by email together with the logs. Hopefully they will explain a bit better my questions.

          Comment


          • #6
            Hi, I tried to reproduce what

            Originally posted by morcoveata View Post
            Hi aweisberg,

            Sorry for the late answer. I did not get any email about a reply..

            It is not swapping (I was running with swapoff -a anyway) but a query optimization issue with the OrderStat-getLastOrder query. The query and the winning plan: ...
            Hi,
            I tried to reproduce what you are describing on master, and it is locked in at 48k TPS on a 2x8x16 Nehalem server with 48 gigabytes of RAM. I ran it until it the server OOMed and locked up ant it did 47k txns/sec.

            You are right that the query isn't being planned well. The limit and order by should be inlined into the scan. Apparently there aren't a lot of rows being returned by the index because it isn't impacting performance much.

            Hopefully the logs will shed more light. What is the CPU and RAM config of the node?

            -Ariel

            Comment


            • #7
              The machine has 192GB of RAM,

              Originally posted by aweisberg View Post
              Hi,
              I tried to reproduce what you are describing on master, and it is locked in at 48k TPS on a 2x8x16 Nehalem server with 48 gigabytes of RAM. I ran it until it the server OOMed and locked up ant it did 47k txns/sec...
              The machine has 192GB of RAM, using 16 out of 80 cores (160 HW contexts but hyper-threading is disabled).

              But I don't think that is just the size of the memory, I see a drop before the resident working set reaches 48GB. Probably the plan gets implemented differently and it doesn't involve sorting all orders. Are you using a very recent version? I've downloaded VoltDB community edition about a month ago and maybe changes happened in the meantime..

              Is great that we get almost the same initial performance -- at least that part works well. Didn't finish the graphs yet but will upload them in ~1h..

              Comment


              • #8
                Actual numbers

                Originally posted by morcoveata View Post
                The machine has 192GB of RAM, using 16 out of 80 cores (160 HW contexts but hyper-threading is disabled).

                But I don't think that is just the size of the memory, I see a drop before the resident working set reaches 48GB. Probably the plan gets implemented differently and it doesn't involve sorting all orders...
                I'm adding a bunch of graphs (in excel form, hope this is ok) to give a better idea of what I am referring to.

                - First spreadsheet shows the long term throughput for TPCC scale factor 16, VoltDB running with 16 threads until it fills up 192GB of RAM. This is the throughput reported by *the client* that is on a separate machine connected to the same switch (1Gb/s link).

                - Second spreadsheet shows the issue with the client disconnect (the second question of my original post). Essentially the throughput measured by the client drops inexplicably after a disconnect/reconnect.

                - Third spreadsheet has a time breakdown of time spent executing each query. It also shows that after a client re-connect there is no impact on the number of transactions executed server side or their latency. This would indicate that somehow some results do not make it back to the client..

                I'm also attaching the project.xml, deployment.xml and the volt.log file.

                Thank you again for your time and help!

                Comment


                • #9
                  Hi,You are running on

                  Originally posted by morcoveata View Post

                  I'm adding a bunch of graphs (in excel form, hope this is ok) to give a better idea of what I am referring to.

                  - First spreadsheet shows the long term throughput for TPCC scale factor 16, VoltDB running with 16 threads until it fills up 192GB of RAM. This is the throughput reported by *the client* that is on a separate machine connected to the same switch (1Gb/s link).
                  Hi,

                  You are running on hardware that is very different then what we have run on before. There are some interesting NUMA questions in these larger quad socket systems that we haven't gotten to addressing.

                  Can you give me the output of "numactl --hardware"

                  If I was going to seriously benchmark Volt on such a thing I would run 4 volt processes, and use numactl to bind threads and memory allocations to a socket. I would start with 5 execution sites per instance. You can connect the client to all 4 instances, I think you may need to check up on bandwidth.

                  Adding client affinity so requests are routed to the correct process would also help. I would like to automate starting multiple processes and doing the NUMA config and make it work with replication, but it isn't likely to be scheduled soon. There are also some thread pool sizing issues.

                  Java will make a GC thread per hardware thread and you really want less than that, especially when running multiple processes. The network thread pool is also sized to (num hardware threads / 2) and new connections are round robined across the pool. That could explain the drop in performance on reconnect since it could go to a different thread that has different scheduling and allocation properties.

                  I would also run with the latest Sun JDK 7, Java's scalability is always improving. -Xmx2g is probably appropriate for this. In the log it sometimes report different processors and only 29 gigabytes of RAM?

                  If you check out the trunk and run that does this still reproduce? I know I have run at similar time scales with better throughput and not seen the drop so it is worth verifying.

                  Thanks,
                  Ariel

                  Comment


                  • #10
                    Hi Ariel, Thanks for the

                    Originally posted by aweisberg View Post
                    Hi,

                    You are running on hardware that is very different then what we have run on before. There are some interesting NUMA questions in these larger quad socket systems that we haven't gotten to addressing... Ariel
                    Hi Ariel,

                    Thanks for the reply. You are right about the NUMA features - I only got results on multi-socket machines without using processor affinity. I also checked out the latest code version (the previous code run was ~1.5 months old) and I see there have been significant changes in the meantime. I will re-run the experiments and will let you know the results.

                    I monitored a bit the effect of the java GC and I can confirm that it adds unpredictability.
                    The exact hardware is: SuperMicro SuperServer 5086B-TRF, 8CPUs(sockets) x 10-Core Intel Xeon CPU E7-L8867 @ 2.13GHz, 192GB memory. Detailed specs at: http://ark.intel.com/products/53577/...7-8867L-%C2%A0(30M-Cache-2_13-GHz-6_40-GTs-Intel-QPI). I actually disabled hyper-threading so the OS sees only 80 processors (one per core) and not 160 (one processor per HW context). Below you have the output of numactl.

                    Sorry for the messy log, it contains entries from several runs on different hardware configurations, explaining the differences you've seen. The entries where memory is low are for experiments where I was running with swapping (these experiments are unrelated to what I've uploaded). The entry with the 2CPUx6cores hardware is from a debugging run on a different machine. My home is a NFS share so the output got appended to the same log file.

                    One thing I forgot to mention in my previous post is that only ~5% of the total time is spent executing transactions ("CPU time in computation" in sheet 3 of the excel file I've uploaded). I got to this number by using the already existing timers of the ProcedureStatsCollector class. Does this sound right?

                    Thanks,
                    Radu

                    numactl --hardware gives:
                    available: 8 nodes (0-7)
                    node 0 cpus: 1 2 3 4 5 6 7 8 9 10
                    node 0 size: 16384 MB
                    node 0 free: 15830 MB
                    node 1 cpus: 11 12 13 14 15 16 17 18 19 20
                    node 1 size: 16384 MB
                    node 1 free: 15812 MB
                    node 2 cpus: 21 22 23 24 25 26 27 28 29 30
                    node 2 size: 16384 MB
                    node 2 free: 15904 MB
                    node 3 cpus: 31 32 33 34 35 36 37 38 39 40
                    node 3 size: 16320 MB
                    node 3 free: 15876 MB
                    node 4 cpus: 0 41 42 43 44 45 46 47 48 49
                    node 4 size: 32603 MB
                    node 4 free: 31441 MB
                    node 5 cpus: 50 51 52 53 54 55 56 57 58 59
                    node 5 size: 32768 MB
                    node 5 free: 31969 MB
                    node 6 cpus: 60 61 62 63 64 65 66 67 68 69
                    node 6 size: 32768 MB
                    node 6 free: 31979 MB
                    node 7 cpus: 70 71 72 73 74 75 76 77 78 79
                    node 7 size: 32768 MB
                    node 7 free: 31717 MB
                    node distances:
                    node 0 1 2 3 4 5 6 7
                    0: 10 21 21 21 21 21 21 21
                    1: 21 10 21 21 21 21 21 21
                    2: 21 21 10 21 21 21 21 21
                    3: 21 21 21 10 21 21 21 21
                    4: 21 21 21 21 10 21 21 21
                    5: 21 21 21 21 21 10 21 21
                    6: 21 21 21 21 21 21 10 21
                    7: 21 21 21 21 21 21 21 10

                    Comment


                    • #11
                      Hi,Can you also give us the

                      Originally posted by morcoveata View Post
                      Hi Ariel,

                      Thanks for the reply. You are right about the NUMA features - I only got results on multi-socket machines without using processor affinity. I also checked out the latest code version (the previous code run was ~1.5 months old) and I see there have been significant changes in the meantime. I will re-run the experiments and will let you know the results...
                      Hi,

                      Can you also give us the output of @Statistics PROCEDURE? Is there a procedure that takes longer than the others? Normally if there is procedure with a linear scan of any kind it will stick out like a sore thumb.

                      If there isn't one, and the amount of time spent executing procedures is small then the bottleneck is usually intiating transactions, but that tops out at 200k+ transactions/sec per node on the hardware we have tested with. I don't have a sense of why you would see a gradual decline to 3k txn/sec for the transaction initiation portion of the workload.

                      Ariel

                      Comment


                      • #12
                        Output of @Statistics PROCEDURE

                        Originally posted by aweisberg View Post
                        Hi,

                        Can you also give us the output of @Statistics PROCEDURE? Is there a procedure that takes longer than the others? Normally if there is procedure with a linear scan of any kind it will stick out like a sore thumb...Ariel
                        Hi Ariel,

                        Sorry for the late reply.

                        I checked out today the latest code from git and I can confirm the wrong execution plan (for getting the last order) is still there. I'm attaching a file with the output of the @Statistics PROCEDURE.

                        To pinpoint what I'm referring to in the file, the running times for the com.procedures.ostatByCustomerName procedure varies between 62814ns(MIN) - 179082035ns(MAX). In the beginning the runtime is close to 60us and it grows to 180ms as more orders are added.

                        I looked a bit at the code and it seems that fixing the problem should be "easy". VoltDB seems to support already ordered retrieval of tuples, so no new operators are needed and only the planning phase needs some modifications. I'll dig a bit deeper when time allows.

                        Radu

                        Comment


                        • #13
                          Hi,

                          I am trying to run VoltDB on a server with 16 sockets. The performance is disappointment.
                          Would you please tell me how to "run 4 volt processes, and use numactl to bind threads and memory allocations to a socket, and start with 5 execution sites per instance" ?

                          Comment

                          Working...
                          X