Announcement

Collapse
No announcement yet.

VoltDB partitioning

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • VoltDB partitioning

    Lets say I have a table

    id col1 col2 col3
    1 ABC DEF XYZ
    2 AAA BBB CCC

    Most frequent queries is going to be

    SELECT * from XYZ where col1='abc' and col2='def'
    SELECT * from XYZ where col1='abc' and col2='def' and col3='xyz'


    As per VoltDB docs:-

    1. Partition should be done on column on which most searches are going to be performed
    2. Partitioning should be done on one column


    I couldn't find any example where search is performed on multiple columns.

    I wonder what is the best way to partition table for multi column searches?

  • #2
    Hi,

    For your use case, I would partition the table on column 'col1'. As you have multiple filters in WHERE clause, I would suggest building an index on those columns, like
    -- CREATE INDEX index1 ON XYZ (col1,col2,col3).

    In order to check whether your query is optimized or not, you can check the query plan by using Explain for AdHoc queries or ExplainProc for queries in stored procedures. Related documentation links can be found at:
    https://voltdb.com/docs/UsingVoltDB/sysprocexplain.php
    https://voltdb.com/docs/UsingVoltDB/...xplainproc.php

    Xin

    Originally posted by atul.pundhir View Post
    Lets say I have a table

    id col1 col2 col3
    1 ABC DEF XYZ
    2 AAA BBB CCC

    Most frequent queries is going to be

    SELECT * from XYZ where col1='abc' and col2='def'
    SELECT * from XYZ where col1='abc' and col2='def' and col3='xyz'


    As per VoltDB docs:-

    1. Partition should be done on column on which most searches are going to be performed
    2. Partitioning should be done on one column


    I couldn't find any example where search is performed on multiple columns.

    I wonder what is the best way to partition table for multi column searches?

    Comment


    • #3
      "The column on which most searches are going to be performed" is intended to include queries that filter multiple columns.
      So, for your two example queries, either col1 or col2 is a candidate for the partition column.

      If your third most frequent query were " SELECT * from XYZ where col1='abc' and col3='xyz' " (eliminating col2), col1 would be the better choice.
      If your third most frequent query were " SELECT * from XYZ where col1>'abc' and col2='def' " (only range-filtering col1), col2 would be the better choice.
      Only equality filters should be considered when choosing a partitioning column.

      Note that queries that use a column in JOIN filters that equate 2 keys rather than a filter equating a key and a constant can also influence choice of the partitioning column.
      This is particularly true when joining a large table with itself to get pairs of rows with the same column value or when joining multiple large tables with a matching column.
      A join between partitioned tables must always specify an equality filter between the partition column on each table. This will often be a primary key of one table that is referenced by a column in the other table.

      If two candidate partitioning columns seem to be used with about the same frequency, choose the column that has the larger number of unique values.

      Comment


      • #4
        Originally posted by pmartel View Post
        "The column on which most searches are going to be performed" is intended to include queries that filter multiple columns.
        So, for your two example queries, either col1 or col2 is a candidate for the partition column.

        If your third most frequent query were " SELECT * from XYZ where col1='abc' and col3='xyz' " (eliminating col2), col1 would be the better choice.
        If your third most frequent query were " SELECT * from XYZ where col1>'abc' and col2='def' " (only range-filtering col1), col2 would be the better choice.
        Only equality filters should be considered when choosing a partitioning column.

        Note that queries that use a column in JOIN filters that equate 2 keys rather than a filter equating a key and a constant can also influence choice of the partitioning column.
        This is particularly true when joining a large table with itself to get pairs of rows with the same column value or when joining multiple large tables with a matching column.
        A join between partitioned tables must always specify an equality filter between the partition column on each table. This will often be a primary key of one table that is referenced by a column in the other table.

        If two candidate partitioning columns seem to be used with about the same frequency, choose the column that has the larger number of unique values.

        Thanks for the answer. However, What is my queries are "SELECT * from XYZ where col1 IN ('ABC', '....', '...') AND col2 IN ('XYZ', 'DEF')"

        Comment


        • #5
          Originally posted by xin View Post
          Hi,

          For your use case, I would partition the table on column 'col1'. As you have multiple filters in WHERE clause, I would suggest building an index on those columns, like
          -- CREATE INDEX index1 ON XYZ (col1,col2,col3).

          In order to check whether your query is optimized or not, you can check the query plan by using Explain for AdHoc queries or ExplainProc for queries in stored procedures. Related documentation links can be found at:
          https://voltdb.com/docs/UsingVoltDB/sysprocexplain.php
          https://voltdb.com/docs/UsingVoltDB/...xplainproc.php

          Xin
          Thanks for the answer. However, What if my queries are "SELECT * from XYZ where col1 IN ('ABC', '....', '...') AND col2 IN ('XYZ', 'DEF')". Will it make sense to partition the data? I think voltdb will have to go to all partitions to fetch the data.

          Comment


          • #6
            Yes. If your table XYZ is partitioned on col1, VoltDB has to fetch data from multiple partitions for your new query. In generally, if your table XYZ has been read a lot without updating very often, you can design that table to be replicated if you can expect the size of the table is not too big for your cluster.

            For the questions like "Whether a table needs to be partitioned or not", you can find more information on page with link next:
            http://voltdb.com/docs/UsingVoltDB/C...esignPartition

            By the way, VoltDB currently does not support using multiple IN with an index. So for your query with index on (col1,col2), VoltDB only use this index for the first IN filter col1, the second filter for col2 will be as a normal predicate. That means VoltDB has to scan all the records satisfying the first IN filter in the table XYZ to validate the second filters. More query optimization with multiple IN filters are in our backlog with ticket number ENG-5549: https://issues.voltdb.com/browse/ENG-5549

            Please let us know if you have more questions.

            Xin



            Originally posted by atul.pundhir View Post
            Thanks for the answer. However, What if my queries are "SELECT * from XYZ where col1 IN ('ABC', '....', '...') AND col2 IN ('XYZ', 'DEF')". Will it make sense to partition the data? I think voltdb will have to go to all partitions to fetch the data.

            Comment

            Working...
            X