Announcement

Collapse
No announcement yet.

Why use partition on when inserting to VoltDB

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • gideon caller
    started a topic Why use partition on when inserting to VoltDB

    Why use partition on when inserting to VoltDB

    Hi everyone,
    I'm reading the VoltDB performance guide and I've ran into an example where you have a table which is partitioned on the Email column.
    There's an example in the performance guide (https://docs.voltdb.com/PerfGuide/Hello2Tables.php) under "Registering New Users" where it shows an insert stored procedure for user and it also says that it's better to add a "partition on" for the stored procedure so the stored procedure will be transformed to a single partition stored procedure.

    Now my question is: since the table is already partitioned on the Email column and I'm using an insert procedure why should I also need to explicitly add "partition on"? Why cant the relevant node be chosen according to the input Email value and be sent to a single Volt node? Also what happens when you execute this query without the "partition on"? doesn't it still go to only a single node? I'm quite confused on the way these 2 queries will be executed.

    Also, I was wondering if its possible to replicate a partitioned table? It seems like each table can be either partitioned or table but what happens if I replicate a table and lose one of the nodes?

    Thanks in advance for any explanation

    Gideon
    Last edited by gideon caller; 10-19-2015, 05:39 AM.

  • energyd
    replied
    Nice! Thanks a lot. So when I use csvloader's default bulk load mechanism, it's actually faster than using the default insert procedure?
    Last edited by energyd; 10-29-2015, 04:18 PM.

    Leave a comment:


  • bballard
    replied
    Csvloader uses a bulk loader mechanism by default, if you only specify the table name. If you specify a procedure with the "-p" parameter, it will call the procedure instead.

    The default procedures such as TABLENAME.insert are partitioned automatically for partitioned tables. If the table is not partitioned (or "replicated" in our terminology) then the insert procedure will be a multi-partition procedure.

    Leave a comment:


  • energyd
    replied
    Hi there,
    What if I'm using the Csvloader to load data into the a partitioned table? I assume Csvloader will use the default stored procedure "table.insert" that is created automatically when the table is created?

    If that is the case, is the default "tablename.insert" stored procedure partitioned by default?

    Thanks

    Leave a comment:


  • gideon caller
    replied
    Got it
    Thanks :)

    Leave a comment:


  • pmartel
    replied
    Yes, if you do not specify PARTITION ON, the insert will be queued for coordinated execution on all nodes, which will block all other queries on the cluster.
    The insert will only take effect on the one partition that matches its partition key. The transaction will run but have no effect on other partitions.
    In the case of a K-safe cluster, the insert into this one partition needs to take effect consistently at multiple (redundant) sites, each on a different node.

    So, neglecting to override the default partitioning behavior of this procedure will not change the correctness of its behavior, but it could have a measurable effect on the performance of the procedure (more processing, more inter-node communication) and the overall performance of the cluster (more blocking).

    Leave a comment:


  • jhugg
    replied
    If you're using a stored procedure, you always need to explicitly partition. If you're using an ad-hoc query or JDBC, it will detect the partitioning inherent in the actual statement and make it a single-partition call.

    The issue is that VoltDB doesn't really know what you're going to do in a stored procedure, so it plays it safe.

    Leave a comment:


  • gideon caller
    replied
    Thanks for your reply
    Regarding the 1st part of my question, so just making sure I understand this correctly, if I did an insert on a partitioned cluster where the partition is on the email and I didn't add the "partition on" part the insert will lock all nodes (since it's multi partition execution) but will only insert into 1 node?

    And no need to apologize, your answer was clear and useful :)
    thanks again!

    Leave a comment:


  • pmartel
    replied
    The default partitioning behavior for stored procedures is "multi-partition" regardless of the method used to specify the definition.
    This default must always be explicitly overridden to get single-partition execution.
    In your particular example of
    - an insert statement
    - defined "inline" in the ddl (vs. as part of a java stored procedure)
    - that uses a parameter directly to initialize the partitioning column
    the default partitioning behavior is obviously sub-optimal.
    Yet, even as common as this case and a FEW similar cases are, we opted for simplicity -- always defaulting the partitioning the same way, regardless of the stored procedure's specifics.
    The workaround, as you have found, is to use the explicit PARTITION ON syntax to override the default.

    In contrast, when issuing ad hoc SQL statements, there is no easily accessible way to target a single partition. So the system DOES analyze the statement logic to determine whether the statement can be safely and optimally executed on a single partition.

    In the future, since the system is capable of making this determination, we may provide an alternative syntax by which you could specify that you want the system to infer the optimal partitioning behavior from the provided sql statement. We have not determined what the syntax for this option should be.


    The answer to your second question about partitioned and replicated tables and redundancy is that a cluster configured with redundant nodes (K-safety factor of 1 or greater) maintains replicas of ALL tables for this purpose. This causes ALL tables to be redundantly stored on as many hosts as required to maintain the K-safety factor.

    Completely separate from this feature is the ability to configure SPECIFIC tables in the ddl as partitioned or replicated for the purpose of localized access.
    This causes non-partitioned tables (also called "replicated") to be redundantly stored on ALL hosts, keeping it locally available to ALL partitions, regardless of whether the K-safety feature is enabled.

    Also, separate from these two features is a third feature "distributed replication" which involves another level of redundant storage, but this time in a separate operational cluster, often in a remote location, allowing for resilience in the case of a cluster-level outage.

    The common use of the terms "replica", "replicated", and "replication" for all of these three distinct product features can often be a little confusing.
    I'm sorry for any inconvenience this may have caused you. I hope I've been able to clarify.

    Leave a comment:

Working...
X