Forum: Building VoltDB Applications

Post: Database design and performance questions: joins, partitioning schema etc

Database design and performance questions: joins, partitioning schema etc
cer
Feb 5, 2012
I have a few questions about getting my application to have good performance and scalability.
My example application has a database of restaurants, their menus and their opening times.

Here is the schema.

CREATE TABLE restaurant
( id INTEGER PRIMARY KEY NOT NULL,
name VARCHAR(200) NOT NULL,
);

CREATE TABLE menu_item
( restaurant_id INTEGER NOT NULL REFERENCES restaurant(id) ,
name VARCHAR(200) NOT NULL,
price FLOAT NOT NULL,
);

CREATE TABLE service_area
( restaurant_id INTEGER NOT NULL REFERENCES restaurant(id) ,
zip_code VARCHAR(5) NOT NULL,
);

CREATE TABLE time_range
( restaurant_id INTEGER NOT NULL REFERENCES restaurant(id) ,
day_of_week INTEGER NOT NULL,
open_time INTEGER NOT NULL,
close_time INTEGER NOT NULL,
);

There are two main operations on the database. The first are CRUD operations on Restaurants: ie. AddRestaurant. The second is a query that finds the restaurants that serve a given zip code and is open at a particular time of day.

Here is the find available restaurants query that I would like to use:

select r.*
from restaurant r,time_range tr, service_area sa
Where ? = sa.zip_code and r.id =tr.restaurant_id
and r.id = sa.restaurant_id and tr.day_of_week=?
and tr.open_time <= ?
and ? <= tr.close_time

I've experimented with a couple of different partitioning schema.

Design #1


The first was to partition all the tables by restaurant id.

The CRUD stored procedures were all single partition and performed extremely well.

However, the FindAvailableRestaurant stored procedure was a multi-partition and voltcompiler complained about the query: "Unable to plan for statement. Likely statement is joining two partitioned tables in a multi-partition statement. This is not supported at this time."

My solution to this problem was to denormalize by defining an additional table:

CREATE TABLE available_time_range
( restaurant_id INTEGER NOT NULL REFERENCES restaurant(id) ,
day_of_week INTEGER NOT NULL,
open_time INTEGER NOT NULL,
close_time INTEGER NOT NULL,
zip_code VARCHAR(5) NOT NULL,
name VARCHAR(200) NOT NULL,
);

The CRUD performance was still good and the FindAvailableRestaurant query was only ok since presumably it was still multi-partition procedure.

Design #2


To improve the query performance I then switched partitioning schemes: available_time_range was partitioned by zipcode.

The FindAvailableRestaurant single partition stored procedure had great performance (48K transactions/sec)
But the AddRestaurant stored procedure had terrible performance (<1K and even worse in a cluster)

Design #3


As an experiment I used a single partition server and defined all stored procedures as single partition. The FindAvailableRestaurant procedure used the original three-way join SELECT statement.
CRUD performance was fine but FindAvailableRestaurant was terrible (20 transactions/second !) even though the query plan showed it using indexes.

Sorry for the long post but I was wondering if anyone had any thoughts or suggestions about my experiences.
Hi Cer, I don't know if we
aweisberg
Feb 6, 2012
Hi Cer,

I don't know if we implement REFERENCES correctly, and whether that will generate a primary key index. I know that we don't support foreign key constraints. I created https://issues.voltdb.com/browse/ENG-2497 for that issue.

It isn't clear to me from your schema what indexes and primary keys you are ending up with. The reason design #2 was faster is that you partitioned on zip code which is effectively building an index on zip code and that made the join a lot faster because only restaurants from that zip code had to be considered.

I would start by partitioning everything on zip code. If you partitioned all tables on zip code you would be in better shape when it comes to inserting data in the database and doing CRUD updates because that would be single partition. You should also add primary keys on restaurant_id to restaurants, available_time_range, and menu_item. I would create a separate available_time_range table for each day of the week so that the join can do a table scan of the correct day of the week. If you indexed day_of_week it would only cut down the number of rows by 1/7 and then it would be a slower index scan. I don't think an index on open_time or close_time will do much good, probably better off with the scan.

You should be able to get better performance then what you are currently seeing. When you go from one to two nodes it is important to make sure that NTP is giving you good synchronization (see NtpSvcIntro) and that you are submitting load asynchronously (See 3.3.3 DesignAppAsync). If you submit a single procedure at a time you are limited by the round trip time and can only generate work for a single core in the database.

If you got with partitioning on zip code you should see your CRUD performance problems go away as long as you are generating enough load to keep all cores on the server busy.

Another approach to this design would be to replicate everything except the detail (menu) information associated with a restaurant (which you would partition on restaurant_id). I wouldn't be surprised if the entire data set fit easily into memory. This would allow you do do FindAvailableRestaurant as a single partition procedure with a dummy partition parameter (dummy since you will be querying replicated tables). Insert, Update, Delete would be slow because it would all be multi-partition, but once the data set is loaded, how often will these things actually change?

-Ariel