Forum: Building VoltDB Applications

Post: Problem with a Join

Problem with a Join
rsafaie
Apr 13, 2010
Hi all,

I am having a problem with a join, have reduced it to the following:


Consider these two tables:


CREATE TABLE OBJECT_DETAIL (
OBJECT_DETAIL_ID INTEGER NOT NULL,
NAME VARCHAR(256) NOT NULL,
DESCRIPTION VARCHAR(1024) NOT NULL,
PRIMARY KEY (OBJECT_DETAIL_ID)
);


CREATE TABLE ASSET (
ASSET_ID INTEGER NOT NULL,
OBJECT_DETAIL_ID INTEGER NOT NULL,
PRIMARY KEY (ASSET_ID)
);


Using the ad hoc query browser (the same problem existed using stored procedures), run:


INSERT INTO OBJECT_DETAIL VALUES (1, 'NAME', 'DESCRIPTION');
INSERT INTO ASSET VALUES (1,1);


Then, running:


SELECT * FROM ASSET;


Gives:



  • ASSET_ID = 1
  • OBJECT_DETAIL_ID = 1


SELECT * FROM OBJECT_DETAIL;


Gives:



  • OBJECT_DETAIL_ID = 1
  • NAME = 'NAME'
  • DESCRIPTION = 'DESCRIPTION'


The problem occurs when I try joining on OBJECT_DETAIL_ID columns.


Running:


SELECT
A.ASSET_ID,
A.OBJECT_DETAIL_ID,
OD.OBJECT_DETAIL_ID
FROM
ASSET A,
OBJECT_DETAIL OD
WHERE
A.OBJECT_DETAIL_ID = OD.OBJECT_DETAIL_ID


Gives:



  • ASSET_ID = 1
  • OBJECT_DETAIL_ID = 282177
  • OBJECT_DETAIL_ID = 1296367627


I would expect OBJECT_DETAIL_ID to be 1.


Adding OD.NAME to the SELECT part of the query returns an empty result set (which it should, as should the above query if this is an inner join??)


Interestingly, moving the tables around in the FROM part of the clause to:


SELECT
A.ASSET_ID,
A.OBJECT_DETAIL_ID,
OD.OBJECT_DETAIL_ID
FROM
OBJECT_DETAIL OD,
ASSET A
WHERE
A.OBJECT_DETAIL_ID = OD.OBJECT_DETAIL_ID


Returns (from two executions):



  • 1.
  • ASSET_ID: 0
  • OBJECT_DETAIL_ID: 241
  • OBJECT_DETAIL_ID: 1



  • 2.
  • ASSET_ID: 76
  • OBJECT_DETAIL_ID: 1598637153
  • OBJECT_DETAIL_ID: 1


In this case it always returns a different set of values for the query (no matter how many times I run it), the first join query (using ASSET A, OBJECT_DETAIL OD in the FROM part) always returns the same rows.


Am I doing something wrong here (perhaps the ddl is wrong)?


My only thought might be that it is related to the enivironment I am running this in:


VMWare Workstation on Windows 7 Home Edition,
64-Bit Ubuntu 9.10


I do have some CentOS VM's setup for some other tests, will try this on those tomorrow and let you know what I find.


Thanks,
Rouz
re: Problem with a Join
tcallaghan
Apr 14, 2010
Rouz,

That is truly strange. I just tried reproducing this issue on my Mac, using both the 0.6.01 and 0.6.02 kits and both times I received correct results from the "join" SQL statement. I also appreciate your trying this in both Ad-hoc and stored procedure modes.

I don't think the issue is related to your VM, but you can always try CentOS. I would like you to try a few things:

1. Use the 0.6.02 kit (if you aren't already).
2. Use the CentOS build (even if you are running on Ubuntu, this single Linux kit will most likely be our only distribution going forward).
3. Try the SQL without table aliases. You can fully qualify all your SQL with the proper table name, aliasing can sometimes cause an issue.

SELECT ASSET.ASSET_ID,
ASSET.OBJECT_DETAIL_ID,
OBJECT_DETAIL.OBJECT_DETAIL_ID
FROM OBJECT_DETAIL,
ASSET
WHERE ASSET.OBJECT_DETAIL_ID = OBJECT_DETAIL.OBJECT_DETAIL_ID

Let me know if this resolves your issue. Also, when reporting an issue with this level of detail feel free to attach your entire project (as long as there is nothing proprietary in it). It is always a better test if I use all your code, not just a snippet.

-Tim
Hi Tim, Thanks for the reply
rsafaie
Apr 14, 2010
Rouz,

That is truly strange. I just tried reproducing this issue on my Mac, using both the 0.6.01 and 0.6.02 kits and both times I received correct results from the "join" SQL statement. I also appreciate your trying this in both Ad-hoc and stored procedure modes.

I don't think the issue is related to your VM, but you can always try CentOS. I would like you to try a few things...

-Tim

Hi Tim,

Thanks for the reply (will attach the code future.. I suspect you would of spotted the problem had I done so this time).
In response to your points:

1. I am already using 0.6.02
2. Have tried using the CENTOS build, seems to work exactly the same as the Ubuntu distribution (same things work / don't work).
3. I had already tried it without the aliases and got the same problem

Further: I have tried this on CentOS and I observe the same problem

However, all that said:

I think I have found the problem, it was due to the columns I was partitioning on. Not sure if this is a lack of understanding on my part or a bug.

I had left a partition column definition in the project.xml file for both the ASSET and OBJECT_DETAIL tables as follows:

<partition table='ASSET' column='ASSET_ID' />
<partition table='OBJECT_DETAIL' column='OBJECT_DETAIL_ID' />

In reality I would most definitely want the object_detail table to be replicated across all nodes anyway for performance, by removing the partition information for that table this is achieved and it fixes the problem (the join works now).

I did think though that you could partition on any column provided it was always a NON-NULL column, am I wrong thinking that (Apologies if this is covered in the docs and I haven't seen it)?

When performing a join, does the column that you are joining on need to be either a replicated table or the partition column of the table?

Thanks, Rouz

NOTE:

If both the paritions are defined for both tables (scenario with a problem), if I try to run:

SELECT * FROM ASSET, OBJECT_DETAIL WHERE ASSET.OBJECT_DETAIL_ID = OBJECT_DETAIL.OBJECT_DETAIL_ID
the Volt server exits with the following message:

[java] Loading org.voltdb.VoltTableRow... deferring to parent.
[java] terminate called after throwing an instance of 'std::bad_alloc'
[java] what(): std::bad_alloc
[java] Java Result: 134

I suspect this is a knock on effect of the above problem, but thought I would mention it just in case it causes concern!
Hi Rouz, As far as
ajgent
Apr 15, 2010
Hi Tim,

Thanks for the reply (will attach the code future.. I suspect you would of spotted the problem had I done so this time).
In response to your points:

1. I am already using 0.6.02
2. Have tried using the CENTOS build, seems to work exactly the same as the Ubuntu distribution (same things work / don't work).
3. I had already tried it without the aliases and got the same problem...
Thanks, Rouz

NOTE:

If both the paritions are defined for both tables (scenario with a problem), if I try to run:

SELECT * FROM ASSET, OBJECT_DETAIL WHERE ASSET.OBJECT_DETAIL_ID = OBJECT_DETAIL.OBJECT_DETAIL_ID
the Volt server exits with the following message:

[java] Loading org.voltdb.VoltTableRow... deferring to parent.
[java] terminate called after throwing an instance of 'std::bad_alloc'
[java] what(): std::bad_alloc
[java] Java Result: 134

I suspect this is a knock on effect of the above problem, but thought I would mention it just in case it causes concern!

Hi Rouz,

As far as partitioning goes, whether the tables are partitioned or not should not affect the accuracy of the results. (It may, of course, impact the performance of the query.) The one caveat to that statement is that you must be careful when doing joins in a single-partitioned stored procedure that all records are within that partition. In general, if your tables are both partitioned -- or if you are not selecting based on the partitioning key, as in your example -- you will need a multi-partitioned stored procedure

Second, any integer or string column can be a partitioning key. You cannot partition on floats or decimals.

That said, the bad_alloc looks like a bug. As Tim suggested, we would very much appreciate it if you could send us the code for this example (assuming it is non-proprietary) so we can try to reproduce it here,

One last question: when you received this error, approximately how many records were in the database?

Many thanks,
Andrew
Some more details, please?
chbussler
Apr 16, 2010
Hi Rouz,

As far as partitioning goes, whether the tables are partitioned or not should not affect the accuracy of the results. (It may, of course, impact the performance of the query.) The one caveat to that statement is that you must be careful when doing joins in a single-partitioned stored procedure that all records are within that partition. In general, if your tables are both partitioned -- or if you are not selecting based on the partitioning key, as in your example -- you will need a multi-partitioned stored procedure...

Many thanks,
Andrew


Hi Andrew,

your post lead to two questions for me:

a) how to I know that all records are in one partition?
b) is a multi-partitioned stored procedure one where single-site == false?

Thanks,
Christoph
Hi Christoph, >>how do I know
ajgent
Apr 16, 2010

Hi Andrew,

your post lead to two questions for me:

a) how to I know that all records are in one partition?
b) is a multi-partitioned stored procedure one where single-site == false?

Thanks,
Christoph

Hi Christoph,

>>how do I know that all records are in one partition?

The three main situations where a join could be single partitioned are:



  1. Both the tables of the join are replicated, so all records are guaranteed to be in all partitions.
  2. One table is partitioned and the other is replicated. For example, if the procedure is partitioned on EMPLOYEE.EMPLOYEE_ID and the table DEPARTMENT is replicated, the following join would be single-partitioned:

    SELECT E.EMP_NAME, D.DEPT_NAME FROM EMPLOYEE AS E, DEPARTMENT AS DWHERE E.DEPT_CODE=D.DEPT_CODE AND E.EMPLOYEE_ID=?
  3. Both tables are partitioned on the same key. For example, if EMPLOYEE is partitioned on EMPLOYEE_ID and DEPARTMENT is partitioned on MANAGER_ID (a silly example, I grant you), then the following could be single-partitioned:

    SELECT E.EMP_NAME, D.DEPT_NAME FROM EMPLOYEE AS E, DEPARTMENT AS DWHERE E.EMPLOYEE_ID=D.MANAGER_ID AND E.EMPLOYEE_ID=?


    >>is a multi-partitioned stored procedure one where single-site == false?

    Yes. Either singlePartition = false or you do not specify any @ProcInfo. (Multi-partitioned is the default.)

    Hope that helps.

    --Andrew
Killer Bug
henning
Apr 16, 2010

Hi Andrew,

your post lead to two questions for me:

a) how to I know that all records are in one partition?
b) is a multi-partitioned stored procedure one where single-site == false?

Thanks,
Christoph

"you must be careful when doing joins in a single-partitioned stored procedure that all records are within that partition."

I am not sure if I fully understood, but this implies that VoltDB will basically believe me and follow orders blindly when I tell it that a join will be single partition? And deliver only a subset of data, if I make a mistake, with no warning?

If this is right, it looks like VoltDB's equivalent to C's zero pointers as an error source to me. Like, it will run a decent part of your user base crazy and globally will cost man-years of frustrated debugging efforts. :-)

I think I would be quite grateful if you were to build a very simple asynchronous error checking mechanism in that runs some intelligent check in the 'background' and informs me of a potential error if it finds that a join I just executed single-partition would have yielded different results if it hadn't been flagged single-partition. And that this flagging must have been a mistake in the first place.

I could imagine that I may come around to actually like the feature as is, if it was guaranteed that I can query reliably only the data subset of a single node. But I would be surprised if this would be on offer? It's simply a mistake and can't be consistent in its results I would assume.

If so, then an intelligent, pro-active check will make sure that VoltDB minimizes pitfalls that are very specific to the special ways it introduces, i.e. will rarely be anticipated, and that seem guaranteed to be beginner's favored errors, as well as advanced user's most dreaded.
Separate discussion
ajgent
Apr 16, 2010
"you must be careful when doing joins in a single-partitioned stored procedure that all records are within that partition."

I am not sure if I fully understood, but this implies that VoltDB will basically believe me and follow orders blindly when I tell it that a join will be single partition? And deliver only a subset of data, if I make a mistake, with no warning?...


Hi Henning,

This thread is getting rather long and diverging into two separate topics (multi-partition joins and single-partition joins). If you don't mind, I will move the discussion of single-partition joins to a separate topic so it gets the visibility it deserves.

Thanks,

--Andrew
Hi Rouz, We have produced a
ajgent
Apr 15, 2010
Hi Rouz,

As far as partitioning goes, whether the tables are partitioned or not should not affect the accuracy of the results. (It may, of course, impact the performance of the query.) The one caveat to that statement is that you must be careful when doing joins in a single-partitioned stored procedure that all records are within that partition. In general, if your tables are both partitioned -- or if you are not selecting based on the partitioning key, as in your example -- you will need a multi-partitioned stored procedure...

Many thanks,
Andrew

Hi Rouz,

We have produced a different but probably related error using your last SQL query. Thanks for pointing this out. I'll let you know as soon as we determine what the cause is (and the workaround).

--Andrew
Hi Rouz, Thanks so much for
ajgent
Apr 16, 2010
Hi Rouz,

As far as partitioning goes, whether the tables are partitioned or not should not affect the accuracy of the results. (It may, of course, impact the performance of the query.) The one caveat to that statement is that you must be careful when doing joins in a single-partitioned stored procedure that all records are within that partition. In general, if your tables are both partitioned -- or if you are not selecting based on the partitioning key, as in your example -- you will need a multi-partitioned stored procedure...

Many thanks,
Andrew

Hi Rouz,

Thanks so much for bringing this to our attention. It turns out there is a problem with joins between two (or more) partitioned tables. As you discovered yourself, joins between one partitioned table and one or more replicated tables operate properly and we are happy to hear this works out for your situation.

For the time being please avoid multi-partition joins between two or more partitioned tables. At the next release, the compiler will catch and report this as an error. Even when we do allow such joins in the future, they run counter to VoltDB's basic
architecture and so their performance is likely to be an issue for most applications. They are not recommended.

Thanks,

Andrew
Hi Andrew, Glad to be of
rsafaie
Apr 16, 2010
Hi Rouz,

Thanks so much for bringing this to our attention. It turns out there is a problem with joins between two (or more) partitioned tables. As you discovered yourself, joins between one partitioned table and one or more replicated tables operate properly and we are happy to hear this works out for your situation.

For the time being please avoid multi-partition joins between two or more partitioned tables. At the next release, the compiler will catch and report this as an error. Even when we do allow such joins in the future, they run counter to VoltDB's basic
architecture and so their performance is likely to be an issue for most applications. They are not recommended.

Thanks,

Andrew

Hi Andrew,

Glad to be of assistance, I was actually putting together a small test app to highlight the problem as I wasn't attempting a single partition query when the problem occured, but spotted this before I got to far in and looks like you have found the problem now.

Just to add (not entirely sure how I didn't see this the other day), when I attempt to run a join that causes this problem inside a stored procedure I get the following exception:

org.voltdb.client.ProcCallException:
===============================================================================
VOLTDB ERROR: UNEXPECTED FAILURE:
Possible Null Pointer Exception
at com.rsafaie.voltdb.procedures.GetAssetProcedure.run(GetAssetProcedure.java:27)
===============================================================================
at org.voltdb.client.SyncCallback.result(SyncCallback.java:82)
at org.voltdb.client.ClientImpl.callProcedure(ClientImpl.java:140)
at com.rsafaie.voltdb.App.main(App.java:25)

GetAssetProcedure.java:27

VoltTable[] voltExecuteSQL = voltExecuteSQL();

At a wild guess it is might be something to do with your code thinking it has results to returns (which it should do if there wasn't this bug) but at the point you come to send the results a null pointer is thrown because the results it thinks are there aren't actually there).

I expect you have probably found this now though!

Let me know if this isn't an obvious consequence of the problem found and I will complete my test app and post!

Thanks,
Rouz
Hi Rouz, Yes. The Null
ajgent
Apr 16, 2010
Hi Andrew,

Glad to be of assistance, I was actually putting together a small test app to highlight the problem as I wasn't attempting a single partition query when the problem occured, but spotted this before I got to far in and looks like you have found the problem now.

Just to add (not entirely sure how I didn't see this the other day), when I attempt to run a join that causes this problem inside a stored procedure I get the following exception...

Thanks,
Rouz

Hi Rouz,

Yes. The Null Pointer Exception is the condition we reproduced and helped us identify the root cause. Thanks!

Andrew