Forum: Building VoltDB Applications

Post: group by a function column

group by a function column
t0v
Mar 4, 2013
i've a SQL statement like
SELECT FLOOR((1362525639563000 - time_stamp) / 1800000000) AS TS, SUM(sample_data) AS SSD FROM my_table WHERE time_stamp BETWEEN 1362439239563000 AND 1362525639563000 GROUP BY TS;

but looks like i cannot alias a column that's using an inbuilt function, e.g. just
SELECT FLOOR((1362525639563000 - time_stamp) / 1800000000) AS TS, sample_data AS SSD FROM my_table WHERE time_stamp BETWEEN 1362439239563000 AND 1362525639563000 LIMIT 20;

doesn't output the 'TS' as the alias for the 1st column.

if i run it w/o FLOOR, i can alias the column, but still GROUP BY does not work.

any idea, how to overcome this?

neither does this work
SELECT (1362525639563000 - time_stamp) / 1800000000 AS TS, SUM(sample_data) AS SSD FROM my_table WHERE time_stamp BETWEEN 1362439239563000 AND 1362525639563000 GROUP BY TS;


and i cannot add a column with the 1st value to the table, since the divider (1800000000) will change from query to another...

regards
t :)
pmartel
Mar 6, 2013
t,

I believe that I have been able to find or reproduce three issues that somewhat match your descriptions.

An older issue that appears to have been resolved in a recent release is that, as you describe, column aliases do not work with builtin functions and other expressions.
This is documented as issue ENG-4309 which is listed as still open -- though it does seem to be fixed already in the current master version of our upcoming 3.1 release -- available now as open source via github.


A second issue is that although we (now) allow column aliases on all column expressions including built-in functions for the purposes of output labeling, we do not allow aliases to be used in the GROUP BY clause.

select NUM as X, FLOOR(NUM) as Y from R1 where ID < 0;

X Y
-- --


select ID as X, ABS(ID) as Y, SUM(NUM) as Z from R1 group by X where ID < 0;

Unexpected Ad Hoc Planning Error: Error compiling query: org.voltdb.planner.PlanningErrorException: user lacks privilege or object not found: X

The workaround is to duplicate the original column name in the GROUP BY:

select ID as X, ABS(ID) as Y, SUM(NUM) as Z from R1 group by ID where ID < 0;

X Y Z
-- -- --

I have logged this defect into our tracking system as issue ENG-4446 . It seems like a useful short-cut.
That said, I was unable to quickly establish whether this particular use of an alias is SQL standard or possibly a common vendor extension.
I know that an alias can be used in an ORDER BY clause, and VoltDB supports that.
Have you used 'GROUP BY alias' with any other SQL databases?


The third issue is the lack of support for expressions in the GROUP BY clause, with or without aliases. GROUP BY expressions are currently restricted to lists of columns.

You have anticipated the simplest workaround for this -- storing the expression as a column -- but have suggested that it will not work for you because other arguments to the expression will change from one query to another.

Yet it is conceivable that the relatively low cost of write operations in VoltDB may yet allow this workaround. You would simply have to "refresh" the column within each query's transaction (stored proc invocation) for example, in place of:

SELECT FLOOR((1362525639563000 - time_stamp) / 1800000000) AS TS, SUM(sample_data) AS SSD FROM my_table WHERE time_stamp BETWEEN 1362439239563000 AND 1362525639563000 GROUP BY TS;

consider passing (1362525639563000, 1800000000, 1362439239563000, 1362525639563000) as parameters (A, B, C, A) into:

UPDATE my_table SET SCALED_TS = FLOOR((? - time_stamp) / ?) WHERE time_stamp BETWEEN ? AND ?;

then passing (1362439239563000, 1362525639563000) as parameters (C, A) into:

SELECT SCALED_TS AS TS, SUM(sample_data) AS SSD FROM my_table WHERE time_stamp BETWEEN ? AND ? GROUP BY SCALED_TS;

If you often (sequentially) re-use the same divisor(s), and never or rarely update the time_stamp field, you might even want to benchmark an attempt to keep SCALED_TS in sync and only update it as needed:

consider passing (1800000000, 1362525639563000, 1800000000, 1362439239563000, 1362525639563000, 1800000000) as parameters (B, A, B, C, A) into:

UPDATE my_table SET TS_SCALE = ?, SCALED_TS = FLOOR((? - time_stamp) / ?) WHERE time_stamp BETWEEN ? AND ? AND TS_SCALE <> ?;

Of course, in neither case will these UPDATEs be "cost free", so you would have to benchmark them to determine whether they provide an acceptably performing work-around.

The ability to GROUP BY general expressions is on our current feature roadmap. Its status is tracked as issue ENG-3291

Thank you for reporting these issues.
I hope that you will be able to take advantage of the new version and the workaround(s) I have described.

Paul Martel
VOLTDB Engineer
pmartel
Mar 6, 2013
t,

I believe that I have been able to find or reproduce three issues that somewhat match your descriptions.

An older issue that appears to have been resolved in a recent release is that, as you describe, column aliases do not work with builtin functions and other expressions.
This is documented as issue ENG-4309 which is listed as still open -- though it does seem to be fixed already in the current master version of our upcoming 3.1 release -- available now as open source via github.


A second issue is that although we (now) allow column aliases on all column expressions including built-in functions for the purposes of output labeling, we do not allow aliases to be used in the GROUP BY clause.

select NUM as X, FLOOR(NUM) as Y from R1 where ID < 0;

X Y
-- --


select ID as X, ABS(ID) as Y, SUM(NUM) as Z from R1 group by X where ID < 0;

Unexpected Ad Hoc Planning Error: Error compiling query: org.voltdb.planner.PlanningErrorException: user lacks privilege or object not found: X

The workaround is to duplicate the original column name in the GROUP BY:

select ID as X, ABS(ID) as Y, SUM(NUM) as Z from R1 group by ID where ID < 0;

X Y Z
-- -- --

I have logged this defect into our tracking system as issue ENG-4446 . It seems like a useful short-cut.
That said, I was unable to quickly establish whether this particular use of an alias is SQL standard or possibly a common vendor extension.
I know that an alias can be used in an ORDER BY clause, and VoltDB supports that.
Have you used 'GROUP BY alias' with any other SQL databases?


The third issue is the lack of support for expressions in the GROUP BY clause, with or without aliases. GROUP BY expressions are currently restricted to lists of columns.

You have anticipated the simplest workaround for this -- storing the expression as a column -- but have suggested that it will not work for you because other arguments to the expression will change from one query to another.

Yet it is conceivable that the relatively low cost of write operations in VoltDB may yet allow this workaround. You would simply have to "refresh" the column within each query's transaction (stored proc invocation) for example, in place of:

SELECT FLOOR((1362525639563000 - time_stamp) / 1800000000) AS TS, SUM(sample_data) AS SSD FROM my_table WHERE time_stamp BETWEEN 1362439239563000 AND 1362525639563000 GROUP BY TS;

consider passing (1362525639563000, 1800000000, 1362439239563000, 1362525639563000) as parameters (A, B, C, A) into:

UPDATE my_table SET SCALED_TS = FLOOR((? - time_stamp) / ?) WHERE time_stamp BETWEEN ? AND ?;

then passing (1362439239563000, 1362525639563000) as parameters (C, A) into:

SELECT SCALED_TS AS TS, SUM(sample_data) AS SSD FROM my_table WHERE time_stamp BETWEEN ? AND ? GROUP BY SCALED_TS;

If you often (sequentially) re-use the same divisor(s), and never or rarely update the time_stamp field, you might even want to benchmark an attempt to keep SCALED_TS in sync and only update it as needed:

consider passing (1800000000, 1362525639563000, 1800000000, 1362439239563000, 1362525639563000, 1800000000) as parameters (B, A, B, C, A) into:

UPDATE my_table SET TS_SCALE = ?, SCALED_TS = FLOOR((? - time_stamp) / ?) WHERE time_stamp BETWEEN ? AND ? AND TS_SCALE <> ?;

Of course, in neither case will these UPDATEs be "cost free", so you would have to benchmark them to determine whether they provide an acceptably performing work-around.

The ability to GROUP BY general expressions is on our current feature roadmap. Its status is tracked as issue ENG-3291

Thank you for reporting these issues.
I hope that you will be able to take advantage of the new version and the workaround(s) I have described.

Paul Martel
VOLTDB Engineer
t0v
Mar 6, 2013
thank you for the answer.
at least in MySQL you can use 'GROUP BY alias' or you can just do 'GROUP BY 1' where 1 is obv the 1st column of your SELECT.

i'll benchmark, how update performs. the issue might be that we have like 80M records in the table, but if VoltDB is fast, no problem with that :)

regards
t :)
t0v
Mar 6, 2013
ok, i tried the UPDATE workaround, but obviously i hit the 'More than 100 MB of temp table memory used while executing SQL.' limit.

(edit start)
is there any way to find out, how much should i allow for the temptables?
i tried 500 MB and still get the error...
with 5000 MB it finally succeeded when 17M rows were updated :)
but having 23M rows to be updated, it fails again with 'More than 904 MB of temp table memory used while executing SQL.'
so, looks like that 904 MB is some hard limit even tho i specified 5000 MB in the deployment.xml
(end edit)

or should i just keep the scaled columns up to date every time when i insert a record into that table?
since there are 3 different scaling factors and the using those during the INSERT changes the behavior a bit, but i guess we can live with that.

regards
t :)
t0v
Mar 8, 2013
at least with 23M rows, the trick is to update all the rows (i.e. leave out the WHERE time_stamp BETWEEN ? AND ?)