Forum: Building VoltDB Applications

Post: VoltDB and Custom Search

VoltDB and Custom Search
MartinKersten
Feb 26, 2014
Hello,

I understand the way voltdb is architectured and what implication it brings with it. We are currently need to implement a search that is using a geo search including an attribute selection. The idea is to store an array of sorted attributes per user and select while searching for the user.

Since this involves somewhat 10k+ entries after index access it will be easy. The array will be simply a blob that can be accessed bytewise to extract the integer values(?). Each array might be around 100bytes or less. So we accessing just 1MB of data during the search.

Since all those stored procedures are executed in parallel, we would just add more replicas to the problem and done.

My question targets the insertion vs the search problem. We try to use the VoltDB for quite a large high volume site where there are maybe a 1000 changes per second are necessary to track the users (geowise).

VoltDB should drive a complete social application website + mobile which we managed to use isolated tables with not much of interconnection (no foreign keys mostly since users can not be removed and either languages cant).

Is there any problem with that? Is this to much load.

When I do a database query with the sql part of a stored procedure do I access the actual memory object or do I just get a copy (important for the memory footprint etc).
bballard
Feb 26, 2014
Hi Martin,

The interface between a stored procedure (running in java) and the SQL Execution Engine (running in C++) is when you call voltExecuteSQL(). At that point, the queued statements and parameters are passed across the interface, the SQL is executed, and the results are returned to the java stored procedure as a copy of data within an array of VoltTable objects. For performance, you would want to use indexes as much as possible to reduce the results returned to the java stored procedure, and minimize the amount of filtering you would need to do in java on the contents of the byte array. If there was no way around doing much of the searching in java code, it would certainly help if you could do many such searches in parallel by using a single-partition procedure.

Without knowing more about the schema, procedures, and partitioning, it's hard to say how much is too much load, but if you'd like to share some details I'd be happy to review them and offer suggestions on how to optimize performance. You can contact me directly at bballard (at) voltdb (dot) com.

Thanks,
Ben
MartinKersten
Feb 26, 2014
Hi Martin,

The interface between a stored procedure (running in java) and the SQL Execution Engine (running in C++) is when you call voltExecuteSQL(). At that point, the queued statements and parameters are passed across the interface, the SQL is executed, and the results are returned to the java stored procedure as a copy of data within an array of VoltTable objects. For performance, you would want to use indexes as much as possible to reduce the results returned to the java stored procedure, and minimize the amount of filtering you would need to do in java on the contents of the byte array. If there was no way around doing much of the searching in java code, it would certainly help if you could do many such searches in parallel by using a single-partition procedure.


The fun is it would be a single partition procedure. Since it would be a copy I just would use a binary blob hiding a sorted array of integers (maybe even short ones). Its simply a geo hashed version of a bunch of people sharing the same area. And those people would be filtered by attributes but also one would like to sum up the attributes all people offer. Simple as that. The sum of would be a facetted search.

The table would not be updated frequently just every five or 10 minutes if the people move around or more often if those people would move quickly.

So it would require a scenario where one can easily replicate those partitions for lets say 20 times or even more and being able to perform such a search in parallel since it would be a core aspect of this social app.

The table would be rather simple: UserId, GeoHashA, GeoHashB etc, Position X, Position Y, UserAttributes [around 100bytes]

Since we can only filter through the appropriate geo hash (which would be just an integer of a certain resolution),
we would sort the results by the UserID (do we have composite indexes like Index(GeoHashA, UserId)?) and return the first 1000+ rows, filter those calculate the distances and request the next batch until all users are processed.

The result would just be userId x Distance + maybe the facetted hull of all attributes (but this is likely to become a second invocation for another table storing the same information to just allow a different pass since the timing might become a problem but that is part of the benchmarking done later on.