Single partiton stored procedures are run serially to completion by each partition replica ("site" in VoltDB parlance). Yes, if you block a stored procedure on IO, you will block all progress at that partition replica.
Note carefully: the java procedure runs at *each* replica. Procedures must be deterministic. We do not encourage users to write procedures with non-database side-effects. However, if such procedures are written, all side-effects must be idempotent.
Multi-partition procedures are run to completion at a single selected partition replica that coordinates the distributed transaction with the participating remote partition replicas. There are existing optimizations to eliminate blocking at the remote replicas (and more being investigated / under development). But in the most general case, multi-partition procedures will stall the cluster until committed.
My architectural suspicion is that if you want to couple a second data store with VoltDB, you may desire a looser coupling than direct access via VoltDB stored procedures.
Firstly, I would like to say that I'm not sure if I understand the sentence "Single partiton stored procedures are run serially to completion by each partition replica ("site" in VoltDB parlance)." properly. Does this apply only to stored procedure that make changes, so that all replica keep in sync, or also to ones that only do selects? I cannot see a reason to perform the same select on the same data in several JVMs...
So, the idea is that stored procedure represents a single transaction. Since some processes are single-partition and others are not, and multi-partition transactions block the whole cluster, one might get a better performance by splitting multi-partition transactions in several single-partition transactions, if possible. Also, by running the whole stored procedure serially in a single-thread, you save yourself synchronization, which makes your code very fast, but only if the stored procedure does nothing that "takes long", and is kept to the absolute minimum.
Having stored procedures run in a (bound) thread pool, and allowing them to perform long-running operations would make stored procedure much more "versatile". It would save having many round-trips to the client, since socket communication is many times slower then in-process communication, even on the same machine. What I'm trying to achieve is some practical way of doing processing inside the VoltDB JVM, but outside of a transaction.
I think this could be best realized with a “second type” of stored procedures which would allow multiple transactions. Those stored procedures would run in a thread pool, outside of the VoltDB thread and the only API call they would be able perform would be to call other stored procedures, which would run in the VoltDB thread. Like that, they could perform pre-processing, post-processing, and even processing between transactions, without blocking the whole cluster. They could take long without hurting performance, but they would look like normal stored procedure to the client, just “slower”. They would also reduce the number of multi-partition transactions, therefore increasing the scalability. They could be defined “implicitly” by simply having no SQL statements in the class. Since all SQL has to be defined statically in the class, having no SQL is a hint that the stored procedure cannot perform anything in the VoltDB thread, and therefore can be run in some other thread pool.
They might be programmed in such a way that they take care of errors themselves, so it might then be interesting to also add the ability to call them without waiting for the results, if they don't have to return anything useful. After that, the next logical step would be to integrate a task scheduler (like Quartz) in VoltDB. It could store it's own internal state in VoltDB itself, and move the processing of tasks to the appropriate VoltDB partition, assuming the VoltDB API offers helpers for that. The tasks would be like agents that move to where the data is.
I think this would be generally useful, and is probably realizable with little code. What do you think?One particular use would be a stored procedure that updates every row in a big table while the system is online. The work could be divided into many small "batches" that are performed by calls to a "real" stored procedure, while keeping the database responsive by allowing other stored procedures to execute between the batches. Is their some kind of cursor implementation that could be used?