Monday, February 23, 2009

Batch Process in Clustered Environment

This week I have to answer some questions relating to the running of batch process in clustered environment. Our prospective client, a big corporation with the sophisticated infrastructure wants to make sure that the application to be deployed could run well in their clustered environment. Having a nice clustered server for high availability with fail over fault-tolerance is really nice. But a single point of failure in the batch process would surely spoil the fun.

It seems like they have encountered problems in the past with batch processing that could clutter the production application. That's why they need to make sure every applications to be installed are robust enough to some disturbance and provide the same level of resilience with their software components running on the JEE application servers counterpart.

When talking about clustered environment, usually we aim for high availability and fail over. The data that we are processing should be consistent and the application should not get crippled when one of those clustered machine goes down, both intentionally or unintentionally.

When talking about batch processing, the point that we are trying to achieve are:

  • restartability

  • idempotence or rerunabilty


As it is common for the batch process to be in the class of long running processes, usually it will be very expensive for them if they have to start the whole thing from the beginning again.

For example, if you have a batch process that calculate the billing for public utility company, the billing generation itself could easily take 10 hours. If after 9 hours the machine crashes, powercuts, et cetere, it would then be too expensive to restart the whole billing generation again from scratch.

In some way this kind of long running process should maintain the state at certain check points, and pick up at that certain check point and then continue.

I remember the time when we have to generate 400MB dummy data to test our system. The generation itself took more than 3 days to run (running simple SQL INSERT statements). If you put all the rows in one transaction, both commit and roll back process will take a very long time. The other way also pose the same problem. If you use auto commit and commit the rows every INSERT operation, the performance will also be very very poor.

We decided then to commit the INSERTSs every certain number of rows (we chose 10,000 rows in PostgreSQL), we sacrifice atomicity for performance. We have to maintain the state when were the last time we generate and not repeating ourself.

Same thing happens when we are running a long running batch process in a clustered deployment. We expect the component to be able to pickup where it has left in case a failure happened. We don't want the component to assume it has finished, nor we want it to restart from the beginning.