ANR-05-CIGC GCPMF (Financial computing on Grid) (Application)
Conducted bySebastien Bezzine, Stephane Vialle
DescriptionDesign and experiment of a multi-paradigm Grid architecture for financial computating, with fault tolerance and respect of time constraint on large scale architectures.
We decided to program a system for distribution of financial computations because :
- Risk computation is a critical issue in finance
- Risk computation is intensive computing
- Parallelism is needed to speedup and size up
- Respect of time constraints is mandatory
- Distributed systems frequently encounter failures (or some resources disappear temporarily)
The system runs with ProActive which is a Grid Java library for parallel, distributed and concurrent computing developed by INRIA Sophia-Antipolis (near Nice in France). We experiment also with JavaSpaces which are virtual shared memories. This Grid architecture uses PCs in reserve to compensate for failures.
The different processes composing this architecture need to be deployed and started beforehand on the different processors. No additional deployment time is required after this initialization of the system.
Load balancing is achieved by dividing the work into independent sets of jobs and submitting a new sub-set to a worker as soon as it is done with its previous task.
To detect fault (either a network connection loss, or a crash of the JVM or supporting node), a simple probe mechanism is implemented: the server regularly pings its sub-servers and a sub-server regularly pings its workers. If the probed element fails to respond in time, it is considered as faulty. To improve fault-recovery time, sub-servers regularly checkpoint the results received from the workers with the server. When a worker disappears, the sub-server responsible for that worker first requests a node from the reserve pool. Next it restarts a worker on that node and provides it with the task the fault worker was in charge of. If the reserve pool is empty, the system runs in reduced mode, with a missing worker. A slightly more complex situation arises when a sub-server fails to respond. In that case, the server requests a new node from the reserved pool; if the pool is empty, the server's node is used (since the server does not usually perform lot of computation, we hope this reduced mode won't significantly affect performances). A new sub-server is started and the server provides it with the interrupted task, the last check-pointing state from the dead sub-server and each worker from the initial group must be re-attached to the new sub-server. Depending on the failure timing (from the worse case: right before a sub-server checkpoints or a worker sends its results, to the less pessimistic: just after), and on the number of concomitant failures, the global computing time is more or less affected. This matter is further discussed in section.
It is worth noting that the programmer of a financial application (for instance) does not need to take care of deployment or fault-recovery issues: thanks to the use of Java Generics, as long as he/she extends our classes, our architecture deals with those issues it-self.
Financial algorithms fall into different categories, each of which is best adapted to a particular programming paradigm (shared memory, message passing, remote procedure call). To ease the programming of parallel algorithms with frequents inter-processors communications and intensive data sharing, a JavaSpace can be instantiated on demand on a sub-server to offer a virtual shared memory to the group of workers is in charge of.
Tools usedNo information
Shared by: Sebastien Bezzine, Stephane Vialle
Last update: 0000-00-00 00:00:00