MPICH-V3 : HIerarchical Fault Tolerance for the Grid (Middleware)
Conducted byCamille Coti, Thomas Herault, Eric Rodriguez
DescriptionOne of the major issue when dealing with Message Passing Interface over a large scale grid is to use efficiently its hierarchical topology. Some solutions are proposed into MPI like using different communicators to communicate either inside a single cluster or between clusters. When we try to integrate transparent fault tolerance mechanisms into such deployments, the global aspect of most of the fault tolerance algorithm breaks this hierarchical design (in a Chandy-Lamport algorithm for example, one has to send a message into each communication channel, thus between the communicators, to flush the messages that may be circulating into them). A current challenge of fault tolerance for the grid is to design a hierarchical fault tolerance mechanism. Solutions based on composition of classical fault tolerance protocols are attractive, but may not be feasible, or may introduce a high overhead into the system. We are investigating different composition techniques. One of the most promising for the grid is the first fault tolerant protocol studied in the MPICH-V project: MPICH-V1. MPICH-V1 uses Channel memories to relay and log messages between computing nodes. The idea in a hierarchical deployment over the grid would be to use Channel memories between clusters (since the messages will have to pass through many routers, it may not be damageable for the performances to add another hop), and direct communications inside a single cluster. Experiments are being conducted to evaluate the best solution that may provide both relay and logging of messages. Then, a fault tolerance protocol for the Hierarchical grid will be evaluated on Grid5000.
Tools usedNo information
Shared by: Camille Coti, Thomas Herault, Eric Rodriguez
Last update: 0000-00-00 00:00:00