Grid'5000 experiment

Jump to: navigation, search

MPICH-V3 : HIerarchical Fault Tolerance for the Grid (Middleware)

Conducted by

Camille Coti, Thomas Herault, Eric Rodriguez

Description

One of the major issue when dealing with Message Passing Interface over a large scale grid is to use efficiently its hierarchical topology. Some solutions are proposed into MPI like using different communicators to communicate either inside a single cluster or between clusters. When we try to integrate transparent fault tolerance mechanisms into such deployments, the global aspect of most of the fault tolerance algorithm breaks this hierarchical design (in a Chandy-Lamport algorithm for example, one has to send a message into each communication channel, thus between the communicators, to flush the messages that may be circulating into them). A current challenge of fault tolerance for the grid is to design a hierarchical fault tolerance mechanism. Solutions based on composition of classical fault tolerance protocols are attractive, but may not be feasible, or may introduce a high overhead into the system. We are investigating different composition techniques. One of the most promising for the grid is the first fault tolerant protocol studied in the MPICH-V project: MPICH-V1. MPICH-V1 uses Channel memories to relay and log messages between computing nodes. The idea in a hierarchical deployment over the grid would be to use Channel memories between clusters (since the messages will have to pass through many routers, it may not be damageable for the performances to add another hop), and direct communications inside a single cluster. Experiments are being conducted to evaluate the best solution that may provide both relay and logging of messages. Then, a fault tolerance protocol for the Hierarchical grid will be evaluated on Grid5000.

Status

in progress

Resources

    Tools used

    No information

    Results

    Not yet

    Shared by: Camille Coti, Thomas Herault, Eric Rodriguez
    Last update: 0000-00-00 00:00:00
    Experiment #189

    Personal tools
    Namespaces

    Variants
    Views
    Actions
    Public Portal
    Users Portal
    Admin portal
    Wiki special pages
    Toolbox