Grid'5000 experiment

Jump to: navigation, search

MPICH-V (Fault tolerant MPI) (Middleware)

Conducted by

Aurelien Bouteiller, Thomas Herault


High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing library in HPC applications. These two trends raise the need for fault tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault tolerance protocols for MPI applications.




  • Nodes involved: 1000
  • Sites involved: 1
  • Minimum walltime: 4h
  • Batch mode: no
  • Use kadeploy: no
  • CPU bound: yes
  • Memory bound: no
  • Storage bound: no
  • Network bound: yes
  • Interlink bound: no

Tools used

Software: Linux, MPICH, PGI Fortran, Intel CC, NAS Parallel Benchmarks, NetPipe. Hardware: Myrinet network, Infiniband Network, GigaEthernet Network. Large cluster configuration on GdX


We presented an extensive related work section highlighting the originality of our approach and the proposed protocols. We implemented a generic fault tolerant layer in one of the leading MPI implementation: MPICH. Inside this fault tolerant framework we implemented four different types of automatic fault tolerance algorithms covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with message logging. We measured the performance of these protocols on a micro-benchmark and compare them for the NAS benchmark, using an original fault tolerance test. Finally, we outlined the lessons learned from this in depth fault tolerant protocol comparison for MPI applications.
Illustrating chart picture not found

More information here

Shared by: Aurelien Bouteiller, Thomas Herault
Last update: 2007-02-21 21:39:20
Experiment #87