Grid'5000 publication

Jump to: navigation, search

Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery

Author: Bouteiller, Aurélien and Ropars, Thomas and Bosilca, George and Morin, Christine and Dongarra, Jack
EntryType: inproceedings
Abstract: With the growing scale of high performance computing platforms, fault tolerance has become a major issue. Among the various approaches for providing fault tolerance to MPI applications, message logging has been proved to tolerate higher failure rate. However, this advantage comes at the expense of a higher overhead on communications, due to latency intrusive logging of events to a stable storage. Previous work proposed and evaluated several protocols relaxing the synchronicity of event logging to moderate this overhead. Recently, the model of message logging has been refined to better match the reality of high performance network cards, where message receptions are decomposed in multiple interdependent events. According to this new model, deterministic and non-deterministic events are clearly discriminated, reducing the overhead induced by message logging. In this paper we compare, experimentally, a pessimistic and an optimistic message logging protocol, using this new model and implemented in the Open MPI library. Although pessimistic and optimistic message logging are, respectively, the most and less synchronous message logging paradigms, experiments show that most of the time their performance is comparable.
Booktitle: IEEE International Conference on Cluster Computing (Cluster 2009)
Address: New Orleans Etats-Unis d'Amérique
Month: 09
Year: 2009
Url: http://hal.inria.fr/inria-00424017/en/

Bibtex:
@inproceedings{BRB09,
	title = { Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery},
	author = {Bouteiller, Aurélien and Ropars, Thomas and Bosilca, George and Morin, Christine and Dongarra, Jack},
	abstract = {With the growing scale of high performance computing platforms, fault tolerance has become a major issue. Among the various approaches for providing fault tolerance to MPI applications, message logging has been proved to tolerate higher failure rate. However, this advantage comes at the expense of a higher overhead on communications, due to latency intrusive logging of events to a stable storage. Previous work proposed and evaluated several protocols relaxing the synchronicity of event logging to moderate this overhead. Recently, the model of message logging has been refined to better match the reality of high performance network cards, where message receptions are decomposed in multiple interdependent events. According to this new model, deterministic and non-deterministic events are clearly discriminated, reducing the overhead induced by message logging. In this paper we compare, experimentally, a pessimistic and an optimistic message logging protocol, using this new model and implemented in the Open MPI library. Although pessimistic and optimistic message logging are, respectively, the most and less synchronous message logging paradigms, experiments show that most of the time their performance is comparable.},
	booktitle = {IEEE International Conference on Cluster Computing (Cluster 2009) },
	address = {New Orleans Etats-Unis d'Amérique },
    month = {09},
    year = {2009},
    URL = {http://hal.inria.fr/inria-00424017/en/},
}

Bibtex parsing powered by http://bibliophile.sourceforge.net


Shared by: George Bosilca, Aurelien Bouteiller, Thomas Ropars
Last update: 2010-01-20 11:21:08
Publication #660

Personal tools
Namespaces

Variants
Views
Actions
Public Portal
Users Portal
Admin portal
Wiki special pages
Toolbox