Grid'5000 user report for Thomas Herault

Jump to: navigation, search

User information

Thomas Herault (users, user, account-manager, orsay, ml-users user)
More user information in the user management interface.


  • MPICH-V (Fault tolerant MPI) (Middleware) [achieved]
    Description: High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing library in HPC applications. These two trends raise the need for fault tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault tolerance protocols for MPI applications.
    Results: We presented an extensive related work section highlighting the originality of our approach and the proposed protocols. We implemented a generic fault tolerant layer in one of the leading MPI implementation: MPICH. Inside this fault tolerant framework we implemented four different types of automatic fault tolerance algorithms covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with message logging. We measured the performance of these protocols on a micro-benchmark and compare them for the NAS benchmark, using an original fault tolerance test. Finally, we outlined the lessons learned from this in depth fault tolerant protocol comparison for MPI applications.
    illustrating chart picture not found
    More information here
  • Self Stabilizing Spanning Tree for P2P Systems (Middleware) [achieved]
    Description: Peer to peer networks and grids are emerging large scale systems that gather thousands of nodes. These networks usually relies on IP to communicate: in such systems, two different nodes are identified by an address and can communicate only by addressing the node. Classical distributed applications need a notion of neighborhood. Large scale systems shall provide such a notion as a basic system feature to address two issues: 1) provide a list of contacts in the network for each node (their neighbours) 2) bound the amount of processes each node has to communicate with. 1) also implies that the virtual topology provided by the system should be fully connected, in order to ensure that although it is communicating only with its neighbours, any node may have any information in the system. It is not practical to multiplex many communication streams today on a single computer, thus it may be desirable to bound the number of neighbours (2). It is not advisable to maintain a complete topology on large scale network as nodes should maintain connectivity and knowledge of thousands connection and address. As it is more likely to maintain only a limited amount of address knowledge and opened connection per nodes, it is necessary to build a bounded degree topology. As a first attempt to address this issue we choose to build a spanning tree which provides nice properties for peer to peer systems and efficient diffusion. Peer to peer systems rely mostly on Internet Protocols. In such system, a node can gently stop to participate to the service, can stop running unexpectedly, or can get back to the system at any time. The network can loose some messages due to congestion (for non-reliable Internet communication protocols, like UDP), or can fall down in a whole geographic region. Maintaining a topology in these systems implies being fault tolerant. On one hand, many different fault tolerant protocols have been studied, including replication and rollback recovery protocols, but those protocols focuses mostly on fail-stop faults. On the other hand, Self-Stabilization addresses a larger kind of behaviors, as long as the system has enough time to converge between the failures. A self-stabilizing system is a system that eventually behaves according its specification whatever the configuration it is initialized in. A set of failures (network failures, topology changes due to process depart or arrival, or even memory corruption) leads the distributed system in a given configuration. Due to its convergence property, the self-stabilizing system will reach from this configuration another configuration that we call legitimate. Continuing from this legitimate configuration, the system behaves accordingly its specification. The idea of self-stabilizing systems was proposed by Dijkstra in 1974. The way self-stabilization abstracts any kind of faults by considering any initialization possible provides a tool to handle the high variety of faults of large scale system. The issue addressed in this work is the following: consider a distributed system over the internet. Every process p has a unique identifier, and may potentially communicate with any other process q, if it has the identifier of q. Every process has moreover access to two services: a resource discovery service that provides identifiers of processes (that were at one point part of the system), and a failure detection service that provides accurate information on the status of processes from their identifiers. We proposes a self-stabilizing algorithm to build a virtual tree which includes all the processes in the system storing only a (small) bounded identifiers. The large scale of the system prevents using solutions that may store the set of all process identifiers. In this context, we use Grid5000 to evaluate the performances of the proposed solutions. This is a challenge, since measuring performances of Self-Stabilizing solutions is not a straightforward process. Indeed, it is not possible for the system to determine if it has converged or not. We use a solution based on trace of executions and post-mortem analysis to compute the convergence time and other meaningful measures. This work provides one of the rare experimental evaluation of self-stabilizing systems, which are useful for large scale distributed systems.
    Results: In this work, we propose a new model in which to design distributed algo- rithms for large scale systems. We replace the classical notion of a neighbor list with the combination of two devices, a resource discovery service and a failure detector, in order to avoid making unnecessary assumptions and to improve scalability. We illustrate our model with a self-stabilizing algorithm that builds a spanning tree whose degree is bounded by δ using only δ +1 process identifiers. We present a formal proof of convergence and performance measurements of a prototype implementation of this algorithm and its services for clusters. The experimental results show that the algorithm performs well enough to argue in favor of the application of self-stabilization in practice. Our intended followup on this work is to design other protocols, in the same model, so as to explore its viability and efficiency for different problems. We will also study other topologies suitable for large scale systems. It would also be interesting to try to define the notion of stabilisation time in this model. This would require stronger assumptions on the resource discovery service, but which ones exactly is an open question.
  • Open MPI and MPICH2 Integration with Fault Tolerance Protocols (Middleware) [in progress]
    Description: The MPICH-V project of the Grand-Large team has demonstrated the efficiency and feasability of fault tolerance techniques for message passing systems like MPI. We are now collaborating with both the OpenMPI team at UTK and the MPICH2 team at Argonne National Laboratory to provide efficient fault tolerant mechanisms into their respective implementations of the MPI2 specification. One of the major challenges of this integration is to deal efficiently with high speed networks (like Grid5000 will provide soon) and large scale (that Grid5000 already provides). In this context, we will use Grid5000 as a testbed for the integrated Fault-Tolerant MPI Libraries.
  • MPICH-V3 : HIerarchical Fault Tolerance for the Grid (Middleware) [in progress]
    Description: One of the major issue when dealing with Message Passing Interface over a large scale grid is to use efficiently its hierarchical topology. Some solutions are proposed into MPI like using different communicators to communicate either inside a single cluster or between clusters. When we try to integrate transparent fault tolerance mechanisms into such deployments, the global aspect of most of the fault tolerance algorithm breaks this hierarchical design (in a Chandy-Lamport algorithm for example, one has to send a message into each communication channel, thus between the communicators, to flush the messages that may be circulating into them). A current challenge of fault tolerance for the grid is to design a hierarchical fault tolerance mechanism. Solutions based on composition of classical fault tolerance protocols are attractive, but may not be feasible, or may introduce a high overhead into the system. We are investigating different composition techniques. One of the most promising for the grid is the first fault tolerant protocol studied in the MPICH-V project: MPICH-V1. MPICH-V1 uses Channel memories to relay and log messages between computing nodes. The idea in a hierarchical deployment over the grid would be to use Channel memories between clusters (since the messages will have to pass through many routers, it may not be damageable for the performances to add another hop), and direct communications inside a single cluster. Experiments are being conducted to evaluate the best solution that may provide both relay and logging of messages. Then, a fault tolerance protocol for the Hierarchical grid will be evaluated on Grid5000.
  • Coordinated Strategies for Grid-Level checkpointing of MPI applications (Middleware) [achieved]
    Description: A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tol- erant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and non-blocking. However they have never been com- pared quantitatively and their respective scalability remains unknown. The contri- bution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.
    Results: In this paper, we present Pcl, a new implementation of a blocking, coordi- nated checkpointing, and fault tolerant protocol inside MPICH2. We evaluate its performance on three typical high performance architectures: clusters of workstations, high speed network clusters, and computational grids. We com- pare the performance to Vcl, an implementation of a non-blocking, coordinated checkpointing protocol. A blocking, coordinated checkpointing protocol requires flushing communi- cation channels before taking the state of a process in order to ensure the coherency of the view built. It introduces synchronization in the distributed system while communications are frozen. However, since it does not require copies of incoming or outgoing messages, it is simpler to implement in an existing high-performance communication driver. A non-blocking, coordinated checkpointing protocol consists of saving the state of the communication channels during the checkpoint without interrupting the computation. It requires logging in-transit messages and replaying them at restart, which implies coordination with the progress engine and queue mechanisms. The experimental study demonstrated that for high-speed networks, the block- ing implementation gives the best performance for sensible checkpoint fre- quency. On clusters of workstations and computational grids, the high cost of network synchronization to produce the checkpointing wave of the block- ing protocol introduces a high overhead that does not appear with the non- blocking implementation. An experimental study on a cluster demonstrated that the checkpoint fre- quency has more significant impact on the performance than the number of nodes involved in a checkpoint synchronization for both non-blocking and blocking protocols. We are conducting a larger study to evaluate this result on computational grids. Evaluating the MTTF (mean time to failure) of the system can significantly improve performances, since the best value for the checkpoint wave frequency is close to the MTTF, trying to make a check- point just before every failure. Components detecting an increasing failure probability (e.g. through their CPU temperature probe) should also trigger a checkpoint wave. The non-blocking protocol seems to provide good performances for large scale, but suffers from implementation issues. We plan to integrate this protocol in the MPICH2-Nemesis framework in order to improve its performances and evaluate on high speed networks. This work has been accepted for publication in the SuperComputing 2006 conference and recently, an improved and extended version in the Journal IJHPCN.
  • Fault Tolerance for High Speed Network Cards using Open MPI (Networking) [in progress]
    Description: With HPC architectures gathering more and more pro- cessors, the mean time between failure (MTBF) is now counted in hours. As a consequence fault tolerant algo- rithms have been developed in MPI to recover from fail- ures. Currently, the preferred automatic fault tolerant pro- tocol is coordinated checkpoint, due to its very low fail- ure free overhead. Unfortunately, coordinated checkpoint has a long recovery time. It has been proved that in spite of a high overhead on network performance, message log- ging would achieve a better application makespan in a low MTBF system. In this paper we present a refinement of message logging model intended to reduce the failure free overhead of message logging by removing useless memory copies and reducing the overall number of logged events. We present the implementation of a pessimistic message logging in Open MPI and compare its performance with the previous reference implementation MPICH-V2. Results outline a several order of magnitude improvement on per- formance and a zero cost for most messages.
    Results: In this experiment we introduce a refinement of message log- ging model intended to reduce the raw overhead of this fault tolerant protocol. Unlike the classical model, it does not considers the message delivery as a single atomic step. In- stead a message may generate two kinds of events: match- ing events at the beginning of any source receptions, and deliver events to count the number of time a message has been involved in a non deterministic probe before deliv- ery. Advantages of this model are 1) better fitting the actual MPI communication pattern 2) removing the need for an in- termediate copy of message 3) allowing implementation of fault tolerant mechanisms at a higher level of the software hierarchy and then distinguish between non deterministic and deterministic events. We implemente a pessimistic message logging algo- rithm according to this model in Open MPI and compare its performance to the previous reference implementation of pessimistic message logging MPICH-V2. Results outlines a lower cost of the fault tolerant framework (bandwidth mul- tiplied by 4.5 and latency divided by 18) thanks to the re- moval of intermediate message copies. Furthermore, be- cause of the better detection of deterministic events most messages have no message logging overhead at all leading to a 35 times better latency.
  • Virtualization (Middleware) [planned]
    Description: Together with Thomas Herault, Benjamin Quetier, Thomas Largillier and myself we are designing full virtual cluster in order to provide a complete and stable framework for the analysis of fault tolerance in large scale systems. The first experiment will be to deploy Xen virtual machines (XVM) linked together through BSD featuring Dummynet. Failure injection will be introduced by an ad hoc mechanism that will destroy virtual machines, thus emulating physical crashes of computers.
  • Go states computation (Programming) [planned]
    Description: The goals of the project are two-fold: 1) application one - to compute the exact number of legal positions on a Go game board which is of size 19x19 (the standard size). This is a considerable computational challenge which was so far solved for boards only up-to the size 17x17 and it would be of certain interest to Go community. As opposed to say "computing the number Pi for yet another zillion digits" which can go on indefinitely this is a one time computation on the border of what is computable with our current computer technology. Mathematically, the problem is essentially to compute a product of a vector by 361 matrices. The dimension of the vector and matrices is 2^64. Luckily, everything is sparse so there are only about 350G non-zero entries in any intermediate or final vector and each of the matrices is given implicitly. The entries are of size about 100B so overall one needs to process approx. 12 Peta bytes of data. 2) programming one - the scale of the computation is such that reliability of hardware and software becomes an issue. As a distributed application seems to be the only possibility how to carry out such a large computation issues coming from unreliability and failure of particular nodes come into play (beside the usual load balancing issues). As the data and the overall state of the computation has to be distributed over many nodes the question of a node failure recovery needs to be addressed. Hence designing and testing a high performance system that could reliably run over somewhat reliable nodes is one of the goals of this experiment. The target parameters of the system is to be able to process continuously and semi-synchronously on order of 1000 nodes order of 10MB/s of data per node in the "map-and-reduce" fashion.
    More information here
  • QosCosGrid MPI middleware and usecases (Application) [in progress]
    Description: The QosCosGrid project aims at running parallel applications on a grid (as an heterogeneous cluster of clusters), in a quasi-opportunistic way. We are dealing with MPI applications and running experiments on Grid5000 in several configurations: - test one component, e.g., the communication library (first phase of the project) - test the integration of several components, e.g., metascheduler + communication library, communication library + application... (second phase of the project) - test the full software stack, from the lowest-level components to the applications (third phase) Applications are simulations of complex systems requiring frequent communications between nodes.
    illustrating chart picture not found


  • QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment [2010] (international)
    EntryType: inproceedings
    Author: Emmanuel Agullo and Camille Coti and Jack Dongarra and Thomas Herault and Julien Langou
    Booktitle: 24th IEEE International Parallel \& Distributed Processing Symposium (IPDPS'10)
    Address: Atlanta, Ga
    Month: April
  • Running parallel applications with topology-aware grid middleware [2009] (international)
    EntryType: inproceedings
    Author: Pavel Bar and Camille Coti and Derek Groen and Thomas Herault and Valentin Kravtsov and Assaf Schuster and Martin Swain
    Booktitle: 5th {IEEE} International Conference on e-Science (eScience 2009)
    Month: December
    Note: to appear
  • MPI Applications on Grids: a Topology-Aware Approach [2009] (international)
    EntryType: inproceedings
    Author: Camille Coti and Thomas Herault and Franck Cappello
    Booktitle: Proceedings of the 15th European Conference on Parallel and Distributed Computing (EuroPar'09)
    Editor: {LNCS}
    Address: Delft, the Netherlands
    Month: August
    Note: to appear
  • A Distributed and Replicated Service for Checkpoint Storage [2008] (international)
    EntryType: inbook
    Author: Fatiha Bouabache and Thomas Herault and Gilles Fedak and Franck Cappello
    Chapter: Checkpointing and Monitoring
    Publisher: M. Danelutto, P. Fragopoulou and V. Getov, eds. Springer
    Volume: 7
    Series: CoreGRID Books: Making Grids Work
    Pages: 293--306
  • Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI [2008] (international)
    EntryType: article
    Author: Darius Buntinas and Camille Coti and Thomas Herault and Pierre Lemarinier and Laurence Pilard and Ala Rezmerita and Eric Rodriguez and Franck Cappello
    Journal: Future Generation Computer Systems
    Volume: 1
    Note: Digital Object Identifier:
  • Grid Services For MPI [2008] (international)
    EntryType: inproceedings
    Author: Camille Coti and Thomas Herault and Sylvain Peyronnet and Ala Rezmerita and Franck Cappello
    Booktitle: Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid'08)
    Pages: 417--424
    Editor: {ACM/IEEE}
    Address: Lyon, France
    Month: May
  • Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment [2008] (international)
    EntryType: inproceedings
    Author: Fatiha Bouabache and Thomas Herault and Gilles Fedak and Franck Cappello
    Booktitle: CCGRID '08: Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
    Isbn: 978-0-7695-3156-4
    Pages: 475--483
    Publisher: IEEE Computer Society
    Address: Washington, DC, USA
  • Grid Services For MPI [2007] (international)
    EntryType: inproceedings
    Author: Camille Coti and Ala Rezmerita and Thomas Herault and Franck Cappello
    Booktitle: Proceedings of the 14th European PVM/MPI Users' Group Meeting (EuroPVM/MPI)
    Pages: 393-394
    Editor: {ACM/IEEE}
    Address: Paris, France
    Month: October
  • Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI [2006] (international)
    EntryType: inproceedings
    Author: Camille Coti and Thomas Herault and Pierre Lemarinier and Laurence Pilard and Ala Rezmerita and Eric Rodriguez and Franck Cappello
    Booktitle: Proceedings of the Int. Conf. for High Performance Networking Computing, Networking, Storage and Analysis (SC2006)
    Pages: electronic
    Editor: {ACM/IEEE}
    Address: Tampa, USA
    Month: November
  • Message Relaying Techniques for Computational Grids and their Relations to Fault Tolerant Message Passing for the Grid [2006] (international)
    EntryType: techreport
    Author: Michael Cadilhac and Thomas Herault and Pierre Lemarinier
    Institution: {CoreGRID} network of Excellence
    Address: 3rd workshop of the WP4, Paris
    Month: January
  • Self-Stabilizing Spanning Tree Algorithm for Large Scale Systems (Brief Announcement) [2006] (international)
    EntryType: inproceedings
    Author: Thomas Herault and Pierre Lemarinier and Olivier Peres and Laurence Pilard and Joffroy Beauquier.
    Booktitle: proceedings of the Eighth International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS 2006)
    Address: Dallas, USA
    Month: November
  • FAIL-MPI: How Fault-Tolerant Is Fault Tolerant MPI? [2006] (international)
    EntryType: inproceedings
    Author: Willam Hoarau and Pierre Lemarinier and Thomas Herault and Eric Rodriguez and Sebastien Tixeuil and Franck Cappello
    Booktitle: Proceedings of the IEEE International Conference on Cluster Computing (Cluster 2006)
    Address: Barcelona, Spain
    Month: September
    Abstract: One of the topics of paramount importance in the development of Cluster and Grid middleware is the impact of faults since their occurrence in Grid infrastructures and in large-scale distributed systems is common. MPI (Message Passing Interface) is a popular abstraction for programming distributed and parallel applications. FAIL (FAult Injection Language) is an abstract language for fault occurrence description capable of expressing complex and realistic fault scenarios. In this paper, we investigate the possibility of using FAIL to inject faults in a fault-tolerant MPI implementation. Our middleware, FAIL-MPI, is used to carry quantitative and qualitative faults and stress testing.
  • MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI [2006] (international)
    EntryType: inproceedings
    Author: Aurelien Bouteiler and Thomas Herault and Geraud Krawezik and Pierre Lemarinier and Franck Cappello
    Journal: The International Journal of High Performance Computing Applications
    Month: Summer
    Volume: 20
    Pages: 319--333
    Publisher: SAGE Publications


    Success stories and benefits from Grid'5000

    • Overall benefits

    last update: 2009-03-25 12:00:27