Grid'5000 user report for Olivier Richard

Jump to: navigation, search

User information

Olivier Richard (users, web-staff, user, account-manager, site-manager, cp, ct, grenoble, grenoble-staff, g5kschool, ciment, ml-users, digitalis user)
More user information in the user management interface.

Experiments

  • Network isolation over the grid using dynamic VLANs (Networking) [achieved]
    Description: Development and tests of a tool to manage VLAN on the fly, in order to bind network isolation with OAR and kadeploy.
    Results: The tools works efficiently on several routers/switches of the platform.
  • OAR Batch scheduler (Middleware) [achieved]
    Description: Develop and maintain the batch scheduler of Grid5000 : OAR. Moreover there is a specific development for Grid5000 which is named OARGRID. It allows to reserve several computers on different clusters via OAR.
    Results:
    More information here
  • OAR Green Computing Features (Middleware) [in progress]
    Description: In the new era of petascale in Large-Scale Distributed Systems and High Performance Computing, the energy consumption is an important parameter in the evolution of these systems. Local Resource Management Systems can play a vital role in this game, since they have an overall knowledge of the hardware resources and the users workload. Turning off unutilized resources and exploiting Dynamic Voltage Scaling techniques may reduce drastically the overall energy consumption. Nevertheless, the impact upon the jobs turnaround time and performance should also be taken into account. We have adpated the Cluster Resource Management System OAR with Energy Efficient Scheduling capabilities to deal with machines unutilization. Furthermore, we provide a special type of parametrized jobs which gives possibilities to the user for CPU DVS exploitation and hard-disk spindown during their actual jobs execution. In these experiments we test those Green Computing features by measuring the tradeoffs on the energy consumption VS execution performance and jobs turnaround time.
    Results:
    illustrating chart picture not found
    More information here
  • Experimentations of the lightweight grid CIGRI :Fault-tolerance and Scheduling (Middleware) [in progress]
    Description:

    A widely used method for large scale experiment execution upon P2P or Cluster Computing platforms, is the exploitation of idle resources. Specifically in the case of clusters, administrators share their cluster's idle cycles into Computational Grids, for the execution of the so called bag-of-tasks applications. Fault-tolerance and scheduling are some of the important challenges that have arisen on the specific research field. Our main interest lies on the large-scale deployment and experimentation of our lightweight grid computing approach, under real-life parameters. Under this context, we experiment with CIGRI and a fully transparent system-level checkpointing feature for scheduling and turnaround-time optimisation.

    The advantage of Grid5000 platform is the degree of its reconfigurability. Due to its deployment toolkit Kadeploy scientists can deploy and install the exact software needed for their experiments on the number of nodes desired. Hence, the platform provides the ideal tool for controlled and thus repeated experiments. Our experimental methodology is specifically constructed to permit the reproduction of our experiments on the one hand for the observation, devellopement and evaluation of the platform from CIGRI devellopers; On the other hand for the verification, refinement and even extension of the experiments by whoever is interested. The goal of this study is to provide a real-life, large scale experimental methodolgy upon grid technologies for research on fault-tolerance and scheduling issues.

    We have constructed real-life experimentations of the CIGRI lightweight grid system deployed upon the grid platform Grid5000. The real-life scenarios consist, of local cluster workloads based on the traces of the DAS2 grid platform and of the grid bag-of-tasks submitted application for an astrophysics benchmark on monte carlo transfer code.

    In this first version of experiments the goal is to compare different fault-tolerance strategies based on the newly implemented checkpoint/restart functionality. Those strategies are:

    • "No Checkpoints" (default) The default fault treatment mechanism of CIGRI where in case of an interference failure the task is reentered on the bag of tasks and the job is rescheduled when resources are idle.
    • "Periodic Checkpoints" This strategy provides transparent periodic checkpoints for every task executed on the clusters.
    • "Specific Checkpoints" This strategy provides transparent periodic checkpoints for only the tasks that are going to be killed from the batch scheduler due to interference failures.
    • "Shared Checkpoints" (unstable) This strategy provides checkpointed job migration possibilities on other free clusters, in case the initial cluster is overloaded.
    In case of an interference failure a CIGRI specific module checks the storage resources of the clusters to see if a checkpoint of the specific job exists and if it does then the specific job is treated in priority (among all the other uncheckpointed tasks of the bag).

    The experimental methodology is based on the following procedure:

    1. Reservation of a specific number of nodes upon Grid5000 clusters.

    2. Experimentation choices and initialisationUser should make some choices for the type of the experiments and then launch the relevant scripts for the experiment initialisation. The experiment starts by the image deployment on all the nodes and then launches the scripts which drive the experimentation.
    3. Deployement of the image containing all the needed software installed (CIGRI, OAR, NFS, BLCR,...)
    4. Configuration after the nodes deployment Configuration of the software, distribution of roles (servers, computing nodes,...), start of services (NFS,BLCR,...), construction of the clusters and grid universe, OAR batch scheduler initialisation for each cluster and CIGRI grid server initialisation
    5. Local cluster workload based on traces Local cluster job submitter starts submitting jobs based on the traces selected (from DAS2 grid platform) representing a specific percentage of the whole workload of the cluster(20% workload, 40%,60%,...)
    6. Grid bag-of-tasks application astrophysics benchmark Bag-of-Tasks Application submitted to CIGRI Grid. It is an astrophysics benchmark on a monte-carlo transfer code (MCFOST), specifically constructed as a bag of independent tasks for valuable computation.
    7. CIGRI real-life function for specific duration CIGRI Schedules and executes the independent tasks of the application on the idle cluster resources. The local cluster job submitter sends the jobs according to the trace. The local cluster workload interferes with the low-priority besteffort grid jobs ,that take advantage of the idleness of the resources, proudcing interference failures. Hence the fault-treatment mechanism of CIGRI can be tested and the different fault-tolerance strategies ("Specific Checkpoints", "Periodic Checkpoints" and "No Checkpoints"(default)) can be compared. CIGRI Grid functions under those real-life conditions for a specific chosen duration.
    8. Results collection Collection of the Experimental results. On the one hand collection of the results of the grid bag-of-tasks application for their processing and evaluation by the astrophysics researchers; On the other hand collection of the results for the observation of the CIGRI Grid platform and its fault-treatment mechanisms. Database traces (OAR,CIGRI) collection along with all the needed information for evaluation of the experiments (ex. OAR logs, CIGRI logs, errors,...)

    9. A posteriori treatment of the collected results for the validation of the experiment.

    Apart the first and the last step of the above procedure the rest is completely automatic.For the second step which is the experimentation choices ,the user has to decide all the different parameters for the experiments to be launched. More specifically it has to define the following parameters:

    • The Grid5000 clusters that are going to be used for the deployment along with the time of experimentation (5hours or 10hours proposed by default).These choices are taken on the time of reservation of nodes.
    • Small-scale grid experimentation of 1 cluster 32nodes (trace taken from DAS2 cluster of 32nodes) or Large-scale grid experimentation of 5 clusters of 200nodes (traces taken from the 5 DAS2 clusters).
    • Execution Time for each task of the bag-of-tasks application.
    • Fault-tolerance strategy to be used:"Specific Checkpoints","Periodic Checkpoints", "No Checkpoints"(default), and also "Sharing Checkpoints"(unstable). Some strategies have their own parameters to define: "Specific Checkpoints"
      • The time delay on the batch scheduler to wait for the "to kill" job to checkpoint itself before it kills it.
      "Periodic Checkpoints"
      • The periodicity that the job will checkpoint itself during its execution.

      Results:

      The "No Checkpoints" default fault treatment mechanism provides a reliable grid system for bag-of-tasks application guarranteeing the succesful execution of all the tasks of the bag for the application. According to the results the grid platform makes a thorough use of the clusters idle resources arriving until the 98% of cluster usage. Its drawback is that all the computation made from jobs that are killed by the batch scheduler due to interference failures are completely useless, and represent not valuable cycles and energy consumption.

      Checkpointing is a good solution to adress this problem. Concerning the checkpointing strategies evaluation; According to our results "Periodic Checkpoints" Strategy has a very big overhead. This is normal and validates our thoughts, since all the jobs on the clusters checkpoint themselves and in the end only some of them are killed resulting in lost time and cycles for all those that terminate their execution sucessfully. Ofcourse there are cases that this strategy could be the most efficient compared on the others.

      The "Specific Checkpoints" strategy on the other hand was turned to be the best in terms of turnaround time of the complete application. An important drawback of this strategy is the fact that it is not completely independent of the cluster and in a way it influences the clusters functionality, which goes against to our initial scopes. This is because the cluster has to wait some seconds before it kills the jobs so that it can allow it to checkpoint itself. This time delay is the subject of new studies that will investigate the relevance among the size of the checkpoint file, the duration to checkpoint and the time delay to be requested to the batch scheduler. Hence even if this strategy provides the best turnaround time under the tested conditions, it suffers important constraints .

      Finally the "Shared Checkpoints" strategy is currently being tested


    • Large scale experimentation of OARv2: Scalability and Comparison with other resource managers and batch schedulers (Middleware) [in progress]
      Description: Our goal is to compare various resource managers and batch schedulers in terms of scalability and performance under various workloads. Our experiments are using virtualization technics so as to scale up to significant large number of nodes on our clusters. We use various known benchmarks and real-life workload traces to replay the experiments.
      Results:
    • Integration of SLURM resource manager with OARv2 batch scheduler (Middleware) [in progress]
      Description: SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work. SLURM is not a sophisticated batch system, but it does provide an Applications Programming Interface (API) for integration with external schedulers such as The Maui Scheduler, Moab Cluster Suite and Platform LSF. Oar is an opensource batch scheduler which provides a simple and flexible exploitation of a cluster. t manages resources of clusters as a traditional batch scheduler (as PBS / Torque / LSF / SGE).Its design is based on high level tools: * relational database engine MySQL or PostgreSQL, * scripting language Perl, * confinement system mechanism cpuset, * scalable exploiting tool Taktuk. It is flexible enough to be suitable for production clusters and research experiments. It currently manages over than 5000 nodes and has executed more than 5 million jobs. Our goal is to couple SLURM with OARv2 batch scheduler. For this there are two different paths: 1)Use the SLURM API and provide OARv2 as an external scheduler of SLURM (In this case we use the SLURM commands for job submitting and monitoring), 2)Use SLURM as the low level resource manager tool in the place of Taktuk for OARv2 batch scheduler (In this case we use the OARv2 commands for job submitting and monitoring). We are interested to implement both of the methods and propose experiments to test and evaluate the performance of the coupled systems in terms of scalability and efficiency, by comparing the tho methodes 1)between them,2)with each system "standalone" and 3)with other possible resource managers and batch scedulers.
      Results:
    • kstress: stressing kadeploy (Other) [achieved]
      Description: kstress istresses kadeploy using a tool which is able to run tests, each test run 1, 2, 4 and then 8 concurrents deployments.
      Results: - discover bugs in kadeploy - decrease deployment time
      More information here
    • refenv: checking reference environment (Other) [achieved]
      Description: refenv checks programs and their versions in reference environments.
      Results:
      More information here
    • Kastafior Benchmark: a broadcasting tool evaluation (Other) [in progress]
      Description: The Kastafior is broadcasting program based and Taktuk tool. As Mput (Brice Videau's report) Kastafior establish a chain between the nodes where the data is to be sent, and send the data along this chain. We measured the data through in various conditions on Grid'5000. We use an early version a Expo middleware to conduct experiments.
      Results:
    • IDTRUIT: Testing frequenlty rebooted hardware strength (Other) [achieved]
      Description: Taking 4 nodes of the venerable and famous IDPOT cluster, we made hard power off/on cycles during weeks to test if the hardware can be damaged by an energy saving feature that halts the nodes when there's no jobs running on a cluster.
      Results: About 14000 reboots without damaging the hardware (equivalent to 12 reboots per day during 3 years)
      More information here
    • OAR Green Computing Features (Middleware) [in progress]
      Description: In the new era of petascale in Large-Scale Distributed Systems and High Performance Computing, the energy consumption is an important parameter in the evolution of these systems. Local Resource Management Systems can play a vital role in this game, since they have an overall knowledge of the hardware resources and the users workload. Turning off unutilized resources and exploiting Dynamic Voltage Scaling techniques may reduce drastically the overall energy consumption. Nevertheless, the impact upon the jobs turnaround time and performance should also be taken into account. We have adpated the Cluster Resource Management System OAR with Energy Efficient Scheduling capabilities to deal with machines unutilization. Furthermore, we provide a special type of parametrized jobs which gives possibilities to the user for CPU DVS exploitation and hard-disk spindown during their actual jobs execution. In these experiments we test those Green Computing features by measuring the tradeoffs on the energy consumption VS execution performance and jobs turnaround time.
      Results:
      illustrating chart picture not found
      More information here
    • Kadeploy2 toolkit optimisations (Middleware) [achieved]
      Description: Kadeploy2 environment deployment toolkit provides automated software installation and reconfiguration mechanisms of clusters and grids. Kadeploy2 toolkit introduces a prototype idea, aiming to be a new way of cluster and grid exploitation. That is to let the users concurrently deploy computing environments exactly fitted to their experiment needs on different sets of nodes. Since the deployment execution time is a very important aspect for the viability of this approach, we studied and proposed optimization methods for the deployment procedure. Multiple performance measurements were conducted that validated our approach, achieved our expectations and generated ideas for deeper optimizations.
      Results: Comparison of the 2 kadeploy2 tool's deployment optimisation methods with the default deployment method on GDX cluster (Orsay): the boot times (in seconds) according to the number of nodes. The featured chart validate our expectations for both optimization methods, and show us that all deployment procedure methods introduce a very good scalability according to the number of nodes.
      illustrating chart picture not found
    • Griduis-2: A High Available high performance computing infrastructure based on G5K (Middleware) [in progress]
      Description:
      Results:
      More information here

    Publications

    Collaborations

      Success stories and benefits from Grid'5000

      last update: 2007-06-07 15:47:06

    Personal tools
    Namespaces

    Variants
    Views
    Actions
    Public Portal
    Users Portal
    Admin portal
    Wiki special pages
    Toolbox