Grid'5000 user report for Yiannis Georgiou

Jump to: navigation, search

User information

Yiannis Georgiou (users, user, grenoble, ml-users user)
More user information in the user management interface.


  • OAR Green Computing Features (Middleware) [in progress]
    Description: In the new era of petascale in Large-Scale Distributed Systems and High Performance Computing, the energy consumption is an important parameter in the evolution of these systems. Local Resource Management Systems can play a vital role in this game, since they have an overall knowledge of the hardware resources and the users workload. Turning off unutilized resources and exploiting Dynamic Voltage Scaling techniques may reduce drastically the overall energy consumption. Nevertheless, the impact upon the jobs turnaround time and performance should also be taken into account. We have adpated the Cluster Resource Management System OAR with Energy Efficient Scheduling capabilities to deal with machines unutilization. Furthermore, we provide a special type of parametrized jobs which gives possibilities to the user for CPU DVS exploitation and hard-disk spindown during their actual jobs execution. In these experiments we test those Green Computing features by measuring the tradeoffs on the energy consumption VS execution performance and jobs turnaround time.
    illustrating chart picture not found
    More information here
  • Experimentations of the lightweight grid CIGRI :Fault-tolerance and Scheduling (Middleware) [in progress]

    A widely used method for large scale experiment execution upon P2P or Cluster Computing platforms, is the exploitation of idle resources. Specifically in the case of clusters, administrators share their cluster's idle cycles into Computational Grids, for the execution of the so called bag-of-tasks applications. Fault-tolerance and scheduling are some of the important challenges that have arisen on the specific research field. Our main interest lies on the large-scale deployment and experimentation of our lightweight grid computing approach, under real-life parameters. Under this context, we experiment with CIGRI and a fully transparent system-level checkpointing feature for scheduling and turnaround-time optimisation.

    The advantage of Grid5000 platform is the degree of its reconfigurability. Due to its deployment toolkit Kadeploy scientists can deploy and install the exact software needed for their experiments on the number of nodes desired. Hence, the platform provides the ideal tool for controlled and thus repeated experiments. Our experimental methodology is specifically constructed to permit the reproduction of our experiments on the one hand for the observation, devellopement and evaluation of the platform from CIGRI devellopers; On the other hand for the verification, refinement and even extension of the experiments by whoever is interested. The goal of this study is to provide a real-life, large scale experimental methodolgy upon grid technologies for research on fault-tolerance and scheduling issues.

    We have constructed real-life experimentations of the CIGRI lightweight grid system deployed upon the grid platform Grid5000. The real-life scenarios consist, of local cluster workloads based on the traces of the DAS2 grid platform and of the grid bag-of-tasks submitted application for an astrophysics benchmark on monte carlo transfer code.

    In this first version of experiments the goal is to compare different fault-tolerance strategies based on the newly implemented checkpoint/restart functionality. Those strategies are:

    • "No Checkpoints" (default) The default fault treatment mechanism of CIGRI where in case of an interference failure the task is reentered on the bag of tasks and the job is rescheduled when resources are idle.
    • "Periodic Checkpoints" This strategy provides transparent periodic checkpoints for every task executed on the clusters.
    • "Specific Checkpoints" This strategy provides transparent periodic checkpoints for only the tasks that are going to be killed from the batch scheduler due to interference failures.
    • "Shared Checkpoints" (unstable) This strategy provides checkpointed job migration possibilities on other free clusters, in case the initial cluster is overloaded.
    In case of an interference failure a CIGRI specific module checks the storage resources of the clusters to see if a checkpoint of the specific job exists and if it does then the specific job is treated in priority (among all the other uncheckpointed tasks of the bag).

    The experimental methodology is based on the following procedure:

    1. Reservation of a specific number of nodes upon Grid5000 clusters.

    2. Experimentation choices and initialisationUser should make some choices for the type of the experiments and then launch the relevant scripts for the experiment initialisation. The experiment starts by the image deployment on all the nodes and then launches the scripts which drive the experimentation.
    3. Deployement of the image containing all the needed software installed (CIGRI, OAR, NFS, BLCR,...)
    4. Configuration after the nodes deployment Configuration of the software, distribution of roles (servers, computing nodes,...), start of services (NFS,BLCR,...), construction of the clusters and grid universe, OAR batch scheduler initialisation for each cluster and CIGRI grid server initialisation
    5. Local cluster workload based on traces Local cluster job submitter starts submitting jobs based on the traces selected (from DAS2 grid platform) representing a specific percentage of the whole workload of the cluster(20% workload, 40%,60%,...)
    6. Grid bag-of-tasks application astrophysics benchmark Bag-of-Tasks Application submitted to CIGRI Grid. It is an astrophysics benchmark on a monte-carlo transfer code (MCFOST), specifically constructed as a bag of independent tasks for valuable computation.
    7. CIGRI real-life function for specific duration CIGRI Schedules and executes the independent tasks of the application on the idle cluster resources. The local cluster job submitter sends the jobs according to the trace. The local cluster workload interferes with the low-priority besteffort grid jobs ,that take advantage of the idleness of the resources, proudcing interference failures. Hence the fault-treatment mechanism of CIGRI can be tested and the different fault-tolerance strategies ("Specific Checkpoints", "Periodic Checkpoints" and "No Checkpoints"(default)) can be compared. CIGRI Grid functions under those real-life conditions for a specific chosen duration.
    8. Results collection Collection of the Experimental results. On the one hand collection of the results of the grid bag-of-tasks application for their processing and evaluation by the astrophysics researchers; On the other hand collection of the results for the observation of the CIGRI Grid platform and its fault-treatment mechanisms. Database traces (OAR,CIGRI) collection along with all the needed information for evaluation of the experiments (ex. OAR logs, CIGRI logs, errors,...)

    9. A posteriori treatment of the collected results for the validation of the experiment.

    Apart the first and the last step of the above procedure the rest is completely automatic.For the second step which is the experimentation choices ,the user has to decide all the different parameters for the experiments to be launched. More specifically it has to define the following parameters:

    • The Grid5000 clusters that are going to be used for the deployment along with the time of experimentation (5hours or 10hours proposed by default).These choices are taken on the time of reservation of nodes.
    • Small-scale grid experimentation of 1 cluster 32nodes (trace taken from DAS2 cluster of 32nodes) or Large-scale grid experimentation of 5 clusters of 200nodes (traces taken from the 5 DAS2 clusters).
    • Execution Time for each task of the bag-of-tasks application.
    • Fault-tolerance strategy to be used:"Specific Checkpoints","Periodic Checkpoints", "No Checkpoints"(default), and also "Sharing Checkpoints"(unstable). Some strategies have their own parameters to define: "Specific Checkpoints"
      • The time delay on the batch scheduler to wait for the "to kill" job to checkpoint itself before it kills it.
      "Periodic Checkpoints"
      • The periodicity that the job will checkpoint itself during its execution.


      The "No Checkpoints" default fault treatment mechanism provides a reliable grid system for bag-of-tasks application guarranteeing the succesful execution of all the tasks of the bag for the application. According to the results the grid platform makes a thorough use of the clusters idle resources arriving until the 98% of cluster usage. Its drawback is that all the computation made from jobs that are killed by the batch scheduler due to interference failures are completely useless, and represent not valuable cycles and energy consumption.

      Checkpointing is a good solution to adress this problem. Concerning the checkpointing strategies evaluation; According to our results "Periodic Checkpoints" Strategy has a very big overhead. This is normal and validates our thoughts, since all the jobs on the clusters checkpoint themselves and in the end only some of them are killed resulting in lost time and cycles for all those that terminate their execution sucessfully. Ofcourse there are cases that this strategy could be the most efficient compared on the others.

      The "Specific Checkpoints" strategy on the other hand was turned to be the best in terms of turnaround time of the complete application. An important drawback of this strategy is the fact that it is not completely independent of the cluster and in a way it influences the clusters functionality, which goes against to our initial scopes. This is because the cluster has to wait some seconds before it kills the jobs so that it can allow it to checkpoint itself. This time delay is the subject of new studies that will investigate the relevance among the size of the checkpoint file, the duration to checkpoint and the time delay to be requested to the batch scheduler. Hence even if this strategy provides the best turnaround time under the tested conditions, it suffers important constraints .

      Finally the "Shared Checkpoints" strategy is currently being tested

    • Large scale experimentation of OARv2: Scalability and Comparison with other resource managers and batch schedulers (Middleware) [in progress]
      Description: Our goal is to compare various resource managers and batch schedulers in terms of scalability and performance under various workloads. Our experiments are using virtualization technics so as to scale up to significant large number of nodes on our clusters. We use various known benchmarks and real-life workload traces to replay the experiments.
    • Supporting Malleability in Parallel Architectures with Dynamic CPUSET Mapping and Dynamic MPI (Programming) [in progress]
      Description: Current parallel architectures take advantage of new hardware evolution, like the use of multicore machines in clusters and grids. The availability of such resources may also be dynamic. Therefore, some kind of adaptation is required by the applications and the resource manager to perform a good resource utilization. Malleable applications can provide a certain flexibility, adapting themselves on-the-fly, according to variations in the amount of available resources. However, to enable the execution of this kind of applications, some support from the resource manager is required, thus introducing important complexities like special allocation and scheduling policies. Under this context, we investigate some techniques to provide malleable behavior on MPI applications and the impact of this support upon a resource manager. Our study deals with two approaches to obtain malleability: dynamic CPUSET mapping and dynamic MPI, using the OAR resource manager. The validation experiments were conducted upon Grid5000 platform. The testbed associates the charge of real workload traces and the execution of MPI benchmarks. Our results show that a dynamic approach using malleable jobs can lead to almost 35% of improvement in the resources utilization, when compared to a non-dynamic approach. Furthermore, the complexity of the malleability support, for the resource manager, seems to be overlapped by the improvement reached.
      illustrating chart picture not found
    • Experimenting BLCR with MVAPICH2 upon InfiniBand (Middleware) [achieved]
      Description: The goal of this study is to experiment and evaluate the system-level checkpoint/restart implementation BLCR upon the MPI2 version for InfiniBand networks MVAPICH2. Our goals are : 1)Execute the Linpack HPL application upon MVAPICH2 and evaluate the recovery of the system using periodic checkpoints with BLCR. 2)Evaluate the scalability and the efficiency in terms of turnaround time (checkpointing duration) and storage (checkpoint files size), by testing various scenarios.
      Results: Scalability In terms of scalability, the whole system presented a good behaviour. Our experiments were limited from the whole number of nodes of the cluster (51 nodes in total). The maximum nodes that we could allocate for our experiments were 32 nodes. In the end we tested 2 different cases of scenarios: 5-node (10 CPUs) and 32-node (64 CPUs) platforms. Our observations were that the system recovery in both cases was correct and we only observed an error that was not permanent but perhaps a bit more frequent in the case of larger deployments, in case of checkpoints storage upon NFS and in case of small number of periodicity upon check pointing. The error is described in detail on 3.5.1. Checkpointing duration along with size of checkpoint files Considering the checkpointing duration it seems that it varies a lot but it's actually not as large as we initially were afraid of. We have a lot of cases where the duration of the checkpoints can take less than a minute for checkpoint files of 800 MB per CPU, or 8000 MB for the platform of 5 nodes. This is a very low number which makes it a very good result that should be taken into consideration for the overall evaluation of the system. Nevertheless, there were cases where the checkpointing took 1055 sec (~17.5 min) for checkpoint files of 1000MB for the platform of 5nodes. Besides, we had cases of really large checkpointing duration of the order of 1-3 hours for checkpoint files of 1400MB which is really big, but it can be explained from the really big memory utilization of the whole system (until 93% of RAM utilization). An important observation is the fact that checkpointing duration scales much better than expected upon system size scaling. From the second table we can see that a checkpoint file of 800 MB per CPU or 25600 MB for the platform of 32 nodes can be taken in less than two minutes. Furthermore, the larger duration observed in the case of 32 nodes platform was 12 min for checkpoint files of 1000 MB per CPU or 32000 MB for the whole platform. The result is really very good and it seems even better than the results obtained on the 5 nodes case (12 min for 10000 MB on the 5 nodes case VS 17.5min for 32000 MB on the 32 nodes case). Nevertheless, we must not put aside the fact that the repetitions of the 32 nodes experiments were less than the 5 nodes experiments and that errors were quite more frequent in the case of 32 nodes. Another observation that can be noted is that the checkpointing duration could go larger through time (first checkpoints are taken much faster than the checkpoints when approaching the end of the execution). Finally, checkpointing upon NFS takes much longer than checkpointing locally, which was something that we expected. Moreover the memory utilization seems larger in the NFS checkpointing case while comparing with the local checkpointing scenario.
    • Kadeploy2 toolkit optimisations (Middleware) [achieved]
      Description: Kadeploy2 environment deployment toolkit provides automated software installation and reconfiguration mechanisms of clusters and grids. Kadeploy2 toolkit introduces a prototype idea, aiming to be a new way of cluster and grid exploitation. That is to let the users concurrently deploy computing environments exactly fitted to their experiment needs on different sets of nodes. Since the deployment execution time is a very important aspect for the viability of this approach, we studied and proposed optimization methods for the deployment procedure. Multiple performance measurements were conducted that validated our approach, achieved our expectations and generated ideas for deeper optimizations.
      Results: Comparison of the 2 kadeploy2 tool's deployment optimisation methods with the default deployment method on GDX cluster (Orsay): the boot times (in seconds) according to the number of nodes. The featured chart validate our expectations for both optimization methods, and show us that all deployment procedure methods introduce a very good scalability according to the number of nodes.
      illustrating chart picture not found



      Success stories and benefits from Grid'5000

      • Overall benefits
      • A robust and scalable reconfiguration mechanism.

      last update: 2010-01-14 22:41:53

    Personal tools

    Public Portal
    Users Portal
    Admin portal
    Wiki special pages