Orchestrating Experiments on the gLite Production Grid Middleware (Orchestration)
Leaders: Lucas Nussbaum (ALGORILLE), Frédéric Suter (CC IN2P3)
The gLite grid middleware is the core software component of the EGI (formerly EGEE) production grid. gLite is organized as a set of services to manage security and authentication (Virtual Organisation Membership Service – VOMS), interact with the local batch schedulers (Computing Element – CE), manage data at the grid level (Storage Element – SE), and expose information on the available resources (Information Service – IS). It also provides a user interface to the users, enabling them to find resources, submit or cancel jobs, show the status or the output of jobs, manage their files, etc. Many users also add an additional layer on top of gLite, such as DIRAC to manage pilot jobs, or MOTEUR to orchestrate large computation campaigns.
The performance and ease of use of the grid platform is highly dependent on this software stack, and it is therefore crucial to be able to evaluate possible improvements in a test environment. Unfortunately, building such a test environment with the required features is very hard, as it needs to be flexible enough to enable developers to replace some parts of the infrastructure, and large enough to reproduce problems that only arise when a very large number of resources are involved. Currently, engineers and researchers directly use the production platform to test their improvements, which has several limitations (risk of breaking the infrastructure, low reproducibility of experiments, waste of production resources).
The goal of this challenge is to explore the use of the Grid’5000 testbed as a test environment for production grid software such as gLite and other related services. Grid’5000 has most of the required features, being composed of a large number of nodes grouped in clusters and sites which matches (at a smaller scale) the architecture of the EGI production grid, and providing users with the ability to deploy their own software stack. However, specific work is needed to (1) overcome the technical locks that one will encounter while deploying such complex software; (2) understand how one will be able to organize experiments involving a very large number of nodes performing complex activities.
This latter point raises important scientific challenges. Traditionally, Grid’5000 users have been controlling their experiments either manually, or using ad-hoc scripts. However, those approaches have serious limitations when trying to perform complex experiments at a large scale: it is extremely hard to write scalable scripts and to address reliability problems such as nodes failures. During this challenge, we plan to investigate easier and better ways to perform complex experiments at large scale. Instead of relying on ad-hoc scripts, we will work on reusable services that provide the basic features required to perform experiments. And to organize the experiment itself, we will work on middleware for the orchestration of large-scale and complex experiments together with experts from the field of Business Process Management.
We plan to contribute:
- A detailed procedure (appliances based on Kadeploy images and scripts, documentation) to deploy the gLite middleware on Grid’5000, enabling others to deploy a production grid infrastructure inside Grid’5000 for testing purposes;
- Reusable services of the control of a large number of nodes, the management of data, the emulation of experimental conditions, the injection of load and faults, the instrumentation and monitoring,... In some cases, we will base our work on existing solutions;
- Middleware for the orchestration of experiments on Grid’5000, enabling users to perform better and easier experiments;
- Large-scale experiments involving the gLite middleware and applications from production grids.