What is OAR 2 ?
OAR 2 is the next generation of the batch management system used on Grid'5000.
Until sep 17 2007, Grid'5000 was using the version 1.6 of OAR, but OAR 1.6:
- is not maintained anymore
- showed its limitation for the management of the resources, especially in the clean-up process
- is obsolete with regard to the new complexity for the management of resources with cores
- does not handle resource sharing (time-sharing)
As a result, the Grid'5000 Technical Committee (CT) has been working on the migration of the platform to OAR 2 since January 2007.
Grid'5000 is now using OAR 2 on all its sites.
OAR 2 is maintained, and provides the resource management scalability for Grid'5000 platform evolution, with regard to the always growing number of clusters, and the need to handle the heterogeneity of the resources with more rational.
As a consequence of the migration proposal, the following changes would impact Grid'5000 users:
- need to use a core (cpuset) aware connector to access the resources at the core level. Either:
- any ssh software with special options to explicitly specify the user and the ssh key: -l oar and -i <job-key>
- or OAR 2 oarsh connector (wrapper on top of OpenSSH)
- or use the backward compatible mode: ssh connector like with OAR 1.6, but cannot take benefits of the CPU/core management level.
- need to adapt to the rethought resources hierarchy, with one OAR server per site instead of per cluster.
This page presents details of the proposal of architecture for Grid'5000 reservation mechanisms using OAR 2.
If you are looking for actual examples of use of OAR2, The page OAR2 use cases presents usage examples.
What's new in OAR 2 and what will be the benefits of using OAR 2 on Grid'5000 ?
We present below some points that explain benefits of the new version of OAR for Grid'5000, but generic information can be found on OAR 2's website: http://oar.imag.fr/docs/manual.html
A better resource management
Using Linux kernel new feature called cpuset*, OAR 2 allows a more reliable management of the resources:
- No unattended processes should remain from previous jobs as it used to happen with OAR 1.6, especially with a node weight > 1 (configuration with resources at the cpu/core level).
- Access to the resources is now restricted to the owner of the resources do facto.
Beside, features like job dependency and check-pointing are now available, allowing a better resources use.
(*) A cpuset is attached to every process, and allow:
- to specify which resource processor/memory can be used by a process, e.g. resources allocated to the job in OAR 2 context.
- to group and identify processes that share the same cpuset, e.g. the processes of a job in OAR 2 context, so that actions like clean-up can be efficiently performed. (here, cpusets provide a replacement for the group/session of processes concept that is not efficient in Linux).
OAR 2 can manage complex hierarchies of resources. In Grid'5000 case, we propose to configure one OAR 2 server per Grid'5000 site. Hence OAR 2 will allow to configure a resources hierarchy configuration on every Grid'5000 sites as follows:
- host (nodes)
Furthermore, as a side effect of this resource management rethought taking benefits of OAR 2's new features, a global normalization effort for OAR resources definition started. The following page: OAR properties proposes resources properties that will be implemented on every sites with OAR 2.
A modern cluster management system
By providing a mechanism to isolate the jobs at the core level, OAR 2 is one of the most modern cluster management systems. Users developing cluster or grid algorithms and programs will then work in a today's up-to-date environment similar to the ones they will meet with other recent cluster management systems on production platforms for instance.
Optimization of the resources usage
Now a day, machines with more than 4 cores become common. Thus, it is then very important to be able to handle cores efficiently. By providing resources selection and processes isolation at the core level, OAR 2 allows users running experiments that do not require the exclusivity of a node (at least during a preparation phase) to have access to many nodes on one core only, but leave the remaining cores free for other users. This can allow to optimize the number of available resources.
Beside, OAR 2 also provide a time-sharing feature which will allow to share a same set of resources among users. This will especially be useful during demonstration or events such as plugtest.
Easier access to the resources
Using OAR 2 OARSH connector to access the job resources, basic usages will not anymore require the user to configure his SSH environment as everything is handled internally (known host keys management, etc). Beside, users that would actually prefer not using OARSH can still use SSH with just the cost of some options to set (one of the features of the OARSH wrapper is to actually hide these options).
Grid resources interconnection
As access to one cluster resources is restricted to an attached job, one may wonder if connections from job to job, from cluster to cluster, from site to site would still be possible. OAR 2 provides a mechanism called job-key than allows inter job communication, even on several sites managed by several OAR 2 servers (this mechanism is indeed used by OARGrid2 for instance).
Management of abstract resources
OAR 2 features a mechanism to manage resources like software licenses or other non-material resources the same way it manages classical resources. Grid'5000 may take benefits of this feature to manage the allocation of virtual IP address range in the context of the deployment of many virtualized systems (xens for instance) on one physical node (to be implemented).
Grid'5000 OAR 2 configuration
OAR 2 configurations on Grid'5000 site have some Grid'5000 specific settings:
One server per site managing the complete hierarchy of resources of the site
The uniformization of Grid'5000 resources defined on all sites is a need reported for a long time. One OAR 2 server for the whole Grid'5000 would not work for several reasons, one being the lack of a powerful fail-over mechanism in the case of a temporary failure of Grid'5000 national interlink. Hence, after evaluating the pros and cons, the CT decided to propose a switch from the one server per cluster configuration to the one server per site configuration as it enhances the resource management capabilities, especially in the case of a heterogeneous clusters such as Bordeaux's or GDX.
The following hierarchy of resources will then be showed to the users on ALL sites:
- host (nodes)
Users will be able to make OAR reservations on the whole set of resources of a site using a single oarsub request.
Mechanisms (admission rules) are configured so that the usage of the platform will not change much, such as forcing default job submission on a homogeneous cluster of machine for instance, even on sites with several clusters.
cpuset and OARSH modules
In order to provide the cpu and core hierarchy levels as requested on some sites, OAR 2 cpuset module must be activated. This choice is furthermore encouraged by the fact that it also provides a enhanced management of the resources isolation and clean-up.
Hence, to each job is now associated a unique cpuset identifier that is propagated over all the nodes belonging to the resources of the job thanks to a job key.
Basically a job key is a SSH key dedicated to the job and as a result, users will have to use special options to connect to their job's resources with SSH. or as an alternative, to use OAR 2 connector: OARSH. OARSH is a wrapper for SSH that manages these options internally.
Grid mode enabled by default
OAR 2 default resource management does not allow connection from resources of one job to resources of another job. This is enabled by using OAR 2 job key mechanism.
In Grid'5000, the job key mechanism is then be activated by default.
As with OAR 1.6, OAR 2 is configured with bindings to kadeploy. Mechanisms is the same as with the previous version of OAR, access to the deployed nodes is possible thru classical SSH connections form instance, as OAR 2 cpuset management is not involved on deployed nodes.
One difference however is that deployment jobs do not have a different priority anymore as they will be managed by a type of job now instead of a special queue (different queue implies different priority).
Backward compatibility mode
In order to keep as backward compatible as possible to OAR 1.6 usage, a backward compatibility mode is configured for the ssh connector:
If nodes are reserved entirely (e.g. all the cores/cpus of the node are affected to a single job), configuration allow to access the node via a classical SSH connection, using the same mechanism as the one in place for OAR 1.6 (access restricted to the job owner, cpuset management will not take place upon SSH connection to the node.). This behavior is activated using the use_classic_ssh submission type.
- CPU/Core level usage: To use this backward compatible connection mode, experiment cannot be run at the cpu/core level.
- Time-sharing* case: With the backward compatible mode, node clean-up is performed on a brutal per user (process owner) basis (OAR 1.6 mechanism). So, if time-sharing is activated and then a node happens to run several jobs from the same user, the user's processes may be killed unexpectedly. This may occur as a side effect of one of the user's job termination, even if other jobs use cpusets (if only running on some cores for instance).
(*) Time-sharing allow to run many jobs on the same resource in a same periode of time.
Accessing Grid'5000 resources with OAR 2
Reminder: OAR 2 on Grid'5000 is configured with the job-key mechanism systematically activated (grid job mode), with one server per site.
Backward compatibility mode: The 3 paragraphs below do not explain the backward compatible mode, but the job key mechanism usage. Example 4 presents a case of use of the backward compatible mode.
Job creation (submissions/reservations)
A job-key is associated to each job in order to identify cpusets (cpu/core management). Hence. two scenarii are possible:
- let oarsub automatically create the job-key for the job (job key can be exported using the -e aka --export-job-key-to-file option)
- force oarsub to use a pre-existing job-key (using the -i aka --import-job-key-from-file option)
- Forcing oarsub to use a pre-existing job-key is mandatory if creating several jobs on several sites in the purpose of interconnecting them afterward (grid job). The first job creation may generate the key, then the next job creations have to use the same job-key.
- Using always the same job key will be very convenient if using ssh (and not oarsh) to connect the nodes, as key can be set in ssh configuration file. However, user must be warned that this usage will raise issues if submitting 2 or more jobs that share a same subset of nodes (on different cpusets), because in this case processes cannot be guarantied to run on the good cpuset.
- Using oarsh eases the use of job dedicated job key.
OAR 2 connector approach was thought to keep as friendly as possible to SSH users, even with the cost of the cpuset propagation mechanism. Therefore, users can continue using either OpenSSH or other SSH implementations (Java SSH API for instance), as long as they can specify 3 options:
- the user for the connection that must be oar
- the key to use for the connection that has to be the job key
- the port which OAR dedicated SSH server is running on (TCP port 6667)
On the other hand, OAR 2 also provides a connector (a wrapper for ssh) called oarsh that make connecting the resources easier:
- Job key is managed transparently (oarsh automatically retrieves the information when available)
- User is not required to manage his ssh configuration to access the resources, as it is handled by oar (
known_hosts, StrictHostKeyChecking option, authentication keys, aso).
- Advanced setups to mix connections to nodes (using oarsh) and to other resources (using ssh) are possible.
The examples below are just technical illustrations of the concepts explained above. For OAR2 usage examples, please go to the page OAR2 use cases.
Example 1 - oarsub and basic ssh commands
- First create the job with the option to export the job key
pneyron@idpot:~$ oarsub -I -l host=2/core=1 -e ~/my_job_key Generating public/private job key pair... [ADMISSION RULE] Set default walltime to 7200. [ADMISSION RULE] Modify resource description with type constraints OAR_JOB_ID=10750 OAR_JOB_KEY_FILE=/home/grenoble/pneyron/my_job_key Interactive mode : waiting... [2007-07-03 15:36:25] Starting... Initialize X11 forwarding... Connect to OAR job 10750 via the node idpot1.grenoble.grid5000.fr pneyron@idpot1:~$
The job key was generated for the job and exported to the file /home/grenoble/pneyron/my_job_key
- Then connect to the nodes
2 nodes belong to the job: idpot1 and idpot2. We connect them from the frontal:
pneyron@idpot:~$ ssh -l oar -p 6667 -i ~/my_job_key idpot1 Last login: Tue Jul 3 15:28:17 2007 from idpot.imag.fr pneyron@idpot1:~$ ssh -l oar -p 6667 -i ~/my_job_key idpot2 Last login: Wed Jun 20 15:44:46 2007 from idpot.imag.fr pneyron@idpot2:~$
If a serie of batch jobs is submitted, with all oarsub exporting the job key of every job to the same filename, then the job key of the last job will overwrite the job key of all other jobs. As a result, connection will only be possible to the resources of the last job.
In that case, one may either use a different job key filename for the job key which is exported for each job (for instance using the %jobid% string), or use the same job key for all jobs, using the -i aka --import-job-key-from-file option (see example 2).
Example 2 - oarsub and ssh, hiding ssh options
OpenSSH connection can be configured on a hostname basis within the configuration file. Hence it is possible to specify the user and the key in that file, and then only run ssh <hostname>
- Edit ~/.ssh/config and add
[...] Host idpot?* User oar Identity ~/my_OAR_jobkey Port 6667 [...]
The user then always uses the same job key, that he got from a previous oarsub for instance or can generate using ssh-keygen:
ssh-keygen -b 1024 -N "" -t rsa -f ~/my_OAR_jobkey.
Warning: user must be warned that this usage will raise issues if submitting 2 or more jobs that share a same subset of nodes (on different cpusets), because oarsh wont be able to connect the good cpuset for a given job id on those nodes, as they all have the same job key identifier.
- Create the job, importing the job key.
pneyron@idpot:~$ oarsub -I -l host=2/core=1 -i ~/my_OAR_jobkey Import job key from file: /home/grenoble/pneyron/my_OAR_jobkey [ADMISSION RULE] Set default walltime to 7200. [ADMISSION RULE] Modify resource description with type constraints OAR_JOB_ID=11332 Interactive mode : waiting... [2007-07-12 15:30:51] Starting... Initialize X11 forwarding... Connect to OAR job 11332 via the node idpot1.grenoble.grid5000.fr pneyron@idpot1:~$
The -i option tells where to import the job-key from (use the same location as in ssh config)
- Then connect to the nodes
2 nodes belong to the job: idpot1 and idpot2. We connect them from the frontal:
pneyron@idpot:~$ ssh idpot1 Last login: Tue Jul 10 09:17:56 2007 from idpot.imag.fr pneyron@idpot1:~$ pneyron@idpot1:~$ ssh idpot2 Last login: Fri May 25 11:49:27 2007 from idpot.imag.fr pneyron@idpot2:~$
Example 3 - oarsub and oarsh
- Create the job
pneyron@idpot:~$ oarsub -I -l host=2/core=1 Generate a job key... [ADMISSION RULE] Set default walltime to 7200. [ADMISSION RULE] Modify resource description with type constraints OAR_JOB_ID=11337 Interactive mode : waiting... [2007-07-12 16:09:36] Starting... Initialize X11 forwarding... Connect to OAR job 11337 via the node idcalc10.grenoble.grid5000.fr pneyron@idcalc10:~$
- Connect from node to node
2 nodes belong to the job: idcalc9 and idcalc10.
pneyron@idcalc10:~$ oarsh idcalc9 Last login: Thu Jul 12 14:48:02 2007 from idpot.imag.fr pneyron@idcalc9:~$
- Connect to the nodes from the frontal
We connect from the frontal. As oarsh cannot guess what job we want to connect to, we set the OAR_JOB_ID. This is only needed on the frontal.
pneyron@idpot:~$ export OAR_JOB_ID=11337 pneyron@idpot:~$ oarsh idcalc9 Last login: Thu Jul 12 14:48:02 2007 from idpot.imag.fr pneyron@idcalc9:~$
An alternative is to use oarsub -C <jobid>
pneyron@idpot:~$ oarsub -C 11337 Initialize X11 forwarding... Connect to OAR job 11337 via the node idcalc10.grenoble.grid5000.fr pneyron@idcalc10:~$
- Connect to the node from another machine on the grid
Below, the user must know the job key (job key had to be exported or imported at submission time, see example 1 and 2).
If ~/my_job_key is the job key for the job, run:
pneyron@idfix:~$ export OAR_JOB_KEY_FILE=~/my_job_key pneyron@idfix:~$ oarsh idcalc10 Last login: Sat May 19 01:14:13 2007 from idpot.imag.fr pneyron@idcalc10:~$
pneyron@idfix:~$ oarsh -i ~/my_job_key idcalc10 Last login: Sat May 19 01:14:14 2007 from idpot.imag.fr pneyron@idcalc10:~$
If connecting from a machine that does not provide oarsh, you can run ssh:
pneyron@idbloc:~$ ssh -l oar -p 6667 -i ~/my_job_key idcalc10 Last login: Sat May 19 01:14:13 2007 from idpot.imag.fr pneyron@idcalc10:~$
Example 4 - oarsub on entire nodes and backward compatible mode for ssh connection
- First create the job, using ONLY ENTIRE nodes
pneyron@idpot:~$ oarsub -I -t allow_classic_ssh -l nodes=2 Generate a job key... [ADMISSION RULE] Set default walltime to 7200. [ADMISSION RULE] Modify resource description with type constraints OAR_JOB_ID=11415 Interactive mode : waiting... [2007-07-13 16:36:25] Starting... Initialize X11 forwarding... Connect to OAR job 11415 via the node idcalc1.grenoble.grid5000.fr pneyron@idcalc1:~$ cat /proc/self/cpuset /oar/pneyron_11415
- Then connect to the nodes
We connect node idcalc1 from the frontal:
pneyron@idpot:~$ ssh idcalc1 Last login: Fri Dec 15 17:46:07 2006 from idpot.imag.fr pneyron@idcalc1:~$ cat /proc/self/cpuset / pneyron@idcalc1:~$ ssh idcalc2 Last login: Fri Dec 15 17:46:07 2006 from idpot.imag.fr pneyron@idcalc2:~$ cat /proc/self/cpuset / pneyron@idcalc2:~$
We can connect to the node, with no special SSH command. However, our processes are not part of the cpuset of the job, which is acceptable in that case, as the job owns all the cores of its nodes (process isolation at the core level is not required/useful in that case).
This example requires no SSH configuration unlike example 2 (
~/.ssh/config). Remove the configuration for example 2 if set.