Advanced OAR

From Grid5000
Revision as of 11:24, 24 October 2012 by Mimbert (talk | contribs)
Jump to navigation Jump to search

OAR

This section will show you various details of OAR useful for an advanced usage, as well as some tips and tricks. It assumes you are familiar with OAR and Grid5000 basics. It assumes you are using the bash shell (but should be easy to adapt to another shell).

useful setup tips

  • Take the time to carefuly configure ssh, as described in [www.loria.fr/~lnussbau/files/g5kss10-grid5000-efficiently.pdf Working efficiently on Grid’5000].
  • Use screen so that your work is not lost if you loose the connection to Grid5000. Moreover, having a screen session opened with one or more shell sessions allows you to leave your work session when you want then get back to it later and recover it exactly as you leaved it.

connexion to the job's nodes

job keys

By default, OAR generates an ssh key pair for each job, and oarsh is used to connect the job's nodes. oarsh looks to environment variables OAR_JOB_ID or OAR_JOB_KEY_FILE to know the key to use, and thus you need to be connected (oarsub -C <JOB_ID>) to the job to use it. If needed OAR allows to export the job key of a specific job, then manually set the environment and use oarsh, or manually use ssh.

sharing keys between jobs

Telling oar to always use the same key can be very convenient. If you have a passphrase-less ssh key dedicated for navigating inside grid5000, then in your ~/.profile or ~/.bash_profile you can set:

export OAR_JOB_KEY_FILE=<path_to_your_key>

Then, OAR will always use this key for all submitted jobs, which allows you to connect to your nodes with oarsh without beeing connected to the job.

Moreover, if this key is replicated between all Grid5000 sites, and if the environment variable OAR_JOB_KEY_FILE is exported in ~/.profile or ~/.bash_profile on all sites, you will be able to connect directly from any frontend to any reserved node of any site.

If using the same key for all jobs, be warned that this will raise issues if submitting two or more jobs that share a same subset of nodes on different cpusets, because in this case processes cannot be guarantied to run on the good cpuset.

allow_classic_ssh

submitting with option -t allow_classic_ssh allows you to use ssh directly (instead of oarsh) to connect to the nodes, at the cost of not being able to select resources at a finer level than the node (cpu, core).

passive and interactive modes

In interactive mode: a shell is opened on the first node of the reservation (or on the frontend, with appropriate environment set, if the job is of type deploy). In interactive mode, the job will be killed as soon as this job's shell is closed and will be limited by the job's walltime. It can also be killed by an explicit oardel.

You can experiment with 3 shells. On first shell, to see the list of your running jobs, regularly run:

$ oarstat -u $USER

On the second shell, run an interactive job:

$ oarsub -I

Wait for the job to start, run oarstat, then leave the job, run oarstat again. Submit another interactive job, and on the third shell, kill it:

$ oardel <JOB_ID>

In passive mode: an executable is run by oar on the first node of the reservation (or on the frontend, with appropriate environment set, if the job is of type deploy). In passive mode, the limitation to the job's length is its walltime. It can also be killed by an explicit oardel.

JOBID = $(oarsub 'uname -a' | sed -n 's/OAR_JOB_ID=\(.*\)/\1/p')
cat OAR.$JOBID.stdout

You may not want a job to be interactive nor to run a script when the job starts, for example because you will use the reserved resources from a program whose lifecycle is longer than the job (and which will use the resources by connecting to the job). One trick to achieve this is to run the job in passive mode with a long sleep command. One drawback of this method is that the job may terminate with status error if the sleep is killed. This can be a problem in some situations, eg. when using job dependencies.

OARGRID

cigri

Reserve several resource types with constraints

We want 2 nodes and 4 /22 subnets with the following constraints:

  • Nodes are on 2 different clusters of the same site (Hint: use a site with several clusters :-D)
  • Nodes have virtualization capability enabled
  • /22 subnets are on two different /19 subnets
  • 2 subnets belonging to the same /19 subnet are consecutive
 oarsub -I -l /slash_19=2/slash_21=1+'{"virtual"!="none"}'/cluster=2/nodes=1

Lets verify the reservation:

 uniq $OAR_NODE_FILE
 paradent-6.rennes.grid5000.fr
 parapluie-7.rennes.grid5000.fr
 g5k-subnets -p
 10.158.8.0/22
 10.158.32.0/22
 10.158.36.0/22
 10.158.12.0/22
 g5k-subnets -ps
 10.158.8.0/21
 10.158.32.0/21