Advanced OAR: Difference between revisions

Revision as of 10:13, 4 September 2014

	Note
	This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

This tutorial is made of two relatively independant parts, one about #OAR and one about #CIGRI. The OAR part consists of various independant sections describing advanced OAR usage as well as some tips and tricks. The CIGRI part is more directive, and you should read it linearly. The rest of the tutorial can be read linearly, but you also may pick some random sections. In the OAR section, begin at least by #useful tips.

OAR

This section will show you various details of OAR useful for an advanced usage, as well as some tips and tricks. It assumes you are familiar with OAR and Grid5000 basics. It assumes you are using the bash shell (but should be easy to adapt to another shell).

This OAR tutorial focuses on command line usage.

useful tips

Take the time to carefuly configure ssh, as described in Getting Started#Connecting for the first time and preparing your SSH environment .
Use screen so that your work is not lost if you loose the connection to Grid5000. Moreover, having a screen session opened with one or more shell sessions allows you to leave your work session when you want then get back to it later and recover it exactly as you leaved it.
Most OAR commands (oarsub, oarstat, oarnodes) can provide output in various formats:
- textual (this is the default mode)
- PERL dumper (-D)
- xml (-X)
- yaml (-Y)
- json (-J)
Direct access to the OAR database: users can directly access the mysql OAR database oar2 on the server mysql.<site>.grid5000.fr with the read-only account oarreader. The password is read.

Connection to a job

Being connected to a job means that your environment is setup (OAR_JOB_ID and OAR_JOB_KEY_FILE) so that OAR commands can work. You are automatically connected to a job if you have submitted it in interactive mode. Else you must manually connect to it:

$ JOBID=$(oarsub 'sleep 300' | sed -n 's/OAR_JOB_ID=\(.*\)/\1/p')
$ oarsub -C $JOBID
$ pkill -f 'sleep 300'

Connection to the job's nodes

You will normally use the oarsh wrapper to connect to the nodes instead of ssh, and oarcp instead of scp to copy files to/from the nodes. If you use taktuk (or a similar tools like pdsh), you have to configure it so that it uses oarsh instead of ssh.

oarsh and job keys

By default, OAR generates an ssh key pair for each job, and oarsh is used to connect the job's nodes. oarsh looks to environment variables OAR_JOB_ID or OAR_JOB_KEY_FILE to know the key to use. This oarsh works directly if you are connected. You can also connect to the nodes without being connected to the job:

$ oarsub -I
[ADMISSION RULE] Set default walltime to 3600.
[ADMISSION RULE] Modify resource description with type constraints
Generate a job key...
OAR_JOB_ID=<JOBID>
...

then, in another terminal:

$ OAR_JOB_ID=<JOBID> oarsh <NODE_NAME>

If needed OAR allows to export the job key of a job.

sharing keys between jobs

Telling oar to always use the same key can be very convenient. If you have a passphrase-less ssh key dedicated for navigating inside grid5000, then in your ~/.profile or ~/.bash_profile you can set:

export OAR_JOB_KEY_FILE=<path_to_your_key>

Then, OAR will always use this key for all submitted jobs, which allows you to connect to your nodes with oarsh without being connected to the job.

Moreover, if this key is replicated between all Grid5000 sites, and if the environment variable OAR_JOB_KEY_FILE is exported in ~/.profile or ~/.bash_profile on all sites, you will be able to connect directly from any frontend to any reserved node of any site.

If using the same key for all jobs, be warned that this will raise issues if submitting two or more jobs that share a same subset of nodes on different cpusets, because in this case processes cannot be guarantied to run on the good cpuset.

allow_classic_ssh

Submitting with option -t allow_classic_ssh allows you to use ssh directly instead of oarsh to connect to the nodes, at the cost of not being able to select resources at a finer level than the node (cpu, core).

oarsh details

oarsh is a frontend to ssh. It opens an ssh connection as user oar to the dedicated oar ssh server running on the node, listening on port 6667. It detects who you are based on your key, and if you have the right to use the node (if you have reserved it) it will su to your user on the node.

So, if you don't have oarsh installed, you can still connect to the nodes by simulating it. One use case is if you have reserved nodes and want to connect to them through an ssh proxy as described in SSH#Using_SSH_with_ssh_proxycommand_setup_to_access_hosts_inside_Grid.275000:

If you have a passphrase-less ssh key internal to Grid5000, that you use to navigate inside Grid5000, you can tell oar to use this key instead of generating a job-key (see #sharing keys between jobs), then you can copy this key to your workstation outside of Grid5000:

user-laptop$ scp g5k:.ssh/<internal_key_name> g5k:.ssh/<internal_key_name>.pub ~/

In Grid5000, submit a job using this key:

$ oarsub -i ~/.ssh/<internal_key_name> -I

Wait for the job to start. Then in another terminal, from outside Grid5000, try connecting to the node:

user-laptop$ ssh -i ~/<internal_key_name> -p 6667 oar@<node name>.g5k

passive and interactive modes

In interactive mode: a shell is opened on the first node of the reservation (or on the frontend, with appropriate environment set, if the job is of type deploy). In interactive mode, the job will be killed as soon as this job's shell is closed and will be limited by the job's walltime. It can also be killed by an explicit oardel.

You can experiment with 3 shells. On first shell, to see the list of your running jobs, regularly run:

$ oarstat -u

To see your own jobs. On the second shell, run an interactive job:

$ oarsub -I

Wait for the job to start, run oarstat, then leave the job, run oarstat again. Submit another interactive job, and on the third shell, kill it:

$ oardel <JOBID>

In passive mode: an executable is run by oar on the first node of the reservation (or on the frontend, with appropriate environment set, if the job is of type deploy). In passive mode, the limitation to the job's length is its walltime. It can also be killed by an explicit oardel.

JOBID=$(oarsub 'uname -a' | sed -n 's/OAR_JOB_ID=\(.*\)/\1/p')
cat OAR.$JOBID.stdout

You may not want a job to be interactive nor to run a script when the job starts, for example because you will use the reserved resources from a program whose lifecycle is longer than the job (and which will use the resources by connecting to the job). One trick to achieve this is to run the job in passive mode with a long sleep command. One drawback of this method is that the job may terminate with status error if the sleep is killed. This can be a problem in some situations, eg. when using job dependencies.

Submission and Reservation

If you don't specify the job's start date (oar option -r), then your job is a submission and oar will choose the best schedule.
If you specify the job's start date, this is a reservation, oar cannot decide the best schedule anymore, it is fixed

There are some consequences:

Current Grid5000 user charter allows no more than 2 reservations per site
in submission mode you're almost guaranteed to get your wanted resources, because oar can decide what resources to allocate at the last moment. You cannot get the list of resources until the job starts.
in reservation mode, you're not guaranteed to get your wanted resources, because oar has to plan the allocation of resources at reservation time. If later resources become not available, you loose them for your job. You can get the list of resources as soon as the reservation starts.
in submission mode, you cannot know the date at will your job will start until it starts. But OAR can give you an estimation of that date.
to coordinate oar submissions on several sites, OARGRID must do OAR reservations.

example: a reservation in one week:

$ oarsub -r "$(date '+%Y-%m-%d %H:%M:%S' --date='+1 week')"

For reservations, there is no interactive mode. You can give oar a command to execute or nothing. If you give it no command, you'll have to connect to the jobs once the reservation starts.

Getting information about a job

The oarstat command gets jobs informations. By default it lists the current jobs of all users. You can restrict it to your own jobs or someone else's jobs with option -u:

$ oarstat -u

You can get full details of a job:

$ oarstat -fj <JOBID>

If scripting OAR and regularly polling job states with oarstat, you can cause a high load on the OAR server (because default oarstat invocation causes costly SQL request in the OAR database). In this case, you should use option -s which is optimized and only queries the current state of a given job:

$ oarstat -s -j <JOBID>

Complex resources selection

The complete selector format (oarsub option -l):

"{sql₁}/type₁=n₁/type₂=n₂+{sql₂}/type₃=n₃/type₄=n₄/type₅=n₅+...,walltime=hh:mm:ss"

where

sql₁..sql_n are optional SQL predicates on property columns
type₁=n₁..type_n=n_n are the wanted number of given resource types
slashes between resource types express resource subtree selection
+ allows aggregating different resource specifications
walltime is the job walltime (defaults to 1 hour)

You can get the list of column names for SQL predicates by running oarnodes to get the available properties of a particular node, for example in Lyon:

$ oarnodes -Y --sql="network_address = 'sagittaire-1.lyon.grid5000.fr'"

You can also get this list by looking directly at the columns of table 'resources' in the OAR database. For example:

$ mysql -u oarreader -D oar2 -h mysql -p -e "describe resources"

These OAR properties are described in OAR2 properties

	Note
	Please refer to a SQL syntax manual in order to build a correct `-p <...>` syntax (WHERE clause of a resource selection SQL matching)

Using the resource hierarchy

ask for 1 core on 15 nodes on a same cluster (total = 15 cores)

$ oarsub -I -l /cluster=1/nodes=15/core=1

ask for 1 core on 15 nodes on 2 clusters (total = 30 cores)

$ oarsub -I -l /cluster=2/nodes=15/core=1

ask for 1 core on 2 cpus on 15 nodes on a same cluster (total = 30 cores)

$ oarsub -I -l /cluster=1/nodes=15/cpu=2/core=1

ask for 10 cpus on 2 clusters (total = 20 cpus, the number of nodes and cores depends on the topology of the machines)

$ oarsub -I -l /cluster=2/cpu=10

ask for 1 core on 3 different network switches (total = 3 cores)

$ oarsub -I -l /switch=3/core=1

Selecting nodes from a specific cluster

For example in Nancy:

$ oarsub -I -l {'cluster="graphene"'}/nodes=2

Selecting specific nodes

For example in Lyon:

$ oarsub -I -l {'network_address in ("sagittaire-10.lyon.grid5000.fr", "sagittaire-11.lyon.grid5000.fr", "sagittaire-12.lyon.grid5000.fr")'}/nodes=1

By negating the SQL clause, you can also exclude some nodes.

Other examples using properties

ask for 10 cores of the cluster azur

$ oarsub -I -l core=10 -p "cluster='azur'"

ask for 2 nodes with 4096 GB of memory and Infiniband 10G

$ oarsub -I -p "memnode=4096 and ib10g='YES'" -l nodes=2

ask for any 4 nodes except gdx-45

$ oarsub -I -p "not host like 'gdx-45.%'" -l nodes=4

Two nodes with virtualization capability, on different clusters + IP subnets

We want 2 nodes and 4 /22 subnets with the following constraints:

Nodes are on 2 different clusters of the same site (Hint: use a site with several clusters :-D)
Nodes have virtualization capability enabled
/22 subnets are on two different /19 subnets
2 subnets belonging to the same /19 subnet are consecutive

$ oarsub -I -l /slash_19=2/slash_22=1+'{"virtual"!="none"}'/cluster=2/nodes=1

Lets verify the reservation:

 $ uniq $OAR_NODE_FILE
 paradent-6.rennes.grid5000.fr
 parapluie-7.rennes.grid5000.fr

 $ g5k-subnets -p
 10.158.8.0/22
 10.158.32.0/22
 10.158.36.0/22
 10.158.12.0/22

 $ g5k-subnets -ps
 10.158.8.0/21
 10.158.32.0/21

1 core on 2 nodes on the same cluster with 4096 GB of memory and Infiniband 10G + 1 cpu on 2 nodes on the same switch with bicore processors for a walltime of 4 hours

$ oarsub -I -l "{memnode=4096 and ib10g='YES'}/cluster=1/nodes=2/core=1+{cpucore=2}/switch=1/nodes=2/cpu=1,walltime=4:0:0"

Warning

walltime must always be the last argument of -l <...>
if no resource matches your request, oarsub will exit with the message

Generate a job key...
[ADMISSION RULE] Set default walltime to 3600.
[ADMISSION RULE] Modify resource description with type constraints
There are not enough resources for your request
OAR_JOB_ID=-5
Oarsub failed: please verify your request syntax or ask for support to your admin.

Retrieving the resources allocated to my job

You can use oarprint, that allows to pretty print a job resources

Retrieving resources from within the job

We first submit a job

jdoe@capricorne:~$ oarsub -I -l nodes=4
...
OAR_JOB_ID=178361
..
Connect to OAR job 178361 via the node capricorne-34.lyon.grid5000.fr
..

Retrieve the host list

We want the list of the nodes we got, identified by unique hostnames

jdoe@capricorne-34:~$oarprint host
sagittaire-32.lyon.grid5000.fr
capricorne-34.lyon.grid5000.fr
sagittaire-63.lyon.grid5000.fr
sagittaire-28.lyon.grid5000.fr

(We get 1 line per host, not per core !)

	Warning
	nodes is a pseudo property: you must use host instead

Retrieve the core list

jdoe@capricorne-34:~$ oarprint core
63
241
64
163
243
244
164
242

Obviously, retrieving OAR internal core Id might not help much. Hence the use of a customized output format

Retrieve core list with host and cpuset Id as identifier

We want to identify our cores by their associated host names and cpuset Ids:

jdoe@capricorne-34:~$ oarprint core -P host,cpuset
capricorne-34.lyon.grid5000.fr 0
sagittaire-32.lyon.grid5000.fr 0
capricorne-34.lyon.grid5000.fr 1
sagittaire-28.lyon.grid5000.fr 0
sagittaire-63.lyon.grid5000.fr 0
sagittaire-63.lyon.grid5000.fr 1
sagittaire-28.lyon.grid5000.fr 1
sagittaire-32.lyon.grid5000.fr 1

A more complex example with a customized output format

We want to identify our cores by their associated host name and cpuset Id, and get the memory information as well, with a customized output format

jdoe@capricorne-34:~$ oarprint core -P host,cpuset,memnode -F "NODE=%[%] MEM=%"
NODE=capricorne-34.lyon.grid5000.fr[0] MEM=2048
NODE=sagittaire-32.lyon.grid5000.fr[0] MEM=2048
NODE=capricorne-34.lyon.grid5000.fr[1] MEM=2048
NODE=sagittaire-28.lyon.grid5000.fr[0] MEM=2048
NODE=sagittaire-63.lyon.grid5000.fr[0] MEM=2048
NODE=sagittaire-63.lyon.grid5000.fr[1] MEM=2048
NODE=sagittaire-28.lyon.grid5000.fr[1] MEM=2048
NODE=sagittaire-32.lyon.grid5000.fr[1] MEM=2048

Retrieving resources from the submission frontend

You just have to pipe oarstat command in oarprint:

jdoe@capricorne:~$ oarstat -j <JOB_ID> -p | oarprint core -P host,cpuset,memnode -F "%[%] (%)" -f -
capricorne-34.lyon.grid5000.fr[0] (2048)
sagittaire-32.lyon.grid5000.fr[0] (2048)
capricorne-34.lyon.grid5000.fr[1] (2048)
sagittaire-28.lyon.grid5000.fr[0] (2048)
sagittaire-63.lyon.grid5000.fr[0] (2048)
sagittaire-63.lyon.grid5000.fr[1] (2048)
sagittaire-28.lyon.grid5000.fr[1] (2048)
sagittaire-32.lyon.grid5000.fr[1] (2048)

List OAR properties

Properties can be listed using the oarprint -l command:

jdoe@capricorne-34:~$ oarprint -l
List of properties:
besteffort, cpuset, ib10gmodel, memnode, memcore, ethnb, cpuarch, myri2gmodel, cpu, myri10g, memcpu, xpanagran, myri10gmodel, wattmetre, type,
cpufreq, myri2g, ib10g, core, deploy, ip, disktype, nodemodel, cluster, cpucore, network_address, virtual, host, rconsole, cputype, switch,
xpsalome

	Note
	Those properties can also be used in `oarsub` using the `-p` switch for instance.

X11 forwarding

X11 forwarding can now be enabled with oarsh. As for ssh you need to pass option -X to oarsh.

We will use xterm to test X.

Shell 1

Connect to a frontend with ssh with option -X:

Check DISPLAY

$ echo $DISPLAY
localhost:11.0

Job submission

$ oarsub -I -l /nodes=2/core=1
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=4926 
Interactive mode : waiting...
[2007-03-07 09:01:16] Starting...

Initialize X11 forwarding...
Connect to OAR job 4926 via the node idpot-8.grenoble.grid5000.fr
jdoe@idpot-8:~$ xterm &
[1] 14656
jdoe@idpot-8:~$ cat $OAR_NODEFILE
idpot-8.grenoble.grid5000.fr
idpot-9.grenoble.grid5000.fr
[1]+  Done                    xterm
jdoe@idpot-8:~$ oarsh idpot-9 xterm
Error: Can't open display: 
jdoe@idpot-8:~$ oarsh -X idpot-9 xterm

Shell 2

Also connected to the frontend with ssh -X:

$ echo $DISPLAY
localhost:13.0
$ OAR_JOB_ID=4928 oarsh -X idpot-9 xterm

Using a parallel launcher: taktuk

	Warning
	Taktuk MUST BE installed on all nodes to test this point. This is the case on production environments and provided default images, except the min and base images.

Shell 1

Unset DISPLAY so that X does not bother...

jdoe@idpot:~$ unset DISPLAY

Job submission

jdoe@idpot:~$ oarsub -I -l /nodes=20/core=1
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=4930 
Interactive mode : waiting...
[2007-03-07 09:15:13] Starting...

Connect to OAR job 4930 via the node idpot-1.grenoble.grid5000.fr

Running the taktuk command

jdoe@idpot-1:~$ taktuk -c "oarsh" -f $OAR_FILE_NODES broadcast exec [ date ]
idcalc-12.grenoble.grid5000.fr-1: date (11567): output > Thu May  3 18:56:58 CEST 2007
idcalc-12.grenoble.grid5000.fr-1: date (11567): status > Exited with status 0
idcalc-4.grenoble.grid5000.fr-8: date (31172): output > Thu May  3 19:00:09 CEST 2007
idcalc-2.grenoble.grid5000.fr-2: date (32368): output > Thu May  3 19:01:56 CEST 2007
idcalc-3.grenoble.grid5000.fr-5: date (31607): output > Thu May  3 18:56:44 CEST 2007
idcalc-3.grenoble.grid5000.fr-5: date (31607): status > Exited with status 0
idcalc-7.grenoble.grid5000.fr-13: date (31188): output > Thu May  3 18:59:54 CEST 2007
idcalc-9.grenoble.grid5000.fr-15: date (32426): output > Thu May  3 18:56:45 CEST 2007
idpot-6.grenoble.grid5000.fr-20: date (16769): output > Thu May  3 18:59:54 CEST 2007
idcalc-4.grenoble.grid5000.fr-8: date (31172): status > Exited with status 0
idcalc-5.grenoble.grid5000.fr-9: date (10288): output > Thu May  3 18:56:39 CEST 2007
idcalc-5.grenoble.grid5000.fr-9: date (10288): status > Exited with status 0
idcalc-6.grenoble.grid5000.fr-11: date (11290): output > Thu May  3 18:57:52 CEST 2007
idcalc-6.grenoble.grid5000.fr-11: date (11290): status > Exited with status 0
idcalc-7.grenoble.grid5000.fr-13: date (31188): status > Exited with status 0
idcalc-8.grenoble.grid5000.fr-14: date (10450): output > Thu May  3 18:57:34 CEST 2007
idcalc-8.grenoble.grid5000.fr-14: date (10450): status > Exited with status 0
idcalc-9.grenoble.grid5000.fr-15: date (32426): status > Exited with status 0
idpot-1.grenoble.grid5000.fr-16: date (18316): output > Thu May  3 18:57:19 CEST 2007
idpot-1.grenoble.grid5000.fr-16: date (18316): status > Exited with status 0
idpot-10.grenoble.grid5000.fr-17: date (31547): output > Thu May  3 18:56:27 CEST 2007
idpot-10.grenoble.grid5000.fr-17: date (31547): status > Exited with status 0
idpot-2.grenoble.grid5000.fr-18: date (407): output > Thu May  3 18:56:21 CEST 2007
idpot-2.grenoble.grid5000.fr-18: date (407): status > Exited with status 0
idpot-4.grenoble.grid5000.fr-19: date (2229): output > Thu May  3 18:55:37 CEST 2007
idpot-4.grenoble.grid5000.fr-19: date (2229): status > Exited with status 0
idpot-6.grenoble.grid5000.fr-20: date (16769): status > Exited with status 0
idcalc-2.grenoble.grid5000.fr-2: date (32368): status > Exited with status 0
idpot-11.grenoble.grid5000.fr-6: date (12319): output > Thu May  3 18:59:54 CEST 2007
idpot-7.grenoble.grid5000.fr-10: date (7355): output > Thu May  3 18:57:39 CEST 2007
idpot-5.grenoble.grid5000.fr-12: date (13093): output > Thu May  3 18:57:23 CEST 2007
idpot-3.grenoble.grid5000.fr-3: date (509): output > Thu May  3 18:59:55 CEST 2007
idpot-3.grenoble.grid5000.fr-3: date (509): status > Exited with status 0
idpot-8.grenoble.grid5000.fr-4: date (13252): output > Thu May  3 18:56:32 CEST 2007
idpot-8.grenoble.grid5000.fr-4: date (13252): status > Exited with status 0
idpot-11.grenoble.grid5000.fr-6: date (12319): status > Exited with status 0
idpot-9.grenoble.grid5000.fr-7: date (17810): output > Thu May  3 18:57:42 CEST 2007
idpot-9.grenoble.grid5000.fr-7: date (17810): status > Exited with status 0
idpot-7.grenoble.grid5000.fr-10: date (7355): status > Exited with status 0
idpot-5.grenoble.grid5000.fr-12: date (13093): status > Exited with status 0

Setting the connector definitively and running taktuk again

jdoe@idpot-1:~$ export TAKTUK_CONNECTOR=oarsh
jdoe@idpot-1:~$ taktuk -m idpot-3 -m idpot-4 broadcast exec [ date ]
idpot-3-1: date (12293): output > Wed Mar  7 09:20:25 CET 2007
idpot-4-2: date (7508): output > Wed Mar  7 09:20:19 CET 2007
idpot-3-1: date (12293): status > Exited with status 0
idpot-4-2: date (7508): status > Exited with status 0

Using best effort mode jobs

Best effort job campaign

OAR 2 provides a way to specify that jobs are best effort, which means that the server can delete them if room is needed to fit other jobs. One can submit such jobs using the besteffort type of job.

For instance you can run a job campaign as follows:

for param in $(< ./paramlist); do
    oarsub -t besteffort -l core=1 "./my_script.sh $param"
done

In this example, the file ./paramlist contains a list of parameters for a parametric application.

The following demonstrates the mechanism.

	Note
	Please have a look at the user charter to avoid abuses.

Best effort job mechanism

Running a besteffort job in a first shell

jdoe@idpot:~$ oarsub -I -l nodes=23 -t besteffort
[ADMISSION RULE] Added automatically besteffort resource constraint
[ADMISSION RULE] Redirect automatically in the besteffort queue
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=9630 
Interactive mode : waiting...
[2007-05-10 11:06:25] Starting...

Initialize X11 forwarding...
Connect to OAR job 9630 via the node idcalc-1.grenoble.grid5000.fr

Running a non best effort job on the same set of resources in a second shell

jdoe@idpot:~$ oarsub -I
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=9631 
Interactive mode : waiting...
[2007-05-10 11:06:50] Start prediction: 2007-05-10 11:06:50 (Karma = 0.000)
[2007-05-10 11:06:53] Starting...

Initialize X11 forwarding...
Connect to OAR job 9631 via the node idpot-9.grenoble.grid5000.fr

As expected, meanwhile the best effort job was stopped (watch the first shell):

jdoe@idcalc-1:~$ bash: line 1: 23946 Killed                  /bin/bash -l
Connection to idcalc-1.grenoble.grid5000.fr closed.
Disconnected from OAR job 9630
jdoe@idpot:~$

Testing the checkpointing trigger mechanism

Writing the test script

Here is a script which features an infinite loop and a signal handler trigged by SIGUSR2 (default signal for OAR's checkpointing mechanism).

#!/bin/bash

handler() { echo "Caught checkpoint signal at: `date`"; echo "Terminating."; exit 0; }
trap handler SIGUSR2

cat <<EOF
Hostname: `hostname`
Pid: $$
Starting job at: `date`
EOF
while : ; do sleep 10; done

Running the job

We run the job on 1 core, and a walltime of 5 minutes, and ask the job to be checkpointed if it lasts (and it will indeed) more than walltime - 150 sec = 2 min 30.

$ oarsub -l "core=1,walltime=0:05:00" --checkpoint 150 ./checkpoint.sh 
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=9464 
$

Result

Taking a look at the job output:

$ cat OAR.9464.stdout 
Hostname: idpot-9
Pid: 26577
Starting job at: Fri May  4 19:41:11 CEST 2007
Caught checkpoint signal at: Fri May  4 20:26:12 CEST 2007
Terminating.

The checkpointing signal was sent to the job 2 minutes 30 before the walltime as expected so that the job can finish nicely.

Interactive checkpointing

The oardel command provides the capability to raise a checkpoint event interactively to a job.

We submit the job again

$ oarsub -l "core=1,walltime=0:05:0" --checkpoint 150 ./checkpoint.sh 
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=9521

Then run the oardel -c #jobid command...

$ oardel -c 9521
Checkpointing the job 9521 ...DONE.
The job 9521 was notified to checkpoint itself (send SIGUSR2).

And then watch the job's output:

$ cat OAR.9521.stdout 
Hostname: idpot-9
Pid: 1242
Starting job at: Mon May  7 16:39:04 CEST 2007
Caught checkpoint signal at: Mon May  7 16:39:24 CEST 2007
Terminating.

The job terminated as expected.

Testing the mechanism of dependency on an anterior job termination

First Job

We run a first interactive job in a first Shell

jdoe@idpot:~$ oarsub -I 
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=9458 
Interactive mode : waiting...
[2007-05-04 17:59:38] Starting...

Initialize X11 forwarding...
Connect to OAR job 9458 via the node idpot-9.grenoble.grid5000.fr
jdoe@idpot-9:~$

And leave that job pending.

Second Job

Then we run a second job in another Shell, with a dependence on the first one

jdoe@idpot:~$ oarsub -I -a 9458
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=9459 
Interactive mode : waiting...
[2007-05-04 17:59:55] Start prediction: 2007-05-04 19:59:39 (Karma = 4.469)

So this second job is waiting for the first job walltime (or sooner termination) to be reached to start.

Job dependency in action

We do a logout on the first interactive job...

jdoe@idpot-9:~$ logout
Connection to idpot-9.grenoble.grid5000.fr closed.
Disconnected from OAR job 9458
jdoe@idpot:~$

... then watch the second Shell and see the second job starting

[2007-05-04 18:05:05] Starting...

Initialize X11 forwarding...
Connect to OAR job 9459 via the node idpot-7.grenoble.grid5000.fr

... as expected.

Container jobs

With this functionality it is possible to execute jobs within another one. So it is like a sub-scheduling mechanism. You submit a first *container* job, then you can submit several jobs which will all be contained inside the container job.

One interresting usage of such jobs is to manually submit a big container job, which will enforce a constraint the resource span and time span you will use (for example to respect the Grid5000:UserCharter). Then, inside this container job, you would like to run several independant tasks. Without container jobs, you would have to implement a mechanism for allocating the resources of the container job to independant tasks and you would end up reimplementing a batch scheduler... Instead you can use OAR to schedule the inner jobs and avoid reinventing the wheel.

First a job of the type container must be submitted:

oarsub -I -t container -l nodes=10,walltime=2:00:00
...
OAR_JOB_ID=42
...

Then it is possible to use the inner type to schedule the new jobs within the previously created container job:

oarsub -I -t inner=42 -l nodes=7,walltime=00:10:00
oarsub -I -t inner=42 -l nodes=1,walltime=00:20:00
oarsub -I -t inner=42 -l nodes=10,walltime=00:10:00

Note

In the case:

oarsub -I -t inner=42 -l nodes=11

This job will never be scheduled because the container job "42" reserved only 10 nodes.

"-t container" is handled by every kind of jobs (passive, interactive and reservations). But "-t inner=..." cannot be used with a reservation.

CiGri

CiGri is a tool that manages the execution of multi-parametric experiments. Multi-parametric experiments can be Bag-of-Tasks (BoT) applications, Monte-Carlo simulations or an embarrassingly parallel problem.

	Warning
	Cigri is in beta test on Grid'5000. Lets us know of any problems you encounter.

Files

All the files used in this tutorial can be found at lyon's /home/sdelamare/public/pov directory. An archive is also available here.

It contains povray executables, the landscape.pov file that generates the output and jdl files used to submit the campaign.

You can copy this folder in your own home and adapt it to your taste. However, as you will see, it is not mandatory because all files can be taken automatically from /home/sdelamare/public/pov.

Campaign description

The objective of this exercise is to create an animation using a povray file. We will create an animation composed of 100 images. Each image will be calculated separately using the same script, only changing the clock parameter (and the output file).

We call this experiment a campaign, and we will use CiGri to manage it. Each image calculation will be executed under different submissions on Grid'5000.

The computation of one image will be done using the following command:

   povray +Olandscape_XXX.png +W1280 +H720 +KXXX landscape.pov

O: output file
W: image width
H: image height
K: clock parameter
landscape.pov: povray file to execute

Creating the JDL

JDL (Job Description Language) files are quite straightforward to write. Have a look to cigri1.jdl. It defines several things:

The name of the campaign: "povray landscape"
The resources to use for each submission: "core=1,walltime=00:30:00"
The script to execute: "~/cigri_campaign/pov/povray"
The clusters to use (each CiGri cluster is an instance of OAR, therefore a Grid'5000 site): rennes, lyon, .....
The parameters to pass to the script: +Olandscape_XXX.png +W1280 +H720 +KXXX landscape.pov

To generate the list of all parameters you can use

   ruby -e '100.times {|i| puts "\"+Olandscape_%03d.png +W1280 +H720 +K#{i} landscape.pov\"," % i}'

The JDL is defined in the JSON format. The final result will look like:

   {
       "name": "povray landscape",
       "resources": "core=1,walltime=00:30:00",
       "exec_file": "~/cigri_campaign/pov/povray",
       "clusters": {
           "rennes":{},
           "nancy":{},
           "reims":{},
           "lyon":{}
       },
       "params": [
           "+Olandscape_00.png +W1280 +H720 +K0 landscape.pov",
           "+Olandscape_01.png +W1280 +H720 +K1 landscape.pov",
           ...
           "+Olandscape_99.png +W1280 +H720 +K99 landscape.pov",
       ]
   }

Improving the JDL File

In the previous JDL file, there are two assumptions made:

povray is already installed on the sites where the campaign will be executed
The results will be gathered manually by the user at the end of the campaign

Installing povray on each execution site

Povray and input files for the campaign are already installed on lyon site in /home/sdelamare/public/pov. Using the CiGri prologue, it is possible to install it on all clusters. The prologue is executed once on each cluster before jobs execution. Copying the folder can simply be done:

   scp -r lyon:/home/sdelamare/public/pov ~/cigri_campaign

This can be included as a prologue in JDL format (see cigri2.jdl file):

   "prologue": [
       "mkdir ~/cigri_campaign",
       "scp -r lyon:/home/sdelamare/public/pov ~/cigri_campaign/"
   ],

This prologue is executed on each site. In lyon, we don't want to use the scp method, therefore, we can overwrite the prologue with another one that only uses cp (See #Final_JDL, or cigri3.jdl file).

Gathering the results

The second matter of gathering the results is a slightly more tricky. There are two main options here:

Gather the results at the end of the campaign
Send the results during the campaign execution

Gathering the results after the campaign needs to be done carefully. The way we described our campaign earlier, it is not possible to guaranty that our results will be correct. Indeed, we don't know which landscape_XXX.png are correct. If one of the jobs was killed during the execution (remember, jobs are executed as best-effort jobs), the result file will be there anyway. An easy way to ensure that an execution completes is to create a new empty file after the end of the execution. Therefore, this new file will only be created upon successful completion of the job:

   povray +Olandscape_XXX.png +W1280 +H720 +KXXX landscape.pov && touch landscape_XXX.png.ok

The gathering method (that could be defined in a epilogue) should take this file into account and only copy the result if the "ok file" is there.

As the result files are small in this campaign, we could also gather the results as the execution goes. We will simply add an scp command to our execution script:

   povray +Olandscape_XXX.png +W1280 +H720 +KXXX landscape.pov && scp landscape_XXX.png lyon:~/cigri_results/ && rm landscape_XXX.png

	Warning
	It is strongly advised that "cigri_results" directroy is created BEFORE the execution of the campaign. It is unsafe to do it inside Lyon prologue because it is not ensured that Lyon prologue will be executed before jobs on other sites.

Simplifying things

In order to manage the new way of gathering method, it will be easier to call a wrapper (see pov_wrapper.sh file) instead of directly povray. It will take one option: the clock argument.

 #! /bin/bash
 
 output_file=$(printf "landscape_%03d.png" $1)
 
 cd $(dirname $0)
 ./povray +O${output_file} +W1280 +H720 +K$1 landscape.pov && scp ${output_file} lyon:~/cigri_results/ && rm ${output_file}

This script only takes an incrementing integer as parameter. Therefore, we can use the nb_jobs option instead of the param options in the JDL file. This option automatically generates parameters from 0 to nb_jobs.

Final JDL

The final JDL (see cigri3.jdl file) automatically installs povray on each site and gathers the execution results. It uses a wrapper and the nb_jobs option:

 {
     "name": "povray landscape",
     "resources": "core=1,walltime=00:30:00",
     "exec_file": "~/cigri_campaign/pov/pov_wrapper.sh",
     "prologue": [
         "mkdir -p ~/cigri_campaign",
         "scp -r lyon:/home/sdelamare/public/pov ~/cigri_campaign/"
     ],
     "epilogue": [
 	"rm -rf ~/cigri_campaign"
     ],
     "clusters": {
         "lyon": {
            "prologue": [
                "mkdir -p ~/cigri_results",
                "mkdir -p ~/cigri_campaign",
                "cp -r /home/sdelamare/public/pov ~/cigri_campaign/"
            ]
         },
         "nancy":{},
         "reims":{},
         "rennes":{}
     },
     "nb_jobs": 100
 }

CiGri API

To interact with CiGri, you must use the Grid'5000 API. All details are given here.

For example, to submit your campaign, just run:

   $ curl -kn -X POST https://api.grid5000.fr/sid/cigri/campaigns?pretty -d @/home/sdelamare/public/pov/cigri3.jdl

To delete your campaign, use:

   $ curl -kn -X DELETE https://api.grid5000.fr/sid/cigri/campaigns/<campaign_number>

Now you can play with the other urls available given in the documentation and wait to get all the images from your campaign.

@@ Line 37: / Line 37: @@
 By default, OAR generates an ssh key pair for each job, and oarsh is used to connect the job's nodes.
-oarsh looks to environment variables <code class="command">OAR_JOB_ID</code> or <code class="command">OAR_JOB_KEY_FILE</code> to know the key to use. Thus oarsh works directly if you are connected. You can also connect to the nodes without being connected to the job:
+oarsh looks to environment variables <code class="command">OAR_JOB_ID</code> or <code class="command">OAR_JOB_KEY_FILE</code> to know the key to use. This oarsh works directly if you are connected. You can also connect to the nodes without being connected to the job:
   $ oarsub -I

Advanced OAR: Difference between revisions

Revision as of 10:13, 4 September 2014

OAR

useful tips

Connection to a job

Connection to the job's nodes

oarsh and job keys

sharing keys between jobs

allow_classic_ssh

oarsh details

passive and interactive modes

Submission and Reservation

Getting information about a job

Complex resources selection

Using the resource hierarchy

Selecting nodes from a specific cluster

Selecting specific nodes

Other examples using properties

Two nodes with virtualization capability, on different clusters + IP subnets

1 core on 2 nodes on the same cluster with 4096 GB of memory and Infiniband 10G + 1 cpu on 2 nodes on the same switch with bicore processors for a walltime of 4 hours

Retrieving the resources allocated to my job

Retrieving resources from within the job

Retrieve the host list

Retrieve the core list

Retrieve core list with host and cpuset Id as identifier

A more complex example with a customized output format

Retrieving resources from the submission frontend

List OAR properties

X11 forwarding

Shell 1

Check DISPLAY

Job submission

Shell 2

Using a parallel launcher: taktuk

Shell 1

Unset DISPLAY so that X does not bother...

Job submission

Running the taktuk command

Setting the connector definitively and running taktuk again

Using best effort mode jobs

Best effort job campaign

Best effort job mechanism

Testing the checkpointing trigger mechanism

Writing the test script

Running the job

Result

Interactive checkpointing

Testing the mechanism of dependency on an anterior job termination

First Job

Second Job

Job dependency in action

Container jobs

CiGri

Files

Campaign description

Creating the JDL

Improving the JDL File

Installing povray on each execution site

Gathering the results

Simplifying things

Final JDL

CiGri API

Navigation menu

Search