Grid experiment-OAR2
From Grid5000
Contents |
Introduction
More than 20 clusters spread over 9 sites are available on Grid'5000.
A really simple tool OAR Grid was built upon OAR2 to help you using the whole Grid resources at once.
That tutorial's purpose is to get our job run among 3 clusters from 3 sites.
| Note | |
|---|---|
We recommend you to open at least 2 terminals:
| |
Prepare the experiment environment
Cluster setup
Please refer to the cluster setup done in previous tutorial.
Synchronize data on each site
Your home directories on each site are independant from each other. Thus before submitting grid jobs, you have to be sure that experiment's data are available in all sites you plan to use.
(1) Synchronize SSH publickey and configuration:
(2) Synchronize data experiment (codes, configuration files...):
Visualize grid
As for cluster experiments, it's possible to analyze the grid state.
disco
disco is a grid resources discovery tool to find the maximum available resources on a given time range for specified alias(s) of resources.
Find available resources from now to now + 1 hour on paradent cluster (located at Rennes):
Find available resources from now to now + 1 hour on Lille Rennes and Sophia sites:
lille: resources:(max/dead/avail): 1045/690/355 nodes: (max/dead/fully_avail): 175/115/21 sophia: resources:(max/dead/avail): 1130/902/228 nodes: (max/dead/fully_avail): 224/73/6 rennes: resources:(max/dead/avail): 2578/2366/212 nodes: (max/dead/fully_avail): 390/232/4 Nb available resources: 795 Nb fully available nodes: 31 To reserve at resource level (inside/outside):oargridsub -s "2011-04-09 15:59:08" -w 1:00:00 lille:rdef="core=355",sophia:rdef="core=228",rennes:rdef="core=212"sshfrontend.grenoble.grid5000.froargridsub -s \"2011-04-09 15:59:08\" -w 1:00:00 lille:rdef="core=355",sophia:rdef="core=228",rennes:rdef="core=212"To reserve at nodes level with -t allow_classic_ssh enabled (inside/outside)::oargridsub -t allow_classic_ssh -s "2011-04-09 15:59:08" -w 1:00:00 lille:rdef="nodes=21",sophia:rdef="nodes=6",rennes:rdef="nodes=4"sshfrontend.grenoble.grid5000.froargridsub -t allow_classic_ssh -s \"2011-04-09 15:59:08\" -w 1:00:00 lille:rdef="nodes=21",sophia:rdef="nodes=6",rennes:rdef="nodes=4"
In addition to give resources available at best, it lists the oargridsub commands to use.
OarGridMonika
OarGridMonika is a web interface that gathers informations retrieved by all Monika of each site.
OarGridGantt
OarGridGantt summarizes information given by its cluster counterpart DrawOARGantt. It prints temporal diagrams of past, current and planned states of each cluster:
Grid'5000 API
An other way to visualize nodes/jobs status is to use the Grid'5000 API
A script imitating some of disco behavior using the API with restfully is available here for API 2.0 or here for the SID API.
Grid reservation
In grid reservation mode, no script can be specified for interactive submissions.
Users are in charge to:
- connect to the allocated nodes.
- launch their experiment.
Reservation submission
We are going to reserve 4 nodes on 3 different sites for half an hour:
frontend:
| oargridsub -t allow_classic_ssh -w '0:30:00' CLUSTER1:rdef="/nodes=2",CLUSTER2:rdef="/nodes=1",CLUSTER3:rdef="nodes=1" |
OAR Grid connects to each of the specified clusters and makes a passive submission. Cluster job ids are returned by OAR. A grid job id is returned by OAR Grid to bind cluster jobs ids together.
You should see an output like this:
CLUSTER1:rdef=/nodes=2,CLUSTER2:rdef=/nodes=1,CLUSTER3:rdef=nodes=1 [OAR_GRIDSUB] [CLUSTER3] Date/TZ adjustment: 0 seconds [OAR_GRIDSUB] [CLUSTER3] Reservation success onCLUSTER3: batchId =CLUSTER_JOB_ID3[OAR_GRIDSUB] [CLUSTER2] Date/TZ adjustment: 1 seconds [OAR_GRIDSUB] [CLUSTER2] Reservation success onCLUSTER2: batchId =CLUSTER_JOB_ID2[OAR_GRIDSUB] [CLUSTER1] Date/TZ adjustment: 0 seconds [OAR_GRIDSUB] [CLUSTER1] Reservation success onCLUSTER1: batchId =CLUSTER_JOB_ID1[OAR_GRIDSUB] Grid reservation id =GRID_JOB_ID[OAR_GRIDSUB] SSH KEY : /tmp/oargrid//oargrid_ssh_key_LOGIN_GRID_JOB_IDYou can use this key to connect directly to your OAR nodes with the oar user.
Fetch the allocated nodes list to transmit it to the script we want to run:
| Note | |
|---|---|
The
| |
(1) Select the node to launch the script (ie: the first node listed in the ~/machines file).
If (and only if) this node does not belong to the site where the ~/machines file was saved,
copy the ~/machines to this node:
frontend:
| OAR_JOB_ID=CLUSTER_JOB_ID oarcp -i /tmp/oargrid/oargrid_ssh_key_LOGIN_GRID_JOB_ID ~/machines `head -n 1 machines`: |
(2) Connect to this node using oarsh:
frontend:
| OAR_JOB_ID=CLUSTER_JOB_ID oarsh -i /tmp/oargrid/oargrid_ssh_key_LOGIN_GRID_JOB_ID `head -n 1 machines` |
And then run the script:
Visualization
The Grid counterpart of oarstat gives information about the grid job:
Ending
Our grid submission is interactive, so its end time is unrelated to the end time of our script run. The submission ends when the submission owner requests that it ends or when the submission deadline is reached.
We are going to ask for our submission to end:
Grid'5000 API
The restfully tutorial describes how to reserve nodes similarly to oargridsub on all sites. You can adapt this script to your convenience.
Grid batch
OAR Grid can also do batch submission. In this mode, you specify a script that will be run on each specified cluster.
Reservation submission
Let us run the script on April 19th, 2011 at 4:00pm for a 10-minute duration:
frontend:
| oargridsub -t allow_classic_ssh CLUSTER1:rdef="/nodes=2",CLUSTER2:rdef="/nodes=1",CLUSTER3:rdef="/nodes=1" -s '2011-04-19 16:00:00' -w '0:10:00' -p ~/hello/helloworld |
You should see a similar behavior to that of an interactive submission.
- OAR Grid connects to each specified cluster and does a passive submission.
- As opposed to interactive mode, the specified script is run by each passive cluster submission.
| Warning | |
|---|---|
Our passive grid submission will provoke 3 independent cluster submissions.
| |
Visualization
The allocated nodes list:
Jobs results can be viewed on each involved cluster. As for a passive submission, standard and error outputs are saved in your home directory:
OAR.CLUSTER_JOB_ID.stdoutOAR.CLUSTER_JOB_ID.stderr
We can follow our cluster job's run in live when connected to these clusters (Ctrl-C to quit):
Ending
Passive grid submission ends when every inherent passive cluster submission are terminated.
- So it ends when each script have finished running on each cluster.
Next tutorial
Learn to program Grid'5000 with API Main Practical.
