Cluster experiment-OAR2
From Grid5000
Contents |
Introduction
This practice is about running jobs on a cluster. You will learn how to access a Grid'5000 cluster, how to install your data and how to run your jobs and visualize them.
| Note | |
|---|---|
We recommend you to open at least 2 terminals:
| |
You're advised to look at the OAR2#Quick_Glossary for definitions of terms used below.
Prepare the experiment environment
In this tutorial, we are going to use a very simple program using the OpenMPI library. This hello world example only prints the rank of each parallel process and the name of the node it is running on.
- To simulate a computation, each process sleeps during 60 seconds.
The program to run is stored outside of Grid'5000. Thus, we need to retrieve and install it on the cluster where we are currently connected.
Cluster setup
Check proxy configuration
| Warning | |
|---|---|
Only a few external sites are reachable from the inside
| |
By default, $http_proxy and $https_proxy environment variables should not be set, to avoid surprises when you use http to connect to your own resources:
should return something like
http_proxy= https_proxy=
For this tutorial, set these 2 environment variables:
frontend:
| export http_proxy="http://proxy:3128" ; export https_proxy="http://proxy:3128" |
With the command used to check presence before, you should now see
http_proxy=http://proxy:3128 https_proxy=http://proxy:3128
Data installation
The experiment data are stored on the project's repository at INRIA Gforge.
(1) Copy them onto the frontend:
frontend:
| wget --no-check-certificate https://gforge.inria.fr/frs/download.php/26756/hello.tgz -O ~/hello.tgz |
(2) Unpack experiment data:
A ~/hello/ directory has been created in our home directory.
It will be available on every clusters's node of the site because of NFS-mounted home directories.
| Warning | |
|---|---|
If your experiment generate lots of writings, it's advised to do them on local disk space of the nodes instead of your networked-shared home directory. | |
This way, you will avoid a lot of NFS troubles, such as lags or breakdowns.
Since NFS service is shared among all users and all compute nodes, its performance may vary independently of your experiment.
In order to ensure experiment's reproductibility, be sure to avoid measurements that could depend on the performance of a shared NFS server !
Visualize cluster
Experiment data are ready, before running the job, we are now going to analyze cluster state, which can be visualized in many ways.
Scheduled or running jobs
oarstat is a command-line tool to view current or planned job submission.
View each submissions:
View each submission details:
View a specific submission details
View the status of a specified job:
View each submissions from a given user:
The API provides a UI to view all scheduled or running jobs.
Nodes properties
oarnodes is also a command-line tool. It shows cluster node properties:
Among returned information there is current node state. This state is generally Alive or Absent. When nodes are sick, their state is Suspected or Dead.
Pretty output
oarprint is a tool providing a pretty print of a job resources.
The command prints a sorted output of the resources of a job with regard to a key property, with a customisable format.
On a job connection node (where $OAR_RESOURCE_PROPERTIES_FILE is defined):
The following command must be executed in an OAR job (on a compute node)
On the submission frontend:
For now, you can test this tool using the second command with a OAR_JOB_ID obtained with oarstat.
Current nodes states
Monika is a web interface which in a way synthesizes information given by oarstat and oarnodes.
It displays:
- Current nodes states
- Scheduled or running submissions (at the bottom).
Gantt charts
DrawOARGantt is a web interface that prints past, current and planned node states on a temporal diagram.
Metrics
Node load, memory usage, cpu usage and so on are available with the Ganglia web interface: https://helpdesk.grid5000.fr/ganglia/
By default, current metrics are displayed but you can have an aggregated view of up to 1 year in the past.
| Note | |
|---|---|
| |
Interactive run
OAR2, the Grid'5000 batch scheduler, has an interactive mode. This mode connects the user to the first of his allocated nodes.
Submission
Submit an interactive job:
OAR2 returns a numeric unique Id that identify our submission:
OAR_JOB_ID=8670
-I option automatically connects you to the job's first node.
OAR2 sets several environment variables that can be used by scripts to get parametrized by the current submission properties:
Especially the list of your dedicated nodes can be viewed:
| Note | |
|---|---|
Sometimes nodes name are duplicated inside the
| |
It is time to run our script:
Results are going to be printed on the standard output.
Visualization
Submission is visible on the Monika web interface of the site where it was submitted:
Cluster status cannot be obtained in command-line from the node where you were connected by OAR.
You need another terminal connected to the frontend machine:
Ending
With interactive submission, the end of the job is:
- Not related to the execution lifespan of your scripts.
- Depends on the connection to the job's first node that OAR made for you.
Thus you can run as many scripts as you want until the job deadline.
| Warning | |
|---|---|
Default submission's walltime are of 1 hour. If your connection still lies after that deadline, it's automatically cut off by OAR2. | |
You can kill your submission by quitting the shell opened by OAR.
Ctrl-Dorexit
But the submission should no longer appear when requesting the current cluster status:
Check Monika and DrawOARGantt if they have been updated after your submission's end.
The nodes dedicated to the job should return in the available state.
Passive run
OAR2 could be used in passive mode:
- A script is passed into parameter.
- It will be executed on the reservation's head.
- It must know about the other dedicated nodes to split its work between them.
Submission
Submit a passive job with our script:
| Note | |
|---|---|
Environment variables, described during our interactive submission, are also set for passive jobs. | |
Visualization
Do the same as for interactive submission.
Wrapping script
The following script launches a passive job that will execute the hello_mpi program and waits until the job starts.
launchJob.sh :
#!/bin/bash
my_script="~/hello/run_hello_mpi"
oar_job_id=`oarsub $my_script | grep "OAR_JOB_ID" | cut -d '=' -f2`
oar_stdout_file="OAR.$oar_job_id.stdout"
until oarstat -s -j $oar_job_id | grep Running ; do
echo "Job (id: $oar_job_id) is waiting..."
sleep 1
done
echo "Job $oar_job_id is started !"
To ease post-run analysis, OAR can redirect standard output and standard error output
into OAR.OAR_JOB_ID.stdout and OAR.OAR_JOB_ID.stderr respectively.
They should be seen in your current working directory after job's end:
OAR.OAR_JOB_ID.stdoutOAR.OAR_JOB_ID.stderr
Thus you can follow the output of your job as it is running:
OAR lets you to specify output files for standard and error output streams by using options
-O and -E respectively.
For example it is possible to redirect the oarsub output in /dev/null or in other files:
Ending
The results of our job are available at the end of the files containing the standard (and error) outputs:
Connection to a running job
While a job is running, it is possible to connect inside its environment with -C OAR_JOB_ID option.
Unless you specify a submission with -t allow_classic_ssh you have to use the OAR shell to connect to
job's nodes: oarsh
Node number specification
Unless specified, submissions request the default resource quantity: 1 node.
For submitting an interactive job on 2 nodes:
We are automatically connected to the reservation's head (one of the 2 nodes) due to the interactive submission.
We can learn about our dedicated resources:
As you can read, the example script detects the available nodes and adapts itself to use all the CPUs:
We can verify that the run occurs on the other node with another terminal:
Container jobs
With this functionality it is possible to execute jobs within another one.
- So it is like a sub-scheduling mechanism.
(1) Submit a job of type container:
oarsub returns the OAR_JOB_ID of the container job.
(2) From the frontend in a new terminal, it's possible to use the inner type to schedule new jobs within:
| Warning | |
|---|---|
Inner jobs have to be submitted with:
| |
Planning
Until now our submissions used default start time now and default duration 1 hour. OAR could off course let you choose a specific duration and a delayed start: advance reservations.
To run the job on April 19th, 5:30pm for 10 minutes:
| Note | |
|---|---|
You can do advance reservations without specifying a script.
| |
The delayed submission appears as Scheduled on Status#Monika of the site where it was submitted.
When the job starts, you can connect to the reservation's head to interactively run the script or monitor its run:
Submission's ending does not occur when you disconnect from reservation's head (even if you omit specifying a script to run).
Ending occurs when:
- Specified script ends
- Job hits its walltime.
| Note | |
|---|---|
If you did not specify a script to run and you finished before the job's walltime, it is a good idea to release the allocated nodes earlier. | |
To terminate a job:
Next tutorial
Same things, but at Grid level: Grid experiment
