Cluster experiment-OAR2

From Grid5000
Jump to: navigation, search


Contents

Introduction

This practice is about running jobs on a cluster. You will learn how to access a Grid'5000 cluster, how to install your data and how to run your jobs and visualize them.

Note.png Note

We recommend you to open at least 2 terminals:

  1. for running jobs
  2. the other to monitor/act on the jobs.

You're advised to look at the OAR2#Quick_Glossary for definitions of terms used below.

Prepare the experiment environment

In this tutorial, we are going to use a very simple program using the OpenMPI library. This hello world example only prints the rank of each parallel process and the name of the node it is running on.

  • To simulate a computation, each process sleeps during 60 seconds.

The program to run is stored outside of Grid'5000. Thus, we need to retrieve and install it on the cluster where we are currently connected.


Cluster setup

Check proxy configuration

To retrieve data from a web site outsite Grid'5000, it is required to use the web proxy.

Warning.png Warning

Only a few external web sites are reachable from inside Grid'5000

  • A list of the authorized web sites is availabale at Web proxy access
  • To request the addition of a web site to the white-list, see Web proxy

We will use the wget command to retrieve our data from Internet. To use a web proxy with wget, the $http_proxy and $https_proxy environment variables must be defined.

By default, those $http_proxy and $https_proxy environment variables are not set, to avoid any confusion when using http to connect resources inside Grid'5000. Please run the following command to check your current environment :

Terminal.png frontend:
echo http_proxy=$http_proxy ; echo https_proxy=$https_proxy

You should get:

http_proxy=
https_proxy=

For this tutorial we use the web proxy, so we need to set the 2 environment variables:

Terminal.png frontend:
export http_proxy="http://proxy:3128" ; export https_proxy="http://proxy:3128"

We can now verify that the environment is indeed modified:

Terminal.png frontend:
echo http_proxy=$http_proxy ; echo https_proxy=$https_proxy

You should get:

http_proxy=http://proxy:3128
https_proxy=http://proxy:3128

Data installation

The experiment data are stored on the project's repository at INRIA Gforge.

(1) Copy them onto the frontend:

Terminal.png frontend:
wget --no-check-certificate https://gforge.inria.fr/frs/download.php/26756/hello.tgz -O ~/hello.tgz

(2) Unpack experiment data:

Terminal.png frontend:
tar -xvzf ~/hello.tgz -C ~/

A ~/hello/ directory has been created in our home directory. It will be available on every clusters's node of the site because of NFS-mounted home directories.

Warning.png Warning

If your experiment generate lots of writings, it's advised to do them on local disk space of the nodes instead of your networked-shared home directory.

This way, you will avoid a lot of NFS troubles, such as lags or breakdowns. Since NFS service is shared among all users and all compute nodes, its performance may vary independently of your experiment.

In order to ensure experiment's reproductibility, be sure to avoid measurements that could depend on the performance of a shared NFS server !

Visualize cluster

Experiment data are now ready. Before running the experiment in a job, we are now going to look at the cluster state, which can be visualized in many ways.

Scheduled or running jobs

oarstat is a command-line tool to view current or planned job submission.

View each submissions:

Terminal.png frontend:
oarstat

View each submission details:

Terminal.png frontend:
oarstat -f

View a specific submission details

Terminal.png frontend:
oarstat -f -j OAR_JOB_ID

View the status of a specified job:

Terminal.png frontend:
oarstat -s -j OAR_JOB_ID

View each submissions from a given user:

Terminal.png frontend:
oarstat -u LOGIN

The API provides a UI to view all scheduled or running jobs.

Nodes properties

oarnodes is also a command-line tool. It shows cluster node properties:

Terminal.png frontend:
oarnodes

Among returned information there is current node state. This state is generally Alive or Absent. When nodes are sick, their state is Suspected or Dead.

Pretty output

oarprint is a tool providing a pretty print of a job resources. The command prints a sorted output of the resources of a job with regard to a key property, with a customisable format.
On a job connection node (where $OAR_RESOURCE_PROPERTIES_FILE is defined):

The following command must be executed in an OAR job (on a compute node)

Terminal.png node:
oarprint host -P host,cpu,core -F "host: % cpu: % core: %" -C+

On the submission frontend:

Terminal.png frontend:
oarstat -j OAR_JOB_ID -p | oarprint core -P host,cpuset,memcore -F "%[%] (%)" -f - | sort

For now, you can test this tool using the second command with a OAR_JOB_ID obtained with oarstat.

Current nodes states

Monika is a web interface which in a way synthesizes information given by oarstat and oarnodes. It displays:

  • Current nodes states
  • Scheduled or running submissions (at the bottom).

Gantt charts

Drawgantt is a web interface that prints past, current and planned node states on a temporal diagram.

Metrics

Node load, memory usage, cpu usage and so on are available with the Ganglia web interface: https://helpdesk.grid5000.fr/ganglia/

By default, current metrics are displayed but you can have an aggregated view of up to 1 year in the past.

Note.png Note

helpdesk.grid5000.fr is another Grid'5000 community website; use your Grid'5000 account as usual.

Interactive run

OAR2, the Grid'5000 batch scheduler, has an interactive mode. This mode connects the user to the first of his allocated nodes.

Submission

Submit an interactive job:

Terminal.png frontend:
oarsub -I

OAR2 returns a numeric unique Id that identify our submission:

OAR_JOB_ID=8670 

-I option automatically connects you to the job's first node.

OAR2 sets several environment variables that can be used by scripts to get parametrized by the current submission properties:

Terminal.png node:
env | grep -i ^oar

Especially the list of your dedicated nodes can be viewed:

Terminal.png node:
cat $OAR_NODE_FILE
Note.png Note

Sometimes nodes name are duplicated inside the $OAR_NODE_FILE file.

  • OAR2 resource notion is core based, leading to print each node name as many times as cores in the nodes.

It is time to run our script:

Terminal.png node:
~/hello/run_hello_mpi

Results are going to be printed on the standard output.

Visualization

Submission is visible on the Monika web interface of the site where it was submitted:

Cluster status cannot be obtained in command-line from the node where you were connected by OAR. You need another terminal connected to the frontend machine:

Terminal.png frontend:
oarstat -f -j OAR_JOB_ID

Ending

With interactive submission, the end of the job is:

  • Not related to the execution lifespan of your scripts.
  • Depends on the connection to the job's first node that OAR made for you.

Thus you can run as many scripts as you want until the job deadline.

Warning.png Warning

Default submission's walltime are of 1 hour. If your connection still lies after that deadline, it's automatically cut off by OAR2.

You can kill your submission by quitting the shell opened by OAR.

 Ctrl-D or exit

But the submission should no longer appear when requesting the current cluster status:

Terminal.png frontend:
oarstat

Check Monika and DrawOARGantt if they have been updated after your submission's end.

The nodes dedicated to the job should return in the available state.

Passive run

OAR2 could be used in passive mode:

  • A script is passed into parameter.
    • It will be executed on the reservation's head.
    • It must know about the other dedicated nodes to split its work between them.

Submission

Submit a passive job with our script:

Terminal.png frontend:
oarsub ~/hello/run_hello_mpi
Note.png Note

Environment variables, described during our interactive submission, are also set for passive jobs.

Visualization

Do the same as for interactive submission.

Wrapping script

The following script launches a passive job that will execute the hello_mpi program and waits until the job starts.
launchJob.sh :

 #!/bin/bash
 
 my_script="~/hello/run_hello_mpi"
 
 oar_job_id=`oarsub $my_script | grep "OAR_JOB_ID" | cut -d '=' -f2`
 
 oar_stdout_file="OAR.$oar_job_id.stdout"
 
 until oarstat -s -j $oar_job_id | grep Running ; do
     echo "Job (id: $oar_job_id) is waiting..."
     sleep 1
 done
 
 echo "Job $oar_job_id is started !"

To ease post-run analysis, OAR will by default redirect standard output and standard error output into OAR.OAR_JOB_ID.stdout and OAR.OAR_JOB_ID.stderr respectively.

They should be seen in your current working directory after job's end:

OAR.OAR_JOB_ID.stdout
OAR.OAR_JOB_ID.stderr

Thus you can follow the output of your job as it is running:

Terminal.png frontend:
tail -f OAR.OAR_JOB_ID.stdout

OAR lets you to specify output files for standard and error output streams by using options -O and -E respectively. For example it is possible to redirect the oarsub output in /dev/null or in other files:

Terminal.png frontend:
oarsub ~/hello/run_hello_mpi -O /dev/null
Terminal.png frontend:
oarsub ~/hello/run_hello_mpi -O ~/hello_mpi.log

Ending

The results of our job are available at the end of the files containing the standard (and error) outputs:

Terminal.png frontend:
tail OAR.OAR_JOB_ID.stdout

Connection to a running job

While a job is running, it is possible to connect inside its environment with -C OAR_JOB_ID option.

Terminal.png frontend:
oarsub -C OAR_JOB_ID

Unless you specify a submission with -t allow_classic_ssh you have to use the OAR shell to connect to job's nodes: oarsh

Terminal.png frontend:
oarsub -C OAR_JOB_ID
Terminal.png main_node:
sort -u $OAR_NODE_FILE
Terminal.png main_node:
oarsh OTHER_NODE_HOSTNAME

Node number specification

Unless specified, submissions request the default resource quantity: 1 node.

For submitting an interactive job on 2 nodes:

Terminal.png frontend:
oarsub -I -l nodes=2

We are automatically connected to the reservation's head (one of the 2 nodes) due to the interactive submission.

We can learn about our dedicated resources:

Terminal.png node:
sort -u $OAR_NODE_FILE

As you can read, the example script detects the available nodes and adapts itself to use all the CPUs:

Terminal.png node:
~/hello/run_hello_mpi

We can verify that the run occurs on the other node with another terminal:

Terminal.png frontend:
export OAR_JOB_ID=OAR_JOB_ID
Terminal.png frontend:
oarsh OTHER_NODE_HOSTNAME ps -C hello_mpi

Container jobs

With this functionality it is possible to execute jobs within another one.

  • So it is like a sub-scheduling mechanism.

(1) Submit a job of type container:

Terminal.png frontend:
oarsub -I -t container -l nodes=4,walltime=0:45:00

oarsub returns the OAR_JOB_ID of the container job.

(2) From the frontend in a new terminal, it's possible to use the inner type to schedule new jobs within:

Terminal.png frontend:
oarsub -I -t inner=containerJobID -l nodes=3,walltime=0:15:00
Terminal.png frontend:
oarsub -I -t inner=containerJobID -l nodes=1,walltime=0:40:00
Warning.png Warning

Inner jobs have to be submitted with:

  • less nodes than container
  • smaller walltime than container
Otherwise they will never be scheduled.

Note.png Note

-t container is usable with every kind of jobs: passive, interactive and reservations.

Note.png Note

-t inner cannot be used with a reservation.

Planning

Until now our submissions used default start time now and default duration 1 hour. OAR could off course let you choose a specific duration and a delayed start: advance reservations.

To run the job on April 19th, 5:30pm for 10 minutes:

Terminal.png frontend:
oarsub -r '2010-03-24 17:30:00' -l nodes=2,walltime=0:10:00 ~/hello/run_hello_mpi
Note.png Note

You can do advance reservations without specifying a script.

  • It is the responsibility of job's owner to ensure that nodes are not kept idle when reservation starts.

Note.png Note

The timezone is set to site's timezone from which you make the reservation.

The delayed submission appears as Scheduled on Status#Monika of the site where it was submitted.

Terminal.png frontend:
oarstat -f -j OAR_JOB_ID

When the job starts, you can connect to the reservation's head to interactively run the script or monitor its run:

Terminal.png frontend:
oarsub -C OAR_JOB_ID -I

Submission's ending does not occur when you disconnect from reservation's head (even if you omit specifying a script to run).

Ending occurs when:

  • Specified script ends
  • Job hits its walltime.
  • Job is explicitely terminated
Note.png Note

If you did not specify a script to run and you finished before the job's walltime, it is a good idea to release the allocated nodes earlier.

To terminate a job:

Terminal.png frontend:
oardel OAR_JOB_ID
Note.png Note

The terminated job will now be in Error state since OAR features no Canceled state

Next tutorial

Same things, but at Grid level: Grid experiment

Personal tools
Namespaces

Variants
Actions
Public Portal
Users Portal
Admin portal
Wiki special pages
Toolbox