GNU Parallel: Difference between revisions

From Grid5000
Jump to navigation Jump to search
No edit summary
 
(124 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{Portal|User}}
{{Portal|Tutorial}}
{{Pages|HPC}}
{{TutorialHeader}}
This page descibes the use of GNU Parallel on Grid'5000.
This page descibes the use of GNU Parallel on Grid'5000.


Quoting [https://www.gnu.org/software/parallel/|GNU Parallel website]:
Quoting [https://www.gnu.org/software/parallel/ GNU Parallel website]:
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run  for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.  
<blockquote>
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run  for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.  
</blockquote>


For a more general and complete information, see the [https://www.gnu.org/software/parallel/|GNU Parallel website].
For a more general and complete information, see the [https://www.gnu.org/software/parallel/ GNU Parallel website].


We details in this page Grid'5000 specific information in order to let you take benefit from the tool on the platform.
We details in this page Grid'5000 specific information in order to let you take benefit from the tool on the platform.
Line 19: Line 25:


== Benefit from using GNU Parallel in Grid5000 ==
== Benefit from using GNU Parallel in Grid5000 ==
While OAR is the ''Resource and Job Management System'' of Grid'5000 and supports the management of ''batch of jobs'', its use may be '''overkill to handle SPMD parallel executions of small tasks within a larger reservation'''. In concrete terms, a user may create a first OAR job in order to book a large set of resources for some time (e.g. for the night), and then have to submit a batch of many small tasks (e.g. each using only one core) within that first job.
While OAR is the ''Resource and Job Management System'' of Grid'5000 and supports the management of ''batch of jobs'', the use OAR with its job container feature may be '''overkill to handle SPMD parallel executions of small tasks within a larger reservation'''.  


To that purpose, using OAR container for the first job, then OAR inner jobs for the small tasks is overkill. ''(But note that using OAR container and inner jobs makes sense when all jobs are not from the same user, for tutorials for instance)''.
In concrete terms, a user may create a OAR job to book a large set of resources for some time (e.g. for the night), and then submit a batch of many small tasks (e.g. each using only one core) within the first job. For that purpose, using a OAR container for the first job, then OAR inner jobs for the small tasks is overkill. Indeed, inner jobs have to pay the price of operations such as running prologues and epilogues, which are necessary to provide a healthy environment between jobs of possibly different users, but that are useless in that case because all jobs are of the same user. In this context, GNU Parallel is more relevant and faster than OAR.  


'''We strongly advise to use GNU parallel to handle the execution of the small tasks within the initial OAR reservation of resources'''. That means only create one OAR job to book the large set of resources (not using the container job type), then within this job, use GNU Parallel.
''(But please note that using OAR container and inner jobs makes sense when all jobs are not from the same user, for tutorials or teaching labs for instance.)''


Note that using GNU Parallel to handle the small tasks, the OAR restriction (e.g. max 200 jobs in queue) will not apply.
'''We strongly advise using GNU parallel to handle the execution of the small tasks within the initial OAR reservation of resources'''. That means creating only one OAR job to book the large set of resources (not using the container job type), then within this job, use GNU Parallel.


== How to use GNU Parallem in Grid'5000 ==
Note that using GNU Parallel to handle the small tasks, some OAR restrictions (e.g. max 200 jobs in queue) will not apply.


'''GNU Parallel must be used within a OAR job''': GNU Parallel does not book resources, it just manages the concurrent parallel execution on already reserved resources.
== How to use GNU Parallel in Grid'5000 ==


; Single node
'''GNU Parallel must be used within a OAR job''': GNU Parallel does not book resources, '''it just manages the concurrent/parallel executions of tasks on already reserved resources'''.
Within a job of only 1 node (host), there is nothing special in the use of GNU parallel in Grid'5000, in order to exploit all the cores of the node. See [https://www.gnu.org/software/parallel/ GNU parallel documentation], or manual page for more information.


; Multiple nodes
=== Single node ===
Within a job of many nodes (hosts), the user need to tell GNU Parallel how to execute on the nodes reserved in the OAR job.
Within a OAR job of only 1 node (host), there is nothing specific to Grid'5000 to know in the usage of GNU parallel, in order to exploit all the cores of the node. Just run the ''parallel'' command in the job shell. See [https://www.gnu.org/software/parallel/ GNU parallel documentation], or manual page for more information.
# It has provide the list of target nodes to execute on, to pass to the GNU Parallel ''--slf'' option.
# It has to use the oarsh connector (unless the ''-t allow_classic_ssh'' OAR job type was used), by passing it to the GNU Parallel ''--ssh'' option.


== Examples ==
=== Multiple nodes ===
=== Example of use of GNU Parallel in a multi-node reservation ===
Within a OAR job of many nodes, the user needs to tell GNU Parallel how to remotely execute the tasks on the nodes reserved in the OAR job, from the head node (where the ''parallel'' command is to be run).
# He has to provide the list of target nodes to execute on, passing it via the GNU Parallel ''--slf'' option. It is possible to use the OAR nodes file ''$OAR_NODEFILE''. ''(Note that this file contains as many lines with a node name as the count of cores of that node. Having these duplicates is however not an issue for GNU Parallel, as it pre-process this file to have only unique entries.)''
# He has to use the ''oarsh'' connector (only if your job did not reserve all the CPU cores of a node), by passing it to the GNU Parallel ''--ssh'' option.
 
 
{{Term|location=node| cmd=<code class="command">parallel --ssh oarsh --slf $OAR_NODEFILE</code> ...}}
 
=== Confining GNU Parallel tasks to GPU, CPU, or cores ===
OAR allows to run jobs at a finer grain than a host: that is confining the job execution to CPU, GPU or cores. This execution isolation can also be obtained when combining OAR with GNU Parallel.
 
After reserving with ''oarsub'' a whole set of resources at the host grain (say, N hosts), GNU Parallel executors can be confined to subsets of the obtained OAR job resources, such as having for instance a few cores only or only 1 GPU available per GNU Parallel task.
 
To do so, GNU Parallel executors (''sshlogin'') must define their isolation context (which CPU, GPU, cores to use), using ''oarsh'' with the ''OAR_USER_CPUSET'' and/or ''OAR_USER_GPUDEVICE'' environment set. The ''oarprint'' command provides what is necessary for that.
 
; Confining to GPUs:
For instance, let's say one want to have one GNU Parallel executor per GPU on 2 bi-GPU machines. First, one reserves a job, say one gets chifflet-3 and chifflet-7. One then creates thanks to ''oarprint'' a GNU Parallel ''sshlogin'' file, which defines the executors as follows:
 
{{Term|location=chifflet-3| cmd=<code class="command">oarprint gpu -P gpudevice,cpuset,host -C+ -F "1/OAR_USER_GPUDEVICE=% OAR_USER_CPUSET=% oarsh %"  &#124; tee </code> <code class="replace">gpu-executors</code>}}
1/OAR_USER_GPUDEVICE=0 OAR_USER_CPUSET=20+18+14+12+16+0+6+26+24+22+10+4+2+8 oarsh chifflet-3.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=0 OAR_USER_CPUSET=20+18+12+14+16+0+26+24+22+6+10+4+8+2 oarsh chifflet-7.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=1 OAR_USER_CPUSET=25+13+27+11+9+15+23+19+1+21+17+5+7+3 oarsh chifflet-3.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=1 OAR_USER_CPUSET=13+25+27+11+23+19+15+9+1+21+17+7+5+3 oarsh chifflet-7.lille.grid5000.fr
 
Each line of the <code class="replace">gpu-executors</code> file instructs GNU parallel to confine the execution of one task (and only one, as per the ''1/'' at the beginning of the line) to the GPU whose id is given in OAR_USER_GPUDEVICE, and the cores whose ids are given in OAR_USER_CPUSET (that's the cores that are associated to the GPU in the OAR resources definition).
 
Also note that the use of ''oarsh'' is set in the line, so the ''--ssh oarsh'' is not required in that case.
 
We use the ''-C+'' option of ''oarprint'', because GNU parallel does not like '','' as a separator for the ''OAR_USER_CPUSET'' values in the sshlogin file (''oarsh'' accepts ''+'' like '','' or ''.'' or '':'' indifferently).
 
One then just has to make GNU Parallel use that file:
{{Term|location=chifflet-3| cmd=<code class="command">parallel --slf</code> <code class="replace">gpu-executors</code> ...}}
 
; Confining to CPUs
In case only CPU matters, one can use:
{{Term|location=node| cmd=<code class="command">oarprint cpu -P cpuset,host -C+ -F "1/OAR_USER_CPUSET=% oarsh %"</code>}}
 
; Confining to cores:
For confinement to cores (warning: every core will come with its hyperthreading sibling, if any), one uses:
{{Term|location=node| cmd=<code class="command">oarprint core -P cpuset,host -C+ -F "1/OAR_USER_CPUSET=% oarsh %"</code>}}
 
== Typical usage and some examples ==
=== Typical coupling of GNU Parallel and OAR in a multi-node reservation ===
; Create a OAR job of 10 nodes
; Create a OAR job of 10 nodes
{{Term|location=frontend| cmd=<code class="command">oarsub -I -l host=10</code>}}  
{{Term|location=frontend| cmd=<code class="command">oarsub -I -l host=10</code>}}  
We create an interactive job for this example, so that the command below are executed in the opened job shell.
We create an interactive job for this example, so that the command below are executed in the opened job shell.


; Create the ''sshlogin'' file for GNU Parallel from OAR_NODEFILE:
''But note that all this can be scripted and passed to the oarsub command of a non-interactive OAR job.''
The OAR node file contains as many lines with a node name as the count of cores of that node.
 
{{Term|location=node| cmd=<code class="command">uniq $OAR_NODEFILE > nodes</code>}}  
; Run parallel with the ''--ssh'' and ''--sshlloginfile'' options:
For a basic ''sshlogin'' file, GNU Parallel needs one line per host. It will compute by itself how many tasks to execute on each host, as a function of the hardware threads count of the host.
{{Term|location=node| cmd=<code class="command">parallel --progress --ssh oarsh --sshloginfile $OAR_NODEFILE </code><code class="replace">command</code> <code>:::</code> <code class="replace">arguments</code>}}
 
{{Note|text=GNU Parallel has many, many, many features to pass <code class="replace">command</code> and <code class="replace">arguments</code>, and provides as many options.
 
The examples below will give some hints, but please refer to the '''[https://www.gnu.org/software/parallel/ many GNU Parallel documentations, examples and tutorials on the web site]''' for a complete information.}}
 
 
=== Example 1: Finding the best kernel parameters of a SVM algorithm ===
 
In this example, let's consider a Machine Learning problem in which we want to find the best kernel parameters of a [https://en.wikipedia.org/wiki/Support-vector_machine SVM] algorithm. To do that, we need first to build several training models configured with different kernel parameter values, and later to compute precision and recall for all such models to find the best one. On this example, we only focus on building all the training models.
 
We consider a python algorithm <code class="replace">training.py</code> that takes as argument the two kernel parameters for the SVM problem: <code class="command">--gamma</code> and <code class="command">--c</code>.
 
We want to generate the SVM models for:
* ''gamma'' values ranging in [0.1, 0.01, 0.001, 0.0001]
* ''C'' value ranging in [1, 10, 100, 1000]
 
The number of trainings corresponds to the cross-product of ''C'' values and ''gamma'' values. Considering the previous ranges, we need to build 16 SVM models. We can build the 16 SVM models iteratively but to speed up the process, especially for big data sets, it is better to parallelize and distribute the computation on multiple nodes.
 
Let's consider the cluster (<code class="replace">mycluster</code>) we want to use has nodes with '''4''' cores. As our SVM algorithm is mono-threaded, we need at least for '''4 nodes''' to distribute the 16 computations. To do so, we first make an OAR reservation to book the 4 nodes of our cluster, and then we execute the following command to start the 4 computations on each of the 4 nodes:
 
{{Term|location=frontend| cmd=<code class="command">oarsub -p mycluster -l nodes=4 -I</code>}}
Then, from the shell opened by oarsub on the head node of job:
{{Term|location=node-1| cmd=<code class="command">parallel --ssh oarsh --sshloginfile $OAR_FILE_NODES ./training.py --gamma {1} --c {2} ::: 0.1 0.01 0.001 0.0001 ::: 1 10 100 1000</code>}}
 
As a result, this command line will execute (the order may be different):
* on node 1 <code class="command">./training.py --gamma 0.1 --c 1</code>
* on node 1 <code class="command">./training.py --gamma 0.1 --c 10</code>
* on node 1 <code class="command">./training.py --gamma 0.1 --c 100</code>
* on node 1 <code class="command">./training.py --gamma 0.1 --c 1000</code>
* on node 2 <code class="command">./training.py --gamma 0.01 --c 1</code>
* on node 2 <code class="command">./training.py --gamma 0.01 --c 10</code>
* on node 2 <code class="command">./training.py --gamma 0.01 --c 100</code>
* on node 2 <code class="command">./training.py --gamma 0.01 --c 1000</code>
* on node 3 <code class="command">./training.py --gamma 0.01 --c 1</code>
* on node 3 <code class="command">./training.py --gamma 0.001 --c 10</code>
* on node 3 <code class="command">./training.py --gamma 0.001 --c 100</code>
* on node 3 <code class="command">./training.py --gamma 0.001 --c 1000</code>
* on node 4 <code class="command">./training.py --gamma 0.001 --c 1</code>
* on node 4 <code class="command">./training.py --gamma 0.001 --c 10</code>
* on node 4 <code class="command">./training.py --gamma 0.001 --c 100</code>
* on node 4 <code class="command">./training.py --gamma 0.001 --c 1000</code>
 
=== Example 2: Illustrating the execution of the GNU Parallel tasks in a gantt chart ===
In this example we use GNU Parallel and build a gantt diagram showing a timeline of execution of the GNU Parallel tasks, within a OAR job.
 
Below is an example of such a diagram. It illustrates the GNU Parallel tasks dispatch on the resources which were reserved previously in a OAR job. This somehow completes the [[Status#Resources_reservations_.28OAR.29_status|OAR drawgantt]] display, by showing what is happening inside the OAR job.
 
[[File:Parallel-gantt.png|1200px]]
 
Please find below how GNU Parallel is used in this cas, and the steps to build the diagram.
 
; Initiate your workspace
Our workspace is in the nantes frontend (any other site could do as well), '''in our ~/public directory''', because we will open the gantt chart in the web browser at the end.
<syntaxhighlight lang="bash" line='line'>
mkdir -p ~/public/parallel
cd ~/public/parallel
</syntaxhighlight>
 
; Create the GNU Parallel task script "''task.sh''"
This is where the actual user program (e.g. numerical simulation, ...) is to be placed. Here for the example, we just basically sleep for a random time, from 4 to 25s.
Parameter of the user program (''$@'') will be passed by GNU Parallel.
<syntaxhighlight lang="bash" line='line'>
#!/bin/bash
DURATION=$(($RANDOM % 22 + 4))
echo Runing "$0 with params $@ on $(hostname)... (task will run for $DURATION s)."
sleep $DURATION
</syntaxhighlight>
Here, for the example, we just basically sleep for a random time, from 4 to 25s.
; Create the OAR script "''oarjob.sh''"
<syntaxhighlight lang="bash" line='line'>
#!/bin/bash
[ -z "$OAR_NODEFILE" ] && echo "Error: Not in a OAR job" 1>&2 && exit 1
cd ${0%/*}
TASK_COUNT=${1:-10}
seq $TASK_COUNT | parallel --joblog $PWD/parallel.log --bar --ssh oarsh --sshloginfile $OAR_NODEFILE --use-cpus-instead-of-cores $PWD/task.sh
</syntaxhighlight>
Here we simply pass to parallel a list of parameters in its standard input, using the ''seq'' command. Thus, parameters are intergers from 1 to ''TASK_COUNT'': ''TASK_COUNT'' tasks will be executed, each with as parameter one of those intergers. See '''[https://www.gnu.org/software/parallel/ GNU Parallel documentation]''' for more details and other options.
 
We used the ''--use-cpus-instead-of-cores'' option of GNU Parallel, so that it will only run 2 tasks at a time on each node.
 
The script will start the GNU Parallel command from the head node of the OAR job, and generate the "'''''parallel.log'''''" file.
 
; Create the html/javascript code to render the gantt chart of the tasks in "''ganttchart.html''"
The ''parallel.log'' file is generated while GNU Parallel is executing the tasks. We create the ''ganttchart.html'' web page to render the gantt chart from it.
<syntaxhighlight lang="html" line='line'>
<html>
  <head>
    <script type="text/javascript" src="https://www.gstatic.com/charts/loader.js"></script>
    <script type="text/javascript">
      var queryDict = {'refresh': 3000};
      location.search.substr(1).split("&").forEach(function(item) {queryDict[item.split("=")[0]] = item.split("=")[1]})
      google.charts.load("current", {packages:["timeline"]});
      google.charts.setOnLoadCallback(getLogs);
      function getLogs(){
        var request = new XMLHttpRequest();
        request.open('GET', './parallel.log?_=' + new Date().getTime(), true); //force cache miss
        request.responseType = 'text';
        request.send(null);
        request.onreadystatechange = function () {
          if (request.readyState === 4 && request.status === 200) {
            var lines = request.responseText.split('\n').filter(function(line) { return line.length > 0 && line.match(/^\d/); });
            var data = lines.map(function(line) {
              a = line.split('\t');
              return [ a[1], "Task "+a[0], new Date(a[2]*1000), new Date(a[2]*1000+a[3]*1000) ];
            });
            data.sort((a, b) => (a[0] > b[0]) ? 1 : ((a[0] < b[0]) ? -1 : ((a[1] > b[1]) ? 1 : -1)));
            var container = document.getElementById('gantt');
            var chart = new google.visualization.Timeline(container);
            var dataTable = new google.visualization.DataTable();
            dataTable.addColumn({ type: 'string', id: 'Ressource' });
            dataTable.addColumn({ type: 'string', id: 'JobId' });
            dataTable.addColumn({ type: 'date', id: 'Start' });
            dataTable.addColumn({ type: 'date', id: 'End' });
            dataTable.addRows(data);
            chart.draw(dataTable);
            if (queryDict['refresh'] > 0) {
              setTimeout(getLogs, queryDict['refresh']);           
            }
          }
        }
      }
    </script>
  </head>
  <body>
    <div id="gantt" style="height: 100%;"></div>
  </body>
</html>
</syntaxhighlight>
 
; Submit the OAR job named ''OAR2parallel'' requesting 8 nodes of the ecotype cluster, and executing the ''oarjob.sh'' script.
<syntaxhighlight lang="bash" line='line'>
oarsub -n OAR2parallel -l nodes=8 -p "cluster='ecotype'" "./oarjob.sh 80"
</syntaxhighlight>
80 tasks to run: roughly 10 tasks per node, 5 per CPUs (bi-CPU nodes), but see in the gantt chart the actual tasks dispatch !'''
 
; Open the ganttchart.html web page
Open the web page in a web browser on your workstation:
* either using the Grid'5000 VPN, so that your can access the internal URL at: http://public.nantes.grid5000.fr/~YOUR_LOGIN/parallel/ganttchart.html
* or using the api reverse proxy service, at : https://api.grid5000.fr/stable/grid5000/sites/nantes/public/YOUR_LOGIN/parallel/ganttchart.html
(Mind replacing ''YOUR_LOGIN'' by your actual Grid'5000 user login).  
 
The gantt chart will show up and refresh every 3 seconds while GNU Parallel executes tasks. This should be very alike the diagram shown at the beginning of this section.
 
; Bonus, show the progress
GNU Parallel progress (''--bar'' option) is written to the OAR job stderr file, we can look at it in a loop:
<syntaxhighlight lang="bash" line='line'>
OAR_JOB_ID=$(oarstat -u | grep OAR2parallel | cut -f 1 -d\ ) \
while oarstat -u | grep -q $OAR_JOB_ID; do
  sleep 1 && [ -r OAR.OAR2parallel.$OAR_JOB_ID.stderr ] && cat OAR.OAR2parallel.$OAR_JOB_ID.stderr
done
</syntaxhighlight>
 
28% 23:57=47s 39
 
Wait until it reaches 100% and the OAR job finishes.
 
=== Example 3: Confining GNU Parallel tasks to execute on single isolated GPU only ===
In this example, we run a batch of tasks, which each have to execute on a single GPU and making sure it does not access the GPU used by another task.
 
See the [[GNU_Parallel#Confining_GNU_Parallel_tasks_to_GPU.2C_CPU.2C_or_cores|section above for some more details]].
 
; First, we reserve a OAR Job with 2 nodes, which have 2 GPUs each.
We use the ''chifflet'' cluster in Lille.
{{Term|location=frontend| cmd=<code class="command">oarsub -p "cluster = 'chifflet'" -l nodes=2 -I</code>}}
Let's say we got chifflet-3 and chifflet-7.
 
; We generate the GNU Parallel ''sshlogin'' file, for the execution on every GPU:
 
From the head node of the OAR job, chifflet-3, we use ''oarprint'' as follows to create the <code class="replace">gpu-executors</code> file which will be used as the GNU Parallel ''sshlogin'' file:
 
{{Term|location=chifflet-3| cmd=<code class="command">oarprint gpu -P gpudevice,cpuset,host -C+ -F "1/OAR_USER_GPUDEVICE=% OAR_USER_CPUSET=% oarsh %"  &#124; tee </code> <code class="replace">gpu-executors</code>}}
1/OAR_USER_GPUDEVICE=0 OAR_USER_CPUSET=20+18+14+12+16+0+6+26+24+22+10+4+2+8 oarsh chifflet-3.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=0 OAR_USER_CPUSET=20+18+12+14+16+0+26+24+22+6+10+4+8+2 oarsh chifflet-7.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=1 OAR_USER_CPUSET=25+13+27+11+9+15+23+19+1+21+17+5+7+3 oarsh chifflet-3.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=1 OAR_USER_CPUSET=13+25+27+11+23+19+15+9+1+21+17+7+5+3 oarsh chifflet-7.lille.grid5000.fr
 
; We create the task script <code class="replace">~/script.sh</code>:
<syntaxhighlight lang="bash" line='line'>
#!/bin/bash
echo ===============================================================================
echo "TASK: $@"
echo -n "BEGIN: "; date
echo -n "Host: "; hostname
echo -n "Threads: "; cat /sys/fs/cgroup/cpuset/$(< /proc/self/cpuset)/cpuset.cpus
echo -n "GPUs: "; nvidia-smi | grep -io -e " \(tesla\|geforce\)[^|]\+|[^|]\+" | sed 's/|/=/' | paste -sd+ -
sleep 3
echo -n "END: "; date
</syntaxhighlight>
The script shows the threads and GPU used by a task.
 
We make the script executable:
{{Term|location=chifflet-3| cmd=<code class="command">chmod 755 ~/script.sh</code>}}
 
; Finally, we run our GNU Parallel tasks:
 
{{Term|location=chifflet-3| cmd=<code class="command">parallel --slf gpu-executors ~/script.sh ::: {1..5}</code>}}
Here is the output:
<syntaxhighlight lang="bash" line="line" highlight="4-6,11-13,18-20,25-27,32-34">
===============================================================================
TASK: 1
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
Host: chifflet-3.lille.grid5000.fr
Threads: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
GPUs:  GeForce GTX 108...  Off  = 00000000:82:00.0 Off
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
TASK: 3
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
Host: chifflet-3.lille.grid5000.fr
Threads: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
GPUs:  GeForce GTX 108...  Off  = 00000000:03:00.0 Off
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
TASK: 2
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
Host: chifflet-7.lille.grid5000.fr
Threads: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
GPUs:  GeForce GTX 108...  Off  = 00000000:82:00.0 Off
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
TASK: 4
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
Host: chifflet-7.lille.grid5000.fr
Threads: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
GPUs:  GeForce GTX 108...  Off  = 00000000:03:00.0 Off
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
TASK: 5
BEGIN: Thu 27 Feb 2020 10:01:56 PM CET
Host: chifflet-3.lille.grid5000.fr
Threads: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
GPUs:  GeForce GTX 108...  Off  = 00000000:82:00.0 Off
END: Thu 27 Feb 2020 10:01:59 PM CET
</syntaxhighlight>


; Run parallel with the ''--ssh'' and ''--slf'' options:
As expected, every task only has access to 1 GPU.
{{Term|location=node| cmd=<code class="command">parallel --progress --ssh $(which oarsh) --slf nodes </code><code class="replace">command</code> <code>:::</code> <code class="replace">argument</code>}}


Note that all this can be scripted and passed to the oarsub command of a non-interactive OAR job.
{{Note|text=Regarding the CPU threads, when comparing with what is set in the ''sshlogin'' file, we see that we got those given by OAR along with their hyper-threading thread siblings (2nd thread for every core). This is because OAR in Grid'5000 does not define the thread siblings in its resources, but computes them at execution time.}}

Latest revision as of 15:12, 6 April 2022

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

This page descibes the use of GNU Parallel on Grid'5000.

Quoting GNU Parallel website:

GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.

For a more general and complete information, see the GNU Parallel website.

We details in this page Grid'5000 specific information in order to let you take benefit from the tool on the platform.

About the GNU Parallel version installed in Grid'5000

The version of GNU Parallel installed on Grid'5000 nodes comes from the Debian's official packaging.

It is a rather old version, but it seems sufficient.

Whenever one would need a more recent version, one can get the tarball provided at http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2, and install it in one's home directoy. This is straightforward (e.g. ./configure --prefix=$HOME/parallel && make install).

(An environment module could be provided if requested by some users.)

Benefit from using GNU Parallel in Grid5000

While OAR is the Resource and Job Management System of Grid'5000 and supports the management of batch of jobs, the use OAR with its job container feature may be overkill to handle SPMD parallel executions of small tasks within a larger reservation.

In concrete terms, a user may create a OAR job to book a large set of resources for some time (e.g. for the night), and then submit a batch of many small tasks (e.g. each using only one core) within the first job. For that purpose, using a OAR container for the first job, then OAR inner jobs for the small tasks is overkill. Indeed, inner jobs have to pay the price of operations such as running prologues and epilogues, which are necessary to provide a healthy environment between jobs of possibly different users, but that are useless in that case because all jobs are of the same user. In this context, GNU Parallel is more relevant and faster than OAR.

(But please note that using OAR container and inner jobs makes sense when all jobs are not from the same user, for tutorials or teaching labs for instance.)

We strongly advise using GNU parallel to handle the execution of the small tasks within the initial OAR reservation of resources. That means creating only one OAR job to book the large set of resources (not using the container job type), then within this job, use GNU Parallel.

Note that using GNU Parallel to handle the small tasks, some OAR restrictions (e.g. max 200 jobs in queue) will not apply.

How to use GNU Parallel in Grid'5000

GNU Parallel must be used within a OAR job: GNU Parallel does not book resources, it just manages the concurrent/parallel executions of tasks on already reserved resources.

Single node

Within a OAR job of only 1 node (host), there is nothing specific to Grid'5000 to know in the usage of GNU parallel, in order to exploit all the cores of the node. Just run the parallel command in the job shell. See GNU parallel documentation, or manual page for more information.

Multiple nodes

Within a OAR job of many nodes, the user needs to tell GNU Parallel how to remotely execute the tasks on the nodes reserved in the OAR job, from the head node (where the parallel command is to be run).

  1. He has to provide the list of target nodes to execute on, passing it via the GNU Parallel --slf option. It is possible to use the OAR nodes file $OAR_NODEFILE. (Note that this file contains as many lines with a node name as the count of cores of that node. Having these duplicates is however not an issue for GNU Parallel, as it pre-process this file to have only unique entries.)
  2. He has to use the oarsh connector (only if your job did not reserve all the CPU cores of a node), by passing it to the GNU Parallel --ssh option.


Terminal.png node:
parallel --ssh oarsh --slf $OAR_NODEFILE ...

Confining GNU Parallel tasks to GPU, CPU, or cores

OAR allows to run jobs at a finer grain than a host: that is confining the job execution to CPU, GPU or cores. This execution isolation can also be obtained when combining OAR with GNU Parallel.

After reserving with oarsub a whole set of resources at the host grain (say, N hosts), GNU Parallel executors can be confined to subsets of the obtained OAR job resources, such as having for instance a few cores only or only 1 GPU available per GNU Parallel task.

To do so, GNU Parallel executors (sshlogin) must define their isolation context (which CPU, GPU, cores to use), using oarsh with the OAR_USER_CPUSET and/or OAR_USER_GPUDEVICE environment set. The oarprint command provides what is necessary for that.

Confining to GPUs

For instance, let's say one want to have one GNU Parallel executor per GPU on 2 bi-GPU machines. First, one reserves a job, say one gets chifflet-3 and chifflet-7. One then creates thanks to oarprint a GNU Parallel sshlogin file, which defines the executors as follows:

Terminal.png chifflet-3:
oarprint gpu -P gpudevice,cpuset,host -C+ -F "1/OAR_USER_GPUDEVICE=% OAR_USER_CPUSET=% oarsh %" | tee gpu-executors
1/OAR_USER_GPUDEVICE=0 OAR_USER_CPUSET=20+18+14+12+16+0+6+26+24+22+10+4+2+8 oarsh chifflet-3.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=0 OAR_USER_CPUSET=20+18+12+14+16+0+26+24+22+6+10+4+8+2 oarsh chifflet-7.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=1 OAR_USER_CPUSET=25+13+27+11+9+15+23+19+1+21+17+5+7+3 oarsh chifflet-3.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=1 OAR_USER_CPUSET=13+25+27+11+23+19+15+9+1+21+17+7+5+3 oarsh chifflet-7.lille.grid5000.fr

Each line of the gpu-executors file instructs GNU parallel to confine the execution of one task (and only one, as per the 1/ at the beginning of the line) to the GPU whose id is given in OAR_USER_GPUDEVICE, and the cores whose ids are given in OAR_USER_CPUSET (that's the cores that are associated to the GPU in the OAR resources definition).

Also note that the use of oarsh is set in the line, so the --ssh oarsh is not required in that case.

We use the -C+ option of oarprint, because GNU parallel does not like , as a separator for the OAR_USER_CPUSET values in the sshlogin file (oarsh accepts + like , or . or : indifferently).

One then just has to make GNU Parallel use that file:

Terminal.png chifflet-3:
parallel --slf gpu-executors ...
Confining to CPUs

In case only CPU matters, one can use:

Terminal.png node:
oarprint cpu -P cpuset,host -C+ -F "1/OAR_USER_CPUSET=% oarsh %"
Confining to cores

For confinement to cores (warning: every core will come with its hyperthreading sibling, if any), one uses:

Terminal.png node:
oarprint core -P cpuset,host -C+ -F "1/OAR_USER_CPUSET=% oarsh %"

Typical usage and some examples

Typical coupling of GNU Parallel and OAR in a multi-node reservation

Create a OAR job of 10 nodes
Terminal.png frontend:
oarsub -I -l host=10

We create an interactive job for this example, so that the command below are executed in the opened job shell.

But note that all this can be scripted and passed to the oarsub command of a non-interactive OAR job.

Run parallel with the --ssh and --sshlloginfile options
Terminal.png node:
parallel --progress --ssh oarsh --sshloginfile $OAR_NODEFILE command ::: arguments
Note.png Note

GNU Parallel has many, many, many features to pass command and arguments, and provides as many options. The examples below will give some hints, but please refer to the many GNU Parallel documentations, examples and tutorials on the web site for a complete information.


Example 1: Finding the best kernel parameters of a SVM algorithm

In this example, let's consider a Machine Learning problem in which we want to find the best kernel parameters of a SVM algorithm. To do that, we need first to build several training models configured with different kernel parameter values, and later to compute precision and recall for all such models to find the best one. On this example, we only focus on building all the training models.

We consider a python algorithm training.py that takes as argument the two kernel parameters for the SVM problem: --gamma and --c.

We want to generate the SVM models for:

  • gamma values ranging in [0.1, 0.01, 0.001, 0.0001]
  • C value ranging in [1, 10, 100, 1000]

The number of trainings corresponds to the cross-product of C values and gamma values. Considering the previous ranges, we need to build 16 SVM models. We can build the 16 SVM models iteratively but to speed up the process, especially for big data sets, it is better to parallelize and distribute the computation on multiple nodes.

Let's consider the cluster (mycluster) we want to use has nodes with 4 cores. As our SVM algorithm is mono-threaded, we need at least for 4 nodes to distribute the 16 computations. To do so, we first make an OAR reservation to book the 4 nodes of our cluster, and then we execute the following command to start the 4 computations on each of the 4 nodes:

Terminal.png frontend:
oarsub -p mycluster -l nodes=4 -I

Then, from the shell opened by oarsub on the head node of job:

Terminal.png node-1:
parallel --ssh oarsh --sshloginfile $OAR_FILE_NODES ./training.py --gamma {1} --c {2} ::: 0.1 0.01 0.001 0.0001 ::: 1 10 100 1000

As a result, this command line will execute (the order may be different):

  • on node 1 ./training.py --gamma 0.1 --c 1
  • on node 1 ./training.py --gamma 0.1 --c 10
  • on node 1 ./training.py --gamma 0.1 --c 100
  • on node 1 ./training.py --gamma 0.1 --c 1000
  • on node 2 ./training.py --gamma 0.01 --c 1
  • on node 2 ./training.py --gamma 0.01 --c 10
  • on node 2 ./training.py --gamma 0.01 --c 100
  • on node 2 ./training.py --gamma 0.01 --c 1000
  • on node 3 ./training.py --gamma 0.01 --c 1
  • on node 3 ./training.py --gamma 0.001 --c 10
  • on node 3 ./training.py --gamma 0.001 --c 100
  • on node 3 ./training.py --gamma 0.001 --c 1000
  • on node 4 ./training.py --gamma 0.001 --c 1
  • on node 4 ./training.py --gamma 0.001 --c 10
  • on node 4 ./training.py --gamma 0.001 --c 100
  • on node 4 ./training.py --gamma 0.001 --c 1000

Example 2: Illustrating the execution of the GNU Parallel tasks in a gantt chart

In this example we use GNU Parallel and build a gantt diagram showing a timeline of execution of the GNU Parallel tasks, within a OAR job.

Below is an example of such a diagram. It illustrates the GNU Parallel tasks dispatch on the resources which were reserved previously in a OAR job. This somehow completes the OAR drawgantt display, by showing what is happening inside the OAR job.

Parallel-gantt.png

Please find below how GNU Parallel is used in this cas, and the steps to build the diagram.

Initiate your workspace

Our workspace is in the nantes frontend (any other site could do as well), in our ~/public directory, because we will open the gantt chart in the web browser at the end.

mkdir -p ~/public/parallel
cd ~/public/parallel
Create the GNU Parallel task script "task.sh"

This is where the actual user program (e.g. numerical simulation, ...) is to be placed. Here for the example, we just basically sleep for a random time, from 4 to 25s. Parameter of the user program ($@) will be passed by GNU Parallel.

#!/bin/bash
DURATION=$(($RANDOM % 22 + 4))
echo Runing "$0 with params $@ on $(hostname)... (task will run for $DURATION s)."
sleep $DURATION

Here, for the example, we just basically sleep for a random time, from 4 to 25s.

Create the OAR script "oarjob.sh"
#!/bin/bash
[ -z "$OAR_NODEFILE" ] && echo "Error: Not in a OAR job" 1>&2 && exit 1
cd ${0%/*}
TASK_COUNT=${1:-10}
seq $TASK_COUNT | parallel --joblog $PWD/parallel.log --bar --ssh oarsh --sshloginfile $OAR_NODEFILE --use-cpus-instead-of-cores $PWD/task.sh

Here we simply pass to parallel a list of parameters in its standard input, using the seq command. Thus, parameters are intergers from 1 to TASK_COUNT: TASK_COUNT tasks will be executed, each with as parameter one of those intergers. See GNU Parallel documentation for more details and other options.

We used the --use-cpus-instead-of-cores option of GNU Parallel, so that it will only run 2 tasks at a time on each node.

The script will start the GNU Parallel command from the head node of the OAR job, and generate the "parallel.log" file.

Create the html/javascript code to render the gantt chart of the tasks in "ganttchart.html"

The parallel.log file is generated while GNU Parallel is executing the tasks. We create the ganttchart.html web page to render the gantt chart from it.

<html>
  <head>
    <script type="text/javascript" src="https://www.gstatic.com/charts/loader.js"></script>
    <script type="text/javascript">
      var queryDict = {'refresh': 3000};
      location.search.substr(1).split("&").forEach(function(item) {queryDict[item.split("=")[0]] = item.split("=")[1]})
      google.charts.load("current", {packages:["timeline"]});
      google.charts.setOnLoadCallback(getLogs);
      function getLogs(){
        var request = new XMLHttpRequest();
        request.open('GET', './parallel.log?_=' + new Date().getTime(), true); //force cache miss
        request.responseType = 'text';
        request.send(null);
        request.onreadystatechange = function () {
          if (request.readyState === 4 && request.status === 200) {
            var lines = request.responseText.split('\n').filter(function(line) { return line.length > 0 && line.match(/^\d/); });
            var data = lines.map(function(line) {
              a = line.split('\t');
              return [ a[1], "Task "+a[0], new Date(a[2]*1000), new Date(a[2]*1000+a[3]*1000) ];
            });
            data.sort((a, b) => (a[0] > b[0]) ? 1 : ((a[0] < b[0]) ? -1 : ((a[1] > b[1]) ? 1 : -1)));
            var container = document.getElementById('gantt');
            var chart = new google.visualization.Timeline(container);
            var dataTable = new google.visualization.DataTable();
            dataTable.addColumn({ type: 'string', id: 'Ressource' });
            dataTable.addColumn({ type: 'string', id: 'JobId' });
            dataTable.addColumn({ type: 'date', id: 'Start' });
            dataTable.addColumn({ type: 'date', id: 'End' });
            dataTable.addRows(data);
            chart.draw(dataTable);
            if (queryDict['refresh'] > 0) {
              setTimeout(getLogs, queryDict['refresh']);            
            }
          }
        }
      }
    </script>
  </head>
  <body>
    <div id="gantt" style="height: 100%;"></div>
  </body>
</html>
Submit the OAR job named OAR2parallel requesting 8 nodes of the ecotype cluster, and executing the oarjob.sh script.
oarsub -n OAR2parallel -l nodes=8 -p "cluster='ecotype'" "./oarjob.sh 80"

80 tasks to run: roughly 10 tasks per node, 5 per CPUs (bi-CPU nodes), but see in the gantt chart the actual tasks dispatch !

Open the ganttchart.html web page

Open the web page in a web browser on your workstation:

(Mind replacing YOUR_LOGIN by your actual Grid'5000 user login).

The gantt chart will show up and refresh every 3 seconds while GNU Parallel executes tasks. This should be very alike the diagram shown at the beginning of this section.

Bonus, show the progress

GNU Parallel progress (--bar option) is written to the OAR job stderr file, we can look at it in a loop:

OAR_JOB_ID=$(oarstat -u | grep OAR2parallel | cut -f 1 -d\ ) \
while oarstat -u | grep -q $OAR_JOB_ID; do
  sleep 1 && [ -r OAR.OAR2parallel.$OAR_JOB_ID.stderr ] && cat OAR.OAR2parallel.$OAR_JOB_ID.stderr
done
28% 23:57=47s 39

Wait until it reaches 100% and the OAR job finishes.

Example 3: Confining GNU Parallel tasks to execute on single isolated GPU only

In this example, we run a batch of tasks, which each have to execute on a single GPU and making sure it does not access the GPU used by another task.

See the section above for some more details.

First, we reserve a OAR Job with 2 nodes, which have 2 GPUs each.

We use the chifflet cluster in Lille.

Terminal.png frontend:
oarsub -p "cluster = 'chifflet'" -l nodes=2 -I

Let's say we got chifflet-3 and chifflet-7.

We generate the GNU Parallel sshlogin file, for the execution on every GPU

From the head node of the OAR job, chifflet-3, we use oarprint as follows to create the gpu-executors file which will be used as the GNU Parallel sshlogin file:

Terminal.png chifflet-3:
oarprint gpu -P gpudevice,cpuset,host -C+ -F "1/OAR_USER_GPUDEVICE=% OAR_USER_CPUSET=% oarsh %" | tee gpu-executors
1/OAR_USER_GPUDEVICE=0 OAR_USER_CPUSET=20+18+14+12+16+0+6+26+24+22+10+4+2+8 oarsh chifflet-3.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=0 OAR_USER_CPUSET=20+18+12+14+16+0+26+24+22+6+10+4+8+2 oarsh chifflet-7.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=1 OAR_USER_CPUSET=25+13+27+11+9+15+23+19+1+21+17+5+7+3 oarsh chifflet-3.lille.grid5000.fr
1/OAR_USER_GPUDEVICE=1 OAR_USER_CPUSET=13+25+27+11+23+19+15+9+1+21+17+7+5+3 oarsh chifflet-7.lille.grid5000.fr
We create the task script ~/script.sh
#!/bin/bash
echo ===============================================================================
echo "TASK: $@"
echo -n "BEGIN: "; date
echo -n "Host: "; hostname
echo -n "Threads: "; cat /sys/fs/cgroup/cpuset/$(< /proc/self/cpuset)/cpuset.cpus
echo -n "GPUs: "; nvidia-smi | grep -io -e " \(tesla\|geforce\)[^|]\+|[^|]\+" | sed 's/|/=/' | paste -sd+ -
sleep 3
echo -n "END: "; date

The script shows the threads and GPU used by a task.

We make the script executable:

Terminal.png chifflet-3:
chmod 755 ~/script.sh
Finally, we run our GNU Parallel tasks
Terminal.png chifflet-3:
parallel --slf gpu-executors ~/script.sh ::: {1..5}

Here is the output:

===============================================================================
TASK: 1
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
Host: chifflet-3.lille.grid5000.fr
Threads: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
GPUs:  GeForce GTX 108...  Off  = 00000000:82:00.0 Off 
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
TASK: 3
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
Host: chifflet-3.lille.grid5000.fr
Threads: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
GPUs:  GeForce GTX 108...  Off  = 00000000:03:00.0 Off 
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
TASK: 2
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
Host: chifflet-7.lille.grid5000.fr
Threads: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
GPUs:  GeForce GTX 108...  Off  = 00000000:82:00.0 Off 
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
TASK: 4
BEGIN: Thu 27 Feb 2020 10:01:53 PM CET
Host: chifflet-7.lille.grid5000.fr
Threads: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
GPUs:  GeForce GTX 108...  Off  = 00000000:03:00.0 Off 
END: Thu 27 Feb 2020 10:01:56 PM CET
===============================================================================
TASK: 5
BEGIN: Thu 27 Feb 2020 10:01:56 PM CET
Host: chifflet-3.lille.grid5000.fr
Threads: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
GPUs:  GeForce GTX 108...  Off  = 00000000:82:00.0 Off 
END: Thu 27 Feb 2020 10:01:59 PM CET

As expected, every task only has access to 1 GPU.

Note.png Note

Regarding the CPU threads, when comparing with what is set in the sshlogin file, we see that we got those given by OAR along with their hyper-threading thread siblings (2nd thread for every core). This is because OAR in Grid'5000 does not define the thread siblings in its resources, but computes them at execution time.