Getting Started: Difference between revisions

From Grid5000
Jump to navigation Jump to search
 
(85 intermediate revisions by 15 users not shown)
Line 7: Line 7:
The '''[[Support]] page''' describes how to get help during your Grid'5000 usage.
The '''[[Support]] page''' describes how to get help during your Grid'5000 usage.


There's also an '''[[FAQ]] page''' and a '''[[media:G5k_cheat_sheet.pdf|cheat sheet]]''' with the most common commands.
There's also an '''[[FAQ]] page''' with the most common commands.


== Connecting for the first time ==
== Connecting for the first time ==
Line 19: Line 19:


[[Image:Grid5000_SSH_access.png|950px|center|Grid5000 Access]]
[[Image:Grid5000_SSH_access.png|950px|center|Grid5000 Access]]
=== Grid'5000 for Microsoft Windows Users ===
Documentation for users using Microsoft Windows on Grid'5000 is available : [[Grid5000_for_Microsoft_Windows_users|Grid5000_for_Microsoft_Windows_users]]


=== SSH connection through a web interface ===
=== SSH connection through a web interface ===
Line 71: Line 75:
'''Reminder:''' <code class=replace>login</code> is your Grid'5000 username
'''Reminder:''' <code class=replace>login</code> is your Grid'5000 username


'''Warning:''' the <code class=command>ProxyCommand</code> line works if your login shell is <code class=command>bash</code>. If not you may have to adapt it.
'''Warning:''' the <code class=command>ProxyCommand</code> line works if your login shell is <code class=command>bash</code>. If not you may have to adapt it. For instance, for the  <code class=command>fish</code> shell, this line must be: <code class=command> ProxyCommand ssh g5k -W (basename %h .g5k):%p</code>.


Once done, you can establish connections to any machine (first of all: frontends) inside Grid'5000 directly, by suffixing <code class=host>.g5k</code> to its hostname (instead of first having to connect to an access machine). E.g.:
Once done, you can establish connections to any machine (first of all: frontends) inside Grid'5000 directly, by suffixing <code class=host>.g5k</code> to its hostname (instead of first having to connect to an access machine). E.g.:
Line 93: Line 97:
If you only require HTTP/HTTPS access to a node, a reverse HTTP proxy is also available, see the [[HTTP/HTTPs_access]] page.
If you only require HTTP/HTTPS access to a node, a reverse HTTP proxy is also available, see the [[HTTP/HTTPs_access]] page.


; Cheatsheet
; Bash prompt
The '''[[media:G5k_cheat_sheet.pdf|Grid'5000 cheat sheet]]''' provides a nice summary of everything described in the tutorials.
It is possible to modify your bash prompt to display useful informations related to your current job, such as its jobid, the reserved nodes and the remaining time.
 
{{Term|location=fnancy|cmd=<code class="command">jdoe@fnancy:~$ oarsub -C 3241912</code>}}
{{Term|location=grisou-15|cmd=<pre class="command">
Connect to OAR job 3241912 via the node grisou-15.nancy.grid5000.fr
[OAR] OAR_JOB_ID=3241912
[OAR] Your nodes are:
      grisou-15.nancy.grid5000.fr*16
 
[jdoe@grisou-15 ~](3241912-->57mn)$ sleep 1m
[jdoe@grisou-15 ~](3241912-->55mn)$
</pre>}}
 
You will find [https://oar.imag.fr/wiki:use_cases_and_user_tips#oar_aware_shell_prompt_for_interactive_jobs here] all the information you need to setup such a prompt if you are interested.


== Discovering, visualizing and reserving Grid'5000 resources ==
== Discovering, visualizing and reserving Grid'5000 resources ==
Line 115: Line 132:
{{Note|text=OAR is the resources and jobs management system (a.k.a batch manager) used in Grid'5000, just like in traditional HPC centers. '''However, settings and rules of OAR that are configured in Grid'5000 slightly differ from traditional batch manager setups in HPC centers, in order to match the requirements for an experimentation testbed'''. Please remember to read again '''Grid'5000 [[Grid5000:UsagePolicy#Resources_reservation|Usage Policy]]''' to understand the expected usage.}}
{{Note|text=OAR is the resources and jobs management system (a.k.a batch manager) used in Grid'5000, just like in traditional HPC centers. '''However, settings and rules of OAR that are configured in Grid'5000 slightly differ from traditional batch manager setups in HPC centers, in order to match the requirements for an experimentation testbed'''. Please remember to read again '''Grid'5000 [[Grid5000:UsagePolicy#Resources_reservation|Usage Policy]]''' to understand the expected usage.}}


In Grid'5000 the smallest unit of resource managed by OAR is the core (cpu core), but by default a OAR job reserves a host (physical computer including all its cpus and cores, and possibly gpus). Hence, what OAR calls ''nodes'' are hosts (physical machines). In the <code class="command">oarsub</code> resource request (<code class="command">-l</code> arguments), ''nodes'' is an alias for ''host'', so both are equivalent. But prefer using ''host'' for consistency with other argumnents and other tools that expose ''host'' not ''nodes''.
In Grid'5000 the smallest unit of resource managed by OAR is the core (cpu core), but by default a OAR job reserves a host (physical computer including all its cpus and cores, and possibly gpus). Hence, what OAR calls ''nodes'' are hosts (physical machines). In the <code class="command">oarsub</code> resource request (<code class="command">-l</code> arguments), ''nodes'' is an alias for ''host'', so both are equivalent. But prefer using ''host'' for consistency with other arguments and other tools that expose ''host'' not ''nodes''.


{{Note|text=Most of this tutorial uses the site of Nancy (with the frontend: <code class="host">fnancy</code>), but other sites can be used alternatively.}}
{{Note|text=Most of this tutorial uses the site of Nancy (with the frontend: <code class="host">fnancy</code>), but other sites can be used alternatively.}}
Line 123: Line 140:
To reserve a single host (one node) for one hour, in interactive mode, do:
To reserve a single host (one node) for one hour, in interactive mode, do:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-I</code>}}
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-I</code>}}
As soon as the resource becomes available, you will be directly connected to the reserved resource with an interactive shell, as indicated by the shell prompt, and you can run commands on the node:
{{Term|location=grisou-1|cmd=<code class="command">lscpu</code>}}
; Reserving only part of a node


To reserve only one CPU core in interactive mode, run:
To reserve only one CPU core in interactive mode, run:
Line 128: Line 151:
{{Note|text= When reserving only a share of the node's cores, you will have a share of the memory with the same ratio as the cores. If you take the whole node, you will have all the memory of the node. If you take half the cores, you will have half the memory, and so on... You cannot reserve a memory size explicitly.}}
{{Note|text= When reserving only a share of the node's cores, you will have a share of the memory with the same ratio as the cores. If you take the whole node, you will have all the memory of the node. If you take half the cores, you will have half the memory, and so on... You cannot reserve a memory size explicitly.}}


As soon as the resource becomes available, you will be directly connected to the reserved resource with an interactive shell, as indicated by the shell prompt, and you can run commands on the node:
When reserving several CPU cores, there is no guarantee that they will be allocated on a single node. To ensure this, you need to specify that you want a single host:
 
{{Term|location=fnancy|cmd=<code class="command">oarsub -l host=1/core=8</code> <code>-I</code>}}
{{Term|location=grisou-1|cmd=<code class="command">lscpu</code>}}


; Non-interactive usage (scripts)
; Non-interactive usage (scripts)


You can also simply launch your experiment along with your reservation:
You can also simply launch your experiment along with your reservation:
{{Term|location=fnancy|cmd=<code class="command">oarsub -l core=1</code> <code>"my_mono_threaded_script.py --in $HOME/data --out $HOME/results"</code>}}
{{Term|location=fnancy|cmd=<code class="command">oarsub -l host=1/core=1</code> <code>"my_mono_threaded_script.py --in $HOME/data --out $HOME/results"</code>}}
Your program will be executed as soon as the requested resources are available. As this type of job is not interactive, you will have to check for its termination using the <code class="command">oarstat</code> command.
Your program will be executed as soon as the requested resources are available. As this type of job is not interactive, you will have to check for its termination using the <code class="command">oarstat</code> command.
{{Template:OARscript}}


; Other types of resources
; Other types of resources
Line 147: Line 172:
Or in Nancy where GPUs are only available in the production queue:
Or in Nancy where GPUs are only available in the production queue:
{{Term|location=fnancy|cmd=<code class="command">oarsub -l gpu=1</code> <code>-I</code> <code>-q production</code>}}
{{Term|location=fnancy|cmd=<code class="command">oarsub -l gpu=1</code> <code>-I</code> <code>-q production</code>}}
To reserve several GPUs and ensure they are located in a single node, make sure to specify <code class="command">host=1</code>:
{{Term|location=flille|cmd=<code class="command">oarsub -l host=1/gpu=2</code> <code>-I</code>}}


; Tips and tricks
; Tips and tricks
Line 153: Line 182:
{{Term|location=graffiti-1|cmd=<code class="command">exit</code>}}  
{{Term|location=graffiti-1|cmd=<code class="command">exit</code>}}  


To avoid unanticipated termination of your jobs in case of errors (terminal closed by mistake, network disconnection), you can either use tools such as [https://tmux.github.io/ tmux] or [[screen]], or reserve and connect in 2 steps using the job id associated to your reservation. First, reserve a node, and ask it to sleep for a long time:
To avoid unanticipated termination of your jobs in case of errors (terminal closed by mistake, network disconnection), you can either use tools such as [https://tmux.github.io/ tmux] or [[screen]], or reserve and connect in 2 steps using the job id associated to your reservation. First, reserve a node, and run a <code class="command">sleep</code> command that does nothing for an infinite time:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> "<code class="command">sleep</code> <code class="replace">10d</code>"}} (10d stands for ''10 days'' -- the command will be killed when the job expires anyway)
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> "<code class="command">sleep</code> <code class="replace">infinity</code>"}}
Of course, the job will not run for an infinite time: the command will be killed when the job expires.
 
Then:
Then:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-C</code> <code class="replace">job_id</code>}}  
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-C</code> <code class="replace">job_id</code>}}  
Line 161: Line 192:
<code class="command">java</code> <code>-version</code><br>
<code class="command">java</code> <code>-version</code><br>
<code class="command">mpirun</code> <code>--version</code><br>
<code class="command">mpirun</code> <code>--version</code><br>
<code class="command">module available</code><code> # List [[Environment_modules|scientific-related software available using module]]</code><br>
<code class="command">module available</code><code> # List [[Modules|scientific-related software available using module]]</code><br>
<code class="command">whoami</code><br>
<code class="command">whoami</code><br>
<code class="command">env <nowiki>|</nowiki> grep OAR</code><code> # discover environment variables set by OAR</code>}}  
<code class="command">env <nowiki>|</nowiki> grep OAR</code><code> # discover environment variables set by OAR</code>}}  
Line 169: Line 200:
Of course, you might want to run a job for a different duration than one hour. The <code>-l</code> option allows you to pass a comma-separated list of parameters specifying the needed resources for the job, and <code>walltime</code> is a special resource defining the duration of your job:
Of course, you might want to run a job for a different duration than one hour. The <code>-l</code> option allows you to pass a comma-separated list of parameters specifying the needed resources for the job, and <code>walltime</code> is a special resource defining the duration of your job:


{{Term|location=fnancy|cmd=<code class="command">oarsub -l core=2,walltime=0:30</code> <code>-I</code>}}
{{Term|location=fnancy|cmd=<code class="command">oarsub -l host=1/core=2,walltime=0:30</code> <code>-I</code>}}
The walltime is the expected duration you envision to complete your work. Its format is <code>[hour:min:sec|hour:min|hour]</code>. For instance:
The walltime is the expected duration you envision to complete your work. Its format is <code>[hour:min:sec|hour:min|hour]</code>. For instance:


Line 186: Line 217:


You will obtain a shell '''on the first node of the reservation'''. It is up to you to connect to the other nodes and distribute work among them.
You will obtain a shell '''on the first node of the reservation'''. It is up to you to connect to the other nodes and distribute work among them.
 
By default, you can only connect to nodes that are part of your reservation. If you completely own the nodes within one job (or with one job per '''complete''' node), you will be able to connect those by using <code class="command">ssh</code>. In the case of nodes that are not completely owned within a job (if you have reserved only a part of the nodes or by having multiple jobs on nodes) you will have to use <code class="command">oarsh</code> connector to go from one node to the other. The connector supports the same options as the classical <code class="command">ssh</code> command, so it can be used as a replacement for software expecting ssh.  
By default, you can only connect to nodes that are part of your reservation, and only using the <code class="command">oarsh</code> connector to go from one node to the other. The connector supports the same options as the classical <code class="command">ssh</code> command, so it can be used as a replacement for software expecting ssh.  
{{Term|location=gros-49|cmd=<br>
{{Term|location=gros-49|cmd=<br>
<code class="command">uniq</code> <code class="env">$OAR_NODEFILE</code> <code># list of resources of your reservation</code><br>
<code class="command">uniq</code> <code class="env">$OAR_NODEFILE</code> <code># list of resources of your reservation</code><br>
<code class="command">oarsh</code> <code class="replace">gros-1</code><code>    # try to connect a node not in the file (will fail)</code><br>
<code class="command">ssh</code> <code class="replace">gros-1</code><code>    # try to connect a node not in the file (should fail)</code><br>
<code class="command">oarsh</code> <code class="replace">gros-54</code><code> # connect to the other node of your reservation (should work)</code><br>
<code class="command">oarsh</code> <code class="replace">gros-54</code><code> # connect to the other node of your reservation (should work)</code><br>
<code class="command">ssh</code> <code class="replace">gros-54</code><code> # this will fail</code>}}
<code class="command">ssh</code> <code class="replace">gros-54</code><code> # connect to the other node of your reservation (should work)</code><br>}}


{{Note|text=To take advantage of several nodes and distribute work between them, a good option is [[GNU_Parallel]].}}
{{Note|text=To take advantage of several nodes and distribute work between them, a good option is [[GNU_Parallel]].}}


<code class="command">oarsh</code> is a wrapper around <code class="command">ssh</code> that enables the tracking of user jobs inside compute nodes (for example, to enforce the correct sharing of resources when two different jobs share a compute node). If your application does not support choosing a different connector, it is possible to avoid using <code class="command">oarsh</code> for <code class="command">ssh</code> with the <code>allow_classic_ssh</code> job type, as in
<code class="command">oarsh</code> is a wrapper around <code class="command">ssh</code> that enables the tracking of user jobs inside compute nodes (for example, to enforce the correct sharing of resources when two different jobs share a compute node). If your application does not support choosing a different connector, be sure to reserve nodes entirely (which is the default with <code class="command">oarsub</code>) to be able to use <code class="command">ssh</code>.
{{Term|location=fnancy|cmd=<code class="command">oarsub -l host=2,walltime=0:30:0</code> <code>-I</code> <code>-t</code> <code class="env">allow_classic_ssh</code>}}


=== Selecting specific resources ===
=== Selecting specific resources ===


So far, all examples were letting OAR decide which resource to allocate to a job.
So far, all examples have been letting OAR decide which resource to allocate to a job.
It is possible to obtain finer-grained control on the allocated resources by using filters.
It is possible to obtain a finer-grained control of the allocated resources, by using filters.


; Selecting nodes from a specific cluster or cluster type
; Selecting nodes from a specific cluster or cluster type


* Reserve nodes from a specific cluster
* Reserve nodes from a specific cluster
{{Term|location=fgrenoble|cmd=<code class="command">oarsub</code> <code class="replace">-p "cluster='yeti'"</code> <code class="command">-l host=2,walltime=2</code> <code>-I</code>}}
{{Term|location=fgrenoble|cmd=<code class="command">oarsub</code> <code class="replace">-p dahu</code> <code class="command">-l host=2,walltime=2</code> <code>-I</code>}}
* Reserve nodes in the [[Grid5000:UsagePolicy#Rules_for_the_production_queue|'''production''' queue]]
* Reserve nodes in the [[Grid5000:UsagePolicy#Rules_for_the_production_queue|'''production''' queue]]
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code class="replace">-q production</code> <code class="replace">-p "cluster='grappe'"</code> <code class="command">-l host=2,walltime=2</code> <code>-I</code>}}
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code class="replace">-q production</code> <code class="replace">-p grappe</code> <code class="command">-l host=2,walltime=2</code> <code>-I</code>}}
* Reserve nodes from an '''exotic''' cluster type
* Reserve nodes from an '''exotic''' cluster type
{{Term|location=flyon|cmd=<code class="command">oarsub</code> <code class="replace">-t exotic</code> <code class="replace">-p "cluster='pyxis'"</code> <code class="command">-l host=2,walltime=2</code> <code>-I</code>}}
{{Term|location=flyon|cmd=<code class="command">oarsub</code> <code class="replace">-t exotic</code> <code class="replace">-p pyxis</code> <code class="command">-l host=2,walltime=2</code> <code>-I</code>}}


Clusters with the '''exotic''' type either have a non-x86 architecture, or are specific enough to warrant this type. Resources with an exotic type are never selected by default by OAR. Using <code class="command">-t exotic</code> is required to obtain such resources.
Clusters with the '''exotic''' type either have a non-x86 architecture or are specific enough to warrant this type. Resources with an exotic type are never selected by default by OAR. Using <code class="command">-t exotic</code> is required to obtain such resources.


The type of a cluster can be identified on the [[Hardware]] pages, see for instance [[Lyon:Hardware]].
The type of a cluster can be identified on the [[Hardware]] pages, see for instance [[Lyon:Hardware]].


{{Warning|text=When using the <code class="command">-t exotic</code> option, you can still obtain non-exotic resources! You should filter on the cluster name or other properties if you want exclusively exotic resources.}}
{{Warning|text=When using the <code class="command">-t exotic</code> option, you can still obtain non-exotic resources! You should filter on the cluster name or other properties if you want exclusively exotic resources.}}
; Selecting specific nodes
If you know the exact node you want to reserve, you can specify the hostname of the node you require:
{{Term|location=fgrenoble|cmd=<code class="command">oarsub</code> <code class="replace">-p dahu-12</code> <code class="command">-l host=1,walltime=2</code> <code>-I</code>}}
If you want several specific nodes, you can use a list:
{{Term|location=fgrenoble|cmd=<code class="command">oarsub</code> <code class="replace">-p "host IN (dahu-5, dahu-12)"</code> <code class="command">-l host=2,walltime=2</code> <code>-I</code>}}


; Using OAR properties
; Using OAR properties
Line 223: Line 263:
The OAR nodes database contains a set of properties for each node, and the <code class="command">-p</code> option actually filters based on these properties:
The OAR nodes database contains a set of properties for each node, and the <code class="command">-p</code> option actually filters based on these properties:
* Nodes with Infiniband FDR interfaces:
* Nodes with Infiniband FDR interfaces:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code class="replace">-p "ib='FDR'"</code> <code class="command">-l host=5,walltime=2<code> <code>-I</code>}}
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code class="replace">-p "ib=FDR"</code> <code class="command">-l host=5,walltime=2 -I</code>}}
* Nodes with power sensors and GPUs:
* Nodes with power sensors and GPUs:
{{Term|location=flyon|cmd=<code class="command">oarsub</code> <code class="replace">-p "wattmeter='YES' and gpu_count > 0"</code> <code class="command">-l host=2,walltime=2</code> <code>-I</code>}}
{{Term|location=flyon|cmd=<code class="command">oarsub</code> <code class="replace">-p "wattmeter=YES AND gpu_count > 0"</code> <code class="command">-l host=2,walltime=2</code> <code>-I</code>}}
* Nodes with 2 GPUs:
* Nodes with 2 GPUs:
{{Term|location=flille|cmd=<code class="command">oarsub</code> <code class="replace">-p "gpu_count = 2"</code> <code class="command">-l host=3,walltime=2</code> <code>-I</code>}}
{{Term|location=flille|cmd=<code class="command">oarsub</code> <code class="replace">-p "gpu_count = 2"</code> <code class="command">-l host=3,walltime=2</code> <code>-I</code>}}
Line 231: Line 271:
{{Term|location=flille|cmd=<code class="command">oarsub</code> <code class="replace">-p "cputype = 'Intel Xeon E5-2630 v4'"</code> <code class="command">-l host=3,walltime=2</code> <code>-I</code>}}
{{Term|location=flille|cmd=<code class="command">oarsub</code> <code class="replace">-p "cputype = 'Intel Xeon E5-2630 v4'"</code> <code class="command">-l host=3,walltime=2</code> <code>-I</code>}}
* Since <code class="command">-p</code> accepts SQL, you can write advanced queries:
* Since <code class="command">-p</code> accepts SQL, you can write advanced queries:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code class="replace">-p "wattmeter='YES' and host not in ('graffiti-41.nancy.grid5000.fr', 'graffiti-42.nancy.grid5000.fr')"</code> <code class="command">-l host=5,walltime=2</code> <code>-I</code>}}
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code class="replace">-p "wattmeter=YES AND host NOT IN (graffiti-41, graffiti-42)"</code> <code class="command">-l host=5,walltime=2</code> <code>-I</code>}}
{{Term|location=flille|cmd=<code class="command">oarsub</code> <code class="replace">-p "cputype LIKE 'AMD%'"</code> <code class="command">-l host=3,walltime=2</code> <code>-I</code>}}
 
The OAR properties available on each site are listed on the Monika pages linked from [[Status]] ([https://intranet.grid5000.fr/oar/Nancy/monika.cgi example page for Nancy]). The full list of OAR properties is available on [[OAR_Properties|this page]].
The OAR properties available on each site are listed on the Monika pages linked from [[Status]] ([https://intranet.grid5000.fr/oar/Nancy/monika.cgi example page for Nancy]). The full list of OAR properties is available on [[OAR_Properties|this page]].
{{Note|text=Since this is using a SQL syntax, quoting is important! Use double quotes to enclose the whole query, and single quotes to write strings within the query.}}


=== Advanced job management topics ===
=== Advanced job management topics ===
Line 255: Line 299:


For more details, see the oarwalltime section of the [[Advanced_OAR#Changing_the_walltime_of_a_running_job_.28oarwalltime.29|Advanced OAR]] tutorial.
For more details, see the oarwalltime section of the [[Advanced_OAR#Changing_the_walltime_of_a_running_job_.28oarwalltime.29|Advanced OAR]] tutorial.
== Using nodes in the default environment ==
When you run <code class="command">oarsub</code>, you gain access to physical nodes with a default (''standard'') software environment. This is a Debian-based system that is regularly updated by the [[Support|technical team]].
=== Storage ===
; Home directory
On each node, the home directory is a network filesystem (NFS): data in your home directory is not actually stored on the node itself, it is stored on a storage server managed by the Grid'5000 team. In particular, it means that all reserved nodes share the same home directory, and it is also shared with the site frontend. For example, you can compile or install software in your home, and it will be usable on all your nodes.
{{Note|text=The home directory is only shared within a site. Two nodes from different sites will not have access to the same home.}}
; /tmp
The <code class="file">/tmp/</code> directory is stored on a local disk of the node.  Use this directory if you need to access data locally.
; Additional local disks
Some nodes have additional local disks, see [[Hardware#Storage]] for a list of available disks for each cluster.
There are two ways to access these local disks:
# On some clusters, '''local disks need to be reserved''' to be accessible. See [[Disk_reservation|Disk reservation]] for a list of these clusters and for documentation on the reservation process.
# On other clusters, '''local disks can be used directly'''. In this case, jump directly to [[Disk_reservation#Using_local_disks_once_connected_on_the_nodes|Using local disks]].
In both cases, the disks are simply provided as raw devices, and it is the responsibility of the user to partition them and create a filesystem. Note that there may still be partitions and filesystems present from a previous job.
; Other storage options
[[Storage|More storage options]] are also available.
=== Getting access to the software you need ===
There are several options to get access to software :
* Many software packages are already installed and directly accessible: Git, editors, GCC, Python, Pip, Ruby, Java, ...
* Some software (mostly scientific software, such as MatLab) is available through [[Modules|modules]]. For a list, use <code class="command">module avail</code>. Documentation (including how to access license tokens) is available in the [[Modules]] page.
* If the software you need is not available through the above options, you can:
** Install it manually in your home directory
** Get root access on your node using the [[sudo-g5k]] command, and then customize the operating system. The node will be reinstalled at the end of your resource reservation, so that it is in a clean state for the next user. It is thus best to avoid running [[sudo-g5k]] in very short jobs, as this has a cost for the platform.
** Install it using a user-level package manager, such as [[Guix|Guix]] (especially suitable for HPC software) and [[Conda]] (especially suitable for AI software)
** Install it using container technology, with [[Docker]] or [[Singularity|Singularity/Apptainer]]
** [[Virtualization on Grid'5000|Boot a virtual machine image]] on the node
** Re-install the node using a custom image with Kadeploy, as described in the following section
** Engage in a discussion with the [[Support|support team]] to see if the software you need could be added to the software available by default
You might also be interested in documentation about [[Run_MPI_On_Grid%275000|running MPI programs]], or [[GPUs_on_Grid5000|using GPUs with CUDA or AMD ROCm / HIP]].


== Deploying your nodes to get root access and create your own experimental environment ==
== Deploying your nodes to get root access and create your own experimental environment ==
Using <code class="command">oarsub</code> gives you access to resources configured in their default (''standard'') environment, with a set of software selected by the Grid'5000 team. You can use such an environment to run Java or [[Run_MPI_On_Grid%275000|MPI programs]], [[Virtualization on Grid'5000|boot virtual machines with KVM]], or [[Environment_modules|access a collection of scientific-related software]]. However, you cannot deeply customize the software environment in a way or another.
Using <code class="command">oarsub</code> gives you access to resources configured in their default (''standard'') environment, with a set of software selected by the Grid'5000 team. You can use such an environment to run Java or [[Run_MPI_On_Grid%275000|MPI programs]], [[Virtualization on Grid'5000|boot virtual machines with KVM]], or [[Modules|access a collection of scientific-related software]]. However, you cannot deeply customize the software environment in a way or another.


Most Grid'5000 users use resources in a different, much more powerful way: they use [http://kadeploy3.gforge.inria.fr/ Kadeploy] to re-install the nodes with their software environment for the duration of their experiment, using Grid'5000 as a ''Hardware-as-a-Service'' Cloud. This enables them to use a different Debian version, another Linux distribution, or even Windows, and get root access to install the software stack they need.
Most Grid'5000 users use resources in a different, much more powerful way: they use [https://kadeploy.gitlabpages.inria.fr/ Kadeploy] to re-install the nodes with their software environment for the duration of their experiment, using Grid'5000 as a ''Hardware-as-a-Service'' Cloud. This enables them to use a different Debian version, another Linux distribution, or even Windows, and get root access to install the software stack they need.


{{Note|text=There is a tool, called <code class="command">sudo-g5k</code> (see the [[sudo-g5k]] page for details), that provides root access on the ''standard'' environment. It does not allow deep reconfiguration as Kadeploy does, but could be enough if you just need to install additional software, with e.g. <code class="command">sudo-g5k</code><code> apt-get install your-favorite-editor</code>. The node will be transparently reinstalled using Kadeploy after your reservation. Usage of <code class="command">sudo-g5k</code> is logged.}}
{{Note|text=There is a tool, called <code class="command">sudo-g5k</code> (see the [[sudo-g5k]] page for details), that provides root access on the ''standard'' environment. It does not allow deep reconfiguration as Kadeploy does, but could be enough if you just need to install additional software, with e.g. <code class="command">sudo-g5k</code><code> apt-get install your-favorite-editor</code>. The node will be transparently reinstalled using Kadeploy after your reservation. Usage of <code class="command">sudo-g5k</code> is logged.}}


=== Deploying nodes with Kadeploy ===
=== Deploying a system on nodes with Kadeploy ===


Reserve one node (the <code class="replace">deploy</code> job type is required to allow deployment with Kadeploy):
Reserve one node (the <code class="replace">deploy</code> job type is required to allow deployment with Kadeploy):
{{Term|location=fnancy|cmd=<code class="command">oarsub</code><code> -I -l host=1,walltime=1:45 -t </code><code class="replace">deploy</code>}}
{{Term|location=fnancy|cmd=<code class="command">oarsub</code><code> -I -l host=1,walltime=1:45 -t </code><code class="replace">deploy</code>}}


Start a deployment of the <code class="env">debian10-x64-base</code> image on that node (this takes 5 to 10 minutes):
Start a deployment of the <code class="env">debian11-min</code> environment on that node (this takes 5 to 10 minutes):
{{Term|location=fnancy|cmd=<code class="command">kadeploy3</code><code> -f $OAR_NODE_FILE -e </code><code class="env">debian10-x64-base</code><code> -k</code>}}
The <code class="command">-f</code> parameter specifies a file containing the list of nodes to deploy. Alternatively, you can use <code class="command">-m</code> to specify a node (such as <code class="command">-m</code> <code class="host">gros-42.nancy.grid5000.fr</code>). The <code class="command">-k</code> parameter asks Kadeploy to copy your SSH key to the node's root account after deployment, so that you can connect without password. If you don't specify it, you will need to provide a password to connect. However, SSH is often configured to disallow root login using password. The root password for all Grid'5000-provided images is <code class="command">grid5000</code>.
 
Reference images are named <code class="replace">debian version</code><code class="command">-</code><code class="replace">architecture</code><code class="command">-</code><code class="replace">type</code>. The <code class="replace">debian version</code> can be <code class="env">debian10</code> (Debian 10 "Buster", released in 07/2019) or <code class="env">debian9</code> (Debian 9 "stretch", released in 06/2017). The <code class="replace">architecture</code> is <code class="command">x64</code> (in the past, 32-bit images were also provided). The <code class="replace">type</code> can be:


* '''<code class="env">min</code>''' = a minimalistic image (standard Debian installation) with minimal Grid'5000-specific customization (the default configuration provided by Debian is used): addition of an SSH server, network interface firmware, etc  (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/min changes]).
{{Term|location=fnancy|cmd=<code class="command">kadeploy3</code> <code class="env">debian11-min</code>}}
By default, all the nodes of the reservation are deployed. Alternatively, you can use <code class="command">-m</code> to specify a node (such as <code class="command">-m</code> <code class="host">gros-42.nancy.grid5000.fr</code>).


* '''<code class="env">base</code>''' = <code class="env">min</code> + various Grid'5000-specific tuning for performance (TCP buffers for 10 GbE, etc.), and a handful of commonly-needed tools to make the image more user-friendly (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/base changes]). Those could incur an experimental bias.
Kadeploy copies your SSH key from <code class="command">~/.ssh/authorized_keys</code> to the node's root account after deployment, so that you can connect without password. You may want to use another SSH key with <code class="command">-k</code> (such as <code class="command">-k</code> <code class="host">~/custom_authorized_keys</code>).


* '''<code class="env">xen</code>''' = <code class="env">base</code> + Xen hypervisor Dom0 + minimal DomU (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/xen changes]).
==== On Grid'5000 reference environments ====
Grid'5000 reference environments are named accordingly to the following scheme: <code class="replace">OS version</code><code class="command">-</code><code class="replace">variant</code>.
* <code class="replace">OS version</code> is the OS distribution name and version, for instance '''<code class="file">debian11</code>''' (Debian 11 "Bullseye", released on 08/2021), '''<code class="file">ubuntu2204</code>''' (Ubuntu 2204 "Jammy Jellyfish", released on 04/2022), or '''<code class="file">centosstream9</code>''' (CentOS Stream 9, clone of RHEL, released on 12/2021), or '''<code class="file">rocky9</code>''' (Rocky Linux 9, released on 07/2022)
*  <code class="replace">variant</code> defines the set of features included in the environment, as follows (for the <code class="env">x86_64</code> architecture -- upport might differ on more experimental architectures like <code class="env">ppc64le</code> (POWER processors) and <code class="env">aarch64</code> (ARM64 processors)) :


* '''<code class="env">nfs</code>''' = <code class="env">base</code> + support for mounting your NFS home and accessing other storage services (Ceph), and using your Grid'5000 user account on deployed nodes (LDAP) (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/nfs changes]).
{|class="wikitable"
|-
!style=height:1em  rowspan="2"|Variant
!rowspan="2"|OS available
!colspan="6"|Installed tools
! rowspan="2"| Network storage
! rowspan="2"|HPC networks support<br>(Infiniband, Omni-Path)
! rowspan="2"|Grid'5000-specific tuning<br>for performance<br>(e.g., TCP buffers for 10 GbE)
|-
! | Standard system<br>utilities*
! | Common<br>utilities**
! | Advanced<br>packages***
! | [[Modules| Scientific software<br/>available via ''module'']]
! | [[Guix| ''Guix''<br> package manager]]
! | [[Conda| ''Conda''<br> package manager]]
|-
! style=height:4em  rowspan="2" |min
|style="text-align: center; font-weight:bold; background-color:#ffffff;" |Debian 10,11,12,testing
|style="text-align: center; background-color:#ffffff;"  rowspan="2" |  [[Image:Check.png]]
|style="text-align: center;  background-color:#ffffff;"  rowspan="2" | [[Image:NoStarted.png]]
|style="text-align: center;  background-color:#ffffff;"  rowspan="2" | [[Image:NoStarted.png]]
|style="text-align: center;  background-color:#ffffff;"  rowspan="2" | [[Image:NoStarted.png]]
|style="text-align: center;  background-color:#ffffff;"  rowspan="2" | [[Image:NoStarted.png]]
|style="text-align: center;  background-color:#ffffff;"  rowspan="2" | [[Image:NoStarted.png]]
|style="text-align: center;  background-color:#ffffff;"  rowspan="2" | [[Image:NoStarted.png]]
|style="text-align: center; font-weight:bold; background-color:#ffffff;"  rowspan="2" | [[Image:NoStarted.png]]
|style="text-align: center; font-weight:bold; background-color:#ffffff;"  rowspan="2" | [[Image:NoStarted.png]]
|-
!style="text-align: center; font-weight:bold; background-color:#ffffff;" | Ubuntu, CentOS, etc.  
|-
!style=height:4em  rowspan="2" |nfs
!style="text-align: center; font-weight:bold; background-color:#ffffff;" |Debian 10,11,12
|style="text-align: center; background-color:#ffffff;" rowspan="2"  | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" rowspan="2"  | [[Image:NoStarted.png]]
|style="text-align: center; background-color:#ffffff;" | partial support****
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" rowspan="4"| Support for:
- mounting your home and group<br>storage.


* '''<code class="env">big</code>''' = <code class="env">nfs</code> + packages for development, system tools, editors, shells (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/big changes]).
- using your Grid'5000 user account<br>on nodes.
And for the standard environment:
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|-
|style="text-align: center; font-weight:bold; background-color:#ffffff;" |Debian testing, Ubuntu, CentOS, etc.
|style="text-align: center; background-color:#ffffff;" | [[Image:NoStarted.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:NoStarted.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:NoStarted.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:NoStarted.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:NoStarted.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:NoStarted.png]]
|-
!style=height:4em|big
|style="text-align: center; font-weight:bold; background-color:#ffffff;" |Debian 10,11,12
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | partial support****
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|-
!style=" height:4em text-align: center; font-weight:bold;" colspan="2"|default environment without deployment,<br> based on Debian 11
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|style="text-align: center; background-color:#ffffff;" | [[Image:Check.png]]
|}


* '''<code class="env">std</code>''' = <code class="env">big</code> + integration with OAR. Currently, it is the <code class="command">debian10-x64-std</code> environment which is used on the nodes if you or another user did not "kadeploy" another environment (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/std changes]).
<small>'''<nowiki>*</nowiki>''' ''Including SSH server and network drivers.''<br>
<!-- see https://www.grid5000.fr/mediawiki/index.php/New_maintenance_strategy_of_reference_images for details -->
'''<nowiki>**</nowiki>''' ''Including among others: Python, Ruby, curl, git, vim, etc.''<br>
'''<nowiki>***</nowiki>''' ''Packages for development, system tools, editors and shells.''<br>
'''<nowiki>****</nowiki>''' ''Supported modules include Conda and Singularity. Others might work, with no guarantee.''</small>


As a result, the environments you are the most likely to use are <code class="env">debian10-x64-min</code>, <code class="env">debian10-x64-base</code>, <code class="env">debian10-x64-xen</code>, <code class="env">debian10-x64-nfs</code>, <code class="env">debian10-x64-big</code>, and their ''debian9'' counterparts.
The list of all supported environments is available by running <code class="command>kaenv3</code> on any frontend. Note that environments are versioned: old versions can be listed using the <code class="command>kaenv3 -l -s</code> command and a former version retrieved and used by adding the <code class="command">--env-version</code><code class="replace">YYYYMMDDHH</code> option to the <code class="command">kaenv3</code> or <code class="command">kadeploy3</code> commands (also see the <code class="command">man</code> pages). This can be useful to reproduce experiments months or years later, using a previous version of an environment. On some sites, environments exist on different architectures (<code class="env">x86_64</code>, <code class="env">ppc64le</code> and <code class="env">aarch64</code>). The full list can be found in the [[Advanced_Kadeploy#Search_an_environment|Advanced Kadeploy]] page.


Environments are also provided and supported for some other distributions, only in the '''<code class="env">min</code>''' variant:
The Grid'5000 reference environments are built using the '''<code class="command">kameleon</code>''' tool from recipes detailing the whole construction process, and updated on a regular basis (see versions). See the [[Environment creation]] page for details.
* Ubuntu: <code class="env">ubuntu1804-x64-min</code> and <code class="env">ubuntu2004-x64-min</code>
* Centos: <code class="env">centos7-x64-min</code> and <code class="env">centos8-x64-min</code>
Last, an environment for the upcoming Debian version (also known as ''Debian testing'') is provided: <code class="env">debiantesting-x64-min</code> (only <code class="env">min</code> as well).


The list of all provided environments is available using <code class="command">kaenv3 -l</code>. Note that environments are versioned, and old versions of reference environments are available in <code class="file">/grid5000/images/</code> on each frontend (as well as images that are no longer supported, such as CentOS 6 images). This can be used to reproduce experiments even months or years later, still using the same software environment.
=== Customizing nodes ===
 
=== Customizing nodes and accessing the Internet ===
Now that your nodes are deployed, the next step is usually to copy data (usually using <code class="command">scp</code> or <code class="command">rsync</code>) and install software.
Now that your nodes are deployed, the next step is usually to copy data (usually using <code class="command">scp</code> or <code class="command">rsync</code>) and install software.


Line 311: Line 469:
{{Term|location=gros-42|cmd=<code class="command">apt-get</code><code> install </code><code class="replace">stress</code>}}
{{Term|location=gros-42|cmd=<code class="command">apt-get</code><code> install </code><code class="replace">stress</code>}}


Installing all the software needed for your experiment can be quite time-consuming. There are three approaches to avoid spending time at the beginning of each of your Grid'5000 sessions:
Installing all the software required for your experiment can be quite cumbersome. Several approaches can be taken to address this:
* Always deploy one of the reference environments, and automate the installation of your software environment after the image has been deployed. You can use a simple bash script, or more advanced tools for configuration management such as [https://www.ansible.com/ Ansible], [http://puppetlabs.com/ Puppet] or [http://www.opscode.com/chef/ Chef].
* Deploy one of the reference environments, and automate the installation of your software environment after it is deployed (one may use a simple bash script, or more advanced tools for configuration management such as [https://www.ansible.com/ Ansible], [http://puppetlabs.com/ Puppet] or [http://www.opscode.com/chef/ Chef]).
* Register a new environment with your modifications, using the <code>tgz-g5k</code> tool. More details are provided in the [[Advanced Kadeploy]] tutorial.
* Or build a custom environment including all your requirements, then deploy it ready to use on all nodes. See the '''[[Environment creation]]''' page for more information.
* Use a tool to generate your environment image from a set of rules, such as [http://kameleon.imag.fr Kameleon] or [http://puppetlabs.com/ Puppet]. The Grid'5000 technical team uses those two tools to [[Environments_creation_using_Kameleon_and_Puppet|generates all Grid'5000 environments in a clean and reproducible process]]
 
All those approaches have different pros and cons. We recommend that you start by scripting software installation after deploying a reference environment, and that you move to other approaches when this proves too limited.


=== Checking nodes' changes over time ===
=== Checking nodes' changes over time ===
Line 326: Line 481:
=== Cleaning up after your reservation ===
=== Cleaning up after your reservation ===
At the end of your resources reservation, the infrastructure will automatically reboot the nodes to put them back in the default (''standard'') environment. There's no action needed on your side.
At the end of your resources reservation, the infrastructure will automatically reboot the nodes to put them back in the default (''standard'') environment. There's no action needed on your side.
== Using efficiently Grid'5000 ==
Until now you have been logging, and submitting jobs manually to Grid'5000. This way of doing is convenient for learning, prototyping, and exploring ideas. But it may quickly become tedious when it comes to performing a set of experiments on a daily basis. In order to be more efficient and user-friendly, Grid'5000 also support more convenient ways of submitting jobs, such as [[API|API requests]] and [[Notebooks|computational notebooks]].
=== A quick example of grid5000 API usage with python requests ===
There are many ways to send requests to an API. In this section, we will present two examples of using the API by using the <code class="command">requests</code> python package.
This python package has been written in order to provide a quick and easy way to submit API requests.
Its documentation can be found here : https://docs.python-requests.org/en/latest/
This package is already available in Grid'5000 frontends and nodes' default environment and can be easily installed with <code class=command>pip install requests</code>.
==== Retrieving information from API with python ====
Here is a simple script that will fetch the names of all clusters of all sites :
<syntaxhighlight lang="python" line>
import os
import requests
user = input(f"Grid'5000 username (default is {os.getlogin()}): ") or os.getlogin()
password = input("Grid'5000 password (leave blank on frontends): ")
g5k_auth = (user, password) if password else None
sites = requests.get("https://api.grid5000.fr/stable/sites", auth=g5k_auth).json()["items"]
print("Grid'5000 sites:")
for site in sites:
    site_id = site["uid"]
    print(site_id + ":")
    site_clusters = requests.get(
        f"https://api.grid5000.fr/stable/sites/{site_id}/clusters",
        auth=g5k_auth,
    ).json()["items"]
    for cluster in site_clusters:
        print("-", cluster["uid"])
</syntaxhighlight>
This script can be launched as follow :
{{Term|location=outside|cmd=<code class=command>login=</code><code class="replace">your-username</code> <code class=command>password=</code><code class="replace">your-password</code> <code class="command">python3</code> <code class="replace">my-script.py</code>}}
==== Scripting Job submission with python ====
By scripting API calls, you can easily control the lifecycle of your jobs.
The following script submits a job requesting one ''taurus'' node in Lyon for echoing a message (it is redirected into a file <code>api-test-stdout</code> in your home).
<syntaxhighlight lang="python" line>
import os
import requests
from time import sleep
user = input(f"Grid'5000 username (default is {os.getlogin()}): ") or os.getlogin()
password = input("Grid'5000 password (leave blank on frontends): ")
g5k_auth = (user, password) if password else None
site_id = "lyon"
cluster = "taurus"
api_job_url = f"https://api.grid5000.fr/stable/sites/{site_id}/jobs"
payload = {
    "resources": "nodes=1",
    "command": 'echo "APIs are awesome !"',
    "stdout": "api-test-stdout",
    "properties": f"cluster='{cluster}'",
    "name": "api-test"
}
job = requests.post(api_job_url, data=payload, auth=g5k_auth).json()
job_id = job["uid"]
print(f"Job submitted ({job_id})")
sleep(60)
state = requests.get(api_job_url+f"/{job_id}", auth=g5k_auth).json()["state"]
if state != "terminated":
    # Deleting the job, because it takes too much time.
    requests.delete(api_job_url+f"/{job_id}", auth=g5k_auth)
    print("Job deleted.")
</syntaxhighlight>
This script can be launched as follow :
{{Term|location=outside|cmd=<code class=command>login=</code><code class="replace">your-username</code> <code class=command>password=</code><code class="replace">your-password</code> <code class="command">python3</code> <code class="replace">my-script.py</code>}}
{{Warning|text=Keep in mind that you are sharing clusters with other users. Please take the time to carefully debug your scripts so that you don't reserve more resources than your experience requires.}}
Scripting your experiences is a very important step if you seek reproducibility, and efficiency.
By scripting API calls, you can automate your whole experients.
If you are interested in using the Grid'5000 API, a tutorial is available on the '''[[API_tutorial]]''' page.
Another way to use the Grid'5000 API from Python is the '''[https://pypi.org/project/python-grid5000/ python-grid5000]''' package, or its higher-level counterpart '''[https://discovery.gitlabpages.inria.fr/enoslib/ EnOSlib]'''.
You can also read [[Experiment scripting tutorial]] which presents several scripting libraries built on top of the Grid'5000 API.
=== Notebooks ===
Grid'5000 also supports Jupyter notebooks and Jupyter lab servers.
Jupyter lab servers provide you with a simple web interface to submit jobs on Grid'5000 and run python Notebooks.
Using notebooks will allow you to track your experiment evolution during your exploratory phase while scripting part of your process.
You can find more information about Jupyter Lab and python notebooks on the '''[[Notebooks]]''' page.


== Going further ==
== Going further ==

Latest revision as of 16:38, 25 March 2024

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

This tutorial will guide you through your first steps on Grid'5000. Before proceeding, make sure you have a Grid'5000 account (if not, follow this procedure), and an SSH client.

Getting support

The Support page describes how to get help during your Grid'5000 usage.

There's also an FAQ page with the most common commands.

Connecting for the first time

The primary way to move around Grid'5000 is using SSH. A reference page for SSH is also maintained with advanced configuration options that frequent users will find useful.

As described in the figure below, when using Grid'5000, you will typically:

  1. connect, using SSH, to an access machine
  2. connect from this access machine to a site frontend
  3. on this site frontend, reserve resources (nodes), and connect to those nodes
Grid5000 Access

Grid'5000 for Microsoft Windows Users

Documentation for users using Microsoft Windows on Grid'5000 is available : Grid5000_for_Microsoft_Windows_users

SSH connection through a web interface

If you want an out-of-the-box solution which does not require you to setup SSH, you can connect through a web interface. The interface is available at https://intranet.grid5000.fr/shell/SITE/. For example, to access nancy's site, use: https://intranet.grid5000.fr/shell/nancy/ To connect you will have to type in your credentials twice (first for the HTTP proxy, then for the SSH connection).

This solution is probably suitable to follow this tutorial, but is unlikely to be suitable for real Grid'5000 usage. So you should probably read the next sections about how to setup and use SSH at some point.

Connect to a Grid'5000 access machine

To enter the Grid'5000 network from Internet, one must use an access machine: access.grid5000.fr (Note that access.grid5000.fr is a round robin alias to either: access-north which is currently hosted in Lille, or access-south currently hosted in Sophia-Antipolis).

For all connections, you must use the login that was provided to you when you created your Grid'5000 account.

Terminal.png outside:
ssh login@access.grid5000.fr

You will get authenticated using the SSH public key you provided in the account creation form. Password authentication is disabled.

Note.png Note

You can modify your SSH keys in the account management interface

Connecting to a Grid'5000 site

Grid'5000 is structured in sites (Grenoble, Rennes, Nancy, ...). Each site hosts one or more clusters (homogeneous sets of machines, usually bought at the same time).

To connect to a particular site, do the following (blue and red arrow labeled SSH in the figure above).

Terminal.png access:
ssh site
Home directories

You have a different home directory on each Grid'5000 site, so you will usually use Rsync or scp to move data around. On access machines, you have direct access to each of those home directories, through NFS mounts (but using that feature to transfer very large volumes of data is inefficient). Typically, to copy a file to your home directory on the Nancy site, you can use:

Terminal.png outside:
scp myfile.c login@access.grid5000.fr:nancy/targetdirectory/mytargetfile.c

Grid'5000 does NOT have a BACKUP service for users' home directories: it is your responsibility to save important data in someplace outside Grid'5000 (or at least to copy data to several Grid'5000 sites in order to increase redundancy).

Quotas are applied on home directories -- by default, you get 25 GB per Grid'5000 site. If your usage of Grid'5000 requires more disk space, it is possible to request quota extensions in the account management interface, or to use other storage solutions (see Storage).

Recommended tips and tricks for an efficient use of Grid'5000

Better exploit SSH and related tools

There are also several recommended tips and tricks for SSH and related tools (more details in the SSH page).

  • Configure SSH aliases using the ProxyCommand option. Using this, you can avoid the two-hops connection (access machine, then frontend) but establish connections directly to frontends. This requires using OpenSSH, which is the SSH software available on all GNU/Linux systems, MacOS, and also recent versions of Microsoft Windows.
Note.png Note

Please really take the time to setup the following ssh configuration on the workstation or laptop from where you access Grid'5000 (outside). It makes many tasks significantly easier and will save you time if you use Grid'5000 on a regular basis.

Terminal.png outside:
editor ~/.ssh/config
Host g5k
  User login
  Hostname access.grid5000.fr
  ForwardAgent no

Host *.g5k
  User login
  ProxyCommand ssh g5k -W "$(basename %h .g5k):%p"
  ForwardAgent no

Reminder: login is your Grid'5000 username

Warning: the ProxyCommand line works if your login shell is bash. If not you may have to adapt it. For instance, for the fish shell, this line must be: ProxyCommand ssh g5k -W (basename %h .g5k):%p.

Once done, you can establish connections to any machine (first of all: frontends) inside Grid'5000 directly, by suffixing .g5k to its hostname (instead of first having to connect to an access machine). E.g.:

Terminal.png outside:
ssh rennes.g5k
Terminal.png outside:
scp a_file lille.g5k:
  • Use rsync instead of scp for better performance with multiple files.
  • Access your data from your laptop using SSHFS
  • Edit files over SSH with your favorite text editor, with e.g.:
Terminal.png outside:
vim scp://nancy.g5k/my_file.c

There are more in this talk from Grid'5000 School 2010, and this talk more focused on SSH.

  • For a better bandwidth or latency, you may also be able to connect directly via the local access machine of one of the Grid'5000 sites.

Local accesses use access.site.grid5000.fr instead of access.grid5000.fr. However, mind that per-site access restrictions are applied: see External access for details about local access machines.

VPN

A VPN service is also available, allowing to connect directly to any Grid'5000 machines (bypassing the access machines). See the VPN page for more information.

HTTP reverse proxies

If you only require HTTP/HTTPS access to a node, a reverse HTTP proxy is also available, see the HTTP/HTTPs_access page.

Bash prompt

It is possible to modify your bash prompt to display useful informations related to your current job, such as its jobid, the reserved nodes and the remaining time.

Terminal.png fnancy:
jdoe@fnancy:~$ oarsub -C 3241912
Terminal.png grisou-15:
Connect to OAR job 3241912 via the node grisou-15.nancy.grid5000.fr
[OAR] OAR_JOB_ID=3241912
[OAR] Your nodes are:
      grisou-15.nancy.grid5000.fr*16

[jdoe@grisou-15 ~](3241912-->57mn)$ sleep 1m
[jdoe@grisou-15 ~](3241912-->55mn)$

You will find here all the information you need to setup such a prompt if you are interested.

Discovering, visualizing and reserving Grid'5000 resources

At this point, you should be connected to a site frontend, as indicated by your shell prompt (login@fsite:~$). This machine will be used to reserve and manipulate resources on this site, using the OAR software suite.

Discovering and visualizing resources

There are several ways to learn about the site's resources and their status:

  • The site's MOTD (message of the day) lists all clusters and their features. Additionally, it gives the list of current or future downtimes due to maintenance, which is also available from https://www.grid5000.fr/status/.
  • Site pages on the wiki (e.g. Nancy:Home) contain a detailed description of the site's hardware and network:
  • The Status page links to the resource status on each site, with two different visualizations available:
Example of Drawgantt in Nancy site
  • Hardware pages contain a detailed description of the site's hardware

Reserving resources with OAR: the basics

Note.png Note

OAR is the resources and jobs management system (a.k.a batch manager) used in Grid'5000, just like in traditional HPC centers. However, settings and rules of OAR that are configured in Grid'5000 slightly differ from traditional batch manager setups in HPC centers, in order to match the requirements for an experimentation testbed. Please remember to read again Grid'5000 Usage Policy to understand the expected usage.

In Grid'5000 the smallest unit of resource managed by OAR is the core (cpu core), but by default a OAR job reserves a host (physical computer including all its cpus and cores, and possibly gpus). Hence, what OAR calls nodes are hosts (physical machines). In the oarsub resource request (-l arguments), nodes is an alias for host, so both are equivalent. But prefer using host for consistency with other arguments and other tools that expose host not nodes.

Note.png Note

Most of this tutorial uses the site of Nancy (with the frontend: fnancy), but other sites can be used alternatively.

Interactive usage

To reserve a single host (one node) for one hour, in interactive mode, do:

Terminal.png fnancy:
oarsub -I

As soon as the resource becomes available, you will be directly connected to the reserved resource with an interactive shell, as indicated by the shell prompt, and you can run commands on the node:

Terminal.png grisou-1:
lscpu
Reserving only part of a node

To reserve only one CPU core in interactive mode, run:

Terminal.png fnancy:
oarsub -l core=1 -I
Note.png Note

When reserving only a share of the node's cores, you will have a share of the memory with the same ratio as the cores. If you take the whole node, you will have all the memory of the node. If you take half the cores, you will have half the memory, and so on... You cannot reserve a memory size explicitly.

When reserving several CPU cores, there is no guarantee that they will be allocated on a single node. To ensure this, you need to specify that you want a single host:

Terminal.png fnancy:
oarsub -l host=1/core=8 -I
Non-interactive usage (scripts)

You can also simply launch your experiment along with your reservation:

Terminal.png fnancy:
oarsub -l host=1/core=1 "my_mono_threaded_script.py --in $HOME/data --out $HOME/results"

Your program will be executed as soon as the requested resources are available. As this type of job is not interactive, you will have to check for its termination using the oarstat command.


Batch job using OAR scripts

Similarly to what it is the standard use for batch scheduler (e.g. SLURM), a good practice is to use a script that include the OAR directives to define the resource submission. Here is a simple example of such script that select a GPU with specific characteristics.

Properties list can be found on OAR Properties and OAR_Syntax_simplification.

#!/bin/bash 

#OAR -q production 
#OAR -l host=1/gpu=1
#OAR -l walltime=3:00:00
#OAR -p gpu-16GB
#OAR -p gpu_compute_capability_major>=5
#OAR -O OAR_%jobid%.out
#OAR -E OAR_%jobid%.err 

# display some information about attributed resources
hostname 
nvidia-smi 
 
# make use of a python torch environment
module load conda
conda activate pytorch_env
python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))";

The script must be executable

Terminal.png frennes:
chmod u+x ./my_script_oar.sh

and can be called from frontend using

Terminal.png frennes:
oarsub -S ./my_script_oar.sh

and will start when resources will be available.

Other types of resources

To reserve only one GPU (with the associated CPU cores and share of memory) in interactive mode, run:

Terminal.png flille:
oarsub -l gpu=1 -I
Note.png Note

Even if the node has several GPUs, this reservation will only be able to access a single one. It's a good practice if you only need one GPU: other users will be able to run jobs on the same node to access the other GPUs. Of course, if you need all GPUs of a node, you have the option to reserve the entire node which includes all its GPUs.

Or in Nancy where GPUs are only available in the production queue:

Terminal.png fnancy:
oarsub -l gpu=1 -I -q production

To reserve several GPUs and ensure they are located in a single node, make sure to specify host=1:

Terminal.png flille:
oarsub -l host=1/gpu=2 -I
Tips and tricks

To terminate your reservation and return to the frontend, simply exit this shell by typing exit or CTRL+d:

Terminal.png graffiti-1:
exit

To avoid unanticipated termination of your jobs in case of errors (terminal closed by mistake, network disconnection), you can either use tools such as tmux or screen, or reserve and connect in 2 steps using the job id associated to your reservation. First, reserve a node, and run a sleep command that does nothing for an infinite time:

Terminal.png fnancy:
oarsub "sleep infinity"

Of course, the job will not run for an infinite time: the command will be killed when the job expires.

Then:

Terminal.png fnancy:
oarsub -C job_id
Terminal.png grisou-42:
hostname && ps -ef | grep sleep

java -version
mpirun --version
module available # List scientific-related software available using module
whoami

env | grep OAR # discover environment variables set by OAR
Choosing the job duration

Of course, you might want to run a job for a different duration than one hour. The -l option allows you to pass a comma-separated list of parameters specifying the needed resources for the job, and walltime is a special resource defining the duration of your job:

Terminal.png fnancy:
oarsub -l host=1/core=2,walltime=0:30 -I

The walltime is the expected duration you envision to complete your work. Its format is [hour:min:sec|hour:min|hour]. For instance:

  • walltime=5 => 5 hours
  • walltime=1:22 => 1 hour and 22 minutes
  • walltime=0:03:30 => 3 minutes and 30 seconds
Working with more than one node

You will probably want to use more than one node on a given site.

To reserve two hosts (two nodes), in interactive mode, do:

Terminal.png fnancy:
oarsub -l host=2 -I

or equivalently (nodes is an alias for host):

Terminal.png fnancy:
oarsub -l nodes=2 -I

You will obtain a shell on the first node of the reservation. It is up to you to connect to the other nodes and distribute work among them. By default, you can only connect to nodes that are part of your reservation. If you completely own the nodes within one job (or with one job per complete node), you will be able to connect those by using ssh. In the case of nodes that are not completely owned within a job (if you have reserved only a part of the nodes or by having multiple jobs on nodes) you will have to use oarsh connector to go from one node to the other. The connector supports the same options as the classical ssh command, so it can be used as a replacement for software expecting ssh.

Terminal.png gros-49:

uniq $OAR_NODEFILE # list of resources of your reservation
ssh gros-1 # try to connect a node not in the file (should fail)
oarsh gros-54 # connect to the other node of your reservation (should work)

ssh gros-54 # connect to the other node of your reservation (should work)
Note.png Note

To take advantage of several nodes and distribute work between them, a good option is GNU_Parallel.

oarsh is a wrapper around ssh that enables the tracking of user jobs inside compute nodes (for example, to enforce the correct sharing of resources when two different jobs share a compute node). If your application does not support choosing a different connector, be sure to reserve nodes entirely (which is the default with oarsub) to be able to use ssh.

Selecting specific resources

So far, all examples have been letting OAR decide which resource to allocate to a job. It is possible to obtain a finer-grained control of the allocated resources, by using filters.

Selecting nodes from a specific cluster or cluster type
  • Reserve nodes from a specific cluster
Terminal.png fgrenoble:
oarsub -p dahu -l host=2,walltime=2 -I
Terminal.png fnancy:
oarsub -q production -p grappe -l host=2,walltime=2 -I
  • Reserve nodes from an exotic cluster type
Terminal.png flyon:
oarsub -t exotic -p pyxis -l host=2,walltime=2 -I

Clusters with the exotic type either have a non-x86 architecture or are specific enough to warrant this type. Resources with an exotic type are never selected by default by OAR. Using -t exotic is required to obtain such resources.

The type of a cluster can be identified on the Hardware pages, see for instance Lyon:Hardware.

Warning.png Warning

When using the -t exotic option, you can still obtain non-exotic resources! You should filter on the cluster name or other properties if you want exclusively exotic resources.


Selecting specific nodes

If you know the exact node you want to reserve, you can specify the hostname of the node you require:

Terminal.png fgrenoble:
oarsub -p dahu-12 -l host=1,walltime=2 -I

If you want several specific nodes, you can use a list:

Terminal.png fgrenoble:
oarsub -p "host IN (dahu-5, dahu-12)" -l host=2,walltime=2 -I


Using OAR properties

The OAR nodes database contains a set of properties for each node, and the -p option actually filters based on these properties:

  • Nodes with Infiniband FDR interfaces:
Terminal.png fnancy:
oarsub -p "ib=FDR" -l host=5,walltime=2 -I
  • Nodes with power sensors and GPUs:
Terminal.png flyon:
oarsub -p "wattmeter=YES AND gpu_count > 0" -l host=2,walltime=2 -I
  • Nodes with 2 GPUs:
Terminal.png flille:
oarsub -p "gpu_count = 2" -l host=3,walltime=2 -I
  • Nodes with a specific CPU model:
Terminal.png flille:
oarsub -p "cputype = 'Intel Xeon E5-2630 v4'" -l host=3,walltime=2 -I
  • Since -p accepts SQL, you can write advanced queries:
Terminal.png fnancy:
oarsub -p "wattmeter=YES AND host NOT IN (graffiti-41, graffiti-42)" -l host=5,walltime=2 -I
Terminal.png flille:
oarsub -p "cputype LIKE 'AMD%'" -l host=3,walltime=2 -I

The OAR properties available on each site are listed on the Monika pages linked from Status (example page for Nancy). The full list of OAR properties is available on this page.

Note.png Note

Since this is using a SQL syntax, quoting is important! Use double quotes to enclose the whole query, and single quotes to write strings within the query.

Advanced job management topics

Reservations in advance

By default, oarsub will give you resources as soon as possible: once submitted, your request enters a queue. This is good for non-interactive work (when you do not care when exactly it will be scheduled), or when you know that the resources are available immediately.

You can also reserve resources at a specific time in the future, typically to perform large reservations over nights and week-ends, with the -r parameter:

Terminal.png fnancy:
oarsub -l host=3,walltime=3 -r '2020-12-23 16:30:00'
Note.png Note

Remember that all your resource reservations must comply with the Usage Policy. You can verify your reservations' compliance with the Policy with usagepolicycheck -t.

Job management

To list jobs currently submitted, use the oarstat command (use -u option to see only your jobs). A job can be deleted with:

Terminal.png fnancy:
oardel 12345
Extending the duration of a reservation

Provided that the resources are still available after your job, you can extend its duration (walltime) using e.g.:

Terminal.png fnancy:
oarwalltime 12345 +1:30

This will request to add one hour and a half to job 12345.

For more details, see the oarwalltime section of the Advanced OAR tutorial.

Using nodes in the default environment

When you run oarsub, you gain access to physical nodes with a default (standard) software environment. This is a Debian-based system that is regularly updated by the technical team.

Storage

Home directory

On each node, the home directory is a network filesystem (NFS): data in your home directory is not actually stored on the node itself, it is stored on a storage server managed by the Grid'5000 team. In particular, it means that all reserved nodes share the same home directory, and it is also shared with the site frontend. For example, you can compile or install software in your home, and it will be usable on all your nodes.

Note.png Note

The home directory is only shared within a site. Two nodes from different sites will not have access to the same home.

/tmp

The /tmp/ directory is stored on a local disk of the node. Use this directory if you need to access data locally.

Additional local disks

Some nodes have additional local disks, see Hardware#Storage for a list of available disks for each cluster.

There are two ways to access these local disks:

  1. On some clusters, local disks need to be reserved to be accessible. See Disk reservation for a list of these clusters and for documentation on the reservation process.
  2. On other clusters, local disks can be used directly. In this case, jump directly to Using local disks.

In both cases, the disks are simply provided as raw devices, and it is the responsibility of the user to partition them and create a filesystem. Note that there may still be partitions and filesystems present from a previous job.

Other storage options

More storage options are also available.

Getting access to the software you need

There are several options to get access to software :

  • Many software packages are already installed and directly accessible: Git, editors, GCC, Python, Pip, Ruby, Java, ...
  • Some software (mostly scientific software, such as MatLab) is available through modules. For a list, use module avail. Documentation (including how to access license tokens) is available in the Modules page.
  • If the software you need is not available through the above options, you can:
    • Install it manually in your home directory
    • Get root access on your node using the sudo-g5k command, and then customize the operating system. The node will be reinstalled at the end of your resource reservation, so that it is in a clean state for the next user. It is thus best to avoid running sudo-g5k in very short jobs, as this has a cost for the platform.
    • Install it using a user-level package manager, such as Guix (especially suitable for HPC software) and Conda (especially suitable for AI software)
    • Install it using container technology, with Docker or Singularity/Apptainer
    • Boot a virtual machine image on the node
    • Re-install the node using a custom image with Kadeploy, as described in the following section
    • Engage in a discussion with the support team to see if the software you need could be added to the software available by default

You might also be interested in documentation about running MPI programs, or using GPUs with CUDA or AMD ROCm / HIP.

Deploying your nodes to get root access and create your own experimental environment

Using oarsub gives you access to resources configured in their default (standard) environment, with a set of software selected by the Grid'5000 team. You can use such an environment to run Java or MPI programs, boot virtual machines with KVM, or access a collection of scientific-related software. However, you cannot deeply customize the software environment in a way or another.

Most Grid'5000 users use resources in a different, much more powerful way: they use Kadeploy to re-install the nodes with their software environment for the duration of their experiment, using Grid'5000 as a Hardware-as-a-Service Cloud. This enables them to use a different Debian version, another Linux distribution, or even Windows, and get root access to install the software stack they need.

Note.png Note

There is a tool, called sudo-g5k (see the sudo-g5k page for details), that provides root access on the standard environment. It does not allow deep reconfiguration as Kadeploy does, but could be enough if you just need to install additional software, with e.g. sudo-g5k apt-get install your-favorite-editor. The node will be transparently reinstalled using Kadeploy after your reservation. Usage of sudo-g5k is logged.

Deploying a system on nodes with Kadeploy

Reserve one node (the deploy job type is required to allow deployment with Kadeploy):

Terminal.png fnancy:
oarsub -I -l host=1,walltime=1:45 -t deploy

Start a deployment of the debian11-min environment on that node (this takes 5 to 10 minutes):

Terminal.png fnancy:
kadeploy3 debian11-min

By default, all the nodes of the reservation are deployed. Alternatively, you can use -m to specify a node (such as -m gros-42.nancy.grid5000.fr).

Kadeploy copies your SSH key from ~/.ssh/authorized_keys to the node's root account after deployment, so that you can connect without password. You may want to use another SSH key with -k (such as -k ~/custom_authorized_keys).

On Grid'5000 reference environments

Grid'5000 reference environments are named accordingly to the following scheme: OS version-variant.

  • OS version is the OS distribution name and version, for instance debian11 (Debian 11 "Bullseye", released on 08/2021), ubuntu2204 (Ubuntu 2204 "Jammy Jellyfish", released on 04/2022), or centosstream9 (CentOS Stream 9, clone of RHEL, released on 12/2021), or rocky9 (Rocky Linux 9, released on 07/2022)
  • variant defines the set of features included in the environment, as follows (for the x86_64 architecture -- upport might differ on more experimental architectures like ppc64le (POWER processors) and aarch64 (ARM64 processors)) :
Variant OS available Installed tools Network storage HPC networks support
(Infiniband, Omni-Path)
Grid'5000-specific tuning
for performance
(e.g., TCP buffers for 10 GbE)
Standard system
utilities*
Common
utilities**
Advanced
packages***
Scientific software
available via module
Guix
package manager
Conda
package manager
min Debian 10,11,12,testing Check.png NoStarted.png NoStarted.png NoStarted.png NoStarted.png NoStarted.png NoStarted.png NoStarted.png NoStarted.png
Ubuntu, CentOS, etc.
nfs Debian 10,11,12 Check.png Check.png NoStarted.png partial support**** Check.png Check.png Support for:

- mounting your home and group
storage.

- using your Grid'5000 user account
on nodes.

Check.png Check.png
Debian testing, Ubuntu, CentOS, etc. NoStarted.png NoStarted.png NoStarted.png NoStarted.png NoStarted.png NoStarted.png
big Debian 10,11,12 Check.png Check.png Check.png partial support**** Check.png Check.png Check.png Check.png
default environment without deployment,
based on Debian 11
Check.png Check.png Check.png Check.png Check.png Check.png Check.png Check.png

* Including SSH server and network drivers.
** Including among others: Python, Ruby, curl, git, vim, etc.
*** Packages for development, system tools, editors and shells.
**** Supported modules include Conda and Singularity. Others might work, with no guarantee.

The list of all supported environments is available by running kaenv3 on any frontend. Note that environments are versioned: old versions can be listed using the kaenv3 -l -s command and a former version retrieved and used by adding the --env-versionYYYYMMDDHH option to the kaenv3 or kadeploy3 commands (also see the man pages). This can be useful to reproduce experiments months or years later, using a previous version of an environment. On some sites, environments exist on different architectures (x86_64, ppc64le and aarch64). The full list can be found in the Advanced Kadeploy page.

The Grid'5000 reference environments are built using the kameleon tool from recipes detailing the whole construction process, and updated on a regular basis (see versions). See the Environment creation page for details.

Customizing nodes

Now that your nodes are deployed, the next step is usually to copy data (usually using scp or rsync) and install software.

First, connect to the node as root:

Terminal.png fnancy:
ssh root@gros-42

You can access websites outside Grid'5000 : for example, to fetch the Linux kernel sources:

Warning.png Warning

Please note that, for legal reasons, your Internet activity from Grid'5000 is logged and monitored.

Let's install stress (a simple load generator) on the node from Debian's APT repositories:

Terminal.png gros-42:
apt-get install stress

Installing all the software required for your experiment can be quite cumbersome. Several approaches can be taken to address this:

  • Deploy one of the reference environments, and automate the installation of your software environment after it is deployed (one may use a simple bash script, or more advanced tools for configuration management such as Ansible, Puppet or Chef).
  • Or build a custom environment including all your requirements, then deploy it ready to use on all nodes. See the Environment creation page for more information.

Checking nodes' changes over time

The Grid'5000 team puts on strong focus on ensuring that nodes meet their advertised capabilities. A detailed description of each node is stored in the Reference API, and the node is frequently checked against this description in order to detect hardware failures or misconfigurations.

To see the description of grisou-1.nancy.grid5000.fr, use:

Cleaning up after your reservation

At the end of your resources reservation, the infrastructure will automatically reboot the nodes to put them back in the default (standard) environment. There's no action needed on your side.

Using efficiently Grid'5000

Until now you have been logging, and submitting jobs manually to Grid'5000. This way of doing is convenient for learning, prototyping, and exploring ideas. But it may quickly become tedious when it comes to performing a set of experiments on a daily basis. In order to be more efficient and user-friendly, Grid'5000 also support more convenient ways of submitting jobs, such as API requests and computational notebooks.

A quick example of grid5000 API usage with python requests

There are many ways to send requests to an API. In this section, we will present two examples of using the API by using the requests python package. This python package has been written in order to provide a quick and easy way to submit API requests. Its documentation can be found here : https://docs.python-requests.org/en/latest/ This package is already available in Grid'5000 frontends and nodes' default environment and can be easily installed with pip install requests.

Retrieving information from API with python

Here is a simple script that will fetch the names of all clusters of all sites :

import os
import requests

user = input(f"Grid'5000 username (default is {os.getlogin()}): ") or os.getlogin()
password = input("Grid'5000 password (leave blank on frontends): ")
g5k_auth = (user, password) if password else None

sites = requests.get("https://api.grid5000.fr/stable/sites", auth=g5k_auth).json()["items"]

print("Grid'5000 sites:")
for site in sites:

    site_id = site["uid"]
    print(site_id + ":")

    site_clusters = requests.get(
        f"https://api.grid5000.fr/stable/sites/{site_id}/clusters",
        auth=g5k_auth,
    ).json()["items"]

    for cluster in site_clusters:
        print("-", cluster["uid"])

This script can be launched as follow :

Terminal.png outside:
login=your-username password=your-password python3 my-script.py

Scripting Job submission with python

By scripting API calls, you can easily control the lifecycle of your jobs. The following script submits a job requesting one taurus node in Lyon for echoing a message (it is redirected into a file api-test-stdout in your home).

import os
import requests
from time import sleep

user = input(f"Grid'5000 username (default is {os.getlogin()}): ") or os.getlogin()
password = input("Grid'5000 password (leave blank on frontends): ")
g5k_auth = (user, password) if password else None

site_id = "lyon"
cluster = "taurus"

api_job_url = f"https://api.grid5000.fr/stable/sites/{site_id}/jobs"

payload = {
    "resources": "nodes=1",
    "command": 'echo "APIs are awesome !"',
    "stdout": "api-test-stdout",
    "properties": f"cluster='{cluster}'",
    "name": "api-test"
}
job = requests.post(api_job_url, data=payload, auth=g5k_auth).json()
job_id = job["uid"]

print(f"Job submitted ({job_id})")

sleep(60)
state = requests.get(api_job_url+f"/{job_id}", auth=g5k_auth).json()["state"]

if state != "terminated":
    # Deleting the job, because it takes too much time.
    requests.delete(api_job_url+f"/{job_id}", auth=g5k_auth)
    print("Job deleted.")

This script can be launched as follow :

Terminal.png outside:
login=your-username password=your-password python3 my-script.py
Warning.png Warning

Keep in mind that you are sharing clusters with other users. Please take the time to carefully debug your scripts so that you don't reserve more resources than your experience requires.

Scripting your experiences is a very important step if you seek reproducibility, and efficiency. By scripting API calls, you can automate your whole experients.

If you are interested in using the Grid'5000 API, a tutorial is available on the API_tutorial page.

Another way to use the Grid'5000 API from Python is the python-grid5000 package, or its higher-level counterpart EnOSlib.

You can also read Experiment scripting tutorial which presents several scripting libraries built on top of the Grid'5000 API.

Notebooks

Grid'5000 also supports Jupyter notebooks and Jupyter lab servers. Jupyter lab servers provide you with a simple web interface to submit jobs on Grid'5000 and run python Notebooks. Using notebooks will allow you to track your experiment evolution during your exploratory phase while scripting part of your process.

You can find more information about Jupyter Lab and python notebooks on the Notebooks page.

Going further

In this tutorial, you learned the basics of Grid'5000:

  • The general structure of Grid'5000, and how to move between sites
  • How to manage you data (one NFS server per site; remember: it is not backed up)
  • How to find and reserve resources using OAR and the oarsub command
  • How to get root access on nodes using Kadeploy and the kadeploy3 command

You should now be ready to use Grid'5000.

Additional tutorials

There are many more tutorials available on the Users Home page. Please have a look at the page to continue learning how to use Grid'5000.