Difference between revisions of "Getting Started"

From Grid5000
Jump to: navigation, search
(Reserving resources with OAR: the basics)
(Discovering and visualizing resources)
 
(37 intermediate revisions by 9 users not shown)
Line 5: Line 5:
  
 
== Getting support ==
 
== Getting support ==
The [[Support]] page describes how to get help during your Grid'5000 usage. There's also an [[FAQ]] and a [[media:G5k_cheat_sheet.pdf|cheat sheet]] with the most common commands.
+
The '''[[Support]] page''' describes how to get help during your Grid'5000 usage.
  
 +
There's also an '''[[FAQ]] page''' and a '''[[media:G5k_cheat_sheet.pdf|cheat sheet]]''' with the most common commands.
  
 
== Connecting for the first time ==
 
== Connecting for the first time ==
Line 12: Line 13:
  
  
The primary way to move around Grid'5000 is using SSH. If you are not familiar with SSH, please consider the specific [[SSH and Grid'5000]] tutorial. A reference page for [[SSH]] is also maintained with advanced configuration options heavy users will find useful.
+
The primary way to move around Grid'5000 is using SSH. A [[SSH|reference page for SSH]] is also maintained with advanced configuration options that frequent users will find useful.
  
 
As described in the figure below, when using Grid'5000, you will typically:
 
As described in the figure below, when using Grid'5000, you will typically:
Line 29: Line 30:
  
 
=== Connect to a Grid'5000 access machine ===
 
=== Connect to a Grid'5000 access machine ===
The <code class="host">access.grid5000.fr</code> address points to two actual machines: <code class="host">access-south</code> (currently hosted in Sophia-Antipolis) and <code class="host">access-north</code> (currently hosted in Lille). Those machines provide SSH access to Grid'5000 from Internet.
+
The <code class="host">access.grid5000.fr</code> address points to two actual machines: <code class="host">access-north</code> (currently hosted in Lille) and <code class="host">access-south</code> (currently hosted in Sophia-Antipolis). Those machines provide SSH access to Grid'5000 from Internet.
  
 
{{Term|location=outside|cmd=<code class="command">ssh</code> <code class="replace">login</code><code class="command">@</code><code class="host">access.grid5000.fr</code>}}
 
{{Term|location=outside|cmd=<code class="command">ssh</code> <code class="replace">login</code><code class="command">@</code><code class="host">access.grid5000.fr</code>}}
Line 38: Line 39:
 
If you prefer (for better bandwidth and latency), you might also be able to connect directly via your local Grid'5000 site. However, per-site access restrictions are applied, so using <code class="host">access.grid5000.fr</code> is usually a simpler choice. See [[External_access]] for details about local access machines.
 
If you prefer (for better bandwidth and latency), you might also be able to connect directly via your local Grid'5000 site. However, per-site access restrictions are applied, so using <code class="host">access.grid5000.fr</code> is usually a simpler choice. See [[External_access]] for details about local access machines.
  
A VPN service is also available to connect directly to Grid'5000 hosts. See the [[VPN]] page for more information. If you only require HTTP/HTTPS access to a node, a reverse HTTP proxy is also available, see [[FAQ#How_can_I_connect_to_an_HTTP_or_HTTPS_service_running_on_a_node.3F|here]].
+
A VPN service is also available to connect directly to Grid'5000 hosts. See [[VPN|the VPN page]] for more information. If you only require HTTP/HTTPS access to a node, a reverse HTTP proxy is also available, see [[FAQ#How_can_I_connect_to_an_HTTP_or_HTTPS_service_running_on_a_node.3F|this FAQ]].
  
 
=== Connecting to a Grid'5000 site ===
 
=== Connecting to a Grid'5000 site ===
Line 58: Line 59:
 
=== Recommended tips and tricks for efficient use of Grid'5000===
 
=== Recommended tips and tricks for efficient use of Grid'5000===
 
There are also several '''recommended tips and tricks for SSH and related tools''', explained in the [[SSH]] page:
 
There are also several '''recommended tips and tricks for SSH and related tools''', explained in the [[SSH]] page:
* Configure [[SSH#Using_SSH_with_ssh_proxycommand_setup_to_access_hosts_inside_Grid.275000|SSH aliases using the ProxyCommand option]]. Using this, you can avoid the two-steps connection (access machine, then frontend) and connect directly to frontends. Edit your ~/.ssh/config
+
* Configure [[SSH#Using_SSH_ProxyCommand_feature_to_ease_the_access_to_hosts_inside_Grid.275000|SSH aliases using the ProxyCommand option]]. Using this, you can avoid the two-steps connection (access machine, then frontend) and connect directly to frontends. Edit your ~/.ssh/config
  
 
  Host g5k
 
  Host g5k
Line 70: Line 71:
  
 
* Using <code class="command">rsync</code> instead of <code class="command">scp</code> (better performance with multiple files)
 
* Using <code class="command">rsync</code> instead of <code class="command">scp</code> (better performance with multiple files)
* Access your data from your laptop using [[SSH#Mounting_remote_filesystem|SSHFS]]
+
* Access your data from your laptop using [[SSH#Mounting_remote_filesystem_.28sshfs.29|SSHFS]]
 
* Edit files over SSH with your favorite text editor, with e.g. <code class="command">vim scp://nancy.g5k/my_file.c</code>
 
* Edit files over SSH with your favorite text editor, with e.g. <code class="command">vim scp://nancy.g5k/my_file.c</code>
There are more in [http://www.loria.fr/~lnussbau/files/g5kss10-grid5000-efficiently.pdf this talk from Grid'5000 School 2010], and [https://github.com/lnussbaum/slides-lectures/raw/master/ssh/ssh.pdf this talk more focused on SSH].
+
There are more in [http://www.loria.fr/~lnussbau/files/g5kss10-grid5000-efficiently.pdf this talk from Grid'5000 School 2010], and [https://github.com/lnussbaum/slides-lectures/blob/master/ssh/ssh.pdf this talk more focused on SSH].
  
 
Additionally, the '''[[media:G5k_cheat_sheet.pdf|Grid'5000 cheat sheet]]''' provides a nice summary of everything described in the tutorials.
 
Additionally, the '''[[media:G5k_cheat_sheet.pdf|Grid'5000 cheat sheet]]''' provides a nice summary of everything described in the tutorials.
Line 89: Line 90:
 
** Gantt, that provides current and planned resources reservations (see [https://intranet.grid5000.fr/oar/Nancy/drawgantt-svg/ Nancy's current status]; example in the figure below).
 
** Gantt, that provides current and planned resources reservations (see [https://intranet.grid5000.fr/oar/Nancy/drawgantt-svg/ Nancy's current status]; example in the figure below).
 
[[Image:nancy-gantt.png|600px|center|Example of Drawgantt in Nancy site]]
 
[[Image:nancy-gantt.png|600px|center|Example of Drawgantt in Nancy site]]
* The Grid'5000 API (we'll look at that later on) provides a machine-readable description of Grid'5000 and machine-readable status information. [https://api.grid5000.fr/3.0/ui/quick-start.html This web UI] can be used to discover resources.
+
* Hardware pages contain a detailed description of the site's hardware
 +
{{Site link|Hardware}}
 +
 
 +
=== Reserving resources with OAR: the basics  ===
 +
{{Note|text=OAR is the resources and jobs management system (a.k.a batch manager) used in Grid'5000, just like in traditional HPC centers. '''However, settings and rules of OAR that are configured in Grid'5000 slightly differ from traditional batch manager setups in HPC centers, in order to match the requirements for an experimentation testbed'''. Please remember to read again '''Grid'5000 [[Grid5000:UsagePolicy#Resources_reservation|Usage Policy]]''' to understand the expected usage.}}
 +
 
 +
In Grid'5000 the smallest unit of resource managed by OAR is the core (cpu core), but by default a OAR job reserves a host (physical computer including all its cpu/cores). Hence, what OAR calls ''nodes'' are hosts (physical machines).
  
 
{{Note|text=Most of this tutorial uses the site of Nancy (with the frontend: <code class="host">fnancy</code>), but other sites can be used alternatively.}}
 
{{Note|text=Most of this tutorial uses the site of Nancy (with the frontend: <code class="host">fnancy</code>), but other sites can be used alternatively.}}
 
=== Reserving resources with OAR: the basics  ===
 
In OAR the smallest unit of resource is the core (cpu core), but by default a OAR job reserves a host (physical computer including all its cpu/cores). Hence, what OAR calls ''nodes'' are hosts (physical machines).
 
  
 
To reserve one host (one node), in interactive mode, do:
 
To reserve one host (one node), in interactive mode, do:
Line 106: Line 110:
 
To reserve only one core in interactive mode, run:
 
To reserve only one core in interactive mode, run:
 
{{Term|location=fnancy|cmd=<code class="command">oarsub -l core=1</code> <code>-I</code>}}
 
{{Term|location=fnancy|cmd=<code class="command">oarsub -l core=1</code> <code>-I</code>}}
 +
{{Note|text= When reserving only a share of the node's cores, you will have a share of the memory with the same ratio as the cores. If you take the whole node, you will have all the memory of the node. If you take half the cores, you will have half the memory, aso... You cannot reserve a memory size explicitly.}}
 +
 
As soon as a resource becomes available, you will be directly connected to the reserved resource with an interactive shell, as indicated by the shell prompt.
 
As soon as a resource becomes available, you will be directly connected to the reserved resource with an interactive shell, as indicated by the shell prompt.
 +
 +
You can also simply launch your experiment along with your reservation:
 +
{{Term|location=fnancy|cmd=<code class="command">oarsub -l core=1</code> <code>"my_mono_threaded_script.py --in $HOME/data --out $HOME/results"</code>}}
 +
Your program will be executed as soon as the requested resources are available (you will have to check for its termination using the oarstat command).
  
 
To reserve only one GPU (with the associated cores) in interactive mode, run:
 
To reserve only one GPU (with the associated cores) in interactive mode, run:
{{Term|location=fnancy|cmd=<code class="command">oarsub -l gpu=1</code> <code>-I</code>}}
+
{{Term|location=fnancy|cmd=<code class="command">oarsub -l gpu=1</code> <code>-I</code> <code>-q production</code>}}
 +
(nodes with GPU are exclusively in the production queue in Nancy, on other sites, just don't use the <code>-q production</code> option to obtain the same result)
 +
 
 
As soon as a resource becomes available, you will be directly connected to the reserved resource with an interactive shell, as indicated by the shell prompt.
 
As soon as a resource becomes available, you will be directly connected to the reserved resource with an interactive shell, as indicated by the shell prompt.
  
  
 
To terminate your reservation and return to the frontend, simply exit this shell by typing <code class="command">exit</code> or <code class="command">CTRL+d</code>:
 
To terminate your reservation and return to the frontend, simply exit this shell by typing <code class="command">exit</code> or <code class="command">CTRL+d</code>:
{{Term|location=graphene-42|cmd=<code class="command">exit</code>}}  
+
{{Term|location=graffiti-1|cmd=<code class="command">exit</code>}}  
  
To avoid anticipated termination of your jobs in case of mistakes (terminal closed by mistake), you can reserve and connect in 2 steps using the job id associated to your reservation. First, reserve a node, and ask it to sleep for a long time:
+
To avoid unanticipated termination of your jobs in case of errors (terminal closed by mistake, network disconnection), you can either use tools such as [https://tmux.github.io/ tmux] or [[screen]], or reserve and connect in 2 steps using the job id associated to your reservation. First, reserve a node, and ask it to sleep for a long time:
 
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> "<code class="command">sleep</code> <code class="replace">10d</code>"}} (10d stands for ''10 days'' -- the command will be killed when the job expires anyway)
 
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> "<code class="command">sleep</code> <code class="replace">10d</code>"}} (10d stands for ''10 days'' -- the command will be killed when the job expires anyway)
 
Then:
 
Then:
 
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-C</code> <code class="replace">job_id</code>}}  
 
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-C</code> <code class="replace">job_id</code>}}  
{{Term|location=graphene-42|
+
{{Term|location=grisou-42|
 
cmd=<code class="command">hostname</code> && <code class="command">ps</code> -ef <nowiki>|</nowiki> <code class="command">grep</code> sleep<br>
 
cmd=<code class="command">hostname</code> && <code class="command">ps</code> -ef <nowiki>|</nowiki> <code class="command">grep</code> sleep<br>
 
<code class="command">java</code> <code>-version</code><br>
 
<code class="command">java</code> <code>-version</code><br>
 
<code class="command">mpirun</code> <code>--version</code><br>
 
<code class="command">mpirun</code> <code>--version</code><br>
 +
<code class="command">module available</code><code> # List [[Environment_modules|scientific-related software available using module]]</code><br>
 
<code class="command">whoami</code><br>
 
<code class="command">whoami</code><br>
 
<code class="command">env <nowiki>|</nowiki> grep OAR</code><code> # discover environment variables set by OAR</code>}}  
 
<code class="command">env <nowiki>|</nowiki> grep OAR</code><code> # discover environment variables set by OAR</code>}}  
Line 132: Line 145:
  
 
By default, you can only connect to nodes in your reservation, and only using the <code class="command">oarsh</code> connector to go from one node to the other. The connector supports the same options as the classical <code class="command">ssh</code> command, so it can be used as a replacement for software expecting ssh.  
 
By default, you can only connect to nodes in your reservation, and only using the <code class="command">oarsh</code> connector to go from one node to the other. The connector supports the same options as the classical <code class="command">ssh</code> command, so it can be used as a replacement for software expecting ssh.  
{{Term|location=griffon-49|cmd=<br>
+
{{Term|location=gros-49|cmd=<br>
 
<code class="command">uniq</code> <code class="env">$OAR_NODEFILE</code> <code># list of resources of your reservation</code><br>
 
<code class="command">uniq</code> <code class="env">$OAR_NODEFILE</code> <code># list of resources of your reservation</code><br>
<code class="command">oarsh</code> <code class="replace">griffon-1</code><code>    # use a node not in the file (will fail)</code><br>
+
<code class="command">oarsh</code> <code class="replace">gros-1</code><code>    # use a node not in the file (will fail)</code><br>
<code class="command">oarsh</code> <code class="replace">griffon-54</code><code> # use the other node of your reservation</code><br>
+
<code class="command">oarsh</code> <code class="replace">gros-54</code><code> # use the other node of your reservation</code><br>
<code class="command">ssh</code> <code class="replace">griffon-54</code><code> # will fail</code>}}
+
<code class="command">ssh</code> <code class="replace">gros-54</code><code> # will fail</code>}}
  
 
<code class="command">oarsh</code> is a wrapper around <code class="command">ssh</code> that enables the tracking of user jobs inside compute nodes (for example, to enforce the correct sharing of resources when two different jobs share a compute node). If your application does not support choosing a different connector, it is possible to avoid using <code class="command">oarsh</code> for <code class="command">ssh</code> with the <code>allow_classic_ssh</code> job type, as in  
 
<code class="command">oarsh</code> is a wrapper around <code class="command">ssh</code> that enables the tracking of user jobs inside compute nodes (for example, to enforce the correct sharing of resources when two different jobs share a compute node). If your application does not support choosing a different connector, it is possible to avoid using <code class="command">oarsh</code> for <code class="command">ssh</code> with the <code>allow_classic_ssh</code> job type, as in  
Line 145: Line 158:
  
 
By default, <code class="command">oarsub</code> will give you resources as soon as possible. You can also reserve resources at a specific time in the future, with the <code class="command">-r</code> parameter:
 
By default, <code class="command">oarsub</code> will give you resources as soon as possible. You can also reserve resources at a specific time in the future, with the <code class="command">-r</code> parameter:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code><code> -l nodes=3,walltime=3 </code><code class="command">'''-r'''</code><code> '2012-12-23 16:30:00'</code>}}
+
{{Term|location=fnancy|cmd=<code class="command">oarsub</code><code> -l nodes=3,walltime=3 </code><code class="command">'''-r'''</code><code> '2020-12-23 16:30:00'</code>}}
  
 
; Job management
 
; Job management
Line 152: Line 165:
 
{{Term|location=fnancy|cmd=<code class="command">oardel</code> <code class="replace">12345</code>}}
 
{{Term|location=fnancy|cmd=<code class="command">oardel</code> <code class="replace">12345</code>}}
  
{{Note|text=Remember that all your resource reservations must comply with the [[Grid5000:UsagePolicy]]. You can verify your reservations' compliance with the Policy with <code>usagepolicycheck -t</code>.}}
+
{{Note|text=Remember that '''all your resource reservations must comply with the [[Grid5000:UsagePolicy#Resources_reservation|Usage Policy]]'''. You can verify your reservations' compliance with the Policy with <code>usagepolicycheck -t</code>.}}
  
 
; Selection of resources using OAR properties
 
; Selection of resources using OAR properties
Line 164: Line 177:
 
{{Term|location=flyon|cmd=<code class="command">oarsub</code> <code class="replace">-p "wattmeter='YES' and gpu_count > 0"</code><code> -l nodes=2,walltime=2 -I</code>}}
 
{{Term|location=flyon|cmd=<code class="command">oarsub</code> <code class="replace">-p "wattmeter='YES' and gpu_count > 0"</code><code> -l nodes=2,walltime=2 -I</code>}}
 
* Since <code class="command">-p</code> accepts SQL, you could write
 
* Since <code class="command">-p</code> accepts SQL, you could write
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code class="replace">-p "wattmeter='YES' and network_address not in ('graphene-140.nancy.grid5000.fr', 'graphene-141.nancy.grid5000.fr')"</code><code> -l nodes=5,walltime=2 -I</code>}}
+
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code class="replace">-p "wattmeter='YES' and host not in ('graffiti-41.nancy.grid5000.fr', 'graffiti-42.nancy.grid5000.fr')"</code><code> -l nodes=5,walltime=2 -I</code>}}
 
The OAR properties available on each site are listed on the Monika pages linked from [[Status]] ([https://intranet.grid5000.fr/oar/Nancy/monika.cgi example page for Nancy]). The full list of OAR properties is available on [[OAR_Properties|this page]].
 
The OAR properties available on each site are listed on the Monika pages linked from [[Status]] ([https://intranet.grid5000.fr/oar/Nancy/monika.cgi example page for Nancy]). The full list of OAR properties is available on [[OAR_Properties|this page]].
  
 
; Extending the duration of a reservation
 
; Extending the duration of a reservation
Provided that the resources are still available after your job, you can extend its duration using e.g.:
+
Provided that the resources are still available after your job, you can extend its duration (walltime) using e.g.:
 
{{Term|location=fnancy|cmd=<code class="command">oarwalltime</code> <code class="replace">12345</code> <code class="replace">+1:30</code>}}
 
{{Term|location=fnancy|cmd=<code class="command">oarwalltime</code> <code class="replace">12345</code> <code class="replace">+1:30</code>}}
For details, see the [[Advanced_OAR]] tutorial.
+
This will request to add one hour and a half to job 12345.
 
 
=== Monitoring your nodes ===
 
 
 
Grid'5000 provides two different services to monitor your nodes during your experiment: Kwapi and Ganglia.
 
 
 
'''Ganglia''' uses a service running on the nodes to collect information.
 
If you point your browser to [https://intranet.grid5000.fr/ganglia/ https://intranet.grid5000.fr/ganglia/], you will see all the metrics collected on Grid'5000. If you navigate first to the site, then to the node you have reserved, you will see the metrics collected for one node.
 
  
Electrical power consumption monitoring is available on some Grid'5000 clusters using '''Kwapi'''. It does not require any specific application running on the nodes. Go to [https://intranet.grid5000.fr/supervision/lyon/monitoring/energy/last/minute/ the Kwapi page of Lyon], and visualize the power consumption of nodes you reserved (you might need to switch to another site using the ''Other sites'' menu).
+
For more details, see the oarwalltime section of the [[Advanced_OAR#Changing_the_walltime_of_a_running_job_.28oarwalltime.29|Advanced OAR]] tutorial.
 
 
 
 
The raw values and the historical data for both monitoring systems is also available through the Grid'5000 [[API]]. You can learn more about power monitoring in the [[Energy consumption monitoring tutorial]].
 
  
 
== Deploying your nodes to get root access and create your own experimental environment ==
 
== Deploying your nodes to get root access and create your own experimental environment ==
Using <code class="command">oarsub</code> gives you access to resources configured in their default (''standard'') environment, with a set of software selected by the Grid'5000 team. You can use such an environment to run Java or [[Run_MPI_On_Grid%275000|MPI programs]], [[Virtualization on Grid'5000|boot virtual machines with KVM]], or [[Software using modules|access a collection of scientific-related software]]. However, you cannot deeply customize the software environment in a way or another.
+
Using <code class="command">oarsub</code> gives you access to resources configured in their default (''standard'') environment, with a set of software selected by the Grid'5000 team. You can use such an environment to run Java or [[Run_MPI_On_Grid%275000|MPI programs]], [[Virtualization on Grid'5000|boot virtual machines with KVM]], or [[Environment_modules|access a collection of scientific-related software]]. However, you cannot deeply customize the software environment in a way or another.
  
 
Most Grid'5000 users use resources in a different, much more powerful way: they use [http://kadeploy3.gforge.inria.fr/ Kadeploy] to re-install the nodes with their software environment for the duration of their experiment, using Grid'5000 as a ''Hardware-as-a-Service'' Cloud. This enables them to use a different Debian version, another Linux distribution, or even Windows, and get root access to install the software stack they need.
 
Most Grid'5000 users use resources in a different, much more powerful way: they use [http://kadeploy3.gforge.inria.fr/ Kadeploy] to re-install the nodes with their software environment for the duration of their experiment, using Grid'5000 as a ''Hardware-as-a-Service'' Cloud. This enables them to use a different Debian version, another Linux distribution, or even Windows, and get root access to install the software stack they need.
  
{{Note|text=There is a tool, called <code class="command">sudo-g5k</code>, that provides root access on the ''standard'' environment. It does not allow deep reconfiguration as Kadeploy does, but could be enough if you just need to install additional software, with e.g. <code class="command">sudo-g5k</code><code> apt-get install your-favorite-editor</code>. The node will be transparently reinstalled using Kadeploy after your reservation. Usage of <code class="command">sudo-g5k</code> is logged.}}
+
{{Note|text=There is a tool, called <code class="command">sudo-g5k</code> (see the [[sudo-g5k]] page for details), that provides root access on the ''standard'' environment. It does not allow deep reconfiguration as Kadeploy does, but could be enough if you just need to install additional software, with e.g. <code class="command">sudo-g5k</code><code> apt-get install your-favorite-editor</code>. The node will be transparently reinstalled using Kadeploy after your reservation. Usage of <code class="command">sudo-g5k</code> is logged.}}
  
 
=== Deploying nodes with Kadeploy ===
 
=== Deploying nodes with Kadeploy ===
Line 196: Line 199:
 
{{Term|location=fnancy|cmd=<code class="command">oarsub</code><code> -I -l nodes=1,walltime=1:45 -t </code><code class="replace">deploy</code>}}
 
{{Term|location=fnancy|cmd=<code class="command">oarsub</code><code> -I -l nodes=1,walltime=1:45 -t </code><code class="replace">deploy</code>}}
  
Start a deployment of the <code class="env">debian9-x64-base</code> image on that node (this takes 5 to 10 minutes):
+
Start a deployment of the <code class="env">debian10-x64-base</code> image on that node (this takes 5 to 10 minutes):
{{Term|location=fnancy|cmd=<code class="command">kadeploy3</code><code> -f $OAR_NODE_FILE -e </code><code class="env">debian9-x64-base</code><code> -k</code>}}
+
{{Term|location=fnancy|cmd=<code class="command">kadeploy3</code><code> -f $OAR_NODE_FILE -e </code><code class="env">debian10-x64-base</code><code> -k</code>}}
The <code class="command">-f</code> parameter specifies a file containing the list of nodes to deploy. Alternatively, you can use <code class="command">-m</code> to specify a node (such as <code class="command">-m</code> <code class="host">graphene-42.nancy.grid5000.fr</code>). The <code class="command">-k</code> parameter asks Kadeploy to copy your SSH key to the node's root account after deployment, so that you can connect without password. If you don't specify it, you will need to provide a password to connect. However, SSH is often configured to disallow root login using password. The root password for all Grid'5000-provided images is <code class="command">grid5000</code>.
+
The <code class="command">-f</code> parameter specifies a file containing the list of nodes to deploy. Alternatively, you can use <code class="command">-m</code> to specify a node (such as <code class="command">-m</code> <code class="host">gros-42.nancy.grid5000.fr</code>). The <code class="command">-k</code> parameter asks Kadeploy to copy your SSH key to the node's root account after deployment, so that you can connect without password. If you don't specify it, you will need to provide a password to connect. However, SSH is often configured to disallow root login using password. The root password for all Grid'5000-provided images is <code class="command">grid5000</code>.
 +
 
 +
Reference images are named <code class="replace">debian version</code><code class="command">-</code><code class="replace">architecture</code><code class="command">-</code><code class="replace">type</code>. The <code class="replace">debian version</code> can be <code class="env">debian10</code> (Debian 10 "Buster", released in 07/2019) or <code class="env">debian9</code> (Debian 9 "stretch", released in 06/2017). The <code class="replace">architecture</code> is <code class="command">x64</code> (in the past, 32-bit images were also provided). The <code class="replace">type</code> can be:
  
Reference images are named <code class="replace">debian version</code><code class="command">-</code><code class="replace">architecture</code><code class="command">-</code><code class="replace">type</code>. The <code class="replace">debian version</code> can be <code class="env">jessie</code> (Debian 8, released in 04/2015) or <code class="env">debian9</code> (Debian 9 "stretch", released in 06/2017. The <code class="replace">architecture</code> is <code class="command">x64</code> (in the past, 32-bit images were also provided). The <code class="replace">type</code> can be:
 
 
* '''<code class="env">min</code>''' = a minimalistic image (standard Debian installation) with minimal Grid'5000-specific customization (the default configuration provided by Debian is used): addition of an SSH server, network interface firmware, etc  (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/min changes]).
 
* '''<code class="env">min</code>''' = a minimalistic image (standard Debian installation) with minimal Grid'5000-specific customization (the default configuration provided by Debian is used): addition of an SSH server, network interface firmware, etc  (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/min changes]).
 +
 
* '''<code class="env">base</code>''' = <code class="env">min</code> + various Grid'5000-specific tuning for performance (TCP buffers for 10 GbE, etc.), and a handful of commonly-needed tools to make the image more user-friendly (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/base changes]). Those could incur an experimental bias.
 
* '''<code class="env">base</code>''' = <code class="env">min</code> + various Grid'5000-specific tuning for performance (TCP buffers for 10 GbE, etc.), and a handful of commonly-needed tools to make the image more user-friendly (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/base changes]). Those could incur an experimental bias.
 +
 
* '''<code class="env">xen</code>''' = <code class="env">base</code> + Xen hypervisor Dom0 + minimal DomU (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/xen changes]).
 
* '''<code class="env">xen</code>''' = <code class="env">base</code> + Xen hypervisor Dom0 + minimal DomU (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/xen changes]).
* '''<code class="env">nfs</code>''' = <code class="env">base</code> + support for mounting your NFS home and accessing other storage services (Ceph), and using your Grid'5000 user account on deployed nodes (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/nfs changes]).
+
 
 +
* '''<code class="env">nfs</code>''' = <code class="env">base</code> + support for mounting your NFS home and accessing other storage services (Ceph), and using your Grid'5000 user account on deployed nodes (LDAP) (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/nfs changes]).
 +
 
 
* '''<code class="env">big</code>''' = <code class="env">nfs</code> + packages for development, system tools, editors, shells (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/big changes]).
 
* '''<code class="env">big</code>''' = <code class="env">nfs</code> + packages for development, system tools, editors, shells (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/big changes]).
 
And for the standard environment:
 
And for the standard environment:
* '''<code class="env">std</code>''' = <code class="env">big</code> + integration with OAR. Currently, it is the <code class="command">debian9-x64-std</code> environment which is used on the nodes if you or another user did not "kadeploy" another environment (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/std changes]).
+
 
 +
* '''<code class="env">std</code>''' = <code class="env">big</code> + integration with OAR. Currently, it is the <code class="command">debian10-x64-std</code> environment which is used on the nodes if you or another user did not "kadeploy" another environment (see [https://github.com/grid5000/environments-recipes/tree/master/steps/data/setup/puppet/modules/env/manifests/std changes]).
 
<!-- see https://www.grid5000.fr/mediawiki/index.php/New_maintenance_strategy_of_reference_images for details -->
 
<!-- see https://www.grid5000.fr/mediawiki/index.php/New_maintenance_strategy_of_reference_images for details -->
  
As a result, the environments you are the most likely to use are <code class="env">debian9-x64-min</code>, <code class="env">debian9-x64-base</code>, <code class="env">debian9-x64-xen</code>, <code class="env">debian9-x64-nfs</code>, <code class="env">debian9-x64-big</code>, and their ''jessie'' counterparts.
+
As a result, the environments you are the most likely to use are <code class="env">debian10-x64-min</code>, <code class="env">debian10-x64-base</code>, <code class="env">debian10-x64-xen</code>, <code class="env">debian10-x64-nfs</code>, <code class="env">debian10-x64-big</code>, and their ''debian9'' counterparts.
  
 
Environments are also provided and supported for some other distributions, only in the '''<code class="env">min</code>''' variant:
 
Environments are also provided and supported for some other distributions, only in the '''<code class="env">min</code>''' variant:
* Ubuntu: <code class="env">ubuntu1804-x64-min</code>
+
* Ubuntu: <code class="env">ubuntu1804-x64-min</code> and <code class="env">ubuntu2004-x64-min</code>
* Centos: <code class="env">centos7-x64-min</code>
+
* Centos: <code class="env">centos7-x64-min</code> and <code class="env">centos8-x64-min</code>
 
Last, an environment for the upcoming Debian version (also known as ''Debian testing'') is provided: <code class="env">debiantesting-x64-min</code> (only <code class="env">min</code> as well).
 
Last, an environment for the upcoming Debian version (also known as ''Debian testing'') is provided: <code class="env">debiantesting-x64-min</code> (only <code class="env">min</code> as well).
  
The list of all provided environments is available using <code class="command">kaenv3 -l</code>. Note that environments are versionned, and old versions of reference environments are available in <code class="file">/grid5000/images/</code> on each frontend (as well as images that are no longer supported, such as Centos 6 images). This can be used to reproduce experiments even months or years later, still using the same software environment.
+
The list of all provided environments is available using <code class="command">kaenv3 -l</code>. Note that environments are versioned, and old versions of reference environments are available in <code class="file">/grid5000/images/</code> on each frontend (as well as images that are no longer supported, such as CentOS 6 images). This can be used to reproduce experiments even months or years later, still using the same software environment.
  
 
=== Customizing nodes and accessing the Internet ===
 
=== Customizing nodes and accessing the Internet ===
Line 223: Line 232:
  
 
First, connect to the node as root:
 
First, connect to the node as root:
{{Term|location=fnancy|cmd=<code class="command">ssh</code> <code>root@</code><code class="replace">griffon-42</code>}}
+
{{Term|location=fnancy|cmd=<code class="command">ssh</code> <code>root@</code><code class="replace">gros-42</code>}}
  
 
You can access websites outside Grid'5000 : for example, to fetch the Linux kernel sources:  
 
You can access websites outside Grid'5000 : for example, to fetch the Linux kernel sources:  
{{Term|location=griffon-42|cmd=<code class="command">wget http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.6.tar.bz2</code>}}
+
{{Term|location=gros-42|cmd=<code class="command">wget http://www.kernel.org/pub/linux/kernel/v5.x/linux-5.4.43.tar.xz</code>}}
  
 
{{Warning|text= Please note that, for legal reasons, your Internet activity from Grid'5000 is logged and monitored.}}
 
{{Warning|text= Please note that, for legal reasons, your Internet activity from Grid'5000 is logged and monitored.}}
  
 
Let's install <code>stress</code> (a simple load generator) on the node from Debian's APT repositories:
 
Let's install <code>stress</code> (a simple load generator) on the node from Debian's APT repositories:
{{Term|location=griffon-42|cmd=<code class="command">apt-get</code><code> install </code><code class="replace">stress</code>}}
+
{{Term|location=gros-42|cmd=<code class="command">apt-get</code><code> install </code><code class="replace">stress</code>}}
  
 
Installing all the software needed for your experiment can be quite time-consuming. There are three approaches to avoid spending time at the beginning of each of your Grid'5000 sessions:
 
Installing all the software needed for your experiment can be quite time-consuming. There are three approaches to avoid spending time at the beginning of each of your Grid'5000 sessions:
Line 239: Line 248:
  
 
All those approaches have different pros and cons. We recommend that you start by scripting software installation after deploying a reference environment, and that you move to other approaches when this proves too limited.
 
All those approaches have different pros and cons. We recommend that you start by scripting software installation after deploying a reference environment, and that you move to other approaches when this proves too limited.
 
=== Monitoring deployed nodes ===
 
To limit experiment artefacts to the minimum, monitoring is not activated by default on reference environments.
 
On a *-min image, you will first need to install the ganglia-monitor package :
 
{{Term|location=griffon-45|cmd=<code class="command">apt-get</code><code> install ganglia-monitor</code>}}
 
You need to start the ganglia-monitor service to get meaningful results from the metrology API. You can do this using the following:
 
{{Term|location=griffon-45|cmd=<code class="command">service</code><code> ganglia-monitor start</code>}}
 
 
=== Controlling nodes (rebooting, accessing the serial console) ===
 
 
Grid'5000 provides you with out-of-band control of your nodes. You can access the node's serial console, trigger a reboot, or even power off the node. This is very useful in case your node loses network connectivity, or simply crashes.
 
 
Using another terminal, connect again to the site's frontend (for Nancy: <code class="host">fnancy</code>), and then connect to your node's serial console, using:
 
{{Term|location=fnancy|cmd=<code class="command">kaconsole3</code><code> -m </code><code class="host">graphene-42.nancy.grid5000.fr</code>}}
 
As a reminder, the root password for all Grid'5000-provided images is grid5000.
 
At the end of this tutorial, when you will need to exit that console, press '&' then '.' (more details about escape sequences are available on the [[Kaconsole]] page).
 
 
Using yet another terminal, connect again to the frontend. Now, shutdown the node, and watch it going down in the console:
 
{{Term|location=fnancy|cmd=<code class="command">kapower3</code><code> --</code><code class="replace">off</code><code> -m </code><code class="host">graphene-42.nancy.grid5000.fr</code>}}
 
After it has been shut down, check its status, and turn it on again:
 
{{Term|location=fnancy|cmd=<code class="command">kapower3</code><code> --</code><code class="replace">status</code><code> -m </code><code class="host">graphene-42.nancy.grid5000.fr</code><br>
 
<code class="command">kapower3</code><code> --</code><code class="replace">on</code><code> -m </code><code class="host">graphene-42.nancy.grid5000.fr</code>}}
 
 
Alternatively, you could have rebooted the node, using:
 
{{Term|location=fnancy|cmd=<code class="command">kareboot3</code><code> --reboot-kind</code> <code class="replace"> simple</code><code> -m </code><code class="host">graphene-42.nancy.grid5000.fr</code>}}
 
  
 
=== Checking nodes' changes over time ===
 
=== Checking nodes' changes over time ===
The Grid'5000 team puts on strong focus on ensuring that nodes meet their advertised capabilities. A detailed description of each node is stored in the '''Reference API''', and the node is frequently checked against this description in order to detect hardware failures or misconfigurations.
+
The Grid'5000 team puts on strong focus on ensuring that nodes meet their advertised capabilities. A detailed description of each node is stored in the '''[[API|Reference API]]''', and the node is frequently checked against this description in order to detect hardware failures or misconfigurations.
  
 
To see the description of grisou-1.nancy.grid5000.fr, use:
 
To see the description of grisou-1.nancy.grid5000.fr, use:
{{Term|location=fnancy|cmd=<code class="command">curl</code><code> https://api.grid5000.fr/3.0/sites/nancy/clusters/grisou/nodes/grisou-1.json?pretty</code>}}
+
{{Term|location=fnancy|cmd=<code class="command">curl</code><code> https://api.grid5000.fr/stable/sites/nancy/clusters/grisou/nodes/grisou-1.json?pretty</code>}}
  
 
=== Cleaning up after your reservation ===
 
=== Cleaning up after your reservation ===
Line 284: Line 268:
  
 
=== Additional tutorials ===
 
=== Additional tutorials ===
There are many '''more tutorials available''' on the [[:Category:Portal:User|Users Home]]. These tutorials cover more advanced aspects of Grid'5000, such as:
+
There are '''many more tutorials available on the [[Users Home]] page'''. Please have a look at the page to continue learning how to use Grid'5000.
* using [[Network_reconfiguration_tutorial|KaVLAN to isolate your experiments]] at the networking level
 
* Deploying your own, possibly customized, OpenStack Cloud inside Grid'5000 using ENOS
 
* accessing data from various monitoring systems (power consumption, network, system metrics)
 
* use Grid'5000 for [[HPC_and_HTC_tutorial|HPC]] experiments, using [[Run MPI On Grid'5000|MPI]] or [[Accelerators on Grid5000|accelerators]] (GPGPU or Xeon Phi)
 
* using the [[API all in one Tutorial|Grid'5000 REST API]] or a [[Experiment_scripting_tutorial|scripting library]] to script your experiment by automating resources selection, reservation and deployment
 
* performing experiments using large amounts of data with [[Storage|the different storage solutions available]] or [[Moving_Data_around_Grid'5000]]
 
* more advanced usage of [[Advanced_OAR|OAR]] – including the use of [https://www.grid5000.fr/mediawiki/index.php/Advanced_OAR#Multi-site_jobs_with_OARGrid OAR Grid] to manage multi-sites jobs – and [[Advanced_Kadeploy|Kadeploy]]
 

Latest revision as of 09:32, 31 July 2020

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

This tutorial will guide you through your first steps on Grid'5000. Before proceeding, make sure you have a Grid'5000 account (if not, follow this procedure), and an SSH client.

Getting support

The Support page describes how to get help during your Grid'5000 usage.

There's also an FAQ page and a cheat sheet with the most common commands.

Connecting for the first time

The primary way to move around Grid'5000 is using SSH. A reference page for SSH is also maintained with advanced configuration options that frequent users will find useful.

As described in the figure below, when using Grid'5000, you will typically:

  1. connect, using SSH, to an access machine
  2. connect from this access machine to a site frontend
  3. on this site frontend, reserve resources (nodes), and connect to those nodes
Grid5000 Access

SSH connection through a web interface

If you want an out-of-the-box solution which does not require you to setup SSH, you can connect through a web interface. The interface is available at https://intranet.grid5000.fr/shell/SITE/. For example, to access nancy's site, use: https://intranet.grid5000.fr/shell/nancy/ To connect you will have to type in your credentials twice (first for the HTTP proxy, then for the SSH connection).

This solution is probably suitable to follow this tutorial, but is unlikely to be suitable for real Grid'5000 usage. So you should probably read the following sections about how to setup and use SSH at some point.

Connect to a Grid'5000 access machine

The access.grid5000.fr address points to two actual machines: access-north (currently hosted in Lille) and access-south (currently hosted in Sophia-Antipolis). Those machines provide SSH access to Grid'5000 from Internet.

Terminal.png outside:
ssh login@access.grid5000.fr

You will get authenticated using the SSH public key you provided in the account creation form. Password authentication is disabled.

Note.png Note

You can modify your SSH keys in the account management interface

If you prefer (for better bandwidth and latency), you might also be able to connect directly via your local Grid'5000 site. However, per-site access restrictions are applied, so using access.grid5000.fr is usually a simpler choice. See External_access for details about local access machines.

A VPN service is also available to connect directly to Grid'5000 hosts. See the VPN page for more information. If you only require HTTP/HTTPS access to a node, a reverse HTTP proxy is also available, see this FAQ.

Connecting to a Grid'5000 site

Grid'5000 is structured in sites (Grenoble, Rennes, Nancy, ...). Each site hosts one or more clusters (homogeneous sets of machines, usually bought at the same time).

To connect to a particular site, do the following (blue and red arrow labelled SSH in the figure above).

Terminal.png access:
ssh site

Home directories

You have a different home directory on each Grid'5000 site, so you will usually use Rsync or scp to move data around. On access machines, you have direct access to each of those home directory, through NFS mounts (but using that feature to transfer very large volumes of data is inefficient). Typically, to copy a file to your home directory on the Nancy site, you can use:

Terminal.png outside:
scp myfile.c login@access.grid5000.fr:nancy/targetdirectory/mytargetfile.c

Grid'5000 does NOT have a BACKUP service for Grid'5000's users home directories: it is your responsibility to save important data outside Grid'5000 (or at least to copy data to several Grid'5000 sites in order to increase redundancy).

Quotas are applied on home directories -- by default, you get 25 GB per Grid'5000 site. If your usage of Grid'5000 requires more disk space, it is possible to request quota extensions in the account management interface, or to use other storage solutions (see Storage).

Recommended tips and tricks for efficient use of Grid'5000

There are also several recommended tips and tricks for SSH and related tools, explained in the SSH page:

Host g5k
  User USERNAME
  Hostname access.grid5000.fr
  ForwardAgent no
Host *.g5k
  User USERNAME
  ProxyCommand ssh g5k -W "$(basename %h .g5k):%p"
  ForwardAgent no
  • Using rsync instead of scp (better performance with multiple files)
  • Access your data from your laptop using SSHFS
  • Edit files over SSH with your favorite text editor, with e.g. vim scp://nancy.g5k/my_file.c

There are more in this talk from Grid'5000 School 2010, and this talk more focused on SSH.

Additionally, the Grid'5000 cheat sheet provides a nice summary of everything described in the tutorials.

Discovering, visualizing and reserving Grid'5000 resources

At this point, you should be connected to a site frontend, as indicated by your shell prompt (login@fsite:~$). This machine will be used to reserve and manipulate resources on this site, using the OAR software suite.

Discovering and visualizing resources

There are several ways to learn about the site's resources and their status:

  • The site's MOTD (message of the day) lists all clusters and their features. Additionally, it gives the list of current or future downtimes due to maintenance, which is also available from https://www.grid5000.fr/status/.
  • Site pages on the wiki (e.g. Nancy:Home) contain a detailed description of the site's hardware and network:
  • The Status page links to the resource status on each site, with two different visualizations available:
Example of Drawgantt in Nancy site
  • Hardware pages contain a detailed description of the site's hardware

Reserving resources with OAR: the basics

Note.png Note

OAR is the resources and jobs management system (a.k.a batch manager) used in Grid'5000, just like in traditional HPC centers. However, settings and rules of OAR that are configured in Grid'5000 slightly differ from traditional batch manager setups in HPC centers, in order to match the requirements for an experimentation testbed. Please remember to read again Grid'5000 Usage Policy to understand the expected usage.

In Grid'5000 the smallest unit of resource managed by OAR is the core (cpu core), but by default a OAR job reserves a host (physical computer including all its cpu/cores). Hence, what OAR calls nodes are hosts (physical machines).

Note.png Note

Most of this tutorial uses the site of Nancy (with the frontend: fnancy), but other sites can be used alternatively.

To reserve one host (one node), in interactive mode, do:

Terminal.png fnancy:
oarsub -I

To reserve three hosts (three nodes), in interactive mode, do:

Terminal.png fnancy:
oarsub -l host=3 -I

or equivalently:

Terminal.png fnancy:
oarsub -l nodes=3 -I

To reserve only one core in interactive mode, run:

Terminal.png fnancy:
oarsub -l core=1 -I
Note.png Note

When reserving only a share of the node's cores, you will have a share of the memory with the same ratio as the cores. If you take the whole node, you will have all the memory of the node. If you take half the cores, you will have half the memory, aso... You cannot reserve a memory size explicitly.

As soon as a resource becomes available, you will be directly connected to the reserved resource with an interactive shell, as indicated by the shell prompt.

You can also simply launch your experiment along with your reservation:

Terminal.png fnancy:
oarsub -l core=1 "my_mono_threaded_script.py --in $HOME/data --out $HOME/results"

Your program will be executed as soon as the requested resources are available (you will have to check for its termination using the oarstat command).

To reserve only one GPU (with the associated cores) in interactive mode, run:

Terminal.png fnancy:
oarsub -l gpu=1 -I -q production

(nodes with GPU are exclusively in the production queue in Nancy, on other sites, just don't use the -q production option to obtain the same result)

As soon as a resource becomes available, you will be directly connected to the reserved resource with an interactive shell, as indicated by the shell prompt.


To terminate your reservation and return to the frontend, simply exit this shell by typing exit or CTRL+d:

Terminal.png graffiti-1:
exit

To avoid unanticipated termination of your jobs in case of errors (terminal closed by mistake, network disconnection), you can either use tools such as tmux or screen, or reserve and connect in 2 steps using the job id associated to your reservation. First, reserve a node, and ask it to sleep for a long time:

Terminal.png fnancy:
oarsub "sleep 10d"
(10d stands for 10 days -- the command will be killed when the job expires anyway)

Then:

Terminal.png fnancy:
oarsub -C job_id
Terminal.png grisou-42:
hostname && ps -ef | grep sleep

java -version
mpirun --version
module available # List scientific-related software available using module
whoami

env | grep OAR # discover environment variables set by OAR

Of course, you will probably want to use more than one node on a given site, and you might want them for a different duration than one hour. The -l switch allows you to pass a comma-separated list of parameters specifying the needed resources for the job.

Terminal.png fnancy:
oarsub -I -l nodes=2,walltime=0:30

The walltime is the expected duration you envision to complete your work. Its format is [hour:min:sec|hour:min|hour] (walltime=5 => 5 hours, walltime=1:22 => 1 hour 22 minutes, walltime=0:03:30 => 3 minutes, 30 seconds).

By default, you can only connect to nodes in your reservation, and only using the oarsh connector to go from one node to the other. The connector supports the same options as the classical ssh command, so it can be used as a replacement for software expecting ssh.

Terminal.png gros-49:

uniq $OAR_NODEFILE # list of resources of your reservation
oarsh gros-1 # use a node not in the file (will fail)
oarsh gros-54 # use the other node of your reservation

ssh gros-54 # will fail

oarsh is a wrapper around ssh that enables the tracking of user jobs inside compute nodes (for example, to enforce the correct sharing of resources when two different jobs share a compute node). If your application does not support choosing a different connector, it is possible to avoid using oarsh for ssh with the allow_classic_ssh job type, as in

Terminal.png fnancy:
oarsub -I -l nodes=2,walltime=0:30:0 -t allow_classic_ssh

Reservations in advance, job management, and selection of resources

Reservations in advance

By default, oarsub will give you resources as soon as possible. You can also reserve resources at a specific time in the future, with the -r parameter:

Terminal.png fnancy:
oarsub -l nodes=3,walltime=3 -r '2020-12-23 16:30:00'
Job management

To list jobs currently submitted, use the oarstat command (use -u option to see only your jobs). A job can be deleted with:

Terminal.png fnancy:
oardel 12345
Note.png Note

Remember that all your resource reservations must comply with the Usage Policy. You can verify your reservations' compliance with the Policy with usagepolicycheck -t.

Selection of resources using OAR properties

The OAR nodes database contains a set of properties for each node, that can be used to request specific resources:

  • Nodes from a given cluster
Terminal.png fluxembourg:
oarsub -p "cluster='granduc'" -l nodes=5,walltime=2 -I
  • Nodes with Infiniband FDR interfaces
Terminal.png fnancy:
oarsub -p "ib='FDR'" -l nodes=5,walltime=2 -I
  • Nodes with power sensors and GPUs
Terminal.png flyon:
oarsub -p "wattmeter='YES' and gpu_count > 0" -l nodes=2,walltime=2 -I
  • Since -p accepts SQL, you could write
Terminal.png fnancy:
oarsub -p "wattmeter='YES' and host not in ('graffiti-41.nancy.grid5000.fr', 'graffiti-42.nancy.grid5000.fr')" -l nodes=5,walltime=2 -I

The OAR properties available on each site are listed on the Monika pages linked from Status (example page for Nancy). The full list of OAR properties is available on this page.

Extending the duration of a reservation

Provided that the resources are still available after your job, you can extend its duration (walltime) using e.g.:

Terminal.png fnancy:
oarwalltime 12345 +1:30

This will request to add one hour and a half to job 12345.

For more details, see the oarwalltime section of the Advanced OAR tutorial.

Deploying your nodes to get root access and create your own experimental environment

Using oarsub gives you access to resources configured in their default (standard) environment, with a set of software selected by the Grid'5000 team. You can use such an environment to run Java or MPI programs, boot virtual machines with KVM, or access a collection of scientific-related software. However, you cannot deeply customize the software environment in a way or another.

Most Grid'5000 users use resources in a different, much more powerful way: they use Kadeploy to re-install the nodes with their software environment for the duration of their experiment, using Grid'5000 as a Hardware-as-a-Service Cloud. This enables them to use a different Debian version, another Linux distribution, or even Windows, and get root access to install the software stack they need.

Note.png Note

There is a tool, called sudo-g5k (see the sudo-g5k page for details), that provides root access on the standard environment. It does not allow deep reconfiguration as Kadeploy does, but could be enough if you just need to install additional software, with e.g. sudo-g5k apt-get install your-favorite-editor. The node will be transparently reinstalled using Kadeploy after your reservation. Usage of sudo-g5k is logged.

Deploying nodes with Kadeploy

Reserve one node (the deploy job type is required to allow deployment with Kadeploy):

Terminal.png fnancy:
oarsub -I -l nodes=1,walltime=1:45 -t deploy

Start a deployment of the debian10-x64-base image on that node (this takes 5 to 10 minutes):

Terminal.png fnancy:
kadeploy3 -f $OAR_NODE_FILE -e debian10-x64-base -k

The -f parameter specifies a file containing the list of nodes to deploy. Alternatively, you can use -m to specify a node (such as -m gros-42.nancy.grid5000.fr). The -k parameter asks Kadeploy to copy your SSH key to the node's root account after deployment, so that you can connect without password. If you don't specify it, you will need to provide a password to connect. However, SSH is often configured to disallow root login using password. The root password for all Grid'5000-provided images is grid5000.

Reference images are named debian version-architecture-type. The debian version can be debian10 (Debian 10 "Buster", released in 07/2019) or debian9 (Debian 9 "stretch", released in 06/2017). The architecture is x64 (in the past, 32-bit images were also provided). The type can be:

  • min = a minimalistic image (standard Debian installation) with minimal Grid'5000-specific customization (the default configuration provided by Debian is used): addition of an SSH server, network interface firmware, etc (see changes).
  • base = min + various Grid'5000-specific tuning for performance (TCP buffers for 10 GbE, etc.), and a handful of commonly-needed tools to make the image more user-friendly (see changes). Those could incur an experimental bias.
  • xen = base + Xen hypervisor Dom0 + minimal DomU (see changes).
  • nfs = base + support for mounting your NFS home and accessing other storage services (Ceph), and using your Grid'5000 user account on deployed nodes (LDAP) (see changes).
  • big = nfs + packages for development, system tools, editors, shells (see changes).

And for the standard environment:

  • std = big + integration with OAR. Currently, it is the debian10-x64-std environment which is used on the nodes if you or another user did not "kadeploy" another environment (see changes).

As a result, the environments you are the most likely to use are debian10-x64-min, debian10-x64-base, debian10-x64-xen, debian10-x64-nfs, debian10-x64-big, and their debian9 counterparts.

Environments are also provided and supported for some other distributions, only in the min variant:

  • Ubuntu: ubuntu1804-x64-min and ubuntu2004-x64-min
  • Centos: centos7-x64-min and centos8-x64-min

Last, an environment for the upcoming Debian version (also known as Debian testing) is provided: debiantesting-x64-min (only min as well).

The list of all provided environments is available using kaenv3 -l. Note that environments are versioned, and old versions of reference environments are available in /grid5000/images/ on each frontend (as well as images that are no longer supported, such as CentOS 6 images). This can be used to reproduce experiments even months or years later, still using the same software environment.

Customizing nodes and accessing the Internet

Now that your nodes are deployed, the next step is usually to copy data (usually using scp or rsync) and install software.

First, connect to the node as root:

Terminal.png fnancy:
ssh root@gros-42

You can access websites outside Grid'5000 : for example, to fetch the Linux kernel sources:

Warning.png Warning

Please note that, for legal reasons, your Internet activity from Grid'5000 is logged and monitored.

Let's install stress (a simple load generator) on the node from Debian's APT repositories:

Terminal.png gros-42:
apt-get install stress

Installing all the software needed for your experiment can be quite time-consuming. There are three approaches to avoid spending time at the beginning of each of your Grid'5000 sessions:

  • Always deploy one of the reference environments, and automate the installation of your software environment after the image has been deployed. You can use a simple bash script, or more advanced tools for configuration management such as Ansible, Puppet or Chef.
  • Register a new environment with your modifications, using the tgz-g5k tool. More details are provided in the Advanced Kadeploy tutorial.
  • Use a tool to generate your environment image from a set of rules, such as Kameleon or Puppet. The Grid'5000 technical team uses those two tools to generates all Grid'5000 environments in a clean and reproducible process

All those approaches have different pros and cons. We recommend that you start by scripting software installation after deploying a reference environment, and that you move to other approaches when this proves too limited.

Checking nodes' changes over time

The Grid'5000 team puts on strong focus on ensuring that nodes meet their advertised capabilities. A detailed description of each node is stored in the Reference API, and the node is frequently checked against this description in order to detect hardware failures or misconfigurations.

To see the description of grisou-1.nancy.grid5000.fr, use:

Cleaning up after your reservation

At the end of your resources reservation, the infrastructure will automatically reboot the nodes to put them back in the default (standard) environment. There's no action needed on your side.

Going further

In this tutorial, you learned the basics of Grid'5000:

  • The general structure of Grid'5000, and how to move between sites
  • How to manage you data (one NFS server per site; remember: it is not backed up)
  • How to find and reserve resources using OAR and the oarsub command
  • How to get root access on nodes using Kadeploy and the kadeploy3 command

You should now be ready to use Grid'5000.

Additional tutorials

There are many more tutorials available on the Users Home page. Please have a look at the page to continue learning how to use Grid'5000.