Disk reservation: Difference between revisions

From Grid5000
Jump to navigation Jump to search
 
(215 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{Maintainer|Florent Didier}}
{{Portal|User}}
<!--{{Portal|User}}
{{Status|In production}}
{{Status|In production}}-->
{{StorageHeader}}
{{StorageHeader}}
{{TutorialHeader}}


'''Disk reservation''' is available in '''beta release''' on grimoire cluster in Nancy. Grimoire has height nodes with five bookable disks on each. You can now reserve hard disks, in order to '''store large datasets between your nodes reservations''' for example.
__TOC__
 
'''Disk reservation''' consists in reserving on nodes additional hard disks, which are otherwise not usable.
 
The table below shows the Grid'5000 clusters with such additional hard disks available for reservation.
{{:Generated/DiskReservation}}


= How it works =
= How it works =
When a job of type deploy starts, the disks you reserved are enabled by the RAID card of the node, and the other disks are disabled. Thus, reserved disks can only be accessed by the user who reserved them.
Two use cases of the disk reservation are possible:
; Long run reservations of '''disks only''' (job reserving no host, i.e. no processing power): disk-only reservations do not have to fit in the day vs. night&week-end ''host'' reservation policy, and can last up to many days (see [[Grid5000:UsagePolicy]]). The reserved disks can then be used by regular ''host'' jobs during the period of time of the ''disk'' reservation. In this use case, the goal is to get more persistence for the local storage of nodes, e.g. avoid the need to reformat disks and reimport dataset in each regular ''host'' job. Those long run jobs must use the <code class='command'>noop</code> OAR job type.
; Regular jobs reserving both '''host and disks''': In this use case, the goal is to get access to the reservable disks within the experiment, just as if the disk were not to reserve separately.
 
In both cases, making use of the reserved disks requires to gain the '''root privileges''', since disks are provided as bare metal hardware to be partitioned, formated, mounted, filled with no restriction but by the experimenter. As a result, the experimenter can use the reserved disk:
* either in a '''non-deploy''' job, in the standard environment but after enabling sudo with the <code class='command'>sudo-g5k</code> command ;
* or in a '''deploy''' job, in a kadeployed environment (use the <code class='command'>deploy</code> OAR job type, then <code class='command'>kadeploy</code>).


= Main commands =
Technically speaking, when a ''deploy'' job starts, or whenever <code class='command'>sudo-g5k</code> is called in a ''non-deploy'' job, the reserved disks stay available (shown by <code class=command>lsblk</code>) while the other disks are disabled and disappear.
One or several disks can be reserved with the following commands. Note that disk reservation only works with a node reservation of type deploy. What's more, by default, reserving a node on grimoire will only grant access to the main disk.
{{Warning|text=Mind that some disks may show up in <code class=command>lsblk</code>, while not being reserved:
* <code class=file>sda</code> is the system disk and host the partition of the running system.
* non reservable disks (have a look at the hardware description of the reserved cluster to find out what disks are reservable, in the site's hardware pages, e.g. [[Nancy:Hardware]] for the clusters of Nancy) also show up every time for any user (their access cannot be protected by a reservation as they are not reservable !)
}}
Reserved disks can only be accessed by the user who reserved them.


== Reserve nodes with only their main disk ==
Please note that reserved disks are '''not cleaned-up''' at the end of reservation. As a result:
You can reserve one node with its main disk on grimoire cluster
* Data let on the disks can be accessed by user in later reservations.
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-I -l {"cluster='grimoire'"}/nodes=1</code>}}
* Reserved disk may first need to get cleaned-up before use (remove previous formating and partitioning)


== Reserve nodes with all their disks ==
See also [[Disk_reservation#Security_issues|Security issues]].
You can also reserve one node with all its disks
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-I -l {"type='disk' or type='default' and disk_reservation_count > 0"}/nodes=1</code>}}
disk_reservation_count represents the number of bookable disks on the node. At present only grimoire has bookable disks.


== Reserve disks and nodes ==
= Usage =
You can reserve nodes grimoire-1 and grimoire-2 with 1 disk per node
The main commands to reserve disks are given below.
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-I -l {"type='default' and host in ('grimoire-1.nancy.grid5000.fr','grimoire-2.nancy.grid5000.fr')"}/host=2+{"type='disk' and host in ('grimoire-1.nancy.grid5000.fr','grimoire-2.nancy.grid5000.fr')"}/host=2/disk=1</code>}}
 
The maximum duration of a disk reservation is defined in the [[Grid5000:UsagePolicy#Rules_for_disks_reservations usage policy|Usage Policy]].
 
{{Note|text=In the following example, add <code class=command>-t deploy</code> to the <code class=command>oarsub</code> command if you plan to deploy your own environment for your expleriment.}}
 
== Reserve disks and nodes at the same time ==
; How to reserve a node with only the main disk (none of the additional disks), on the grimoire cluster:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-I -p grimoire -l /host=1</code>}}
(no change to the way a node was to be reserved in the past, before the disk reservation mechanism existed.)
 
; How to reserve a node with all its disks:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-I -l "{host+disk & grimoire}"/host=1</code>}}
 
; How to reserve specific nodes grimoire-1 and grimoire-2 with all their reservable disks:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-I -l "{host+disk & host in (grimoire-1, grimoire-2)}"/host=2</code>}}
 
{{Note|text=Make sure to set the same number of nodes in <code>host=N</code> than the number of nodes in the host name constraint list.}}
 
; How to reserve nodes specific grimoire-1 and grimoire-2 with only one reservable disk per node:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-I -p "host in (grimoire-1, grimoire-2)" -l "/host=2 + {disk}/host=2/disk=1"</code>}}
{{Note|text=Yes, the syntax of the last oarsub command is a bit awkward, so please be careful and mind having:
* the ''-p'' option explicitly set the hosts you want (using ''grimoire'' or ''"cluster='grimoire'"'' instead could not insure that you get the disks on the same nodes you will reserve) ;
* both ''host='' values in the ''-l'' option (''2'' in the example) exactly match the count of hosts in the list you provide in the ''-p'' option (''grimoire-1 and grimoire-2'' in the example).
* we do not need to explicitly write ''"{type='default'}"'' in the ''-l'' option (before the ''/host=2+'', because ''default'' is implicit is the ''type'' is not set.
See [[Advanced_OAR#Complex_resources_selection|Advanced OAR]] for more explanation of the oarsub syntax.
}}


== Reserve disks and nodes separately ==
== Reserve disks and nodes separately ==
=== First : reserve disks ===
You may, for example, decide to reserve some disks for one week, but the nodes where your disks are located only when you want to carry out an experiment.
You can decide to reserve some disks for one week, and the corresponding nodes only when you want to carry out an experiment.
=== First: reserve the disks ===
Since we want to reserve disks only in a first time, we use the '''noop''' job type: with this '''noop''' job type, OAR will not try to execute anything on the job resources (which is what we want since disk resources are not capable of ''executing'' programs).
 
(Please mind that Jobs of type ''noop'' cannot be interactive: <code class='command'>oarsub</code><code>-I -t noop ...</code> is not supported.)
 
3 examples:
 
Reserve two disks on grimoire-1 for one week, starting on 2018-01-01:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-r "2018-01-01 00:00:00" -t noop -l "{disk&grimoire-1}"/host=1/disk=2,walltime=168</code>}}
 
Or reserve the first two reservable disks on grimoire-2 (named disk1 and disk2, since disk0 is the system disk which is not reservable):
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-r "2018-01-01 00:00:00" -t noop -l "{disk & grimoire-2 and disk in (disk1.grimoire-2, disk2.grimoire-2)}"/host=1/disk=2,walltime=168</code>}}
 
Or reserve all disks on two nodes:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-r "2018-01-01 00:00:00" -t noop -l "{disk&grimoire}"/host=2/disk=ALL,walltime=168</code>}}
 
=== Second: reserve the nodes ===
You can then reserve nodes grimoire-1 and grimoire-2 for 3 hours, in the usual way:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-I -l "{host in (grimoire-1, grimoire-2)}"/host=2,walltime=3</code>}}
 
''You must respect this order : reserve the disks first, then reserve the nodes. Otherwise the disks you reserved will not be available on your nodes.''
 
= Checking the state of reserved disks =
== Gantt diagrams with disk reservations ==
Reservations of both nodes (processors) and disks are displayed on the following Gantt diagrams:
{|
|-
|bgcolor="#ffffff" valign="top" style="border:1px solid #cccccc;padding:1em;padding-top:0.5em;"|
[https://intranet.grid5000.fr/oar/Grenoble/drawgantt-svg-disks/ Grenoble]
|bgcolor="#ffffff" valign="top" style="border:1px solid #cccccc;padding:1em;padding-top:0.5em;"|
[https://intranet.grid5000.fr/oar/Lille/drawgantt-svg-disks/ Lille]
|bgcolor="#ffffff" valign="top" style="border:1px solid #cccccc;padding:1em;padding-top:0.5em;"|
[https://intranet.grid5000.fr/oar/Lyon/drawgantt-svg-disks/ Lyon]
|bgcolor="#ffffff" valign="top" style="border:1px solid #cccccc;padding:1em;padding-top:0.5em;"|
[https://intranet.grid5000.fr/oar/Nancy/drawgantt-svg-disks/ Nancy]
|bgcolor="#ffffff" valign="top" style="border:1px solid #cccccc;padding:1em;padding-top:0.5em;"|
[https://intranet.grid5000.fr/oar/Rennes/drawgantt-svg-disks/ Rennes]
|}
 
== Getting information about disk reservations from OAR and G5K APIs ==
* The OAR API shows the properties of each resource of a job. You can retrieve the properties of your reserved disks, such as disk or diskpath:
{{Term|location=fnancy|cmd=<code class="command">curl</code> <code>https://api.grid5000.fr/stable/sites/</code><code class="replace">site</code><code>/internal/oarapi/jobs/</code><code class="replace">job_id</code><code>/resources.json</code> (or <code>resources.yaml</code>)}}
 
* The Grid'5000 API also provide some details about disk reservations under the '''"disks"''' key of the status and jobs APIs:
{{Term|location=fnancy|cmd=<code class="command">curl</code> <code>https://api.grid5000.fr/stable/sites/</code><code class="replace">site</code><code>/status &#124; json_pp</code>}}
{{Term|location=fnancy|cmd=<code class="command">curl</code> <code>https://api.grid5000.fr/stable/sites/</code><code class="replace">site</code><code>/jobs/</code><code class="replace">job_id</code><code> &#124; json_pp</code>}}
 
= Using local disks once connected on the nodes =
 
Login as root on a node where you reserved one or more disks:
 
* either use <code class="command">sudo-g5k -i</code> from the standard environment to become root
* either login with SSH as root on an environment you deployed
 
→ Example of <code class=command>oarsub</code> command for such a reservation:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-t exotic</code> <code class=replace>-l "{type='disk' and host like 'yeti-1.%' and disk like 'disk2.%'}"/disk=1+"{type='default' and host like 'yeti-1.%'}"/host=1</code> <code>-I</code>}}
(then <code class=command>ssh yeti-1</code> and run <code class=command>sudo-g5k</code>).
 
All examples below assume that you are already logged in as root on the node.
 
== Discovering available disks ==
 
The <code class="command">lsblk</code> command lists all block devices. For instance, on a <code class="host">yeti</code> machine in Grenoble, this might show:
# <code class="command">lsblk</code>
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda          8:0    0 447.1G  0 disk
├─sda1        8:1    0  3.7G  0 part [SWAP]
├─sda2        8:2    0  19.6G  0 part /
├─sda3        8:3    0  22.4G  0 part
├─sda4        8:4    0    1K  0 part
└─sda5        8:5    0 401.5G  0 part /tmp
sdc          8:32  0  1.8T  0 disk
nvme0n1    259:0    0  1.5T  0 disk
nvme1n1    259:1    0  1.5T  0 disk
 
In this case:
 
* <code class="file">disk0</code> is shown as <code class="file">sda</code> and is the system disk, so it is always available
* <code class="file">disk2</code> is shown as <code class="file">sdc</code> and has been reserved explicitly so it is visible
* <code class="file">disk1</code> and <code class="file">disk3</code> that should map to <code class="file">sdb</code> and <code class="file">sdd</code> do not show up: indeed, they have not been reserved for this example
* <code class="file">disk4</code> and <code class="file">disk5</code> are shown as <code class="file">nvme0n1</code> and <code class="file">nvme1n1</code>, that are NVMe SSDs and are always available (not reservable)
 
You can compare the output with the reference data shown in [[Grenoble:Hardware#yeti]].
 
{{Warning|text=The actual mapping between <code class="file">diskN</code> and <code class="file">sdX</code> names might change depending on disk initialization order during boot. Thus, <code class="file">sdc</code> might be an entirely different disk if you reboot the machine. In the following, we show how to use the disk aliases (symlinks <code class="file">/dev/diskN</code> set in Grid'5000 environments) or the use of PCI paths to make sure we unambiguously identify the right disks.}}
 
If using an environment where the disk aliases are activated (default environment or deployed environment where <code>g5k-postinstall</code> is called with the <code>--disk-aliases</code> option), the following symlinks should show the disks with the matching between the <code class="file">diskN</code> and <code class="file">sdX</code> names:
# ls -l /dev/disk[0-9]*
lrwxrwxrwx 1 root root 3 Oct 13 09:14 /dev/disk0 -> sda
lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p1 -> sda1
lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p2 -> sda2
lrwxrwxrwx 1 root root 4 Oct 13 09:15 /dev/disk0p3 -> sda3
lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p4 -> sda4
lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p5 -> sda5
lrwxrwxrwx 1 root root 3 Oct 13 09:14 /dev/disk2 -> sdc
lrwxrwxrwx 1 root root 7 Oct 13 09:14 /dev/disk4 -> nvme0n1
lrwxrwxrwx 1 root root 7 Oct 13 09:14 /dev/disk5 -> nvme1n1
 
It is also possible to display disks with their PCI path, which is guaranteed to always be the same (unless the hardware is physically modified):
 
# <code class="command">ls -l /dev/disk/by-path/</code>
total 0
lrwxrwxrwx 1 root root  9 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:0:0 -> ../../sda
lrwxrwxrwx 1 root root 10 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Oct  7 20:12 pci-0000:18:00.0-scsi-0:0:0:0-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part5 -> ../../sda5
lrwxrwxrwx 1 root root  9 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:2:0 -> ../../sdc
lrwxrwxrwx 1 root root 13 Oct  7 20:11 pci-0000:59:00.0-nvme-1 -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Oct  7 20:11 pci-0000:6d:00.0-nvme-1 -> ../../nvm10n1
 
Here, we see that <code class="file">sdc</code> has the PCI path <code class="file">pci-0000:18:00.0-scsi-0:0:2:0</code>, which matches the second reservable disk listed on [[Grenoble:Hardware#yeti]].
 
== Partitioning a disk ==
 
To start using the disk, you will likely need to partition it. Several tools exist to do this: <code class="command">fdisk</code>, <code class="command">sfdisk</code>, <code class="command">cfdisk</code>, <code class="command">parted</code>...
 
For example, to partition the [[Grenoble:Hardware#yeti|second 2 TB disk of a yeti machine]] interactively:
 
{{Term|location=yeti-1|cmd=<code class="command">cfdisk</code> <code class="replace">/dev/disk2</code>}}
 
Use the interactive prompt to create a single partition of type "Linux filesystem", possibly by deleting existing partitions first.
 
As an advanced usage, you could use LVM to create logical volumes that may span several disks, or mdadm to create software RAID volumes.
 
== Creating a filesystem ==


You can reserve two disks on grimoire-1 for one week starting from January 1st
Continuing the previous example, let's create an ext4 filesystem on the first partition of the same disk:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-r "2018-01-01 00:00:00" -t noop -l {"type='disk' and host='grimoire-1.nancy.grid5000.fr"}/host=1/disk=2,walltime=168</code>}}


Or you can specify you want the first two disks of grimoire-1
{{Term|location=yeti-1|cmd=<code class="command">mkfs.ext4</code> <code class="command">-m 0</code> <code class="replace">/dev/disk2p1</code>}}
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-r "2018-01-01 00:00:00" -t noop -l {"type='disk' and host='grimoire-1.nancy.grid5000.fr' and disk in ('1', '2')"}/host=1/disk=2,walltime=168</code>}}


You can also reserve all disks of two nodes
Mount it and check that it appears:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-r "2018-01-01 00:00:00" -t noop -l {"type='disk' and cluster='grimoire'"}/host=2/disk=ALL,walltime=168</code>}}


=== Secondly: reserve nodes ===
{{Term|location=yeti-1|cmd=<code class="command">mkdir -p</code> <code class="replace">/mnt/mylocaldisk</code>}}
You can reserve nodes grimoire-1 and grimoire-2 for 3 hours
{{Term|location=yeti-1|cmd=<code class="command">mount</code> <code class="replace">/dev/disk2p1</code> <code class="replace">/mnt/mylocaldisk</code>}}
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code>-l {"host in ('grimoire-1.nancy.grid5000.fr', 'grimoire-2.nancy.grid5000.fr')"}/host=2,walltime=3</code>}}
{{Term|location=yeti-1|cmd=<code class="command">df -h</code>}}


= Data loss =
As an advanced usage, you may use any filesystem: Btrfs, HDFS, Ceph, ZFS, Beegfs, etc. Refer to the documentation for each of these systems for guidance.
Such as for your home directory, your data on the disks will not be
 
backuped by Grid'5000. You must be cautious : several events such as
== Troubleshooting ==
a disk failure or a bad command on the node may delete your data.
 
When partitioning or formatting local disks, you might encounter an error such as:
 
Error: Partition(s) on /dev/sdb are being used
 
This may be because the disks already contained partitions of a certain type (LVM, software RAID...) from a previous job, and your system automatically started using them. To solve this, you have several options:
 
* use a tool such as <code class="command">wipefs</code> or <code class="command">pvremove</code> to remove previous information from the disk.
 
* use the <code class="command">mdadm</code> tool if a Linux software RAID was configured, with a command such as <code class="command">mdadm --misc --stop /dev/md127</code> for instance, to remove relics of an old RAID device.
 
* use a low-level tool such as <code class="command">dd</code> to completely erase the beginning of the disk, and reboot. Use with care as it can destroy your data.
 
For instance, here is an example script that cleans up disks automatically: https://github.com/pmorillon/terraform-provider-grid5000/blob/master/examples/ceph/modules/rook_ceph/files/disk-format.sh.tmpl


= Security issues =
= Security issues =
Unfortunately, it is not possible to avoid a malicious user to take
The mechanism used to enable/disable disks is designed to avoid mistakes from other users. However, a malicious user could take control of the RAID card, enable any disk, and access or erase your data. Please notify the [mailto:support-staff@lists.grid5000.fr Grid'5000 tech-team] in case of such event, but first of all mind securing your data:
control of the RAID card and therefore of the disks. If you have
* Keep a copy (backup) in a safe place if relevant for your data ;
sensitive data and don't know how to protect them, you may contact the Grid'5000
* If your data is sensitive, mind using cryptographic mechanisms to secure it.
technical team to help you. It is also your responsibility to erase the
 
data on the disks you reserved, at the end of your experiment.
Also, the data on reserved disks is not automatically erased at the end of your job. If you don't want the next user to access it, you have to erase it yourself.
 
Finally, no backup of data stored on the reserved disks is made.

Latest revision as of 19:52, 24 September 2025

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Disk reservation consists in reserving on nodes additional hard disks, which are otherwise not usable.

The table below shows the Grid'5000 clusters with such additional hard disks available for reservation.

Site Cluster Number of nodes Number of reservable disks per node
Grenoble yeti 4 3
Lille chiclet-8 1 1
Lille chiclet-[1-7] 7 2
Lille chifflot 8 5
Lyon gemini 2 4
Nancy gros 124 1
Nancy grouille 2 1
Rennes parasilo 25 5

Last generated from the Grid'5000 Reference API on 2025-09-03 (commit b23b5cd50e)

How it works

Two use cases of the disk reservation are possible:

Long run reservations of disks only (job reserving no host, i.e. no processing power)
disk-only reservations do not have to fit in the day vs. night&week-end host reservation policy, and can last up to many days (see Grid5000:UsagePolicy). The reserved disks can then be used by regular host jobs during the period of time of the disk reservation. In this use case, the goal is to get more persistence for the local storage of nodes, e.g. avoid the need to reformat disks and reimport dataset in each regular host job. Those long run jobs must use the noop OAR job type.
Regular jobs reserving both host and disks
In this use case, the goal is to get access to the reservable disks within the experiment, just as if the disk were not to reserve separately.

In both cases, making use of the reserved disks requires to gain the root privileges, since disks are provided as bare metal hardware to be partitioned, formated, mounted, filled with no restriction but by the experimenter. As a result, the experimenter can use the reserved disk:

  • either in a non-deploy job, in the standard environment but after enabling sudo with the sudo-g5k command ;
  • or in a deploy job, in a kadeployed environment (use the deploy OAR job type, then kadeploy).

Technically speaking, when a deploy job starts, or whenever sudo-g5k is called in a non-deploy job, the reserved disks stay available (shown by lsblk) while the other disks are disabled and disappear.

Warning.png Warning

Mind that some disks may show up in lsblk, while not being reserved:

  • sda is the system disk and host the partition of the running system.
  • non reservable disks (have a look at the hardware description of the reserved cluster to find out what disks are reservable, in the site's hardware pages, e.g. Nancy:Hardware for the clusters of Nancy) also show up every time for any user (their access cannot be protected by a reservation as they are not reservable !)

Reserved disks can only be accessed by the user who reserved them.

Please note that reserved disks are not cleaned-up at the end of reservation. As a result:

  • Data let on the disks can be accessed by user in later reservations.
  • Reserved disk may first need to get cleaned-up before use (remove previous formating and partitioning)

See also Security issues.

Usage

The main commands to reserve disks are given below.

The maximum duration of a disk reservation is defined in the Usage Policy.

Note.png Note

In the following example, add -t deploy to the oarsub command if you plan to deploy your own environment for your expleriment.

Reserve disks and nodes at the same time

How to reserve a node with only the main disk (none of the additional disks), on the grimoire cluster
Terminal.png fnancy:
oarsub -I -p grimoire -l /host=1

(no change to the way a node was to be reserved in the past, before the disk reservation mechanism existed.)

How to reserve a node with all its disks
Terminal.png fnancy:
oarsub -I -l "{host+disk & grimoire}"/host=1
How to reserve specific nodes grimoire-1 and grimoire-2 with all their reservable disks
Terminal.png fnancy:
oarsub -I -l "{host+disk & host in (grimoire-1, grimoire-2)}"/host=2
Note.png Note

Make sure to set the same number of nodes in host=N than the number of nodes in the host name constraint list.

How to reserve nodes specific grimoire-1 and grimoire-2 with only one reservable disk per node
Terminal.png fnancy:
oarsub -I -p "host in (grimoire-1, grimoire-2)" -l "/host=2 + {disk}/host=2/disk=1"
Note.png Note

Yes, the syntax of the last oarsub command is a bit awkward, so please be careful and mind having:

  • the -p option explicitly set the hosts you want (using grimoire or "cluster='grimoire'" instead could not insure that you get the disks on the same nodes you will reserve) ;
  • both host= values in the -l option (2 in the example) exactly match the count of hosts in the list you provide in the -p option (grimoire-1 and grimoire-2 in the example).
  • we do not need to explicitly write "{type='default'}" in the -l option (before the /host=2+, because default is implicit is the type is not set.
See Advanced OAR for more explanation of the oarsub syntax.

Reserve disks and nodes separately

You may, for example, decide to reserve some disks for one week, but the nodes where your disks are located only when you want to carry out an experiment.

First: reserve the disks

Since we want to reserve disks only in a first time, we use the noop job type: with this noop job type, OAR will not try to execute anything on the job resources (which is what we want since disk resources are not capable of executing programs).

(Please mind that Jobs of type noop cannot be interactive: oarsub-I -t noop ... is not supported.)

3 examples:

Reserve two disks on grimoire-1 for one week, starting on 2018-01-01:

Terminal.png fnancy:
oarsub -r "2018-01-01 00:00:00" -t noop -l "{disk&grimoire-1}"/host=1/disk=2,walltime=168

Or reserve the first two reservable disks on grimoire-2 (named disk1 and disk2, since disk0 is the system disk which is not reservable):

Terminal.png fnancy:
oarsub -r "2018-01-01 00:00:00" -t noop -l "{disk & grimoire-2 and disk in (disk1.grimoire-2, disk2.grimoire-2)}"/host=1/disk=2,walltime=168

Or reserve all disks on two nodes:

Terminal.png fnancy:
oarsub -r "2018-01-01 00:00:00" -t noop -l "{disk&grimoire}"/host=2/disk=ALL,walltime=168

Second: reserve the nodes

You can then reserve nodes grimoire-1 and grimoire-2 for 3 hours, in the usual way:

Terminal.png fnancy:
oarsub -I -l "{host in (grimoire-1, grimoire-2)}"/host=2,walltime=3

You must respect this order : reserve the disks first, then reserve the nodes. Otherwise the disks you reserved will not be available on your nodes.

Checking the state of reserved disks

Gantt diagrams with disk reservations

Reservations of both nodes (processors) and disks are displayed on the following Gantt diagrams:

Grenoble

Lille

Lyon

Nancy

Rennes

Getting information about disk reservations from OAR and G5K APIs

  • The OAR API shows the properties of each resource of a job. You can retrieve the properties of your reserved disks, such as disk or diskpath:
Terminal.png fnancy:
curl https://api.grid5000.fr/stable/sites/site/internal/oarapi/jobs/job_id/resources.json (or resources.yaml)
  • The Grid'5000 API also provide some details about disk reservations under the "disks" key of the status and jobs APIs:
Terminal.png fnancy:
curl https://api.grid5000.fr/stable/sites/site/status | json_pp
Terminal.png fnancy:
curl https://api.grid5000.fr/stable/sites/site/jobs/job_id | json_pp

Using local disks once connected on the nodes

Login as root on a node where you reserved one or more disks:

  • either use sudo-g5k -i from the standard environment to become root
  • either login with SSH as root on an environment you deployed

→ Example of oarsub command for such a reservation:

Terminal.png fnancy:
oarsub -t exotic -l "{type='disk' and host like 'yeti-1.%' and disk like 'disk2.%'}"/disk=1+"{type='default' and host like 'yeti-1.%'}"/host=1 -I

(then ssh yeti-1 and run sudo-g5k).

All examples below assume that you are already logged in as root on the node.

Discovering available disks

The lsblk command lists all block devices. For instance, on a yeti machine in Grenoble, this might show:

# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda           8:0    0 447.1G  0 disk 
├─sda1        8:1    0   3.7G  0 part [SWAP]
├─sda2        8:2    0  19.6G  0 part /
├─sda3        8:3    0  22.4G  0 part 
├─sda4        8:4    0     1K  0 part 
└─sda5        8:5    0 401.5G  0 part /tmp
sdc           8:32   0   1.8T  0 disk 
nvme0n1     259:0    0   1.5T  0 disk
nvme1n1     259:1    0   1.5T  0 disk

In this case:

  • disk0 is shown as sda and is the system disk, so it is always available
  • disk2 is shown as sdc and has been reserved explicitly so it is visible
  • disk1 and disk3 that should map to sdb and sdd do not show up: indeed, they have not been reserved for this example
  • disk4 and disk5 are shown as nvme0n1 and nvme1n1, that are NVMe SSDs and are always available (not reservable)

You can compare the output with the reference data shown in Grenoble:Hardware#yeti.

Warning.png Warning

The actual mapping between diskN and sdX names might change depending on disk initialization order during boot. Thus, sdc might be an entirely different disk if you reboot the machine. In the following, we show how to use the disk aliases (symlinks /dev/diskN set in Grid'5000 environments) or the use of PCI paths to make sure we unambiguously identify the right disks.

If using an environment where the disk aliases are activated (default environment or deployed environment where g5k-postinstall is called with the --disk-aliases option), the following symlinks should show the disks with the matching between the diskN and sdX names:

# ls -l /dev/disk[0-9]*
lrwxrwxrwx 1 root root 3 Oct 13 09:14 /dev/disk0 -> sda
lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p1 -> sda1
lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p2 -> sda2
lrwxrwxrwx 1 root root 4 Oct 13 09:15 /dev/disk0p3 -> sda3
lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p4 -> sda4
lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p5 -> sda5
lrwxrwxrwx 1 root root 3 Oct 13 09:14 /dev/disk2 -> sdc
lrwxrwxrwx 1 root root 7 Oct 13 09:14 /dev/disk4 -> nvme0n1
lrwxrwxrwx 1 root root 7 Oct 13 09:14 /dev/disk5 -> nvme1n1

It is also possible to display disks with their PCI path, which is guaranteed to always be the same (unless the hardware is physically modified):

# ls -l /dev/disk/by-path/
total 0
lrwxrwxrwx 1 root root  9 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:0:0 -> ../../sda
lrwxrwxrwx 1 root root 10 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Oct  7 20:12 pci-0000:18:00.0-scsi-0:0:0:0-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part5 -> ../../sda5
lrwxrwxrwx 1 root root  9 Oct  7 20:11 pci-0000:18:00.0-scsi-0:0:2:0 -> ../../sdc
lrwxrwxrwx 1 root root 13 Oct  7 20:11 pci-0000:59:00.0-nvme-1 -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Oct  7 20:11 pci-0000:6d:00.0-nvme-1 -> ../../nvm10n1

Here, we see that sdc has the PCI path pci-0000:18:00.0-scsi-0:0:2:0, which matches the second reservable disk listed on Grenoble:Hardware#yeti.

Partitioning a disk

To start using the disk, you will likely need to partition it. Several tools exist to do this: fdisk, sfdisk, cfdisk, parted...

For example, to partition the second 2 TB disk of a yeti machine interactively:

Terminal.png yeti-1:
cfdisk /dev/disk2

Use the interactive prompt to create a single partition of type "Linux filesystem", possibly by deleting existing partitions first.

As an advanced usage, you could use LVM to create logical volumes that may span several disks, or mdadm to create software RAID volumes.

Creating a filesystem

Continuing the previous example, let's create an ext4 filesystem on the first partition of the same disk:

Terminal.png yeti-1:
mkfs.ext4 -m 0 /dev/disk2p1

Mount it and check that it appears:

Terminal.png yeti-1:
mkdir -p /mnt/mylocaldisk
Terminal.png yeti-1:
mount /dev/disk2p1 /mnt/mylocaldisk
Terminal.png yeti-1:
df -h

As an advanced usage, you may use any filesystem: Btrfs, HDFS, Ceph, ZFS, Beegfs, etc. Refer to the documentation for each of these systems for guidance.

Troubleshooting

When partitioning or formatting local disks, you might encounter an error such as:

Error: Partition(s) on /dev/sdb are being used

This may be because the disks already contained partitions of a certain type (LVM, software RAID...) from a previous job, and your system automatically started using them. To solve this, you have several options:

  • use a tool such as wipefs or pvremove to remove previous information from the disk.
  • use the mdadm tool if a Linux software RAID was configured, with a command such as mdadm --misc --stop /dev/md127 for instance, to remove relics of an old RAID device.
  • use a low-level tool such as dd to completely erase the beginning of the disk, and reboot. Use with care as it can destroy your data.

For instance, here is an example script that cleans up disks automatically: https://github.com/pmorillon/terraform-provider-grid5000/blob/master/examples/ceph/modules/rook_ceph/files/disk-format.sh.tmpl

Security issues

The mechanism used to enable/disable disks is designed to avoid mistakes from other users. However, a malicious user could take control of the RAID card, enable any disk, and access or erase your data. Please notify the Grid'5000 tech-team in case of such event, but first of all mind securing your data:

  • Keep a copy (backup) in a safe place if relevant for your data ;
  • If your data is sensitive, mind using cryptographic mechanisms to secure it.

Also, the data on reserved disks is not automatically erased at the end of your job. If you don't want the next user to access it, you have to erase it yourself.

Finally, no backup of data stored on the reserved disks is made.