Production: Difference between revisions
|  (→Rennes) |  (→Rennes) | ||
| Line 166: | Line 166: | ||
| === Rennes === | === Rennes === | ||
| <code class="host"> | <code class="host">ssh-rba.inria.fr</code> is an access machine hosted on Irisa side. That machine can be used to access all services in the Inria/Irisa environment. | ||
| {{Term|location=frontend|cmd=editor <code class=file>~/.ssh/config</code>}} | {{Term|location=frontend|cmd=editor <code class=file>~/.ssh/config</code>}} | ||
Revision as of 13:00, 16 October 2024
Introduction
The Nancy and Rennes Grid'5000 sites also hosts clusters for production use (including clusters with GPUs). See Nancy:Hardware and Rennes:Hardware for details.
The usage rules differ from the rest of Grid'5000:
- Advance reservations (oarsub -r) are not allowed (to avoid fragmentation). Only submissions (and reservations that start immediately) are allowed.
- All Grid'5000 users can use those nodes (provided they meet the conditions stated in Grid5000:UsagePolicy), but it is expected that users outside of LORIA / Centre Inria Nancy -- Grand Est and IRISA / Centre Inria de l'Université de Rennes will use their own local production resources in priority, and mostly use those resources for tasks that require Grid'5000 features. Examples of local production clusters are Cleps (Paris), Margaret (Saclay), Plafrim (Bordeaux), etc.
Using the resources
Getting an account
Users from the Loria laboratory (LORIA/Centre Inria Nancy Grand-Est) and the Irisa (IRISA/Centre Inria de l'Université de Rennes) that want to access Grid'5000 primarily for a production usage must use that request form to open an account, like regular Grid'5000 users.
- The following fields must be filled as follows:
- Group Granting Access (GGA): either the group named after the research team, or if it does not belong to the team list below: loria(for Nancy) origrida(for Rennes).
- Laboratory: LORIA or IRISA
- Team: INTUIDOC, SYNALP, LACODAM, MULTISPEECH, SERPICO, CARAMBA, CAPSID, SIROCCO, ORPAILLEUR, LARSEN, CIDRE, SEMAGRAMME, LINKMEDIA, SISR, TANGRAM...
 
- Group Granting Access (GGA): either the group named after the research team, or if it does not belong to the team list below: 
Other users from Nancy (not belonging to the Loria laboratory) can ask to join using the nancy-misc Group Granting Access while other users from Rennes (not belonging to the Irisa laboratory) can ask to join using the rennes-misc Group Granting Access.
- Users are automatically subscribed to the Grid'5000 users mailing lists: users@lists.grid5000.fr. This list is the user-to-user or user-to-admin communication mean to address help/support requests for Grid'5000. The technical team can be reached on support-staff@lists.grid5000.fr.
Learning to use Moyens de Calcul hosted by Grid'5000
Refer to the Production:Getting Started Production tutorial (derived from Getting Started Grid'5000 tutorial. There are other tutorial listed on the Users Home page.
Using deep learning software on Grid'5000
A tutorial for using deep learning software on Grid'5000, written by Ismael Bada is also available.
Using production resources
To access production resources, you need to submit jobs to the production queue using the command -q production. Job submissions in the production queue are prioritized based on who funded the material. There are four levels of priority, each with a maximum job duration:
- p1 -- 168h (one week)
- p2 -- 96h (four days)
- p3 -- 48h (two days)
- p4 -- 24h (one day)
- You may also have access to the clusters on besteffort.
|   | Note | 
|---|---|
| Moreover, with p1 priority, user can submit advanced reservation. More information about that in the Advanced OAR Page. For example, to reserve one week from now:p1 priority level also allow to extend the duration of a job. The extension is only apply 24h before the end of the job and cannot be longer than 168h. More information about this feature can be found also on the Advance Oar Page. | |
|   | Warning | 
|---|---|
| These limits DO NOT replace the maximum walltime per node which are still in effects. | |
You can check your priority level for any cluster using https://api.grid5000.fr/explorer.
|   | Note | 
|---|---|
| As of today, the resources explorer only shows basic information. Additional information will be added in the near future. | |
When submitting a job, by default, you will be placed at the highest priority level that allows you to maximize resources:
Using the command above will generally place your job at the lowest priority to allow usage of all clusters, even those where your priority is p4.
When you specify a cluster, your job will be set to your highest priority level for that cluster:
You can also limit a job submission to a cluster at a specific priority level using -qPRIORITY LEVEL:
Dashboards and status pages
Nancy
Rennes
Contact information and support
For support, see the Support page.
Contacts:
- The Grid'5000 responsable de site for Nancy is "Thomas Lambert" (thomas.lambert@inria.fr) and for Rennes is "Anne Cécile Orgerie" (anne-cecile.orgerie@irisa.fr)
- Local mailing lists: all Grid'5000 users from Nancy and Rennes are automatically subscribed to nancy-users@lists.grid5000.fr or rennes-users@lists.grid5000.fr, respectively.
FAQ
Data storage
Research teams, people of different teams, individuals can ask for different Group storages in order to manage their data at the team level. The main benefit of using Group storages is that they allow for the members of the group to share their data (corpus, datasets, results ...) and to overcome easily the quota restrictions of the home directories.
Please remember that NFS servers (the home directories are also served by a NFS server) are quite slow when it comes to process a huge amount of small files during a computation, and if your are in this case, you may consider to do the major part of your I/Os on the nodes and copy back the results on the NFS server at the end of the experiment.
See here for other kind of storage available on the platform.
Nancy
Group storages are used to control the access to different storage spaces located on the storage[1-5].nancy.grid5000.fr NFS servers (more information about the maximum capacities of each of these server can be found here). Ask to your GGA leader if your team have access to one or more storage spaces (this is the case for instance for the following teams: Bird, Capsid, Caramba, Heap, Multispeech, Optimist, Orpailleur, Semagramme, Sisr, Synalp, Tangram).
Rennes
Group storages are used to control the access to different storage spaces located on the storage2.rennes.grid5000.fr NFS server (more information about the maximum capacities of these server can be found here). Ask to your GGA leader if your team have access to one or more storage spaces (this is the case for instance for the following teams: cidre and sirocco (compactdisk storage)).
I am physically located in the LORIA/IRISA building, is there a shorter path to connect?
Where your are located in LORIA/IRISA building, you can benefit from a direct connection that does not go through Grid'5000 national access machines (access-south and access-north). To do so, use access.nancy or  access.rennes (instead of access).
Configure an SSH alias for the local access
To establish a connection to the Grid'5000 network from the local access, you can configure your SSH client as follows:
Hostg5klUserloginHostname access.site.grid5000.fr ForwardAgent no Host*.g5klUserloginProxyCommand ssh g5k -W "$(basename %h .g5kl):%p" ForwardAgent no
Reminder: login is your Grid'5000 username and site is either nancy or rennes.
With such a configuration, you can:
- connect the frontend related to your local site
- transfer files from your laptop to your local frontend (with better bandwidth than using the national Grid'5000 access)
- access the frontend of a different site:
- transfer files from your laptop to your a different frontend
How to access data in hosted on Inria/Loria or Inria/Irisa serveurs
Grid'5000 network is not directly connected to Inria/Loria or Inria/Irisa internal servers. If you want to access from the Grid'5000 frontend and/or the Grid'5000 nodes, you need to use a local Bastion host. If you need to regularly transfer data, it is highly recommanded to configure the SSH client on each Grid'5000 frontends.
|   | Note | 
|---|---|
| Please note that you have a different home directory on each Grid'5000 site, so you may need to replicate your SSH configuration across multiple sites. | |
Nancy
bastionssh.loria.fr is an access machine hosted on Loria side.
That machine can be used to access all services in the Inria/Loria environment.
Host accessloria Hostname bastionssh.loria.fr User <code class=replace>jdoe</code> # to be replaced by your LORIA login Host *.loria ProxyCommand ssh accessloria -W $(basename %h .loria):%p User <code class=replace>jdoe</code> # to be replaced by your LORIA login
|   | Note | 
|---|---|
| Given that  | |
Rennes
ssh-rba.inria.fr is an access machine hosted on Irisa side. That machine can be used to access all services in the Inria/Irisa environment.
Host transit Hostname ssh-rba.inria.fr User <code class=replace>jdoe</code> # to be replaced by your IRISA login
Data hosted on Inria's NAS server is accessible on /nfs of ssh-rba.inria.fr. Considering that you have set the configuration on Grenoble homedir:
Transfer files to Grid'5000 storage
With that setup, you can now use :
- Rsync to synchronize your data on Inria/Loria environment and data on your local home on Grid'5000 frontend
- Sshfs to mount directly your data directory on Inria/Loria environment under your local home. <=> mount your /user/my_team/my_username (origin = bastionssh.loria.fr) on fnancy (destination = a folder on fnancy).
eg:
To unmount the remote filesystem:
I submitted a job, there are free resources, but my job doesn't start as expected!
Most likely, this is because of our configuration of resources restriction per walltime. In order to make sure that someone requesting only a few nodes, for a small amount of time will be able to get soon enough, the nodes are split into categories. This depends on each cluster and is visible in the Gantt chart. An example of split is:
- 20% of the nodes only accept jobs with walltime lower than 1h
- 20% -- 2h
- 20% -- 24h (1 day)
- 20% -- 48h (2 days)
- 20% -- 168h (one week)
Note that best-effort jobs are excluded from those limitations.
To see the exact walltime partition of each production cluster, have a look at the Nancy Hardware page or Rennes Hardware page.
Another OAR feature that could impact the scheduling of your jobs is the OAR scheduling with fair-sharing, which is based on the notion of karma: this feature assigns a dynamic priority to submissions based on the history of submissions by a specific user. With that feature, the jobs from users that rarely submit jobs will be generally scheduled earlier than jobs from heavy users.
I have an important demo, can I reserve all resources in advance?
There's a special challenge queue that can be used to combine resources from the classic Grid'5000 clusters and the production clusters for special events. If you would like to use it, please ask for a special permission from the executive committee.
Can I use besteffort jobs in production ?
Yes, you can submit a besteffort job on the production resources by using OAR -t besteffort option. Here is an exemple:
If you didn't specify the -q production option, your job could run on both production and non-production resources.
How to cite / Comment citer
If you use the Grid'5000 production clusters for your research and publish your work, please add this sentence in the acknowledgements section of your paper:
Experiments presented in this paper were carried out using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).
