Nancy:Production: Difference between revisions

From Grid5000
Jump to navigation Jump to search
(47 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{Portal|User}}
{{Author|Clément Parisot}}
{{Maintainer|Clément Parisot}}
{{Portal|User}}
= Introduction =
= Introduction =
The Nancy Grid'5000 site also hosts nodes for production use. Those nodes are:
The Nancy Grid'5000 site also hosts clusters for production use. Those clusters are:
* {{Act|Clément|décrire précisément le matériel}}
* '''graphique''', a 6-node Intel Xeon cluster, with 2 GPU per node (GTX 980 Nvidia, Titian Black)
* '''graoully''', a 16 nodes with 2 CPUs Intel Xeon E5-2630 v3, 8 cores/CPU, 126GB RAM, 2x558GB HDD, 10Gbps ethernet + infiniband cluster
* '''grimani''', a 6 nodes with 2 CPUs Intel Xeon E5-2603 v3, with 2 GPU per node (Tesla K40m), 6 cores/CPU, 64GB RAM, 1TB HDD, 10Gbps ethernet. Omni-Path is coming.
* '''grele''', a 14 nodes with 2 CPUs Intel Xeon E5-2650 v4, with 2 GPU per node (GTX 1080 Ti), 12 cores/CPU, 128GB RAM, 2x300GB SAS 15K rpm HDD, 10Gbps ethernet. Omni-Path is coming.


The usage rules differ from the rest of Grid'5000:
The usage rules differ from the rest of Grid'5000:
* Advance reservations (<code>oarsub -r</code>) are not allowed (to avoid fragmentation). Only submissions (and reservations that start immediately) are allowed.
* Advance reservations (<code>oarsub -r</code>) are not allowed (to avoid fragmentation). Only submissions (and reservations that start immediately) are allowed.
* All Grid'5000 users can use those nodes, but it is expected that users outside of LORIA / Inria Nancy -- Grand Est will use their own local production resources in priority, and mostly use those resources for tasks that require Grid'5000 features. Examples of local production clusters are Tompouce (Saclay), Igrida (Rennes), Plafrim (Bordeaux), etc.
* All Grid'5000 users can use those nodes (provided they meet the conditions stated in [[Grid5000:UsagePolicy]]), but it is expected that users outside of LORIA / Inria Nancy -- Grand Est will use their own local production resources in priority, and mostly use those resources for tasks that require Grid'5000 features. Examples of local production clusters are Tompouce (Saclay), Igrida (Rennes), Plafrim (Bordeaux), etc.


= Getting started =
= Using the resources =
== Getting an account ==
== Getting an account ==
Please use the [[Special:G5KRequestAccountUMS|request form here]].
Please use the [[Special:G5KRequestAccountUMS|request form here]].


* The following fields must be filled as indicated (but the '''other fields must be filled too'''):  
* The following fields must be filled as indicated (but the '''other fields (team, lab, ...) must be filled too'''):  
** manager: lnussbaum  
** manager: lnussbaum  
** site: nancy
** site: nancy
Line 22: Line 30:
Refer to the [[Getting Started]] tutorial.
Refer to the [[Getting Started]] tutorial.
There are other tutorial listed on the [https://www.grid5000.fr/mediawiki/index.php/Category:Portal:User Users Home] page.
There are other tutorial listed on the [https://www.grid5000.fr/mediawiki/index.php/Category:Portal:User Users Home] page.
== Using deep learning software on Grid'5000 ==
A tutorial for using deep learning software on Grid'5000, written by Ismael Bada [[User:Ibada/Tuto_Deep_Learning|is also available]].


== Using production resources ==
== Using production resources ==
To access production resources, you need to submit jobs in the ''production'' queue:
To access production resources, you need to submit jobs in the ''production'' queue or using the ''production'' job type:
  oarsub -q production -I
  oarsub -q production -I
  oarsub -q production -p "cluster='talc'" -I
  oarsub -q production -p "cluster='talc'" -I
Line 30: Line 41:
  oarsub -q production -l walltime=24 -t deploy 'sleep 100d'
  oarsub -q production -l walltime=24 -t deploy 'sleep 100d'
  ...
  ...
or
oarsub -t production -I
oarsub -t production -p "cluster='talc'" -I
oarsub -t production -l nodes=2,walltime=240 -I
oarsub -t production -l walltime=24 -t deploy 'sleep 100d'


= Cluster management tools =
== Dashboards and status pages ==
{{Act|Clément|mettre à jour les liens}}
* [https://intranet.grid5000.fr/oar/Nancy/drawgantt-svg-prod/ DrawGantt: Gantt diagram of jobs on the cluster]
* [https://intranet.grid5000.fr/oar/Nancy/talc/drawgantt.cgi DrawGantt: Gantt diagram of jobs on the cluster]
* [https://intranet.grid5000.fr/oar/Nancy/monika-prod.cgi Monika: currently running jobs]
* [https://intranet.grid5000.fr/oar/Nancy/talc/monika.cgi Monika: currently running jobs]
* [https://intranet.grid5000.fr/ganglia/?r=hour&s=descending&c=Nancy Ganglia: status of nodes]
* [https://intranet.grid5000.fr/supervision/nancy/talc/ganglia/ Ganglia: status of nodes]
* [https://www.grid5000.fr/status/ planned and ongoing maintenances, events and issues on Grid'5000]


= Contact information and support =
= Contact information and support =
Contacts:
Contacts:
* The local system administrator is Clément Parisot ([mailto:clement.parisot@inria.fr clement.parisot@inria.fr])
* The Grid'5000 team can be contacted as described on the [[Support]] page.
* The Grid'5000 ''responsable de site'' for Nancy is Lucas Nussbaum ([mailto:lucas.nussbaum@loria.fr lucas.nussbaum@loria.fr])
* The Grid'5000 ''responsable de site'' for Nancy is Lucas Nussbaum ([mailto:lucas.nussbaum@loria.fr lucas.nussbaum@loria.fr])
* The rest of the Grid'5000 team can be contacted as described on the [[Support]] page.
* Ismael Bada (engineer funded by CPER LCHN) can also help local users, especially regarding requests related to deep learning on Grid'5000. ([mailto:ismael.bada@loria.fr ismael.bada@loria.fr])
 
To get support, you can:
To get support, you can:
* Use the [mailto:users@lists.grid5000.fr users@lists.grid5000.fr] mailing list: all Grid'5000 users (700+ people) are subscribed
* Use the [mailto:users@lists.grid5000.fr users@lists.grid5000.fr] mailing list: all Grid'5000 users (700+ people) are automatically subscribed
* Use the [mailto:nancy-users@lists.grid5000.fr nancy-users@lists.grid5000.fr] mailing list: all Grid'5000 users from Nancy are subscribed
* Use the [mailto:nancy-users@lists.grid5000.fr nancy-users@lists.grid5000.fr] mailing list: all Grid'5000 users from Nancy are automatically subscribed
* Contact Ismael Bada (see above)


The Grid'5000 team does not have the resources (manpower) to do user support (help with writing scripts, creating system images, etc.) If you need such help, please contact the SED service.
The Grid'5000 team does not have the resources (manpower) to do user support, such as helping with writing scripts, creating system images, etc. If you need such help, please contact either Ismael Bada (see above), or the SED service.


= FAQ =
= FAQ =
== Data storage ==
All data needed for experiments of production team is stored on talc-data NFS server. The capacity of this server is 58T. This includes:
* '''talc_home''': old talc home directories
* '''talc_data[1-3]''': various data and results used by several teams. Old backup of data.
* '''One directory for each local team''': data of each team
Those repositories are exported on Nancy frontend under the '''/talc''' directory. It is also exported on all Nancy nodes that runs the standard environment (''jessie-x64-std'') under the '''/talc''' repository.
Please remember that those data are hosted on a NFS server that is not recommended for compute usage.
For other shorter term storage support, see [[Storage5k]].
== I am physically located in the LORIA building, is there a shorter path to connect? ==
If for some reason you don't want to go through Grid'5000 national access machines (access-south and access-north), you can also connect directly using
{{Term|location=mylaptop|cmd=<code class="command">ssh</code> <code class="replace">jdoe</code><code>@</code><code class="host">access.nancy.grid5000.fr</code>}}
== I have a large amount of jobs to execute, is there a better solution ? ==
Yes. You should have a look at [[Multi-parametric experiments with CiGri|CiGri]]. This middleware makes it easier to submit large number of jobs as ''best-effort'' jobs on Grid'5000. The advantage is that you would benefit from the computing power of the whole of Grid'5000 for your jobs, not just of Nancy's production resources.
== How to access data in Inria/Loria ==
bastionssh.loria.fr is an access machine hosted on Loria side.
That machine can be used to access all services in the Inria/Loria environment.
You need to use the SSH ProxyCommand for that purpose.
Ajust the following lines for your <code class="file">~/.ssh/config</code>
<pre>
Host accessloria
        Hostname bastionssh.loria.fr
        User jdoe # to be replaced by your LORIA login
Host *.loria
        User jdoe # to be replaced by your LORIA login
        ProxyCommand ssh accessloria -W $(basename %h .loria):%p
</pre>       
With that setup, you can now use :
* [https://www.grid5000.fr/mediawiki/index.php/Rsync Rsync ] to synchronize your data on Inria/Loria environment and data on your local home on Grid'5000 frontend
* [https://www.grid5000.fr/mediawiki/index.php/SSH#Mounting_remote_filesystem Sshfs] to mount directly your data directory on Inria/Loria environment under your local home. <=> mount your /user/my_team/my_username (origin = bastionssh.loria.fr) on fnancy (destination = a folder on fnancy).
eg:
{{Term|location=fnancy|cmd=<code class="command">sshfs</code> <code>-o idmap=user</code> <code class="replace">jdoe</code><code>@</code><code class="host">tregastel.loria</code>:<code class="file">/users/myteam/jdoe ~/local_dir</code>}}
 
To unmount the remote filesystem:
{{Term|location=fnancy|cmd=<code class="command">fusermount</code> <code>-u</code> <code class="file">~/local_dir</code>}}
 
{{Note|text=Given that bastionssh.loria.fr only accepts logins using SSH key, you cannot simply connect with your LORIA password.}}
== I submitted a job, there are free resources, but my job doesn't start as expected! ==
== I submitted a job, there are free resources, but my job doesn't start as expected! ==
Most likely, this is because of our configuration of resources restriction per walltime.
Most likely, this is because of our configuration of resources restriction per walltime.
Line 60: Line 127:
Note that ''best-effort'' jobs are excluded from those limitations.
Note that ''best-effort'' jobs are excluded from those limitations.


Another OAR setting could impact the scheduling of your jobs. The OAR ''karma'' feature is enabled: this feature assigns a dynamic priority to submissions based on the history of submissions by a specific user. With that feature, the jobs from users that rarely submit jobs will be generally scheduled earlier than jobs from heavy users.
Another enabled OAR feature that could impact the scheduling of your jobs is the OAR ''karma'': this feature assigns a dynamic priority to submissions based on the history of submissions by a specific user. With that feature, the jobs from users that rarely submit jobs will be generally scheduled earlier than jobs from heavy users.


== I have an important demo, can I reserve all resources in advance? ==
== I have an important demo, can I reserve all resources in advance? ==
There's a special ''challenge'' queue that can be used to combine resources from the classic Grid'5000 clusters and the production clusters for special events. If you would like to use it, please get in touch with the clusters managers.
There's a special ''challenge'' queue that can be used to combine resources from the classic Grid'5000 clusters and the production clusters for special events. If you would like to use it, please get in touch with the clusters managers.


== I am physically located in the LORIA building, is there a shorter path to connect? ==
==Can I use besteffort jobs in production ?==
Yes, you can submit a besteffort job on the production resources by using OAR <code>-t besteffort</code> option. Here is an exemple:
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> <code class="replace">-t besteffort</code> <code>-q production</code><code class="file">./my_script.sh</code>}}
If you didn't specify the <code>-q production</code> option, your job could run on both production and non-production resources.


If for some reason you don't want to go through Grid'5000 national access machines (access-south and access-north), you can also connect directly using
== Is it possible to run Matlab? ==
{{Term|location=mylaptop|cmd=<code class="command">ssh</code><code class="replace">jdoe</code><code>@</code><code class="host">access.nancy.grid5000.fr</code>}}
Yes, through SSH tunneling to access UL license server (access to bastionssh.loria.fr required).
More information is available in [https://members.loria.fr/FSur/articles/MatlabGrid5000.pdf this document].


== Energy costs ==
== Energy costs ==
Line 89: Line 160:
** Heure Creuse Eté 2,580c€/KWh
** Heure Creuse Eté 2,580c€/KWh


== How to access data in Inria/Loria environment ==
== How to cite / Comment citer ==
 
bastionssh.loria.fr is an access machine hosted on Loria side.
That machine can be used to access all services in the Inria/Loria environment.
 
You need to use SSH ProxyCommand for that purpose.


Adapt following lines for your <code class="file">~/.ssh/config</code>
If you use the Grid'5000 production clusters for your research and publish your work, please add this sentence in the acknowledgements section of your paper:
<pre>
<blockquote>
  Host accessloria
Experiments presented in this paper were carried out using the Grid'5000
        HostName bastionssh.loria.fr
testbed, supported by a scientific interest group hosted by
        User jdoe
Inria and including CNRS, RENATER and several Universities as well as
        Port 22
other organizations (see https://www.grid5000.fr).
        IdentityFile /home/%u/.ssh/id_rsa
</blockquote>
        ForwardAgent no
  Host *.loria
        User jdoe
        ProxyCommand ssh accessloria -W $(basename %h .loria):%p
        IdentityFile /home/%u/.ssh/id_rsa
</pre>      
 
With that setup, you can now use :
* [https://www.grid5000.fr/mediawiki/index.php/Rsync Rsync ] to synchronize your data on Inria/Loria environment and data on your local home on Grid'5000 frontend
* [https://www.grid5000.fr/mediawiki/index.php/SSH#Mounting_remote_filesystem Sshfs] to mount directly your data directory on Inria/Loria environment under your local home. <=> mount your /user/my_team/my_username (origin = bastionssh.loria.fr) on fnancy (destination = a folder on fnancy).
 
eg:
 
{{Term|location=ftalc|cmd=<code class="command">sshfs</code> <code>-o idmap=user</code> <code class="replace">jdoe</code><code>@</code><code class="host">tregastel.loria</code>:<code class="file">/users/myteam/jdoe ~/remote_dir</code>}}
 
To unmount the remote filesystem:
{{Term|location=ftalc|cmd=<code class="command">fusermount</code> <code>-u</code> <code class="file">~/remote_dir</code>}}
 
{{Note|text=The password for your Inria/Loria account will be asked twice if you don't specify a ssh key.
}}
{{Act|Clément|visiblement, ça ne marche pas (cf mail sur users). Tu peux vérifier, voir si on peut corriger, et sinon, corriger le texte pour dire que ça ne marche que depuis un noeud ? (si ca marche bien depuis un noeud, bien sûr)}}

Revision as of 14:35, 13 September 2018


Introduction

The Nancy Grid'5000 site also hosts clusters for production use. Those clusters are:

  • graphique, a 6-node Intel Xeon cluster, with 2 GPU per node (GTX 980 Nvidia, Titian Black)
  • graoully, a 16 nodes with 2 CPUs Intel Xeon E5-2630 v3, 8 cores/CPU, 126GB RAM, 2x558GB HDD, 10Gbps ethernet + infiniband cluster
  • grimani, a 6 nodes with 2 CPUs Intel Xeon E5-2603 v3, with 2 GPU per node (Tesla K40m), 6 cores/CPU, 64GB RAM, 1TB HDD, 10Gbps ethernet. Omni-Path is coming.
  • grele, a 14 nodes with 2 CPUs Intel Xeon E5-2650 v4, with 2 GPU per node (GTX 1080 Ti), 12 cores/CPU, 128GB RAM, 2x300GB SAS 15K rpm HDD, 10Gbps ethernet. Omni-Path is coming.

The usage rules differ from the rest of Grid'5000:

  • Advance reservations (oarsub -r) are not allowed (to avoid fragmentation). Only submissions (and reservations that start immediately) are allowed.
  • All Grid'5000 users can use those nodes (provided they meet the conditions stated in Grid5000:UsagePolicy), but it is expected that users outside of LORIA / Inria Nancy -- Grand Est will use their own local production resources in priority, and mostly use those resources for tasks that require Grid'5000 features. Examples of local production clusters are Tompouce (Saclay), Igrida (Rennes), Plafrim (Bordeaux), etc.

Using the resources

Getting an account

Please use the request form here.

  • The following fields must be filled as indicated (but the other fields (team, lab, ...) must be filled too):
    • manager: lnussbaum
    • site: nancy
    • groups, roles: none
    • privileges: user
  • You are automatically subscribed to the Grid'5000 users' mailing lists: users@lists.grid5000.fr

This list is the user-to-user or user-to-admin communication mean to address help/support requests for Grid'5000.

Learning to use Grid'5000

Refer to the Getting Started tutorial. There are other tutorial listed on the Users Home page.

Using deep learning software on Grid'5000

A tutorial for using deep learning software on Grid'5000, written by Ismael Bada is also available.

Using production resources

To access production resources, you need to submit jobs in the production queue or using the production job type:

oarsub -q production -I
oarsub -q production -p "cluster='talc'" -I
oarsub -q production -l nodes=2,walltime=240 -I
oarsub -q production -l walltime=24 -t deploy 'sleep 100d'
...

or

oarsub -t production -I
oarsub -t production -p "cluster='talc'" -I
oarsub -t production -l nodes=2,walltime=240 -I
oarsub -t production -l walltime=24 -t deploy 'sleep 100d'

Dashboards and status pages

Contact information and support

Contacts:

  • The Grid'5000 team can be contacted as described on the Support page.
  • The Grid'5000 responsable de site for Nancy is Lucas Nussbaum (lucas.nussbaum@loria.fr)
  • Ismael Bada (engineer funded by CPER LCHN) can also help local users, especially regarding requests related to deep learning on Grid'5000. (ismael.bada@loria.fr)

To get support, you can:

The Grid'5000 team does not have the resources (manpower) to do user support, such as helping with writing scripts, creating system images, etc. If you need such help, please contact either Ismael Bada (see above), or the SED service.

FAQ

Data storage

All data needed for experiments of production team is stored on talc-data NFS server. The capacity of this server is 58T. This includes:

  • talc_home: old talc home directories
  • talc_data[1-3]: various data and results used by several teams. Old backup of data.
  • One directory for each local team: data of each team

Those repositories are exported on Nancy frontend under the /talc directory. It is also exported on all Nancy nodes that runs the standard environment (jessie-x64-std) under the /talc repository.

Please remember that those data are hosted on a NFS server that is not recommended for compute usage.

For other shorter term storage support, see Storage5k.

I am physically located in the LORIA building, is there a shorter path to connect?

If for some reason you don't want to go through Grid'5000 national access machines (access-south and access-north), you can also connect directly using

Terminal.png mylaptop:
ssh jdoe@access.nancy.grid5000.fr

I have a large amount of jobs to execute, is there a better solution ?

Yes. You should have a look at CiGri. This middleware makes it easier to submit large number of jobs as best-effort jobs on Grid'5000. The advantage is that you would benefit from the computing power of the whole of Grid'5000 for your jobs, not just of Nancy's production resources.

How to access data in Inria/Loria

bastionssh.loria.fr is an access machine hosted on Loria side. That machine can be used to access all services in the Inria/Loria environment.

You need to use the SSH ProxyCommand for that purpose.

Ajust the following lines for your ~/.ssh/config

Host accessloria
        Hostname bastionssh.loria.fr
        User jdoe # to be replaced by your LORIA login

Host *.loria
        User jdoe # to be replaced by your LORIA login
        ProxyCommand ssh accessloria -W $(basename %h .loria):%p

With that setup, you can now use :

  • Rsync to synchronize your data on Inria/Loria environment and data on your local home on Grid'5000 frontend
  • Sshfs to mount directly your data directory on Inria/Loria environment under your local home. <=> mount your /user/my_team/my_username (origin = bastionssh.loria.fr) on fnancy (destination = a folder on fnancy).

eg:

Terminal.png fnancy:
sshfs -o idmap=user jdoe@tregastel.loria:/users/myteam/jdoe ~/local_dir

To unmount the remote filesystem:

Terminal.png fnancy:
fusermount -u ~/local_dir
Note.png Note

Given that bastionssh.loria.fr only accepts logins using SSH key, you cannot simply connect with your LORIA password.

I submitted a job, there are free resources, but my job doesn't start as expected!

Most likely, this is because of our configuration of resources restriction per walltime. In order to make sure that someone requesting only a few nodes, for a small amount of time will be able to get soon enough, the nodes are split into categories:

  • 20% of the nodes only accept jobs with walltime lower than 1h
  • 20% -- 2h
  • 20% -- 24h
  • 20% -- 48h
  • 20% accept all jobs (no limit on duration)

Note that best-effort jobs are excluded from those limitations.

Another enabled OAR feature that could impact the scheduling of your jobs is the OAR karma: this feature assigns a dynamic priority to submissions based on the history of submissions by a specific user. With that feature, the jobs from users that rarely submit jobs will be generally scheduled earlier than jobs from heavy users.

I have an important demo, can I reserve all resources in advance?

There's a special challenge queue that can be used to combine resources from the classic Grid'5000 clusters and the production clusters for special events. If you would like to use it, please get in touch with the clusters managers.

Can I use besteffort jobs in production ?

Yes, you can submit a besteffort job on the production resources by using OAR -t besteffort option. Here is an exemple:

Terminal.png fnancy:
oarsub -t besteffort -q production./my_script.sh

If you didn't specify the -q production option, your job could run on both production and non-production resources.

Is it possible to run Matlab?

Yes, through SSH tunneling to access UL license server (access to bastionssh.loria.fr required). More information is available in this document.

Energy costs

Grid'5000 nodes are automatically shut down when they are not reserved so, when possible, it is a good idea to reserve nodes during cheaper time slots.

Electricity costs are currently:

  • Périodes:
    • Heures pointe: Décembre, Janvier, Février; 09H00-11H00/18H00-20H00
    • Heures Pleines Hiver: 06H00-22H00 (hors heures de pointe précisées à l’article 18).
    • Heures Creuses Hiver: 22H00-06H00
    • Heures Pleines Eté: 06H00-22H00
    • Heures Creuses Eté: 22H00-06H00
    • Le dimanche ne comprend que des heures creuses en hiver et été.
  • Cout du KWh
    • Heure pointe 10,893c€/KWh
    • Heure Pleine Hiver 6,535c€/KWh
    • Heure Creuse Hiver 4,474c€/KWh
    • Heure Pleine Eté 4,125c€/KWh
    • Heure Creuse Eté 2,580c€/KWh

How to cite / Comment citer

If you use the Grid'5000 production clusters for your research and publish your work, please add this sentence in the acknowledgements section of your paper:

Experiments presented in this paper were carried out using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).