FAQ: Difference between revisions

From Grid5000
Jump to navigation Jump to search
 
(80 intermediate revisions by 11 users not shown)
Line 4: Line 4:
== About this document ==
== About this document ==
=== How to add/correct an entry to the FAQ? ===
=== How to add/correct an entry to the FAQ? ===
Just like any other page of this wiki, you can edit the FAQ yourself to improve it. If you click on one of the little "edit" placed after each question, you'll get the possibility to edit that particular question. To edit the whole page, simply choose the edit tab at the top of the page.
{{Note|text=Just like any other page of this wiki, you can edit the FAQ yourself to improve it. If you click on one of the little "edit" placed after each question, you'll get the possibility to edit that particular question. To edit the whole page, simply choose the edit tab at the top of the page.}}
 


== Publications and Grid'5000 ==
== Publications and Grid'5000 ==
Line 13: Line 12:
=== How to mention Grid'5000 in HAL  ? ===
=== How to mention Grid'5000 in HAL  ? ===
[http://hal.inria.fr HAL] is an open archive you're invited to use. If you do so, the recommended way of mentioning Grid'5000 is to use the collaboration field of submission form, with the '''Grid'5000''' keyword, capitalized as such.
[http://hal.inria.fr HAL] is an open archive you're invited to use. If you do so, the recommended way of mentioning Grid'5000 is to use the collaboration field of submission form, with the '''Grid'5000''' keyword, capitalized as such.
== Accessing Grid'5000 ==
=== How can I connect to Grid'5000 ? ===
This is documented at length in the [[Getting Started]] tutorial.
You should be able to access Grid'5000 from anywhere on the Internet, by connecting to <code class="host">access.grid5000.fr</code> using SSH. You'll need SSH keys properly configured (please refer to [[SSH#SSH_Key_usage| the page dedicated to SSH]] if you don't understand these last words) as this machine will not allow you to log using a password.
Some sites have an <code class="host">access.<em>site</em>.grid5000.fr</code> machine, which is only reachable from an IP address coming from local laboratory.
=== How to connect from different workstations with the same account? ===
You can associate multiple public SSH keys to your account. In order to do so, you have to:
* login
* go to  [https://api.grid5000.fr/ui/account User Portal > Manage Account]
* select the 'My account' tab
* in actions list select «Edit Profil»
* then, paste the public SSH key in a new line just after the other(s).
More information about [[SSH]] and [[Public key authentication]].
=== How to directly connect by SSH to any machine within Grid'5000 from my workstation? ===
This tip consists of customizing SSH configuration file <code class="file">~/.ssh/config</code>.
Host *.g5k
    User <code class="replace">login</code>
    ProxyCommand <code class="command">ssh</code> <code class="replace">login</code>@access.grid5000.fr -W "`basename %h .g5k`:%p"
Your are now able to connect to any machine using <code class="command">ssh</code> <code class="replace">machine.site</code><code>.g5k</code>
Please have a look at the [[SSH#Using_ssh_proxy_to_access_ssh_servers_.28ie_hosts.29_behind_a_firewall|SSH page]] to a deeper understanding of this proxy feature.
'''Note''': Grid'5000 internal network uses private IP addresses and are not directly reachable from outside of Grid'5000.
=== Is access to the Internet possible from nodes? ===
Since end of 2015, full Internet access is allowed from Grid'5000 network. For security reasons, connections are logged.
=== How can I connect to an HTTP or HTTPS service running on a node? ===
Network connections (tcp, udp, ...) to any Grid'5000 node from outside Grid'5000 are possible using Grid'5000 [[VPN]].
See also the [[SSH]] page (forwarding and SOCKS proxy).
However, if you only need to connect to a web interface or web API served by a node using HTTP or HTTPS, you can use Grid'5000 reverse proxy. The base URL to use is ''https://mynode.mysite.http.proxy.grid5000.fr/'' for HTTP and ''https://mynode.mysite.https.proxy.grid5000.fr/'' for HTTPS.
If you can't run your service on port 80 or 443 (because you are not root for example), you can use a port 8080 or 8443 (you don't need to be root to bind them). You can then use ''https://mynode.mysite.http8080.proxy.grid5000.fr/'' for HTTP and ''https://mynode.mysite.https8443.proxy.grid5000.fr/'' for HTTPS.
Please note that the reverse proxy needs authentication using your Grid'5000 credentials.
{{Note|text=You may have to add an exception as required by you web browser to access the target web service, because of a SSL certificate mismatch.}}


== Account management ==
== Account management ==
Line 74: Line 27:


=== How to get my home mounted on deployed nodes? ===
=== How to get my home mounted on deployed nodes? ===
This is completely automatic if you deploy a *-nfs or *-big image. You can then connect using your own login, and once connected into the node, just enter your home:
This is completely automatic if you deploy a *-nfs or *-big image (automount).
   cd /home/<your login>
* You can connect using your own username and should land in your home;
* If connecting as root, once connected to the node, just change directory your home and it will be mounted:
   <code class="command">cd</code> /home/<code class="replace">username</code>
{{Note|text=But home of other users cannot be mounted, for security reasons.}}


=== How to restore a wrongly deleted file? ===
=== How to restore a wrongly deleted file? ===
Line 81: Line 37:


=== What about disk quotas ? ===
=== What about disk quotas ? ===
You'll find that for each account and each site, disk quotas may be activated.
See the section about the <code class=file>/home</code> in the [[Storage#.2Fhome|Storage]] page.
* the soft limit is set to what the admins find a reasonable limit for an account on a more or less permanent basis. You can use more disk space temporarily, but you should not try and trick the system to keep that data on the shared file system.
* the hard limit is set so as to preserve usability for other users if one of your scripts produces unexpected amounts of data. You'll not be able to override that limit.
 
 
=== How to increase my disk quota limitation? ===
Should you need higher quotas, please visit your user account settings page at https://api.grid5000.fr/ui/account (my storage tab), or send an email to [mailto:support-staff@lists.grid5000.fr support-staff@lists.grid5000.fr], explaining how much storage space you want and what for.


=== How do I unsubscribe from the mailing-list ? ===
=== How do I unsubscribe from the mailing-list ? ===
Line 93: Line 43:
Users' mailing-list subscription is tied to your Grid'5000 account.  You can configure your subscriptions in your account settings:
Users' mailing-list subscription is tied to your Grid'5000 account.  You can configure your subscriptions in your account settings:


[[File:Grid5000-unsubscribe-mailing-list.png|thumb|How to unsubscribe from the mailing list]]  
[[File:Grid5000-unsubscribe-mailing-list.png|thumb|How to unsubscribe from the mailing list]]
* Login to https://api.grid5000.fr/ui/account
* Login to https://api.grid5000.fr/ui/account
* Go to the "My account" tab, then click on the "Actions" button, then choose "Manage mailing lists"
* Go to the "My account" tab, then click on the "Actions" button, then choose "Manage mailing lists"
Line 103: Line 53:
* From the left panel, select ''users_grid5000''. Then go to your subscriber options (''Options d'abonné'') and in the ''reception'' field (''Mode de réception''), select ''suspend'' (''interrompre la réception des messages'').
* From the left panel, select ''users_grid5000''. Then go to your subscriber options (''Options d'abonné'') and in the ''reception'' field (''Mode de réception''), select ''suspend'' (''interrompre la réception des messages'').


== SSH related questions ==
== Network access to/from Grid'5000 ==
=== How can I connect to Grid'5000 ? ===
This is documented at length in the [[Getting Started]] tutorial.


=== How to avoid SSH host key checking? ===
You should be able to access Grid'5000 from anywhere on the Internet, by connecting to <code class="host">access.grid5000.fr</code> using SSH. You'll need SSH keys properly configured (please refer to [[SSH#SSH_Key_usage| the page dedicated to SSH]] if you don't understand these last words) as this machine will not allow you to log using a password.
With the <code>StrictHostKeyChecking</code> option, SSH host key checking can be turned off. This option can be set in the <code class="file">~/.ssh/config</code> file:
StrictHostKeyChecking no
Or it can be passed on the command line:
<code class="command">ssh</code> -o StrictHostKeyChecking=no <code class="replace">host</code>


=== How not to get tons of [[SSH]] errors about Man-in-the-middle attacks while deploying images ? ===
Some sites have an <code class="host">access.</code><code class="replace">site</code><code class="host">.grid5000.fr</code> machine, which is only reachable from an IP address coming from local laboratory (replace <code class="replace">site</code> with the actual site name).
If you get the following error when you try to connect a machine using <code class="command">ssh</code>:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!    @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
6e:44:89:6d:ac:fc:d8:84:fd:2b:fb:22:e5:ba:5c:88.
Please contact your system administrator.


This is because [[SSH]] get worried by the fact that the machine answering to the connection is not the same from run to run. This is actually really logical if you just redeployed the image so it is not same system that is answering...
=== How to connect from different workstations with the same account? ===
You can associate several public SSH keys to your account. In order to do so, you have to:
* login
* go to  [https://api.grid5000.fr/ui/account User Portal > Manage Account],
* select the ''My account'' top tab,
* select the ''SSH keys'' left tab,
* then, manage your keys:
** add a new public SSH key ;
** remove an old one.


Technically speaking, the file <code class="file">/etc/ssh/ssh_host_dsa_key.pub</code> is likely to be different in your own deployed image and in the default image. SSH will thus freak out since such replacement usually denote that someone is intercepting the communication and pretend to be the server to get informations from you.
More information in the [[SSH]] page and the [[Public key authentication]] page.


If you don't want to care about this issues, there are several solutions:
=== How to directly connect by SSH to any machine within Grid'5000 from my workstation? ===
* Add <code>StrictHostKeyChecking=no</code> to your <code class="file">.ssh/config</code> file to explain [[SSH]] to ignore about those errors.
This tip consists of customizing SSH configuration file <code class="file">~/.ssh/config</code> (compatible with OpenSSH ssh client)
* Pass this option (<code>StrictHostKeyChecking=no</code>) on the command line to ssh (using -o)
* Make sure that you have the same <code>host_dsa_key</code> in your own images than in defaults one. They can usually be found in the pre/post install scripts of your site.


Outside of Grid'5000 scope, the correct solution is to fix your <code class="file">~/.ssh/known_hosts</code>, either by hand or using the command <code class="command">ssh-keygen</code> -R <code class="replace">hostname</code>.
Host *.g5k
    User <code class="replace">login</code>
    ProxyCommand <code class="command">ssh</code> <code class="replace">login</code>@access.grid5000.fr -W "$(basename %h .g5k):%p"


Please have a look at the [[SSH]] page also.
You can then connect to any machine using <code class="command">ssh</code> <code class="replace">machine.site</code><code>.g5k</code>


=== What kind of public keys are supported on Grid'5000 ? ===
Please have a look at the '''[[SSH]]''' page for a deeper understanding and more information.
The only format of the public_keys allowed in Grid'5000 is the openSSH format.<br/> You '''MUST''' provide and use ssh public keys in this format.


* SSH2 like public key '''(NOT SUPPORTED)''' :
For users of ''powershell'' in ''Microsoft Windows'' which also comes with OpenSSH ssh client, mind adapting the configuration as the <code class="command">basename</code> command may not be available.
---- BEGIN SSH2 PUBLIC KEY ----
Comment: rsa-key-20090623
AAAAB3NzaC1yc2EAAAABJQAAAIEA1YO87ubDgjQmCEdyX98UZ1RaBNAEXNGUNX2t
D/lEw7MPShJKpVYpcj4JhrOqTc0QXIcLqefkucDaoAIlEAp7e5aShWhWFtYR5Mwn
qAF1hrMBMF0xJIqgZjUWUPxvvFVeQXkObUWQkRyj5AjlG9+qQDLOoD9GgBOqfLDV
edGCLoM=
---- END SSH2 PUBLIC KEY ----


* OpenSSH like public key :
{{Note|text=Grid'5000 internal network uses private IP V4 addresses and are not directly reachable from outside of Grid'5000}}
ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAIEA1YO87ubDgjQmCEdyX98UZ1RaBNAEXNGUNX2tD/lEw7MPShJKpVYpcj4JhrOqTc0QXIcLqefkucDaoAIlEAp7e5aShWhWFtYR5MwnqAF1hrMBMF0xJIqgZjUWUPxvvFVeQXkObUWQkRyj5AjlG9+qQDLOoD9GgBOqfLDVedGCLoM=


To convert ssh public keys from SSH2 to OpenSSH, see [http://burnz.wordpress.com/2007/12/14/ssh-convert-openssh-to-ssh2-and-vise-versa/ this tutorial].
=== Is access to the Internet possible from nodes? ===
 
Full Internet access is allowed from Grid'5000 network to the Internet.
== Software installation issues ==
=== What is the general philosophy ? ===
This is how things should work: a basic set of software is installed on the frontends and nodes' standard environment of each site. If you need some other software packages on nodes, you can create a Kadeploy image including then, and deploy it. You can also use at sudo-g5k. If you think those software should be installed by default, you can contact the [[Support|support-staff]].


== Deployment related issues ==
All IPv4 communication is NATed, while with [[IPv6]] each node uses its own public IPv6 address.


=== My environment does not work on all clusters ===
{{Warning|text=For security reasons, all connections are logged.}}
It some rare occasion, an environment may not work on a given cluster:
# The kernel used does not support all hardware. You are advised to base your environment on one of the reference environments to avoid dealing with this, or to carefully read the hardware section of each site to see the list of kernel drivers that need to be compiled in your environment for it to be able to boot on all clusters. Of course, when a new cluster is integrated, you might need to update your kernel for portability.
# The post-installation scripts do not recognize your environment, and therefore network access, console access or site specific configurations are not taken into account. You can check the contents of the default post-installation scripts to see the variables set by kadeploy by looking at environment's description using kaenv.


=== Kadeploy fails with ''Image file not found!'' ===
=== What is the source address of outcoming traffic from Grid'5000 nodes to the Internet? ===
This means that <code>kadeploy</code> is not able to read your environment's main archive. This can be caused by many reasons, i.e:
The IPv4 outcoming traffic from Grid'5000 nodes to the Internet is NATed. The public IPv4 addresses used as sources for the NATed packets are:
* registered filename is wrong
    194.254.60.35 (nr-lil-536.grid5000.fr)
* extension is not right (for example <code>.tar.gz</code> does not work, whereas <code>.tgz</code> is OK)
    194.254.60.13 (nr-sop-535.grid5000.fr)


=== Kadeploy is complaining about a node already involved in an other deployment===
=== How can I connect to an HTTP or HTTPS service running on a node? ===
The waring you see is
node $node is already involved in another deployment
This error occurs
* when 2 concurrent deployments are attempted on the same node. If you have 2 simultaneous deployments, make sure you have 2 distinct sets of nodes.
* when there is a problem in the kadeploy database: typically when a deployment ended in a strange way, this can happen. The best is to wait for about 15 minutes and retry the deployment: kadeploy can correct its database automatically.


=== How to kill all my processes on a host ? ===
See the [[HTTP/HTTPs_access]] page.
On the currently connected host (warning, it will disconnect you)
<code class="command">kill</code> -KILL -1


=== How do I exit from kaconsole on cluster X from site Y ===
=== How can I share file from Grid'5000 using HTTP? ===
You can try '&.' sequence (french keyboard), but this may not work on all clusters.The [[Kaconsole#Escape_sequence_for_every_site|Kaconsole]] page may give you more information.
See the [[HTTP/HTTPs_access]] page.


=== Why are the network interfaces named eth2,eth3...eth''n'' in my deployed environment? ===
=== Could I access Grid'5000 nodes directly from the internet? ===
This should be due to default [http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html udev] rules on Debian based systems which allocate unique interface names to physical network devices. When you deploy an environment on an other node, it will detect new physical network devices and allocate them the next available interface names, incrementing it each time. Delete the appropriate rules in your environment to prevent udev from having this behaviour:
For other protocols than [[SSH#Easing_SSH_connections_from_the_outside_to_Grid.275000|SSH]] and [[HTTP/HTTPs_access|HTTP/HTTPs]] which provide lighter specific solutions, see the [[VPN]] and [[Reconfigurable_Firewall]].


{{Term|location=node|cmd=<code class=command>rm</code> <code class=dir>/etc/udev/rules.d/</code><code class=file>*persistent-net.rules</code>}}
=== SSH related questions ===
See the [[SSH]] page.


== Job submission related issues ==
== Software installation issues ==
=== What is the general philosophy ? ===
This is how things should work: a basic set of software is installed on the frontends and nodes' standard environment of each site. If you need some other software packages on nodes, you can create a Kadeploy image including them, and deploy it. You can also use at sudo-g5k. If you think those software should be installed by default, you can contact the [[Support|support-staff]].


=== What is the so called "best-effort" mode of OAR? ===
== Deployment related issues ==
See [[Advanced_Kadeploy#FAQ]].


The best-effort was implemented to back-fill the cluster with jobs considered as less important without blocking "regular" jobs.
== About resources reservations (jobs) ==
To submit jobs under that policy, you simply have to select the besteffort type of job in your oarsub command.


<code class="command">oarsub</code> <code class="replace">-t besteffort</code> script_to_launch
=== How can I check whether my reservations are respecting the Grid'5000 Usage Policy? ===
You can use the script <code>usagepolicycheck</code>, present on all frontends. See if your current reservations are respecting the Policy with <code>usagepolicycheck -t</code>, use <code>usagepolicycheck -h</code> to see the other options.


Jobs submitted that way will only get scheduled on processes when no other job use them (any regular job  
To help respecting the usage policy, it is possible to use <code>day</code> and <code>night</code> OAR job types to fit batch jobs inside day vs. night / week-end time frames. More details are available in the [[Advanced OAR#Restricting_jobs_to_daytime_or_night.2Fweek-end_time|Advanced OAR]] guide.
overtake besteffort jobs in the waiting queue, regardless of submission times).  
Moreover, these jobs are killed (as if oardel were called) when a regular job recently submitted needs the nodes used by a besteffort job.


By default, no checkpointing or automatic restart of besteffort jobs is provided. They are just killed. That is why this mode
is best used with a tool which can detect the killed jobs and resubmit them. However OAR2 provides options for that. You may also have a look at tools like CiGri.


=== How to pass arguments to my script ===
== How can I execute a campaign of tasks within previously reserved resources? (or smaller job in a bigger job) ==
When you do passive submission through <code class="command">oarsub</code>, you must specify a script. This script can be a simple script name or a more complex command line with arguments.
This can be done either with OAR's ''container'' jobs, or with '''[[GNU Parallel]]''':
* If all jobs, container and inner are from a same user, using '''[[GNU Parallel]]''' should be '''preferred''.
* Container job are mostly relevant for tutorials or teaching labs, where jobs are created by a set of '''different users'''. More information in [[Advanced_OAR#Container_jobs]]


To pass arguments, you have to quote the whole command line, like in the following example:
== About checkpoint/restart support of job ==
<code class="command">oarsub</code> -l nodes=4,walltime=2 <code class="replace">"/path/to/myscript arg1 arg2 arg3"</code>


'''Note:''' to avoid random code injection, <code class="command">oarsub</code> allows only alphanumeric characters (<code>[a-zA-Z0-9_]</code>), whitespace characters (<code>[ \t\n\r\f\v]</code>) and few others (<code>[/.-]</code>) inside its command line argument.
The Grid'5000 OAR service setup does not provide a seamless checkpoint/restart mechanism for jobs. While this is obviously a most wanted feature especially for long-running tasks that have to be split in order to fit in the platform usage policy, we think this is better to let the user take care of it. Indeed, while some techniques exist, such as [https://criu.org/ CRIU], none seems satisfactory enough for a sustainable deployment in Grid'5000.


=== Why are /core and -t deploy or -t use_classic_ssh incompatible ? ===
Note that OAR provides a [[Advanced_OAR#Using_the_checkpointing_trigger_mechanism|mechanism]] to trigger an application to checkpoint itself, and to get a checkpointed job resubmitted.
Jobs with type <code>deploy</code> or type <code>allow_classic_ssh</code> imply the exclusive usage of a node. Therefore, specifying core information for your submission can only lead to some inconsistencies. It is therefore prohibited by an admission rule.
 
=== Why did my advance reservation start with less than all the resources I requested ? ===
Since OAR 2.2.12, an advance reservation is validated regardless of the state of resources being either:
# ''alive''
#''suspended''
# ''absent''
(but not ''dead'') at the time the reservation is required to start and during the panned walltime (because those states are transitional).
 
Moreover, resources allocated to an advance reservation are definitely fixed upon this validation, which means that if any of those resources becomes ''dead'', ''absent'' or ''suspected'' after the validation, that resource won't be replaced.
 
At the start time of the advance reservation then, OAR looks after any unavailable resources (''absent'' or ''suspected''), and whenever some exists, wait for them to return for 5 minutes, shall it append:
* resource are in the ''absent'' state during the reboot after a kadeploy job, and then become ''alive'' again as soon as the boot complete
* resource which good health is ''suspected'' by OAR might be fixed back by an admin or maintenance tool operation
If resources are not back yet at the time the job actually starts, these resources are lost for the job, which then provides less resources than expected indeed.
 
That is a price to pay for using advance reservation.
 
;NB
Information about reduced number of resources or reduced walltime for a reservation due to this mechanism are available in the event part of the output of
<code class='command'>oarstat -fj </code><code class='replace'>jobid</code>
 
=== How can I check whether my reservations are respecting the Grid'5000 Usage Policy ? ===
You can use the script <code>usagepolicycheck</code>, present on all frontends. See if your current reservations are respecting the Policy with <code>usagepolicycheck -t</code>, use <code>usagepolicycheck -h</code> to see the other options.


== Access to logs ==
== Continuous Integration (CI) jobs ==
=== OAR database logs ===
Grid'5000 gives the possibility to all users to use a read only access to OAR's database. You should be able to connect using PostgresSQL client as user <code>oarreader</code> with password <code>read</code> to database <code>oar2</code> on all <code class=host>oardb.</code><code class=replace>site</code><code class=host>.grid5000.fr</code>. This gives you access to the complete history of jobs on all Grid'5000 sites. This gives you read-only access to the production database of OAR: please be careful with your queries to avoid overloading the testbed!


{{Note|text=Careful: Grid'5000 is not a computation grid, nor HPC center, nor a Cloud (Grid'5000 is a research instrument). That means that the usage of Grid'5000 by itself (OAR logs of Grid'5000 users' reservations) does not reflect a typical usage of any such infrastructure. It is therefore not relevant to analyze Grid'5000 OAR logs to that purpose. As a user, one can however use Grid'5000 to emulate a HPC cluster or cloud on reserved resources (in a job), possibly injecting a real load from a real infrastructure.}}
Running CI tasks on Grid'5000 is allowed, but special precautions must be taken:
* Inform the [[Support|support staff]] that you plan to use Grid'5000 for CI
* Use a dedicated user account (not your personal user account) that reflects your project's name, and make sure that the ''Professional status/Employee type'' is set to ''bot''.
* Remember that you remain responsible for the usage made by your project's bot account.
** Specifically, if you use GitHub, [https://docs.github.com/en/actions/managing-workflow-runs/approving-workflow-runs-from-public-forks configure GitHub Actions to require approval before running workflows from external collaborators].


= About OAR =
Orchestrating such tasks can be done using the [[API|Grid'5000 REST API]], together with client libraries described on [[Grid5000:Software|Software]] and [[Experiment_scripting_tutorial]].
== How to know if a node is in energy saving mode or really absent ? ==


Nodes in energy saving mode are displayed with the state "Absent (standby)" by the oarnodes command.<br>
Several schemas are possible to run such tasks from GitLab (and manage credentials):
The state "Absent (standby)" means that the node is shut down in order to save energy. <br>
* Use an existing GitLab runner (such as GitLab's shared ones), store credentials in GitLab secrets, and create a job that will reserve resources as needed (typically using the Grid5000 API). See for example [https://gitlab.inria.fr/discovery/enoslib/-/blob/main/.gitlab-ci.yml?ref_type=heads#L45 test_invivo_g5k* in EnOSLib's .gitlab-ci.yml]
Nodes in this state will be automatically started by OAR when it will be needed.
* Run your own GitLab runner on a Grid5000 frontend, as documented in the [https://gitlab.inria.fr/gitlabci_gallery/orchestration/supercomputer-oar GitLab CI gallery]. Credentials are stored in the home directory.
* Use a [[Persistent Virtual Machine]] to host your GitLab runner service. Credentials are stored in the virtual machine.


Advanced users who check directly the OAR database can determine if a node is in energy saving mode or  absent with the field "available_upto" in the resources table.<br>
== Maintenance on Grid'5000 ==
If energy saving is enabled on the cluster, the field "available_upto" provides a date (unix timestamp) until when the resource will be available.


* A node "Absent" is in energy saving mode if the field "available_upto" is greater than the current unix timestamp
A maintenance slot is planned every Thursday on Grid'5000.


An example of SQL query listing absent nodes because of the energy saving mode:
If a maintenance can impact the users jobs, we announce it on the mailing list users@lists.grid5000.fr .
SELECT distinct(network_address) FROM resources WHERE state="Absent" AND available_upto >= UNIX_TIMESTAMP()


* A node "Absent" is really absent if :
When a maintenance is announced, you can follow its progress on ''[https://www.grid5000.fr/status/ the platform's operation schedule]''
- the field "available_upto" is equal to 0<br>
- or the field "available_upto" is smaller than the current unix timestamp (this case should not occur upon Grid'5000)


An example of SQL query listing really absent nodes:
== How to use MPI in Grid5000? ==
SELECT distinct(network_address) FROM resources WHERE state='Absent' AND (available_upto < UNIX_TIMESTAMP() OR available_upto = 0)


== How to detect nodes in maintenance ? ==
See [[Run_MPI_On_Grid'5000|The MPI Tutorial]].
Nodes in maintenance are nodes with a ''Dead'' state, a OAR ''maintenance'' property set to ''YES'', or temporary in a really absent state (see above).


== How to execute jobs within another one ? ==
== How to share data with other users in Grid5000? ==
This functionality is named ''container jobs''.
With this functionality it is possible to execute jobs within another one. So it is like a sub-scheduling mechanism.


* First a job of the type container must be submitted, for example:
See [[Storage]].
oarsub -I -t container -t cosystem -l nodes=10,walltime=2:10:00
...
OAR_JOB_ID=13542
...


* Then it is possible to use the ''inner'' type to schedule the new jobs within the previously created container job:
== How do I access to other scientific infrastructures from Grid'5000 ? ==
oarsub -I -t inner=13542 -l nodes=7,walltime=0:30:00
oarsub -I -t inner=13542 -l nodes=3,walltime=0:45:00


More information in [[Advanced_OAR#Container_jobs]]
=== Jean Zay supercomputer (and possibly others GENCI supercomputers) ===


= How to use MPI in Grid5000? =
If you have an account on the Jean Zay supercomputer operated by the Institute for Development and Resources in Intensive Scientific Computing (IDRIS), it is possible to connect directly to it using ssh/scp/sftp from Grid'5000 frontends or reserved nodes.


See also : [[Run_MPI_On_Grid'5000|The MPI Tutorial]]
For this to be effective, you must add the Grid'5000 SSH outcoming IP addresses to the list of the IP addresses bound to your Jean Zay account.


These addresses are:


== MPI options to use ==
* 194.254.60.35 (nr-lil-536.grid5000.fr)
It is preferable to use the following options to avoid warnings or errors:<br>
* 194.254.60.13 (nr-sop-535.grid5000.fr)


* INFINIBAND clusters: Parapluie, Parapide, Griffon, Graphene, Edel, Genepi
The procedure is the following:


{{Term|location=node|cmd=<code class="command"> mpirun --mca orte_rsh_agent "oarsh" --mca btl openib,sm,self --mca pml ^cm -machinefile $OAR_NODEFILE  $HOME/mpi_programm</code>}}
* First download from the IDRIS website the form required to manage your account
** English: http://www.idris.fr/media/eng/forms/fgc-eng.pdf
** French: http://www.idris.fr/media/data/formulaires/fgc.pdf


* For other clusters, you may use the following options:  
* Then fill in the required sections:
** --mca pml ob1 --mca btl tcp,self
** "Add, modify or delete machines"
** --mca btl ^openib
*** add IP/name of the two IP/name addresses cited above
** --mca btl ^mx
*** both you and your associated security manager must sign this part of the form
** "Complete this box only if the machines are under the responsibility of an organisation or a department which is not the demander’s organisation"
*** Organisation hosting the machines: '''GIS Grid'5000'''
*** Laboratory unit number (if CNRS) or acronym: '''Grid'5000'''
*** Address: '''https://www.grid5000.fr'''
*** Telephone: leave this field blank
*** Last name, first name and qualification/function of the site manager: '''Guillaume Schreiner, Technical Director'''
*** Professional e-mail address: '''support-staff@lists.grid5000.fr'''
*** Telephone: leave this field blank
* Send us your request by mail at '''support-staff@lists.grid5000.fr''':
** Subject: ''Request to connect to Jean Zay supercomputer from Grid'5000''
** Attached: the above form filled and signed (PDF)
** Body of the mail (example): ''Hello, could you please sign the attached form because I need it to access to Jean Zay from Grid'5000 ? Best regards. YOU.''
* We will send you back the form with our signature, and you will have to send the form to '''gestutil@idris.fr''' (this will take roughly a day for this to be effective)

Latest revision as of 20:42, 27 March 2024

About this document

How to add/correct an entry to the FAQ?

Note.png Note

Just like any other page of this wiki, you can edit the FAQ yourself to improve it. If you click on one of the little "edit" placed after each question, you'll get the possibility to edit that particular question. To edit the whole page, simply choose the edit tab at the top of the page.

Publications and Grid'5000

Is there an official acknowledgement ?

Yes there is: you agreed to it when accepting the usage policy. As the policy might have been updated since, please refer to the latest version. You should use it on all publications presenting results obtained (even partially) using Grid'5000.

How to mention Grid'5000 in HAL  ?

HAL is an open archive you're invited to use. If you do so, the recommended way of mentioning Grid'5000 is to use the collaboration field of submission form, with the Grid'5000 keyword, capitalized as such.

Account management

I forgot my password, how can I retrieve it ?

To retrieve your password, you can use this form, or ask your account manager to reset it.

My account expired, how can I extend it?

Use the account management interface (Manage account link in the sidebar).

Why doesn't my home directory contain the same files on every site?

Every site has its own file server, this is the user's responsibility to synchronize the personal data between his home directory on the different sites. You may use the rsync command to synchronize a remote site home directory (be careful this will erase any file that are not the same as on the local home directory):

rsync -n --delete -avz ~ frontend.site.grid5000.fr:~

NB : please remove the -n argument once you are sure you actually don't want to do a dry-run only...;)

How to get my home mounted on deployed nodes?

This is completely automatic if you deploy a *-nfs or *-big image (automount).

  • You can connect using your own username and should land in your home;
  • If connecting as root, once connected to the node, just change directory your home and it will be mounted:
 cd /home/username
Note.png Note

But home of other users cannot be mounted, for security reasons.

How to restore a wrongly deleted file?

No backup facility is provided by Grid'5000 platform. Please watch your fingers and do backup your data using external backup services.

What about disk quotas ?

See the section about the /home in the Storage page.

How do I unsubscribe from the mailing-list ?

Users' mailing-list subscription is tied to your Grid'5000 account. You can configure your subscriptions in your account settings:

How to unsubscribe from the mailing list

Alternate method, by configuring Sympa to stop receiving any email from the list (while still being subscribed):

  • If you haven't done it before, ask for a password on sympa.inria.fr from this form: https://sympa.inria.fr/sympa/firstpasswd/. Use the email address you used to register to Grid'5000.
  • Connect to https://sympa.inria.fr using your email address you used to register to Grid'5000 and your sympa.inria.fr password.
  • From the left panel, select users_grid5000. Then go to your subscriber options (Options d'abonné) and in the reception field (Mode de réception), select suspend (interrompre la réception des messages).

Network access to/from Grid'5000

How can I connect to Grid'5000 ?

This is documented at length in the Getting Started tutorial.

You should be able to access Grid'5000 from anywhere on the Internet, by connecting to access.grid5000.fr using SSH. You'll need SSH keys properly configured (please refer to the page dedicated to SSH if you don't understand these last words) as this machine will not allow you to log using a password.

Some sites have an access.site.grid5000.fr machine, which is only reachable from an IP address coming from local laboratory (replace site with the actual site name).

How to connect from different workstations with the same account?

You can associate several public SSH keys to your account. In order to do so, you have to:

  • login
  • go to User Portal > Manage Account,
  • select the My account top tab,
  • select the SSH keys left tab,
  • then, manage your keys:
    • add a new public SSH key ;
    • remove an old one.

More information in the SSH page and the Public key authentication page.

How to directly connect by SSH to any machine within Grid'5000 from my workstation?

This tip consists of customizing SSH configuration file ~/.ssh/config (compatible with OpenSSH ssh client)

Host *.g5k
   User login
   ProxyCommand ssh login@access.grid5000.fr -W "$(basename %h .g5k):%p"

You can then connect to any machine using ssh machine.site.g5k

Please have a look at the SSH page for a deeper understanding and more information.

For users of powershell in Microsoft Windows which also comes with OpenSSH ssh client, mind adapting the configuration as the basename command may not be available.

Note.png Note

Grid'5000 internal network uses private IP V4 addresses and are not directly reachable from outside of Grid'5000

Is access to the Internet possible from nodes?

Full Internet access is allowed from Grid'5000 network to the Internet.

All IPv4 communication is NATed, while with IPv6 each node uses its own public IPv6 address.

Warning.png Warning

For security reasons, all connections are logged.

What is the source address of outcoming traffic from Grid'5000 nodes to the Internet?

The IPv4 outcoming traffic from Grid'5000 nodes to the Internet is NATed. The public IPv4 addresses used as sources for the NATed packets are:

   194.254.60.35 (nr-lil-536.grid5000.fr)
   194.254.60.13 (nr-sop-535.grid5000.fr)

How can I connect to an HTTP or HTTPS service running on a node?

See the HTTP/HTTPs_access page.

How can I share file from Grid'5000 using HTTP?

See the HTTP/HTTPs_access page.

Could I access Grid'5000 nodes directly from the internet?

For other protocols than SSH and HTTP/HTTPs which provide lighter specific solutions, see the VPN and Reconfigurable_Firewall.

SSH related questions

See the SSH page.

Software installation issues

What is the general philosophy ?

This is how things should work: a basic set of software is installed on the frontends and nodes' standard environment of each site. If you need some other software packages on nodes, you can create a Kadeploy image including them, and deploy it. You can also use at sudo-g5k. If you think those software should be installed by default, you can contact the support-staff.

Deployment related issues

See Advanced_Kadeploy#FAQ.

About resources reservations (jobs)

How can I check whether my reservations are respecting the Grid'5000 Usage Policy?

You can use the script usagepolicycheck, present on all frontends. See if your current reservations are respecting the Policy with usagepolicycheck -t, use usagepolicycheck -h to see the other options.

To help respecting the usage policy, it is possible to use day and night OAR job types to fit batch jobs inside day vs. night / week-end time frames. More details are available in the Advanced OAR guide.


How can I execute a campaign of tasks within previously reserved resources? (or smaller job in a bigger job)

This can be done either with OAR's container jobs, or with GNU Parallel:

  • If all jobs, container and inner are from a same user, using GNU Parallel' should be preferred.
  • Container job are mostly relevant for tutorials or teaching labs, where jobs are created by a set of different users. More information in Advanced_OAR#Container_jobs

About checkpoint/restart support of job

The Grid'5000 OAR service setup does not provide a seamless checkpoint/restart mechanism for jobs. While this is obviously a most wanted feature especially for long-running tasks that have to be split in order to fit in the platform usage policy, we think this is better to let the user take care of it. Indeed, while some techniques exist, such as CRIU, none seems satisfactory enough for a sustainable deployment in Grid'5000.

Note that OAR provides a mechanism to trigger an application to checkpoint itself, and to get a checkpointed job resubmitted.

Continuous Integration (CI) jobs

Running CI tasks on Grid'5000 is allowed, but special precautions must be taken:

Orchestrating such tasks can be done using the Grid'5000 REST API, together with client libraries described on Software and Experiment_scripting_tutorial.

Several schemas are possible to run such tasks from GitLab (and manage credentials):

  • Use an existing GitLab runner (such as GitLab's shared ones), store credentials in GitLab secrets, and create a job that will reserve resources as needed (typically using the Grid5000 API). See for example test_invivo_g5k* in EnOSLib's .gitlab-ci.yml
  • Run your own GitLab runner on a Grid5000 frontend, as documented in the GitLab CI gallery. Credentials are stored in the home directory.
  • Use a Persistent Virtual Machine to host your GitLab runner service. Credentials are stored in the virtual machine.

Maintenance on Grid'5000

A maintenance slot is planned every Thursday on Grid'5000.

If a maintenance can impact the users jobs, we announce it on the mailing list users@lists.grid5000.fr .

When a maintenance is announced, you can follow its progress on the platform's operation schedule

How to use MPI in Grid5000?

See The MPI Tutorial.

How to share data with other users in Grid5000?

See Storage.

How do I access to other scientific infrastructures from Grid'5000 ?

Jean Zay supercomputer (and possibly others GENCI supercomputers)

If you have an account on the Jean Zay supercomputer operated by the Institute for Development and Resources in Intensive Scientific Computing (IDRIS), it is possible to connect directly to it using ssh/scp/sftp from Grid'5000 frontends or reserved nodes.

For this to be effective, you must add the Grid'5000 SSH outcoming IP addresses to the list of the IP addresses bound to your Jean Zay account.

These addresses are:

  • 194.254.60.35 (nr-lil-536.grid5000.fr)
  • 194.254.60.13 (nr-sop-535.grid5000.fr)

The procedure is the following:

  • Then fill in the required sections:
    • "Add, modify or delete machines"
      • add IP/name of the two IP/name addresses cited above
      • both you and your associated security manager must sign this part of the form
    • "Complete this box only if the machines are under the responsibility of an organisation or a department which is not the demander’s organisation"
      • Organisation hosting the machines: GIS Grid'5000
      • Laboratory unit number (if CNRS) or acronym: Grid'5000
      • Address: https://www.grid5000.fr
      • Telephone: leave this field blank
      • Last name, first name and qualification/function of the site manager: Guillaume Schreiner, Technical Director
      • Professional e-mail address: support-staff@lists.grid5000.fr
      • Telephone: leave this field blank
  • Send us your request by mail at support-staff@lists.grid5000.fr:
    • Subject: Request to connect to Jean Zay supercomputer from Grid'5000
    • Attached: the above form filled and signed (PDF)
    • Body of the mail (example): Hello, could you please sign the attached form because I need it to access to Jean Zay from Grid'5000 ? Best regards. YOU.
  • We will send you back the form with our signature, and you will have to send the form to gestutil@idris.fr (this will take roughly a day for this to be effective)