Revision as of 12:51, 17 April 2019

About this document

How to add/correct an entry to the FAQ?

Just like any other page of this wiki, you can edit the FAQ yourself to improve it. If you click on one of the little "edit" placed after each question, you'll get the possibility to edit that particular question. To edit the whole page, simply choose the edit tab at the top of the page.

Publications and Grid'5000

Is there an official acknowledgement ?

Yes there is: you agreed to it when accepting the usage policy. As the policy might have been updated since, please refer to the latest version. You should use it on all publications presenting results obtained (even partially) using Grid'5000.

How to mention Grid'5000 in HAL ?

HAL is an open archive you're invited to use. If you do so, the recommended way of mentioning Grid'5000 is to use the collaboration field of submission form, with the Grid'5000 keyword, capitalized as such.

Accessing Grid'5000

How can I connect to Grid'5000 ?

This is documented at length in the Getting Started tutorial.

You should be able to access Grid'5000 from anywhere on the Internet, by connecting to access.grid5000.fr using SSH. You'll need SSH keys properly configured (please refer to the page dedicated to SSH if you don't understand these last words) as this machine will not allow you to log using a password.

Some sites have an access.site.grid5000.fr machine, which is only reachable from an IP address coming from local laboratory.

How to connect from different workstations with the same account?

You can associate multiple public SSH keys to your account. In order to do so, you have to:

login
go to User Portal > Manage Account
select the 'My account' tab
in actions list select «Edit Profil»
then, paste the public SSH key in a new line just after the other(s).

More information about SSH and Public key authentication.

How to directly connect by SSH to any machine within Grid'5000 from my workstation?

This tip consists of customizing SSH configuration file ~/.ssh/config.

Host *.g5k
   User login
   ProxyCommand ssh login@access.grid5000.fr -W "`basename %h .g5k`:%p"

Your are now able to connect to any machine using ssh machine.site.g5k

Please have a look at the SSH page to a deeper understanding of this proxy feature.

Note: Grid'5000 internal network uses private IP addresses and are not directly reachable from outside of Grid'5000.

Is access to the Internet possible from nodes?

Since end of 2015, full Internet access is allowed from Grid'5000 network. For security reasons, connections are logged.

How can I connect to an HTTP or HTTPS service running on a node?

Network connections (tcp, udp, ...) to any Grid'5000 node from outside Grid'5000 are possible using Grid'5000 VPN. See also the SSH page (forwarding and SOCKS proxy).

However, if you only need to connect to a web interface or web API served by a node using HTTP or HTTPS, you can use Grid'5000 reverse proxy. The base URL to use is https://mynode.mysite.http.proxy.grid5000.fr/ for HTTP and https://mynode.mysite.https.proxy.grid5000.fr/ for HTTPS.

If you can't run your service on port 80 or 443 (because you are not root for example), you can use a port 8080 or 8443 (you don't need to be root to bind them). You can then use https://mynode.mysite.http8080.proxy.grid5000.fr/ for HTTP and https://mynode.mysite.https8443.proxy.grid5000.fr/ for HTTPS.

Please note that the reverse proxy needs authentication using your Grid'5000 credentials.

	Note
	You may have to add an exception as required by you web browser to access the target web service, because of a SSL certificate mismatch.

Account management

I forgot my password, how can I retrieve it ?

To retrieve your password, you can use this form, or ask your account manager to reset it.

My account expired, how can I extend it?

Use the account management interface (Manage account link in the sidebar).

Why doesn't my home directory contain the same files on every site?

Every site has its own file server, this is the user's responsibility to synchronize the personal data between his home directory on the different sites. You may use the rsync command to synchronize a remote site home directory (be careful this will erase any file that are not the same as on the local home directory):

rsync -n --delete -avz ~ frontend.site.grid5000.fr:~

NB : please remove the -n argument once you are sure you actually don't want to do a dry-run only...;)

How to get my home mounted on deployed nodes?

This is completely automatic if you deploy a *-nfs or *-big image. You can then connect using your own login, and once connected into the node, just enter your home:

 cd /home/<your login>

How to restore a wrongly deleted file?

No backup facility is provided by Grid'5000 platform. Please watch your fingers and do backup your data using external backup services.

What about disk quotas ?

You'll find that for each account and each site, disk quotas may be activated.

the soft limit is set to what the admins find a reasonable limit for an account on a more or less permanent basis. You can use more disk space temporarily, but you should not try and trick the system to keep that data on the shared file system.
the hard limit is set so as to preserve usability for other users if one of your scripts produces unexpected amounts of data. You'll not be able to override that limit.

How to increase my disk quota limitation?

Should you need higher quotas, please visit your user account settings page at https://api.grid5000.fr/ui/account (my storage tab), or send an email to support-staff@lists.grid5000.fr, explaining how much storage space you want and what for.

How do I unsubscribe from the mailing-list ?

Users' mailing-list subscription is tied to your Grid'5000 account. You can configure your subscriptions in your account settings:

How to unsubscribe from the mailing list

Login to https://api.grid5000.fr/ui/account
Go to the "My account" tab, then click on the "Actions" button, then choose "Manage mailing lists"

Alternate method, by configuring Sympa to stop receiving any email from the list (while still being subscribed):

If you haven't done it before, ask for a password on sympa.inria.fr from this form: https://sympa.inria.fr/sympa/firstpasswd/. Use the email address you used to register to Grid'5000.
Connect to https://sympa.inria.fr using your email address you used to register to Grid'5000 and your sympa.inria.fr password.
From the left panel, select users_grid5000. Then go to your subscriber options (Options d'abonné) and in the reception field (Mode de réception), select suspend (interrompre la réception des messages).

SSH related questions

How to avoid SSH host key checking?

With the StrictHostKeyChecking option, SSH host key checking can be turned off. This option can be set in the ~/.ssh/config file:

StrictHostKeyChecking no

Or it can be passed on the command line:

ssh -o StrictHostKeyChecking=no host

How not to get tons of SSH errors about Man-in-the-middle attacks while deploying images ?

If you get the following error when you try to connect a machine using ssh:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
6e:44:89:6d:ac:fc:d8:84:fd:2b:fb:22:e5:ba:5c:88.
Please contact your system administrator.

This is because SSH get worried by the fact that the machine answering to the connection is not the same from run to run. This is actually really logical if you just redeployed the image so it is not same system that is answering...

Technically speaking, the file /etc/ssh/ssh_host_dsa_key.pub is likely to be different in your own deployed image and in the default image. SSH will thus freak out since such replacement usually denote that someone is intercepting the communication and pretend to be the server to get informations from you.

If you don't want to care about this issues, there are several solutions:

Add StrictHostKeyChecking=no to your .ssh/config file to explain SSH to ignore about those errors.
Pass this option (StrictHostKeyChecking=no) on the command line to ssh (using -o)
Make sure that you have the same host_dsa_key in your own images than in defaults one. They can usually be found in the pre/post install scripts of your site.

Outside of Grid'5000 scope, the correct solution is to fix your ~/.ssh/known_hosts, either by hand or using the command ssh-keygen -R hostname.

Please have a look at the SSH page also.

What kind of public keys are supported on Grid'5000 ?

The only format of the public_keys allowed in Grid'5000 is the openSSH format.
You MUST provide and use ssh public keys in this format.

SSH2 like public key (NOT SUPPORTED) :

---- BEGIN SSH2 PUBLIC KEY ----
Comment: rsa-key-20090623
AAAAB3NzaC1yc2EAAAABJQAAAIEA1YO87ubDgjQmCEdyX98UZ1RaBNAEXNGUNX2t
D/lEw7MPShJKpVYpcj4JhrOqTc0QXIcLqefkucDaoAIlEAp7e5aShWhWFtYR5Mwn
qAF1hrMBMF0xJIqgZjUWUPxvvFVeQXkObUWQkRyj5AjlG9+qQDLOoD9GgBOqfLDV
edGCLoM=
---- END SSH2 PUBLIC KEY ----

OpenSSH like public key :

ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAIEA1YO87ubDgjQmCEdyX98UZ1RaBNAEXNGUNX2tD/lEw7MPShJKpVYpcj4JhrOqTc0QXIcLqefkucDaoAIlEAp7e5aShWhWFtYR5MwnqAF1hrMBMF0xJIqgZjUWUPxvvFVeQXkObUWQkRyj5AjlG9+qQDLOoD9GgBOqfLDVedGCLoM=

To convert ssh public keys from SSH2 to OpenSSH, see this tutorial.

Software installation issues

What is the general philosophy ?

This is how things should work: a basic set of software is installed on the frontends and nodes' standard environment of each site. If you need some other software packages on nodes, you can create a Kadeploy image including then, and deploy it. You can also use at sudo-g5k. If you think those software should be installed by default, you can contact the support-staff.

Deployment related issues

My environment does not work on all clusters

It some rare occasion, an environment may not work on a given cluster:

The kernel used does not support all hardware. You are advised to base your environment on one of the reference environments to avoid dealing with this, or to carefully read the hardware section of each site to see the list of kernel drivers that need to be compiled in your environment for it to be able to boot on all clusters. Of course, when a new cluster is integrated, you might need to update your kernel for portability.
The post-installation scripts do not recognize your environment, and therefore network access, console access or site specific configurations are not taken into account. You can check the contents of the default post-installation scripts to see the variables set by kadeploy by looking at environment's description using kaenv.

Kadeploy fails with Image file not found!

This means that kadeploy is not able to read your environment's main archive. This can be caused by many reasons, i.e:

registered filename is wrong
extension is not right (for example .tar.gz does not work, whereas .tgz is OK)

Kadeploy is complaining about a node already involved in an other deployment

The waring you see is

node $node is already involved in another deployment

This error occurs

when 2 concurrent deployments are attempted on the same node. If you have 2 simultaneous deployments, make sure you have 2 distinct sets of nodes.
when there is a problem in the kadeploy database: typically when a deployment ended in a strange way, this can happen. The best is to wait for about 15 minutes and retry the deployment: kadeploy can correct its database automatically.

How to kill all my processes on a host ?

On the currently connected host (warning, it will disconnect you)

kill -KILL -1

How do I exit from kaconsole on cluster X from site Y

You can try '&.' sequence (french keyboard), but this may not work on all clusters.The Kaconsole page may give you more information.

Why are the network interfaces named eth2,eth3...ethn in my deployed environment?

This should be due to default udev rules on Debian based systems which allocate unique interface names to physical network devices. When you deploy an environment on an other node, it will detect new physical network devices and allocate them the next available interface names, incrementing it each time. Delete the appropriate rules in your environment to prevent udev from having this behaviour:

node:

rm /etc/udev/rules.d/*persistent-net.rules

Job submission related issues

What is the so called "best-effort" mode of OAR?

The best-effort was implemented to back-fill the cluster with jobs considered as less important without blocking "regular" jobs. To submit jobs under that policy, you simply have to select the besteffort type of job in your oarsub command.

oarsub -t besteffort script_to_launch

Jobs submitted that way will only get scheduled on processes when no other job use them (any regular job overtake besteffort jobs in the waiting queue, regardless of submission times). Moreover, these jobs are killed (as if oardel were called) when a regular job recently submitted needs the nodes used by a besteffort job.

By default, no checkpointing or automatic restart of besteffort jobs is provided. They are just killed. That is why this mode is best used with a tool which can detect the killed jobs and resubmit them. However OAR2 provides options for that. You may also have a look at tools like CiGri.

How to pass arguments to my script

When you do passive submission through oarsub, you must specify a script. This script can be a simple script name or a more complex command line with arguments.

To pass arguments, you have to quote the whole command line, like in the following example:

oarsub -l nodes=4,walltime=2 "/path/to/myscript arg1 arg2 arg3"

Note: to avoid random code injection, oarsub allows only alphanumeric characters ([a-zA-Z0-9_]), whitespace characters ([ \t\n\r\f\v]) and few others ([/.-]) inside its command line argument.

Why are /core and -t deploy or -t use_classic_ssh incompatible ?

Jobs with type deploy or type allow_classic_ssh imply the exclusive usage of a node. Therefore, specifying core information for your submission can only lead to some inconsistencies. It is therefore prohibited by an admission rule.

Why did my advance reservation start with less than all the resources I requested ?

Since OAR 2.2.12, an advance reservation is validated regardless of the state of resources being either:

alive
suspended
absent

(but not dead) at the time the reservation is required to start and during the panned walltime (because those states are transitional).

Moreover, resources allocated to an advance reservation are definitely fixed upon this validation, which means that if any of those resources becomes dead, absent or suspected after the validation, that resource won't be replaced.

At the start time of the advance reservation then, OAR looks after any unavailable resources (absent or suspected), and whenever some exists, wait for them to return for 5 minutes, shall it append:

resource are in the absent state during the reboot after a kadeploy job, and then become alive again as soon as the boot complete
resource which good health is suspected by OAR might be fixed back by an admin or maintenance tool operation

If resources are not back yet at the time the job actually starts, these resources are lost for the job, which then provides less resources than expected indeed.

That is a price to pay for using advance reservation.

NB

Information about reduced number of resources or reduced walltime for a reservation due to this mechanism are available in the event part of the output of

oarstat -fj jobid

How can I check whether my reservations are respecting the Grid'5000 Usage Policy ?

You can use the script usagepolicycheck, present on all frontends. See if your current reservations are respecting the Policy with usagepolicycheck -t, use usagepolicycheck -h to see the other options.

Access to logs

OAR database logs

Grid'5000 gives the possibility to all users to use a read only access to OAR's database. You should be able to connect using PostgresSQL client as user oarreader with password read to database oar2 on all oardb.site.grid5000.fr. This gives you access to the complete history of jobs on all Grid'5000 sites. This gives you read-only access to the production database of OAR: please be careful with your queries to avoid overloading the testbed!

	Note
	Careful: Grid'5000 is not a computation grid, nor HPC center, nor a Cloud (Grid'5000 is a research instrument). That means that the usage of Grid'5000 by itself (OAR logs of Grid'5000 users' reservations) does not reflect a typical usage of any such infrastructure. It is therefore not relevant to analyze Grid'5000 OAR logs to that purpose. As a user, one can however use Grid'5000 to emulate a HPC cluster or cloud on reserved resources (in a job), possibly injecting a real load from a real infrastructure.

About OAR

How to know if a node is in energy saving mode or really absent ?

Nodes in energy saving mode are displayed with the state "Absent (standby)" by the oarnodes command.
The state "Absent (standby)" means that the node is shut down in order to save energy.
Nodes in this state will be automatically started by OAR when it will be needed.

Advanced users who check directly the OAR database can determine if a node is in energy saving mode or absent with the field "available_upto" in the resources table.
If energy saving is enabled on the cluster, the field "available_upto" provides a date (unix timestamp) until when the resource will be available.

A node "Absent" is in energy saving mode if the field "available_upto" is greater than the current unix timestamp

An example of SQL query listing absent nodes because of the energy saving mode:

SELECT distinct(network_address) FROM resources WHERE state="Absent" AND available_upto >= UNIX_TIMESTAMP()

A node "Absent" is really absent if :

- the field "available_upto" is equal to 0
- or the field "available_upto" is smaller than the current unix timestamp (this case should not occur upon Grid'5000)

An example of SQL query listing really absent nodes:

SELECT distinct(network_address) FROM resources WHERE state='Absent' AND (available_upto < UNIX_TIMESTAMP() OR available_upto = 0)

How to detect nodes in maintenance ?

Nodes in maintenance are nodes with a Dead state, a OAR maintenance property set to YES, or temporary in a really absent state (see above).

How to execute jobs within another one ?

This functionality is named container jobs. With this functionality it is possible to execute jobs within another one. So it is like a sub-scheduling mechanism.

First a job of the type container must be submitted, for example:

oarsub -I -t container -t cosystem -l nodes=10,walltime=2:10:00
...
OAR_JOB_ID=13542
...

Then it is possible to use the inner type to schedule the new jobs within the previously created container job:

oarsub -I -t inner=13542 -l nodes=7,walltime=0:30:00
oarsub -I -t inner=13542 -l nodes=3,walltime=0:45:00

More information in Advanced_OAR#Container_jobs

How to use MPI in Grid5000?

MPI options to use

It is preferable to use the following options to avoid warnings or errors:

INFINIBAND clusters: Parapluie, Parapide, Griffon, Graphene, Edel, Genepi