FAQ

From Grid5000
Jump to: navigation, search

Contents

About this document

How to add/correct an entry to the FAQ?

Just like any other page of this wiki, you can edit the FAQ yourself to improve it. If you click on one of the little "edit" placed after each question, you'll get the possibility to edit that particular question. To edit the whole page, simply choose the edit tab at the top of the page.


Publications and Grid'5000

Is there an official acknowledgement ?

Yes there is: you agreed to it when accepting the user Charter. As the charter might have been updated since, please refer to the latest version. You should use it on all publications presenting results obtained (even partially) using Grid'5000.

How to mention Grid'5000 in HAL  ?

HAL is an open archive you're invited to use. If you do so, the recommended way of mentioning Grid'5000 is to use the collaboration field of submission form, with the Grid'5000 keyword, capitalized as such.

Accessing Grid'5000

What is the theory ?

You should be able to access Grid'5000 from anywhere on the Internet, by connecting to access.grid5000.fr using SSH. You'll need SSH keys properly configured (please refer to the page dedicated to SSH if you don't understand these last words) as this machine will not allow you to log using a password.

Some sites have an access.site.grid5000.fr machine, which is only reachable from an IP address coming from local laboratory.

How to connect from different workstations with the same account?

You can associate multiple public SSH keys to your account. In order to do so, you have to:

  • login
  • go to User Portal > Manage Account
  • select the 'My account' tab
  • in actions list select «Edit Profil»
  • then, paste the public SSH key in a new line just after the other(s).

More information about SSH and Public key authentication.

How to directly connect by SSH to any machine within Grid'5000 from my workstation?

This tip consists of customizing SSH configuration file ~/.ssh/config.

Host *.g5k
   User login
   ProxyCommand ssh login@access.grid5000.fr -W "`basename %h .g5k`:%p"

Your are now able to connect to any machine using ssh machine.site.g5k

Please have a look at the SSH page to a deeper understanding of this proxy feature.

Note: Grid'5000 internal network uses private IP addresses and are not directly reachable from outside of Grid'5000.

Is access to the Internet possible from nodes?

Since end of 2015, full Internet access is allowed from Grid'5000 network. For security reason, connections are logged.

How can I connect to an HTTP or HTTPS service running on a node?

You can connect to every nodes from outside Grid'5000 using the VPN. If you only need HTTP or HTTPS (and have setup a server on your node), it can contacted at this address https://mynode.mysite.proxy-http.grid5000.fr/., for HTTP and https://mynode.mysite.proxy-https.grid5000.fr/, for HTTPS.

Account management

I forgot my password, how can I retrieve it ?

To retrieve your password, you can use this form, or ask your account manager to reset it.

Why does my home directory not contain the same files on every site?

Every site has its own file server, this is the user's responsibility to synchronize the personal data between his home directory on the different sites. You may use the rsync command to synchronize a remote site home directory (be careful this will erase any file that are not the same as on the local home directory):

rsync -n --delete -avz ~ frontend.site.grid5000.fr:~

NB : please remove the -n argument once you are sure you actually don't want to do a dry-run only...;)

How to get my home mounted on deployed nodes?

This is completely automatic if you deploy a *-nfs or *-big image. You can then connect using your own login, and once connected into the node, just enter your home:

 cd /home/<your login>

How to restore a wrongly deleted file?

No backup facility is provided by Grid'5000 platform. Please watch your fingers and do backup your data using external backup services.

What about disk quotas ?

You'll find that for each account and each site, disk quotas may be activated.

  • the soft limit is set to what the admins find a reasonable limit for an account on a more or less permanent basis. You can use more disk space temporarily, but you should not try and trick the system to keep that data on the shared file system.
  • the hard limit is set so as to preserve usability for other users if one of your scripts produces unexpected amounts of data. You'll not be able to override that limit.

More information is available in the quotas page.

How to increase my quota disk limitation?

Should you need higher quotas, please visit your user account settings page at https://api.grid5000.fr/ui/account (my storage tab), or send an email to support-staff@lists.grid5000.fr, explaining how much storage space you want and what for.

SSH related questions

How to fetch all the SSH host keys of one site?

To avoid answering 'yes' when connecting with SSH for the first time to hosts, the ~/.ssh/known_hosts file can be automatically generated for one site:

nodelist site | ssh-keyscan -tdsa,rsa  -f -

Please have a look at "How to get a site list of nodes?", for information on the nodelist command.

How to avoid SSH host key checking?

With the StrictHostKeyChecking option, SSH host key checking can be turned off. This option can be set in the ~/.ssh/config file:

StrictHostKeyChecking no

Or it can be passed on the command line:

ssh -o StrictHostKeyChecking=no host

How not to get tons of SSH errors about Man-in-the-middle attacks while deploying images ?

If you get the following error when you try to connect a machine using ssh:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
6e:44:89:6d:ac:fc:d8:84:fd:2b:fb:22:e5:ba:5c:88.
Please contact your system administrator.

This is because SSH get worried by the fact that the machine answering to the connection is not the same from run to run. This is actually really logical if you just redeployed the image so it is not same system that is answering...

Technically speaking, the file /etc/ssh/ssh_host_dsa_key.pub is likely to be different in your own deployed image and in the default image. SSH will thus freak out since such replacement usually denote that someone is intercepting the communication and pretend to be the server to get informations from you.

If you don't want to care about this issues, there are several solutions:

  • Add StrictHostKeyChecking=no to your .ssh/config file to explain SSH to ignore about those errors.
  • Pass this option (StrictHostKeyChecking=no) on the command line to ssh (using -o)
  • Make sure that you have the same host_dsa_key in your own images than in defaults one. They can usually be found in the pre/post install scripts of your site.

Outside of Grid'5000 scope, the correct solution is to fix your ~/.ssh/known_hosts, either by hand or using the command ssh-keygen -R hostname.

Please have a look at the SSH page also.

What kind of public keys are supported on Grid'5000 ?

The only format of the public_keys allowed in Grid'5000 is the openSSH format.
You MUST provide and use ssh public keys in this format.

  • SSH2 like public key (NOT SUPPORTED) :
---- BEGIN SSH2 PUBLIC KEY ----
Comment: rsa-key-20090623
AAAAB3NzaC1yc2EAAAABJQAAAIEA1YO87ubDgjQmCEdyX98UZ1RaBNAEXNGUNX2t
D/lEw7MPShJKpVYpcj4JhrOqTc0QXIcLqefkucDaoAIlEAp7e5aShWhWFtYR5Mwn
qAF1hrMBMF0xJIqgZjUWUPxvvFVeQXkObUWQkRyj5AjlG9+qQDLOoD9GgBOqfLDV
edGCLoM=
---- END SSH2 PUBLIC KEY ----
  • OpenSSH like public key :
ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAIEA1YO87ubDgjQmCEdyX98UZ1RaBNAEXNGUNX2tD/lEw7MPShJKpVYpcj4JhrOqTc0QXIcLqefkucDaoAIlEAp7e5aShWhWFtYR5MwnqAF1hrMBMF0xJIqgZjUWUPxvvFVeQXkObUWQkRyj5AjlG9+qQDLOoD9GgBOqfLDVedGCLoM=

To convert ssh public keys from SSH2 to OpenSSH, see this tutorial.

Experiments issues

Why and How to fill a experiment report

You should document the experiments that you are doing on your user report. This way, you can help the team in charge of the instrument to document the interest of the platform.

Software installation issues

What is the general philosophy ?

This is how things should work : a basic set of software is installed on the frontends and nodes' standard environment of each site. If you need some other software package, you should should create a Kadeploy image including it, as you have root access on deployed images. You might also ask for package inclusion by asking for Support and your demand will be reviewed.


Deployment related issues

Deployments seem to fail for unknown reasons

From time to time, kadeploy may give error during deployment. Don't hesitate to report such problem to Support.

My environment does not work on all clusters

It some rare occasion, an environment may not work on a given cluster:

  1. The kernel used does not support all hardware. You are advised to base your environment on one of the reference environments to avoid dealing with this, or to carefully read the hardware section of each site to see the list of kernel drivers that need to be compiled in your environment for it to be able to boot on all clusters. Of course, when a new cluster is integrated, you might need to update your kernel for portability.
  2. The post-installation scripts do not recognize your environment, and therefore network access, console access or site specific configurations are not taken into account. You can check the contents of the default post-installation scripts to see the variables set by kadeploy by looking at environment's description using kaenv.

Kadeploy fails with Image file not found!

This means that kadeploy is not able to read your environment's main archive. This can be caused by many reasons, i.e:

  • registered filename is wrong
  • extension is not right (for example .tar.gz does not work, whereas .tgz is OK)

Kadeploy is complaining about a node already involved in an other deployment

The waring you see is

node $node is already involved in another deployment

This error occurs

  • when 2 concurrent deployments are attempted on the same node. If you have 2 simultaneous deployments, make sure you have 2 distinct sets of nodes.
  • when there is a problem in the kadeploy database: typically when a deployment ended in a strange way, this can happen. The best is to wait for about 15 minutes and retry the deployment: kadeploy can correct its database automatically.

How to quickly check for nodes health

You can check for nodes health, based on ICMP request, with nmap command if it is available on a site:

nodelist site | nmap -iL - -sP

Or with fping command, if it is available:

nodelist site | fping -a 2> /dev/null

How to kill all my processes on a host ?

On the currently connected host (warning, it will disconnect you)

kill -KILL -1

How do I exit from kaconsole on cluster X from site Y

You can try '&.' sequence (french keyboard), but this may not work on all clusters.The Kaconsole page may give you more information.

Why are the network interfaces named eth2,eth3...ethn in my deployed environment?

This should be due to default udev rules on Debian based systems which allocate unique interface names to physical network devices. When you deploy an environment on an other node, it will detect new physical network devices and allocate them the next available interface names, incrementing it each time. Delete the appropriate rules in your environment to prevent udev from having this behaviour:

Terminal.png node:
rm /etc/udev/rules.d/*persistent-net.rules

Job submission related issues

What is the so called "best-effort" mode of OAR?

The best-effort was implemented to back-fill the cluster with jobs considered as less important without blocking "regular" jobs. To submit jobs under that policy, you simply have to select the besteffort type of job in your oarsub command.

oarsub -t besteffort script_to_launch

Jobs submitted that way will only get scheduled on processes when no other job use them (any regular job overtake besteffort jobs in the waiting queue, regardless of submission times). Moreover, these jobs are killed (as if oardel were called) when a regular job recently submitted needs the nodes used by a besteffort job.

By default, no checkpointing or automatic restart of besteffort jobs is provided. They are just killed. That is why this mode is best used with a tool which can detect the killed jobs and resubmit them. However OAR2 provides options for that. You may also have a look at tools like Multi-parametric_experiments_with_CiGri or APST.

How to pass arguments to my script

When you do passive submission through oarsub, you must specify a script. This script can be a simple script name or a more complex command line with arguments.

To pass arguments, you have to quote the whole command line, like in the following example:

oarsub -l nodes=4,walltime=2 "/path/to/myscript arg1 arg2 arg3"

Note: to avoid random code injection, oarsub allows only alphanumeric characters ([a-zA-Z0-9_]), whitespace characters ([ \t\n\r\f\v]) and few others ([/.-]) inside its command line argument.

Why are /core and -t deploy or -t use_classic_ssh incompatible ?

Jobs with type deploy or type allow_classic_ssh imply the exclusive usage of a node. Therefore, specifying core information for your submission can only lead to some inconsistencies. It is therefore prohibited by an admission rule.

Why did my advance reservation started with not all the resources I requested ?

Since OAR 2.2.12, an advance reservation is validated regardless of the state of resources being either:

  1. alive
  2. suspended
  3. absent

(but not dead) at the time the reservation is required to start and during the panned walltime (because those states are transitional).

Moreover, resources allocated to an advance reservation are definitely fixed upon this validation, which means that if any of those resources becomes dead, absent or suspected after the validation, that resource won't be replaced.

At the start time of the advance reservation then, OAR looks after any unavailable resources (absent or suspected), and whenever some exists, wait for them to return for 5 minutes, shall it append:

  • resource are in the absent state during the reboot after a kadeploy job, and then become alive again as soon as the boot complete
  • resource which good health is suspected by OAR might be fixed back by an admin or maintenance tool operation

If resources are not back yet at the time the job actually starts, these resources are lost for the job, which then provides less resources than expected indeed.

That is a price to pay for using advance reservation.

NB

Information about reduced number of resources or reduced walltime for a reservation due to this mechanism are available in the event part of the output of

oarstat -fj jobid

Access to logs

OAR database logs

A little known feature in Grid'5000 is the possibility for all users to use a read only access to oar's database. You should be able to connect using PostgresSQL client as user oarreader with password read to database oar2 on all oardb.site.grid5000.fr. This gives you access to the complete history of jobs on all Grid'5000 sites. This gives you access to the production database of OAR: please be careful with your queries !!!

About OAR 2.4

How to know if a node is in energy saving mode or really absent ?

Nodes in energy saving mode are displayed with the state "Absent (standby)" by the oarnodes command.
The state "Absent (standby)" means that the node is shut down in order to save energy.
Nodes in this state will be automatically started by OAR when it will be needed.

Advanced users who check directly the OAR database can determine if a node is in energy saving mode or absent with the field "available_upto" in the resources table.
If energy saving is enabled on the cluster, the field "available_upto" provides a date (unix timestamp) until when the resource will be available.

  • A node "Absent" is in energy saving mode if the field "available_upto" is greater than the current unix timestamp

An example of SQL query listing absent nodes because of the energy saving mode:

SELECT distinct(network_address) FROM resources WHERE state="Absent" AND available_upto >= UNIX_TIMESTAMP()
  • A node "Absent" is really absent if :

- the field "available_upto" is equal to 0
- or the field "available_upto" is smaller than the current unix timestamp (this case should not occur upon Grid'5000)

An example of SQL query listing really absent nodes:

SELECT distinct(network_address) FROM resources WHERE state='Absent' AND (available_upto < UNIX_TIMESTAMP() OR available_upto = 0)

How to detect nodes in maintenance ?

Nodes in maintenance are nodes with a Dead state, a OAR maintenance property set to YES, or temporary in a really absent state (see above).

How to execute jobs within another one ?

This functionality is named container jobs.
With this functionality it is possible to execute jobs within another one. So it is like a sub-scheduling mechanism.

  • First a job of the type container must be submitted, for example:
oarsub -I -t container -l nodes=10,walltime=2:10:00
...
OAR_JOB_ID=13542
...
  • Then it is possible to use the inner type to schedule the new jobs within the previously created container job:
oarsub -I -t inner=13542 -l nodes=7,walltime=0:30:00
oarsub -I -t inner=13542 -l nodes=3,walltime=0:45:00

How to use MPI in Grid5000?

See also : The MPI Tutorial

MPI Warnings

On many clusters, it is likely to encounter the following warnings. These warnings may come from collisions between INFINIBAND and MYRINET technologies.

  • 1 - mca: base: component_find: unable to open /usr/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
  • 2 - Error in mx_open_endpoint (error Busy)
  • 3 - librdmacm: couldn't read ABI version.
  • 4 - Error in mx_init (error No MX device entry in /dev.)

MPI options to use

It is preferable to use the following options to avoid warnings or errors:

  • INFINIBAND clusters: Parapluie, Parapide, Griffon, Graphene, Edel, Adonis, Genepi
Terminal.png node:
mpirun --mca orte_rsh_agent "oarsh" --mca btl openib,sm,self --mca pml ^cm -machinefile $OAR_NODEFILE $HOME/mpi_programm
  • MYRINET clusters: Chinqchint, some of Sol nodes (unsupported in Debian Jessie):
Terminal.png node:
mpirun --mca orte_rsh_agent "oarsh" --mca pml ob1 --mca btl tcp,self -machinefile $OAR_NODEFILE $HOME/mpi_programm
  • For other clusters, you may use the following options:
    • --mca pml ob1 --mca btl tcp,self
    • --mca btl ^openib
    • --mca btl ^mx
Personal tools
Namespaces

Variants
Actions
Public Portal
Users Portal
Admin portal
Wiki special pages
Tools