FAQ

From Grid5000
Jump to navigation Jump to search

About this document

How to add/correct an entry to the FAQ?

Just like any other page of this wiki, you can edit the FAQ yourself to improve it. If you click on one of the little "edit" placed after each question, you'll get the possibility to edit that particular question. To edit the whole page, simply choose the edit tab at the top of the page.


Publications and Grid'5000

Is there an official acknowledgement ?

Yes there is: you agreed to it when accepting the usage policy. As the policy might have been updated since, please refer to the latest version. You should use it on all publications presenting results obtained (even partially) using Grid'5000.

How to mention Grid'5000 in HAL  ?

HAL is an open archive you're invited to use. If you do so, the recommended way of mentioning Grid'5000 is to use the collaboration field of submission form, with the Grid'5000 keyword, capitalized as such.

Accessing Grid'5000

How can I connect to Grid'5000 ?

This is documented at length in the Getting Started tutorial.

You should be able to access Grid'5000 from anywhere on the Internet, by connecting to access.grid5000.fr using SSH. You'll need SSH keys properly configured (please refer to the page dedicated to SSH if you don't understand these last words) as this machine will not allow you to log using a password.

Some sites have an access.site.grid5000.fr machine, which is only reachable from an IP address coming from local laboratory (replace site with the actual site name).

How to connect from different workstations with the same account?

You can associate several public SSH keys to your account. In order to do so, you have to:

  • login
  • go to User Portal > Manage Account,
  • select the My account top tab,
  • select the SSH keys left tab,
  • then, manage your keys:
    • add a new public SSH key ;
    • remove an old one.

More information in the SSH page and the Public key authentication page.

How to directly connect by SSH to any machine within Grid'5000 from my workstation?

This tip consists of customizing SSH configuration file ~/.ssh/config (compatible with OpenSSH ssh client)

Host *.g5k
   User login
   ProxyCommand ssh login@access.grid5000.fr -W "$(basename %h .g5k):%p"

You can then connect to any machine using ssh machine.site.g5k

Please have a look at the SSH page for a deeper understanding and more information.

For users of powershell in Microsoft Windows which also comes with OpenSSH ssh client, mind adapting the configuration as the basename command may not be available.

Note.png Note

Grid'5000 internal network uses private IP addresses and are not directly reachable from outside of Grid'5000

Is access to the Internet possible from nodes?

Full Internet access is allowed from Grid'5000 network to the Internet. IP addresses are NATed.

Warning.png Warning

For security reasons, all connections are logged.

How can I connect to an HTTP or HTTPS service running on a node?

See Web services page.

How can I share file from Grid'5000 using HTTP

See Web services page.

Account management

I forgot my password, how can I retrieve it ?

To retrieve your password, you can use this form, or ask your account manager to reset it.

My account expired, how can I extend it?

Use the account management interface (Manage account link in the sidebar).

Why doesn't my home directory contain the same files on every site?

Every site has its own file server, this is the user's responsibility to synchronize the personal data between his home directory on the different sites. You may use the rsync command to synchronize a remote site home directory (be careful this will erase any file that are not the same as on the local home directory):

rsync -n --delete -avz ~ frontend.site.grid5000.fr:~

NB : please remove the -n argument once you are sure you actually don't want to do a dry-run only...;)

How to get my home mounted on deployed nodes?

This is completely automatic if you deploy a *-nfs or *-big image (automount).

  • You can connect using your own username and should land in your home;
  • If connecting as root, once connected to the node, just change directory your home and it will be mounted:
 cd /home/username
Note.png Note

But home of other users cannot be mounted, for security reasons.

How to restore a wrongly deleted file?

No backup facility is provided by Grid'5000 platform. Please watch your fingers and do backup your data using external backup services.

What about disk quotas ?

You'll find that for each account and each site, disk quotas may be activated.

  • the soft limit is set to what the admins find a reasonable limit for an account on a more or less permanent basis. You can use more disk space temporarily, but you should not try and trick the system to keep that data on the shared file system.
  • the hard limit is set so as to preserve usability for other users if one of your scripts produces unexpected amounts of data. You'll not be able to override that limit.


How to increase my disk quota limitation?

Should you need higher quotas, please visit your user account settings page at https://api.grid5000.fr/ui/account (my storage tab), or send an email to support-staff@lists.grid5000.fr, explaining how much storage space you want and what for.

How do I unsubscribe from the mailing-list ?

Users' mailing-list subscription is tied to your Grid'5000 account. You can configure your subscriptions in your account settings:

How to unsubscribe from the mailing list

Alternate method, by configuring Sympa to stop receiving any email from the list (while still being subscribed):

  • If you haven't done it before, ask for a password on sympa.inria.fr from this form: https://sympa.inria.fr/sympa/firstpasswd/. Use the email address you used to register to Grid'5000.
  • Connect to https://sympa.inria.fr using your email address you used to register to Grid'5000 and your sympa.inria.fr password.
  • From the left panel, select users_grid5000. Then go to your subscriber options (Options d'abonné) and in the reception field (Mode de réception), select suspend (interrompre la réception des messages).

SSH related questions

See the SSH page.

Software installation issues

What is the general philosophy ?

This is how things should work: a basic set of software is installed on the frontends and nodes' standard environment of each site. If you need some other software packages on nodes, you can create a Kadeploy image including then, and deploy it. You can also use at sudo-g5k. If you think those software should be installed by default, you can contact the support-staff.

Deployment related issues

My environment does not work on all clusters

It some rare occasion, an environment may not work on a given cluster:

  1. The kernel used does not support all hardware. You are advised to base your environment on one of the reference environments to avoid dealing with this, or to carefully read the hardware section of each site to see the list of kernel drivers that need to be compiled in your environment for it to be able to boot on all clusters. Of course, when a new cluster is integrated, you might need to update your kernel for portability.
  2. The post-installation scripts do not recognize your environment, and therefore network access, console access or site specific configurations are not taken into account. You can check the contents of the default post-installation scripts to see the variables set by kadeploy by looking at environment's description using kaenv.

Kadeploy fails with Image file not found!

This means that kadeploy is not able to read your environment's main archive. This can be caused by many reasons, i.e:

  • registered filename is wrong
  • extension is not right (for example .tar.gz does not work, whereas .tgz is OK)

Kadeploy is complaining about a node already involved in an other deployment

The waring you see is

node node is already involved in another deployment

This error occurs

  • when 2 concurrent deployments are attempted on the same node. If you have 2 simultaneous deployments, make sure you have 2 distinct sets of nodes.
  • when there is a problem in the kadeploy database: typically when a deployment ended in a strange way, this can happen. The best is to wait for about 15 minutes and retry the deployment: kadeploy can correct its database automatically.

How to kill all my processes on a host ?

On the currently connected host (warning, it will disconnect you)

kill -KILL -1

How do I exit from kaconsole on cluster X from site Y

You can try & then . sequence (just like typing &.), but this may not work on all clusters. The Kaconsole page may give you more information.

Why are the network interfaces named eth2,eth3...ethn in my deployed environment?

This should be due to default udev rules on Debian based systems which allocate unique interface names to physical network devices. When you deploy an environment on an other node, it will detect new physical network devices and allocate them the next available interface names, incrementing it each time. Delete the appropriate rules in your environment to prevent udev from having this behaviour:

Terminal.png node:
rm /etc/udev/rules.d/*persistent-net.rules

Job submission related issues

What is the so called "best-effort" mode of OAR?

The best-effort was implemented to back-fill the cluster with jobs considered as less important without blocking "regular" jobs. To submit jobs under that policy, you simply have to select the besteffort type of job in your oarsub command.

oarsub -t besteffort script_to_launch

Jobs submitted that way will only get scheduled on processes when no other job use them (any regular job overtake besteffort jobs in the waiting queue, regardless of submission times). Moreover, these jobs are killed (as if oardel were called) when a regular job recently submitted needs the nodes used by a besteffort job.

By default, no checkpointing or automatic restart of besteffort jobs is provided. They are just killed. That is why this mode is best used with a tool which can detect the killed jobs and resubmit them. However OAR2 provides options for that.

How to pass arguments to my script

When you do passive submission through oarsub, you must specify a script. This script can be a simple script name or a more complex command line with arguments.

To pass arguments, you have to quote the whole command line, like in the following example:

oarsub -l nodes=4,walltime=2 "/path/to/myscript arg1 arg2 arg3"

Note: to avoid random code injection, oarsub allows only alphanumeric characters ([a-zA-Z0-9_]), whitespace characters ([ \t\n\r\f\v]) and few others ([/.-]) inside its command line argument.

Why are /core and -t deploy or -t use_classic_ssh incompatible ?

Jobs with type deploy or type allow_classic_ssh imply the exclusive usage of a node. Therefore, specifying core information for your submission can only lead to some inconsistencies. It is therefore prohibited by an admission rule.

Why did my advance reservation start with less than all the resources I requested ?

Since OAR 2.2.12, an advance reservation is validated regardless of the state of resources being either:

  1. alive
  2. suspended
  3. absent

(but not dead) at the time the reservation is required to start and during the panned walltime (because those states are transitional).

Moreover, resources allocated to an advance reservation are definitely fixed upon this validation, which means that if any of those resources becomes dead, absent or suspected after the validation, that resource won't be replaced.

At the start time of the advance reservation then, OAR looks after any unavailable resources (absent or suspected), and whenever some exists, wait for them to return for 5 minutes, shall it append:

  • resource are in the absent state during the reboot after a kadeploy job, and then become alive again as soon as the boot complete
  • resource which good health is suspected by OAR might be fixed back by an admin or maintenance tool operation

If resources are not back yet at the time the job actually starts, these resources are lost for the job, which then provides less resources than expected indeed.

That is a price to pay for using advance reservation.

NB

Information about reduced number of resources or reduced walltime for a reservation due to this mechanism are available in the event part of the output of

oarstat -fj jobid

How can I check whether my reservations are respecting the Grid'5000 Usage Policy ?

You can use the script usagepolicycheck, present on all frontends. See if your current reservations are respecting the Policy with usagepolicycheck -t, use usagepolicycheck -h to see the other options.

To help respecting the usage policy, it is possible to use day and night OAR job types to fit batch jobs inside day vs. night / week-end time frames. More details are available in the Advanced OAR guide.

Access to logs

OAR database logs

Grid'5000 gives the possibility to all users to use a read only access to OAR's database. You should be able to connect using PostgresSQL client as user oarreader with password read to database oar2 on all oardb.site.grid5000.fr. This gives you access to the complete history of jobs on all Grid'5000 sites. This gives you read-only access to the production database of OAR: please be careful with your queries to avoid overloading the testbed!

Note.png Note

Careful: Grid'5000 is not a computation grid, nor HPC center, nor a Cloud (Grid'5000 is a research instrument). That means that the usage of Grid'5000 by itself (OAR logs of Grid'5000 users' reservations) does not reflect a typical usage of any such infrastructure. It is therefore not relevant to analyze Grid'5000 OAR logs to that purpose. As a user, one can however use Grid'5000 to emulate a HPC cluster or cloud on reserved resources (in a job), possibly injecting a real load from a real infrastructure.

About OAR

How to know if a node is in energy saving mode or really absent ?

Nodes in energy saving mode are displayed with the state "Absent (standby)" by the oarnodes command.
The state "Absent (standby)" means that the node is shut down in order to save energy.
Nodes in this state will be automatically started by OAR when it will be needed.

Advanced users who check directly the OAR database can determine if a node is in energy saving mode or absent with the field "available_upto" in the resources table.
If energy saving is enabled on the cluster, the field "available_upto" provides a date (unix timestamp) until when the resource will be available.

  • A node "Absent" is in energy saving mode if the field "available_upto" is greater than the current unix timestamp

An example of SQL query listing absent nodes because of the energy saving mode:

SELECT distinct(host) FROM resources WHERE state="Absent" AND available_upto >= UNIX_TIMESTAMP()
  • A node "Absent" is really absent if :

- the field "available_upto" is equal to 0
- or the field "available_upto" is smaller than the current unix timestamp (this case should not occur upon Grid'5000)

An example of SQL query listing really absent nodes:

SELECT distinct(host) FROM resources WHERE state='Absent' AND (available_upto < UNIX_TIMESTAMP() OR available_upto = 0)

How to detect nodes in maintenance ?

Nodes in maintenance are nodes with a Dead state, a OAR maintenance property set to YES, or temporary in a really absent state (see above).

How to execute jobs within another one ?

For this functionality OAR provides the container job type. But please also have a look at the GNU Parallel tool, which may be more relevant and efficient.

With this functionality it is possible to execute several jobs inside another one, involving the OAR scheduling. This is especially relevant for tutorials or teaching labs, where jobs are created by a set of different users.

If all jobs, container and inner are from a same user, using GNU Parallel should be prefered.

  • First create a job of type container, for example:
Terminal.png node:
oarsub -I -t container -t cosystem -l nodes=10,walltime=2:10:00
...
OAR_JOB_ID=13542
...
  • Then the inner job type can be used, to get new jobs scheduled inside the previously created container job:
Terminal.png node:
oarsub -I -t inner=13542 -l nodes=7,walltime=0:30:00
Terminal.png node:
oarsub -I -t inner=13542 -l nodes=3,walltime=0:45:00

More information in Advanced_OAR#Container_jobs

How to use MPI in Grid5000?

See also : The MPI Tutorial

MPI options to use

It is preferable to use the following options to avoid warnings or errors:

  • INFINIBAND clusters: Parapluie, Parapide, Griffon, Graphene, Edel, Genepi
Terminal.png node:
mpirun --mca orte_rsh_agent "oarsh" --mca btl openib,sm,self --mca pml ^cm -machinefile $OAR_NODEFILE $HOME/mpi_programm
  • For other clusters, you may use the following options:
    • --mca pml ob1 --mca btl tcp,self
    • --mca btl ^openib
    • --mca btl ^mx

Access to the Jean Zay supercomputer (and possibly others GENCI supercomputers)