Difference between revisions of "FAQ"
(→Does oarsub check whether a job respects the Usage Policy before submitting it ?)
(→Job submission related issues)
|Line 202:||Line 202:|
By default, no checkpointing or automatic restart of besteffort jobs is provided. They are just killed. That is why this mode
By default, no checkpointing or automatic restart of besteffort jobs is provided. They are just killed. That is why this mode
is best used with a tool which can detect the killed jobs and resubmit them. However OAR2 provides options for that. You may also have a look at tools like
is best used with a tool which can detect the killed jobs and resubmit them. However OAR2 provides options for that. You may also have a look at tools like .
=== How to pass arguments to my script ===
=== How to pass arguments to my script ===
Revision as of 12:51, 17 April 2019
- 1 About this document
- 2 Publications and Grid'5000
- 3 Accessing Grid'5000
- 3.1 How can I connect to Grid'5000 ?
- 3.2 How to connect from different workstations with the same account?
- 3.3 How to directly connect by SSH to any machine within Grid'5000 from my workstation?
- 3.4 Is access to the Internet possible from nodes?
- 3.5 How can I connect to an HTTP or HTTPS service running on a node?
- 4 Account management
- 4.1 I forgot my password, how can I retrieve it ?
- 4.2 My account expired, how can I extend it?
- 4.3 Why doesn't my home directory contain the same files on every site?
- 4.4 How to get my home mounted on deployed nodes?
- 4.5 How to restore a wrongly deleted file?
- 4.6 What about disk quotas ?
- 4.7 How to increase my disk quota limitation?
- 4.8 How do I unsubscribe from the mailing-list ?
- 5 SSH related questions
- 6 Software installation issues
- 7 Deployment related issues
- 7.1 My environment does not work on all clusters
- 7.2 Kadeploy fails with Image file not found!
- 7.3 Kadeploy is complaining about a node already involved in an other deployment
- 7.4 How to kill all my processes on a host ?
- 7.5 How do I exit from kaconsole on cluster X from site Y
- 7.6 Why are the network interfaces named eth2,eth3...ethn in my deployed environment?
- 8 Job submission related issues
- 8.1 What is the so called "best-effort" mode of OAR?
- 8.2 How to pass arguments to my script
- 8.3 Why are /core and -t deploy or -t use_classic_ssh incompatible ?
- 8.4 Why did my advance reservation start with less than all the resources I requested ?
- 8.5 How can I check whether my reservations are respecting the Grid'5000 Usage Policy ?
- 9 Access to logs
- 10 About OAR
- 11 How to use MPI in Grid5000?
About this document
How to add/correct an entry to the FAQ?
Just like any other page of this wiki, you can edit the FAQ yourself to improve it. If you click on one of the little "edit" placed after each question, you'll get the possibility to edit that particular question. To edit the whole page, simply choose the edit tab at the top of the page.
Publications and Grid'5000
Is there an official acknowledgement ?
Yes there is: you agreed to it when accepting the usage policy. As the policy might have been updated since, please refer to the latest version. You should use it on all publications presenting results obtained (even partially) using Grid'5000.
How to mention Grid'5000 in HAL ?
HAL is an open archive you're invited to use. If you do so, the recommended way of mentioning Grid'5000 is to use the collaboration field of submission form, with the Grid'5000 keyword, capitalized as such.
How can I connect to Grid'5000 ?
This is documented at length in the Getting Started tutorial.
You should be able to access Grid'5000 from anywhere on the Internet, by connecting to
access.grid5000.fr using SSH. You'll need SSH keys properly configured (please refer to the page dedicated to SSH if you don't understand these last words) as this machine will not allow you to log using a password.
Some sites have an
access.site.grid5000.fr machine, which is only reachable from an IP address coming from local laboratory.
How to connect from different workstations with the same account?
You can associate multiple public SSH keys to your account. In order to do so, you have to:
- go to User Portal > Manage Account
- select the 'My account' tab
- in actions list select «Edit Profil»
- then, paste the public SSH key in a new line just after the other(s).
How to directly connect by SSH to any machine within Grid'5000 from my workstation?
This tip consists of customizing SSH configuration file
Host *.g5k User
email@example.com -W "`basename %h .g5k`:%p"
Your are now able to connect to any machine using
Please have a look at the SSH page to a deeper understanding of this proxy feature.
Note: Grid'5000 internal network uses private IP addresses and are not directly reachable from outside of Grid'5000.
Is access to the Internet possible from nodes?
Since end of 2015, full Internet access is allowed from Grid'5000 network. For security reasons, connections are logged.
How can I connect to an HTTP or HTTPS service running on a node?
However, if you only need to connect to a web interface or web API served by a node using HTTP or HTTPS, you can use Grid'5000 reverse proxy. The base URL to use is https://mynode.mysite.http.proxy.grid5000.fr/ for HTTP and https://mynode.mysite.https.proxy.grid5000.fr/ for HTTPS.
If you can't run your service on port 80 or 443 (because you are not root for example), you can use a port 8080 or 8443 (you don't need to be root to bind them). You can then use https://mynode.mysite.http8080.proxy.grid5000.fr/ for HTTP and https://mynode.mysite.https8443.proxy.grid5000.fr/ for HTTPS.
Please note that the reverse proxy needs authentication using your Grid'5000 credentials.
You may have to add an exception as required by you web browser to access the target web service, because of a SSL certificate mismatch.
I forgot my password, how can I retrieve it ?
To retrieve your password, you can use this form, or ask your account manager to reset it.
My account expired, how can I extend it?
Use the account management interface (Manage account link in the sidebar).
Why doesn't my home directory contain the same files on every site?
Every site has its own file server, this is the user's responsibility to synchronize the personal data between his home directory on the different sites. You may use the
rsync command to synchronize a remote site home directory (be careful this will erase any file that are not the same as on the local home directory):
rsync-n --delete -avz
NB : please remove the -n argument once you are sure you actually don't want to do a dry-run only...;)
How to get my home mounted on deployed nodes?
This is completely automatic if you deploy a *-nfs or *-big image. You can then connect using your own login, and once connected into the node, just enter your home:
cd /home/<your login>
How to restore a wrongly deleted file?
No backup facility is provided by Grid'5000 platform. Please watch your fingers and do backup your data using external backup services.
What about disk quotas ?
You'll find that for each account and each site, disk quotas may be activated.
- the soft limit is set to what the admins find a reasonable limit for an account on a more or less permanent basis. You can use more disk space temporarily, but you should not try and trick the system to keep that data on the shared file system.
- the hard limit is set so as to preserve usability for other users if one of your scripts produces unexpected amounts of data. You'll not be able to override that limit.
How to increase my disk quota limitation?
Should you need higher quotas, please visit your user account settings page at https://api.grid5000.fr/ui/account (my storage tab), or send an email to firstname.lastname@example.org, explaining how much storage space you want and what for.
How do I unsubscribe from the mailing-list ?
Users' mailing-list subscription is tied to your Grid'5000 account. You can configure your subscriptions in your account settings:
- Login to https://api.grid5000.fr/ui/account
- Go to the "My account" tab, then click on the "Actions" button, then choose "Manage mailing lists"
Alternate method, by configuring Sympa to stop receiving any email from the list (while still being subscribed):
- If you haven't done it before, ask for a password on sympa.inria.fr from this form: https://sympa.inria.fr/sympa/firstpasswd/. Use the email address you used to register to Grid'5000.
- Connect to https://sympa.inria.fr using your email address you used to register to Grid'5000 and your sympa.inria.fr password.
- From the left panel, select users_grid5000. Then go to your subscriber options (Options d'abonné) and in the reception field (Mode de réception), select suspend (interrompre la réception des messages).
How to avoid SSH host key checking?
StrictHostKeyChecking option, SSH host key checking can be turned off. This option can be set in the
Or it can be passed on the command line:
How not to get tons of SSH errors about Man-in-the-middle attacks while deploying images ?
If you get the following error when you try to connect a machine using
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that the RSA host key has just been changed. The fingerprint for the RSA key sent by the remote host is 6e:44:89:6d:ac:fc:d8:84:fd:2b:fb:22:e5:ba:5c:88. Please contact your system administrator.
This is because SSH get worried by the fact that the machine answering to the connection is not the same from run to run. This is actually really logical if you just redeployed the image so it is not same system that is answering...
Technically speaking, the file
/etc/ssh/ssh_host_dsa_key.pub is likely to be different in your own deployed image and in the default image. SSH will thus freak out since such replacement usually denote that someone is intercepting the communication and pretend to be the server to get informations from you.
If you don't want to care about this issues, there are several solutions:
.ssh/configfile to explain SSH to ignore about those errors.
- Pass this option (
StrictHostKeyChecking=no) on the command line to ssh (using -o)
- Make sure that you have the same
host_dsa_keyin your own images than in defaults one. They can usually be found in the pre/post install scripts of your site.
Outside of Grid'5000 scope, the correct solution is to fix your
~/.ssh/known_hosts, either by hand or using the command
Please have a look at the SSH page also.
What kind of public keys are supported on Grid'5000 ?
The only format of the public_keys allowed in Grid'5000 is the openSSH format.
You MUST provide and use ssh public keys in this format.
- SSH2 like public key (NOT SUPPORTED) :
---- BEGIN SSH2 PUBLIC KEY ---- Comment: rsa-key-20090623 AAAAB3NzaC1yc2EAAAABJQAAAIEA1YO87ubDgjQmCEdyX98UZ1RaBNAEXNGUNX2t D/lEw7MPShJKpVYpcj4JhrOqTc0QXIcLqefkucDaoAIlEAp7e5aShWhWFtYR5Mwn qAF1hrMBMF0xJIqgZjUWUPxvvFVeQXkObUWQkRyj5AjlG9+qQDLOoD9GgBOqfLDV edGCLoM= ---- END SSH2 PUBLIC KEY ----
- OpenSSH like public key :
To convert ssh public keys from SSH2 to OpenSSH, see this tutorial.
Software installation issues
What is the general philosophy ?
This is how things should work: a basic set of software is installed on the frontends and nodes' standard environment of each site. If you need some other software packages on nodes, you can create a Kadeploy image including then, and deploy it. You can also use at sudo-g5k. If you think those software should be installed by default, you can contact the support-staff.
My environment does not work on all clusters
It some rare occasion, an environment may not work on a given cluster:
- The kernel used does not support all hardware. You are advised to base your environment on one of the reference environments to avoid dealing with this, or to carefully read the hardware section of each site to see the list of kernel drivers that need to be compiled in your environment for it to be able to boot on all clusters. Of course, when a new cluster is integrated, you might need to update your kernel for portability.
- The post-installation scripts do not recognize your environment, and therefore network access, console access or site specific configurations are not taken into account. You can check the contents of the default post-installation scripts to see the variables set by kadeploy by looking at environment's description using kaenv.
Kadeploy fails with Image file not found!
This means that
kadeploy is not able to read your environment's main archive. This can be caused by many reasons, i.e:
- registered filename is wrong
- extension is not right (for example
.tar.gzdoes not work, whereas
Kadeploy is complaining about a node already involved in an other deployment
The waring you see is
node $node is already involved in another deployment
This error occurs
- when 2 concurrent deployments are attempted on the same node. If you have 2 simultaneous deployments, make sure you have 2 distinct sets of nodes.
- when there is a problem in the kadeploy database: typically when a deployment ended in a strange way, this can happen. The best is to wait for about 15 minutes and retry the deployment: kadeploy can correct its database automatically.
How to kill all my processes on a host ?
On the currently connected host (warning, it will disconnect you)
How do I exit from kaconsole on cluster X from site Y
You can try '&.' sequence (french keyboard), but this may not work on all clusters.The Kaconsole page may give you more information.
Why are the network interfaces named eth2,eth3...ethn in my deployed environment?
This should be due to default udev rules on Debian based systems which allocate unique interface names to physical network devices. When you deploy an environment on an other node, it will detect new physical network devices and allocate them the next available interface names, incrementing it each time. Delete the appropriate rules in your environment to prevent udev from having this behaviour:
What is the so called "best-effort" mode of OAR?
The best-effort was implemented to back-fill the cluster with jobs considered as less important without blocking "regular" jobs. To submit jobs under that policy, you simply have to select the besteffort type of job in your oarsub command.
Jobs submitted that way will only get scheduled on processes when no other job use them (any regular job overtake besteffort jobs in the waiting queue, regardless of submission times). Moreover, these jobs are killed (as if oardel were called) when a regular job recently submitted needs the nodes used by a besteffort job.
By default, no checkpointing or automatic restart of besteffort jobs is provided. They are just killed. That is why this mode is best used with a tool which can detect the killed jobs and resubmit them. However OAR2 provides options for that. You may also have a look at tools like CiGri.
How to pass arguments to my script
When you do passive submission through
oarsub, you must specify a script. This script can be a simple script name or a more complex command line with arguments.
To pass arguments, you have to quote the whole command line, like in the following example:
"/path/to/myscript arg1 arg2 arg3"
Note: to avoid random code injection,
oarsub allows only alphanumeric characters (
[a-zA-Z0-9_]), whitespace characters (
[ \t\n\r\f\v]) and few others (
[/.-]) inside its command line argument.
Why are /core and -t deploy or -t use_classic_ssh incompatible ?
Jobs with type
deploy or type
allow_classic_ssh imply the exclusive usage of a node. Therefore, specifying core information for your submission can only lead to some inconsistencies. It is therefore prohibited by an admission rule.
Why did my advance reservation start with less than all the resources I requested ?
Since OAR 2.2.12, an advance reservation is validated regardless of the state of resources being either:
(but not dead) at the time the reservation is required to start and during the panned walltime (because those states are transitional).
Moreover, resources allocated to an advance reservation are definitely fixed upon this validation, which means that if any of those resources becomes dead, absent or suspected after the validation, that resource won't be replaced.
At the start time of the advance reservation then, OAR looks after any unavailable resources (absent or suspected), and whenever some exists, wait for them to return for 5 minutes, shall it append:
- resource are in the absent state during the reboot after a kadeploy job, and then become alive again as soon as the boot complete
- resource which good health is suspected by OAR might be fixed back by an admin or maintenance tool operation
If resources are not back yet at the time the job actually starts, these resources are lost for the job, which then provides less resources than expected indeed.
That is a price to pay for using advance reservation.
Information about reduced number of resources or reduced walltime for a reservation due to this mechanism are available in the event part of the output of
How can I check whether my reservations are respecting the Grid'5000 Usage Policy ?
You can use the script
usagepolicycheck, present on all frontends. See if your current reservations are respecting the Policy with
usagepolicycheck -t, use
usagepolicycheck -h to see the other options.
Access to logs
OAR database logs
Grid'5000 gives the possibility to all users to use a read only access to OAR's database. You should be able to connect using PostgresSQL client as user
oarreader with password
read to database
oar2 on all
.grid5000.fr. This gives you access to the complete history of jobs on all Grid'5000 sites. This gives you read-only access to the production database of OAR: please be careful with your queries to avoid overloading the testbed!
How to know if a node is in energy saving mode or really absent ?
Nodes in energy saving mode are displayed with the state "Absent (standby)" by the oarnodes command.
The state "Absent (standby)" means that the node is shut down in order to save energy.
Nodes in this state will be automatically started by OAR when it will be needed.
Advanced users who check directly the OAR database can determine if a node is in energy saving mode or absent with the field "available_upto" in the resources table.
If energy saving is enabled on the cluster, the field "available_upto" provides a date (unix timestamp) until when the resource will be available.
- A node "Absent" is in energy saving mode if the field "available_upto" is greater than the current unix timestamp
An example of SQL query listing absent nodes because of the energy saving mode:
SELECT distinct(network_address) FROM resources WHERE state="Absent" AND available_upto >= UNIX_TIMESTAMP()
- A node "Absent" is really absent if :
- the field "available_upto" is equal to 0
- or the field "available_upto" is smaller than the current unix timestamp (this case should not occur upon Grid'5000)
An example of SQL query listing really absent nodes:
SELECT distinct(network_address) FROM resources WHERE state='Absent' AND (available_upto < UNIX_TIMESTAMP() OR available_upto = 0)
How to detect nodes in maintenance ?
Nodes in maintenance are nodes with a Dead state, a OAR maintenance property set to YES, or temporary in a really absent state (see above).
How to execute jobs within another one ?
This functionality is named container jobs. With this functionality it is possible to execute jobs within another one. So it is like a sub-scheduling mechanism.
- First a job of the type container must be submitted, for example:
oarsub -I -t container -t cosystem -l nodes=10,walltime=2:10:00 ... OAR_JOB_ID=13542 ...
- Then it is possible to use the inner type to schedule the new jobs within the previously created container job:
oarsub -I -t inner=13542 -l nodes=7,walltime=0:30:00 oarsub -I -t inner=13542 -l nodes=3,walltime=0:45:00
More information in Advanced_OAR#Container_jobs
How to use MPI in Grid5000?
See also : The MPI Tutorial
MPI options to use
It is preferable to use the following options to avoid warnings or errors:
- INFINIBAND clusters: Parapluie, Parapide, Griffon, Graphene, Edel, Genepi
- For other clusters, you may use the following options:
- --mca pml ob1 --mca btl tcp,self
- --mca btl ^openib
- --mca btl ^mx