Environment Management
From Grid5000
Contents |
Introduction
This page describes the process of creating, updating and deleting new images (both production and reference) on grid5000. the following 5 steps describe the general process that should happen when updating/creating an env.
- deploying on a node
- tune or generate a new env
- test and validate the new env on all clusters
- deploy this new env on all the grid5000 frontends into /grid5000/images
- update the API
- create an info page on the wiki
Versionning
The environment name is expressed using: the distribution name-architecture-a 3 numbers component(major.minor.micro)-flavor
the major component indicates the distribution version whereas the minor and micro indicates the internal grid5000 version numbering. the minor version is bumped whenever drastic changes that can break user-experience occurs, otherwise bumping the micro component is recommended.
concerning the reference environments, if a user can't port easily his own environment based on the previous version to the new one, the minor component is bumped.
examples of valid names:
debian-x64-5.2.8-base debian-x64-5.6.12-prod fedora-x64-13.1.2-big
we still have problems with distributions that marks their versions as a 2 component numbers such as gentoo, for example the last gentoo profile is 10.0 and would give:
gentoo-x64-10.0.1.0-nfs
Creation
from an existing environment
nothing new here; the kadeploy3 tutorial contains everything you need to know: https://www.grid5000.fr/mediawiki/index.php/Deploy_environment
with chef
Create an environment
get the git repository
git clone ssh://g5kadmin@git.grid5000.fr/srv/git/repos/chef-node-bootstrap.git
add a new environment in the chef-node-bootstrap/envs folder. as an example, the rennes production environment: prod-rennes.json
{
"grid5000": { "site": "rennes" },
"oar": { "version": "2.4" },
"recipes": [ "setup", "g5kparts", "oar", "ganglia", "sensors", "kernel", "myrinet", "ofed", "openmpi" ]
}
Generate an environment
manually generating the environment
- Deploy a reference environment (like squeeze-x64-big) on a node and install deboostrap:
then you can start generating the env.
apt-get update && apt-get install debootstrap cd /tmp wget http://git.grid5000.fr/chef/scripts/gen.sh sh gen.sh squeeze-x64-base
This last command will generate the image /tmp/squeeze-x64-base-<version>.tgz.
squeeze is the image distribution name as passed to debootstrap, currently only squeeze and lenny are supported by our recipes.
x64 is the architecture,
base is the variant name;
as found in the envs/[distrib-arch-variant].json directory of the git repo.
The current <version> of the image is defined inside that JSON file.
- For a production environment, the command will be like this:
sh gen.sh rennes_squeeze-x64-prod
let the scripts do the job
found in the chef-node-bootstrap repo, the Rakefile contains everything you need to automatize the generation of an environment.
this will generate a new production environment on all sites and put the .tgz in your ~/kaenv/prod_rc/ directory of each frontend.site.
rake bootstrap:build rake bootstrap:push rake envs:generate SITE=all KAENV=prod_rc ENV_NAME=prod
you'll be notified on the grid5000 notification room whenever an environment is ready.
there's 3 mandatory parameters for envs:generate ::
- SITE : site where to launch the generation. comma separated value of grid5000 sites or keyword all
- KAENV : Directory name into ~/kaenvs/ where to put the generated images/logs. results in: ~/kaenvs/$KAENV
- ENV_NAME : basename of the environment description intended for generation. (ie: you want to generate the environment described in the envs/ directory and named squeeze-x64-base.json, you end up passing ENV_NAME=squeeze-x64-base )
there's 2 special keywords that will not be passed directly, but expanded in the Rakefile:
- SITE=all => will generate on all sites, only useful for production environments
- ENV_NAME=prod => will be expanded into $SITE_squeeze-x64-prod, for each site, only useful when using multiple sites
those three commands and their interactions with the whole grid5000 will result in this sequence diagram (not 100% accurate for the sake of simplicity, here the apiserver actually includes the oar-server and kadeploy-server)
Tests and Validation
Reference Environment
we describe the tests associated with the 3 flavors of the reference environments
- base Environment
- System Integrity regarding the API
- System Rights integrity (as in 3061)
- nfs Environment
- base Environment tests
- check for nfs mounts
- check for ldap accounts
- big Environment
- nfs Environment tests
- Bandwith tests with MPI (this implies deploying 2 nodes per cluster)
Production Environment
the production environment slightly differs from a big reference environment by the fact that it's an oar-node and thus needs further testing.
we propose to deploy the nodes in a container job in order to bring the nodes up and test the oar-node functionality (submit a job) without messing with the platform.
the integration of g5k-code in the production environment shall be tested.
the rest of the tests are identical to a big flavor reference environment.
as an example, this how you would launch this test by hand:
oarsub -t container -t deploy -I -l walltime=2:00 kadeploy3 -u root -e debian_prod -f $OAR_NODEFILE oarsub -I -t inner=$OAR_JOBID -l nodes=1,walltime=0:30
If the second oarsub succeed, then you can consider that oarnode is functionnal.
it can be summarized with this sequence diagramm
Scripts
Deploy
one scripts is provided to deploy and tests an environment on all clusters on the forge subversion repository:
- deploy_env_all_cluster.rb
- sensible defaults have been set, running without options will deploy one
- lenny-x64-base per cluster and run uptime on the nodes
- you will be notified via xmpp on the grid5000 jabber.
- the script can be configured via a yaml file in ~/.restfully/deploy_env_all.yml
- a summary will be displayed at the end of the script
- the script is designed to be quite modular and fits different needs
- the list of available options can be quite impressive at first but most of the times default values suits well
- multiple examples covering the most useful options are shown below:
#deploying one lenny-x64-base on all clusters and launching uptime ruby deploy_env_all_cluster.rb #deploying one reference environment on all nodes and launching a ping on frontend ruby deploy_env_all_cluster.rb -e http://public.rennes.grid5000.fr/~granquet/lenny-x64-base_rc.env -c 'ping -c1 frontend' #deploying gentoo on 10 nodes of suno and compiling a kernel with distcc ruby deploy_env_all_cluster.rb -e http://public.rennes.grid5000.fr/~granquet/gentoo-x64-base.env --clusters suno -s sophia --numnodes 10 -c 'echo @N > /etc/distcc/hosts && emerge --sync && USE=symlink emerge gentoo-sources && cd /usr/src/linux && genkernel --kernel-cc=distcc --makeopts=-j80 all' #testing the production environment on all sites #this deploys 3 nodes per cluster with the environnment named prod, makes an inner reservation and launch an mpi application. ruby deploy_env_all_cluster.rb -e prod --numnodes 3 -d ',' -c 'mpirun -H @N /home/bordeaux/granquet/bwlat/latency_flow_tests/latency_flow_tests -g' -U $USER -t prod -K ~/.ssh/internal.key.pub
all the options are listed in the help
#displaying help
ruby deploy_env_all_cluster.rb -h
--help -h: You are here
--env -e: env environment name
--command -c: command specify a command line to execute on the first node of the deployment
the sequences @N, @n, @u, @U, @t, will be replaced respectively by the list of nodes involved in the deployment (see delim for a cusom separator), the first node of the deployment, your grid5000 username, the username you are connected with over ssh on the node, the type of deployment
--user -u: user user-name to use on grid5000
--gateway -g: gw gateway to connect through ssh
--base-url -B: url url to the API
--type -t: type type of environment {prod,ref}
--sites -s: sites comma separated list of sites you want to restrict the deployment on
--nocleanup -n: do not cleanup the reservations at end of script
--npassword -P: the password to use to connect to the node with ssh
--delim -d: delim delimiter for multiple values replacements in command
--nuname -U: username username to use when connecting to the node
--key -K: keyfile public ssh key
--clusters comma separated list of clusters
--numnodes number of nodes to allocate _per cluster_
Qualify
As seen earlier, all the tests defined for the differents flavors of the environments stacks in a hierarchy that makes it easy to programmatically (inheritance) put in place a framework for testing all these envs.
various tools of interest for this framework includes:
- g5k-checks :: tests the hardware against the API
- hwdb3 :: Launch various tests and record results in a mysql database
- ivtools :: tests the hardware against the oar database
- bwlat :: Throughput and Latency testing tools using MP
Broadcasting an image on all sites
to help admins with the deployment of the images, 2 helper scripts are provided on the the forge subversion repository, the scripts are intended to be run with your g5kadmin account:
To dispatch a new environment on all sites:
- broadcast_image.rb [sites] [MyEnv.tgz] [MyEnv.env]:
- ruby script that will broadcast and register an image via taktuk
- taktuk needs to execute 'sudo su deploy' on the frontends,
- therefore the script has to be launched with the g5kadmin account
- the first argument is a comma separated list of sites, special keyword all will use the API to fetch all the sites available
- the second and third arguments are paths to an the image and an env
ruby broadcast_image.rb all ~/images/mylenny-x64.tgz ~/images/mylenny-x64.env ruby broadcast_image.rb bordeaux,lyon,lille ~/images/mylenny-x64.tgz ~/images/mylenny-x64.env
To register a new production environment after generating it with chef on all sites:
- common_env_register.rb [sites] [MyEnv.env] [Kaenv_Directory]:
- ruby script that will register a production environment via taktuk
- taktuk needs to execute 'sudo su deploy' on the frontends,
- therefore the script has to be launched with the g5kadmin account
- the first argument is a comma separated list of sites, special keyword all will use the API to fetch all the sites available
- the second argument is a kaenv3 environment file,
- third arguments is the path where the image has been generated (typically the KAENV variable you passed when generating your env with chef)
ruby common_env_register.rb all ~granquet/kaenvs/prod/debian-x64-5-prod.dsc ~granquet/kaenvs/prod/
Deletion and Deprecation
This section's purpose is to explain the process of deprecating reference and production environments.
Most (if not all) of our reference and production environments are generated from various versions of Debian : Sid, Etch, Lenny, Squeeze,... So after some time (a few years), certain environments become obsolete and must be removed from the Grid'5000 platform because :
- Very few users ever deploy them,
- They are not maintained, since bugs from those versions have been corrected in more recent versions,
- Drivers updates and security updates are stopped, so they do no longer work on all clusters,
- They are too old.
Deprecated environments should be removed from Grid'5000 platform in a way that :
- Classic users should not be bothered with information about deprecated environments,
- Documentation on deprecated environments should still be available to users who explicitly want to use them,
- Only users who really want to use them should be able to do so.
Here are the actions to do once an environment is declared as deprecated:
- Its documentation on the wiki should be updated to clearly specify that it is deprecated : on the Environment global page and on the environment description page (example Sid-x64-big-1.1 )
- On each site :
- Its description should be stored in a file in the directory
/grid5000/descriptions/, with 2 requirements:- The word deprecated should be added to the environment description file name . Example :
/grid5000/descriptions/sid-x64-big-1.1-deprecated.dsc - The visibility parameter should be set to shared within that description file.
- The word deprecated should be added to the environment description file name . Example :
- The environment tarball and post-install (.tgz files) should neither be moved nor deleted from they respective directories
/grid5000/images/and/grid5000/postinstalls/.
- Its description should be stored in a file in the directory
- On the API :
- The environment should be removed from the description of each sites. This is done by removing the environment name from the list of environments within the file :
reference-repository/generators/input/site.rb. Do not forget to remove the corresponding environments files in directoriesreference-repository/data/grid5000/sites/site/environments/before pushing your modifications to the git server. - The declaration and description of the environment should be left inside the API. But its state parameter should be set to deprecated. This is done in the file
reference-repository/generators/input/environments.rb. This way, users will still be able to consult its description through API by explicitly requesting for it. Example :
- The environment should be removed from the description of each sites. This is done by removing the environment name from the list of environments within the file :
Environments should never be completely removed from Grid'5000 platform. It should always be possible to retrieve archives of all environments that once ran on Grid'5000.
But if you absolutely have to delete images of a deprecated environment from your site, make sure first that there are copies of that environment on some other sites.
Note that the important/required files are :
- The environment description file located in the directory
/grid5000/descriptions/, - The environment tarball file located in the directory
/grid5000/images/, - The environment post-install file located in the directory
/grid5000/postinstalls/
Updating the API
you now have to update the API to reflect the changes you have made on your (new) environment.
the environment description is stored in the git reference-repository in a json file.
Do not edit the json direcly, json files are generated using ruby templates and a generator script.
the following commands will guide you through the steps of fetching the reference repository, updating the environment template, generate the .json files and send your modifications upstream.
git clone ssh://g5kadmin@git.grid5000.fr/srv/git/repos/reference-repository.git vi reference-repository/generators/input/environments.rb ruby reference-repository/generators/grid5000 reference-repository/generators/input/environments.rb git commit -m "[Environments] Added Environment foo version 42 to the API" git push
Documentation
Any deployed environment has to be documented on the Environment wiki page.
this documentation shall includes an Identification sheet and all the steps that have been done to create this image.
If the image has been created manually then the documentation needs to include:
- the Env on which this new one is based on.
- all the tunning on configuration files.
- a list of provided packages and services.
If the Image has been created with chef then:
- the environment json
- the git version of the chef repo used to generate the env.
