Environment Management

From Grid5000
Jump to: navigation, search


Contents

Introduction

This page describes the process of creating, updating and deleting new images (both production and reference) on grid5000. the following 5 steps describe the general process that should happen when updating/creating an env.

  1. deploying on a node
  2. tune or generate a new env
  3. test and validate the new env on all clusters
  4. deploy this new env on all the grid5000 frontends into /grid5000/images
  5. update the API
  6. create an info page on the wiki

Versionning

The environment name is expressed using: the distribution name-architecture-a 3 numbers component(major.minor.micro)-flavor

the major component indicates the distribution version whereas the minor and micro indicates the internal grid5000 version numbering. the minor version is bumped whenever drastic changes that can break user-experience occurs, otherwise bumping the micro component is recommended.

concerning the reference environments, if a user can't port easily his own environment based on the previous version to the new one, the minor component is bumped.

examples of valid names:

debian-x64-5.2.8-base
debian-x64-5.6.12-prod
fedora-x64-13.1.2-big

we still have problems with distributions that marks their versions as a 2 component numbers such as gentoo, for example the last gentoo profile is 10.0 and would give:

gentoo-x64-10.0.1.0-nfs

Creation

With jenkins

Environments can be created with jenkins :

Production environments will be available in ~ajenkins/public/prod of each site. Reference environments wiil be available in ~ajenkins/public/ref-env of the site configured in jenkins tasks.

With chef

Create an environment

get the git repository

git clone git@gitolite.grid5000.fr:chef-node-bootstrap

add a new environment in the chef-node-bootstrap/envs folder. as an example, the rennes production environment: prod-rennes.json

{
  "grid5000": { "site": "rennes" },
  "oar": { "version": "2.4" },
  "recipes": [ "setup", "g5kparts", "oar", "ganglia", "sensors", "kernel", "myrinet", "ofed", "openmpi" ]
}

Generate an environment

manually generating the environment

  • If needed update bootstrap and json envs files
rake bootstrap:build
rake bootstrap:push
rake envs:push
  • Deploy a reference environment (like squeeze-x64-big) on a node and install deboostrap:

then you can start generating the env.

cd /tmp
wget http://git.grid5000.fr/chef/scripts/gen.sh
bash gen.sh squeeze-x64-base

This last command will generate the image /tmp/squeeze-x64-base-<version>.tgz.

squeeze is the image distribution name as passed to debootstrap, currently only wheezy, squeeze and lenny are supported by our recipes.
x64 is the architecture,
base is the variant name;
as found in the envs/[distrib-arch-variant].json directory of the git repo.
The current <version> of the image is defined inside that JSON file.

  • For a production environment, the command will be like this:
sh gen.sh rennes_squeeze-x64-prod

let the scripts do the job

Ensure that your /chef-node-bootstrap/config/config.yaml is correctly set.

---
url: https://api.grid5000.fr/2.0/grid5000
username: user_name
gateway: 194.254.60.5
password: user_password

Found in the chef-node-bootstrap repo, the Rakefile contains everything you need to automatize the generation of an environment.

this will generate a new production environment on all sites and put the .tgz in your ~/kaenv/prod_rc/ directory of each frontend.site.

rake bootstrap:build
rake bootstrap:push
rake envs:generate SITE=all KAENV=prod_rc ENV_NAME=prod 

you'll be notified on the grid5000 notification room whenever an environment is ready.
there's 3 mandatory parameters for envs:generate :

  • SITE : site where to launch the generation. comma separated value of grid5000 sites or keyword all
  • KAENV : Directory name into ~/kaenvs/ where to put the generated images/logs. results in: ~/kaenvs/$KAENV
  • ENV_NAME : basename of the environment description intended for generation. (ie: you want to generate the environment described in the envs/ directory and named squeeze-x64-base.json, you end up passing ENV_NAME=squeeze-x64-base )

there's 2 special keywords that will not be passed directly, but expanded in the Rakefile:

  • SITE=all => will generate on all sites, only useful for production environments
  • ENV_NAME=prod => will be expanded into $SITE_squeeze-x64-prod, for each site, only useful when using multiple sites

those three commands and their interactions with the whole grid5000 will result in this sequence diagram (not 100% accurate for the sake of simplicity, here the apiserver actually includes the oar-server and kadeploy-server)
Chef-seq.png

from an existing environment

nothing new here; the kadeploy3 tutorial contains everything you need to know: https://www.grid5000.fr/mediawiki/index.php/Deploy_environment

Qcow2 images

  • Deploy a reference environment (nfs ou big one) on a node.

then you can start generating the env.

cd /tmp
wget http://git.grid5000.fr/chef/scripts/gen_qcow2_img.sh
bash gen_qcow2_img.sh wheezy-x64-base-1.1

This last command will generate the image /tmp/wheezy-x64-base-1.1.qcow2.

The parameter should be an existing tgz environments stored in /grid5000/images of the site your are working on.

Tests and Validation

Reference Environment

we describe the tests associated with the 3 flavors of the reference environments

  • base Environment
    • System Integrity regarding the API
    • System Rights integrity (as in bug #3061)
  • nfs Environment
    • base Environment tests
    • check for nfs mounts
    • check for ldap accounts
  • big Environment
    • nfs Environment tests
    • Bandwith tests with MPI (this implies deploying 2 nodes per cluster)

Production Environment

the production environment slightly differs from a big reference environment by the fact that it's an oar-node and thus needs further testing.
we propose to deploy the nodes in a container job in order to bring the nodes up and test the oar-node functionality (submit a job) without messing with the platform.
the integration of g5k-code in the production environment shall be tested.
the rest of the tests are identical to a big flavor reference environment.

as an example, this how you would launch this test by hand:

 oarsub -t container -t deploy -I -l walltime=2:00
 kadeploy3 -u root -e debian_prod -f $OAR_NODEFILE
 oarsub -I -t inner=$OAR_JOBID -l nodes=1,walltime=0:30

If the second oarsub succeed, then you can consider that oarnode is functionnal.

it can be summarized with this sequence diagram
Prod-env-qualif.png

Jenkins

Environments created with jenkins can be deployed on one node of each cluster with the following tasks :

Be careful this task only test is the deployment is successful for a node of a cluster, any other test is made on the environment.

Scripts

Deploy

Shell script

this script can be used to deploy and register an environment on all sites (from a frontend as g5kadmin) :

#!/bin/sh

if [ -z $1 ] || [ -z $2 ]; then
  echo "$0 [tgz_file] [desc_file] [sites]"
  exit 1
fi

if [ -n "$3" ]; then
  SITES=$3
fi

TGZ=$(basename $1)
DSC=$(basename $2)

for site in $SITES; do
  scp $1 $site:/tmp
  ssh $site "sudo mv /tmp/$TGZ /grid5000/images/users/ubuntu/"
  ssh $site "sudo chown deploy /grid5000/images/users/ubuntu/$TGZ"

  scp $2 $site:/tmp
  ssh $site "sudo mv /tmp/$DSC /grid5000/descriptions/users/ubuntu/"
  ssh $site "sudo chown deploy /grid5000/descriptions/users/ubuntu/$DSC"

  ssh $site "sudo su deploy -c \"kaenv3 -a /grid5000/descriptions/users/ubuntu/$DSC\""
done


Unmaintained scripts

Warning.png Warning

Script broken (2014-02-18)

one scripts is provided to deploy and tests an environment on all clusters on the forge subversion repository:


  • deploy_env_all_cluster.rb
sensible defaults have been set, running without options will deploy one
lenny-x64-base per cluster and run uptime on the nodes
you will be notified via xmpp on the grid5000 jabber.
the script can be configured via a yaml file in ~/.restfully/deploy_env_all.yml
a summary will be displayed at the end of the script
the script is designed to be quite modular and fits different needs
the list of available options can be quite impressive at first but most of the times default values suits well
multiple examples covering the most useful options are shown below:
#deploying one lenny-x64-base on all clusters and launching uptime
ruby deploy_env_all_cluster.rb

#deploying one reference environment on all nodes and launching a ping on frontend
ruby deploy_env_all_cluster.rb -e http://public.rennes.grid5000.fr/~granquet/lenny-x64-base_rc.env -c 'ping -c1 frontend'

#deploying gentoo on 10 nodes of suno and compiling a kernel with distcc
ruby deploy_env_all_cluster.rb -e http://public.rennes.grid5000.fr/~granquet/gentoo-x64-base.env --clusters suno -s sophia --numnodes 10 -c 'echo @N > /etc/distcc/hosts && emerge --sync && USE=symlink emerge gentoo-sources && cd /usr/src/linux && genkernel --kernel-cc=distcc --makeopts=-j80 all'

#testing the production environment on all sites
#this deploys 3 nodes per cluster with the environnment named prod, makes an inner reservation and launch an mpi application.
ruby deploy_env_all_cluster.rb -e prod --numnodes 3 -d ',' -c 'mpirun -H @N /home/bordeaux/granquet/bwlat/latency_flow_tests/latency_flow_tests -g' -U $USER -t prod -K ~/.ssh/internal.key.pub

all the options are listed in the help

#displaying help
ruby deploy_env_all_cluster.rb -h
--help -h:       You are here
--env -e: env    environment name
--command -c: command    specify a command line to execute on the first node of the deployment
 the sequences @N, @n, @u, @U, @t, will be replaced respectively by the list of nodes involved in the deployment (see delim for a cusom separator), the first node of the deployment, your grid5000 username, the username you are connected with over ssh on the node, the type of deployment
--user -u: user          user-name to use on grid5000
--gateway -g: gw         gateway to connect through ssh
--base-url -B: url       url to the API
--type -t: type          type of environment {prod,ref}
--sites -s: sites        comma separated list of sites you want to restrict the deployment on
--nocleanup -n:          do not cleanup the reservations at end of script
--npassword -P:          the password to use to connect to the node with ssh
--delim -d: delim        delimiter for multiple values replacements in command
--nuname -U: username    username to use when connecting to the node
--key -K: keyfile        public ssh key
--clusters       comma separated list of clusters
--numnodes       number of nodes to allocate _per cluster_

Qualify

As seen earlier, all the tests defined for the differents flavors of the environments stacks in a hierarchy that makes it easy to programmatically (inheritance) put in place a framework for testing all these envs.
various tools of interest for this framework includes:

  • g5k-checks :: tests the hardware against the API
  • hwdb3 :: Launch various tests and record results in a mysql database
  • ivtools :: tests the hardware against the oar database
  • bwlat :: Throughput and Latency testing tools using MP

Broadcasting an image on all sites

to help admins with the deployment of the images, 2 helper scripts are provided on the the forge subversion repository, the scripts are intended to be run with your g5kadmin account:

To dispatch a new environment on all sites:

  • broadcast_image.rb [sites] [MyEnv.tgz] [MyEnv.env]:
ruby script that will broadcast and register an image via taktuk
taktuk needs to execute 'sudo su deploy' on the frontends,
therefore the script has to be launched with the g5kadmin account
the first argument is a comma separated list of sites, special keyword all will use the API to fetch all the sites available
the second and third arguments are paths to an the image and an env
ruby broadcast_image.rb --site all --tgz ~/images/mylenny-x64.tgz --dsc ~/images/mylenny-x64.env
ruby broadcast_image.rb --site bordeaux,lyon,lille --tgz ~/images/mylenny-x64.tgz --dsc ~/images/mylenny-x64.env

To register a new production environment after generating it with chef on all sites:

  • common_env_register.rb [sites] [MyEnv.env] [Kaenv_Directory]:
ruby script that will register a production environment via taktuk
taktuk needs to execute 'sudo su deploy' on the frontends,
therefore the script has to be launched with the g5kadmin account
the first argument is a comma separated list of sites, special keyword all will use the API to fetch all the sites available
the second argument is a kaenv3 environment file,
third arguments is the path where the image has been generated (typically the KAENV variable you passed when generating your env with chef)
ruby common_env_register.rb all ~granquet/kaenvs/prod/debian-x64-5-prod.dsc ~granquet/kaenvs/prod/

Deletion and Deprecation

This section's purpose is to explain the process of deprecating reference and production environments.

Most (if not all) of our reference and production environments are generated from various versions of Debian : Sid, Etch, Lenny, Squeeze,... So after some time (a few years), certain environments become obsolete and must be removed from the Grid'5000 platform because :

  1. Very few users ever deploy them,
  2. They are not maintained, since bugs from those versions have been corrected in more recent versions,
  3. Drivers updates and security updates are stopped, so they do no longer work on all clusters,
  4. They are too old.

Deprecated environments should be removed from Grid'5000 platform in a way that :

  • Classic users should not be bothered with information about deprecated environments,
  • Documentation on deprecated environments should still be available to users who explicitly want to use them,
  • Only users who really want to use them should be able to do so.


Here are the actions to do once an environment is declared as deprecated:

  1. Its documentation on the wiki should be updated to clearly specify that it is deprecated : on the Environment global page and on the environment description page (example Sid-x64-big-1.1 )
  2. On each site :
    1. Its description should be stored in a file in the directory /grid5000/descriptions/ , with 2 requirements:
      1. The word deprecated should be added to the environment description file name . Example : /grid5000/descriptions/sid-x64-big-1.1-deprecated.dsc
      2. The visibility parameter should be set to shared within that description file.
    2. The environment tarball and post-install (.tgz files) should neither be moved nor deleted from they respective directories /grid5000/images/ and /grid5000/postinstalls/.
  3. On the API :
    1. The environment should be removed from the description of each sites. This is done by :
      1. removing the environment json file from all the site api description reference-repository/data/grid5000/sites/site/environments/environment.json
      2. removing the content of the parameter available_on from the environment generator file reference-repository/generators/input/environments/environment.rb
    2. The declaration and description of the environment should be left inside the API. But its state parameter should be set to deprecated. This is done in the file reference-repository/generators/input/environments/environment.rb. This way, users will still be able to consult its description through API by explicitly requesting for it.
    3. Also move the environment generator file in the directory for deprecated environment : reference-repository/generators/input/environments/deprecated/environment.rb
    4. generate the deprecated json file with its new configuration :
Terminal.png local:
reference-repository$ rake env:generate ENV_NAME="deprecated/environment.rb"

Here is an example of a deprecated environment :

Environments should never be completely removed from Grid'5000 platform. It should always be possible to retrieve archives of all environments that once ran on Grid'5000.
But if you absolutely have to delete images of a deprecated environment from your site, make sure first that there are copies of that environment on some other sites.

Note that the important/required files are :

  • The environment description file located in the directory /grid5000/descriptions/,
  • The environment tarball file located in the directory /grid5000/images/,
  • The environment post-install file located in the directory /grid5000/postinstalls/
Note.png Note

There will soon be a script to automate all theses actions.

Updating the API

you now have to update the API to reflect the changes you have made on your (new) environment.
the environment description is stored in the git reference-repository in a json file.
Do not edit the json direcly, json files are generated using ruby templates and a generator script.

the following commands will guide you through the steps of fetching the reference repository, updating the environment template, generate the .json files and send your modifications upstream.

git clone ssh://g5kadmin@git.grid5000.fr/srv/git/repos/reference-repository.git
vi reference-repository/generators/input/environments.rb
ruby reference-repository/generators/grid5000 reference-repository/generators/input/environments.rb
git commit -m "[Environments] Added Environment foo version 42 to the API"
git push

Documentation

Any deployed environment has to be documented on the Environment wiki page.
this documentation shall includes an Identification sheet and all the steps that have been done to create this image.
If the image has been created manually then the documentation needs to include:

  • the Env on which this new one is based on.
  • all the tunning on configuration files.
  • a list of provided packages and services.

If the Image has been created with chef then:

  • the environment json
  • the git version of the chef repo used to generate the env.

Inform users

As part of bug 5053, users have to informed of each environment update. The admin in charge of the update should send an email to users mailing list. Here an example :

Dear Grid'5000 users,

All wheezy reference environments¹ were updated to 1.1 version.
All wheezy variantsnow use ext4 as root filsystem (/) and are not exposed to debian CVE-2013-2094.

Here the changelog for all variants :

wheezy-x64-min²
=============
- Remove netcat
- Use ext4 as filesystem
- Debian CVE-2013-2094 correction (Bug 4999³)

wheezy-x64-base⁴
=============
- Use ext4 as filesystem
- Debian CVE-2013-2094 correction (Bug 4999³)
- Netcat (nc) now point to netcat-openbsd (nc.openbsd) instead of netcat-traditional (nc.traditional) (Bug 5097⁵)

wheezy-x64-nfs⁶
=============
- Use ext4 as filesystem
- Debian CVE-2013-2094 correction (Bug 4999³)
- Netcat (nc) now point to netcat-openbsd (nc.openbsd) instead of netcat-traditional (nc.traditional) (Bug 5097⁵)

wheezy-x64-big⁷
=============
- Use ext4 as filesystem
- Debian CVE-2013-2094 correction (Bug 4999³)
- Netcat (nc) now point to netcat-openbsd (nc.openbsd) instead of netcat-traditional (nc.traditional) (Bug 5097⁵)
- Diffstat installation (Bug 5064⁸)

wheezy-x64-xen⁹
=============
- Use ext4 as filesystem
- Debian CVE-2013-2094 correction (Bug 4999³)
- Netcat (nc) now point to netcat-openbsd (nc.openbsd) instead of netcat-traditional (nc.traditional) (Bug 5097⁵)


As a reminder, previous versions of environments are still available in /grid5000/images/.

Regards,

¹ https://www.grid5000.fr/mediawiki/index.php/Category:Portal:Environment
² https://www.grid5000.fr/mediawiki/index.php/Wheezy-x64-min-1.1
³ https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=4999https://www.grid5000.fr/mediawiki/index.php/Wheezy-x64-base-1.1https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=5097https://www.grid5000.fr/mediawiki/index.php/Wheezy-x64-nfs-1.1https://www.grid5000.fr/mediawiki/index.php/Wheezy-x64-big-1.1https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=5064https://www.grid5000.fr/mediawiki/index.php/Wheezy-x64-xen-1.1
Personal tools
Namespaces

Variants
Actions
Public Portal
Users Portal
Admin portal
Wiki special pages
Toolbox