Understanding Grid5000

From Grid5000
Jump to: navigation, search



Contents

The platform

Grid'5000 is a scientific instrument designed to support experiment-driven research in all areas of computer science related to parallel, large-scale or distributed computing and networking.

Grid'5000 was prototyped in mid-2003 and took off since 2004.

Status

On April 5th 2011, 7244 cores distributed among 1500 nodes were available to experiments.

The trend is to increase the ratio nb of cores / nb of nodes, whereas the number of nodes remains stable.

Hardware information is gathered at :

The live status of the platform (live or dead machines, job load among the clusters, ...) is at :


Specialized visualizations of the platform are reachable through the Grid5000 API User Interface.

The technical team

Whole team is between 15 and 20 engineers, structured into specialized staffs :

  • support :
    • All subjects related to platform administration and maintenance
  • development :
    • various developments about the platform and its tools : KaVLAN, Kadeploy, APIs, UMS, ...

On sharp topics such as specific parallel or distributed technologies, you'd better requesting help from the Grid'5000 user community through the users mailing-list (users@lists.grid5000.fr).


The resources

Warning.png Warning

Reserved to Grid'5000 account's owners.

Account

Account's management is performed through the centralized User Management System (UMS)

  1. Request your account (if not already done) with that form.
  2. Once your request approved, manage your account with the dedicated interface.

Privileges

Some accounts are getting higher privileges than others in order to implement the hierarchical management between accounts.

  • User accounts are getting standard privileges : access to the wiki, web services and grid systems.
    • they are supervised by managers.
  • Account managers : they are in charge of one or several accounts.
    • They're able to approve some requests from user accounts.
    • They are supervised by Top managers.
  • Site managers (or Top Managers) : they are in responsability of all the accounts of a site or a project.
    • They're able to approve all the requests from users accounts.
  • Admin accounts : reserved to Grid'5000 staff ; unlimited powers.
    • They're able to repair or unlock things.
  • Observer accounts : access only to the wiki.

Reporting

Grid5000 is a unique tool but its usage and scientific production is regularly audited by its funders.

So users are strongly encouraged to keep their reports up to date about their research activities and listing their experiments, results, publications, collaborations...

Warning.png Warning

Users reports are essential to survey Grid'5000 scientific impact !

The wiki

Our WiKi based website (powered by MediaWiki) :

  • Gathers knowledge of engineers and researchers about Grid'5000.
  • Archiving and sharing informations for the community.
  • Features 3 portals : public, users, admin.


Public portal

  • Institutional pages for public visibility
    • Targeted at the whole scientific community: history, descriptions, publications...
  • unauthenticated access

Scientific work produced with Grid'5000 can be accessed at:

Users portal

  • Dedicated to the Grid'5000 user.
  • Unauthenticated access allowed.

Go there to grab basic documentation on how to start with Grid'5000. It features main pointers to :

  • the tutorials : help to learn ...
  • the tools : resource reservation, deployment, ...
  • your account : management, usage, ...
  • the platform information : events, status, ...
  • Meta informations about the social functioning of the Grid'5000 comunity : charter, support, ...


Admin portal

  • Heart of the Grid'5000 knowledge base
  • Authenticated access only.

It gathers more detailed documentation about :

  • Grid'5000 softwares.
  • Site's hardware.
  • Technical documentation about Grid'5000 usage.
  • Technical committee's meetings minutes...

Some technical webservices (monitoring tools, bugzilla database, etc.) are hosted on a separate Helpdesk but all links are provided by the main Grid'5000 portal.


Philosophy

Grid'5000 wiki is open toward its community : do not hesitate to contribute ! (typos / updates / add-ons...).

Some editing rules shall be followed however in order to avoid chaos and to make information easy to find and easy to upkeep ! Beware of :

  • Namespace+name when creating a new page.
  • Cross-linking to/from existing pages :
    • Ease of finding new informations from existing top pointers (sidebar, portals).
  • Avoid redundancy.
    • Use rather cross-links/inclusions/redirections than simple duplication.

In case of doubt, do not hesitate to ask for help to wiki administrators and coordinators (David Margery, et al.).

Computing environment

For historical reasons, Grid'5000 has been built upon a network of dedicated clusters. It's not an ad-hoc grid.

This hence implies 2 ways of using Grid'5000 :

  • At the grid level (to make experiments in a grid environment).
  • At the cluster level (to maximise hardware homogeneity and bandwidth).
    • It's the easiest level to start with as it is the basic building block unit, for hardware as well as for software.
    • But multi-site experiments are favoured by the charter and some tools because these are difficult


Definitions

site

A site refers to the geographical area of the laboratory hosting the machines. Almost all Grid'5000 sites are hosting more than one cluster.

Historically, Grid'5000 hardware has been acquired by incremental steps on each sites, thus forming clusters at each acquisition.

cluster

A cluster is a set of computers which present homogeneous properties.

A cluster is connected to a given network architecture and is physically installed on a given site.

All clusters from a given site rely upon a common network infrastructure.

node

A computer who's part of a cluster is called a node. Therefore we could define two types of nodes :

  • compute nodes : the base element of a cluster, on which computations are run ; they're usually refered as nodes.
  • service nodes : these machines are not meant to execute users applications but are dedicated to host the grid infrastructure services.
    • Some service nodes are called frontends because of their particular role.

Each node may offer several resources to the users. On Grid'5000, the finest grain of resource is the core.


Architecture

Clusters

Every cluster usually consists of:

  • one or more service node (also called frontend)
    • from user-side these machines are mainly used for access, resources reservation and deployment
    • from admin-side, theses machines host virtual hosts and infrastructure services
  • compute nodes : the main computing power.
    • Each node may have several CPUs and each CPU possibly several cores. The Grid'5000 resource manager allows reservation at the node level, and the core level.
  • Service nodes hosting all the infrastructure services : their systems relies upon virtualization for isolation needs.
Note.png Note

In principle, service nodes and frontends do not take part to the computing power of the cluster and therefore are not counted as computing nodes in the cluster hardware description

Network

  • All clusters of a site are physically connected.
  • Network topology of each site are described in pages like :
Link.png Sidebar > users portal > platform > sites > SITE > network

Infrastructure services

  • system services which allow things work (DNS, LDAP, ...)
  • Distributed among all sites, with no outbound routing policy so the traffic is well isolated.
  • External services (WiKi, mailing-lists manager...) are centralized on dedicated hosts belonging to the outside (public IP addresses).


Software and middleware

The two main services offered specifically by Grid'5000 are :

  • resource management and job scheduling : the ability for the users to request some resources on the platform

and the guarantee of having fair access to them.

  • deployment of system images : the ability for the users to re-install the system on their reserved nodes.

Other services are under the way :

Resource management and job scheduling

This job is handled by OAR 2, developed at IMAG.

It performs 3 tasks :

  • Reserve resources (computes nodes, cores...) for a given duration, on behalf of the requesting user.
  • Schedule the user's job over the reserved nodes ; the scheduler guarantees a fair use of the machine time.
  • Free the resources at the end of the reservation.

All the resources of a given site are managed by a single instance of OAR 2.

Warning.png Warning

Resources reservations must abide to the User Charter.

Grid level

OAR-Grid 2 is a tool built on top of OAR for Grid reservations : several nodes of several sites at the same time.

Images deployment

Grid'5000 allow users to build their own customized system environment and install it on their reserved nodes.

This task is assumed by Kadeploy and the associated ka-* tools. Kadeploy's usage will also be addressed by the next practices.

Two options are available to benefit from a customized image :

  • Building an image from scratch :
    • Requires some preliminary work and specific knowledge.
    • Allows for optimum experiment conditions and reproductability (as long as the hardware remain the same).
  • Customizing an image maintained by the support staff.
    • Some knowledge about services infrastructure is required (name resolution: DNS, authentication: LDAP, Home directories: NFS, access: SSH, ...).

Clusters hardware

The Grid5000 API UI provides a simple way to browse the hardware constitution of Grid'5000.

Full Grid'5000 hardware description is API browsable programatically but it's gathered there.

Nodes

  • All machines hardware are based upon x86-64 architecture.
  • Processors are either AMD or Intel.
  • Machine's hardware is slightly different from cluster to cluster, so as to create a richer grid eco-system.

Network

  • At least full 1 Gbps Ethernet interconnection.
  • Low-latency hi-perf networks like Myrinet or Infiniband (featuring usually 10 Gbps bandwidth).

Storage

Compute nodes

  • Local disks (at least 80 GB).

The universal partitioning scheme is the same throughout all Grid'5000.

Frontends

  • Huge storage space, mainly for hosting /home directories.
Warning.png Warning

Home directories are neither backuped nor synchronized from site to site !

Working with Grid'5000

Log in

SSH

Users log in Grid'5000 through an SSH connection through 2 SSH gateways with public key authentication (External access).

The Security policy could be resumed as follows :

  1. No outbound connection allowed from Grid'5000 toward the Internet.
  2. Inbound internet connections to site's clusters may be filtered depending on local security policy.
  3. Hosting laboratory networks are allowed to connect to site's Grid'5000 cluster(s).
  4. All traffic allowed between 2 Grid'5000 endpoints.

For more informations, see Security model.

Web

Internet is filtered, so only a few sites are accessible from inside Grid'5000, for security reasons. Accesses are split in two categories :

  • common : these accesses are common to the whole Grid'5000 platform.
  • site : these accesses are specific to a site.

Common accesses are mainly about :

  • Linux packages mirrors (Debian, Fedora, CentOS, ...)
  • Kernel archive repository
  • INRIA Gforge (INRIA's forge of software repository)

Web accesses are driven by a policy. Please refer to it to take a deeper understanding of web accesses inside Grid'5000.

If you need an extra access which is not currently authorized from inside Grid'5000 and if you feel it's for a legitimate use of Grid'5000, you might ask for it to be added to whitelisted hosts, by using the default support request procedure.

Communicating

Grid'5000 is a grid, so each node of any site can communicate with any other node of any site.


Home directory

A Grid'5000 account gives you a home directory on each site. This home directory is mounted on site's frontend and on nodes who use the reference image (default site system image).

  • Please note that you have a distinct home directory on each site (ie : 9 Grid'5000 sites --> 9 home directories).

Data synchronization between your lab's home directory and all your Grid'5000 home directories are your responsibility.

  • Grid'5000 home directories use quota on each site (soft limit of 25 GB, hard limit of 100 GB).

Getting Support

Informations to consider :

Next tutorial

Personal tools
Namespaces

Variants
Actions
Public Portal
Users Portal
Admin portal
Wiki special pages
Toolbox