Grid5000:Gotchas: Difference between revisions

From Grid5000
Jump to navigation Jump to search
No edit summary
No edit summary
Line 36: Line 36:


This is supposed to be fixed eventually, and tracked in [https://www.grid5000.fr/cgi-bin/bugzilla3/show_bug.cgi?id=3576 bug 3576]
This is supposed to be fixed eventually, and tracked in [https://www.grid5000.fr/cgi-bin/bugzilla3/show_bug.cgi?id=3576 bug 3576]
== Software ==
* The production environment on all compute nodes is identical, with the exception of additional drivers and software to support GPUs, and Myrinet or Infiniband networks on sites where they are available.
* The user frontend are identical.

Revision as of 08:00, 17 June 2011


This page documents various gotchas (counter-intuitive features of Grid'5000) that could affect users' experiments in surprising ways.

Network

Topology of ethernet networks

Most (large) clusters have a hierarchical ethernet topology, because ethernet switchs with a large number of ports are too expensive. A good example of such a hierarchical topology is the Orsay:Network, where nodes are first connected to 18 24-port switches, which are themselves connected to the central Cisco Catalyst 6509 switch. When doing experiments using the ethernet network intensively, it is a good idea to request nodes on the same switch, using e.g oarsub -l switch=1/nodes=5, or to request nodes connected to specific switch using e.g oarsub -p "switch='cisco2'" -l nodes=5.

Performance of ethernet networks

The backplane bandwidth of ethernet switches doesn't usually allow full-speed communications between all the ports of the switch.

High-performance networks

The topology of Infiniband and Myrinet networks is generally less surprising, and many of them are non-blocking (the switch can handle the total bandwidth of all ports simultaneously). However, there are some exceptions :

  • the Infiniband network in Grenoble is hierarchical (see Grenoble:Network).
  • in nancy, graphene-144 is connected to the griffon infiniband switch. This was required in order to free a port on the graphene switch, used to connect the two infiniband switchs together. This can impact the performance of your application if you are using all 144 graphene nodes.

Compute nodes

All Grid'5000 clusters are supposed to contain homogeneous (identical) sets of nodes, but there are some exceptions.

Hard disks

Due to their high failure rate, hard disks tend to get replaced frequently, and it is not always possible to keep the same model during the whole life of a cluster.

Different CPU performance in the Orsay gdx cluster

The gdx cluster is Orsay is composed of two sets of nodes, as documented in Orsay:Hardware:

  • 186 IBM e326m with 2.0 GHz CPUs
  • 126 IBM e326m with 2.4 GHz CPUs

In order to select one type of nodes or the other, you need to use the cpufreq OAR property: oarsub -p "cluster='gdx' and cpufreq='2.0'"

This is supposed to be fixed eventually, and tracked in bug 1875

Different machines in the Grenoble adonis cluster

The Grenoble adonis cluster is composed of two different set of nodes:

  • adonis-1 to adonis-10 have two E5520 CPUs (Intel Xeon Nehalem GainesTown, 4 cores) ; OAR property: cputype=xeon-Gainestown
  • adonis-11 to adonis-12 have two E5620 CPUs (Intel Xeon Nehalem Gulftown, 6 cores) ; OAR property: cputype=xeon-gulftown

This is supposed to be fixed eventually, and tracked in bug 3576

Software

  • The production environment on all compute nodes is identical, with the exception of additional drivers and software to support GPUs, and Myrinet or Infiniband networks on sites where they are available.
  • The user frontend are identical.