Grid'5000 user report for Lucas Nussbaum

Jump to: navigation, search

User information

Lucas Nussbaum (users, user, account-manager, site-manager, ct, nancy, devel, ml-users user)
More user information in the user management interface.


  • Study of peer-to-peer systems using emulation (Networking) [in progress]
    Description: I use Grid5000 to study peer-to-peer systems using emulation. My experimental platform uses FreeBSD (which I deploy on Grid5000) and Dummynet, FreeBSD's network emulator. My framework currently only runs on a single cluster.
    Results: My experiments involved the concurrent execution of 5760 Bittorrent clients on 180 nodes running FreeBSD, using our emulation platform P2PLab. This graph shows the progress of the download of a 16 MB file on each client. Clients are started with a 0.25ms interval. Each virtual node has a 2mbps (download)/128kbps (upload) network connection. I can provide an EPS file if needed.
    illustrating chart picture not found
  • FreeBSD Setups on Grid'5000 (Other) [achieved]
    Description: FreeBSD was a requirement for my experimentations, so I carried out quite a lot of work to support it on Grid5000, detailed on FreeBSD wiki page. FreeBSD is now easily deployable on Orsay, but still needs to be patched manually to support the new BCM5780 network cards in the IBM e326m.
    More information here
  • MPI'5000 (Middleware) [in progress]

    MPI5000 is a new transparent layer placed between MPI and TCP allowing application composed of several tasks to be correctly distributed on available node regarding the grid topology and the application scheme. Thus, our layer needs two data files: a file describing the grid topology including available nodes, both latency and bandwidth between the nodes and between sites; another file describing the application communication patterns with the size and the amount of messages sent between MPI processes. Using these two informations, our layer should realise an efficient placement of tasks on grid nodes.

    Our layer also propose to transparently slipt TCP connections between MPI processes in order to take into account the grid topology. This new architecture is based on a system of relays placed at the LAN/WAN interface. We replace each end-to-end TCP connection by three connections (two on the LAN between a node and a relay, one on the WAN between two relays). Thus, we expect a faster lost recovery on LAN as well as a reduction of memory used because for local TCP buffers (they depend on RTT latency of a connection). On the relay, we planned to use different TCP implementations or different protocols for local and distant communications. The relays could also implement a different scheduling strategy of the messages in function of the data size, for example we could give priority to small message (usually control messages). Finally, as MPI applications are mostly using small messages, they are more penalised if the network is congestionned by large flows. We planned to reserve bandwidth in order to optimise MPI communications on the long distance shared link. The implementation of our proposition is based both on a library between MPI and system calls and relays daemon. Thus, the architecture is independant of MPI implementations.

    Results: For the moment, relays and library are in a test phase. We are now testing our architecture in Grid'5000. Finally, we will implement the optimisations proposed previously.
  • Debian and Ubuntu Quality Assurance work (Other) [in progress]
    Description: Being involved in Debian and Ubuntu development, I sometimes use Grid'5000 to work on distribution-wide quality assurance tasks. One example of such tasks is the rebuild (recompilation) of all Debian and Ubuntu packages from the source packages. Such rebuilds are important because all packages in a release are supposed to be buildable from theirs sources (in case, one year later, a security patch has to be applied to the sources and a package has to be rebuilt). They also allow to track poorly maintained packages and to ensure that all packages meet a minimum quality level. Those tasks should not normally conflict with normal Grid'5000 experiments, since they are only submitted during nights, and never using reservations (only submissions). I am working towards fully automating such tasks, so they can run in batch mode. They are currently limited to one cluster, but it should also be possible soon to run them across several clusters.
    Results: A complete rebuild of Debian on one node takes about 10 days. It was made possible to rebuild Debian in less than 8 hours using Grid'5000, allowing to detect many bugs. Tests of package installation were also conducted. About 200 "release critical" bugs were found and fixed in Debian Etch.
  • kstress: stressing kadeploy (Other) [achieved]
    Description: kstress istresses kadeploy using a tool which is able to run tests, each test run 1, 2, 4 and then 8 concurrents deployments.
    Results: - discover bugs in kadeploy - decrease deployment time
    More information here
  • refenv: checking reference environment (Other) [achieved]
    Description: refenv checks programs and their versions in reference environments.
    More information here
  • end-to-end (node-to-node, over the backbone) 10 Gbps data transfer (Networking) [achieved]
    Description: The goal of this experiment was to check whether it would be possible to make use of the 10 Gbps backbone directly from nodes.

    I used two sites:
    - Lyon, where some machines from the RESO team are connected to the Grid'5000 network using 10 GbE. (Myrinet 10G cards, connected to a Fujitsu 10G switch)
    - Rennes, where the Myrinet 10G network (running in "Myrinet", not "Ethernet" mode) is connected to the site's main router via a Ethernet/Myrinet bridge (ref: 10G-SW16LC-6C2ER). So I used IP over ethernet, emulated over MX from the node to the Myrinet switch, then 10 GbE from the switch to Rennes' router.

    I used the routable IP addresses in the 10. class to configure NICs at both ends. In Lyon, the myri10ge driver was used, while in Rennes, the proprietary Myrinet driver was used.

    - the proprietary Myrinet driver has a bug that causes frames coming from the ethernet bridge to be broadcasted to all nodes. Fixed in driver version 1.2.8.
    - TCP buffers need to be tuned.

    The achieved bandwidth was a bit disappointing.
    Rennes->Lyon: 3.2 Gbps max
    Lyon->Rennes: 4.7 Gbps max

    Possible bottlenecks (not investigated yet):
    - ethernet emulation over MX -- unlikely, because node-node bandwith is much higher - measured 8.9 Gbps
    - ethernet/myrinet bridge -- we would need to stress it using local nodes
    - backbone network
    - TCP tuning. I configured the buffer sizes to 64M to be on the safe side, but didn't play with the various TCP congestion control algorithms.The nodes in Lyon were not deployable, and kernel version 2.6.18 was used. After installing a 2.6.26 kernel, max bandwidth was about the same, but results were more reproducible.

    The results were not very reproducible (only the max bandwidth is given above): the bandwidth sometimes stayed stable at a much lower rate, so the most likely suspect is a problem with TCP tuning.
    I did the following experiment:
    A single Myrinet node sends data through the Myrinet/Ethernet bridge to an increasing number of Ethernet nodes in Rennes. The expected result was to reach a 10 Gbps outgoing bandwidth. The bandwidth stopped increasing after three target nodes were added, at around 3 Gbps. This clearly points to the bridge as the bottleneck.
    Results: "it works."
  • Fixing the Linux implementation of the TCP CUBIC congestion avoidance algorithm (Networking) [achieved]
    Description: The TCP CUBIC algorithm, which is the default congestion avoidance algorithm since Linux 2.6.19, gained a new set of heuristics called HyStart in Linux 2.6.29[1]. This algorithm uses RTT and ack spacing measurements to decide when to exit slow start. However, experiments on Grid'5000 demonstrated very poor performance in some cases[2]. We proposed some patches[3] to improve the behaviour of Linux, and after more discussion, another set of patches[4] was proposed and applied to Linux. The fixes will be available in Linux 2.6.39, and are likely to be backported to stable kernel versions. [1]Sangtae Ha and Injong Rhee.Taming the Elephants: New TCP Slow Start, NCSU TechReport 2008. [2] [3] [4]



    Success stories and benefits from Grid'5000

    • Overall benefits
    • Grid'5000 allows me to easily run my experiments on a huge number of high performance nodes, without adding too much work because of the additional number of nodes. The kadeploy deployment tool make it very easy to deploy complex customized environments efficiently, on a high number of nodes. Its support of ".dd.gz" images allowed me to deploy non-linux systems like FreeBSD.

    last update: 2011-06-22 11:20:00