BigData hands-on tutorial

From Grid5000
Jump to: navigation, search

In this tutorial

This tutorial offers information regarding many components of Grid'5000 that may be useful to BigData experimenters.

We first look at those components, then offer a basic scenario of usage taking benefits of some of them.

Finally, we present a user contributed library that may help automate BigData experimentation scenarios.

Example of workflow for a BigData experiment on Grid'5000

G5K BigData Workflow

Grid'5000 components for BigData

Choose resources for the experiment

As for any experiment, a BigData experiment may require to look at compute, network and storage questions.

Compute
  • What processing unit would we use ?
  • What about memory ?
  • Heterogeneous, Homogeneous ?
  • What scaling ?

To find out what Grid'5000 can provide, one may begin by looking at the Hardware page. Then looking a the site's hardware pages gives more details:

Storage
  • What technology
  • What capacity and persistence ?

The Storage page is the good entry point.

Network
  • What technology ?
  • What topology ?

The Hardware page is again a good entry point to choose the network technology. But you may also consider looking at the networking tutorials (see KaVLAN), if you wish to setup complex topologies.


Finally, we have to reserve the required resources for the experiment. The platform status page is a good entry point to find out what is available and when, and to organize your experimentation calendar according to Grid'5000 usage policy.

Involved G5K components

Deploy BigData stacks

Tools required by BigData stacks/services may be very specific and are obviously not installed in the Grid'5000 standard environment. It is let to the user to install the software or version of software it wants to use or evaluate.

Involved G5K components
  • Sudo-g5k: get root privileges in the Grid'5000 environment (without deploying)
  • Kameleon & Kadeploy: fully master your experimentation environment

Manage data

Manage long term data storage

  • request long term data storage
  • either generate input datasets
  • or import input datasets
    • To import data by ssh or rsync, mind using your site's local access machines: access.site.grid5000.fr, in order to possibly speedup your data transfer (connections to site's access machines are restricted to the site's university or labs networks).
  • reuse your data storage between experiments
  • export data (results) and free the storage
    • To export data by ssh or rsync, mind using your site's local access machines: access.site.grid5000.fr, in order to possibly speedup your data transfer (connections to site's access machines are restricted to the site's university or labs networks).
Involved G5K components
  • see the different available kinds of storage of the platform
  • learn to master SSH, rsync, TakTuk, or any distributed storage tool (e.g. hadoop distcp, etc)

Access user interfaces

  • Access BigData tools web interfaces
  • Jupyter notebook
Involved G5K components

Automated experiment management

Grid'5000 provides higher level interfaces for experiment automation, using for instance Python or Ruby rather than script shell (bash). This is something to really consider in the conception of experiments.

Involved G5K components

An example of experiment

Prepare the environment (operating system) for the environment

In this first phase, we generate the Grid'5000 node environment (OS) for the tutorial. We base this environment on Grid'5000 provided big environment (debian9-x64-big), and add some piece of software:

This is achieved using the kameleon tool.

First clone the tutorial git repo:

Terminal.png frontend:
cd bigdata-tutorial/kameleon

You can then look at the kameleon recipe: debian9-x64-bigdata-tutorial.yaml and associated steps.

You can build the environment as explained in the README.md file.

A log of a previous build is provided in kameleon.build.log file.

Prepare data storage

In this tutorial, we propose to use storage5k for a long term storage of our input datasets.

To do so, you have to find a site offering thee storage5k service, and reserve about 30GB of NFS storage for the time of the tutorial at least. The command will usually be:

Terminal.png frontend:
storage5k -a add -l chunks=3,walltime=24

See the storage5k page for more details on how to use storage5k.

Run the experiment

In this part of the tutorial, we will use a Jupyter notebook to perform step after step an experiment. We will use the execo python library as the upper instrumentation level for Grid'5000 tools (OAR and Kadeploy).

First choose a Grid'5000 site and connect to its frontend
Terminal.png workstation:
ssh site.g5k
Then retrieve (if needed again) the notebook files
Terminal.png frontend:
cd bigdata-tutorial
And start the notebook
Warning.png Warning

You might have to install jupyter first:

Terminal.png frontend:
pip3 install --user jupyter

and if ~/.local/bin is not in your PATH yet:

Terminal.png frontend:
echo 'export PATH=$PATH:~/.local/bin/' >> ~/.bashrc && . ~/.bashrc
also make sure that your ~/.bashrc file is loaded by the ~/.bash_profile file or ~/.profile file.

Run:

Terminal.png frontend:
jupyter notebook --ip=$(hostname -f)

At this point you must have either a SSH Socks proxy or the G5K VPN running, in order to access to the jupyter notebook from your web browser running on your workstation.

Open the Jupyter URL in your web browser

Find the jupyter URL in the jupyter command output.

Warning.png Warning

Unfortunately, issues have been reported with using firefox. You may consider using chromium (Debian) or chrome

This opens the a Jupyter file browser (in a web page). Click on Experiment.ipynb.

You can then follow with the next steps in the notebook.

Note.png Note

The notebook text can also be read directly on gitlab: https://gitlab.inria.fr/grid5000/bigdata-tutorial/blob/master/Experiment.ipynb

Toward more automated experiments

With the notebook, we already get some automation, but it is still interactivity-centric: first of all error handling is quite weak.

It is however a good start point for preparing unattended experiments, such as what offers Michael Mercier's lib: https://gitlab.inria.fr/evalys/big-data-hpc-g5k-expe-tools.

This small library is the result of a lot of code factorizing of the experiments Michael Mercier has run (is still running) for his PhD Thesis work (subject is HPC-BigData convergence, hence the presence of HPC tools in the library).

As a Grid'5000 user work, it is of course open to contribution from other users. Feedback and pull requests are very welcome.

Other contributed works are of course also very welcome to be shared with the community.