Moving Data around Grid'5000
The Grid'5000 platform provides a computing and storage platform for performing different types of research experiments, in particular in the domain of Big Data. In this tutorial, different aspects of Big Data experiments will be studied, with a focus on the question of the storage resources available in Grid'5000.
Big Data on Grid'5000
This tutorial elaborates on:
- Proposed data lifecycle for Big data experiments
- Suggested best practices in Big data experiments.
In the context of experiments in Big Data, the following generalised Data Lifecycle is proposed as the reference schema. It will form the basis of steps for this tutorial.
Generalised data lifecycle in Big Data experiments
The dotted arrows represent actions outside the domain of Grid'5000, e.g. importing datsets from other sites - Wikipedia, Google, Yahoo!
Suggested best practices in Experiments
Motivation for Best Practices - Pitfalls of users
In the functioning of the above schema, steps 2 and 5 are not strictly essential. Hence, in most cases experimenters ignore them by following a shorter lifecycle as: Step 1 --> Step 3 --> Step 4 --> Step 1
Users tend to adopt the following reduced sequence of steps:
- Step 1: Save their data directly on the home directory (or at best storage5k).
- Step 3: Install data and compute frameworks to work with data directly from the home directory or storage5k.
- Step 4: Run experiments and save results directly on home directory or storage5k.
Clearly, the quality of experiments can be compromised in the above 3 steps. This is due to Disk contentions. When data is read/written from shared resources (e.g. /home, storage5k) there will always be contentions from other user processes that read/write onto the same servers. Contentions are very unpredictable and can happen at any hour of the day and for any period of time. If these I/O operations are part of the measured output of an experiment, then they introduce a strong factor or irreproducibility of experiments.
Moreover, NFS from a single server as in Grid'5000, is not the best performing protocol for moving large datasets.
In the next sub-section, we explain the best practices to adopt in order to improve the quality of experiments. In the following sections, we use a typical use case for data storage to demonstrate this.
Best Practices for selecting storage resources
It is advisable to practise the complete data lifecycle proposed and use different storage resources in the different steps:
- Step 1 - Import Datasets
- Save initial (external) datasets on shared storage resources (e.g. storage5k, bigdata-srv)
- Step 2 - Prepare Dedicated Storage
- Prepare a pooled dedicated storage on reserved nodes for the duration of an experiment (e.g. dedicated Ceph cluster on reserved nodes)
- Load the datasets from shared storage to dedicated storage resources; prepare as necessary (e.g. tar, unzip).
- Step 3 - Install Data/Compute Framework
- Reserve nodes for computations
- Install data/compute frameworks (e.g. Hadoop MapReduce, Spark, Flink)
- Step 4 - Run Experiment
- As per requirements of project.
- Step 5 - Save Results
- Move the datasets from dedicated storage to shared storage resources (e.g. managed Ceph clusters).
See the following pages: