Scalable Distributed Processing Using the MapReduce Paradigm (Map-Red)
Leaders: Gabriel Antoniu (KERDATA), Gilles Fedak (GRAAL)
Map/Reduce is a parallel programming paradigm successfully used by large Internet service providers to perform computations on massive amounts of data. After being strongly promoted by Google, it has also been implemented by the open source community through the Yahoo! project, maintained by the Apache Foundation and supported by Yahoo! and even by Google itself. This model is currently getting more and more popular as a solution for rapid implementation of distributed data-intensive applications. The key strength of the Map/Reduce model is its inherently high degree of potential parallelism that should enable processing of petabytes of data in a couple of hours on large clusters consisting of several thousand nodes.
At the core of the Map/Reduce frameworks stays a key component: the storage layer. To enable massively parallel data processing to a high degree over a large number of nodes, the storage layer must meet a series of specific requirements. Firstly, since data is stored in huge files, the computation will have to efficiently process small parts of these huge files concurrently. Thus, the storage layer is expected to provide efficient fine-grain access to the files. Secondly, the storage layer must be able to sustain a high throughput in spite of heavy access concurrency to the same file, as thousands of clients simultaneously access data.
These critical needs of data-intensive distributed applications have not been addressed by classical, POSIX-compliant distributed file systems. Therefore, specialized file systems have been designed, such as HDFS, the default storage layer of Hadoop. HDFS has how- ever some difficulties in sustaining a high throughput in the case of concurrent accesses to the same file. Amazon’s cloud computing initiative, Elastic MapReduce, employs Hadoop on their Elastic Compute Cloud infrastructure (EC2) and inherits these limitations. The storage backend used by Hadoop is Amazon’s Simple Storage Service (S3), which pro- vides limited support for concurrent accesses to shared data. Moreover, many desirable features are missing altogether, such as the support for versioning and for concurrent up- dates to the same file. Finally, another important requirement for the storage layer is its ability to expose an interface that enables the application to be data-location aware. This is critical in order to allow the scheduler to use this information to place computation tasks close to the data and thus reduce network traffic, contributing to a better global data throughput.
Given this context, the challenge we propose to address is the following: we aim to overcome the limitations of current MapReduce frameworks (such as Hadoop) and enable ultra-scalable MapReduce-based data processing on various physical platforms such as clouds and desktop grids. To meet this global goal, several critical aspects need to be investigated. First, we will explore advanced data and metadata management techniques recently introduced into the BlobSeer BLOB-based distributed management service for grids and clouds. BlobSeer is being designed by the KerData team to enable efficient, fine-grain access to massive, distributed data accessed under heavy concurrency.
Second, we will investigate how the MapReduce programming model can be used to program data-intense applications on Desktop Grid platforms. To achieve this goal, we will leverage Bitdew, a data management platform designed by the GRAAL team to explore specificities of Desktop Grid platforms. In contrast with existing cluster-based implementation of MapReduce file system, Desktop Grids would not allow every storage and computing nodes to communicate with each other. Instead, the approach followed by BitDew is to build a network of trusted nodes which handles the distribution of large data, the scheduling of Map and Reduce tasks, the data flow control as well as the certification of intermediate results. Unlike “real world” Desktop Grid platforms, by using Grid’5000 we can understand the impact of Desktop Grid characteristics on the performances and reliability of our approach by elaborating experiments and scenarios which isolate each one of these parameters. Indeed many of the key charateristics of Desktop Grids can be simulated in a controllable fashion: volatility by generating process crashes, storage, network and CPU heterogeneity by (eventually dynamically) degrading performance and scalability by launching several virtual processes on each physical nodes. Finally, we will also investigate advanced the impact of data and task co-scheduling techniques on the overall performance.
Finally, we will investigate various scheduling issues related to large executions of MapReduce instances. In particular, we will study how the scheduler of the Hadoop implementation of MapReduce can scale over heterogeneous platforms. Care should be taken around the replication of data. Right now, replication is not algorithmically com- puted. However, it must be highly connected to the scheduling of tasks. The scheduling of multiple DAGs of MapReduce applications must taken as a whole to ensure fairness between users.
Our goal is to explore how combinating these techniques may improve the behav- ior of MapReduce-based applications on grids, clouds and desktop grids. We target a scalable execution on more than 1000 nodes spread across multiple sites with real-life workloads from existing data mining applications. Grid’5000 will definitely be highly valuable in such a context, as it provides the ideal experimental infrastructure required by the targeted experiments.