Hemera:WG:Data

From Grid5000
Revision as of 16:27, 7 March 2011 by Cperez (talk | contribs)
Jump to navigation Jump to search

Efficient management of very large volumes of information for data-intensive applications (Data)

Leaders: Gabriel Antoniu (KERDATA), Jean-Marc Pierson (ASTRE)

During the last years, a wide spectrum of applications have started to exploit large- scale distributed storage infrastructures and use tremendous volumes of data (e.g. up to petabytes). Initiated as grids were starting to develop, this trend is getting stronger and stronger with the emergence of distributed cloud environments supported by major ac- tors, such as IBM, Google, Yahoo!, Amazon and others. This increases the sustainability of grid technologies by proposing an economic model. Data intensive applications mak- ing use of such large-scale distributed infrastructures have emerged in nuclear physics, healthcare, environment, e-business, etc. Such applications deal with various data types (images, text, video, numerical values, . . . ). Data are often distributed, heterogeneous, structured or unstructured, semantically rich or enriched, sometimes confidential. They may be stored as raw data, or structured data, in distributed file systems, in distributed databases, or in more sophisticated data cloud storage services or data warehouses. What- ever are the means of production of these data (synthetic data, sensor data, actual life data or simulation data), the need to store, manage and process them in a reliable, secure and efficient way at a large arises.

The challenge of this working group is to provide high-level services for information management (search, mining, visualization, processing) for very large volumes of dis- tributed data, while taking into account specific constraints related to security, efficiency and heterogeneity, according to the application requirements and to the execution infras- tructures (grids, clouds, ...).

To address this challenge, several issues must be addressed. At the lowest level, the main challenges relate to fault-tolerance, caching, transport, security (access control, en- cryption), and consistency. At an intermediate level come issues concerning interoperabil- ity among storage systems (e.g. interoperability with cloud technologies), data indexing, etc., while making the heterogeneity of the technologies transparent to the higher-level data management services. Finally, at the highest level come challenges related to data mining, data classification, data assimilation, knowledge extraction, and data visualiza- tion, based on intelligent metadata management and efficient algorithm design. In order to approach all challenges mentioned above, it is important to leverage the experience of several research communities: distributed applications, distributed systems (including cluster systems, grid systems, peer-to-peer systems), fault-tolerant systems, databases, data mining, security, numerical algorithms, etc.