KadoP: A platform for sharing and querying XML data in P2P (Application)
Conducted byNicoleta Preda
As part of the KadoP project developed in Gemo Inria project, we aim to study the scalable management of XML data, more specifically the indexing and the querying of a large collection of XML documents and WebServices, in a P2P setting based on distributed hash tables (DHTs).
We identified DHT aspects hindering efficient query processing and proposed improvements to lift these limitations. We propose a parallel data structure and an efficient algorithm to index and query XML documents in such a setting based on novel optimization techniques. We have already tested the system that we developed in a small network in our team. Currently we are testing it also in the Orsay cluster, but we intend to run tests in a larger network like Grid5000. The optimization algorithms and the experiments will be included in an article which is in preparation.
We are concerned with the processing of queries over a very large number (millions) of XML documents stored in a large number (thousands) of peers. We focus on the evaluation of the query processing of tree pattern queries (a subset of XPath). We ignore aspects of KadoP such as semantics (concepts and relationships) or Web services. The peers are not too volatile, so we are in a much different setting than, say Kazaa. Each peer (node) in the network runs the KadoP application and the nodes in the system are connected using the FreePastry overlay network.
As a performance measure, we consider the index latency: the time delay between the moment a document is published, and the moment it is seen by the index-queries. To measure query performance, we consider: (i) the overall query processing latency, (ii) the latency of the first answer, (iii) the bandwidth consumed by both the index-query evaluation and by the retrieve-data phase. For both publishing and querying we measure the total traffic consumption in the network for several workloads.
The tests we have run on a LAN network with a modest number of peers (hundreds) showed gains of more than 60% percent for the optimized query processing algorithm, for a large set of queries.
Expected results in Grid5000
The goal of the experiments we plan to run in Grid5000 is two-fold: (1) verify that publishing and query processing scale with the number of peers and documents published in the network, (2) measure the gains brought by the optimization algorithms.
- Nodes involved: >1000
- Sites involved: >3
- Minimum walltime: >1d
- Batch mode: yes
- Use kadeploy: no
- CPU bound: no
- Memory bound: no
- Storage bound: no
- Network bound: yes
- Interlink bound: yes
Tools usedFreePastry, ActiveXML
More information here
Shared by: Nicoleta Preda
Last update: 2007-02-26 13:52:34