Institutional Repository
Technical University of Crete
EN  |  EL



My Space

Statistical synopses for graph-structured XML databases

Polyzotis, Neoklis, Garofalakis Minos

Full record

Year 2002
Type of Item Conference Full Paper
Bibliographic Citation N. Polyzotis and M. Garofalakis, "Statistical synopses for graph-structured XML databases", in ACM SIGMOD International Conference on Management of Data, June 2002, pp. 358-369.
Appears in Collections


Effective support for XML query languages is becoming increasingly importantwith the emergence of new applications that access large volumes ofXML data. All existing proposals for querying XML (e.g., XQuery) rely ona pattern-specification language that allows path navigation and branchingthrough the XML data graph in order to reach the desired data elements.Optimizing such queries depends crucially on the existence of concise synopsisstructures that enable accurate compile-time selectivity estimates forcomplex path expressions over graph-structured XML data. In this paper,we introduce a novel approach to building and using statistical summariesof large XML data graphs for effective path-expression selectivity estimation.Our proposed graph-synopsis model (termed XSKETCH) exploits localizedgraph stability to accurately approximate (in limited space) the pathand branching distribution in the data graph. To estimate the selectivities ofcomplex path expressions over concise XSKETCH synopses, we develop anestimation framework that relies on appropriate statistical (uniformity andindependence) assumptions to compensate for the lack of detailed distributioninformation. Given our estimation framework, we demonstrate that theproblem of building an accuracy-optimal XSKETCH for a given amount ofspace is ✁✄✂-hard, and propose an efficient heuristic algorithm based ongreedy forward selection. Briefly, our algorithm constructs an XSKETCHsynopsis by successive refinements of the label-split graph, the coarsestsummary of the XML data graph. Our refinement operations act locallyand attempt to capture important statistical correlations between data paths.Extensive experimental results with synthetic as well as real-life data setsverify the effectiveness of our approach. To the best of our knowledge, oursis the first work to address this timely problem in the most general settingof graph-structured data and complex (branching) path expressions