XCluster synopses for structured XML content

Polyzotis, Neoklis, Garofalakis Minos

Year 2006
Type of Item Conference Full Paper
Bibliographic Citation N. Polyzotis and M. Garofalakis, "XCluster synopses for structured XML content", in 22nd International Conference on Data Engineering, 2006, doi: 10.1109/ICDE.2006.175
We tackle the difficult problem of summarizing the path/branching structure and value content of an XML database that comprises both numeric and textual values. We introduce a novel XML-summarization model, termed XCLUSTERs, that enables accurate selectivity estimates for the class of twig queries with numeric-range, substring, and textual IR predicates over the content of XML elements. In a nutshell, an XCLUSTER synopsis represents an effective clustering of XML elements based on both their structural and value-based characteristics. By leveraging techniques for summarizing XML-document structure as well as numeric and textual data distributions, our XCLUSTER model provides the first known unified framework for handling path/branching structure and different types of element values. We detail the XCLUSTER model, and develop a systematic framework for the construction of effective XCLUSTER summaries within a specified storage budget. Experimental results on synthetic and real-life data verify the effectiveness of our XCLUSTER synopses, clearly demonstrating their ability to accurately summarize XML databases with mixed-value content. To the best of our knowledge, ours is the first work to address the summarization problem for structured XML content in its full generality.