Institutional Repository
Technical University of Crete
EN  |  EL

Search

Browse

My Space

SPARTAN: A model-based semantic compression system for massive data tables

Babu Shivnath, Garofalakis Minos, Rastogi Rajeev

Full record


URI: http://purl.tuc.gr/dl/dias/5507F55A-78BD-4114-A824-3CB697C32ED1
Year 2001
Type of Item Conference Full Paper
License
Details
Bibliographic Citation S. Babu, M. Garofalakis and R. Rastogi, "SPARTAN: A model-based semantic compression system for massive data tables", in International Conference on Management of Data (SIGMOD 2001), May 2001, pp. 283-294.
Appears in Collections

Summary

While a variety of lossy compression schemes have been developed for certainforms of digital data (e.g., images, audio, video), the area of lossycompression techniques for arbitrary data tables has been left relatively unexplored.Nevertheless, such techniques are clearly motivated by the everincreasingdata collection rates of modern enterprises and the need for effective,guaranteed-quality approximate answers to queries over massiverelational data sets. In this paper, we propose SPARTAN , a system thattakes advantage of attribute semantics and data-mining models to performlossy compression of massive data tables. SPARTAN is based on thenovel idea of exploiting predictive data correlations and prescribed errortolerances for individual attributes to construct concise and accurate Classificationand Regression Tree (CaRT) models for entire columns of a table.More precisely, SPARTAN selects a certain subset of attributes forwhich no values are explicitly stored in the compressed table; instead, conciseCaRTs that predict these values (within the prescribed error bounds)are maintained. To restrict the huge search space and construction costof possible CaRT predictors, SPARTAN employs sophisticated learningtechniques and novel combinatorial optimization algorithms. Our experimentationwith several real-life data sets offers convincing evidence of theeffectiveness of SPARTAN ’s model-based approach – SPARTAN isable to consistently yield substantially better compression ratios than existingsemantic or syntactic compression tools (e.g., gzip) while utilizing onlysmall data samples for model inference.

Services

Statistics