Institutional Repository
Technical University of Crete
EN  |  EL

Search

Browse

My Space

Probabilistic histograms for probabilistic data

Cormode, Graham, 1977-, Deligiannakis Antonios, Garofalakis Minos, McGregor Andrew

Full record


URI: http://purl.tuc.gr/dl/dias/E6E011A7-1E7E-41A6-824F-1BD230C6047A
Year 2009
Type of Item Conference Full Paper
License
Details
Bibliographic Citation G. Cormode, A. Deligiannakis, M. Garofalakis and A. McGregor, "Probabilistic histograms for probabilistic data," in 35th International Conference on Very Large Data Bases, 2009.
Appears in Collections

Summary

There is a growing realization that modern database managementsystems (DBMSs) must be able to manage data that contains uncertaintiesthat are represented in the form of probabilistic relations.Consequently, the design of each core DBMS componentmust be revisited in the presence of uncertain and probabilistic information.In this paper, we study how to build histogram synopsesfor probabilistic relations, for the purposes of enabling bothDBMS-internal decisions (such as indexing and query planning),and (possibly, user-facing) approximate query processing tools. Incontrast to initial work in this area, our probabilistic histogramsretain the key possible-worlds semantics of probabilistic data, allowingfor more accurate, yet concise, representation of the uncertaintycharacteristics of data and query results. We present avariety of techniques for building optimal probabilistic histograms,each one tuned to a different choice of approximation-error metric.We show that these can be incorporated into a general DynamicProgramming (DP) framework, which generalizes that used for existinghistogram constructions. The end result is a histogram whereeach “bucket” is approximately represented by a compact probabilitydistribution function (PDF), which can be used as the basisfor query planning and approximate query answering. We presentnovel, polynomial-time algorithms to find optimal probabilistic histogramsfor a variety of PDF-error metrics (including variation distance,sum squared error, max error and EMD1). Our experimentalstudy shows that our probabilistic histogram synopses can accuratelycapture the key statistical properties of uncertain data, whilebeing much more compact to store and work with than the originaluncertain relations.

Services

Statistics