Institutional Repository
Technical University of Crete
EN  |  EL



My Space

Join-distinct aggregate estimation over update streams

Ganguly, Sumit, Garofalakis Minos, Kumar Amit, Rastogi Rajeev

Full record

Year 2005
Type of Item Conference Full Paper
Bibliographic Citation S. Ganguly, M. Garofalakis, A. Kumar and R. Rastogi, "Join-distinct aggregate estimation over update streams", in 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2005.
Appears in Collections


There is growing interest in algorithms for processing and queryingcontinuous data streams (i.e., data that is seen only once in afixed order) with limited memory resources. Providing (perhapsapproximate) answers to queries over such streams is a crucial requirementfor many application environments; examples includelarge IP network installations where performance data from differentparts of the network needs to be continuously collected andanalyzed.The ability to estimate the number of distinct (sub)tuples inthe result of a join operation correlating two data streams (i.e.,the cardinality of a projection with duplicate elimination over ajoin) is an important requirement for several data-analysis scenarios.For instance, to enable real-time traffic analysis and loadbalancing, a network-monitoring application may need to estimatethe number of distinct (source, destination) IP-addresspairs occurring in the stream of IP packets observed by router R1,where the source address is also seen in packets routed througha different router R2. Earlier work has presented solutions to theindividual problems of distinct counting and join-size estimation(without duplicate elimination) over streams. These solutions,however, are fundamentally different and extending or combiningthem to handle our more complex “Join-Distinct” estimationproblem is far from obvious. In this paper, we propose the firstspace-efficient algorithmic solution to the general Join-Distinctestimation problem over continuous data streams (our techniquescan actually handle general update streams comprising tuple deletionsas well as insertions). Our estimators are probabilistic innature and rely on novel algorithms for building and combininga new class of hash-based synopses (termed “JD sketches”) forindividual update streams. We demonstrate that our algorithmscan provide low error, high-confidence Join-Distinct estimatesusing only small space and small processing time per update. Infact, we present lower bounds showing that the space usage ofour estimators is within small factors of the best possible for theJoin-Distinct problem. Preliminary experimental results verifythe effectiveness of our approach.