Institutional Repository
Technical University of Crete
EN  |  EL

Search

Browse

My Space

Join-distinct aggregate estimation over update streams

Ganguly, Sumit, Garofalakis Minos, Kumar Amit, Rastogi Rajeev

Simple record


URIhttp://purl.tuc.gr/dl/dias/5325F237-4135-4A25-ABF2-A2CFBC74B7CF-
Identifierhttp://cse.iitk.ac.in/users/sganguly/jd.pdf-
Languageen-
Extent12 pagesen
TitleJoin-distinct aggregate estimation over update streamsen
CreatorGanguly, Sumiten
CreatorGarofalakis Minosen
CreatorΓαροφαλακης Μινωςel
CreatorKumar Amiten
CreatorRastogi Rajeeven
PublisherAssociation for Computing Machineryen
Content SummaryThere is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. Providing (perhaps approximate) answers to queries over such streams is a crucial requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. The ability to estimate the number of distinct (sub)tuples in the result of a join operation correlating two data streams (i.e., the cardinality of a projection with duplicate elimination over a join) is an important requirement for several data-analysis scenarios. For instance, to enable real-time traffic analysis and load balancing, a network-monitoring application may need to estimate the number of distinct (source, destination) IP-address pairs occurring in the stream of IP packets observed by router R1, where the source address is also seen in packets routed through a different router R2. Earlier work has presented solutions to the individual problems of distinct counting and join-size estimation (without duplicate elimination) over streams. These solutions, however, are fundamentally different and extending or combining them to handle our more complex “Join-Distinct” estimation problem is far from obvious. In this paper, we propose the first space-efficient algorithmic solution to the general Join-Distinct estimation problem over continuous data streams (our techniques can actually handle general update streams comprising tuple deletions as well as insertions). Our estimators are probabilistic in nature and rely on novel algorithms for building and combining a new class of hash-based synopses (termed “JD sketches”) for individual update streams. We demonstrate that our algorithms can provide low error, high-confidence Join-Distinct estimates using only small space and small processing time per update. In fact, we present lower bounds showing that the space usage of our estimators is within small factors of the best possible for the Join-Distinct problem. Preliminary experimental results verify the effectiveness of our approach.en
Type of ItemΠλήρης Δημοσίευση σε Συνέδριοel
Type of ItemConference Full Paperen
Licensehttp://creativecommons.org/licenses/by/4.0/en
Date of Item2015-12-01-
Date of Publication2005-
SubjectManagement systemsen
SubjectQuery-processing algorithmsen
Bibliographic CitationS. Ganguly, M. Garofalakis, A. Kumar and R. Rastogi, "Join-distinct aggregate estimation over update streams", in 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2005. en

Services

Statistics