URI | http://purl.tuc.gr/dl/dias/5325F237-4135-4A25-ABF2-A2CFBC74B7CF | - |
Identifier | http://cse.iitk.ac.in/users/sganguly/jd.pdf | - |
Language | en | - |
Extent | 12 pages | en |
Title | Join-distinct aggregate estimation over update streams | en |
Creator | Ganguly, Sumit | en |
Creator | Garofalakis Minos | en |
Creator | Γαροφαλακης Μινως | el |
Creator | Kumar Amit | en |
Creator | Rastogi Rajeev | en |
Publisher | Association for Computing Machinery | en |
Content Summary | There is growing interest in algorithms for processing and querying
continuous data streams (i.e., data that is seen only once in a
fixed order) with limited memory resources. Providing (perhaps
approximate) answers to queries over such streams is a crucial requirement
for many application environments; examples include
large IP network installations where performance data from different
parts of the network needs to be continuously collected and
analyzed.
The ability to estimate the number of distinct (sub)tuples in
the result of a join operation correlating two data streams (i.e.,
the cardinality of a projection with duplicate elimination over a
join) is an important requirement for several data-analysis scenarios.
For instance, to enable real-time traffic analysis and load
balancing, a network-monitoring application may need to estimate
the number of distinct (source, destination) IP-address
pairs occurring in the stream of IP packets observed by router R1,
where the source address is also seen in packets routed through
a different router R2. Earlier work has presented solutions to the
individual problems of distinct counting and join-size estimation
(without duplicate elimination) over streams. These solutions,
however, are fundamentally different and extending or combining
them to handle our more complex “Join-Distinct” estimation
problem is far from obvious. In this paper, we propose the first
space-efficient algorithmic solution to the general Join-Distinct
estimation problem over continuous data streams (our techniques
can actually handle general update streams comprising tuple deletions
as well as insertions). Our estimators are probabilistic in
nature and rely on novel algorithms for building and combining
a new class of hash-based synopses (termed “JD sketches”) for
individual update streams. We demonstrate that our algorithms
can provide low error, high-confidence Join-Distinct estimates
using only small space and small processing time per update. In
fact, we present lower bounds showing that the space usage of
our estimators is within small factors of the best possible for the
Join-Distinct problem. Preliminary experimental results verify
the effectiveness of our approach. | en |
Type of Item | Πλήρης Δημοσίευση σε Συνέδριο | el |
Type of Item | Conference Full Paper | en |
License | http://creativecommons.org/licenses/by/4.0/ | en |
Date of Item | 2015-12-01 | - |
Date of Publication | 2005 | - |
Subject | Management systems | en |
Subject | Query-processing algorithms | en |
Bibliographic Citation | S. Ganguly, M. Garofalakis, A. Kumar and R. Rastogi, "Join-distinct aggregate estimation over update streams", in 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2005.
| en |