Institutional Repository
Technical University of Crete
EN  |  EL



My Space

Data mining algorithms over Akka and Storm frameworks

Karatza Dimitra

Full record

Year 2016
Type of Item Diploma Work
Bibliographic Citation Dimitra Karatza, "Data mining algorithms over Akka and Storm frameworks", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2016
Appears in Collections


Efficient processing over massive data sets has taken an increasing importance in the last few decades due to the growing availability of large volumes of data in a variety of applications in computer science. In particular, monitoring huge and rapidly changing streams of data that arrive online has emerged as an important data management problem. Relevant applications include analyzing network traffic, telephone call records, internet advertising and data bases. For these reasons, the streaming model has recently received a lot of attention. This model differs from computation over traditional stored data sets since algorithms must process their input by making only one pass over it, using only a limited amount of working memory. The streaming model applies to settings where the size of the input far exceeds the size of the main memory available and the only feasible access to the data is by making one pass over it.Typical streaming algorithms use space at most polylogarithmic in the length of input stream. Using linear space motivates the design for summary data structures with small memory footprints, also known as synopses. Algorithms such as Misra Gries, Lossy Counting, Sticky Sampling and Space Saving use parameters support, error and probability of failure, which are specified by the user, in order to extract the items that exceed some threshold (support) from an unbounded data stream. Accuracy guarantees are typically made in terms of those parameters (support, error, probability of failure) meaning that the error in extracting those frequent items is within a factor of 1+error of the true items’ frequency with probability at least 1-δ. The space will depend on these parameters.Since we make only one pass over the unbounded data stream we have to use suitable computation systems. We introduce Storm and Akka frameworks which are both real-time, distributed, fault-tolerant models. Those two frameworks have a completely different architecture which are deeply explained in the current diploma thesis. The crucial difference is that in Storm framework data stream is processed synchronously while in Akka framework data stream is processed asynchronously. We execute Misra Gries, Lossy Counting, Sticky Sampling and Space Saving algorithms in those two frameworks in a multi node cluster tuning the topologies in order to optimize performance. We observe throughput, the number of processed items in input data set per second. Our goal is to compare the algorithms’ behavior in two frameworks.The data set which is used in order to make our experiments contains two weeks HTTP requests to ClarkNet server. ClarkNet is a full Internet access provider for the Metro Baltimore –Washington DC area.

Available Files