Distributed k-Means streaming algorithms in Spark

Kyriakidou Ioanna

Full record

URI:

http://purl.tuc.gr/dl/dias/0569DC4B-8A26-4026-9964-7F802533074C

Year

2021

Type of Item

Diploma Work

License

Details

Bibliographic Citation

Ioanna Kyriakidou, "Distributed k-Means streaming algorithms in Spark", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2021 https://doi.org/10.26233/heallink.tuc.89431

Appears in Collections

Diploma Works in Community School of Electrical and Computer Engineering

Diploma Works in Community Distributed Multimedia Information Systems and Applications Laboratory

Summary

K-means is one of the most commonly used clustering algorithms that clusters the multi-dimensional data points into a predefined number of clusters. When data arrives in a stream, there is a need to estimate clusters dynamically, updating them on arrival. In this thesis, we will apply a sampling technique using a data structure called coreset trees, before any approximation algorithm is applied. Coresets are used to obtain a small weighted sample from the data stream. Using coresets in a tree-like form we successfully speed up the process of computing a summary of the original data. The advantage of such a coreset is that we can apply any clustering algorithm on a much smaller sample to compute a solution for the original dataset faster. In the second step, we are using a StreamKM++ to estimate the cluster centres of the summary. We evaluate the algorithm on how the parallelism level impacts the time needed to extract the clusters, finally we compare the consistency within clusters of data conclusions about the usage of coreset trees as a distributed sampling method.

Search

Browse

My Space

Distributed k-Means streaming algorithms in Spark

Kyriakidou Ioanna

Summary

Available Files

Services

Export

Share

Statistics

Metadata & Content in a METS Package:

Metadata in Format: