URI | http://purl.tuc.gr/dl/dias/E1E86B11-FFC0-4334-AE22-A90F29608701 | - |
Identifier | https://doi.org/10.26233/heallink.tuc.78767 | - |
Language | en | - |
Extent | 52 pages | en |
Title | Implementation of decision trees for data streams in the Spark Streaming platform | en |
Title | Υλοποίηση δέντρων αποφάσεων για ροές δεδομένων στην πλατφόρμα Spark Streaming | el |
Creator | Ziakas Christos | en |
Creator | Ζιακας Χρηστος | el |
Contributor [Thesis Supervisor] | Garofalakis Minos | en |
Contributor [Thesis Supervisor] | Γαροφαλακης Μινως | el |
Contributor [Committee Member] | Deligiannakis Antonios | en |
Contributor [Committee Member] | Δεληγιαννακης Αντωνιος | el |
Contributor [Committee Member] | Samoladas Vasilis | en |
Contributor [Committee Member] | Σαμολαδας Βασιλης | el |
Publisher | Πολυτεχνείο Κρήτης | el |
Publisher | Technical University of Crete | en |
Academic Unit | Technical University of Crete::School of Electrical and Computer Engineering | en |
Academic Unit | Πολυτεχνείο Κρήτης::Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών | el |
Content Summary | In the era of big data, enormous amounts of data are created, replicated
and transferred every day. The current technology for handling and analyzing
vast amounts of data allows us to develop applications for various problems
(e.g., DNA sequence analysis, medical imaging, traffic control) that could not
previously be solved efficiently. More precisely, the time required to process
large volumes of data can be minimized by using distributed computing platforms
such as Apache Spark. The Apache Spark framework includes various
implementations for large-scale machine learning, distributed data streaming
processing and parallel graph analytics. The Spark Streaming platform provides
scalable and fault-tolerant data streaming processing. However, there is
only a limited number of implemented distributed incremental machine learning
algorithms available in the Spark Streaming platform.
In this thesis, we propose a parallel implementation of an incremental and
scalable tree learning method for classification in Spark Streaming, the Hoeffding
decision tree. Our proposed implementation performs horizontal data
parallelism in the shared-nothing architecture of Spark. The Hoeffding bound
guarantees with high confidence that the Hoeffding decision tree is asymptotically
identical to a batch-learning one. The high dimensional statistics, required
for evaluating splits, are stored as sparse matrices in main memory
across the Spark cluster. These statistics are instantly updated, when new
training instances are available. Furthermore, distributed computations are
performed in order to identify the optimal split and assess whether the splitting
criterion is satisfied. The generated model is used in order to make color
classification based on the spectral signature of each color. Each color has a
different chemical composition, and as a consequence a different spectral signature. | en |
Type of Item | Διπλωματική Εργασία | el |
Type of Item | Diploma Work | en |
License | http://creativecommons.org/licenses/by/4.0/ | en |
Date of Item | 2018-09-17 | - |
Date of Publication | 2018 | - |
Subject | Spark | en |
Bibliographic Citation | Christos Ziakas, "Implementation of decision trees for data streams in the Spark Streaming platform", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2018 | en |
Bibliographic Citation | Χρήστος Ζιάκας, "Υλοποίηση δέντρων αποφάσεων για ροές δεδομένων στην πλατφόρμα Spark Streaming ", Διπλωματική Εργασία, Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών, Πολυτεχνείο Κρήτης, Χανιά, Ελλάς, 2018 | el |