URI | http://purl.tuc.gr/dl/dias/E1E86B11-FFC0-4334-AE22-A90F29608701 | - |
Αναγνωριστικό | https://doi.org/10.26233/heallink.tuc.78767 | - |
Γλώσσα | en | - |
Μέγεθος | 52 pages | en |
Τίτλος | Implementation of decision trees for data streams in the Spark Streaming platform | en |
Τίτλος | Υλοποίηση δέντρων αποφάσεων για ροές δεδομένων στην πλατφόρμα Spark Streaming | el |
Δημιουργός | Ziakas Christos | en |
Δημιουργός | Ζιακας Χρηστος | el |
Συντελεστής [Επιβλέπων Καθηγητής] | Garofalakis Minos | en |
Συντελεστής [Επιβλέπων Καθηγητής] | Γαροφαλακης Μινως | el |
Συντελεστής [Μέλος Εξεταστικής Επιτροπής] | Deligiannakis Antonios | en |
Συντελεστής [Μέλος Εξεταστικής Επιτροπής] | Δεληγιαννακης Αντωνιος | el |
Συντελεστής [Μέλος Εξεταστικής Επιτροπής] | Samoladas Vasilis | en |
Συντελεστής [Μέλος Εξεταστικής Επιτροπής] | Σαμολαδας Βασιλης | el |
Εκδότης | Πολυτεχνείο Κρήτης | el |
Εκδότης | Technical University of Crete | en |
Ακαδημαϊκή Μονάδα | Technical University of Crete::School of Electrical and Computer Engineering | en |
Ακαδημαϊκή Μονάδα | Πολυτεχνείο Κρήτης::Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών | el |
Περίληψη | In the era of big data, enormous amounts of data are created, replicated
and transferred every day. The current technology for handling and analyzing
vast amounts of data allows us to develop applications for various problems
(e.g., DNA sequence analysis, medical imaging, traffic control) that could not
previously be solved efficiently. More precisely, the time required to process
large volumes of data can be minimized by using distributed computing platforms
such as Apache Spark. The Apache Spark framework includes various
implementations for large-scale machine learning, distributed data streaming
processing and parallel graph analytics. The Spark Streaming platform provides
scalable and fault-tolerant data streaming processing. However, there is
only a limited number of implemented distributed incremental machine learning
algorithms available in the Spark Streaming platform.
In this thesis, we propose a parallel implementation of an incremental and
scalable tree learning method for classification in Spark Streaming, the Hoeffding
decision tree. Our proposed implementation performs horizontal data
parallelism in the shared-nothing architecture of Spark. The Hoeffding bound
guarantees with high confidence that the Hoeffding decision tree is asymptotically
identical to a batch-learning one. The high dimensional statistics, required
for evaluating splits, are stored as sparse matrices in main memory
across the Spark cluster. These statistics are instantly updated, when new
training instances are available. Furthermore, distributed computations are
performed in order to identify the optimal split and assess whether the splitting
criterion is satisfied. The generated model is used in order to make color
classification based on the spectral signature of each color. Each color has a
different chemical composition, and as a consequence a different spectral signature. | en |
Τύπος | Διπλωματική Εργασία | el |
Τύπος | Diploma Work | en |
Άδεια Χρήσης | http://creativecommons.org/licenses/by/4.0/ | en |
Ημερομηνία | 2018-09-17 | - |
Ημερομηνία Δημοσίευσης | 2018 | - |
Θεματική Κατηγορία | Spark | en |
Βιβλιογραφική Αναφορά | Christos Ziakas, "Implementation of decision trees for data streams in the Spark Streaming platform", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2018 | en |
Βιβλιογραφική Αναφορά | Χρήστος Ζιάκας, "Υλοποίηση δέντρων αποφάσεων για ροές δεδομένων στην πλατφόρμα Spark Streaming ", Διπλωματική Εργασία, Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών, Πολυτεχνείο Κρήτης, Χανιά, Ελλάς, 2018 | el |