Implementation of decision trees for data streams in the Spark Streaming platform

Ziakas Christos

URI	http://purl.tuc.gr/dl/dias/E1E86B11-FFC0-4334-AE22-A90F29608701	-
Identifier	https://doi.org/10.26233/heallink.tuc.78767	-
Language	en	-
Extent	52 pages	en
Title	Implementation of decision trees for data streams in the Spark Streaming platform	en
Title	Υλοποίηση δέντρων αποφάσεων για ροές δεδομένων στην πλατφόρμα Spark Streaming	el
Creator	Ziakas Christos	en
Creator	Ζιακας Χρηστος	el
Contributor [Thesis Supervisor]	Garofalakis Minos	en
Contributor [Thesis Supervisor]	Γαροφαλακης Μινως	el
Contributor [Committee Member]	Deligiannakis Antonios	en
Contributor [Committee Member]	Δεληγιαννακης Αντωνιος	el
Contributor [Committee Member]	Samoladas Vasilis	en
Contributor [Committee Member]	Σαμολαδας Βασιλης	el
Publisher	Πολυτεχνείο Κρήτης	el
Publisher	Technical University of Crete	en
Academic Unit	Technical University of Crete::School of Electrical and Computer Engineering	en
Academic Unit	Πολυτεχνείο Κρήτης::Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών	el
Content Summary	In the era of big data, enormous amounts of data are created, replicated and transferred every day. The current technology for handling and analyzing vast amounts of data allows us to develop applications for various problems (e.g., DNA sequence analysis, medical imaging, traffic control) that could not previously be solved efficiently. More precisely, the time required to process large volumes of data can be minimized by using distributed computing platforms such as Apache Spark. The Apache Spark framework includes various implementations for large-scale machine learning, distributed data streaming processing and parallel graph analytics. The Spark Streaming platform provides scalable and fault-tolerant data streaming processing. However, there is only a limited number of implemented distributed incremental machine learning algorithms available in the Spark Streaming platform. In this thesis, we propose a parallel implementation of an incremental and scalable tree learning method for classification in Spark Streaming, the Hoeffding decision tree. Our proposed implementation performs horizontal data parallelism in the shared-nothing architecture of Spark. The Hoeffding bound guarantees with high confidence that the Hoeffding decision tree is asymptotically identical to a batch-learning one. The high dimensional statistics, required for evaluating splits, are stored as sparse matrices in main memory across the Spark cluster. These statistics are instantly updated, when new training instances are available. Furthermore, distributed computations are performed in order to identify the optimal split and assess whether the splitting criterion is satisfied. The generated model is used in order to make color classification based on the spectral signature of each color. Each color has a different chemical composition, and as a consequence a different spectral signature.	en
Type of Item	Διπλωματική Εργασία	el
Type of Item	Diploma Work	en
License	http://creativecommons.org/licenses/by/4.0/	en
Date of Item	2018-09-17	-
Date of Publication	2018	-
Subject	Spark	en
Bibliographic Citation	Christos Ziakas, "Implementation of decision trees for data streams in the Spark Streaming platform", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2018	en
Bibliographic Citation	Χρήστος Ζιάκας, "Υλοποίηση δέντρων αποφάσεων για ροές δεδομένων στην πλατφόρμα Spark Streaming ", Διπλωματική Εργασία, Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών, Πολυτεχνείο Κρήτης, Χανιά, Ελλάς, 2018	el

Search

Browse

My Space

Implementation of decision trees for data streams in the Spark Streaming platform

Ziakas Christos

Available Files

Services

Export

Share

Statistics

Metadata & Content in a METS Package:

Metadata in Format: