Institutional Repository
Technical University of Crete
EN  |  EL

Search

Browse

My Space

Implementation of decision trees for data streams in the Spark Streaming platform

Ziakas Christos

Simple record


URIhttp://purl.tuc.gr/dl/dias/E1E86B11-FFC0-4334-AE22-A90F29608701-
Identifierhttps://doi.org/10.26233/heallink.tuc.78767-
Languageen-
Extent52 pagesen
TitleImplementation of decision trees for data streams in the Spark Streaming platformen
TitleΥλοποίηση δέντρων αποφάσεων για ροές δεδομένων στην πλατφόρμα Spark Streaming el
CreatorZiakas Christosen
CreatorΖιακας Χρηστοςel
Contributor [Thesis Supervisor]Garofalakis Minosen
Contributor [Thesis Supervisor]Γαροφαλακης Μινωςel
Contributor [Committee Member]Deligiannakis Antoniosen
Contributor [Committee Member]Δεληγιαννακης Αντωνιοςel
Contributor [Committee Member]Samoladas Vasilisen
Contributor [Committee Member]Σαμολαδας Βασιληςel
PublisherΠολυτεχνείο Κρήτηςel
PublisherTechnical University of Creteen
Academic UnitTechnical University of Crete::School of Electrical and Computer Engineeringen
Academic UnitΠολυτεχνείο Κρήτης::Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστώνel
Content SummaryIn the era of big data, enormous amounts of data are created, replicated and transferred every day. The current technology for handling and analyzing vast amounts of data allows us to develop applications for various problems (e.g., DNA sequence analysis, medical imaging, traffic control) that could not previously be solved efficiently. More precisely, the time required to process large volumes of data can be minimized by using distributed computing platforms such as Apache Spark. The Apache Spark framework includes various implementations for large-scale machine learning, distributed data streaming processing and parallel graph analytics. The Spark Streaming platform provides scalable and fault-tolerant data streaming processing. However, there is only a limited number of implemented distributed incremental machine learning algorithms available in the Spark Streaming platform. In this thesis, we propose a parallel implementation of an incremental and scalable tree learning method for classification in Spark Streaming, the Hoeffding decision tree. Our proposed implementation performs horizontal data parallelism in the shared-nothing architecture of Spark. The Hoeffding bound guarantees with high confidence that the Hoeffding decision tree is asymptotically identical to a batch-learning one. The high dimensional statistics, required for evaluating splits, are stored as sparse matrices in main memory across the Spark cluster. These statistics are instantly updated, when new training instances are available. Furthermore, distributed computations are performed in order to identify the optimal split and assess whether the splitting criterion is satisfied. The generated model is used in order to make color classification based on the spectral signature of each color. Each color has a different chemical composition, and as a consequence a different spectral signature.en
Type of ItemΔιπλωματική Εργασίαel
Type of ItemDiploma Worken
Licensehttp://creativecommons.org/licenses/by/4.0/en
Date of Item2018-09-17-
Date of Publication2018-
SubjectSparken
Bibliographic CitationChristos Ziakas, "Implementation of decision trees for data streams in the Spark Streaming platform", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2018en
Bibliographic CitationΧρήστος Ζιάκας, "Υλοποίηση δέντρων αποφάσεων για ροές δεδομένων στην πλατφόρμα Spark Streaming ", Διπλωματική Εργασία, Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών, Πολυτεχνείο Κρήτης, Χανιά, Ελλάς, 2018el

Available Files

Services

Statistics