Implementation of decision trees for data streams in the Spark Streaming platform

Ziakas Christos

Full record

URI:

http://purl.tuc.gr/dl/dias/E1E86B11-FFC0-4334-AE22-A90F29608701

Year

2018

Type of Item

Diploma Work

License

Details

Bibliographic Citation

Christos Ziakas, "Implementation of decision trees for data streams in the Spark Streaming platform", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2018 https://doi.org/10.26233/heallink.tuc.78767

Appears in Collections

Diploma Works in Community School of Electrical and Computer Engineering

Diploma Works in Community Software Technology and Network Applications Laboratory

Summary

In the era of big data, enormous amounts of data are created, replicatedand transferred every day. The current technology for handling and analyzingvast amounts of data allows us to develop applications for various problems(e.g., DNA sequence analysis, medical imaging, traffic control) that could notpreviously be solved efficiently. More precisely, the time required to processlarge volumes of data can be minimized by using distributed computing platformssuch as Apache Spark. The Apache Spark framework includes variousimplementations for large-scale machine learning, distributed data streamingprocessing and parallel graph analytics. The Spark Streaming platform providesscalable and fault-tolerant data streaming processing. However, there isonly a limited number of implemented distributed incremental machine learningalgorithms available in the Spark Streaming platform.In this thesis, we propose a parallel implementation of an incremental andscalable tree learning method for classification in Spark Streaming, the Hoeffdingdecision tree. Our proposed implementation performs horizontal dataparallelism in the shared-nothing architecture of Spark. The Hoeffding boundguarantees with high confidence that the Hoeffding decision tree is asymptoticallyidentical to a batch-learning one. The high dimensional statistics, requiredfor evaluating splits, are stored as sparse matrices in main memoryacross the Spark cluster. These statistics are instantly updated, when newtraining instances are available. Furthermore, distributed computations areperformed in order to identify the optimal split and assess whether the splittingcriterion is satisfied. The generated model is used in order to make colorclassification based on the spectral signature of each color. Each color has adifferent chemical composition, and as a consequence a different spectral signature.

Search

Browse

My Space

Implementation of decision trees for data streams in the Spark Streaming platform

Ziakas Christos

Summary

Available Files

Services

Export

Share

Statistics

Metadata & Content in a METS Package:

Metadata in Format: