Scaling text processing pipelines using Apache Spark

Katsani Merieme

URI	http://purl.tuc.gr/dl/dias/6D2F87E6-508B-4687-97B1-9A5CA530ACB6	-
Αναγνωριστικό	https://doi.org/10.26233/heallink.tuc.70651	-
Γλώσσα	en	-
Μέγεθος	48 pages	en
Τίτλος	Scaling text processing pipelines using Apache Spark	en
Δημιουργός	Katsani Merieme	en
Δημιουργός	Κατσανι Μεριεμε	el
Συντελεστής [Επιβλέπων Καθηγητής]	Deligiannakis Antonios	en
Συντελεστής [Επιβλέπων Καθηγητής]	Δεληγιαννακης Αντωνιος	el
Συντελεστής [Μέλος Εξεταστικής Επιτροπής]	Garofalakis Minos	en
Συντελεστής [Μέλος Εξεταστικής Επιτροπής]	Γαροφαλακης Μινως	el
Συντελεστής [Μέλος Εξεταστικής Επιτροπής]	Lagoudakis Michail	en
Συντελεστής [Μέλος Εξεταστικής Επιτροπής]	Λαγουδακης Μιχαηλ	el
Εκδότης	Πολυτεχνείο Κρήτης	el
Εκδότης	Technical University of Crete	en
Ακαδημαϊκή Μονάδα	Technical University of Crete::School of Electrical and Computer Engineering	en
Ακαδημαϊκή Μονάδα	Πολυτεχνείο Κρήτης::Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών	el
Περίληψη	Big data, which is derived from humans or machines, starting with social media and extending to smartphones or sensors, in forms of texts, images or transactions, is a continuously evolving field. Thus, the ongoing increase of data generated creates a need for knowledge extraction from it, through data analysis. Several areas are engaged in data mining, and in particular the area of machine learning which has been well established over the past years. Various techniques and methods of machine learning are trying to solve big data problems and these two areas consist now an integral part. This particular combination is the main subject of this study, which aims to implement a large-scale text processing architecture. More specifically, this architecture focuses on processing streaming texts derived from Reddit in real-time and the classification thereof as sarcastic or non-sarcastic through a machine learning model. The architecture uses the latest technologies in the field of information processing through distributed platforms such as Apache Kafka and Spark as well as state-of-the-art but also simple and powerful ML algorithms, i.e Random Forests, Naive Bayes and Logistic Regression. After comparing the methodology and design of each individual piece forming the final layout, a selection of the most appropriate model is made followed by the implementation of the framework. Success rates exported were quite close to the relevant literature and sometimes higher, depending on each technique examined. Finally, results are indexed in the distributed search engine Elasticsearch and are evaluated through the Kibana plugin.	en
Τύπος	Διπλωματική Εργασία	el
Τύπος	Diploma Work	en
Άδεια Χρήσης	http://creativecommons.org/licenses/by/4.0/	en
Ημερομηνία	2018-01-02	-
Ημερομηνία Δημοσίευσης	2017	-
Θεματική Κατηγορία	Big data	en
Θεματική Κατηγορία	Machine learning	en
Θεματική Κατηγορία	NLP	en
Θεματική Κατηγορία	Natural Language Processing	en
Βιβλιογραφική Αναφορά	Merieme Katsani, "Scaling text processing pipelines using Apache Spark", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2017	en

Αναζήτηση

Πλοήγηση

Ο Χώρος μου

Scaling text processing pipelines using Apache Spark

Katsani Merieme

Διαθέσιμα αρχεία

Υπηρεσίες

Εξαγωγή

Κοινοποίηση

Στατιστικά

Μεταδεδομένων & Περιεχομένου σε METS:

Μεταδεδομένων σε Μορφότυπο: