Self-organized clustering of Big Data: application on the extraction of cervical classes from backscattering curves

Vourlaki Ioanna-Theoni

URI	http://purl.tuc.gr/dl/dias/3D8AEF7F-C51D-4F47-8299-79F819DBF953	-
Identifier	https://doi.org/10.26233/heallink.tuc.67233	-
Language	en	-
Extent	9.935KB	en
Title	Self-organized clustering of Big Data: application on the extraction of cervical classes from backscattering curves	en
Creator	Vourlaki Ioanna-Theoni	en
Creator	Βουρλακη Ιωαννα-Θεωνη	el
Contributor [Thesis Supervisor]	Zervakis Michalis	en
Contributor [Thesis Supervisor]	Ζερβακης Μιχαλης	el
Contributor [Committee Member]	Balas Costas	en
Contributor [Committee Member]	Μπαλας Κωστας	el
Contributor [Committee Member]	Lagoudakis Michael	en
Contributor [Committee Member]	Λαγουδακης Μιχαηλ	el
Publisher	Πολυτεχνείο Κρήτης	el
Publisher	Technical University of Crete	en
Academic Unit	Technical University of Crete::School of Electrical and Computer Engineering	en
Academic Unit	Πολυτεχνείο Κρήτης::Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών	el
Content Summary	Data mining is an interdisciplinary subfield of computer science. It forms the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and data systems. Machine learning goes often in parallel with data mining, with the first being a supervised scheme whereas the latter focuses more on exploratory data analysis and is known as unsupervised learning. Clustering constitutes an unsupervised learning approach aiming to organize the available data into compact classes according to some notion of similarity. The contribution of clustering in medicine and biology is highly significant. In this sense, this Master’s thesis examines three fundamental aspects of clustering, namely stability, generalizability and separability in order to discover and interpret the appropriate information from the processed data and make clustering attractive for the effective organization of large datasets. First, (in association with stability problems), we propose a novel algorithmic approach for extracting adequate information from the input dataset and self-organizing data expanding clustering with resampling. This approach aims to derive stable and representative class centers through permutations in the initialization process implemented via the concept of data resampling. Thus, using data resampling with replacement, we produce multiple partitions from k-means based on multiple reruns of the same population. In our approach, the large number of class centroids is then reorganized into tight groups through the mean-shift approach, which rigorously searches for maxima into this new distribution space of meta-data (class centroids). Then, by further exploring clustering stability problems, we attempt to refine and improve the clustering result by sequentially updating (instead of replacing) centers on the basis of their present and previous positions. Based on this updating strategy, we can exploit both prior expert knowledge and posterior data information from the statistical distribution of the examined population. In this part, we examine the stability problem of k-means, by proposing a novel algorithmic scheme for self-organizing data, adopting a recursive-mode k-means clustering approach. Thirdly, we examine issues of k-means clustering associated with its generalization ability in organizing big datasets. For this purpose, we exploit a data bootstrapping strategy without replacement. With the generation of multiple datasets of rather small size, we attempt to cover the entire data distribution space and capture its structural properties within the multiple classes generated. Each bootstrap stage exploits the stabilization process of the k-means algorithm. Finally, all class centroids generated from the bootstrap process are considered as (meta-data) samples of higher abstraction, which are organized into classes via the mean-shift approach, similar to the stabilization process. In association with data re-sampling strategies, we also consider the appropriate use of distance metrics addressing another major problem of data exploratory schemes. We apply our algorithmic developments on data expressing the temporal course of tissue reflection under a specific wavelength. The process of aceto-whitening is of paramount importance in cervical cancer diagnosis and we examine clustering methodologies for extracting, processing and interpreting the relevant information from the available data. As in most time-series formulations, the response curves considered are characterized by both overall amplitude (or power) characteristics and local shape formations. The proposed metric attempts to capture both of these aspects into a single configuration, which can be parametrically adjusted to the particular application domain. Overall, the test results indicate the importance of data resampling (and bootstrapping) in the appropriate partitioning of large datasets and the efficient operation of data mining (and clustering) schemes.	en
Type of Item	Μεταπτυχιακή Διατριβή	el
Type of Item	Master Thesis	en
License	http://creativecommons.org/licenses/by/4.0/	en
Date of Item	2017-01-03	-
Date of Publication	2016	-
Subject	Data Mining	en
Subject	Machine learning	en
Subject	Pattern recognition	en
Subject	Clustering	en
Subject	Statistics	en
Bibliographic Citation	Ioanna-Theoni Vourlaki, "Self-organized clustering of Big Data: application on the extraction of cervical classes from backscattering curves", Master Thesis, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2016	en

Search

Browse

My Space

Self-organized clustering of Big Data: application on the extraction of cervical classes from backscattering curves

Vourlaki Ioanna-Theoni

Available Files

Services

Export

Share

Statistics

Metadata & Content in a METS Package:

Metadata in Format: