Self-organized clustering of Big Data: application on the extraction of cervical classes from backscattering curvesSelf-organized clustering of Big Data: application on the extraction of cervical classes from backscattering curves
Μεταπτυχιακή Διατριβή
Master Thesis
2017-01-032016enData mining is an interdisciplinary subfield of computer science. It forms the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and data systems. Machine learning goes often in parallel with data mining, with the first being a supervised scheme whereas the latter focuses more on exploratory data analysis and is known as unsupervised learning. Clustering constitutes an unsupervised learning approach aiming to organize the available data into compact classes according to some notion of similarity. The contribution of clustering in medicine and biology is highly significant.
In this sense, this Master’s thesis examines three fundamental aspects of clustering, namely stability, generalizability and separability in order to discover and interpret the appropriate information from the processed data and make clustering attractive for the effective organization of large datasets.
First, (in association with stability problems), we propose a novel algorithmic approach for extracting adequate information from the input dataset and self-organizing data expanding clustering with resampling. This approach aims to derive stable and representative class centers through permutations in the initialization process implemented via the concept of data resampling. Thus, using data resampling with replacement, we produce multiple partitions from k-means based on multiple reruns of the same population. In our approach, the large number of class centroids is then reorganized into tight groups through the mean-shift approach, which rigorously searches for maxima into this new distribution space of meta-data (class centroids).
Then, by further exploring clustering stability problems, we attempt to refine and improve the clustering result by sequentially updating (instead of replacing) centers on the basis of their present and previous positions. Based on this updating strategy, we can exploit both prior expert knowledge and posterior data information from the statistical distribution of the examined population. In this part, we examine the stability problem of k-means, by proposing a novel algorithmic scheme for self-organizing data, adopting a recursive-mode k-means clustering approach.
Thirdly, we examine issues of k-means clustering associated with its generalization ability in organizing big datasets. For this purpose, we exploit a data bootstrapping strategy without replacement. With the generation of multiple datasets of rather small size, we attempt to cover the entire data distribution space and capture its structural properties within the multiple classes generated. Each bootstrap stage exploits the stabilization process of the k-means algorithm. Finally, all class centroids generated from the bootstrap process are considered as (meta-data) samples of higher abstraction, which are organized into classes via the mean-shift approach, similar to the stabilization process.
In association with data re-sampling strategies, we also consider the appropriate use of distance metrics addressing another major problem of data exploratory schemes. We apply our algorithmic developments on data expressing the temporal course of tissue reflection under a specific wavelength. The process of aceto-whitening is of paramount importance in cervical cancer diagnosis and we examine clustering methodologies for extracting, processing and interpreting the relevant information from the available data. As in most time-series formulations, the response curves considered are characterized by both overall amplitude (or power) characteristics and local shape formations. The proposed metric attempts to capture both of these aspects into a single configuration, which can be parametrically adjusted to the particular application domain.
Overall, the test results indicate the importance of data resampling (and bootstrapping) in the appropriate partitioning of large datasets and the efficient operation of data mining (and clustering) schemes.
http://creativecommons.org/licenses/by/4.0/Πολυτεχνείο Κρήτης::Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών ΥπολογιστώνVourlaki_Ioanna_MSc_2016.pdfChania [Greece]Library of TUC2017-01-03application/pdf10.2 MBfree
Vourlaki Ioanna-Theoni
Βουρλακη Ιωαννα-Θεωνη
Zervakis Michalis
Ζερβακης Μιχαλης
Balas Costas
Μπαλας Κωστας
Lagoudakis Michael
Λαγουδακης Μιχαηλ
Πολυτεχνείο Κρήτης
Technical University of Crete
Data Mining
Machine learning
Pattern recognition
Clustering
Statistics