Το work with title Techniques for biological data feature aggregation for medical diagnosis and use of neural networks as classifiers by Kormpi Konstantina is licensed under Creative Commons Attribution 4.0 International
Bibliographic Citation
Konstantina Kormpi, "Techniques for biological data feature aggregation for medical diagnosis and use of neural networks as classifiers", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2019
https://doi.org/10.26233/heallink.tuc.84018
Cancer is a global problem as it is described in the World Cancer Report. Today’s technology can give approaches that reveal the cellular and molecular level of cancer. In a cancer disease sample such a cell biopsy to be processed, thousands of genes at a time can be subjected simultaneously for analysis in a single chip, called Microarray. Machine learning is a branch of artificial intelligence that employs a variety of statistical, probabilistic and optimization techniques that allows computers to “learn” from past examples and to detect hard-to-discern patterns from large, noisy or complex data sets. This capability is particularly well-suited to medical applications, especially those that depend on complex proteomic and genomic measurements. As a result, machine learning is frequently used in cancer diagnosis and detection. More recently machine learning has been applied to cancer prognosis and prediction. This latter approach is particularly interesting as it is part of a growing trend towards personalized, predictive medicine.Our goal was, firstly, to construct a framework for statistical analysis, description and visualization of real biological data and secondly, build a predictive model for binary classification of cancer based on machine learning algorithms and feature selection techniques. We use six algorithms of supervised machine learning such as Logistic Regression (LR), Linear Discriminant Analysis (LDA), k-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naïve Bayes (NB) and Linear Support Vector Machines (SVM) to be tested in different datasets of Cervical, Breast, Acute Myeloid Leukemia and Pancreatic cancer, publicly available on Gene Expression Omnibus platform. During the learning procedure, the data were split to validation and train sets. The train set, is used in 5-fold cross-validation for three different scenarios: on primary data, on standardized data, and finally on standardized data that have been transformed by the dimensionality reduction technique of Principal Component Analysis (PCA) and other feature reduction techniques. Finally we compare the results and use the validation dataset to evaluate our models’ predictions on unseen data.We end up with prediction accuracy: 100% of models trained with LR, NB and SVM on Cervical dataset, 90% of models built with LDA on Breast dataset, 95.4% of models trained with NB on AML dataset and 94.4% trained with LR Pancreatic dataset, respectfully. During the procedure, we compare the results of 5-fold cross-validation on each step and finally we estimate more evaluation metrics such as precision, sensitivity, f1-score and ROC curves, in order to extract useful insights.