Nikolaos-Kosmas Chlis, "Machine learning methods for genomic signature extraction", Master Thesis, School of Electronic and Computer Engineering, Technical University of Crete, Chania, Greece, 2015
https://doi.org/10.26233/heallink.tuc.26968
The application of machine learning methodologies for the analysis of DNA microarray data has become a common practice in the field of bioinformatics. DNA microarrays can be used in order to simultaneously measure the expression value of thousands of genes. Given the measurements of gene expression, machine learning methods can be employed in order to identify candidate genes that are related to a biological state or phenotype of interest, such as cancer. These lists of candidate genes are often called “genomic signatures” in literature. The application of machine learning methods for the extraction of genomic signatures is a necessity, since it is practically impossible for field experts to assess the importance of each gene individually by manual inspection due to the large size of the genome, which consists of approximately 25,000 genes. Machine learning methods such as feature subset selection and classification algorithms are popular choices for the extraction of genomic signatures. Univariate feature selection methods filter genes according to difference in their gene expression profiles among samples belonging to different classes of interest, such as control and disease. Since they test each gene individually, univariate methods are computationally efficient and they select genes with high discrimination ability. However, they ignore associations among genes. On the other hand, multivariate methods simultaneously assess groups of genes and select candidate genes based on their predictive performance when used in conjunction with a classifier. As such, they are more efficient at capturing the latent associations among genes and select genes with high predictive capability, at the cost of being computationally expensive. While the applied feature selection and classification methodologies have matured and several state of the art algorithms have been established, the stability of the extracted genomic signatures is often overlooked. As a result, the genomic signatures extracted by many methodologies are unstable under sample variations. That is, the extracted signatures differ significantly under variations of the training data. Since result stability is related to generalization, this instability raises skepticism in the expert community and hinders the validity and clinical application of research findings extracted from such gene expression studies. This thesis deals with the following three aspects of the selection and evaluation of gene signatures, namely stability, predictive capability and statistical significance. First, a framework for the extraction of stable genomic signatures, called Stable Bootstrap Validation (SBV) is introduced. The proposed methodology enforces stability at the validation step. As a result, it can be combined with any classification method, as long as it supports feature selection. Three publicly available gene expression datasets are used in order to test the proposed methodology. First the dimensionality of the datasets is reduced using a filtering method. Then, bootstrap resampling is utilized in order to generate a list of candidate signatures according to the selection frequency of genes across all bootstrap datasets. Then, a stable signature which has maximal predictive performance in terms of accuracy, sensitivity and specificity is extracted and the predictive performance of all candidate signatures is plotted in an elaborate manner for further inspection. Additionally, the application of random sampling methods for countering the negative effects of imbalanced datasets in classification was investigated, since imbalanced datasets are frequently found in DNA microarray studies where control samples are usually scarce. Moreover, a proper statistical framework was implemented that includes two separate statistical tests, in order to assess the statistical significance of the extracted signature in terms of classification accuracy as well as association to the response variable (phenotype/biological state). Finally, the robustness of the methodology is assessed by testing the degree of “agreement” among signatures extracted from independent executions of the methodology.