Institutional Repository
Technical University of Crete
EN  |  EL



My Space

Computational methods for knowledge discovery from heterogeneous data sources: methodology and implementation on biological and molecular sources

Koumakis Eleftherios

Full record

Year 2014
Type of Item Doctoral Dissertation
Bibliographic Citation Eleftherios Koumakis, "Computational methods for knowledge discovery from heterogeneous data sources: methodology and implementation on biological and molecular sources", Doctoral Dissertation, School of Production Engineering and Management, Technical University of Crete, Chania, Greece, 2014
Appears in Collections


More than a decade after the completion of the Human Genome Project, advances in genome research and biotechnology have influenced drastically the concept of disease diagnosis and treatment. In this context, the improvement of high throughput technologies, such as microarrays, caused a fundamental transformation in the research of various diseases (e.g. cancer). Microarrays present a powerful tool to study the molecular basis of the genesis and progression of diseases, and has advanced life scientists’ ability not only to detect but also to quantify simultaneously the expression of thousands of genes for various diseases and phenotypes.Initial expectation was that microarrays would reveal specific gene co-expression patterns (gene signatures or, gene-biomarkers) for various phenotypes, but the utility of gene-expression profiles seems to be bounded by a number of limitations, mainly related to: (a) the variation and heterogeneity of the examined tissues - when comparing two different tissue samples, the potential differences in gene-expression levels is a manifestation of all the cell types present in that sample, making the induced gene-signatures amenable to the specific tissues examined; (b) the different microarray platforms utilised as well as the different experimental protocols followed are facts that make really difficult to combine gene-expression datasets form heterogeneous platforms and different studies; and (c) the great imbalance between the huge number of transcripts and genes (tens of thousands) and the relatively small number of available sample cases (hundreds). In addition, the utilization of ‘knowledge-ignorant’ feature-selection approaches does not guarantee the ‘biological validity’ of the result (selected gene-biomarkers). In other words, focusing just on highly differential genes might not be the optimal process to follow. The aforementioned observations have being reported and justified by various studies in the literature.Currently bioinformatics community focuses on more ‘knowledge-aware’ and enhanced methods for selecting genes from microarray data. These methods, aim to guide the gene-selection process by taking advantage and ‘amalgamating’ knowledge from other established biological sources, such as molecular pathways, and especially gene regulatory networks (GRNs). In cells thousands of genes are expressed and work in concert to ensure the cell's function, fitness, and survival. The gene relationships have been mapped onto GRNs that can be interrogated to gain insight into the mechanisms of differential gene expression at a systems level. These networks can also be used to understand the flow of information in a biological system, to identify circuits that may be used for a specific purpose, and to model changes in gene expression under different conditions. The study of the function, structure and evolution of GRNs in combination with microarray gene-expression profiles has become essential for contemporary biology research. The most prominent research line in the respective fields, called pathway analysis, focus on the identification of the most discriminant GRNs (pathways), or parts of GRNs (sub-paths) that differentiate between specific phenotypes by integrating and coupling the underlying gene regulatory machinery of GRNs and gene-expression profiles from microarray data. The relevant approaches and methodologies increased significantly over the past years, a fact that indicates the importance of such an integration endeavour. In addition, all reported methodologies and developed tools have significantly contributed to the identification of informative associations between GRNs and target phenotypes. One critical drawback of these tools comes from the way the methodologies handle the knowledge encoded in GRNs. In most cases each GRN is represented and manipulated just as the set of the genes engaged in the network. With this approach, and following the gene enrichment analysis (GEA) algorithmic processes, one can determine which biological pathways are significantly over-represented (i.e., more than expected by chance) for a specific phenotype. So, the GEA-like methodologies, are unable to access and do not provide information for parts (i.e., sub-paths) of the pathway. Recently, some enhanced GEA-like tools, take advantage and utilize in their analysis the topology of the GRNs (based on graph-theoretic approaches and network visualization techniques) but only a limited number of the reported so-far methodologies take advantage of the signalling information present in a GRN i.e., the topology and the type of involved interactions such as the activation or inhibition relations holding between genes.The work reported in this thesis introduces and presents a novel pathway-analysis methodology. The whole methodology is implanted in a system called MinePath (, a web-based platform aiming to facilitate and ease the identification and visualization of differentially active paths or sub-paths within a GRN, using gene-expression data. The methodology takes advantage of the topology and the underlying regulatory mechanisms of GRNs, including the direction and the type of the engaged interactions (e.g. activation/expression, inhibition). Each GRN sub-path is interpreted according to Kauffman’s principles and semantics: (i) the network is a directed graph with genes (inputs and outputs) being the graph nodes and the edges between them representing the causal links between them, i.e., the regulatory reactions; (ii) each node can be in one of the two states, ‘ON’, the gene is expressed or up-regulated (i.e., the respective substance being present) or, ‘OFF’, the gene is not-expressed or targeted from a specific gene; and (iii) time is viewed as proceeding in discrete steps - at each step the new state of a node is a Boolean function of the prior states of the nodes with arrows pointing towards it.The method of MinePath unfolds into five modular steps: I.Gene expression values are discretized into two states with values 1 and 0 for up-regulated and down-regulated genes, respectively, and the respective samples’ binary gene-expression sample matrix is formed; II.each target GRN is decomposed into its constituent sub-paths, e.g., the path A B | C is decomposed into three sub-paths, A B, B | C and A B | C (note that the overlapping sub-paths are also identified and formed); III.Each sub-path is interpreted on the basis of its functional active-state, and it is represented by a binary ordered-vector with active states, e.g., sub-path A B | C is considered functional when A and B are up-regulated and C is down-regulated, resulting into its active-state ordered vector <1,1,0> for the corresponding genes; IV.The binary ordered-vector of each sub-path is aligned and matched against all (discretized) binary gene-expression sample profiles. A sub-path is considered to match a sample if and only if all the corresponding genes in the sub-path exhibit the same active-state in the sample, i.e. genes A, B are up-regulated and gene C is down-regulated, resulting into the corresponding sample ordered-vector <1,1,0>, which matches the sub-path vector. In addition, a binary sub-path expression matrix is formed with rows the sub-paths, columns the input samples, and cell-values 1, 0 for the respective sub-path being functional and active (or hold) for the corresponding sample or not. In other words, the sub-paths are taking the place of sample descriptor features and are utilized for the construction of sub-path based phenotype prediction models. V.Finally, the differential power of each sub-path is computed and appropriate parameterized (users may adjust them to his/her exploratory needs). The highly ranked (best matching) sub-paths are kept according to user-defined thresholds. Subsequently each sub-path is characterized about its phenotype inclination; sub-paths with positive differential power values are characterized as inclined to phenotype 1, and those with negative power as phenotype 2. These sub-paths present putative evidential molecular mechanisms that govern the disease itself,

Available Files