Integrative Methods for the Analysis of Gene Regulation

Integrative Machine Learning approaches for gene regulation

Despite the vast amount of research in that area, the regulation of genes is still not fully understood. In particular the interplay between regulators that modulate the transcriptional activation of a gene and post-transcriptional regulators that modulate the abundance of a gene’s product, the mRNA, is neglected in most system biology studies. With the availability of numerous complete epigenomics datasets our vision is to produce a comprehensive computational catalogue of gene regulation for each gene in the human genome, including transcriptional and post-transcriptional regulators, at much higher detail as is currently available.

Logo

Relating Histone Modifications to Regulation of Gene Expression

The regulation of residues on histone proteins, the elements of the nucleosome, has been shown to be connected to the regulation of gene transcription. Many studies have shown that the abundance of the core histone modifications, such as H3K4me3, H3K27me and H3K27ac, along a gene promoter are predictive of the gene’s expression level. We are interested in performing a detailed mapping of functional histone modification.

We study machine learning methods that enable us to link the abundance of modifications along a gene region, e.g., the promoter, to the expression of the gene. A particular interest is in understanding the regulation of sense / antisense promoters.

histone example figure

Logo

Prediction of Context-Specific microRNA-Transcript Interactions

MicroRNAs (miRNAs) are small non-coding RNAs which a play critical role in a wide range of biological processes, via post-transcriptional gene regulation. Identifying miRNA targets is a critical step toward elucidating their functions in different diseases. In recent years, several computational methods based on miRNA-mRNA sequence complementarity information have been developed. However the expected false positive rate of sequence based predictions is still large. In addition many target relationships are context-specific. Therefore, most approaches incorporate miRNA-mRNA expression levels to improve prediction accuracy.

Next generation RNA-sequencing (RNA-seq) extends the possibilities of transcriptome profiling to quantitative analysis of expression levels of genes and their transcript isoforms. We use approaches from Machine Learning for inferring miRNA-mRNA interaction networks in cancer using gene and also transcript expression levels. Learning the regulation of miRNAs for individual transcripts has the advantage that the effect of a miRNA on the direct precursors of a protein can be estimated, which is ambiguous on the level of transcripts if these are summarized as one gene expression level.

Machine Learning approaches to learn context-specific miRNA-transcript interactions

Reconstruction of dynamic regulatory networks is a challenging task in Computational Systems Biology. Current models of gene regulatory networks are often constructed as a static snapshot of the regulatory wiring in cells. We are working on methods that can dynamically rewire the network connections modeling transcriptional and posttranscriptional factors through the integration of binding data (e.g. Chip-Seq) and gene expression data. In addition, we are enhancing these methods to utilize transcript expression level measurements with RNA-Seq to improve the resolution for reconstruction of dynamic regulatory networks.

dynamic networks

MH Schulz, KV Pandit, CL Lino Cardenas, N Ambalavanan, N Kaminski and Z Bar-Joseph
Reconstructing dynamic microRNA-regulated interaction networks
PNAS 2013 [full text]

Logo

Probabilistic Models for the Analysis of Open Chromatin Regions

As part of our ongoing involvement in the DEEP epigenomics project, we have developed a new method for the analysis of NOMe data. NOMe is a bisulfite sequencing based genome-wide approach that measures open chromatin regions through methylation of GpC dinucleotides. Normally open chromatin regions are predicted using DNAse I treatment and sequencing but this technology is limited to samples with a large number of cells and was further reported to have large biases.

We developed a new hidden Markov model (HMM) based approach that predicts open chromatin regions from NOMe data more accurately than the previously suggested window- based method for our own and other available datasets. Our method employs HMMs with a binomial distribution and an automatic procedure for empirical False Discovery Rate (FDR) calculation based on permutations and adjust for genomic regions of uneven coverage.

RNA-Seq

example gene with RNA-seq transcript expression levels

We are involved in different projects that improve the analysis of RNA-Seq data. Oases is an accurate de novo transcriptome assembler that uses an explicit alternative splicing model and is able to assemble full length mRNAs from RNA-Seq data without the need of mapping the reads to a reference sequence. The software is freely available and is based on parts of the Velvet genome assembler.

MH Schulz*, DR Zerbino*, M Vingron and E Birney (2012)
Oases: Robust de novo RNA-seq assembly across the dynamic range of expression levels (2012)
Bioinformatics 28 (8): 1086-1092 [full text]

The quality of RNA-Seq downstream analysis like de novo transcriptome assembly is diminished by errors introduced during the read sequencing process. We have developed the SEECER software that corrects mismatch and indel errors in non-uniform sequencing data sets, for example RNA-Seq data. If a genomic reference is available as well as transcript annotation, software distributed in the Solas package can be used to infer new alternative splicing events. Also reliable estimates of isoform expression levels for a gene can be computed using the POEM algorithm (see picture above % values on the right).

H Richard*, MH Schulz*, M Sultan*, A Nürnberger, S Schrinner, D Balzereit, E Dagand, A Rasche, H Lehrach, M Vingron, SA Haas, and ML Yaspo (2010)
Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments
Nucleic Acids Research, 38 (10):e112 [full text]
*shared first authorship

Clinical Diagnostics

Phenomizer Query overlap on the Human Phenotype 

Ontology

Using the Human Phenotype Ontology (HPO) we develop methods for the diagnosis of diseases using observed phenotypes in patients. We have developed a new procedure to rank potential causal diseases using p-values of semantic similarity measures between terms of the HPO . The most succesful measure can be tested online using the Phenomizer web application. It is even possible to compute the exact p-values with efficient algorithms, which was shown to outperform random sampling presented in BMC Bioinformatics. Extending the applicability of these methods as well as the considerations of annotation errors are directions of our future research.

S Köhler, MH Schulz, P Krawitz, S Bauer, S Dölken, CE Ott, C Mundlos, D Horn, S Mundlos and PN Robinson (2009)
Clinical Diagnostics with Semantic Similarity Searches in Ontologies
The American Journal of Human Genetics, 85 (4):457-64 [full text]

MH Schulz, S Köhler, S Bauer and PN Robinson
Exact score distribution computation for ontological similarity searches
BMC Bioinformatics, 12:441 [full text]

SeqAn - C++ Library for Sequence Analysis

SeqAn - C++ Library for Biological Sequence Analysis

Bioinformatic software is under the permanent need to adapt to the increasing throughput of modern technologies, especially Next-Generation Sequencing. It is therefore essential that open source libraries exist that supply the researcher with up-to-date implementations for common tasks in Sequence Analysis. To often researchers resort to ad-hoc implementations and non-optimized algorithms due to lack of availability and time. SeqAn is a C++ template library for Biological Sequence Analysis that has been growing considerably over the last years and contains efficient implementations of all major building blocks for Sequence Analysis. I contribute to SeqAn to improve it further. Thus far my most important contributions are data mining algorithms and algorithms for construction of variable order Markov chains.

MH Schulz, D Weese, T Rausch, A Döring, K Reinert and M Vingron
Fast and adaptive variable order Markov chain construction
Proceedings WABI 2008, Springer LNCS, Volume 5251 [Full text]

D Weese and MH Schulz
Efficient string mining under constraints via the deferred frequency index
Industrial Conference for Data Mining (ICDM 2008), LNAI 5077, pp. 374-388 [Full text]