Jan, 22. 04. 2008:
- Stormo82
-
Use of the 'perceptron' algorithm to distinguish translational initiation sites
G. D. Stormo and T. D. Schneider and L. M. Gold and A. Ehrenfeucht
NAR
10
2997-3010
(1982)
- Staden84
-
Computer methods to locate signals in nucleic acid sequences
R. Staden
Nucleic Acids Research
12
505-519
(1984)
- zhang93wam
-
A weight array method for splicing signal analysis
M. O. Zhang and T. G. Marr
Comput. Appl. Biosci.
9
499-509
(1993)
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/9/5/499
A new method of sequence analysis, using a weight array method (WAM), which generalizes the traditional Staden weight matrix method (WMM), is proposed. With the help of a statistical mechanical model, the discriminant function is ident with the energy function describing macromolecular interactions. The method is applied to the study of 5'-splice signals in Schizosaccharomyces pombe pre-mRNA sequences. The results show that there may exist weak pairwise correlations within the signals and that our method can help to better discriminate these signals. Experiments are proposed to test the predictions of the theory.
- bailey94meme
-
Fitting a Mixture model by expectation maximization to discover motifs in biopolymers
T. L. Bailey and C. Elkan
(1994)
- bailey06meme
-
MEME: discovering and analyzing DNA and protein sequence motifs
T. L. Bailey and N. Williams and C. Misleh and W. W. Li
Nucl. Acids Res.
34
W369-373
(2006)
http://nar.oxfordjournals.org/cgi/content/abstract/34/suppl_2/W369
MEME (Multiple EM for Motif Elicitation) is one of the most widely used tools for searching for novel signals' in sets of biological sequences. Applications include the discovery of new transcription factor binding sites and protein domains. MEME works by searching for repeated, ungapped sequence patterns that occur in the DNA or protein sequences provided by the user. Users can perform MEME searches via the web server hosted by the National Biomedical Computation Resource (http://meme.nbcr.net) and several mirror sites. Through the same web server, users can also access the Motif Alignment and Search Tool to search sequence databases for matches to motifs encoded in several popular formats. By clicking on buttons in the MEME output, users can compare the motifs discovered in their input sequences with databases of known motifs, search sequence databases for matches to the motifs and display the motifs in various formats. This article describes the freely accessible web server and its architecture, and discusses ways to use MEME effectively to find new sequence patterns in biological sequences and analyze their significance.
- Burge97mdd
-
Prediction of complete gene structures in human genomic DNA;
C. Burge and S. Karlin
Journal of Molecular Biology
268
78--94
(1997)
http://www.sciencedirect.com/science/article/B6WK7-45VGF7T-9/2/0db8132754939b6d0d07e85a6276d801
Ralf, 29. 04. 2008:
- barash03modeling
-
Modelling dependencies in protein-DNA binding sites
Y. Barash and G. Elidan and N. Friedman and T. Kaplan
28--37
(2003)
http://portal.acm.org/citation.cfm?id=640079
- yeo2004mem
-
Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals
G. Yeo and C. B. Burge
Journal of Computational Biology
11
377-394
(2004)
http://www.liebertonline.com/doi/abs/10.1089/1066527041410418
- Elemento2007fire
-
A Universal Framework for Regulatory Element Discovery across All Genomes and Data Types;
O. Elemento and N. Slonim and S. Tavazoie
Molecular Cell
28
337--350
(2007)
http://www.sciencedirect.com/science/article/B6WSR-4R05PJ7-K/2/3a48e1f3cf2c02a016c544e229c1db4e
Summary Deciphering the noncoding regulatory genome has proved a formidable challenge. Despite the wealth of available gene expression data, there currently exists no broadly applicable method for characterizing the regulatory elements that shape the rich underlying dynamics. We present a general framework for detecting such regulatory DNA and RNA motifs that relies on directly assessing the mutual information between sequence and gene expression measurements. Our approach makes minimal assumptions about the background sequence model and the mechanisms by which elements affect gene expression. This provides a versatile motif discovery framework, across all data types and genomes, with exceptional sensitivity and near-zero false-positive rates. Applications from yeast to human uncover putative and established transcription-factor binding and miRNA target sites, revealing rich diversity in their spatial configurations, pervasive co-occurrences of DNA and RNA motifs, context-dependent selection for motif avoidance, and the strong impact of posttranscriptional processes on eukaryotic transcriptomes.
Martin, 13. Mai 2008:
- levitsky07effective
-
Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions
V. Levitsky and E. Ignatieva and E. Ananko and I. Turnaev and T. Merkulova and N. Kolchanov and T. Hodgman
BMC Bioinformatics
8
481
(2007)
http://www.biomedcentral.com/1471-2105/8/481
BACKGROUND:Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered.RESULTS:To improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-kappaB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies.To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA.Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies.CONCLUSION:Based on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs.
Daniel, 20. Mai 2008:
- yousef07microRNA
-
Naive Bayes for microRNA target predictions machine learning for microRNA targets
M. Yousef and S. Jung and A. V. Kossenkov and L. C. Showe and M. K. Showe
Bioinformatics
23
2987-2992
(2007)
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/22/2987
Motivation: Most computational methodologies for miRNA:mRNA target gene prediction use the seed segment of the miRNA and require cross-species sequence conservation in this region of the mRNA target. Methods that do not rely on conservation generate numbers of predictions, which are too large to validate. We describe a target prediction method (NBmiRTar) that does not require sequence conservation, using instead, machine learning by a naive Bayes classifier. It generates a model from sequence and miRNA:mRNA duplex information from validated targets and artificially generated negative examples. Both the seed' and out-seed' segments of the miRNA:mRNA duplex are used for target identification. Results: The application of machine-learning techniques to the features we have used is a useful and general approach for microRNA target gene prediction. Our technique produces fewer false positive predictions and fewer target candidates to be tested. It exhibits higher sensitivity and specificity than algorithms that rely on conserved genomic regions to decrease false positive predictions. Availability: The NBmiRTar program is available at http://wotan.wistar.upenn.edu/NBmiRTar/ Contact: yousef@wistar.org Supplementary information: http://wotan.wistar.upenn.edu/NBmiRTar/
- jiang07oscar
-
OSCAR: One-class SVM for accurate recognition of cis-elements
B. Jiang and M. Q. Zhang and X. Zhang
Bioinformatics
23
2823-2828
(2007)
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/21/2823
Motivation: Traditional methods to identify potential binding sites of known transcription factors still suffer from large number of false predictions. They mostly use sequence information in a position-specific manner and neglect other types of information hidden in the proximal promoter regions. Recent biological and computational researches, however, suggest that there exist not only locational preferences of binding, but also correlations between transcription factors. Results: In this article, we propose a novel approach, OSCAR, which utilizes one-class SVM algorithms, and incorporates multiple factors to aid the recognition of transcription factor binding sites. Using both synthetic and real data, we find that our method outperforms existing algorithms, especially in the high sensitivity region. The performance of our method can be further improved by taking into account locational preference of binding events. By testing on experimentally-verified binding sites of GATA and HNF transcription factor families, we show that our algorithm can infer the true co-occurring motif pairs accurately, and by considering the co-occurrences of correlated motifs, we not only filter out false predictions, but also increase the sensitivity. Availability: An online server based on OSCAR is available at http://bioinfo.au.tsinghua.edu.cn/oscar. Contact: zhangxg@tsinghua.edu.cn
- yuan08nucleosome
-
Genomic Sequence Is Highly Predictive of Local Nucleosome Depletion
G. Yuan and J. S. Liu
PLoS Comput Biol
4
e13
(2008)
http://dx.doi.org/10.1371%2Fjournal.pcbi.0040013
The regulation of DNA accessibility through nucleosome positioning is important for transcription control. Computational models have been developed to predict genome-wide nucleosome positions from DNA sequences, but these models consider only nucleosome sequences, which may have limited their power. We developed a statistical multi-resolution approach to identify a sequence signature, called the N-score, that distinguishes nucleosome binding DNA from non-nucleosome DNA. This new approach has significantly improved the prediction accuracy. The sequence information is highly predictive for local nucleosome enrichment or depletion, whereas predictions of the exact positions are only modestly more accurate than a null model, suggesting the importance of other regulatory factors in fine-tuning the nucleosome positions. The N-score in promoter regions is negatively correlated with gene expression levels. Regulatory elements are enriched in low N-score regions. While our model is derived from yeast data, the N-score pattern computed from this model agrees well with recent high-resolution protein-binding data in human.
Claus, 27. Mai 2008:
- li08fdrmotif
-
fdrMotif: identifying cis-elements by an EM algorithm coupled with false discovery rate control
L. Li and R. L. Bass and Y. Liang
Bioinformatics
24
629-636
(2008)
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/5/629
Motivation: Most de novo motif identification methods optimize the motif model first and then separately test the statistical significance of the motif score. In the first stage, a motif abundance parameter needs to be specified or modeled. In the second stage, a Z-score or P-value is used as the test statistic. Error rates under multiple comparisons are not fully considered. Methodology: We propose a simple but novel approach, fdrMotif, that selects as many binding sites as possible while controlling a user-specified false discovery rate (FDR). Unlike existing iterative methods, fdrMotif combines model optimization [e.g. position weight matrix (PWM)] and significance testing at each step. By monitoring the proportion of binding sites selected in many sets of background sequences, fdrMotif controls the FDR in the original data. The model is then updated using an expectation (E)- and maximization (M)-like procedure. We propose a new normalization procedure in the E-step for updating the model. This process is repeated until either the model converges or the number of iterations exceeds a maximum. Results: Simulation studies suggest that our normalization procedure assigns larger weights to the binding sites than do two other commonly used normalization procedures. Furthermore, fdrMotif requires only a user-specified FDR and an initial PWM. When tested on 542 high confidence experimental p53 binding loci, fdrMotif identified 569 p53 binding sites in 505 (93.2%) sequences. In comparison, MEME identified more binding sites but in fewer ChIP sequences than fdrMotif. When tested on 500 sets of simulated ChIP' sequences with embedded known p53 binding sites, fdrMotif, compared to MEME, has higher sensitivity with similar positive predictive value. Furthermore, fdrMotif is robust to noise: it selected nearly identical binding sites in data adulterated with 50% added background sequences and the unadulterated data. We suggest that fdrMotif represents an improvement over MEME. Availability: C code can be found at: http://www.niehs.nih.gov/research/resources/software/fdrMotif/ Contact: li3@niehs.nih.gov Supplementary information: Supplementary data are available at http://www.niehs.nih.gov/research/resources/software/fdrMotif/
Peter, 03. Juni 2008:
- gunewardena07hybrid
-
A hybrid model for robust detection of transcription factor binding sites
S. Gunewardena and Z. Zhang
Bioinformatics
24
484-491
(2008)
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/4/484
Motivation: The short and degenerate nature of transcription factor (TF) binding sites contributes towards a low signal to noise ratio making it very difficult to separate them from their background. In order to tackle this problem one needs to look at ways of capturing the underlying biophysical properties that best discriminates TF binding sites from their background DNA. One such discriminatory property lies in the observed compositional differences in the nucleotide levels of TF binding sites and background DNA which are a result of processes such as purifying selection and selective preferences of TF binding sites for particular nucleotides or a combination of nucleotides over others. Results: In this article, we present a hybrid model, referred to as a MonoDi-nucleotide model for robustly detecting TF binding sites. It incorporates both mono- and dinucleotide statistics to optimally partition the base positions of an aligned set of TF binding sites (motif) into a non-redundant sequence of mono and/or dinucleotide segments that maximizes the odds ratio of the binding sites relative to their background DNA. We tested the MonoDi-nucleotide model on the benchmark dataset compiled by Tompa et al. (2005) for assessing computational tools that predict TF binding sites. The performance of the MonoDi-nucleotide model on this data set compares well to, and in many cases exceeds, the performance of existing tools. This is in part attributed to the significant role played by dinucleotides in discriminating TF binding sites from background DNA. Availability: A Matlab implementation of the MonoDi-nucleotide model can be found at http://www.utoronto.ca/zhanglab/MonoDi/. Contact: sumedha@cantab.net, Zhaolei.Zhang@utoronto.ca Supplementary information: Supplementary data are available at Bioinformatics online.
- wang08microRNA
-
Identification of phylogenetically conserved microRNA cis-regulatory elements across 12 Drosophila species
X. Wang and J. Gu and M. Q. Zhang and Y. Li
Bioinformatics
24
165-171
(2008)
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/2/165
Motivation: MicroRNAs are a class of endogenous small RNAs that play regulatory roles. Intergenic miRNAs are believed to be transcribed independently, but the transcriptional control of these crucial regulators is still poorly understood. Results: In this work, phylogenetic footprinting is used to identify conserved cis-regulatory elements (CCEs) surrounding intergenic miRNAs in Drosophila. With a two-step strategy that takes advantage of both alignment-based and motif-based methods, we identified CCEs that are conserved across the 12 fly species. When compared with TRANSFAC database, these CCEs are significantly enriched in known transcription factor binding sites (TFBSs). Moreover, several TFs that play essential roles in Drosophila development (e.g. Adf-1, Abd-B, Sd, Prd, Ubx, Zen and En) are found to be preferentially regulating the miRNA genes. Further analysis revealed many over-represented cis-regulatory modules (CRMs) composed of multiple known TFBSs, motif pairs with significant distance constraints and a number of novel motifs, many of which preferentially occur near the transcription start site of protein-coding genes. Additionally, a number of putative miRNA-TF regulatory feedback loops were also detected. Availability: Supplementary Material and the Perl scripts performing two-step phylogenetic footprinting are available at http://bioinfo.au.tsinghua.edu.cn/member/xwwang/mircisreg Contact: daulyd@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.
Jochen, 10. Juni 2008:
- manke08modelling
-
Statistical Modeling of Transcription Factor Binding Affinities Predicts Regulatory Interactions
T. Manke and H. G. Roider and M. Vingron
PLoS Comput Biol
4
e1000039
(2008)
http://dx.doi.org/10.1371%2Fjournal.pcbi.1000039
Recent experimental and theoretical efforts have highlighted the fact that binding of transcription factors to DNA can be more accurately described by continuous measures of their binding affinities, rather than a discrete description in terms of binding sites. While the binding affinities can be predicted from a physical model, it is often desirable to know the distribution of binding affinities for specific sequence backgrounds. In this paper, we present a statistical approach to derive the exact distribution for sequence models with fixed GC content. We demonstrate that the affinity distribution of almost all known transcription factors can be effectively parametrized by a class of generalized extreme value distributions. Moreover, this parameterization also describes the affinity distribution for sequence backgrounds with variable GC content, such as human promoter sequences. Our approach is applicable to arbitrary sequences and all transcription factors with known binding preferences that can be described in terms of a motif matrix. The statistical treatment also provides a proper framework to directly compare transcription factors with very different affinity distributions. This is illustrated by our analysis of human promoters with known binding sites, for many of which we could identify the known regulators as those with the highest affinity. The combination of physical model and statistical normalization provides a quantitative measure which ranks transcription factors for a given sequence, and which can be compared directly with large-scale binding data. Its successful application to human promoter sequences serves as an encouraging example of how the method can be applied to other sequences.