cWords - systematic microRNA regulatory motif discovery from mRNA expression data
© Rasmussen et al.; licensee BioMed Central Ltd. 2013
Received: 16 January 2013
Accepted: 24 April 2013
Published: 20 May 2013
Post-transcriptional regulation of gene expression by small RNAs and RNA binding proteins is of fundamental importance in development of complex organisms, and dysregulation of regulatory RNAs can influence onset, progression and potentially be target for treatment of many diseases. Post-transcriptional regulation by small RNAs is mediated through partial complementary binding to messenger RNAs leaving nucleotide signatures or motifs throughout the entire transcriptome. Computational methods for discovery and analysis of sequence motifs in high-throughput mRNA expression profiling experiments are becoming increasingly important tools for the identification of post-transcriptional regulatory motifs and the inference of the regulators and their targets.
cWords is a method designed for regulatory motif discovery in differential case–control mRNA expression datasets. We have improved the algorithms and statistical methods of cWords, resulting in at least a factor 100 speed gain over the previous implementation. On a benchmark dataset of 19 microRNA (miRNA) perturbation experiments cWords showed equal or better performance than two comparable methods, miReduce and Sylamer. We have developed rigorous motif clustering and visualization that accompany the cWords analysis for more intuitive and effective data interpretation. To demonstrate the versatility of cWords we show that it can also be used for identification of potential siRNA off-target binding. Moreover, cWords analysis of an experiment profiling mRNAs bound by Argonaute ribonucleoprotein particles discovered endogenous miRNA binding motifs.
cWords is an unbiased, flexible and easy-to-use tool designed for regulatory motif discovery in differential case–control mRNA expression datasets. cWords is based on rigorous statistical methods that demonstrate comparable or better performance than other existing methods. Rich visualization of results promotes intuitive and efficient interpretation of data. cWords is available as a stand-alone Open Source program at Github https://github.com/simras/cWords and as a web-service at: http://servers.binf.ku.dk/cwords/.
KeywordsMicroRNA siRNA RNA binding proteins Motif discovery Post-transcriptional regulation
MicroRNAs (miRNAs) are endogenous small regulatory RNAs of size approximately 22 nucleotides. miRNAs, bound by the RNA induced silencing complex (RISC), repress gene and protein expression post-transcriptionally. miRNA targeting and binding of complementary messenger RNA (mRNA) sequences - often in the 3′ untranslated regions (UTRs) - generally leads to target mRNA degradation [1–3]. Perfect base-pairing between nucleotide 2 to 8 of the mature miRNA (the seed) and the mRNA target site plays an essential role , but cannot alone explain the full regulatory potential of miRNAs .
The function of a miRNA in a given cellular context can be studied experimentally by analyzing changes in mRNA expression after miRNA inhibition [5, 6] or overexpression [1, 2]. When interpreting data from such experiments it is important to establish that the miRNA was successfully and efficiently perturbed leading to change in expression of target mRNAs. This can be achieved by showing differential regulation of the predicted target mRNAs  or by showing seed site enrichment using unbiased 3′UTR motif analysis of differentially expressed genes [7–10]. An unbiased motif analysis may have additional advantages as a standard tool when analyzing miRNA perturbation experiments. For example, miRNA target prediction methods may not detect non-canonical target motifs specific to the perturbed miRNA, and systematic analysis of miRNA perturbation experiments has shown that in addition to miRNA seed sites, other 3′UTR motifs, some corresponding to known binding sites of RNA binding proteins (RNA-BPs), can also be predictive of the observed mRNA expression changes . There is therefore a need for computational methods that allow for unbiased and systematic analysis of mRNA sequence motifs in miRNA perturbation experiments to confirm effective experimental perturbation and to explore regulatory sequence elements other than established miRNA binding sites.
Motif discovery has a long history in bioinformatics , in particular for analysis of transcription factor binding sites . There are many different approaches to motif discovery. Most use a fixed set of sequences and identify motifs that are overrepresented in this set compared to a Markov chain background model (Gibbs Sampler , MEME , and Weeder ). Other methods do discriminative analysis, where the goal is to identify motifs that are over-represented in a positive set compared to a negative or background set of sequences (DEME  and ). However often we are dealing with transcriptome-wide measurements of gene expression, and a priori it is difficult to set a natural cut-off that defines the positive (or negative) set.
Recently, methods for identifying correlations of word occurrences in mRNA sequences and transcriptome-wide changes in gene expression have been developed. miReduce  and Sylamer  are two such methods designed for unbiased analysis of miRNA regulation in mRNA 3′UTR sequences (and for analyses of other types of gene regulation). miReduce uses a stepwise linear regression model to estimate the words that best explain the observed gene expression changes. Sylamer computes word enrichment based on a hyper-geometric test of word occurrences in a ranked list of sequences. Sylamer is computationally efficient and allows for bin-wise 3′UTR sequence composition bias correction.
Here we present cWords, a method for correlating word enrichment in mRNA sequences and changes in mRNA expression. It permits for correction of sequence composition bias for each individual sequence and is based on methods developed in . By development of robust and efficient parametric statistics, cWords offers a factor 100 to 1000 speed gain over the previous permutation-based framework. An exhaustive 7mer word analysis of a gene-expression dataset can be completed in less than 10 minutes mainly due to efficient approximations of statistical tests, and the parallelized implementation that enables full utilization of multicore computer resources.
cWords includes methods for clustering and visualization of enriched words with similar sequences that can aid exploratory analysis of enriched words and degenerate motifs such as noncanonical miRNA binding sites and RNA-BP binding sites. We show that cWords is effective for analyzing miRNA binding and regulation in miRNA overexpression and inhibition experiments, and we demonstrate how cWords can be used to identify enrichment of other types of regulatory motifs in such experiments. We demonstrate that miReduce, Sylamer, and cWords exhibit comparable performance on a panel of miRNA perturbation experiments. Finally, we demonstrate how cWords can be used to identify potential siRNA off-target binding and regulation in RNAi experiments, and to discover endogenous miRNA binding sites in an experiment profiling mRNAs bound by Argonaute ribonucleoprotein.
Results and discussion
We have developed an efficient enumerative motif discovery method that can be used for extracting correlations of differential expression and motif occurrences. In brief, sequences are ranked by fold change of expression, and motifs (words) are correlated with gene ranks. Unlike other methods, cWords can detect subtle correlations of words only present in few sequences due to sequence specific background models. The rigorous statistical framework allows for simultaneous analysis of multiple word lengths, and words are clustered into motifs presented in plots providing both overview and in-depth information for interpretation.
The summary plots of cWords
The gene rank in the enrichment profile plot at which the global maximum enrichment score is obtained is termed the enrichment specificity (ES) index. A low ES index is indicative of a specific enrichment signal corresponding to enrichment of a motif in a small set of strongly differentially expressed genes. Oppositely, a high ES index reflects that the word enrichment was found for a larger set of less differentially expressed genes. Words enriched in sets of genes with a large intersection will tend to exhibit similar enrichment profiles and have ES indices that are numerically close. For example, variants of miRNA target sequences (seed sites with 1 or 2 nucleotides offsets) tend to have similar ES indices when analyzing miRNA overexpression experiments (Figure 1A).
The enrichment profile plot provides a lot of detail for individual words, but is also limited by the number of words that can be effectively summarized in the same plot, which may be an important factor in the discovery phase of a motif analysis. For this purpose we developed the word cluster plot (Figure 1B). This plot shows the maximum enrichment score versus the ES index for all words, and it displays word relationships found through word similarity clustering. We found that this type of plot produces a simple and yet informative summary for miRNA perturbation experiments. For example, when analyzing expression changes after miR-9 overexpression in HeLa cells, the word with strongest enrichment in 3′UTRs of downregulated genes corresponds to the 7mer seed site of miR-9 (Figure 1B). Several shifted variants of the seed site also show enrichment in the plot highlighting the preference for sites with a flanking adenosine. Furthermore, the plot reveals significant enrichment for certain T-rich motifs (including TTTTAAA, DNA-alphabet was used with T instead of U), which were also reported in our previous study . The word cluster plot can therefore provide a rich and unbiased summary for exploration of regulatory motifs associated with gene expression changes.
cWords analysis of miRNA target sites in coding regions of mRNAs
cWords identifies siRNA off-target effects
cWords analysis of endogenous miRNA binding sites in HEK293 cells
Comparison to miReduce and Sylamer
The performance of cWords was compared to two other methods, miReduce and Sylamer, on the task of identifying seed site binding in mRNA 3′UTRs in a panel of 18 miRNA transfection experiments and one miRNA inhibition.
miReduce uses a stepwise linear regression estimation procedure and does not compute scores for all words of a given length - only the most significant word among a group of strongly correlated words will be included in the model and summarized in the output. Words of different lengths cannot be compared by the Sylamer statistic. Due to these issues we compare performance of the three methods by computing enrichment for all 7mers in each miRNA perturbation experiment. We report the rank of the highest ranking word that is identical to the reverse complement of the canonical A1 7mer seed (identity in positions 2 to 7, with preferentially an A in position 1) or the canonical m8 7mer seed (identity in positions 2 to 8) of the transfected miRNA .
For miReduce, Sylamer and cWords, we found that the top-ranked word corresponded to the seed site of the perturbed miRNA in 13 of the 19 experiments [see Table 2 in Additional file 1]. In six experiments the results diverged. For transfection of miR-133a, the top word had an overlap of the six rightmost characters with the six leftmost in the m8 7mer canonical seed site, for all methods. This most likely reflects the biological reality that miRNAs under certain conditions bind in atypical ways. Another exception was in the inhibition of miR-21 , where Sylamer ranked a 7mer seed site as number 12 and all higher ranking words were not similar to the seed site. In the other four experiments Sylamer did not rank a 7mer seed site as the first word.
This serves as a demonstration that the three methods are able to find and discriminate the seed motif in datasets where this is expected to be the strongest signal. In five cases cWords performed better than Sylamer, but generally the performance of the three methods was very similar under these benchmarking conditions. The parallel implementation is an advantage of cWords over the other methods. Using four cores cWords finished an analysis on average approximately five times faster than Sylamer and approximately two times faster than miReduce and using 40 cores cWords was up to 20 times faster. In both cases, Sylamer was run disabling approximations to not compromise precision. If a larger window size is used, Sylamer is faster than the other methods. For more details on which data was used in the comparison see Supplementary methods in Additional file 1.
We have presented cWords, which finds overrepresented words in sets of DNA (or RNA) sequences. Contrary to most other methods, it uses a sensitive statistics that takes the individual sequence composition into account. cWords can rank words across different word lengths and uses clustering to group similar words. cWords outputs multiple summary plots and tables, which in combination provide both an overview and detailed information for in depth analysis of the results.
cWords is designed for analysis of experiments in which gene expression is measured after perturbation of a miRNA. We have shown cWords successfully identifies seed sites as the highest-ranking words in such experiments. Furthermore, we have shown that cWords can identify likely off-target effects of siRNAs mediated by miRNA-like binding of 3′UTRs, and that binding motifs of endogenous miRNAs can be identified from Argonaute immunoprecipitation data.
We conducted a comparative study of cWords, miReduce and Sylamer on published datasets from 19 miRNA transfection and miRNA knockdown experiments. No single method was notably better than the others, and overall the performance of cWords, miReduce, and Sylamer was very good for the specific application of identifying seed sites as high-ranking motifs.
The word cluster plot of cWords provides a summary and a way to associate words among the highest-ranking words. An advantage of both miReduce and cWords is that they can statistically evaluate and compare enrichment for motifs of different lengths. Sylamer can only be used for words of the same length in an analysis and results from analyses of different word lengths are not directly comparable. Sylamer is a fast tool, but actually this is only the case when a large ‘window size’ is used, however, the speedup resulting from a large window size comes at the expense of a less precise background model.
We have strived to make cWords user friendly, and it offers the flexibility of a downloadable Open Source program rich in features as well as the simplicity and ease of use of the cWords web server.
cWords is an exact method, in which all words of a given length are counted in the sequences. Based on these word frequencies, enrichment scores (scores of over-representation) are calculated for each word in each sequence by a binomial model with a kth-order Markov Model that corrects for composition bias in each sequence. Enrichment scores are summarized and enrichment profiles normalized in a Kolmogorov like statistics used for ranking and discriminating regulatory words from non-regulatory.
Scoring word overrepresentation in individual sequences
where p = P k (W) is the probability of observing m occurrences of the word W in a sequence (calculated by equation 1). In the original implementation of cWords the expected frequency of a word in a sequence was estimated by shuffling it. The above probability was calculated as the fraction of shuffles where m or more instances of the word would occur.
Evaluating word enrichment in a ranked list of sequences
Clustering words into motifs
Signals of regulatory sites typically surface as degenerate motifs and not as single words. To also facilitate analysis of motifs in cWords, the most significant words are clustered into motifs. The algorithm developed for word clustering is based on the UPGMA algorithm . In this implementation of UPGMA, association of two words is inferred by ungapped local alignment. An alignment of two words is scored by the number of matches minus the number of mismatches. The highest scoring ungapped alignment is found and the score is normalized dividing by the length of the shortest word to control for score biases when comparing words of different lengths. This score is used for clustering.
Human lung cancer cell line
Human colon cancer cell line
- ES index:
Enrichment specificity index
False discovery rate
Human colon cancer cell line
Human embryonic kidney cell line
Human cervical cancer cell line
RNA induced silencing complex
RNA binding protein
Small interfering RNA
Human ovary cancer cell line
Unweighted pair group method using arithmetic averages
3′ end untranslated region
The authors would like to thank Line Skotte for her contribution in the development of an approximation to a statistical test and Jan Teichmann who contributed in setting up the web server. Peter Menzel is thanked for his valuable advice on specific problems in design of the webserver application. Moreover Line Skotte and Mireya Plass are thanked for their comments on the manuscript.
This project was supported by grants from the Danish Strategic Research Council and from the Novo Nordisk Foundation. AJ was supported by a grant from the Danish Research Council.
- Lim LP, Lau NC, Garrett-Engele P, Grimson A, Schelter JM, Castle J, Bartel DP, Linsley PS, Johnson JM: Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature. 2005, 7027: 769-773.View ArticleGoogle Scholar
- Grimson A, Farh KK-H, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP: MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol Cell. 2007, 27: 91-105. 10.1016/j.molcel.2007.06.017.PubMed CentralView ArticlePubMedGoogle Scholar
- Bartel DP: MicroRNAs: Target Recognition and Regulatory Functions. Cell. 2009, 136: 215-233. 10.1016/j.cell.2009.01.002.PubMed CentralView ArticlePubMedGoogle Scholar
- Betel D, Koppal A, Agius P, Sander C, Leslie C: Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites. Genome Biol. 2010, 11: R90-10.1186/gb-2010-11-8-r90.PubMed CentralView ArticlePubMedGoogle Scholar
- Frankel LB, Christoffersen NR, Jacobsen A, Lindow M, Krogh A, Lund AH: Programmed cell death 4 (PDCD4) is an important functional target of the microRNA miR-21 in breast cancer cells. J Biol Chem. 2008, 283: 1026-1033. 10.1074/jbc.M707224200.View ArticlePubMedGoogle Scholar
- Krützfeldt J, Rajewsky N, Braich R, Rajeev KG, Tuschl T, Manoharan M, Stoffel M: Silencing of microRNAs in vivo with “antagomirs”. Nature. 2005, 438: 685-689. 10.1038/nature04303.View ArticlePubMedGoogle Scholar
- Jacobsen A, Wen J, Marks DS, Krogh A: Signatures of RNA binding proteins globally coupled to effective microRNA target sites. Genome Res. 2010, 20: 1010-1019. 10.1101/gr.103259.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Sood P, Krek A, Zavolan M, Macino G, Rajewsky N: Cell-type-specific signatures of microRNAs on target mRNA expression. PNAS. 2006, 103: 2746-2751. 10.1073/pnas.0511045103.PubMed CentralView ArticlePubMedGoogle Scholar
- Dongen SV, Abreu-Goodger C, Enright AJ: Detecting microRNA binding and siRNA off-target effects from expression data. Nat Methods. 2008, 5: 1023-1025. 10.1038/nmeth.1267.PubMed CentralView ArticlePubMedGoogle Scholar
- Gregersen LH, Jacobsen AB, Frankel LB, Wen J, Krogh A, Lund AH: MicroRNA-145 targets YES and STAT1 in colon cancer cells. PLoS One. 2010, 5: e8836-10.1371/journal.pone.0008836.PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey TL: Discovering sequence motifs. Methods Mol Biol. 2008, 452: 231-251. 10.1007/978-1-60327-159-2_12.View ArticlePubMedGoogle Scholar
- Tompa M, Li N, Bailey TL, Church GM, Moor BD, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005, 23: 137-144. 10.1038/nbt1053.View ArticlePubMedGoogle Scholar
- Lawrance CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214. 10.1126/science.8211139.View ArticleGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994, 2: 28-36.PubMedGoogle Scholar
- Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics. 2001, 17: 207-214. 10.1093/bioinformatics/17.suppl_1.S207.View ArticleGoogle Scholar
- Redhead E, Bailey TL: Discriminative motif discovery in DNA and protein sequences using the DEME algorithm. BMC Bioinformatics. 2007, 8: 385-10.1186/1471-2105-8-385.PubMed CentralView ArticlePubMedGoogle Scholar
- Valen E, Sandelin A, Winther O, Krogh A: Discovery of regulatory elements is improved by a discriminatory approach. PLoS Comput Biol. 2009, 5: e1000562-10.1371/journal.pcbi.1000562.PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov J: Gene set enrichment: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.PubMed CentralView ArticlePubMedGoogle Scholar
- Schnall-Levin M, Rissland OS, Johnston WK, Perrimon N, Bartel DP, Berger B: Unusually effective microRNA targeting within repeat-rich coding regions of mammalian mRNAs. Genome Res. 2011, 21: 1395-1403. 10.1101/gr.121210.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Zeng Y, Yi R, Cullen BR: MicroRNAs and small interfering RNAs can inhibit mRNA expression by similar mechanisms. Proc Natl Acad Sci USA. 2003, 100: 9779-9784. 10.1073/pnas.1630797100.PubMed CentralView ArticlePubMedGoogle Scholar
- Saxena S, Jónsson ZO, Dutta A: Small RNAs with Imperfect Match to Endogenous mRNA Repress Translation. J Biol Chem. 2003, 1278: 44312-44319.View ArticleGoogle Scholar
- Doench JG, Petersen CP, Sharp PA: siRNAs can function as miRNAs. Genes Dev. 2003, 17: 438-442. 10.1101/gad.1064703.PubMed CentralView ArticlePubMedGoogle Scholar
- Jackson AL, Burchard J, Schelter J, Chau BN, Cleary M, Lim L, Linsley PS: Widespread siRNA “off-target” transcript silencing mediated by seed region sequence complementarity. RNA. 2006, 12: 1179-1187. 10.1261/rna.25706.PubMed CentralView ArticlePubMedGoogle Scholar
- Jackson AL, Bartz SR, Schelter J, Kobayashi SV, Burchard J, Mao M, Li B, Cavet G, Linsley PS: Expression profiling reveals off-target gene regulation by RNAi. RNA. 2003, 21: 635-637.Google Scholar
- Jackson AL, Burchard J, Leake D, Reynolds A, Schelter J, Guo J, Johnson JM, Lim L, Karpilow J, Nichols K, Marshall W, Khvorova A, Linsley PS: Position-specific chemical modification of siRNAs reduces “off-target” transcript silencing. RNA. 2006, 12: 1197-1205. 10.1261/rna.30706.PubMed CentralView ArticlePubMedGoogle Scholar
- Landthaler M, Gaidatzis D, Rothballer A, Chen PY, Soll SJ, Dinic L, Ojo T, Hafner M, Zavolan M, Tuschl T: Molecular characterization of human Argonaute-containing ribonucleoprotein complexes and their bound target mRNAs. RNA. 2008, 14: 2580-2596. 10.1261/rna.1351608.PubMed CentralView ArticlePubMedGoogle Scholar
- Landgraf P, Rusu M, Sheridan R, Sewer A, Iovino N, Aravin A, Pfeffer S, Rice A, Kamphorst AO, Landthaler M, Lin C, Socci ND, Hermida L, Fulci V, Chiaretti S, Foà R, Schliwka J, Fuchs U, Novosel A, Müller RU, Schermer B, Bissels U, Inman J, Phan Q, Chien M, Weir DB, Choksi R, De Vita G, Frezzetti D, Trompeter HI: A Mammalian microRNA Expression Atlas Based on Small RNA Library Sequencing. Cell. 2007, 129: 1401-1414. 10.1016/j.cell.2007.04.040.PubMed CentralView ArticlePubMedGoogle Scholar
- Schbath S: On Overview of the Distribution of Word Counts in Markov Chains. J Comput Biol. 2000, 7: 193-201. 10.1089/10665270050081469.View ArticlePubMedGoogle Scholar
- Feller W: The asymptotic distribution of the range of sums of independent random variables. Ann Mathematical Stat. 1951, 22: 427-432. 10.1214/aoms/1177729589.View ArticleGoogle Scholar
- Mises RV: Mathematical theory of probability and statistics. 1964, New York: AcademicGoogle Scholar
- Marsaglia G, Tsang WW, Wang J: Evaluating Kolmogorov’s Distribution. J Stat Software. 2003, 08 (18): 1-4.View ArticleGoogle Scholar
- Durbin R, Eddy S, Krogh A: Biological Sequence Analysis. 2007, Cambridge, UK: Cambridge University PressGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.