SVM model  

microRNAs (miRNAs) are small non-coding RNAs and act as regulators, leading either mRNA cleavage or translational repression by hybridizing to the 3’-untranslated regions (3’-UTR) of their target genes. This negatively regulatory mechanism at the posttranscriptional level ensures miRNAs the key roles in controlling diverse biological processes such as carcinogenesis, cellular proliferation, and differentiation. Moreover, miRNAs are related to human diseases, especially refer to cancers.


Since miRNAs play important roles in pathology, their targeting and transcriptional regulation become critical issues in RNA research. Although miRNA target prediction has remarkable advances in recent years, the transcriptional regulation of it is still inadequate.


In order to understand the transcriptional regulation of microRNAs, identifying genuine transcriptional start sites (TSSs) of miRNA genes demands immediate attention. For most intragenic miRNAs which are harbored in annotated genes, the regulatory mechanism is coincided with their host genes. (Figure 1) It is implied that intragenic miRNAs and their host genes share the common TSSs and express simultaneously. However, for intergenic miRNAs, their own promoter regions dramatically vary in distance and limit the reliability of predicted results. Moreover, due to the lower expression level of miRNAs, it is more difficult to perform experimental validation and obtain full-length cDNAs of miRNA primary transcripts than coding genes. Therefore, a systematical approach to accurately identify miRNA TSS is necessary to solve these problems.

Figure1. The biogenesis of microRNA.
Features used in identification of microRNA TSS
Histone methylation
Many studies indicate that histone methylation can influence gene expression. ChIP-Seq, a massive parallel signature sequencing technique, performs well in chromatin modifications and offers a high-resolution profiling of histone methylations in the human genome. For example, H3K4me3 (histone H3 is trimethylated at its lysine 4 residue) enriched peak is found around TSS and positively correlated with gene expression, no matter genes are productively transcribed or not. This characteristic helps scientists determine putative TSSs in a reliable way.
Support by evidence sequences

Beside ChIP-Seq data of histone methylation, many evidence sequences based on experiments also provide useful references in TSS prediction. Cap-analysis gene expression (CAGE) tags are ~20 nts sequences derived from the 5’ terminal of cDNA in human and mouse genome. CAGE tags can be generated using biotinylated cap-trapper with specific linker sequence to ensure the sequence after 5’ cap of cDNA was reserved. Because RNA polymerase II transcripts have 5’ cap structures, the CAGE tags contain the first base of 5’ terminal sequence, that is, the transcription start site of RNA polymerase II transcripts. In this work, we integrated the FANTOM4 CAGE tag database. The CAGE tags are clustered to CAGE tag start site (CTSS). We mapped the CTSS to the upstream flanking region of intergenic miRNA precursor to determine the transcriptional start sites.

Moreover, illumina Solexa tags were derived using a new-generation and high-throughput sequencing technology, which DNA templates are immobilized on a special surface fluorescently labeled nucleotides with specific enzyme. We incorporated two TSS Seq data sets from DBTSS, which were generated by oligo-capping method and Solexa sequencing technology. The genomic position of TSS Seq tags in the DBTSS could be directly mapped to the upstream flanking region of intergenic miRNA precursors to support evidence for the detection of miRNA TSSs.

Systematical identification of human miRNA TSSs

To efficiently identify 940 human miRNA TSSs using computational approach, a SVM model was developed to systematically select the representative TSSs for each miRNA gene. The performance of the model was estimated by a 5-fold cross-validation: Sensitivity = 90.36%, Specificity = 90.05%, Accuracy = 90.21% and Precision = 90.08%. After scanning the 50 kb upstream region of miRNA precursors with SVM model and then executing the filtering process, miRStart provides top 5 representative TSSs for each intergenic miRNA gene. As to intragenic miRNA genes, although miRStart officially uses their host gene starts as TSSs, the top 5 representative TSSs identified by SVM model are still provided because several studies have indicated that intragenic miRNA genes may have their own promoters.

SVM model