microRNA target predictor
News: miRPredictor has been released.

Overview

    MicroRNAs (miRNA) have been proven to serve as important post-transcription regulators in gene expression. To understand the function of miRNAs, it is necessary to figure out the target gene of miRNAs. Here we developed a novel miRNA target predictor which is based on support vector machine (SVM), a widely-used machine learning approach, combined with feature selection procedure. We considered different types of features including the flanking sequences of the potential targets and pattern information. The features selected were also analyzed to dig out the intrinsic mechanism of miRNA-target interaction. The website of our predictor is http://bis.zju.edu.cn/mirpredictor/.

Method

Dataset

    In our study, data was extracted from the database miRecords, which is a database of experimental identified interactions between miRNAs and mRNAs developed by Xiao et al. in 2008. The first version of both human and Drosophila melanogaster in miRecords were used, the total number of which is 121 and 1311, respectively. After excluding those redundant and incomplete examples, we eventually obtained 278 validated miRNA-target pairs, including 83 ones of D. melanogaster and 195 ones of human. They were used as positive samples in this study.

    To gain the negative samples, data were extracted from literature listed by miRecords manually. MiRNA-target site pairs which were wet-experimental proved to be non-regulatory were firstly collected. To get more negative samples in order to improve specificity, we inferred more negative samples by considering interaction between miRNA and mRNA after site mutation. Consequently, we got a negative sample set with 194 examples, including 30 ones of D. melanogaster and 164 ones of human.

    With all these samples, we constructed three datasets for our research: complete dataset with all samples we have, fruit fly dataset with the 113 samples from D. melanogaster, and human dataset with the 359 samples from human.

Support Vector Machine (SVM)

    In our study, SVM was used to classify a miRNA-target site candidate to a regulatory one or a non-regulatory one. SVM is one of the most popular machine learning approaches used in different fields including many biological researches. It constructs an optimized hyperplane in the feature space to maximize segregation between different types of samples, and predict a new sample's type by mapping it to the feature space. To implement nonlinear classification, SVMs allow an implicit mapping of feature vectors into a high-dimensional, non-linear feature space, with the kernel function to calculate similarity between samples in acceptable time. This study used a radial basis function (RBF) as the kernel function:


where xi, xj are the two feature vectors to be compare, and γ is the parameter determining the similarity level of features.

    The SVM package LIBSVM was used here to construct our SVM predictor. It was developed by Chang et al. and widely used in many areas. The package can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

Feature Vector Construction

For SVM classifier, each sample should be represented by a feature vector, which covers all aspect of the interaction between miRNAs and mRNAs. In this study, features can be categorized into 6 groups: structural features, thermodynamic features, position-based features, compositive features, secondary-structure features and pattern-based features. The first three elements are also considered by miTarget, while the rest are novel ones imported for the prediction.

    Structural and thermodynamic features describe characteristics of the binding between miRNAs and their target sites. Structural features count the percentages of matches, mismatches, G:C matches, A:U matches, G:U wobble pairs and other mismatches from the five parts we considered, which consist of 5' part (seed part), 3' part of binding site alignment, the total alignment of binding site, the total alignment between miRNAs and 5' flanking sequences of binding sites, and the total alignment between miRNAs and 3' flanking sequences of binding sites. The former three parts of features are identical to what miTarget used, while the latter two are novel ones. Previous studies indicate the cooperation between miRNA binding sites adjacent to each other, and if any other possible binding site available in the flanking sequences, the tested binding site would be more likely to be a real one. Thermodynamic features are similar to structural ones, showing the free energy values of the five alignment structures as described above. Both structural and thermodynamic features are calculated by RNAduplex, one of the programs provided by Vienna RNA Package.

    Position-based features are firstly introduced by miTarget, imitating the shape and mechanism of the seed pairing. It focuses on the matching situation at each specific position of miRNAs. Each position is represented by a vector with three dimensions, indicating A:U match, C:G match and G:U wobble pair, respectively. If this position is an A:U match, we could translate it into "1,0,0", while C:G match, G:U wobble pair could be similarly coded into "0,1,0" and "0,0,1", and "0,0,0" means mismatch. The first 20 nt of the appointed miRNA would be considered, so 60 features would be generated.

    Previous studies show that the binding sites of miRNAs have some specific nucleotide composition, which cannot be clearly explained by known mechanisms. In other researches about nucleic acid, the nucleotide composition is also widely used. In our study, we considered content of each nucleotide in the five parts of miRNA binding sites with the same way as the structural and thermodynamic features.

    To regulate the target gene, the miRNA-binding site secondary structure is thought to play an important role, and should be thermodynamic stable enough.32 Many classical miRNA target predictors such as RNAHybrid come to their conclusions mainly based on the thermodynamic analysis of the miRNA-mRNA secondary structures. In our study, the candidate miRNA binding sites with their 100 nt length of flanking sequence on both sides are treated as a whole. Before miRNAs attached, he binding site together with its flanking components formed its own secondary structure, which was predicted by RNAcofold in Vienna RNA Package. We counted the percentages of matches, mismatches, A:U matches, C:G mataches, G:U wobble pairs and other mismatches as parts of secondary structure features. We also calculated the free energy of the secondary structures before and after miRNAs'binding by applying RNAcofold. The change of free energy in the binding process is also involved. Eventually, we obtained the 6+3=9 secondary structure features.

    In several previous studies of miRNA target prediction, motifs are extracted from the sequences and considered as a series of important features. Most of them count "words" in the binding site. It is simple, but these "words" rarely contain significant biological meanings. Rna22 is miRNA target predictor based on motif discovery. It used Teiresias Algorithm to discover variable-length motif in known miRNAs. In this study, we adopt the same method to get results. Using the web server of Teiresias Algorithm at http://cbcsrv.watson.ibm.com/Tspd.html, we obtained 228941 motifs. These motifs comprise a minimum length of L = 4, have at least 30% of their positions specified (W = 12) and appear a minimum of K = 2 times in the input. Because mRNAs are reverse complement to miRNAs, the motifs should also be reversed and complemented to generate target site motifs. Here we consider four parts of miRNA binding sites, including the direct binding sites, 5' flanking sequences of the binding sites, 3' flanking sequences of the binding sites, and binding sites together with its flanking sequences. For each part, the valid pattern value which would be defined later is calculated. Firstly, motifs which exist in the appointed miRNA are selected as valid patterns. The number of these patterns would be counted and added together. Secondly, target site motifs corresponding to the valid patterns in the miRNA binding site would also be counted. Suppose there are n valid patterns in miRNA, N corresponding target site motifs in target site, the valid pattern value of this part of target site can be obtained by n/N.

Classifier Performance Evaluation

    Cross-validation test and independent dataset test are widely in different fields for testing prediction quality in statistical prediction. In our research, three different categories of predictors were constructed with the same feature set but different training set. All predictors have gone through 10-fold cross-validation. For the ones based on the fruit fly dataset or human dataset, besides 10-fold cross-validation test, an independent dataset test using samples of the other species was also utilized.

    The results of tests can be described in different methods. Receiver operating characteristic (ROC) analysis, a plot of the true positive rate false positive rate, which is one of the most effective tools for evaluation, is also used to shows specificity-sensitivity trade-off. Overall accurate rate is also calculated for comparison.

Feature Selection

    Features with little distinction between different types of samples have negative effect to the performance of predictors. To improve the prediction results, feature selection should be used to obtain the optimal feature set. Whats more, by analyzing features in the optimal feature set, it is possible for us to interpret mechanism of miRNA-mRNA binding better.

    In this study, eight feature evaluation algorithms provided by Weka3 are used to rank each feature in the complete feature set based on the complete dataset: Chi-Square Attribute Evaluation, Filtered Attribute Evaluation, Gain Ratio Attribute Evaluation, Information Gain Attribute Evaluation, OneR Attribute Evaluation, RelieF Attribute Evaluation, SVM Attribute Evaluation, Symmetrical Uncertainty (SU) Attribute Evaluation. Each feature would get a rank in every evaluation, when a smaller rank carries more importance. Total rank is defined as the sum of all 8 ranks. The evaluation scheme of the total rank is the same as single ranks.

    With the ranked list of all features, the next step is to determine the size of the optimal feature set. It is a process similar to Incremental Feature Selection (IFS). By adding features with the order of the list one by one, we could obtain N feature set where N is the total number of features and here N = 128, while the i-th feature set is:


where fi is the i-th feature in the ordered feature list. Based on the complete dataset and the N feature set, N different SVM-based predictors were constructed. 10-fold cross-validation was used to test their performances. Soptimal = {f1, f2, ..., fh} is regarded as the optimal feature set if the predictor based on it reaches the highest overall accurate rate in all predictors.

Statistical evaluation of features

    To find out differences of features between positive and negative samples, Kolmogorov-Smirnov test (K-S test) was used to test whether features in optimal feature set are different between positive and negative samples. K-S test is a form of minimum distance estimation used as a nonparametric test of equality of one-dimensional probability distributions used to compare a sample with a reference probability distribution (one-sample K-S test), or to compare two samples (two-sample K-S test). The null distribution of this statistic is calculated under the null hypothesis that the samples are drawn from the same distribution (in the two-sample case) or that the sample is drawn from the reference distribution (in the one-sample case). Here two-sample case of K-S test was used. In this case, the K-S statistic is:


    The null hypothesis is rejected at level α if:


Program

Please refer to our paper for more information.


Citation: Zhisong He, Dijun Chen, Kuangyu Wang, Ming Chen* (2012) miRPreditor: a Novel MiRNA Target Predictor Based on SVM with Feature Analysis Electronic Journal of Biology, 8(4): 79-89.

Contact

If you have any question/suggestion about miRPredictor, please feel free to contact us: