Upload
beverly-goodwin
View
218
Download
0
Embed Size (px)
Citation preview
Prediction of > 3000 novel human microRNAs …
Martin ReczkoICS/IMBB Bioinformatics Program
Biomedical Informatics LabInstitute for Computer Science – FORTH
ID #miRNAs name-------------------------------------------aga 42 A. gambiae (MOZ2)ame 26 A. mellifera (AMEL2.0)ath 117 A. thaliana (RefSeq entries)cbr 82 C. briggsae (cb25.agp8)cel 115 C. elegans (WormBase WS140)cfa 6 C. familiaris (BROADD1)dme 78 D. melanogaster (BDGP4)dps 73 D. pseudoobscura (DPSE2.0)dre 293 D. rerio (WTSI Zv5)fru 130 F. rubripes (FUGU2.0)gga 122 G. gallus (WASHUC1)hsa 325 H. sapiens (NBCI35)mmu 255 M. musculus (NCBIM34)osa 123 O. sativa (TIGR 3.0)ptr 67 P. troglodytes (CHIMP1)rno 189 R. norvegicus (RGSC3.4)tni 131 T. nigroviridis (TETRAODON7)zma 95 Z. mays (TIGR AZM4)ebv 5 Epstein Barr virus (EMBL:V01555.1)hcmv 8 Human cytomegalovirus (Refseq:NC_001347.2)kshv 11 Kaposi sarcoma associated herpesvirus (EMBL:U75698.1)mghv 9 Mouse gammaherpesvirus 68 (EMBL:U97553.1)
microrna.sanger.ac.uk
Rfam/miRBase 7.1 (October 2005)
used 227 from miRBase 6.0
~ 9 MBases http://www.ensembl.org/BioMart/
Negative examples: 3’UTR s
Conservation: MultiZ alignments
11111111111111111111111111111111111111110111111111111111111101111111111111111110111111111111111111111111 011111011111111111111111111111111111111010111111111111111111111111111110111110110111111111111111111111111 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 211111101111111111111111111111111111111111101101011111111111111111111111111111110111111101111111111111111 311101001101101111111111111111111111110011111011011111111011011111111001111011100111111101111111111111111 411100001101101111111111111111111111110010100001011111111011001111111000111010000111111101111111111111111 5
Conservation rules: # 1’s above >= 120 , at least one stretch of 12 1’s
Genome wide prediction pipeline
Process windows of 104 nt along genome:
1. Fast filtering using composition and palindromes
2. Comparative analysis with other genomes (BLASTZ)
3. Approximate secondary structure prediction (stem-loop) using a novel dynamic programming algorithm.
4. Feature extraction and classification (SVMs)
5. Filter conserved secondary structures
-No window containing unknown base-No windows with complete repeat-regions gain 40% reduction in analyzed size,
100% - > 98.4 % sensitivity(lost: hsa-mir-151 hsa-mir-370 hsa-mir-422a hsa-mir-513-1 hsa-mir-513-2)
- Single nt composition, both strands: max A 43% min 9% max C 38% min 10.6% max G 45% min 11% max T 40% min 9.3%
- Single nt composition, single strands: max A 37.5% min 9% max C 38% min 10.6% max G 43.8% min 12.5% max T 40% min 12.7%
’Fast’ rules:
-Double nt composition, single strands:max AA 15.4% min 0%max AC 10.7% min 0%max AG 14.2% min 1%max AT 16.1% min 0%max CA 14.7% min 0%max CC 18.3% min 0%max CG 15.8% min 0%max CT 16.4% min 1.3%max GA 11.9% min 0%max GC 17.6% min 0%max GG 19.3% min 1%max GT 13.4% min 1.4%max TA 15.7% min 0%max TC 15.6% min 1.1%max TG 18.8% min 2.9%max TT 25.8% min 0%
More ’fast’ rules:
>= 4nt palindrome rule:
Hash-table with 4^4=256 entries:
Hash-key occured at position rev.comp
---------------------------------------000 AAAA 3 255001 AAAC 0 254002 AAAG 0 253003 AAAU 4 252004 AACA 0 251 005 ...
...254 UUUG 0 001255 UUUU 60 000
microRNA computational prediction pipeline
2 851 352 871 bases
Inverted repeats, composition
RNA secondary structureprediction
Energy + structural features
Cross-species conservation
SVM
SS-conservation
Novel microRNAs: Microarray verification
1. Stem_Length 2. GC_Content 3. Stem_BPs 4. maxLinHelix 5. MatureCons 6. MatureOppositeCons 7. ArmCons 8. SS_Energy 9. MatureBPs 10. MatureEnergyProfile
Prediction featurespredicted seconddary structure
comparative analysis
=> 10 features for SVM classification
Histogram for feature: stem length
Histogram for feature: GC content
Histogram for feature: #base pairs in stem
Feature: longest ‘linear’ helix
maxlinhelix = 18 nt
maxlinhelix = 26 nt
Histogram for feature: longest ‘linear’ helix
Features related to mature region
window of 23 nt
Sliding 0 to 15 nt from loop
Calculate ‘mature’ feature at all positions and keepprediction with highest score
Histogram for feature: #conserved bases in mature region
Histogram for feature: #conserved bases in mature region(on opposite strand)
Histogram for feature: #conserved bases in both arms of the stem
Histogram for feature: secondary structure minimal free energy
Histogram for feature: #paired bases in mature region
Mature region: average stacking energy
Histogram for feature: correlation with averagemature energy profile in mature region
Learning with Support Vector Machines
Training data Test data
‘Soft-margin’hyperplanes,
cost parameter C
Training with libsvm-2.6 package by C.-C. Chang & C.-J. Lin
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Modification:optimize Mathewscorrelation,not % correct
All features:Cross Validation Accuracy = 87.2728%
Feature ‘knockout’:Cross Validation Accuracy = 75.4618% ss-energy ***Cross Validation Accuracy = 84.6784% stem-start Cross Validation Accuracy = 84.409% stem-end Cross Validation Accuracy = 85.2758% loop-length Cross Validation Accuracy = 82.3163% loop-start Cross Validation Accuracy = 82.3909% # base-pairsCross Validation Accuracy = 76.4124% GC-content **Cross Validation Accuracy = 86.3902% higher arm conservationCross Validation Accuracy = 84.97% lower arm conservationCross Validation Accuracy = 85.0393% loop conservationCross Validation Accuracy = 84.0942% # GU pairsCross Validation Accuracy = 85.4047% length of longest bulge
Importance of features with ‘knockout’ retraining:
Q SENS SPEC CORR cp cn fp fn threshold---------------------------------------------------------------------99.60 96.74 28.16 +0.5208 89 56497 227 3 0.01000099.76 95.65 39.82 +0.6163 88 56591 133 4 0.02000099.83 95.65 48.09 +0.6776 88 56629 95 4 0.03000099.86 95.65 54.32 +0.7203 88 56650 74 4 0.04000099.87 95.65 55.00 +0.7248 88 56652 72 4 0.05000099.92 95.65 67.18 +0.8012 88 56681 43 4 0.10000099.94 95.65 75.21 +0.8479 88 56695 29 4 0.15000099.95 95.65 78.57 +0.8667 88 56700 24 4 0.20000099.96 95.65 82.24 +0.8868 88 56705 19 4 0.25000099.96 95.65 83.02 +0.8909 88 56706 18 4 0.300000 ***99.96 94.57 85.29 +0.8979 87 56709 15 5 0.35000099.97 94.57 86.14 +0.9024 87 56710 14 5 0.40000099.97 92.39 87.63 +0.8996 85 56712 12 7 0.45000099.97 91.30 90.32 +0.9080 84 56715 9 8 0.50000099.97 88.04 91.01 +0.8950 81 56716 8 11 0.55000099.96 85.87 90.80 +0.8828 79 56716 8 13 0.60000099.96 85.87 91.86 +0.8880 79 56717 7 13 0.65000099.97 85.87 94.05 +0.8985 79 56719 5 13 0.70000099.96 82.61 93.83 +0.8802 76 56719 5 16 0.75000099.96 80.43 96.10 +0.8790 74 56721 3 18 0.80000099.96 80.43 96.10 +0.8790 74 56721 3 18 0.84999999.96 77.17 97.26 +0.8662 71 56722 2 21 0.899999
Test-set results for various SVM thresholds
< 3 weeks on ~40 AMD-242-Opterons (ICS-FORTH)
precursor #candidatessensitivity (incl. known miRNAs) hit-rate---------------------------------------------- 95.1% 96699 16 ppm 90.3% 45231 7.6 ppm 85.9% 23025 3.9 ppm 80.6% 14429 2.4 ppm 75.7% 9732 1.6 ppm 70.9% 6912 1.2 ppm---------------------------------------Total nt processed: 5976557831
Hg17-scan results for various SVM thresholds
Secondary structure conservation:
From RNAfold-library:
structure – stucture comparison:
Null, H, B, I, M, S, E -------------------------------------{ 0, 2, 2, 2, 2, 1, 1} Null { 2, 0, 2, 2, 2, INF, INF} H { 2, 2, 0, 1, 2, INF, INF} B { 2, 2, 1, 0, 2, INF, INF} I { 2, 2, 2, 2, 0, INF, INF} M { 1, INF, INF, INF, INF, 0, INF} S { 1, INF, INF, INF, INF, INF, 0} E
'H' hairpin loop 'I' interior loop 'B' bulge 'M' multi-loop 'S' stack 'E' external elements
Secondary structure conservationvs. SVM scores
Q:099.96 SENS:085.87 SPEC:091.86 CORR:+0.8880 cp 79 cn 56717 fp 7 fn 13 th 0.67
spec=cp/(cp+fp)=cp/nhits => (expected cp)=spec*nhits=0.9168*7664=7026
- 2 probes with 60 nt for each candidate- end of 5' probes reach 75% into the hairpin-loop - 3' probes start after 50% of the hairpin-loop
- sensitivity detecting mature miRNA: 86 %- Chip in preparation at UoToronto
Probe-design for experimental verification (RNA-RNA chip):
Estimate for the number of true miRNAs:
All predictions are avaliable !
Just the tip of an iceberg
-tiling window expression analysis of mouse:
30 % of the genome is transcribed !
- mRNA genes are 3% of the truth….
Acknowledgments:
Artemis Hatzigeorgiou,Praveen Sethupathy, Molly Megraw, Karol SzafranskiCenter for Bioinformatics, School of Medicine, University of Pennsylvania
Yannis TollisPanayiota PoïraziAnastasis OulasAlkiviadis Simeonidis
Angelos Bilas, Michalis FlourisAdvanced Computing Systems,Computer Architecture and VLSI Systems Lab, ICS-FORTH