49
1 Identifying RNA Editing Sites Using RNA Sequencing Data Alone Gokul Ramaswami 1,3 , Rui Zhang 1,3 , Robert Piskol 1 , Liam P Keegan 2 , Patricia Deng 1 , Mary A O'Connell 2 , Jin Billy Li 1 1. Department of Genetics, Stanford University, Stanford, California, USA. 2. MRC Human Genetics Unit, MRC IGMM, University of Edinburgh, Western General Hospital, Edinburgh, UK 3. These authors contributed equally to this work Correspondence should be addressed to J.B.L. ([email protected] ). Nature Methods: doi: 10.1038/nmeth.2330

Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

1  

Identifying RNA Editing Sites Using RNA Sequencing Data Alone

Gokul Ramaswami1,3, Rui Zhang1,3, Robert Piskol1, Liam P Keegan2, Patricia Deng1, Mary A

O'Connell2, Jin Billy Li1

1. Department of Genetics, Stanford University, Stanford, California, USA.

2. MRC Human Genetics Unit, MRC IGMM, University of Edinburgh, Western General

Hospital, Edinburgh, UK

3. These authors contributed equally to this work

Correspondence should be addressed to J.B.L. ([email protected]).

Nature Methods: doi: 10.1038/nmeth.2330

Page 2: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

2  

Supplementary Figures Supplementary Figure 1 - Identification of A-to-I RNA editing sites using RNA-seq data from human lymphocyte cell lines. (a-c) The fraction and number of putative A-to-I sites using the Separate Samples method. Relationship between the minimum number of samples shared by each variant with (a) the A-to-G and T-to-C variant proportion and (b) the total number of variants. Variants were required to be supported by at least 1 read in each sample. (c) Percentage of all twelve mismatch types. Alu and non-Alu variants were required to be present in at least one or two samples, respectively. (d-f) The fraction and number of putative A-to-I sites using the Pooled Samples method. Relationship between the minimum number of reads harboring the altered nucleotide for each variant with (d) the percentage of A-to-G and T-to-C variants and (e) the total number of variants. (f) Percentage of all twelve mismatch types. Alu and non-Alu variants were required to be supported by at least one or two mismatch reads, respectively. (g-i) Overlaps between sites found using the Separate Samples method with sites found using the Pooled Samples method for (g) Alu, (h) repetitive non-Alu, and (i) non-repetitive sites. The color legends are shared by all figure panels.

Nature Methods: doi: 10.1038/nmeth.2330

Page 3: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

3  

Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA differences called by Ramaswami et al. (2012)1 in the GM12878 lymphoblastoid cell line using a strand-specific RNA-seq library. (b) Percentage of all twelve mismatch types for the same mismatches when mismatch type was determined using gene annotations instead of the strand-specificity of the library.

Nature Methods: doi: 10.1038/nmeth.2330

Page 4: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

4  

Supplementary Figure 3 – Comparison of editing sites identified in GM12891 using RNA-seq only (Separate Samples Method) against editing sites identified in GM12891 using the pipeline from Ramaswami et al. (2012)1. Venn diagrams showing the overlaps between sites in (a) Alu regions, (b) repetitive non-Alu regions and (c) non-repetitive regions.

Nature Methods: doi: 10.1038/nmeth.2330

Page 5: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

5  

Supplementary Figure 4 – Distribution of read depth for editing sites. Histogram of sequencing read depths for editing sites identified in the Lymphoblastoid dataset using the Separate Samples method. For each editing site, we identified every one of the 40 samples in which it was present as a variant and used the read depths in these samples.

Nature Methods: doi: 10.1038/nmeth.2330

Page 6: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

6  

Supplementary Figure 5 – Identification of A-to-I RNA editing sites using RNA-seq data from human brain tissues. (a) Relationship between the total number of variants with the minimum number of samples shared by each variant using the Separate Samples method. Variants were required to be supported by at least 1 read in each sample. (b) Relationship between the total number of variants with the minimum number of reads harboring the altered nucleotide for each variant using the Pooled Samples method. (c-e) Overlaps between sites found using the Separate Samples method with sites found using the Pooled Samples method for (c) Alu, (d) repetitive non-Alu, and (e) non-repetitive sites. The color legends are shared by all figure panels.

Nature Methods: doi: 10.1038/nmeth.2330

Page 7: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

7  

Supplementary Figure 6 – Distribution of editing levels. Cumulative distribution function of editing levels for sites found in the Brain dataset. Editing levels were calculated using the pooled alignment file of all 50 samples. A minimum coverage of 10 reads was required for each site to estimate the editing level.

Nature Methods: doi: 10.1038/nmeth.2330

Page 8: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

8  

Supplementary Figure 7 – ADAR-binding sequence preferences. The sequence preferences flanking A-to-I RNA editing sites is based on (a) 612,573 Alu sites, (b) 13,724 repetitive non-Alu sites and (c) 12,160 non-repetitive sites called in the Brain dataset. The edited adenosine is present at position 4. (d) The proportions of triplet sequences centered on the edited adenosine (NAN) are plotted.

Nature Methods: doi: 10.1038/nmeth.2330

Page 9: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

9  

Supplementary Figure 8 – Identification of A-to-I sites using RNA-seq data from human BodyMap tissues. (a-c) The fraction and number of putative A-to-I sites using the Separate Samples method. Relationship between the minimum number of samples shared by each variant with (a) the A-to-G and T-to-C variant proportion and (b) the total number of variants. Variants were required to be supported by at least 1 read in each sample. (c) Percentage of all twelve mismatch types. Alu and non-Alu variants were required to be present in least one or two samples, respectively. (d-f) The fraction and number of putative A-to-I sites using the Pooled Samples method. Relationship between the minimum number of reads harboring the altered nucleotide for each variant with (d) the percentage of A-to-G and T-to-C variants and (e) the total number of variants. (f) Percentage of all twelve mismatch types. Alu and non-Alu variants were required to be supported by at least one or three mismatch reads, respectively. (g-i) Overlaps between sites found using the Separate Samples method with sites found using the Pooled Samples method for (g) Alu, (h) repetitive non-Alu, and (i) non-repetitive sites. The color legends are shared by all figure panels.

Nature Methods: doi: 10.1038/nmeth.2330

Page 10: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

10  

Supplementary Figure 9 – Comparison of sites between human datasets. (a-c) Venn diagrams showing the overlaps for the human sites found between the three datasets (Lymphoblastoid, Brain, BodyMap) in this study for (a) Alu regions, (b) repetitive non-Alu regions and (c) non-repetitive regions.

Nature Methods: doi: 10.1038/nmeth.2330

Page 11: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

11  

Supplementary Figure 10 – Cross-species comparative transcriptome method.

Nature Methods: doi: 10.1038/nmeth.2330

Page 12: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

12  

Supplementary Figure 11 - Accurate identification of RNA editing sites in the primate lineage. (a-c) Relationship between the number of A-to-G variants and the minimum editing level in (a) Alu, (b) repetitive non-Alu, and (c) nonrepetitive regions. The color legend is shared by all panels.

Nature Methods: doi: 10.1038/nmeth.2330

Page 13: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

13  

Supplementary Figure 12 – Gene ontology terms associated with human edited genes. The top 10 gene ontology terms significantly associated with (a) genes harboring non-repetitive editing sites and (b) genes harboring non-repetitive editing sites in protein coding regions. Sites were taken from the Brain dataset. The p-values are uncorrected with respect to multiple testing.

Nature Methods: doi: 10.1038/nmeth.2330

Page 14: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

14  

Supplementary Figure 13 – Sanger validation of non-synonymous sites. 7/9 randomly chosen novel non-repetitive non-synonymous editing sites were validated by Sanger sequencing in a human cerebellum sample. *: A previously identified non-synonymous editing site in CACNA1D was also validated2. The number of A nucleotides, G nucleotides, and editing frequency were calculated from the pooled alignments of 50 human brain RNA-seq samples.

Chr Position Gene #A #G Editing frequency

gDNA trace cDNA trace

Validated sites

chr1

2436080 PLCH2 337 50 0.129

*chr3 53820892 CACNA1D 133 44 0.249

chr4 2835556 SH3BP2 99 17 0.147

chr6 44120349 TMEM63B 481 88 0.155

chr9 21384902 IFNA2 34 7 0.175

chr9 132571671 TOR1B 4 4 0.500

chr19 56895363 ZNF582 204 30 0.128

Nature Methods: doi: 10.1038/nmeth.2330

Page 15: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

15  

chr19 56895428 ZNF582 44 5 0.102

Chr Position Gene #A #G Editing frequency

gDNA trace cDNA trace

Not validated sites

chr1 160319987 NCSTN 243 34 0.123

chr9 21384898 IFNA2 5 4 0.800

Nature Methods: doi: 10.1038/nmeth.2330

Page 16: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

16  

Supplementary Figure 14 - Validation of nonsynonymous non-repetitive editing sites identified in brain tissues. Validation success rate for sites found using (a) Separate Samples method and (b) Pooled Samples method.

Nature Methods: doi: 10.1038/nmeth.2330

Page 17: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

17  

Supplementary Figure 15 - Accurate identification of RNA editing sites in Drosophila by comparative transcriptome analysis. (a, b) Relationship between the proportion of detected mismatches that are A-to-G type and the minimum editing level for (a) OregonR head and (b) 1 day old y1; cn bw1 sp1 strain RNA-seq datasets. The OregonR strain is a different inbred strain compared to y1; cn bw1 sp1 strain that was used to generate the D.mel reference genome sequence. (c, d) Relationship between the number of A-to-G variants and the minimum editing level for (c) OregonR head and (d) 1 day old y1; cn bw1 sp1 strain RNA-seq datasets. The color legend is shared by all panels.

Nature Methods: doi: 10.1038/nmeth.2330

Page 18: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

18  

Supplementary Figure 16 - Characterization of all newly identified A-to-I editing sites in Drosophila lineage. (a) Venn diagram showing the overlap between known edited sites3-5 (blue) and sites identified in this study (red). (b) Distribution of newly identified sites in different genic regions. Sites were annotated using ANNOVAR6. (c) Gene Ontology (GO) categories of edited genes. DAVID was used to perform GO analysis. P values shown were corrected for multiple hypotheses testing using the Benjamini-Hochberg method.

a                                  b 

              

c Ontologies Category Description #

Genes P value

Molecular function GO:0022836 gated channel activity 26 3.18E-12Molecular function GO:0005261 cation channel activity 25 2.21E-11Molecular function GO:0015267 channel activity 29 3.56E-10Molecular function GO:0022803 passive transmembrane transporter activity 29 3.56E-10Molecular function GO:0005216 ion channel activity 28 3.26E-10Molecular function GO:0046873 metal ion transmembrane transporter activity 26 3.00E-10Molecular function GO:0022838 substrate specific channel activity 28 4.32E-10Molecular function GO:0022843 voltage-gated cation channel activity 14 8.78E-09Molecular function GO:0005244 voltage-gated ion channel activity 14 1.21E-07Molecular function GO:0022832 voltage-gated channel activity 14 1.21E-07Biological process GO:0007267 cell-cell signaling 31 1.01E-09Biological process GO:0007268 synaptic transmission 29 1.12E-09Biological process GO:0006811 ion transport 39 9.49E-10Biological process GO:0019226 transmission of nerve impulse 29 1.40E-09Biological process GO:0030001 metal ion transport 25 2.87E-08Biological process GO:0006812 cation transport 29 3.84E-07Biological process GO:0001505 regulation of neurotransmitter levels 19 7.21E-07Biological process GO:0046903 secretion 20 8.21E-07Biological process GO:0007269 neurotransmitter secretion 18 1.16E-06Biological process GO:0003001 generation of a signal in cell-cell signaling 18 1.22E-06Cellular component GO:0045202 synapse 27 1.38E-11Cellular component GO:0044456 synapse part 24 5.70E-11Cellular component GO:0034702 ion channel complex 17 7.61E-10Cellular component GO:0044459 plasma membrane part 43 2.59E-09Cellular component GO:0005886 plasma membrane 58 4.15E-07Cellular component GO:0030136 clathrin-coated vesicle 15 3.62E-07Cellular component GO:0008021 synaptic vesicle 14 9.04E-07Cellular component GO:0030054 cell junction 21 8.32E-07Cellular component GO:0030135 coated vesicle 16 1.02E-06Cellular component GO:0016023 cytoplasmic membrane-bounded vesicle 17 5.16E-06Cellular component GO:0034703 cation channel complex 10 6.10E-06

Nature Methods: doi: 10.1038/nmeth.2330

Page 19: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

19  

Supplementary Figure 17 – Overlap between human and fly edited genes. Venn diagram showing the overlap between human (blue) and fly (red) edited genes.

Nature Methods: doi: 10.1038/nmeth.2330

Page 20: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

20  

Supplementary Figure 18 - Comparison of sites between this study and previous studies. Venn diagrams showing the overlaps between the union of all human sites found in this study with the union of sites found in previous studies1,7-9 for (a) Alu regions, (c) repetitive non-Alu regions and (c) non-repetitive regions.

Nature Methods: doi: 10.1038/nmeth.2330

Page 21: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

21  

Supplementary Figure 19 - Identifying editing sites in a single sample. Fraction of all twelve mismatch types for (a) all variants identified in the sample SRR094854, (b) non-Alu variants shared between SRR094854 and any other sample from the brain dataset, and (c) variants identified after pooling the alignments of SRR094854 with alignments from all of the samples from the Brain dataset. The proportions of novel A-to-G sites identified are shown as overlaid black stripes. The total number of variants for each category is given in parenthesis in the legend.

Nature Methods: doi: 10.1038/nmeth.2330

Page 22: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

22  

Supplementary Tables Supplementary Table 1 – Lymphoblastoid dataset samples

GEO # SRR #'s Tissue read

length # reads # mapped reads

GSE16921 SRR032239 Lymphocyte 50 37,931,753 32,354,941 GSE16921 SRR032240 Lymphocyte 50 37,634,150 33,839,762 GSE16921 SRR032241 Lymphocyte 50 36,665,036 31,110,807 GSE16921 SRR032242 Lymphocyte 50 35,480,405 29,018,708 GSE16921 SRR032243 Lymphocyte 50 41,948,913 36,552,670 GSE16921 SRR032244 Lymphocyte 50 41,089,776 33,834,197 GSE16921 SRR032245 Lymphocyte 50 43,291,432 36,417,453 GSE16921 SRR032246 Lymphocyte 50 45,169,846 38,391,236 GSE16921 SRR032247 Lymphocyte 50 37,293,182 30,312,832 GSE16921 SRR032248 Lymphocyte 50 36,412,185 30,126,968 GSE16921 SRR032249 Lymphocyte 50 34,024,869 26,438,055 GSE16921 SRR032250 Lymphocyte 50 38,657,782 32,204,942 GSE16921 SRR032251 Lymphocyte 50 29,880,357 22,742,481 GSE16921 SRR032252 Lymphocyte 50 37,470,160 27,912,241 GSE16921 SRR032253 Lymphocyte 50 33,956,826 26,322,119 GSE16921 SRR032254 Lymphocyte 50 7,146,920 5,471,399 GSE16921 SRR032255 Lymphocyte 50 34,898,453 29,802,719 GSE16921 SRR032256 Lymphocyte 50 33,289,678 28,190,819 GSE16921 SRR032257 Lymphocyte 50 34,290,820 28,501,171 GSE16921 SRR032258 Lymphocyte 50 27,757,799 22,476,857 GSE16921 SRR032259 Lymphocyte 50 26,342,094 21,072,084 GSE16921 SRR032260 Lymphocyte 50 15,982,126 13,330,210 GSE16921 SRR032261 Lymphocyte 50 28,737,075 23,562,953 GSE16921 SRR032262 Lymphocyte 50 11,862,352 9,762,476 GSE16921 SRR032263 Lymphocyte 50 9,712,545 8,246,891 GSE16921 SRR032264 Lymphocyte 50 8,618,199 7,281,558 GSE16921 SRR032265 Lymphocyte 50 35,020,229 29,423,512 GSE16921 SRR032266 Lymphocyte 50 9,606,273 8,083,473 GSE16921 SRR032267 Lymphocyte 50 12,026,381 9,964,686 GSE16921 SRR032268 Lymphocyte 50 16,248,397 13,700,241 GSE16921 SRR032269 Lymphocyte 50 16,462,947 14,013,819 GSE16921 SRR032270 Lymphocyte 50 16,843,152 14,413,617 GSE16921 SRR032271 Lymphocyte 50 45,066,631 36,567,173 GSE16921 SRR032272 Lymphocyte 50 14,115,948 11,426,544 GSE16921 SRR032273 Lymphocyte 50 19,250,405 16,682,010 GSE16921 SRR032274 Lymphocyte 50 18,435,291 15,409,444 GSE16921 SRR032275 Lymphocyte 50 19,763,004 16,477,379 GSE16921 SRR032276 Lymphocyte 50 45,367,154 35,775,820 GSE16921 SRR032277 Lymphocyte 50 39,588,081 31,624,708 GSE16921 SRR032278 Lymphocyte 50 8,970,706 6,878,317

Data from Cheung VG et al. PLoS Biol. 201010

Nature Methods: doi: 10.1038/nmeth.2330

Page 23: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

23  

Supplementary Table 2. Human genic A-to-I RNA editing sites identified in this study

Alu repeat repetitive non-Alu non-repetitive

Lymphoblastoid Separate Samples 255,816 1,398 1,455

Pooled Samples 158,873 2,548 2,481

Union 303,624 2,796 2,815

Nonsynonymous 214 1 37

Synonymous 92 0 13

5'UTR 978 18 24

3'UTR 12,191 117 306

ncRNA 3,164 46 54

Intronic 286,985 2,614 2,381

Brain Separate Samples 484,674 5,693 5,395

Pooled Samples 393,529 12,803 10,772

Union 612,573 13,724 12,160

Nonsynonymous 479 5 115

Synonymous 226 4 86

5'UTR 2,378 70 147

3'UTR 22,957 291 690

ncRNA 7,788 197 227

Intronic 578,745 13,157 10,895

Bodymap Separate Samples 342,570 2,051 1,955

Pooled Samples 227,770 2,667 2,517

Union 364,497 3,110 3,016

Nonsynonymous 333 2 38

Synonymous 153 2 15

5'UTR 1,452 32 24

3'UTR 21,592 243 488

ncRNA 5,450 76 84

Intronic 335,517 2,755 2,367

Primate Hum-Chimp 17,800 308 464

Hum-Rhe 8,255 79 177

Hum-Chimp-Rhe 4,947 58 115

Union 21,108 334 542

Nonsynonymous 183 8 54

Synonymous 67 1 20

5'UTR 792 38 36

3'UTR 16,822 215 356

ncRNA 3,244 72 76

Intronic N/A N/A N/A Abbreviations: Chimp: Chimpanzee, Hum: human, Rhe: Rhesus Macaque, N/A: not applicable

Nature Methods: doi: 10.1038/nmeth.2330

Page 24: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

24  

Supplementary Table 3 – Summary of mismatches identified in human RNA-seq data. A-to-G and T-to-C percentages and total number of mismatches identified in Alu, repetitive non-Alu, and non-repetitive regions separated by genomic locations.

Alu repetitive non-Alu non-repetitive

Lymphoblastoid Separate Samples Mismatches AG % (AG+TC) % Mismatches AG % (AG+TC) % Mismatches AG % (AG+TC) %

Genic 359938 71% 80% 1596 88% 94% 1987 73% 80%

nonsynonymous 308 59% 67% 2 0% 0% 87 38% 40%

synonymous 119 69% 81% 1 0% 0% 36 25% 36%

5'UTR 1326 64% 72% 21 57% 57% 58 31% 41%

3'UTR 13781 81% 85% 108 90% 93% 339 76% 78%

ncRNA 3899 72% 80% 30 93% 97% 47 68% 72%

intronic 340505 71% 80% 1434 88% 94% 1420 78% 85%

Intergenic 89473 N/A 72% 422 N/A 97% 404 N/A 77%

Pooled Samples

Genic 195647 81% 89% 2884 88% 94% 3163 78% 86%

nonsynonymous 120 78% 86% 3 33% 33% 36 47% 61%

synonymous 54 83% 91% 0 N/A N/A 16 38% 56%

5'UTR 564 86% 92% 14 79% 86% 28 54% 64%

3'UTR 6549 95% 97% 80 98% 98% 231 87% 90%

ncRNA 1880 82% 89% 43 91% 93% 56 79% 82%

intronic 186480 81% 89% 2744 88% 94% 2796 79% 87%

Intergenic 67449 N/A 71% 870 N/A 94% 731 N/A 79%

Brain Separate Samples

Genic 608996 80% 91% 6438 88% 97% 7627 71% 81%

nonsynonymous 533 78% 86% 8 50% 50% 450 21% 32%

synonymous 246 77% 89% 3 67% 67% 310 23% 48%

5'UTR 2622 79% 86% 59 80% 86% 155 58% 66%

3'UTR 25648 83% 89% 298 82% 88% 1221 49% 59%

ncRNA 8802 78% 89% 161 81% 94% 224 71% 83%

intronic 571145 79% 91% 5909 89% 97% 5267 83% 92%

Nature Methods: doi: 10.1038/nmeth.2330

Page 25: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

25  

Intergenic 204693 N/A 91% 2978 N/A 98% 1619 N/A 89%

Pooled Samples

Genic 449694 88% 97% 16295 79% 88% 16847 64% 75%

nonsynonymous 284 92% 97% 4 50% 75% 131 44% 57%

synonymous 130 89% 98% 2 100% 100% 105 33% 66%

5'UTR 1508 94% 98% 61 80% 95% 148 70% 78%

3'UTR 12505 96% 99% 235 85% 92% 509 74% 82%

ncRNA 4909 87% 98% 217 69% 90% 260 65% 83%

intronic 430358 87% 97% 15776 79% 87% 15694 64% 75%

Intergenic 161589 N/A 94% 7457 N/A 91% 4591 N/A 76%

Bodymap Separate Samples

Genic 448683 76% 87% 2363 87% 95% 2784 70% 79%

nonsynonymous 447 70% 78% 2 50% 50% 128 18% 25%

synonymous 213 69% 79% 2 50% 50% 52 17% 46%

5'UTR 1873 74% 80% 27 81% 81% 52 17% 27%

3'UTR 25046 84% 87% 219 91% 94% 598 72% 75%

ncRNA 6876 75% 84% 71 85% 92% 2784 70% 79%

intronic 414228 76% 87% 2042 87% 95% 1871 76% 86%

Intergenic 137596 N/A 86% 851 N/A 96% 688 N/A 84%

Pooled Samples

Genic 283288 80% 90% 3311 81% 91% 3914 64% 73%

nonsynonymous 174 83% 91% 3 33% 33% 104 24% 34%

synonymous 77 86% 95% 1 100% 100% 58 16% 38%

5'UTR 888 87% 92% 37 70% 76% 91 22% 29%

3'UTR 11006 95% 97% 198 87% 91% 496 62% 66%

ncRNA 3327 83% 92% 69 80% 91% 117 56% 64%

intronic 267816 80% 90% 3003 80% 92% 3048 69% 78%

Intergenic 99085 N/A 85% 1251 N/A 94% 971 N/A 78%

Primate Human-Chimp

Exonic 19550 91% 95% 364 85% 93% 553 84% 89%

nonsynonymous 168 95% 99% 11 72% 91% 64 66% 80%

Nature Methods: doi: 10.1038/nmeth.2330

Page 26: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

26  

synonymous 70 86% 97% 1 100% 100% 40 40% 60%

5'UTR 714 96% 98% 39 82% 95% 36 83% 86%

3'UTR 15406 91% 94% 228 87% 92% 331 92% 93%

ncRNA 3192 89% 98% 85 81% 95% 82 89% 95%

Intronic N/A N/A N/A N/A N/A N/A N/A N/A N/A

Intergenic N/A N/A N/A N/A N/A N/A N/A N/A N/A

Human-Rhesus

Exonic 8964 92% 95% 92 86% 96% 218 81% 87%

nonsynonymous 46 96% 100% 0 N/A N/A 36 83% 89%

synonymous 18 83% 100% 1 0% 100% 14 50% 57%

5'UTR 252 96% 97% 10 80% 90% 12 83% 83%

3'UTR 7526 92% 94% 75 89% 96% 133 89% 90%

ncRNA 1122 91% 99% 6 67% 100% 23 52% 83%

Intronic N/A N/A N/A N/A N/A N/A N/A N/A N/A

Intergenic N/A N/A N/A N/A N/A N/A N/A N/A N/A

Hum-Chim-Rhe

Exonic 5175 96% 98% 63 92% 98% 131 88% 90%

nonsynonymous 22 95% 100% 0 N/A N/A 29 86% 90%

synonymous 10 80% 100% 0 N/A N/A 5 60% 80%

5'UTR 136 100% 100% 2 100% 100% 4 100% 100%

3'UTR 4346 96% 97% 29 93% 98% 80 91% 91%

ncRNA 661 93% 99% 2 50% 100% 13 77% 85%

Intronic N/A N/A N/A N/A N/A N/A N/A N/A N/A

Intergenic N/A N/A N/A N/A N/A N/A N/A N/A N/A

Nature Methods: doi: 10.1038/nmeth.2330

Page 27: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

27  

Supplementary Table 4 – Brain dataset samples

GEO # SRR #'s Tissue read

length trimmed

read length # reads

# mapped reads

PMID

GSE30611 ERR030890,ERR030882 Whole brain 50,75 50,75 211,339,298 196,143,878 N/A GSE19166 SRR039628,SRR039629 Whole brain 50 45 16,402,286 15,452,343 2037151611 GSE19166 SRR039630,SRR039631 Whole brain 50 45 15,520,177 14,626,939 20371516 GSE19166 SRR039632,SRR039633 Whole brain 50 45 14,901,988 14,027,002 20371516 SRP004879 SRR085726 Frontal Lobe 36 35 15,228,832 14,593,280 2128369212 SRP004879 SRR085473 Temporal Lobe 36 35 14,227,702 13,435,512 21283692 SRP004879 SRR087416 Whole brain 36 36 14,720,816 14,251,780 21283692 GSE30573 SRR309137,SRR309138 Frontal Lobe 76 76 47,240,074 42,422,059 2161400113 GSE30573 SRR309133,SRR309134 Temporal Lobe 76 74 47,806,095 44,402,560 21614001 GSE30573 SRR309135,SRR309136 Temporal Lobe 76 74 46,627,463 42,796,406 21614001 GSE13652 SRR036966 Whole brain 32 32 17,246,957 14,860,459 1897878914 GSE13652 SRR014262 Cerebral Cortex 32 32 31,940,303 25,811,445 18978789 SRP004879 SRR085474 Frontal Lobe 36 35 15,772,947 15,275,200 21283692 GSE30573 SRR309139,SRR309140 Frontal Lobe 76 76 54,181,963 48,345,199 21614001 SRP004879 SRR085471 Temporal Lobe 36 36 15,256,752 14,752,736 21283692 GSE30573 SRR309141,SRR309142 Temporal Lobe 76 76 50,989,533 46,450,238 21614001 GSE30573 SRR309143,SRR309144 Temporal Lobe 76 73 52,560,671 48,380,620 21614001 SRP004879 SRR085725 Whole brain 36 36 13,442,077 12,967,752 21283692 SRP005169 SRR090440 Frontal Gyrus 76 76 32,196,586 24,836,371 N/A SRP005169 SRR090441 Frontal Gyrus 76 76 31,890,926 24,732,128 N/A SRP005169 SRR090442 Frontal Gyrus 76 76 35,664,636 26,495,200 N/A SRP005169 SRR107727 Frontal Gyrus 100 100 21,277,649 9,059,914 N/A SRP005169 SRR111895 Frontal Gyrus 100 100 21,284,713 17,360,595 N/A SRP005169 SRR111896 Frontal Gyrus 100 100 20,754,409 14,680,990 N/A SRP005169 SRR111897 Frontal Gyrus 100 100 23,722,421 19,914,343 N/A SRP005169 SRR111898 Frontal Gyrus 100 100 23,416,250 18,211,563 N/A SRP005169 SRR111899 Frontal Gyrus 100 100 22,698,303 18,022,392 N/A SRP005169 SRR111900 Frontal Gyrus 100 100 23,934,412 19,476,603 N/A SRP005169 SRR111901 Frontal Gyrus 100 100 17,759,057 13,810,076 N/A SRP005169 SRR111902 Frontal Gyrus 100 100 19,901,399 15,191,854 N/A SRP005169 SRR111903 Frontal Gyrus 100 100 23,201,284 18,508,848 N/A

Nature Methods: doi: 10.1038/nmeth.2330

Page 28: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

28  

SRP005169 SRR111904 Frontal Gyrus 100 100 16,019,209 12,570,780 N/A SRP005169 SRR111905 Frontal Gyrus 100 100 20,948,595 16,934,309 N/A SRP005169 SRR111906 Frontal Gyrus 100 100 21,032,459 16,889,571 N/A SRP005169 SRR111907 Frontal Gyrus 100 100 20,255,260 16,635,310 N/A SRP005169 SRR111935 Cerebellum 76 50 36,310,372 30,702,201 N/A SRP005169 SRR111936 Cerebellum 76 50 36,006,838 28,457,209 N/A SRP005169 SRR111937 Cerebellum 76 50 32,584,988 26,627,941 N/A SRP005169 SRR112600 Frontal Gyrus 76 50 43,457,126 35,231,630 N/A SRP005169 SRR112601 Frontal Gyrus 76 50 40,322,760 32,696,000 N/A SRP005169 SRR112672 Frontal Gyrus 76 50 41,156,904 34,417,974 N/A SRP005169 SRR112673 Frontal Gyrus 76 50 39,199,852 33,239,677 N/A SRP005169 SRR112674 Frontal Gyrus 76 50 45,298,510 36,357,866 N/A SRP005169 SRR112675 Frontal Gyrus 76 50 43,922,080 36,253,577 N/A GSE30352 SRR306839 Frontal Cortex 76 76 18,850,030 11,417,815 2201239215 GSE30352 SRR306840 Prefrontal Cortex 76 76 45,153,410 29,390,122 22012392 GSE30352 SRR306841 Prefrontal Cortex 76 76 24,325,223 16,001,002 22012392 GSE30352 SRR306842 Prefrontal Cortex 76 76 34,845,988 20,637,444 22012392 GSE30352 SRR306844 Cerebellum 76 76 32,698,558 22,278,508 22012392 GSE30554 SRR309262 Whole brain 50 50 62,276,114 51,885,655 2189064716

Nature Methods: doi: 10.1038/nmeth.2330

Page 29: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

29  

Supplementary Table 5 – Bodymap dataset samples

GEO # SRR #'s Tissue read length # reads # mapped

reads

GSE30611 ERR030888,ERR030880 Adipose 50,75 230,869,369 210,467,294GSE30611 ERR030889,ERR030881 Adrenal 50,75 225,117,311 207,342,765GSE30611 ERR030890,ERR030882 Brain 50,75 211,339,298 196,143,878GSE30611 ERR030891,ERR030883 Breast 50,75 228,919,690 209,895,271GSE30611 ERR030892,ERR030884 Colon 50,75 245,132,643 228,243,609GSE30611 ERR030894,ERR030886 Heart 50,75 242,604,430 229,742,305GSE30611 ERR030893,ERR030885 Kidney 50,75 240,567,067 219,752,478GSE30611 ERR030895,ERR030887 Liver 50,75 237,551,123 224,040,054GSE30611 ERR030896,ERR030879 Lung 50,75 239,849,248 222,800,771GSE30611 ERR030897,ERR030878 Lymphnode 50,75 246,072,774 223,902,952GSE30611 ERR030900,ERR030875 Lympocyte 50,75 245,219,969 230,242,496GSE30611 ERR030901,ERR030874 Ovary 50,75 242,895,572 226,037,903GSE30611 ERR030898,ERR030877 Prostate 50,75 247,988,054 233,522,418GSE30611 ERR030899,ERR030876 SkeletalMuscle 50,75 247,086,914 230,449,519GSE30611 ERR030902,ERR030873 Testes 50,75 245,716,717 229,162,601GSE30611 ERR030903,ERR030872 Thyroid 50,75 244,072,431 226,332,460

Data from Illumina Human BodyMap 2.0

Nature Methods: doi: 10.1038/nmeth.2330

Page 30: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

30  

Supplementary Table 6 - Datasets used for analysis in primates and mouse

Species Number of datasets Read numbers Read lengths (bp) References

Chimpanzee 12 391,381,439 75, 76 and 101 15,17 Rhesus monkey 13 330,830,959 75, 76 and 101 17-19

Mouse 48 562,407,149 36 and 72 20,21

Nature Methods: doi: 10.1038/nmeth.2330

Page 31: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

31  

Supplementary Table 7 – Validation of nonsynonymous non-repetitive editing sites by deep sequencing

Chr Position Gene RNAseq

Edit

Level a

RNAseq

Reads a

Brain Reference Total RNAc Mixed Brain RNA

b Frontal Lobe RNA

b Frontal Lobe Genomic DNA

b

Reads Edit

Level P value Reads

Edit Level

P value Reads Edit

Level P value Reads

Edit Level

P value

Validated

chr1 2436080 PLCH2 0.129 387 8084 0.016 2.799E-12 10490 0.005 1 7421 0.051 6.3E-74 2173 0.004 1

chr1 26611864 uc010ofb.1 0.167 72 752 0.097 2.808E-16 1538 0.033 4.268E-8 222 0.005 1 2786 0.004 1

chr1 160319987 NCSTN 0.123 277 9873 0.031 3.207E-49 14558 0.016 1.698E-24 12870 0.005 1 187 0.016 1

chr1 247264181 ZNF669 0.046 130 7899 0.006 1 12117 0.011 1.824E-07 6506 0.005 1 1566 0.006 1

chr10 46999805 GPRIN2 0.2 35 2831 0.009 1 4755 0.012 0.0003946 12 0 1 5443 0.002 1

chr11 61135627 uc010rli.1 0.122 49 4569 0.107 4.46E-112 6262 0.014 6.714E-08 5020 0.124 2E-146 3137 0.003 1

chr11 61724916 BEST1 0.222 361 587 0.039 5.899E-04 947 0.024 0.009 445 0.002 1 8910 0.003 1

chr12 50475424 ACCN2 0.164 317 12415 0.023 3.421E-37 14297 0.004 1 7683 0.002 0.113 10217 0.002 0.5

chr12 69140502 SLC35E3 0.049 102 9080 0.011 1.177E-06 9896 0.072 4.07E-154 1261 0.003 1 710 0.001 1

chr14 70990511 ADAM20 0.4 5 10976 0.033 1.802E-59 15100 0.137 0 9702 0.003 1 12478 0.002 0.18

chr14 105058377 BX927359.1 0.279 323 8338 0.385 0 16860 0.23 0 11049 0.289 0 3816 0.004 1

chr15 65425992 PDCD7 0.089 45 22236 0.374 0 46036 0.122 0 8153 0.565 0 16289 0.005 1

chr15 65426067 PDCD7 0.038 53 14358 0.171 0 24406 0.275 0 671 0.335 2.5E-54 7391 0.01 0.003

chr17 43045220 C1QL1 0.076 79 110 0 1 216 0.005 1 168 0.125 6.3E-04 1522 0.002 1

chr17 37826529 PNMT 0.26 181 1857 0.229 7.39E-104 1584 0.247 1.97E-95 1465 0.113 2.6E-38 2993 0.005 1

chr17 42475813 GPATCH8 0.065 308 2148 0.036 6.312E-13 7606 0.022 2.36E-22 7452 0.037 2.3E-48 2873 0.006 1

chr17 46198763 SNX11 0.073 206 7585 0.034 4.435E-43 11129 0.068 9.6E-161 1361 0.029 2.2E-05 1681 0.011 1

chr17 48703595 CACNA1G 0.041 490 10730 0.006 1 11680 0.057 9.1E-134 6478 0.007 0.600 3939 0.006 1

chr18 32825609 ZNF397 0.089 112 7148 0.051 4.204E-71 6051 0.009 0.04 1722 0.017 0.011 12881 0.002 0.014

chr18 32825654 ZNF397 0.09 122 1057 0.057 4.067E-12 1328 0.051 1.36E-12 903 0.008 1 1202 0.002 1

chr18 55274824 NARS 0.022 929 9888 0.081 5.98E-176 10636 0.006 1 8889 0.004 1 15141 0.003 1

chr19 3592462 AC116968.1 0.182 11 8138 0.017 4.594E-14 6560 0.003 1 7228 0.004 1 3511 0.002 1

chr19 13947328 uc002mxi.3 0.333 6 166 0.452 5.776E-17 754 0.001 1 371 0.008 1 13189 0.004 1

chr19 47152737 DACT3 0.283 53 15543 0.61 0 19798 0.773 1 16265 0.113 0 10 0.8 1

chr19 52538520 ZNF432 0.043 69 22369 0.004 1 22311 0.009 1.5E-07 20680 0.004 1 14535 0.004 1

chr19 56895363 ZNF582 0.102 49 9008 0.19 0 13831 0.101 0 1009 0.269 1.8E-66 6723 0.003 1

chr19 56895428 ZNF582 0.175 40 1749 0.087 7.179E-34 1279 0.076 5.63E-21 1060 0.1 3.5E-24 236 0 1

chr2 69741674 AAK1 0.031 259 8509 0.017 6.63E-16 8685 0.005 1 8667 0.004 1 8426 0.003 1

chr2 210835613 UNC80 0.065 215 14188 0.056 3.95E-158 12051 0.122 0 9956 0.133 0 9835 0.005 1

chr2 219563809 STK36 0.051 198 5680 0.01 0.005 12673 0.013 4.68E-14 9290 0.016 2.9E-14 13815 0.002 0.464

chr21 34922801 SON 0.137 131 6757 0.237 0 6084 0.054 6.45E-66 4960 0.097 2.0E-108 1098 0.001 1

chr22 46657119 PKDREJ 0.286 7 10358 0.004 1 19227 0.004 1 16649 0.009 3.3E-06 14537 0.005 1

chr3 147128798 ZIC1 0.027 448 20164 0.012 6.251E-16 24644 0.008 6.34E-07 18291 0.006 0.144 11838 0.005 1

chr3 179093066 MFN1 0.017 767 10402 0.015 4.403E-14 12654 0.005 1 9921 0.01 1.3E-05 5358 0.003 1

chr4 775178 AC139887.1 0.167 24 2616 0.024 1.06E-08 16189 0.147 0 11271 0.005 1 7330 0.005 1

chr4 2835556 SH3BP2 0.147 116 12195 0.29 0 16080 0.125 0 11304 0.01 1.4E-05 4035 0.002 1

chr4 17805279 DCAF16 0.072 222 8966 0.038 4.238E-59 11276 0.037 4.20E-72 10488 0.04 1.5E-75 12111 0.002 0.324

Nature Methods: doi: 10.1038/nmeth.2330

Page 32: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

32  

chr5 38949495 uc003jlo.2 0.054 37 6275 0.213 0 8395 0.19 0 4123 0.439 0 97 0.01 1

chr6 34100903 GRM4 0.079 190 20810 0.104 0 18906 0.102 0 15469 0.203 0 9991 0.004 1

chr6 44120349 TMEM63B 0.155 569 17895 0.237 0 8756 0.187 0 16726 0.132 0 4143 0.006 1

chr6 142468514 uc011edu.1 0.012 754 14148 0.006 1 9166 0.004 1 11878 0.01 1.2E-05 7386 0.004 1

chr6 167352220 uc003qvi.1 0.182 44 9821 0.03 2.678E-47 10306 0.015 2.04E-14 1379 0.004 1 2964 0.002 1

chr6 167352256 uc003qvi.1 0.241 29 9015 0.056 7.4E-101 10068 0.001 0.012 1397 0.004 1 2755 0.003 1

chr7 39990527 CDK13 0.312 16 1847 0.313 3.55E-141 1359 0.241 1.73E-79 909 0.367 8.3E-81 731 0.01 1

chr9 132571671 TOR1B 0.128 234 10977 0.177 0 11815 0.145 0 4945 0.002 1 8203 0.001 0.069

chrX 2825467 ARSD 0.125 40 10562 0.033 2.559E-57 10826 0.005 1 9472 0.004 1 4930 0.004 1

Not Validated

chr1 111717535 CEPT1 0.008 236 8206 0.002 0.074 11505 0.001 0.001 5675 0.001 0.045 829 0 1

chr1 146650158 uc001epg.1 0.146 41 8905 0.004 1 9260 0.003 1 9037 0.002 1 1 0 1

chr1 153610779 C1orf77 0.025 315 6791 0.004 1 13071 0.003 1 6528 0.003 1 2528 0.003 1

chr11 46388498 DGKZ 0.157 70 662 0.003 1 683 0.006 1 560 0.005 1 8624 0.006 1

chr11 66620033 PC 0.009 326 330 0.012 1 17685 0.005 1 9247 0.007 1 3842 0.007 1

chr11 88029400 CTSC 0.028 71 575 0.003 1 945 0.002 1 1276 0.008 1 971 0.001 1

chr13 28519518 ATP5EP2 1 3 789 0.003 1 1 0 1 2 0 1 5783 0.001 0.061

chr14 73746097 NUMB 0.215 65 4632 0.002 1 7371 0.003 1 301 0.013 1 31 0 1

chr16 22134962 VWA3A 0.117 60 12368 0.002 0.068 11274 0.002 0.683 6833 0.003 1 13809 0.003 1

chr17 26902513 ALDOC 0.014 2164 7332 0.006 1 5054 0.006 1 4072 0.007 1 10818 0.005 1

chr17 35902595 SYNRG 0.047 107 6285 0.003 1 7485 0.005 1 3969 0.003 1 916 0.005 1

chr17 74714882 JMJD6 0.006 638 368 0.005 1 258 0.004 1 83 0 1 3886 0.007 1

chr18 33719511 uc010xcg.1 0.068 44 1547 0.007 1 2201 0.003 1 1666 0.005 1 11028 0.007 0.832

chr19 23544208 ZNF91 0.009 107 1277 0.001 1 1408 0.006 1 1551 0.005 1 1233 0.007 1

chr19 38189366 ZNF607 0.079 101 799 0.013 1 969 0.001 1 681 0.018 1 1482 0.003 1

chr19 50662822 IZUMO2 0.667 3 6396 0.004 1 2865 0.003 1 7 0 1 219 0.014 1

chr19 54485579 CACNG8 0.107 56 4717 0.007 1 5369 0.004 1 3011 0.004 1 0 0 1

chr2 17699634 RAD51AP2 0.333 6 9719 0.002 0.727 13366 0.003 1 16361 0.004 1 3 0 1

chr2 64779254 AFTPH 0.01 509 4094 0.003 1 2785 0.001 1 3947 0.002 1 18240 0.003 0.994

chr2 97364828 FER1L5 0.6 5 2644 0.003 1 2 0 1 2 0 1 6228 0.003 1

chr5 96073628 CAST 0.03 133 15676 0.005 1 13685 0.004 1 8619 0.008 0.119 6296 0.006 1

chr5 118970017 FAM170A 1 2 15356 0.006 0.328 12529 0.005 1 22 0 1 10965 0.005 1

chr5 148999948 ARHGEF37 0.017 242 17922 0.005 1 14514 0.004 1 19151 0.004 1 13360 0.004 1

chr6 49667583 CRISP2 0.3 10 178 0.006 1 2158 0.001 1 6 0 1 9578 0.004 1

chr6 74176314 MTO1 0.005 435 12031 0.006 0.382 11429 0.005 1 9637 0.005 1 12284 0.005 1

chr7 39990302 CDK13 0.167 66 172 0.006 1 2944 0.008 1 5 0 1 3610 0.014 0.001

chr7 39990736 CDK13 0.28 25 27 0.111 1 44 0.023 1 2 0 1 43 0 1

chr7 45141465 TBRG4 0.015 330 12127 0.005 1 7972 0.005 1 7373 0.005 1 6404 0.003 1

chr9 21384898 IFNA2 0.444 9 3896 0.005 1 10420 0.005 1 5183 0.007 1 6484 0.005 1

chr9 21384902 IFNA2 0.5 8 3887 0.002 1 10407 0.003 1 5169 0.002 1 6484 0.003 1

chr9 131381164 SPTAN1 0.014 1109 690 0 1 574 0.009 1 492 0.004 1 35 0 1

chr9 139720242 C9orf86 0.011 901 15230 0.004 1 12938 0.004 1 12816 0.004 1 9018 0.004 1

chrX 48435461 RBM3 0.044 1286 5160 0.003 1 5571 0.003 1 4521 0.005 1 488 0.002 1

Nature Methods: doi: 10.1038/nmeth.2330

Page 33: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

33  

chrX 70324174 CXorf65 0.4 5 11 0 1 631 0.002 1 0 0 1 2981 0.002 1

Insufficient Data

chr1 3759756 KIAA0562 0.333 9 0 0 1 0 0 1 0 0 1 53 0 1

chr19 14593605 GIPC1 0.024 289 0 0 1 2 0 1 0 0 1 50 0 1

chr19 47152854 DACT3 0.809 47 0 0 1 0 0 1 0 0 1 0 0 1

chr2 88825980 C2orf51 0.667 3 12 0 1 19 0 1 10 0 1 17655 0.006 1

chr20 23545585 CST9L 0.667 3 12 0 1 18 0 1 2 0 1 5351 0.002 1

chr20 29628273 uc010ztk.1 0.415 82 0 0 1 0 0 1 0 0 1 0 0 1

chr7 39990590 CDK13 0.231 26 16 0 1 3 0 1 6 0 1 3597 0.004 1

The validated sites and tissues are highlighted in yellow (Benjamini-Hochberg correction: P value < 0.05). a RNA-seq data that were used to identify putative A-to-I RNA editing sites. The “Reads” and “Edit Level” were calculated based on the pooled brain RNA-seq datasets. b These three are postmortem samples from a single human individual (Biochain Institute). c This is a pooled set of brain samples from 23 individuals (Ambion, Cat # 6050).

Nature Methods: doi: 10.1038/nmeth.2330

Page 34: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

34  

Supplementary Table 8 - Datasets used for analysis in Drosophila

Species Stage Read number Read length Reference or Accession

D. melanogaster Adult 5 day male and female (y1; cn bw1 sp1)

185,521,954 76 bp 3

D. melanogaster Adult 5~7 day male and female heads (OregonR)

77,060,093 76 bp GSE20348

D. melanogaster Adult 1 day male and female (y1; cn bw1 sp1)

226,527,733 76 bp 3

D. simulans Adult 5~7day male and female 397,123,340 101 bp GSM694262, GSM694263, GSM694264, GSM694265

D. yakuba Adult 5~7day female 173,949,062 101 bp GSM694271, GSM694272

D. pseudoobscura Adult 5~7day female 198,103,340 101 bp GSM694279, GSM694280

y1 – yellow, cn – cinnabar, bw1 – brown, sp1 – speck

Nature Methods: doi: 10.1038/nmeth.2330

Page 35: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

35  

Supplementary Table 9 – Primer sequences used for high-throughput sequencing validation Editing Site Primers Chrom Position Gene cDNA primer (forward) cDNA primer (reverse)

chr1 2436080 PLCH2 GCGTTATCGAGGTCCCCATACCTGACGAACTGC GTGCTCTTCCGATCGGTGGGTGACGTGATGT

chr1 3759756 KIAA0562 GCGTTATCGAGGTCCCTTTAGACCAGGCACCTC GTGCTCTTCCGATCTTATGTGGATGCAGTAGGACA

chr1 26611864 uc010ofb.1 GCGTTATCGAGGTCTCCTGGAGGGCATCAGA GTGCTCTTCCGATCCACCCAAGTGACACCAGT

chr1 111717535 CEPT1 GCGTTATCGAGGTCCAAAAAGGTGGTCCTCCAATC GTGCTCTTCCGATCTGGGAAACAGGCAAGAAGA

chr1 146650158 uc001epg.1 GCGTTATCGAGGTCGAACTCGGAGTGAGCTTCA GTGCTCTTCCGATCGTGGACACTGCAAGAGACT

chr1 153610779 C1orf77 GCGTTATCGAGGTCCGTCGGCTGTTTGTTCTTC GTGCTCTTCCGATCACTACCGCTTACGCATCTAAC

chr1 160319987 NCSTN GCGTTATCGAGGTCGGCAATGGTTTGGCTTATGAA GTGCTCTTCCGATCGCATGCTCCACACATTGTAA

chr1 247264181 ZNF669 GCGTTATCGAGGTCCATTCCTTACATTCGTAGGGTTTC GTGCTCTTCCGATCTGTGGAAAAGTCTTTGTACGTC

chr10 46999805 GPRIN2 GCGTTATCGAGGTCAAGTCATTGGCTGAGGTCAT GTGCTCTTCCGATCTGAACAGCTGGCTACCAC

chr11 46388498 DGKZ GCGTTATCGAGGTCTACGCGAGCAGGGCGCTCCA GTGCTCTTCCGATCGGCTACTATCGGCGCCTCAGC

chr11 61135627 uc010rli.1 GCGTTATCGAGGTCGCCTCCTCATTTCCCATAGT GTGCTCTTCCGATCTCATATGGACAGATGGACTTCA

chr11 61724916 BEST1 GCGTTATCGAGGTCTACTCAGTGTGGACACCTGTA GTGCTCTTCCGATCCGACAATCCAGTTGGTCTCAA

chr11 66620033 PC GCGTTATCGAGGTCGAGTGGCCAGCAGTGAC GTGCTCTTCCGATCATGGTAAACGGTCCAACCAC

chr11 88029400 CTSC GCGTTATCGAGGTCTACGGATTCTCGCTTCTAGC GTGCTCTTCCGATCAAGTCTTGGACTGCAACTACA

chr12 50475424 ACCN2 GCGTTATCGAGGTCCCCATCTAGGCCTTTGGTTC GTGCTCTTCCGATCGGTCATTAAGCACAAGCTGT

chr12 69140502 SLC35E3 GCGTTATCGAGGTCGGAGAAGGTTTTCTGGTAGCA GTGCTCTTCCGATCTGGCATTGCTGGTGGAC

chr13 28519518 ATP5EP2 GCGTTATCGAGGTCAGTCAGGGTAGATTATTCCTTCTTT GTGCTCTTCCGATCGCCAGAGTCTTGGCGATTT

chr14 70990511 ADAM20 GCGTTATCGAGGTCAAACGACCAACCTGTTGTCT GTGCTCTTCCGATCCTTCCTACCAGTGGAGACCTA

chr14 73746097 NUMB GCGTTATCGAGGTCTCCAAGCTAATGGCACTGAC GTGCTCTTCCGATCCCTGGAAGAGACCTGGAGA

chr14 105058377 BX927359.1 GCGTTATCGAGGTCATCCCAGCTCCTCCACTTG GTGCTCTTCCGATCGCTCCCACGCCCAGCATCCA

chr15 65425992 PDCD7 GCGTTATCGAGGTCCTGCCCTCCCCGGCTTTCCC GTGCTCTTCCGATCGGCTCGGGGCTGCAGAGCCA

chr15 65426067 PDCD7 GCGTTATCGAGGTCCATTCTTCGGCCAGGGTCGC GTGCTCTTCCGATCGTGGCGGACAGCCGAAAGGA

chr16 22134962 VWA3A GCGTTATCGAGGTCCAGGGAGAGCCACTCAATC GTGCTCTTCCGATCTGCAGAGTGGCTTAAGGTC

chr17 26902513 ALDOC GCGTTATCGAGGTCCTGTCACCATGCCTCACTC GTGCTCTTCCGATCACACCCTTGTCAACCTTGAT

chr17 35902595 SYNRG GCGTTATCGAGGTCATCATCCCTGTCTGACCAGT GTGCTCTTCCGATCTGAAGATGCACTCTCTGTTCA

chr17 37826529 PNMT GCGTTATCGAGGTCTGTCTGAGGAGGAGGTGA GTGCTCTTCCGATCCCACTTCAAAGAACAGGGAATC

chr17 42475813 GPATCH8 GCGTTATCGAGGTCGGTGATCAGCAGTAGCTTCTC GTGCTCTTCCGATCGGAAAGTGGAGAGGAAACCT

chr17 43045220 C1QL1 GCGTTATCGAGGTCCGCCCTGAGCGAGCAGAGCG GTGCTCTTCCGATCTTGCCTGGCTCACCCTTCTCC

chr17 46198763 SNX11 GCGTTATCGAGGTCACCCCTCTCCTCACCAT GTGCTCTTCCGATCAGCCCAAGCAGAGGTGA

chr17 48703595 CACNA1G GCGTTATCGAGGTCGTCCAACTCTAAGCTGCTTCT GTGCTCTTCCGATCGCAGCAATAAGGACTGACTC

chr17 74714882 JMJD6 GCGTTATCGAGGTCTCCCCGTTTCCCACCAT GTGCTCTTCCGATCTGGTGGCATGTTGTCCTC

Nature Methods: doi: 10.1038/nmeth.2330

Page 36: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

36  

chr18 32825609 ZNF397 GCGTTATCGAGGTCAACTAAAGGCCTTCCCACAC GTGCTCTTCCGATCTTAGTGTGGAAGCAAGGAAGT

chr18 32825654 ZNF397 GCGTTATCGAGGTCTGTGGGAAGGCCTTTAGTTT GTGCTCTTCCGATCTTTGAGCTCTGGCTAAAGGTT

chr18 33719511 uc010xcg.1 GCGTTATCGAGGTCATTCTGTGCAAGGCTCTGT GTGCTCTTCCGATCCTTGTGAACAGCTTGCTAGGA

chr18 55274824 NARS GCGTTATCGAGGTCGACATCAACGTCAGACTCCT GTGCTCTTCCGATCGATCCAAGTCTCCCAGAGC

chr19 3592462 AC116968.1 GCGTTATCGAGGTCAACCCAGCTCCAGAACTC GTGCTCTTCCGATCCAGGCTGGGTTCCAGAA

chr19 13947328 uc002mxi.3 GCGTTATCGAGGTCTGTGGACCCTGCTCACAAGC GTGCTCTTCCGATCCCAGGGATTTCCAACCGACC

chr19 14593605 GIPC1 GCGTTATCGAGGTCTCGATGCGGCCAGTGGGACT GTGCTCTTCCGATCCTAGTGGAAAATGAGGAGGCTGAG

chr19 23544208 ZNF91 GCGTTATCGAGGTCATGTTTATTAAGGGTTAAGGATTGTC GTGCTCTTCCGATCGGCAAAGCATTTAACTGGTC

chr19 38189366 ZNF607 GCGTTATCGAGGTCTGCCACATTCCTTACATTCGTA GTGCTCTTCCGATCCGCAGAGTTCATACTGGTGA

chr19 47152737 DACT3 GCGTTATCGAGGTCGCGCTCCAACGAGGGCTCCC GTGCTCTTCCGATCCAGAACAGCGTGCGCCAGCG

chr19 47152854 DACT3 GCGTTATCGAGGTCTACATCTCGGCGCTCCT GTGCTCTTCCGATCGGATGTACTGCGCCTTCA

chr19 50662822 IZUMO2 GCGTTATCGAGGTCCTCTCTGAAAGATGAGCCTCT GTGCTCTTCCGATCTCTGTATGTACAAGCCGAGAC

chr19 52538520 ZNF432 GCGTTATCGAGGTCCAGCTTTTGTTCTGGTTGACT GTGCTCTTCCGATCGTCCAGAAAACAACGAAGTTGA

chr19 54485579 CACNG8 GCGTTATCGAGGTCGCGGTAACTGGGCAGAC GTGCTCTTCCGATCTCAATCATTTCCCGGAGGAC

chr19 56895363 ZNF582 GCGTTATCGAGGTCCTACCGTTCAACCTCAGAGAA GTGCTCTTCCGATCACCTCTGAATGAAGGCATTGT

chr19 56895428 ZNF582 GCGTTATCGAGGTCTGACTCAAGGTCTTCTCACAT GTGCTCTTCCGATCACAACTCAAGCAACATCAGAGA

chr2 17699634 RAD51AP2 GCGTTATCGAGGTCGATGGCCGAGCTCAGAAA GTGCTCTTCCGATCCATATCTGCTTCCCACTGACT

chr2 64779254 AFTPH GCGTTATCGAGGTCATTCCTCAGCAGGACTAGGTA GTGCTCTTCCGATCAGTAGTGGAGCAGAGACAGA

chr2 69741674 AAK1 GCGTTATCGAGGTCGCTGCCTTGCAGCAAAA GTGCTCTTCCGATCGTTGGGTTTTGGGGGATGA

chr2 88825980 C2orf51 GCGTTATCGAGGTCACATATACCAAAGCTCCCACAT GTGCTCTTCCGATCCACGGAAAGTCAGCACTCT

chr2 97364828 FER1L5 GCGTTATCGAGGTCCAAAGATGGGATCCAGAGTGT GTGCTCTTCCGATCCATCTACCCCTTTCCTGAGAA

chr2 210835613 UNC80 GCGTTATCGAGGTCCAGTGTTGCTGATGGATTACC GTGCTCTTCCGATCTTTGCTTCCAGATCCTACTGT

chr2 219563809 STK36 GCGTTATCGAGGTCAGTGCCAGCTTTGCTGT GTGCTCTTCCGATCCTCCACATGCCATTTCTAGGA

chr20 23545585 CST9L GCGTTATCGAGGTCGCCTCCTGAACAAGACCT GTGCTCTTCCGATCATTAGGCTCAAGATGTGTGGA

chr20 29628273 uc010ztk.1 GCGTTATCGAGGTCCTTTGTTGGCCTCAAATAGCT GTGCTCTTCCGATCCAGTATCTGTCGGCTTTCAATT

chr21 34922801 SON GCGTTATCGAGGTCCTGGCAACTCAGGCACT GTGCTCTTCCGATCGCAAAAGTGTTAGAGCCTTCA

chr22 46657119 PKDREJ GCGTTATCGAGGTCGGACCAAGTTCATTGCTGTC GTGCTCTTCCGATCATTCAGAGGGTTTCTGGGTTA

chr3 147128798 ZIC1 GCGTTATCGAGGTCCGAAGACCTTGCCACAGC GTGCTCTTCCGATCCAACAAAACTTTCAGCACCAT

chr3 179093066 MFN1 GCGTTATCGAGGTCGAAAACAAAGTTCGAACAGCAC GTGCTCTTCCGATCCCAAAACAGACAGTCGACAA

chr4 775178 AC139887.1 GCGTTATCGAGGTCAGCCGCAGAAAGCGCCCCGACG GTGCTCTTCCGATCCCCACCCTAAGCCCATAGCC

chr4 2835556 SH3BP2 GCGTTATCGAGGTCACCCCTACGGCTACACT GTGCTCTTCCGATCGCTTGGTCCTGGGTTTCT

chr4 17805279 DCAF16 GCGTTATCGAGGTCAACATTGAGTAGAGCCACTCC GTGCTCTTCCGATCAGGGTTTGTCAAAGGGTATCC

chr5 38949495 uc003jlo.2 GCGTTATCGAGGTCACACCAAGCAGGATCGATTT GTGCTCTTCCGATCTCCTGTAGTCCAGTGTCTTCT

chr5 96073628 CAST GCGTTATCGAGGTCGGTGGAACTCATTGGATCTGA GTGCTCTTCCGATCACCAAAGACTAAACCACAAGAC

chr5 118970017 FAM170A GCGTTATCGAGGTCGCTGTCTGATTCTGTCCTTTC GTGCTCTTCCGATCCTCACAACATGTCTCCTTGTC

Nature Methods: doi: 10.1038/nmeth.2330

Page 37: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

37  

chr5 148999948 ARHGEF37 GCGTTATCGAGGTCGCAGGCTTTCCTCTACTTCA GTGCTCTTCCGATCAGACGCTTCTTGATCAGGTTC

chr6 34100903 GRM4 GCGTTATCGAGGTCCCTCTGTGCCATCCTTCTC GTGCTCTTCCGATCCATAGATGGGGACATCACACT

chr6 44120349 TMEM63B GCGTTATCGAGGTCGCATCAGGCCTACGAGTT GTGCTCTTCCGATCCATAGACGTGGGAGCTAGGA

chr6 49667583 CRISP2 GCGTTATCGAGGTCAAAGATGGAATGGAGCAGAGA GTGCTCTTCCGATCGGGACTCTTTGGTCCTACAC

chr6 74176314 MTO1 GCGTTATCGAGGTCGCTTCCATCCACCAAAACAAC GTGCTCTTCCGATCCACGATCGGTCAGATGTCA

chr6 142468514 uc011edu.1 GCGTTATCGAGGTCGGCTCAGGAGCATGACA GTGCTCTTCCGATCGCCTTCCATCTGGCATACTT

chr6 167352220 uc003qvi.1 GCGTTATCGAGGTCATGATCTGGAAAGGCTTGGTA GTGCTCTTCCGATCTGCCAGAAATGAGGGCATAC

chr6 167352256 uc003qvi.1 GCGTTATCGAGGTCGGTAGGCTCAGAACATTCCTC GTGCTCTTCCGATCCCGCTTCTGGAAGCATGA

chr7 39990302 CDK13 GCGTTATCGAGGTCGCTGAGGGGACAGGAATC GTGCTCTTCCGATCCGTACCCCACTGTGACC

chr7 39990527 CDK13 GCGTTATCGAGGTCTGGAGGTCAAGCGGCTGGCG GTGCTCTTCCGATCGGCTGGGGCAGCGAGAAGAC

chr7 39990590 CDK13 GCGTTATCGAGGTCCTGGGGCAGCGAGAAGA GTGCTCTTCCGATCGGAGATGGCCAGGATCTGA

chr7 39990736 CDK13 GCGTTATCGAGGTCGAGCAGGGGCTGCTGCTGGG GTGCTCTTCCGATCGCTGGGTGCCGGAGGAGGAG

chr7 45141465 TBRG4 GCGTTATCGAGGTCCTGCAGGAGACGCTGAA GTGCTCTTCCGATCCGCATCTTGTCCTTGAGGTA

chr9 21384898 IFNA2 GCGTTATCGAGGTCGACTCCATTCTGGCTGTGA GTGCTCTTCCGATCTAATGGATCAGTCAGCATGGT

chr9 21384902 IFNA2 GCGTTATCGAGGTCGACTCCATTCTGGCTGTGA GTGCTCTTCCGATCTAATGGATCAGTCAGCATGGT

chr9 131381164 SPTAN1 GCGTTATCGAGGTCTCTTCCACATTGGCTACAAAC GTGCTCTTCCGATCGTGCAGAACCTGAGGAAGAA

chr9 132571671 TOR1B GCGTTATCGAGGTCACAGTGGACTGATCGACAAA GTGCTCTTCCGATCCTAAGGTGCAAACAGGCAAAA

chr9 139720242 C9orf86 GCGTTATCGAGGTCCATCCACTGGAGCTACAAGACC GTGCTCTTCCGATCGTAGTTTCCCAGCACGCACA

chrX 2825467 ARSD GCGTTATCGAGGTCGACTCCGAGCCCCTGTA GTGCTCTTCCGATCTTTTGTTTGCAACGCAGTCA

chrX 48435461 RBM3 GCGTTATCGAGGTCCCGCTACTCAGGAGGAAATTAC GTGCTCTTCCGATCCTGGCTTGATGCACATTTCTT

chrX 70324174 CXorf65 GCGTTATCGAGGTCAGGAGACGAATTCAGATGCTTAA GTGCTCTTCCGATCTTACTTCTTTTGCTTGGTGGT

Chrom Position Gene gDNA primer (forward) gDNA primer (reverse)

chr1 2436080 PLCH2 GCGTTATCGAGGTCCCCATACCTGACGAACTGC GTGCTCTTCCGATCGTGGGTGACGTGATGTCT

chr1 3759756 KIAA0562 GCGTTATCGAGGTCCCTTTAGACCAGGCACCTC GTGCTCTTCCGATCCAGGAAATTGAGGCAGGAAGA

chr1 26611864 uc010ofb.1 GCGTTATCGAGGTCCTCCTGGAGGGCATCAGA GTGCTCTTCCGATCCTGGGTGACCCTGGATCT

chr1 111717535 CEPT1 GCGTTATCGAGGTCATGTGACTGAAGTGCAAATCTT GTGCTCTTCCGATCTCCTGTTTTGGAACCAGAGTTA

chr1 146650158 uc001epg.1 GCGTTATCGAGGTCGGACTGAGGAAGAATTTAAG GTGCTCTTCCGATCCGTTATCATCATACTCGTTC

chr1 153610779 C1orf77 GCGTTATCGAGGTCCATCCATTTTGTTCGACCTTGA GTGCTCTTCCGATCGCCACATAGATGGCAAACTC

chr1 160319987 NCSTN GCGTTATCGAGGTCGGCAATGGTTTGGCTTATGAA GTGCTCTTCCGATCGAGCAAAATGTGGGGAAAGAA

chr1 247264181 ZNF669 GCGTTATCGAGGTCTCGTAGGGTTTCTCTCCAGT GTGCTCTTCCGATCTTCTTCGTTTCTGTTCCAGGT

chr10 46999805 GPRIN2 GCGTTATCGAGGTCCTGGGTTAGTCCCAGAGC GTGCTCTTCCGATCTCCGTACACCTCCCATGT

chr11 46388498 DGKZ GCGTTATCGAGGTCCAGGCGCCAGGTAGCCCTAC GTGCTCTTCCGATCCCAGACCCACAGGAGGTCACA

chr11 61135627 uc010rli.1 GCGTTATCGAGGTCGCCTCCTCATTTCCCATAGT GTGCTCTTCCGATCAGCTAACCAAGTTGGAGACAT

chr11 61724916 BEST1 GCGTTATCGAGGTCACTCAGTGTGGACACCTGTA GTGCTCTTCCGATCATCACATTGCCTCTACTGGAC

chr11 66620033 PC GCGTTATCGAGGTCGAGCTGTGCGGAACCAC GTGCTCTTCCGATCTAATTAACAGGCAGGTGTCTC

Nature Methods: doi: 10.1038/nmeth.2330

Page 38: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

38  

chr11 88029400 CTSC GCGTTATCGAGGTCATACGGATTCTCGCTTCTAGC GTGCTCTTCCGATCCCAACCTGATGGCTTTGAAC

chr12 50475424 ACCN2 GCGTTATCGAGGTCCCATCTAGGCCTTTGGTTCA GTGCTCTTCCGATCCTGCCACCACCTTCTCTC

chr12 69140502 SLC35E3 GCGTTATCGAGGTCATTCTGGTGGAGAAGGTTTTCT GTGCTCTTCCGATCCCTGGCTGGGCTTGTATATC

chr13 28519518 ATP5EP2 GCGTTATCGAGGTCGTCAGGGTAGATTATTCCTTCTTTAC GTGCTCTTCCGATCAGTGAGCCAAGATCACATCA

chr14 70990511 ADAM20 GCGTTATCGAGGTCGACCAACCTGTTGTCTTCAAA GTGCTCTTCCGATCCCTACCAGTGGAGACCTAGAT

chr14 73746097 NUMB GCGTTATCGAGGTCTGTAGCTAATGGCACTGACTC GTGCTCTTCCGATCAGTTTTAGAACCAGGGTCACT

chr14 105058377 BX927359.1 GCGTTATCGAGGTCGCTCCTCCACTTGCTTCT GTGCTCTTCCGATCGTGAATGTGTGACTGCATGT

chr15 65425992 PDCD7 GCGTTATCGAGGTCCTGCCCTCCCCGGCTTTCCC GTGCTCTTCCGATCGGCTCGGGGCTGCAGAGCCA

chr15 65426067 PDCD7 GCGTTATCGAGGTCGACAGCCGAAAGGAGCAGGA GTGCTCTTCCGATCCAAAGCATCTCCGTCGTGAACA

chr16 22134962 VWA3A GCGTTATCGAGGTCCACTTCTGCAGCAACAGC GTGCTCTTCCGATCCCTAATCTAAGTGATGGCTTCTGAT

chr17 26902513 ALDOC GCGTTATCGAGGTCCAGGGCAATGTCAGACAAC GTGCTCTTCCGATCCTACCAGGGACAAAAGGAGAT

chr17 35902595 SYNRG GCGTTATCGAGGTCCTGAGTTGCATCATCCCTGT GTGCTCTTCCGATCACAAAGTCAAGATGAAGGGAGA

chr17 37826529 PNMT GCGTTATCGAGGTCGTGTCTGAGGAGGAGGTGA GTGCTCTTCCGATCAATGAATGATGCGATGGACAC

chr17 42475813 GPATCH8 GCGTTATCGAGGTCGGTGATCAGCAGTAGCTTCTC GTGCTCTTCCGATCGAAGAGGGTCAGAAGGTGAC

chr17 43045220 C1QL1 GCGTTATCGAGGTCGCCCTGAGCGAGCAGAGCG GTGCTCTTCCGATCCTTGCCTGGCTCACCCTTCTC

chr17 46198763 SNX11 GCGTTATCGAGGTCACCCCTCTCCTCACCATTA GTGCTCTTCCGATCCAAAACTGGTCACCTGAGGA

chr17 48703595 CACNA1G GCGTTATCGAGGTCGTCCAACTCTAAGCTGCTTCT GTGCTCTTCCGATCTAGGCAGCAATAAGGACTGAC

chr17 74714882 JMJD6 GCGTTATCGAGGTCACCGCAGGAAGAAGAGGA GTGCTCTTCCGATCCGTTTATCTGCATTGGTAGCA

chr18 32825609 ZNF397 GCGTTATCGAGGTCACTAAAGGCCTTCCCACAC GTGCTCTTCCGATCTTAGTGTGGAAGCAAGGAAGT

chr18 32825654 ZNF397 GCGTTATCGAGGTCTGTGGGAAGGCCTTTAGTTT GTGCTCTTCCGATCTTTGAGCTCTGGCTAAAGGTT

chr18 33719511 uc010xcg.1 GCGTTATCGAGGTCCAGAATGTGTGCAGCTTCTC GTGCTCTTCCGATCGTGAAGGTCTGGGGTAAGTT

chr18 55274824 NARS GCGTTATCGAGGTCCTGCTGGAGGAGCTGAC GTGCTCTTCCGATCCCTTCTACTTGTGTTTGCACT

chr19 3592462 AC116968.1 GCGTTATCGAGGTCAACCCAGCTCCAGAACTC GTGCTCTTCCGATCGGGCTCAGCCAGATTCTT

chr19 13947328 uc002mxi.3 GCGTTATCGAGGTCCCTGCTCACAAGCAGCTA GTGCTCTTCCGATCAATCTCATTACCTCCTTTGCTCT

chr19 14593605 GIPC1 GCGTTATCGAGGTCCTCGATGCGGCCAGTGGGAC GTGCTCTTCCGATCCTAGTGGAAAATGAGGAGGCTGAG

chr19 23544208 ZNF91 GCGTTATCGAGGTCATGTTTATTAAGGGTTAAGGATTGTC GTGCTCTTCCGATCGGCAAAGCATTTAACTGGTC

chr19 38189366 ZNF607 GCGTTATCGAGGTCTGCCACATTCCTTACATTCG GTGCTCTTCCGATCGGCCTTTAGTGTATCTGGACAA

chr19 47152737 DACT3 GCGTTATCGAGGTCGCGCTCCAACGAGGGCTCCC GTGCTCTTCCGATCCCCCTGGACGGCTACATCTCGG

chr19 47152854 DACT3 GCGTTATCGAGGTCCCTGGACGGCTACATCTC GTGCTCTTCCGATCATGTACTGCGCCTTCACC

chr19 50662822 IZUMO2 GCGTTATCGAGGTCTCTCCTCAGATGAGCCTCT GTGCTCTTCCGATCGAAGGGAGAGATAGACGATCC

chr19 52538520 ZNF432 GCGTTATCGAGGTCCAGCTTTTGTTCTGGTTGACTA GTGCTCTTCCGATCCTTTCCTAGAAAACAACGAAGTTGA

chr19 54485579 CACNG8 GCGTTATCGAGGTCCTGGGCAGACGGAGGAT GTGCTCTTCCGATCACTTGGGTCAGCCAGATC

chr19 56895363 ZNF582 GCGTTATCGAGGTCCTACCGTTCAACCTCAGAGAA GTGCTCTTCCGATCAAAATGTGGTTCTGGAATGTGT

chr19 56895428 ZNF582 GCGTTATCGAGGTCTTCATCATCAGGTTATTCACACT GTGCTCTTCCGATCCAAGAGGGAGAAGTAAGTCAC

chr2 17699634 RAD51AP2 GCGTTATCGAGGTCGTCCTCAGGAGGCGTTAA GTGCTCTTCCGATCTTCGTTCTACCACAGGGAAAA

Nature Methods: doi: 10.1038/nmeth.2330

Page 39: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

39  

chr2 64779254 AFTPH GCGTTATCGAGGTCGACGGAAGCCTCTTAGCAC GTGCTCTTCCGATCCTGATCCTCTCCATCAGTGTC

chr2 69741674 AAK1 GCGTTATCGAGGTCGCTGCCTTGCAGCAAAA GTGCTCTTCCGATCGCAACCTGGACCCTCATC

chr2 88825980 C2orf51 GCGTTATCGAGGTCCCAAAGCTCCCACATGGT GTGCTCTTCCGATCATACATGCACATGCACAGAAA

chr2 97364828 FER1L5 GCGTTATCGAGGTCGTGACCCTTATGTGATCCTGAA GTGCTCTTCCGATCTCCCCTGGTTATGGTTGAAAA

chr2 210835613 UNC80 GCGTTATCGAGGTCCCAGTGTTGCTGATGGATTAC GTGCTCTTCCGATCACTGAAGTGGTTCCTGAAGTT

chr2 219563809 STK36 GCGTTATCGAGGTCAGTGCCAGCTTTGCTGT GTGCTCTTCCGATCCTCCACATGCCATTTCTAGGA

chr20 23545585 CST9L GCGTTATCGAGGTCCAGTTCAGCCTCCTGAACAA GTGCTCTTCCGATCTCCATCTACCCAGATCCTACT

chr20 29628273 uc010ztk.1 GCGTTATCGAGGTCTTTGTTGGCCTCAAATAGCT GTGCTCTTCCGATCTGCATGGCAGGCATATATCTT

chr21 34922801 SON GCGTTATCGAGGTCTGCTACCCCGGTGCCAGAGT GTGCTCTTCCGATCTCCTGTGGTGGCTCCAACCC

chr22 46657119 PKDREJ GCGTTATCGAGGTCGGACCAAGTTCATTGCTGTC GTGCTCTTCCGATCATTCAGAGGGTTTCTGGGTTA

chr3 147128798 ZIC1 GCGTTATCGAGGTCCGAAGACCTTGCCACAGC GTGCTCTTCCGATCCCCATCAAGCAAGAGCTCATC

chr3 179093066 MFN1 GCGTTATCGAGGTCAGTTCGAACAGCACACTATCA GTGCTCTTCCGATCGTTCTTCCACTGCTTCAAAGA

chr4 775178 AC139887.1 GCGTTATCGAGGTCAGCCGCAGAAAGCGCCCCGACG GTGCTCTTCCGATCCCCACCCTAAGCCCATAGCC

chr4 2835556 SH3BP2 GCGTTATCGAGGTCCCACATGGACTGCCATCA GTGCTCTTCCGATCGGGAAGTGAGGTGGGAAC

chr4 17805279 DCAF16 GCGTTATCGAGGTCCCAGAAACACATGAATTAGGGATC GTGCTCTTCCGATCCTCTCCCATTGTTCCCATTGT

chr5 38949495 uc003jlo.2 GCGTTATCGAGGTCCCTACCTGAAAAGGTTGTTTGTT GTGCTCTTCCGATCTGGCATCTCCAGCAAAAGAT

chr5 96073628 CAST GCGTTATCGAGGTCGATACTTTAGGAGGACCTGAAGA GTGCTCTTCCGATCGGTCTTGAATGCCTGACTTC

chr5 118970017 FAM170A GCGTTATCGAGGTCCCACCAGAAACCTCCTGTC GTGCTCTTCCGATCGGTGCTCTCTGATTCCATGTT

chr5 148999948 ARHGEF37 GCGTTATCGAGGTCGGTCCAGGCTTTCCTCTAC GTGCTCTTCCGATCGCCAAAGTGCCTGACTTCTA

chr6 34100903 GRM4 GCGTTATCGAGGTCCCTCTGTGCCATCCTTCTC GTGCTCTTCCGATCCATAGATGGGGACATCACACT

chr6 44120349 TMEM63B GCGTTATCGAGGTCCGTGAAGACGCACATCATC GTGCTCTTCCGATCCTCAAGTCCACCCAACTCTT

chr6 49667583 CRISP2 GCGTTATCGAGGTCGGAATGGAGCAGAGAGGTAAC GTGCTCTTCCGATCGTTCTTATTGTAAATACCCCTCTGTAT

chr6 74176314 MTO1 GCGTTATCGAGGTCCTTACAGAACCAGAGCCTGAA GTGCTCTTCCGATCCATGGTACTGATGAGCCTAGA

chr6 142468514 uc011edu.1 GCGTTATCGAGGTCGGCTCAGGAGCATGACAA GTGCTCTTCCGATCCTTCTTTCGGACACTCACCT

chr6 167352220 uc003qvi.1 GCGTTATCGAGGTCTATTCGGCTACTGGGATCATT GTGCTCTTCCGATCCTTCCAGAGTTTCCGTCTTTC

chr6 167352256 uc003qvi.1 GCGTTATCGAGGTCGGTAGGCTCAGAACATTCCTC GTGCTCTTCCGATCGCAGCCTGTTTCACTACTTT

chr7 39990302 CDK13 GCGTTATCGAGGTCGCTGAGGGGACAGGAATC GTGCTCTTCCGATCCCCACTGTGACCTGGAAC

chr7 39990527 CDK13 GCGTTATCGAGGTCTGGAGGTCAAGCGGCTGGCG GTGCTCTTCCGATCGGCTGGGGCAGCGAGAAGAC

chr7 39990590 CDK13 GCGTTATCGAGGTCCTGGGGCAGCGAGAAGA GTGCTCTTCCGATCGGAGATGGCCAGGATCTGA

chr7 39990736 CDK13 GCGTTATCGAGGTCGAGCAGGGGCTGCTGCTGGG GTGCTCTTCCGATCGCTGGGTGCCGGAGGAGGAG

chr7 45141465 TBRG4 GCGTTATCGAGGTCCTGCAGGAGACGCTGAA GTGCTCTTCCGATCGGGGACATCCCTCAGAAAC

chr9 21384898 IFNA2 GCGTTATCGAGGTCCTCCCCTGATGAAGGAGGA GTGCTCTTCCGATCAATCGTGTCATGGTCATAGCA

chr9 21384902 IFNA2 GCGTTATCGAGGTCCTCCCCTGATGAAGGAGGA GTGCTCTTCCGATCAATCGTGTCATGGTCATAGCA

chr9 131381164 SPTAN1 GCGTTATCGAGGTCTGATTTTCTTCCTGTCTCCTGT GTGCTCTTCCGATCACTGAGACTGACAAAGCCATA

chr9 132571671 TOR1B GCGTTATCGAGGTCACAGTGGACTGATCGACAAA GTGCTCTTCCGATCCTAAGGTGCAAACAGGCAAAA

Nature Methods: doi: 10.1038/nmeth.2330

Page 40: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

40  

chr9 139720242 C9orf86 GCGTTATCGAGGTCCTACATCCCAGACTTCAACCTT GTGCTCTTCCGATCAGGGTCCCCTTCCAGTAT

chrX 2825467 ARSD GCGTTATCGAGGTCGACTCCGAGCCCCTGTA GTGCTCTTCCGATCATTTTGTTTGCAACGCAGTCA

chrX 48435461 RBM3 GCGTTATCGAGGTCATGTGCATGTCTCATTTCAGT GTGCTCTTCCGATCAGTTGCGACATGGTAAAGTAT

chrX 70324174 CXorf65 GCGTTATCGAGGTCGCAAGTCTTACCGGGACA GTGCTCTTCCGATCTCCACTTTAGGAGGGACGT

Barcode PCR Primers

Forward primer AATGATACGGCGACCACCGACACTCTTTCCCTACACGAGCGTTATCGAGGTC

Reverse primer 1 (BARCODE 1) CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC

Reverse primer 2 (BARCODE 2) CAAGCAGAAGACGGCATACGAGATCTTCGAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC

Reverse primer 3 (BARCODE 3) CAAGCAGAAGACGGCATACGAGATCTCATGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC

Reverse primer 4 (BARCODE 4) CAAGCAGAAGACGGCATACGAGATACCGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC

MiSeq custom sequencing primer (R1) CCACCGACACTCTTTCCCTACACGAGCGTTATCGAGGTC

Nature Methods: doi: 10.1038/nmeth.2330

Page 41: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

41  

Supplementary Table 10 – Primer sequences used for Sanger validation

chr sites gDNA primers cDNA primers

A-to-G sites in Supplementary Figure 13

chr1 2436080 left: CCCGTGATTAGAAGGGTGAA right: CAAAGTCATCAGCAGTGAGGT

left: CCCGTGATTAGAAGGGTGAA right: CAAAGTCATCAGCAGTGAGGT

chr3 53820892 left: CTCTCCTTCTGAGCATGTATGT right: CATTTGGGGGAGTGTGTAGTAT

left: GCAAGAGATTAGTTGCCATGAA right: GCTGAAGGGAATCTCTCCTATC

chr4 2835556 left: CCCACAGGACTCTAAGTTCTAC right: GTAATCCCCTCCTTCCCAATC

left: ACCTCTAACAAAGTGAGGAACT right: GTAATCCCCTCCTTCCCAATC

chr6 44120349 left: TACCCTAGACATTAGGTCCTGT right: TTACAGGGTGGTGTAAAAGTGT

left: ACAAGTGCTACACTTTCCTCAT right: GCACTGAGGTATTTGAAGTGTC

chr9 21384898, 21384902

left: ATAATGGATCAGTCAGCATGGT right: CAATCTCTTCAGCACAAAGGAC

left: ATAATGGATCAGTCAGCATGGT right: CAATCTCTTCAGCACAAAGGAC

chr9 132571671 left: CTTAGGGCATACTGTGCTAGAA right: CCAAAAGTCTAAGGTGCAAACA

left: TCTAGACTACTACGAGCAGGTT right: CCAAAAGTCTAAGGTGCAAACA

chr19 56895363, 56895428

left: AGGCAAAACAGAACTGAGAGAT right: CAGGTGGATGGAGATCAAGATT

left: AGGCAAAACAGAACTGAGAGAT right: CAGGTGGATGGAGATCAAGATT

chr1 160319987 left: TACTTCTTTGAGTCACCACCTC right: TATGACACAGCTTGGAGATCAA

left: CCTACAGTGGGTATTGACTGAT right: TCAGGTTGTGATCTTGATAGCA

Nature Methods: doi: 10.1038/nmeth.2330

Page 42: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

42  

Supplementary Notes

Supplementary Note 1 – RNA editing sites identified in lymphoblastoid cell lines

We applied our two methods to identify RNA editing sites using RNA-seq data obtained

from 40 lymphoblastoid cell lines (Supplementary Table 1). This dataset presents an ideal

scenario to use our two methods for three reasons: i) all of the samples are from the same cell

type, ii) the RNA was sequenced with the same length and at similar depth, and iii) common

SNPs in these samples were previously identified by the 1000 Genomes Project22. We

categorized both A-to-G and T-to-C mismatches as potential A-to-I editing sites, because the

RNA-seq libraries were not strand-specific (see below). Using the Separate Samples Method,

we found that RNA variants shared by one or more samples in Alu repeats and RNA variants

shared by two or more samples in non-Alu regions were highly enriched for potential A-to-I

editing sites (Supplementary Fig. 1a). We identified 359,938 variants in Alu repeats, 287,342

(79.8%) of which were A-to-G or T-to-C mismatches (Supplementary Fig. 1a,b). In non-Alu

regions, we found 1,596 variants in repetitive regions and 1,987 variants in non-repetitive

regions, including 1,493 (93.5%) and 1,585 (79.8%) A-to-G/T-to-C mismatches respectively

(Supplementary Fig. 1a,b). Using the Pooled Samples Method, we found that RNA variants

with one or more variant reads in Alu repeats and RNA variants with two or more variant reads

in non-Alu regions were highly enriched for potential A-to-I editing sites (Supplementary Fig.

1d). We identified 195,647 variants in Alu repeats, 174,058 (89.0%) of which were A-to-G or T-

to-C mismatches (Supplementary Fig. 1d,e). In non-Alu regions, we found 2,884 variants in

repetitive regions and 3,163 variants in non-repetitive regions, including 2,717 (94.2%) and

2,726 (86.2%) A-to-G/T-to-C mismatches respectively (Supplementary Fig. 1d,e).

The editing type of each mismatch was inferred using existing gene annotations and the

majority of edits were of the A-to-G type as expected (Supplementary Fig. 1c,f). We

hypothesized that the slight enrichment for T-to-C editing types was caused by incorrect

assignment of A-to-G edits caused by a deficiency in existing gene annotations. We tested this

hypothesis by taking a set of mismatches identified from a strand-specific RNA-seq library and

inferring their editing type as if they came from a non-strand-specific library. Indeed we found

that vast majority of the T-to-C mismatches should be annotated as A-to-G mismatches

(Supplementary Fig. 2).

Nature Methods: doi: 10.1038/nmeth.2330

Page 43: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

43  

We evaluated our performance in identifying editing sites using RNA-seq data only to

our previous method1 which identified editing sites using matched DNA and RNA sequencing.

We took one sample from the lymphoblastoid dataset (GM12891) whose genome was previously

deep sequenced22 and called editing sites using our previous method1. We identified 658 variants

in Alu regions, 3 variants in repetitive non-Alu regions, and 6 variants in non-repetitive regions,

including 612 (93%), 3 (100%), and 6 (100%) A-to-G mismatches respectively. In comparison,

we identified 6130, 73, and 62 editing sites in Alu, repetitive non-Alu, and non-repetitive regions

respectively in GM12891 using the Separate Samples method on the lymphoblastoid dataset

(Supplementary Figure 3), highlighting the effectiveness of our new methods. The sensitivity

of our new method is enhanced by the ability to detect editing sites with only one mismatched

read (Supplementary Figure 4) as well as the ability to identify editing sites where the genome

sequencing data does not meet the stringent criteria of our previous method to establish a

nucleotide position as homozygous reference.

Nature Methods: doi: 10.1038/nmeth.2330

Page 44: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

44  

Supplementary Note 2 – RNA editing sites identified from human tissues

To identify RNA editing sites from other human tissues, we applied our two methods to

the Illumina Human BodyMap 2.0 dataset which consists of RNA sequencing data from 16

different human tissues (Supplementary Table 5). In theory this dataset is less ideal for our

methods than the lymphocyte dataset for two reasons: i) the samples are from varying tissue

types and ii) genomic SNPs from these samples were not previously ascertained. However, using

the same criteria used for the lymphoblastoid cell lines and brain tissues except that three variant

reads were required in non-Alu regions for the Pooled Samples method, we were able to enrich

for A-to-G variants (Supplementary Fig. 8). We identified a total of 364,497 Alu, 3,110

repetitive non-Alu, and 3,016 non-repetitive A-to-G variants in the BodyMap dataset

(Supplementary Table 2 and Supplementary Data 5-6).

Nature Methods: doi: 10.1038/nmeth.2330

Page 45: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

45  

Supplementary Note 3 - Validation of amino-acid recoding editing sites in human brains

In non-repetitive regions of the brain dataset, we identified 115 nonsynonymous editing

sites, 87 of which are novel. We examined the editing levels of these sites in the pooled brain

samples and found that the novel sites generally have much lower editing levels than the known

sites (Fig. 1g). Some of the lowly edited sites exhibited a wide variation in editing level between

samples and may be edited specifically in certain regions of the brain (Fig. 1g). We validated 7

out of 9 randomly selected novel nonsynonymous sites (with >10% editing frequency) using

PCR and Sanger sequencing of gDNA and cDNA from a single human cerebellum

(Supplementary Fig. 13). For a more comprehensive validation of the novel sites, many of

which have editing levels less than 10% (Fig. 1g), we designed primers to amplify all 87 novel

sites from three human brain samples and deeply sequenced the PCR products using an Illumina

MiSeq (Online Methods). We were able to validate 46 of 80 sites in at least one sample (7 sites

did not have sufficient sequencing data, probably due to failed PCR) (Supplementary Table 7).

Most of the sites that we were unable to validate were shared by few samples in the Separate

Samples method (Supplementary Fig. 14a) or supported by few variant reads in the Pooled

Samples method (Supplementary Fig. 14b). From the sites identified by the Separate Samples

method, all 15 sites present in ≥4 samples were successfully validated (Supplementary Fig.

14a). From the sites identified by the Pooled Samples method, 15 of 19 sites having ≥5 or more

variant reads were successfully validated (Supplementary Fig. 14b). We note that one editing

site in the IFNA2 gene was validated by Sanger sequencing in the cerebellum (Supplementary

Fig. 13), but not validated by deep sequencing of 6 pooled brain tissues (including the

cerebellum) from the same individual (Supplementary Table 7). This result suggests that the

low editing signal from the cerebellum can be diluted away when looking at the whole brain and

that more sites can be validated if more brain tissues are examined separately.

Nature Methods: doi: 10.1038/nmeth.2330

Page 46: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

46  

Supplementary Note 4 – Relationship between shared RNA edits and genetic distance

We first carried out two-way comparisons between human and chimpanzee, rhesus

macaque, or mouse. The number of shared A-to-G mismatches was inversely correlated with the

genetic distance (Fig. 2d). For Alu editing in primates, humans share many more sites with

chimpanzee than with rhesus macaque; ~15% of the Alu editing sites were present in Alus

conserved between human and chimpanzee but not rhesus macaque, suggesting the recent origin

of many editing events in the great ape lineage. Similarly, the genetic distance also affected the

number of editing sites in non-repetitive regions, with far fewer sites conserved between human

and mouse. This analysis highlights the limitations of a previous attempt23, in which comparative

analysis of human and mouse cDNA sequences led to identification of editing events in only four

genes. It also suggests that most human non-repetitive RNA editing sites are probably not

conserved in mouse, although human and mouse share ~85% similarity in genic regions24. We

then extended the analysis to the Drosophila lineage, comparing D. melanogaster and D.

simulans, D. yakuba, or D. pseudoobscura using similar numbers of mappable bases from RNA-

seq data from the whole bodies of 5-7-day-old females. A similar trend is observed (Fig. 3d).

This finding was further strengthened by examination of the conservation of genes

harboring RNA editing sites between distantly related species such as human and D.

melanogaster. Of 1,969 genes edited in human and 791 genes edited in D. melanogaster, only 37

genes were shared (Supplementary Fig. 17), and none of the editing sites in these genes were

conserved. These results suggest that RNA editing was introduced independently in different

lineages during evolution, although the edited genes may share similar neural-activity functions

in different species.

Nature Methods: doi: 10.1038/nmeth.2330

Page 47: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

47  

Supplementary Note 5 – Identifying RNA editing sites from a single RNA-seq sample

Although our methods are powerful for editing site discovery, they rely on large numbers

of RNA-seq samples. We investigated whether the catalog of RNA variants identified by us can

be used as a resource to reliably detect RNA editing sites in a single sample of interest. To test

this, we downloaded a human hippocampus RNA-seq data sample (SRR094854) and identified

RNA variants using the same pipeline as described earlier. As expected, a large fraction of non-

A-to-G variants were identified in non-Alu regions (Supplementary Fig. 19a). Non-Alu variants

shared with any other sample in the brain dataset were highly enriched for A-to-G variants, and

~40% of the editing sites identified in SRR094854 were not identified previously using the

Separate Samples method on the Brain dataset (Supplementary Fig. 19a,b). These additional

sites were not identified using the Separate Samples method on the original 50 samples, because

they were only present in a single sample. Next, we investigated how the addition of another

sample would influence the Pooled Samples method. We pooled the alignments from

SRR094854 with the other 50 brain samples and identified ~1% and ~4% more sites in Alu and

non-Alu regions respectively (Supplementary Fig. 19c). These additional sites were not

identified using the Pooled Samples method on the original 50 samples, because either they were

not supported by a sufficient number of variant reads or they were present at a very low variant

frequency and not identified by GATK. Overall with our two methods, by including an

additional sample (SRR094854) we found 3,960, 359 and 565 more sites in Alu, repetitive non-

Alu, and non-repetitive regions respectively.

Nature Methods: doi: 10.1038/nmeth.2330

Page 48: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

48  

Supplementary References 1 Ramaswami, G. et al. Accurate identification of human Alu and non-Alu RNA editing sites. Nature

methods 9, 579-581, doi:10.1038/nmeth.1982 (2012). 2 Huang, H. et al. RNA editing of the IQ domain in Ca(v)1.3 channels modulates their Ca(2)(+)-

dependent inactivation. Neuron 73, 304-316, doi:10.1016/j.neuron.2011.11.022 (2012). 3 Graveley, B. R. et al. The developmental transcriptome of Drosophila melanogaster. Nature 471,

473-479, doi:10.1038/nature09715 (2011). 4 Hoopengardner, B., Bhalla, T., Staber, C. & Reenan, R. Nervous system targets of RNA editing

identified by comparative genomics. Science 301, 832-836, doi:10.1126/science.1086763 (2003). 5 Stapleton, M., Carlson, J. W. & Celniker, S. E. RNA editing in Drosophila melanogaster: New

targets and functional consequences. RNA 12, 1922-1932, doi:10.1261/rna.254306 (2006). 6 Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from

high-throughput sequencing data. Nucleic acids research 38, e164, doi:10.1093/nar/gkq603 (2010). 7 Bahn, J. H. et al. Accurate identification of A-to-I RNA editing in human by transcriptome

sequencing. Genome research 22, 142-150, doi:10.1101/gr.124107.111 (2012). 8 Kiran, A. & Baranov, P. V. DARNED: a DAtabase of RNa EDiting in humans. Bioinformatics 26,

1772-1776, doi:10.1093/bioinformatics/btq285 (2010). 9 Peng, Z. et al. Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human

transcriptome. Nature biotechnology 30, 253-260, doi:10.1038/nbt.2122 (2012). 10 Cheung, V. G. et al. Polymorphic cis- and trans-regulation of human gene expression. PLoS biology

8, doi:10.1371/journal.pbio.1000480 (2010). 11 Au, K. F., Jiang, H., Lin, L., Xing, Y. & Wong, W. H. Detection of splice junctions from paired-end

RNA-seq data by SpliceMap. Nucleic acids research 38, 4570-4578, doi:10.1093/nar/gkq211 (2010). 12 Twine, N. A., Janitz, K., Wilkins, M. R. & Janitz, M. Whole transcriptome sequencing reveals gene

expression and splicing differences in brain regions affected by Alzheimer's disease. PloS one 6, e16266, doi:10.1371/journal.pone.0016266 (2011).

13 Voineagu, I. et al. Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature 474, 380-384, doi:10.1038/nature10110 (2011).

14 Pan, Q., Shai, O., Lee, L. J., Frey, B. J. & Blencowe, B. J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature genetics 40, 1413-1415, doi:10.1038/ng.259 (2008).

15 Brawand, D. et al. The evolution of gene expression levels in mammalian organs. Nature 478, 343-348, doi:10.1038/nature10532 (2011).

16 Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes & development 25, 1915-1927, doi:10.1101/gad.17446611 (2011).

17 Liu, X. et al. Extension of cortical synaptic development distinguishes humans from chimpanzees and macaques. Genome research 22, 611-622, doi:10.1101/gr.127324.111 (2012).

18 Liu, Y. et al. Ab initio identification of transcription start sites in the Rhesus macaque genome by histone modification and RNA-Seq. Nucleic acids research 39, 1408-1418, doi:10.1093/nar/gkq956 (2011).

19 Yan, G. et al. Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques. Nature biotechnology 29, 1019-1023, doi:10.1038/nbt.1992 (2011).

20 Gregg, C. et al. High-resolution analysis of parent-of-origin allelic expression in the mouse brain. Science 329, 643-648, doi:10.1126/science.1190830 (2010).

21 Tollervey, J. R. et al. Characterizing the RNA targets and position-dependent splicing regulation by TDP-43. Nature neuroscience 14, 452-458, doi:10.1038/nn.2778 (2011).

22 A map of human genome variation from population-scale sequencing. Nature 467, 1061-1073, doi:10.1038/nature09534 (2010).

Nature Methods: doi: 10.1038/nmeth.2330

Page 49: Identifying RNA Editing Sites Using RNA Sequencing Data Alone€¦ · 3 Supplementary Figure 2 – Annotation of editing types. (a) Percentage of all twelve mismatch types for RNA-DNA

49  

23 Levanon, E. Y. et al. Evolutionarily conserved human targets of adenosine to inosine RNA editing. Nucleic acids research 33, 1162-1168, doi:10.1093/nar/gki239 (2005).

24 Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-562, doi:10.1038/nature01262 (2002).

Nature Methods: doi: 10.1038/nmeth.2330