Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
SHAPE as a way to evaluate our ability to predict the effects of SNPs on RNA
Laederach Lab
UNC Chapel Hill, Department of Biology
http://ribosnitch.bio.unc.edu
SHAPE Chemical Mapping
Mitra S, Shcherbakova IV, Altman RB, Brenowitz M, Laederach A. High-throughput single-nucleotide structural mapping by capillary automated footprinting analysis. NAR, 2008, 36 e63
SHAPE Analysis of FTL 5’ UTR
5’ UTR Exon 1A)
U22GPrediction
IRE
B)C)
0
1
2
3
4
5
Sequence Position
Inte
nsity
WTU22GG4A
27 30 33 36 39 42 45 48 51 54 57 60 63 66 69
x103
*In our CAFA analysis, peaks are proportional to the amount of accessibility-dependent labeling in folded RNA at different positions*Peaks here represent unfiltered fluorescence intensities of the three variants of FTL 5‘UTR that we will focus on today (WT,A196G and U22G)*The area that is being focused on here is representative of the IRE in the secondary structure of 5‘UTR*We see that even without filtration of datapoints, not only do WT and A196G capillary traces closely follow one another, but also U22G deviates significantly from the WT patterns of peaks shown
How do we get a lot more probing data?
(a)
(b)
THESISCRAP
THESISCRAP
SNRNASM (Single Nucleotide Resolution Nucleic Acid Structure Mapping)
ISA-Tab is a Standard
124 VOLUME 44 | NUMBER 2 | FEBRUARY 2012 | NATURE GENETICS
COMMENTARY
a
b
c
Collectionand curation
Common representation
format
Connectingdistributed data
Systems supporting the ISA-Tab format
Systems powered by ISAsoftware components
Investigation
Study Study
Assay (s)
Pointers to data !lenames/location
External !les innative or other
formats
Assay (s)
Data Data
InvestigationHigh-level concept to link related studies
StudyThe central unit, containing information on the subject under study, its characteristics and any treatments applied
A study has associated assays
AssayTest performed either on material taken from the subject or on the whole initial subject, which produce qualitative or quantitative measurements (data)
(i) regularize local collection and management of experimental metadata, (ii) reduce the adoption barrier for using community mini-mum reporting guidelines and terminologies through customizable configuration, (iii) facili-tate consistent curation at source and (iv) sup-port direct submission to a growing number of public repositories, both in ISA-Tab format (such as MetaboLights and the other systems shown in Box 1) and through conversion to other supported formats12–14. An example of the ISA framework in action is illustrated by the Harvard Stem Cell Institute (HCSI)’s Stem Cell Discovery Engine (SCDE)22 and shown in Figure 1.
Without community-level harmonization and interoperability, many community proj-ects risk becoming data silos, aggravating the problem. Using the shared, metadata-focused ISA framework, it is now possible to aggregate investigations in community ‘staging posts’, merge them in various combinations, perform meta-analyses and more straightforwardly submit to public repositories. Furthermore, simplifying the integration of bioscience data can only speed systems biology research23 and improve the ability of the R&D community to utilize shared data24.
The growing number of communities using the ISA framework adds credibility to this meta-data-focused data sharing vision. Taking this a step further, Figure 2 shows how these com-munities’ systems—a mix of public and internal tools that use ISA software components or, min-imally, the ISA-Tab format—will progressively interrelate to build the ‘ISA commons’. Activities are already underway under the auspices of the World Wide Web Consortium (W3C) Semantic Web for Health Care and Life Sciences Interest Group (HCLSIG)’s Scientific Discourse task
force to generate serialized ISA-Tab metadata in compliance with the recommendations of the international Linked Data community25. Semantic integration of bioscience data with the wider corpus of human knowledge then becomes more straightforward.
BioSharing: standard cooperating proceduresIt is widely acknowledged that unlocking shared data promises to accelerate discovery, but this process requires new models for the way we col-laborate1–3,5,6,17,18,26. But reporting standards often have different levels of maturity, and inevi-tably, duplication of effort. Communication between standards initiatives is pivotal to ensure that a common or at least complementary set of
standards exists and is widely used by the aca-demic and commercial sectors to maximize the utility of shared data. Building on the effort of the Minimum Information for Biological and Biomedical Investigations (MIBBI) portal10, the BioSharing initiative works to strengthen collaborations between researchers, funders, industry and journals and to discourage redun-dant (if unintentional) competition between standards-generating groups27. The BioSharing catalog maps the landscape of standards and the systems implementing them, and it also works to build graphs of complementarities in scope and functionality. In time and after consultation, a set of criteria for assessing the usability and popularity of standards will be implemented to maximize their adoption and use to assist the
Figure 2 Building the ‘ISA commons’, a growing ecosystem of resources that work to provide a data commons. (a) Data sets of interest to each community are collected and curated. (b) Capture systems, either powered by the ISA software suite or supporting the hierarchical ISA-Tab structure, deliver a common representation of experimental content that transcends individual domains. (c) To achieve broader data integration, the next step is to explore the growing Linked Data universe. The European Innovative Medicines Initiative (IMI) Open PHACTS project, for example, will use semantic web approaches to make existing knowledge available for linking, querying and where possible, reasoning. This project will benefit greatly from study descriptions that draw on the ISA model to connect quantified information held in semantic triple stores to data from actual experiments performed. As a result, the project will connect public and private datasets to genomics resources, enabling the combination of existing and new experimental data.
Susanna-Assunta SansonePhilippe Rocca-Serra
SNRNASM is a link farm to Google Spreadsheets
Providing open access to the data and meta-data.
A community effortBIOINFORMATICS
Sharing and archiving nucleic acid structure mapping data
PHILIPPE ROCCA-SERRA,1 STANISLAV BELLAOUSOV,2 AMANDA BIRMINGHAM,3 CHUNXIA CHEN,4
PABLO CORDERO,5 RHIJU DAS,5 LAUREN DAVIS-NEULANDER,4 CAIA DUNCAN,6 MATTHEW HALVORSEN,4
ROB KNIGHT,7 NEOCLES LEONTIS,8 DAVID H. MATHEWS,2 JUSTIN RITZ,4 JESSE STOMBAUGH,7
5 KEVIN M. WEEKS,6 CRAIG L. ZIRBEL,9 and ALAIN LAEDERACH4,10
1Oxford e-Research Center, University of Oxford, OX1 3QG, Oxford, United Kingdom2Department of Biochemistry and Biophysics and Center for RNA Biology, University of Rochester, Rochester, New York 14642, USA3Thermo Fisher Scientific, Lafayette, Colorado 80026, USA4Biology Department, University of North Carolina, Chapel Hill, North Carolina 27599-3290, USA
105Biochemistry Department, Stanford University, Stanford, California 94305, USA6Department of Chemistry, University of North Carolina, Chapel Hill, North Carolina 27599, USA7Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado 80309, USA8Department of Chemistry, Bowling Green State University, Bowling Green, Ohio 43403, USA9Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, Ohio 43403, USA
15 ABSTRACT
NucleicAU1 acids are particularly amenable to structural characterization using chemical and enzymatic probes. Each individualstructure mapping experiment reveals specific information about the structure and/or dynamics of the nucleic acid. Currently,there is no simple approach for making these data publically available in a standardized format. We therefore developeda standard for reporting the results of single nucleotide resolution nucleic acid structure mapping experiments, or SNRNASMs.
20 We propose a schema for sharing nucleic acid chemical probing data that uses generic public servers for storing, retrieving, andsearching the data. We have also developed a consistent nomenclature (ontology) within the Ontology of BiomedicalInvestigations (OBI), which provides unique identifiers (termed persistent URLs, or PURLs) for classifying the data. Links tostandardized data sets shared using our proposed format along with a tutorial and links to templates can be found at http://snrnasm.bio.unc.edu.
25 Keywords: RNA structure; chemical mapping; secondary structure
INTRODUCTION
Fields in which data standardization has allowed sharingamong many researchers, including sequence data in GenBank(Benson et al. 2008; Wheeler et al. 2008) and structural data
30 in the Protein Data Bank (Bernstein et al. 1977), havebenefited enormously from the ability of investigators todraw insights from the work of thousands of peopledispersed across the globe (Cannone et al. 2002; Griffiths-Jones et al. 2003; Noy et al. 2003; Zhang et al. 2006; Elnitski
35 et al. 2007; Musen et al. 2008; Brown et al. 2009). Atpresent, there is currently no standard database for archiv-ing and sharing nucleic acid structure mapping data,despite the compelling opportunities to incorporate suchdata in studies with direct relevance to human health
40and to a wide range of scientific challenges (Russell andHerschlag 2001; Tullius 2002; Schroeder et al. 2004; Takamotoet al. 2004; Thirumalai and Hyeon 2005; Mortimer and Weeks2007; Tijerina et al. 2007; Shcherbakova and Brenowitz2008; Woodson 2008; Deigan et al. 2009). Chemical and
45enzymatic structure mapping techniques are useful in thefield of nucleic acids and are commonly used to experi-mentally validate and/or constrain structural predictions,‘‘footprint’’ protein-binding sites, and characterize foldingreactions both kinetically and thermodynamically (Mathews
50et al. 2004; Deigan et al. 2009; Quarrier et al. 2010; Weeks2010). Recent developments allowing the analysis of chemicalmapping reactions in a quantitative and high-throughputmanner yield large amounts of high-quality data that re-quire automated processing and annotation (Das et al. 2005;
55Laederach et al. 2008; Mitra et al. 2008; Vasa et al. 2008;Wilkinson et al. 2008; Deigan et al. 2009; Watts et al. 2009;Underwood et al. 2010) .
A standardized approach for making such data availableupon publication is needed to facilitate sharing and wider
10Corresponding author.E-mail [email protected] published online ahead of print. Article and publication date are
at http://www.rnajournal.org/cgi/doi/10.1261/rna.2753211.
RNA (2011), 17:00–00. Published by Cold Spring Harbor Laboratory Press. Copyright ! 2011 RNA Society. 1
BIOINFORMATICS
Sharing and archiving nucleic acid structure mapping data
PHILIPPE ROCCA-SERRA,1 STANISLAV BELLAOUSOV,2 AMANDA BIRMINGHAM,3 CHUNXIA CHEN,4
PABLO CORDERO,5 RHIJU DAS,5 LAUREN DAVIS-NEULANDER,4 CAIA DUNCAN,6 MATTHEW HALVORSEN,4
ROB KNIGHT,7 NEOCLES LEONTIS,8 DAVID H. MATHEWS,2 JUSTIN RITZ,4 JESSE STOMBAUGH,7
5 KEVIN M. WEEKS,6 CRAIG L. ZIRBEL,9 and ALAIN LAEDERACH4,10
1Oxford e-Research Center, University of Oxford, OX1 3QG, Oxford, United Kingdom2Department of Biochemistry and Biophysics and Center for RNA Biology, University of Rochester, Rochester, New York 14642, USA3Thermo Fisher Scientific, Lafayette, Colorado 80026, USA4Biology Department, University of North Carolina, Chapel Hill, North Carolina 27599-3290, USA
105Biochemistry Department, Stanford University, Stanford, California 94305, USA6Department of Chemistry, University of North Carolina, Chapel Hill, North Carolina 27599, USA7Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado 80309, USA8Department of Chemistry, Bowling Green State University, Bowling Green, Ohio 43403, USA9Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, Ohio 43403, USA
15 ABSTRACT
NucleicAU1 acids are particularly amenable to structural characterization using chemical and enzymatic probes. Each individualstructure mapping experiment reveals specific information about the structure and/or dynamics of the nucleic acid. Currently,there is no simple approach for making these data publically available in a standardized format. We therefore developeda standard for reporting the results of single nucleotide resolution nucleic acid structure mapping experiments, or SNRNASMs.
20 We propose a schema for sharing nucleic acid chemical probing data that uses generic public servers for storing, retrieving, andsearching the data. We have also developed a consistent nomenclature (ontology) within the Ontology of BiomedicalInvestigations (OBI), which provides unique identifiers (termed persistent URLs, or PURLs) for classifying the data. Links tostandardized data sets shared using our proposed format along with a tutorial and links to templates can be found at http://snrnasm.bio.unc.edu.
25 Keywords: RNA structure; chemical mapping; secondary structure
INTRODUCTION
Fields in which data standardization has allowed sharingamong many researchers, including sequence data in GenBank(Benson et al. 2008; Wheeler et al. 2008) and structural data
30 in the Protein Data Bank (Bernstein et al. 1977), havebenefited enormously from the ability of investigators todraw insights from the work of thousands of peopledispersed across the globe (Cannone et al. 2002; Griffiths-Jones et al. 2003; Noy et al. 2003; Zhang et al. 2006; Elnitski
35 et al. 2007; Musen et al. 2008; Brown et al. 2009). Atpresent, there is currently no standard database for archiv-ing and sharing nucleic acid structure mapping data,despite the compelling opportunities to incorporate suchdata in studies with direct relevance to human health
40and to a wide range of scientific challenges (Russell andHerschlag 2001; Tullius 2002; Schroeder et al. 2004; Takamotoet al. 2004; Thirumalai and Hyeon 2005; Mortimer and Weeks2007; Tijerina et al. 2007; Shcherbakova and Brenowitz2008; Woodson 2008; Deigan et al. 2009). Chemical and
45enzymatic structure mapping techniques are useful in thefield of nucleic acids and are commonly used to experi-mentally validate and/or constrain structural predictions,‘‘footprint’’ protein-binding sites, and characterize foldingreactions both kinetically and thermodynamically (Mathews
50et al. 2004; Deigan et al. 2009; Quarrier et al. 2010; Weeks2010). Recent developments allowing the analysis of chemicalmapping reactions in a quantitative and high-throughputmanner yield large amounts of high-quality data that re-quire automated processing and annotation (Das et al. 2005;
55Laederach et al. 2008; Mitra et al. 2008; Vasa et al. 2008;Wilkinson et al. 2008; Deigan et al. 2009; Watts et al. 2009;Underwood et al. 2010) .
A standardized approach for making such data availableupon publication is needed to facilitate sharing and wider
10Corresponding author.E-mail [email protected] published online ahead of print. Article and publication date are
at http://www.rnajournal.org/cgi/doi/10.1261/rna.2753211.
RNA (2011), 17:00–00. Published by Cold Spring Harbor Laboratory Press. Copyright ! 2011 RNA Society. 1
On/off conformations in Boltzmann sampling
Prin
cipa
l Com
pone
nt 2
Prin
cipa
l Com
pone
nt 2
WTG G A A
AGG
AAAG
GGA
A AGAAAC
GCUUCAUAU
AAU
CC U
AA
GAUAUG
GUU
UGG
AGUUUCUA
CC A
AGA
GCCU
U A A A C U
UUG
AU
UAUG
AA
GU
GA A
AAC A
AA
ACAA
A GAA
ACAAC
A A C A A C A A C70
10
60
20
50
3040
11080
100
90
120
5´
G G A A A G G A A AGGGA
AA
GAA
AC G
CUUC A U
AU
AAUCCU
A A G A U A U G G UUUGG
A GUUUCU
ACCAAGAGCC
UU
A A AC U
U U GA U U
AU G
A AG U G
AA
AACA
AAACA
AAGAAAC
AA
CAA
C A A C A A C
10
30
20
4050
110
8070
100
90
120
60
5´
A
P2
P1
P3
G ACGCUUCAUAU
AA
UCCUAAGAUAU G G
U U U G G AG
U U U C UA
CC A A G A G
C CUU
AAACUUUG
AUU
AUGAAGUG
A A A A C A A A A C A C
20 90
50
8040
70
100 120
30
60
5´ ...
G ACGCUUCAUA
UAAUC
CUA
A G AUAUGGUUUGG
AGUUU
CU
A C CAA
GAGCCUUAAACU
UUG
AU
UAUGAAGUG
A A A A C A A A A C A A C
20 90
30
40
80
5070
60
100 120
5´ ...-4 -3 -2 -1 0 1 2 3 4
-4
-3
-2
-1
0
1
2
...
...
Principal Component 1Prin
cipa
l Com
pone
nt 2
Aptamer Binding DomainAdenine Bound Conformation “On”
Adenine Free Conformation “O!”G
G
G
G
SNitching the Switch
B C
U39AC77G
Principal Component 1Prin
cipa
l Com
pone
nt 2
0.0
0.2
0.4
0.6
0.8
1.0
Sequence Position5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
WTU39AC77G
Nor
mal
ized
Rea
ctiv
ity D
-4 -3 -2 -1 0 1 2 3 4-4
-3
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3 4-4
-3
-2
-1
0
1
2
3
Principal Component 1Prin
cipa
l Com
pone
nt 2
Principal Component 1Prin
cipa
l Com
pone
nt 2
Figure 1
SHAPE data for most single point mutations!
G G A A A G G A A A G G G A A A G A A ACGCUUCAUAU
AAUCCUAAU
GAUAU G G
U U U G G G AGU U U C U
ACC A A G A G C C
UUAAA
CUCUUGA
UUAUGAAGUGA A A A C A A A A C A A A G A A A C A A C A A C A A C A A C
10 20 90
50
8040
70
100 110 120
30
60
5´
P2
P1
P3
0.0
0.2
0.4
0.6
0.8
1.0
Sequence Position5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Nor
mal
ized
Rea
ctiv
ity D
0.0
0.2
0.4
0.6
0.8
1.0
Sequence Position5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Nor
mal
ized
Rea
ctiv
ity D
Sequence Position
Nor
mal
ized
Rea
ctiv
ity0.
00.
20.
40.
60.
81.
0
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Quantifying the structure change
0.0
0.2
0.4
0.6
0.8
1.0
Sequence Position
Nor
mal
ized
Rea
ctiv
ity
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Adenine Riboswitch
- Ligand+ Ligand
eSDC = 1.45, Biological “significance.”
eSDC, the experimental Structure Disruption Coefficient
0.0
0.5
1.0
1.5
2.0
2.5
3.0
U22G.A196G
U22GU22G.G4A
U24A
A56U.U59CG63CU22GC90UU59C
G67C
C77G
C35GU76A
U113A
U13AU113A
U13A
G85CC51G
G105C
G85C
C32G
A107U
C65G
0.5.05
.01.001
.00011.0
P-V
alue
0.5.05
.01.001
.00011.0
FTL Adenine RS Glycine RS No G
GMP RSw/ CDM
P4P6Glycine RS w/ G
GMP RSNo CDM
U32A C128GU39A0.
00.
20.
40.
60.
81.
0
15 25 35 45 55 65 75 85 95 105 115 125 135 145 155 165 175
WTU32AU113A
0.0
0.2
0.4
0.6
0.8
1.0
Sequence Position
Nor
mal
ized
SH
AP
E R
eact
ivity
25 35 45 55 65 75 85 95 105 115 125 135 145 155
WTU32AU113A
WTC128GC65G
eSD
C
A
B
C
± Adenine ± Glycine
± CDM
Figure 2
Predicting SNitch inducing SNPs
Stocastic SampledStructures
Suboptimal Structures
Mutant Structures
sfold -o outdir seq.farun 10 times!nd MFE structure
Minimal Free EnergyStructure
RNAFoldVienna’s
AGACCAAACUCU
AGACCAAAgUCUAGgCCAAACUCU
AGACCcAACUCU
AGAgCAAuCUCU
2 mutations
1 mutation
RNAmutants
RNAstructureMathews Lab
Minimal Free EnergyStructure
Stocastic SampledStructures
Suboptimal Structures
Minimal Free EnergyStructure
Minimal Free EnergyStructure
Stocastic SampledStructures sfold -o outdir seq.fa
run 10 times
RNAsubopt -e E -s < seq.fa > out!le.txt AllSub seq.seq out!le.ct -a E
RNAsubopt -p 10000 -s < seq.fa >out!le.txt
RNAfold -noPS < seq.fa > out!le.txt Fold seq.seq out!le.ct -m 1
partition seq.seq out!le.pfsProbabilityPlot out!le.pfs out!le.txt -t
RNAmutants -l ~/RNAmutants /lib/ -f seq.fa -mutants 1 -n 10000 > out!le.txt
RNAmutants -l ~/RNAmutants /lib/ -f seq.fa -mutants 1 -n 1 > out!le.txt
pSDC, or the predicted Structure Disruption Coefficient
SHAPE
MFE
Cluster Centroid
Z Centroid
Probability of Pairing
Sequence Position20 40 60 80 100 120 140 160 180
1.000.75
0.50
0.25
0.00
5 15 25 35 45 55 65 75 85 95 105 115 125 135 145 155 165 175
WTU22G
Sequence Position
Sequence Position20 40 60 80 100 120 140 160 180
1.00
0.75
0.50
0.25
0.00
Sequence Position20 40 60 80 100 120 140 160 180
1.000.75
0.50
0.250.00
Sequence Position20 40 60 80 100 120 140 160 180
1.000.75
0.50
0.25
0.00
METRIC WT vs. Mutant2.0
1.5
1.0
0.5
0.0Nor
mal
ized
Rea
ctiv
ity
eSDC = 2.3
pSDC = 6.7
pSDC = 4.6
pSDC = 6.1
pSDC = 5.6
A
E
D
C
B
PPV and AUC analysis of predictions
A
B0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1-Specificity
Sen
sitiv
ity
Centroid-RNAsubopt-.64PP-RNAmutantsPair Perc. RNAsubopt-.53
Z Centroid.RNAsubopt - 0.64Prob. Pair.RNAfold - 0.60
MFE.Sfold - 0.54
AUC reveals there is still work to be done.B
Metric Program AUC +/- .03
Allsub 0.59 RNAfold 0.60 RNAMutants 0.59 RNAStructure 0.59 RNAsubopt 0.61 0.58
Allsub 0.56 RNAfold 0.62 RNAMutants 0.58 RNAStructure 0.56 RNAsubopt 0.64 0.54
Allsub 0.57 RNAfold 0.63 RNAMutants 0.59 RNAStructure 0.58 RNAsubopt 0.62 0.55
Probability of Pairing
Distance Centroid
Z Centroid
RNAfold 0.62 RNAmutants 0.61 RNAstructure 0.57 sFold 0.54
MFE
mFold 0.62
sFold
sFold
sFold
Open Questions
• What exactly is SHAPE telling us? Notice we only used it comparatively, i.e. does the structure change and how much?
• Is there information in partial SHAPE (or for that matter any probing) reactivities or do we just need to know the most reactive species?
• Why do we correctly predict that there are major structure disrupting SNPs but no one gets the right ones?
• Are hyper-reactive nucleotides even structurally important or just an artifact of ideal acylation geometry?
Josh Martin
Matt Halvorsen Wes SandersJustin Ritz
Gabriela Phillips
Chetna Gopinath
R00 GM079953 (NIGMS),R21 MH087336 (NIMH), R01 HL111527 (NHLBI), R01 GM101237 (NGMS).