SHAPE as a way to evaluate our ability to predict the ...benasque.org/2012rna/talks_contr/315_Laederach.pdf · simplifying the integration of bioscience data can only speed systems

SHAPE as a way to evaluate our ability to predict the effects of SNPs on RNA

Laederach Lab

UNC Chapel Hill, Department of Biology

[email protected]

http://ribosnitch.bio.unc.edu

SHAPE Chemical Mapping

Mitra S, Shcherbakova IV, Altman RB, Brenowitz M, Laederach A. High-throughput single-nucleotide structural mapping by capillary automated footprinting analysis. NAR, 2008, 36 e63

SHAPE Analysis of FTL 5’ UTR

5’ UTR Exon 1A)

U22GPrediction

IRE

B)C)

0

1

2

3

4

5

Sequence Position

Inte

nsity

WTU22GG4A

27 30 33 36 39 42 45 48 51 54 57 60 63 66 69

x103

*In our CAFA analysis, peaks are proportional to the amount of accessibility-dependent labeling in folded RNA at different positions*Peaks here represent unfiltered fluorescence intensities of the three variants of FTL 5‘UTR that we will focus on today (WT,A196G and U22G)*The area that is being focused on here is representative of the IRE in the secondary structure of 5‘UTR*We see that even without filtration of datapoints, not only do WT and A196G capillary traces closely follow one another, but also U22G deviates significantly from the WT patterns of peaks shown

How do we get a lot more probing data?

(a)

(b)

THESISCRAP

THESISCRAP

SNRNASM (Single Nucleotide Resolution Nucleic Acid Structure Mapping)

ISA-Tab is a Standard

124 VOLUME 44 | NUMBER 2 | FEBRUARY 2012 | NATURE GENETICS

COMMENTARY

a

b

c

Collectionand curation

Common representation

format

Connectingdistributed data

Systems supporting the ISA-Tab format

Systems powered by ISAsoftware components

Investigation

Study Study

Assay (s)

Pointers to data !lenames/location

External !les innative or other

formats

Assay (s)

Data Data

InvestigationHigh-level concept to link related studies

StudyThe central unit, containing information on the subject under study, its characteristics and any treatments applied

A study has associated assays

AssayTest performed either on material taken from the subject or on the whole initial subject, which produce qualitative or quantitative measurements (data)

(i) regularize local collection and management of experimental metadata, (ii) reduce the adoption barrier for using community mini-mum reporting guidelines and terminologies through customizable configuration, (iii) facili-tate consistent curation at source and (iv) sup-port direct submission to a growing number of public repositories, both in ISA-Tab format (such as MetaboLights and the other systems shown in Box 1) and through conversion to other supported formats12–14. An example of the ISA framework in action is illustrated by the Harvard Stem Cell Institute (HCSI)’s Stem Cell Discovery Engine (SCDE)22 and shown in Figure 1.

Without community-level harmonization and interoperability, many community proj-ects risk becoming data silos, aggravating the problem. Using the shared, metadata-focused ISA framework, it is now possible to aggregate investigations in community ‘staging posts’, merge them in various combinations, perform meta-analyses and more straightforwardly submit to public repositories. Furthermore, simplifying the integration of bioscience data can only speed systems biology research23 and improve the ability of the R&D community to utilize shared data24.

The growing number of communities using the ISA framework adds credibility to this meta-data-focused data sharing vision. Taking this a step further, Figure 2 shows how these com-munities’ systems—a mix of public and internal tools that use ISA software components or, min-imally, the ISA-Tab format—will progressively interrelate to build the ‘ISA commons’. Activities are already underway under the auspices of the World Wide Web Consortium (W3C) Semantic Web for Health Care and Life Sciences Interest Group (HCLSIG)’s Scientific Discourse task

force to generate serialized ISA-Tab metadata in compliance with the recommendations of the international Linked Data community25. Semantic integration of bioscience data with the wider corpus of human knowledge then becomes more straightforward.

BioSharing: standard cooperating proceduresIt is widely acknowledged that unlocking shared data promises to accelerate discovery, but this process requires new models for the way we col-laborate1–3,5,6,17,18,26. But reporting standards often have different levels of maturity, and inevi-tably, duplication of effort. Communication between standards initiatives is pivotal to ensure that a common or at least complementary set of

standards exists and is widely used by the aca-demic and commercial sectors to maximize the utility of shared data. Building on the effort of the Minimum Information for Biological and Biomedical Investigations (MIBBI) portal10, the BioSharing initiative works to strengthen collaborations between researchers, funders, industry and journals and to discourage redun-dant (if unintentional) competition between standards-generating groups27. The BioSharing catalog maps the landscape of standards and the systems implementing them, and it also works to build graphs of complementarities in scope and functionality. In time and after consultation, a set of criteria for assessing the usability and popularity of standards will be implemented to maximize their adoption and use to assist the

Figure 2 Building the ‘ISA commons’, a growing ecosystem of resources that work to provide a data commons. (a) Data sets of interest to each community are collected and curated. (b) Capture systems, either powered by the ISA software suite or supporting the hierarchical ISA-Tab structure, deliver a common representation of experimental content that transcends individual domains. (c) To achieve broader data integration, the next step is to explore the growing Linked Data universe. The European Innovative Medicines Initiative (IMI) Open PHACTS project, for example, will use semantic web approaches to make existing knowledge available for linking, querying and where possible, reasoning. This project will benefit greatly from study descriptions that draw on the ISA model to connect quantified information held in semantic triple stores to data from actual experiments performed. As a result, the project will connect public and private datasets to genomics resources, enabling the combination of existing and new experimental data.

Susanna-Assunta SansonePhilippe Rocca-Serra

SNRNASM is a link farm to Google Spreadsheets

Providing open access to the data and meta-data.

A community effortBIOINFORMATICS

Sharing and archiving nucleic acid structure mapping data

PHILIPPE ROCCA-SERRA,1 STANISLAV BELLAOUSOV,2 AMANDA BIRMINGHAM,3 CHUNXIA CHEN,4

PABLO CORDERO,5 RHIJU DAS,5 LAUREN DAVIS-NEULANDER,4 CAIA DUNCAN,6 MATTHEW HALVORSEN,4

ROB KNIGHT,7 NEOCLES LEONTIS,8 DAVID H. MATHEWS,2 JUSTIN RITZ,4 JESSE STOMBAUGH,7

5 KEVIN M. WEEKS,6 CRAIG L. ZIRBEL,9 and ALAIN LAEDERACH4,10

1Oxford e-Research Center, University of Oxford, OX1 3QG, Oxford, United Kingdom2Department of Biochemistry and Biophysics and Center for RNA Biology, University of Rochester, Rochester, New York 14642, USA3Thermo Fisher Scientific, Lafayette, Colorado 80026, USA4Biology Department, University of North Carolina, Chapel Hill, North Carolina 27599-3290, USA

105Biochemistry Department, Stanford University, Stanford, California 94305, USA6Department of Chemistry, University of North Carolina, Chapel Hill, North Carolina 27599, USA7Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado 80309, USA8Department of Chemistry, Bowling Green State University, Bowling Green, Ohio 43403, USA9Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, Ohio 43403, USA

15 ABSTRACT

NucleicAU1 acids are particularly amenable to structural characterization using chemical and enzymatic probes. Each individualstructure mapping experiment reveals specific information about the structure and/or dynamics of the nucleic acid. Currently,there is no simple approach for making these data publically available in a standardized format. We therefore developeda standard for reporting the results of single nucleotide resolution nucleic acid structure mapping experiments, or SNRNASMs.

20 We propose a schema for sharing nucleic acid chemical probing data that uses generic public servers for storing, retrieving, andsearching the data. We have also developed a consistent nomenclature (ontology) within the Ontology of BiomedicalInvestigations (OBI), which provides unique identifiers (termed persistent URLs, or PURLs) for classifying the data. Links tostandardized data sets shared using our proposed format along with a tutorial and links to templates can be found at http://snrnasm.bio.unc.edu.

25 Keywords: RNA structure; chemical mapping; secondary structure

INTRODUCTION

Fields in which data standardization has allowed sharingamong many researchers, including sequence data in GenBank(Benson et al. 2008; Wheeler et al. 2008) and structural data

30 in the Protein Data Bank (Bernstein et al. 1977), havebenefited enormously from the ability of investigators todraw insights from the work of thousands of peopledispersed across the globe (Cannone et al. 2002; Griffiths-Jones et al. 2003; Noy et al. 2003; Zhang et al. 2006; Elnitski

35 et al. 2007; Musen et al. 2008; Brown et al. 2009). Atpresent, there is currently no standard database for archiv-ing and sharing nucleic acid structure mapping data,despite the compelling opportunities to incorporate suchdata in studies with direct relevance to human health

40and to a wide range of scientific challenges (Russell andHerschlag 2001; Tullius 2002; Schroeder et al. 2004; Takamotoet al. 2004; Thirumalai and Hyeon 2005; Mortimer and Weeks2007; Tijerina et al. 2007; Shcherbakova and Brenowitz2008; Woodson 2008; Deigan et al. 2009). Chemical and

45enzymatic structure mapping techniques are useful in thefield of nucleic acids and are commonly used to experi-mentally validate and/or constrain structural predictions,‘‘footprint’’ protein-binding sites, and characterize foldingreactions both kinetically and thermodynamically (Mathews

50et al. 2004; Deigan et al. 2009; Quarrier et al. 2010; Weeks2010). Recent developments allowing the analysis of chemicalmapping reactions in a quantitative and high-throughputmanner yield large amounts of high-quality data that re-quire automated processing and annotation (Das et al. 2005;

55Laederach et al. 2008; Mitra et al. 2008; Vasa et al. 2008;Wilkinson et al. 2008; Deigan et al. 2009; Watts et al. 2009;Underwood et al. 2010) .

A standardized approach for making such data availableupon publication is needed to facilitate sharing and wider

10Corresponding author.E-mail [email protected] published online ahead of print. Article and publication date are

at http://www.rnajournal.org/cgi/doi/10.1261/rna.2753211.

RNA (2011), 17:00–00. Published by Cold Spring Harbor Laboratory Press. Copyright ! 2011 RNA Society. 1

BIOINFORMATICS

Sharing and archiving nucleic acid structure mapping data

PHILIPPE ROCCA-SERRA,1 STANISLAV BELLAOUSOV,2 AMANDA BIRMINGHAM,3 CHUNXIA CHEN,4

PABLO CORDERO,5 RHIJU DAS,5 LAUREN DAVIS-NEULANDER,4 CAIA DUNCAN,6 MATTHEW HALVORSEN,4

ROB KNIGHT,7 NEOCLES LEONTIS,8 DAVID H. MATHEWS,2 JUSTIN RITZ,4 JESSE STOMBAUGH,7

5 KEVIN M. WEEKS,6 CRAIG L. ZIRBEL,9 and ALAIN LAEDERACH4,10

1Oxford e-Research Center, University of Oxford, OX1 3QG, Oxford, United Kingdom2Department of Biochemistry and Biophysics and Center for RNA Biology, University of Rochester, Rochester, New York 14642, USA3Thermo Fisher Scientific, Lafayette, Colorado 80026, USA4Biology Department, University of North Carolina, Chapel Hill, North Carolina 27599-3290, USA

105Biochemistry Department, Stanford University, Stanford, California 94305, USA6Department of Chemistry, University of North Carolina, Chapel Hill, North Carolina 27599, USA7Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado 80309, USA8Department of Chemistry, Bowling Green State University, Bowling Green, Ohio 43403, USA9Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, Ohio 43403, USA

15 ABSTRACT

NucleicAU1 acids are particularly amenable to structural characterization using chemical and enzymatic probes. Each individualstructure mapping experiment reveals specific information about the structure and/or dynamics of the nucleic acid. Currently,there is no simple approach for making these data publically available in a standardized format. We therefore developeda standard for reporting the results of single nucleotide resolution nucleic acid structure mapping experiments, or SNRNASMs.

20 We propose a schema for sharing nucleic acid chemical probing data that uses generic public servers for storing, retrieving, andsearching the data. We have also developed a consistent nomenclature (ontology) within the Ontology of BiomedicalInvestigations (OBI), which provides unique identifiers (termed persistent URLs, or PURLs) for classifying the data. Links tostandardized data sets shared using our proposed format along with a tutorial and links to templates can be found at http://snrnasm.bio.unc.edu.

25 Keywords: RNA structure; chemical mapping; secondary structure

INTRODUCTION

Fields in which data standardization has allowed sharingamong many researchers, including sequence data in GenBank(Benson et al. 2008; Wheeler et al. 2008) and structural data

30 in the Protein Data Bank (Bernstein et al. 1977), havebenefited enormously from the ability of investigators todraw insights from the work of thousands of peopledispersed across the globe (Cannone et al. 2002; Griffiths-Jones et al. 2003; Noy et al. 2003; Zhang et al. 2006; Elnitski

35 et al. 2007; Musen et al. 2008; Brown et al. 2009). Atpresent, there is currently no standard database for archiv-ing and sharing nucleic acid structure mapping data,despite the compelling opportunities to incorporate suchdata in studies with direct relevance to human health

40and to a wide range of scientific challenges (Russell andHerschlag 2001; Tullius 2002; Schroeder et al. 2004; Takamotoet al. 2004; Thirumalai and Hyeon 2005; Mortimer and Weeks2007; Tijerina et al. 2007; Shcherbakova and Brenowitz2008; Woodson 2008; Deigan et al. 2009). Chemical and

45enzymatic structure mapping techniques are useful in thefield of nucleic acids and are commonly used to experi-mentally validate and/or constrain structural predictions,‘‘footprint’’ protein-binding sites, and characterize foldingreactions both kinetically and thermodynamically (Mathews

50et al. 2004; Deigan et al. 2009; Quarrier et al. 2010; Weeks2010). Recent developments allowing the analysis of chemicalmapping reactions in a quantitative and high-throughputmanner yield large amounts of high-quality data that re-quire automated processing and annotation (Das et al. 2005;

55Laederach et al. 2008; Mitra et al. 2008; Vasa et al. 2008;Wilkinson et al. 2008; Deigan et al. 2009; Watts et al. 2009;Underwood et al. 2010) .

A standardized approach for making such data availableupon publication is needed to facilitate sharing and wider

10Corresponding author.E-mail [email protected] published online ahead of print. Article and publication date are

at http://www.rnajournal.org/cgi/doi/10.1261/rna.2753211.

RNA (2011), 17:00–00. Published by Cold Spring Harbor Laboratory Press. Copyright ! 2011 RNA Society. 1

On/off conformations in Boltzmann sampling

Prin

cipa

l Com

pone

nt 2

Prin

cipa

l Com

pone

nt 2

WTG G A A

AGG

AAAG

GGA

A AGAAAC

GCUUCAUAU

AAU

CC U

AA

GAUAUG

GUU

UGG

AGUUUCUA

CC A

AGA

GCCU

U A A A C U

UUG

AU

UAUG

AA

GU

GA A

AAC A

AA

ACAA

A GAA

ACAAC

A A C A A C A A C70

10

60

20

50

3040

11080

100

90

120

5´

G G A A A G G A A AGGGA

AA

GAA

AC G

CUUC A U

AU

AAUCCU

A A G A U A U G G UUUGG

A GUUUCU

ACCAAGAGCC

UU

A A AC U

U U GA U U

AU G

A AG U G

AA

AACA

AAACA

AAGAAAC

AA

CAA

C A A C A A C

10

30

20

4050

110

8070

100

90

120

60

5´

A

P2

P1

P3

G ACGCUUCAUAU

AA

UCCUAAGAUAU G G

U U U G G AG

U U U C UA

CC A A G A G

C CUU

AAACUUUG

AUU

AUGAAGUG

A A A A C A A A A C A C

20 90

50

8040

70

100 120

30

60

5´ ...

G ACGCUUCAUA

UAAUC

CUA

A G AUAUGGUUUGG

AGUUU

CU

A C CAA

GAGCCUUAAACU

UUG

AU

UAUGAAGUG

A A A A C A A A A C A A C

20 90

30

40

80

5070

60

100 120

5´ ...-4 -3 -2 -1 0 1 2 3 4

-4

-3

-2

-1

0

1

2

...

...

Principal Component 1Prin

cipa

l Com

pone

nt 2

Aptamer Binding DomainAdenine Bound Conformation “On”

Adenine Free Conformation “O!”G

G

G

G

SNitching the Switch

B C

U39AC77G


cipa

l Com

pone

nt 2

0.0

0.2

0.4

0.6

0.8

1.0

Sequence Position5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

WTU39AC77G

Nor

mal

ized

Rea

ctiv

ity D

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3


cipa

l Com

pone

nt 2


cipa

l Com

pone

nt 2

Figure 1

SHAPE data for most single point mutations!

G G A A A G G A A A G G G A A A G A A ACGCUUCAUAU

AAUCCUAAU

GAUAU G G

U U U G G G AGU U U C U

ACC A A G A G C C

UUAAA

CUCUUGA

UUAUGAAGUGA A A A C A A A A C A A A G A A A C A A C A A C A A C A A C

10 20 90

50

8040

70

100 110 120

30

60

5´

P2

P1

P3

0.0

0.2

0.4

0.6

0.8

1.0


Nor

mal

ized

Rea

ctiv

ity D

0.0

0.2

0.4

0.6

0.8

1.0


Nor

mal

ized

Rea

ctiv

ity D

Sequence Position

Nor

mal

ized

Rea

ctiv

ity0.

00.

20.

40.

60.

81.

0

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

Quantifying the structure change

0.0

0.2

0.4

0.6

0.8

1.0

Sequence Position

Nor

mal

ized

Rea

ctiv

ity

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

Adenine Riboswitch

- Ligand+ Ligand

eSDC = 1.45, Biological “significance.”

eSDC, the experimental Structure Disruption Coefficient

0.0

0.5

1.0

1.5

2.0

2.5

3.0

U22G.A196G

U22GU22G.G4A

U24A

A56U.U59CG63CU22GC90UU59C

G67C

C77G

C35GU76A

U113A

U13AU113A

U13A

G85CC51G

G105C

G85C

C32G

A107U

C65G

0.5.05

.01.001

.00011.0

P-V

alue

0.5.05

.01.001

.00011.0

FTL Adenine RS Glycine RS No G

GMP RSw/ CDM

P4P6Glycine RS w/ G

GMP RSNo CDM

U32A C128GU39A0.

00.

20.

40.

60.

81.

0

15 25 35 45 55 65 75 85 95 105 115 125 135 145 155 165 175

WTU32AU113A

0.0

0.2

0.4

0.6

0.8

1.0

Sequence Position

Nor

mal

ized

SH

AP

E R

eact

ivity

25 35 45 55 65 75 85 95 105 115 125 135 145 155

WTU32AU113A

WTC128GC65G

eSD

C

A

B

C

± Adenine ± Glycine

± CDM

Figure 2

Predicting SNitch inducing SNPs

Stocastic SampledStructures

Suboptimal Structures

Mutant Structures

sfold -o outdir seq.farun 10 times!nd MFE structure

Minimal Free EnergyStructure

RNAFoldVienna’s

AGACCAAACUCU

AGACCAAAgUCUAGgCCAAACUCU

AGACCcAACUCU

AGAgCAAuCUCU

2 mutations

1 mutation

RNAmutants

RNAstructureMathews Lab


Stocastic SampledStructures

Suboptimal Structures



Stocastic SampledStructures sfold -o outdir seq.fa

run 10 times

RNAsubopt -e E -s < seq.fa > out!le.txt AllSub seq.seq out!le.ct -a E

RNAsubopt -p 10000 -s < seq.fa >out!le.txt

RNAfold -noPS < seq.fa > out!le.txt Fold seq.seq out!le.ct -m 1

partition seq.seq out!le.pfsProbabilityPlot out!le.pfs out!le.txt -t

RNAmutants -l ~/RNAmutants /lib/ -f seq.fa -mutants 1 -n 10000 > out!le.txt

RNAmutants -l ~/RNAmutants /lib/ -f seq.fa -mutants 1 -n 1 > out!le.txt

pSDC, or the predicted Structure Disruption Coefficient

SHAPE

MFE

Cluster Centroid

Z Centroid

Probability of Pairing

Sequence Position20 40 60 80 100 120 140 160 180

1.000.75

0.50

0.25

0.00

5 15 25 35 45 55 65 75 85 95 105 115 125 135 145 155 165 175

WTU22G

Sequence Position


1.00

0.75

0.50

0.25

0.00


1.000.75

0.50

0.250.00


1.000.75

0.50

0.25

0.00

METRIC WT vs. Mutant2.0

1.5

1.0

0.5

0.0Nor

mal

ized

Rea

ctiv

ity

eSDC = 2.3

pSDC = 6.7

pSDC = 4.6

pSDC = 6.1

pSDC = 5.6

A

E

D

C

B

PPV and AUC analysis of predictions

A

B0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1-Specificity

Sen

sitiv

ity

Centroid-RNAsubopt-.64PP-RNAmutantsPair Perc. RNAsubopt-.53

Z Centroid.RNAsubopt - 0.64Prob. Pair.RNAfold - 0.60

MFE.Sfold - 0.54

AUC reveals there is still work to be done.B

Metric Program AUC +/- .03

Allsub 0.59 RNAfold 0.60 RNAMutants 0.59 RNAStructure 0.59 RNAsubopt 0.61 0.58



Probability of Pairing

Distance Centroid

Z Centroid

RNAfold 0.62 RNAmutants 0.61 RNAstructure 0.57 sFold 0.54

MFE

mFold 0.62

sFold

sFold

sFold

Open Questions

• What exactly is SHAPE telling us? Notice we only used it comparatively, i.e. does the structure change and how much?

• Is there information in partial SHAPE (or for that matter any probing) reactivities or do we just need to know the most reactive species?

• Why do we correctly predict that there are major structure disrupting SNPs but no one gets the right ones?

• Are hyper-reactive nucleotides even structurally important or just an artifact of ideal acylation geometry?

Josh Martin

Matt Halvorsen Wes SandersJustin Ritz

Gabriela Phillips

Chetna Gopinath

R00 GM079953 (NIGMS),R21 MH087336 (NIMH), R01 HL111527 (NHLBI), R01 GM101237 (NGMS).

Documents

SHAPE as a way to evaluate our ability to predict the ...benasque.org/2012rna/talks_contr/315_Laederach.pdf · simplifying the integration of bioscience data can only speed systems