ISFOLD: Structure Prediction of Base Pairs Ali Mokdad in Non …biochemistry2.ucsf.edu/labs/frankel/papers/JBSD 2008... · 2008-04-14 · developed the Isfold (Isosteric Folding)

ISFOLD: Structure Prediction of Base Pairs in Non-Helical RNA Motifs from Isostericity

Signatures in Their Sequence Alignments

http://www.jbsdonline.com

Abstract

The existence and identity of non-Watson-Crick base pairs (bps) within RNA bulges, inter-nal loops, and hairpin loops cannot reliably be predicted by existing algorithms. We have developed the Isfold (Isosteric Folding) program as a tool to examine patterns of nucleotide substitutions from sequence alignments or mutation experiments and identify plausible bp interactions. We infer these interactions based on the observation that each non-Watson-Crick bp has a signature pattern of isosteric substitutions where mutations can be made that preserve the 3D structure. Isfold produces a dynamic representation of predicted bps within defined motifs in order of their probabilities. The software was developed under Windows XP, and is capable of running on PC and MAC with Matlab 7.1 (SP3) or higher. A PC stand-alone version that does not require Matlab also is available. This software and a user manual are freely available at www.ucsf.edu/frankel/isfold.

Key words: RNA 3D structure; Structure prediction; Non-Watson-Crick base pair; Base pair isostericity; and Non-helical motif.

Introduction

Many computational efforts to determine the structures of RNA molecules from their sequences have been aimed at determining their secondary structures, i.e., the set of cis Watson-Crick base pairs (WC bps) that make up the stacked helical skeleton and establishes the overall architecture of the molecules. However, depending on con-text, as many as 20-50% of edge-to-edge bps in non-coding structured RNAs are of non-WC types, and these fall into about a dozen bp types (1-3). These non-WC bps are found within internal loops, hairpin loops, and other non-helical elements such as pseudoknots. They can form long range inter- and intra-molecular interactions, and they often are dynamic in nature, allowing for transient formation and breaking of contacts that permit the molecule to change its shape as needed (4-7). Such bps typically are found in function-determining parts of RNA molecules, so modeling their exact structures or range of possible structures is of considerable value. Com-putational and experimental approaches exist to determine secondary structures with good accuracy but provide much less information about the structures of non-helical regions. Dynamic programming algorithms such as Mfold (8) and RNAstructure (9) are relatively accurate in predicting RNA secondary structure in helical areas based on stacking free energies between consecutive WC bps. Furthermore, they indicate the approximate locations of non-WC regions but without details of their structures. Other programs such as RNAfold (10), Pfold (11), and Sfold (12) use related ap-proaches but still are limited in their ability to detect non-WC bps (13).

Another major class of computational approaches is based on comparative sequence analysis (CSA) (14-17), where compensatory mutations (covariations) observed in

Journal of Biomolecular Structure &Dynamics, ISSN 0739-1102Volume 25, Issue Number 5, (2008)©Adenine Press (2008)

Ali Mokdad*

Alan D. Frankel

Department of Biochemistry and BiophysicsUniversity of California at San Francisco600 16th StreetSan Francisco, CA 94143-2280, USA

467*Email: [email protected]

Open Access ArticleThe authors, the publisher, and the right hold-ers grant the right to use, reproduce, and dis-seminate the work in digital form to all users.

468

Mokdad and Frankel

sequence alignments are used to predict secondary structure. As with the free en-ergy approaches, CSA methods also do not generally detect the positions and types of all possible bp interactions, in part because they are not represented by simple (two base WC) covariations but rather by more complex types of sequence signa-tures. For example, C:G, G:C, and G:G all can be equivalent substitutions of the cHH (cis Hoogsteen/Hoogsteen) bp family, and A:A, A:C, A:G, and A:U of the tSS (trans Sugar edge/Sugar edge) bp family (Figure 1). These signatures are not detected by classical CSA because the compensating mutations do not necessarily affect both positions that form the bp. Other methods to predict structure are based on primary sequence alone (18) or utilize graph grammar to detect structure from primary sequence alone or from sequence alignments (19-21), but these also have shown limited success in detecting non-WC bps. To date, no reliable method has been dedicated to determining non-WC bps (22), in part because the rules that es-tablish such complex relationships have not yet been systematically defined. Here, we describe an approach that focuses solely on predicting plausible non-WC bps based on their degree of structural similarity or isomorphism (3, 23).

Materials and Methods

A few years have passed since all possible bp types or families were categorized ac-cording to their structural similarities (physical dimensions and bond orientations), resulting in isostericity matrices (IMs) (23). According to this classification, each bp family is organized into isosteric subfamilies that yield distinctive patterns of acceptable sequence variations (Figure 1). These patterns in principle can identify the bp type amongst aligned sequences, but this is complicated because IMs within

1- cWWACGU

3- cWHACGU

5- cWSACGU

7- cHHACGU

9- cHSACGU

11- cSSACGU

12- tSSACGU

10- tHSACGU

8- tHHACGU

6- tWSACGU

4- tWHACGU

2- tWWACGU

A C G U

A C G U

A C G U

A C G U

A C G U A C G U

A C G U

A C G U

A C G U

A C G U

A C G U A C G U

Figure 1: The 12 bp families and their isosteric sub-families (23), demarcated by colors that show the unique patterns of acceptable and unacceptable substitutions for each interaction. Within one bp family, boxes with the same color represent isosteric combinations, boxes with “similar” colors (those grouped within single ovals at the bottom) near-isosteric, boxes with different and non-similar colors heterosteric, and gray boxes implausible (structurally incompatible). Three-letter names of the bp families are: c, cis; t, trans; W, Watson-Crick edge; H, Hoogsteen edge; S, sugar edge; according to previous nomenclature (2). Consequently, cWW designates a cis Watson-Crick/Watson Crick interaction, tHS designates a trans Hoogsteen/Sugar edge interaction, and so on.

469ISFOLD: Structure

Prediction of Base Pairs in Non-Helical RNA Motifs

bp families often are not equally populated (24). IMs have been used to improve sequence alignments based on known 3D structures (25, 26), but few efforts have been made to predict bp types beginning with aligned sequences or from mutational data. One such effort resulted in the manual prediction of a loop E motif in potato spindle tuber viroid based on patterns of viable and lethal mutations (27, 28);here, we describe a more automated procedure.

We have created the Isfold program, which uses isostericity patterns observed in sequence alignments or in experimental mutational data to suggest plausible bp con-figurations, particularly within RNA internal loops and hairpin loops. Isfold com-pares substitution patterns, which may or may not be considered standard two-base covariations, at every pair of nucleotide positions that may potentially form a bp, to the known isostericity patterns of all bp families. An “isostericity compliance score” for each potential bp is calculated, based on the adherence of its sequence variation to isostericity rules. The formula for calculating this score takes into consideration the number of isosteric, near-isosteric, heterosteric, and forbidden substitutions that are observed when comparing the sequence alignment or experimental mutation patterns to the IM of each bp type. The more favorable (isosteric and near-isosteric) substitutions produce higher scores, and the more unfavorable (heterostreic and for-bidden) substitutions lower scores. Users also may modify scoring functions as de-

Figure 2: Screenshot of Isfold Results Screen, which is a dynamic Graphical User Interface (GUI) depicting the predicted structures. The user can browse through plau-sible bp schemes, one Result Screen at a time, in order of their likelihood based on the calculated scores. The bp symbols used are the Leontis/Westhof representation (2): circle represents Watson-Crick edge, square rep-resents Hoogsteen edge, and triangle represents sugar edge. Solid or filled symbols of any color indicate cis orientation, and open symbols indicate trans orientation. For example, the symbol v–£ indicates a tSH or trans Sugar edge/Hoogsteen interaction and ò–u indicates a cWS or cis Watson Crick/Sugar edge interaction. For simplicity, when both nucleotides in the bp use the same edge, only one such edge symbol is drawn. Thus, –¢– indicates a cHH or cis Hoogsteen/Hoogsteen interaction. Values and colors of the arrows represent the isosteric-ity compliance scores. The bottom section displays warnings concerning the quality of predictions, such as inadequate data in sequence alignment or prediction of incompatible interactions. Besides this dynamic graphi-cal output, results also are provided in text format.

Figure 3: Flowchart outlining the procedure followed by Isfold, together with an interpretation of the results by Ribostral (25). The alignment (step a) is a starting point for other programs to predict sec-ondary structure (step b) and for Isfold to predict non-WC base pairs (step c). For simplicity, panel (c) shows only four of the possible base pair types between nucleotides 8 and 28. The first two predictions shown have equal and perfect scores, because all substitutions are isosteric according to the cHS (cis Hoogsteen/Sugar edge) and tWS (trans Watson Crick/Sugar edge) bp types, as seen in Ribostral screen-shots (step d). Panel (e) explains how mutation experiments can be designed to differentiate between such similarly scoring base pairing possibilities.

a) Alignment b) Secondary structure (by other software)

c) Isfold analysis of internal loop area (shaded, positions 7-10,26-19): Pairing potential between each position from the first strand with each position from the second strand is tested, and the best base pair matches are chosen. For simplicity, results are shown only for the bolded positions 8 and 28:

d) Using Ribostral (20) together with Isfold helps visualize and interpret the results:

e) Ribostral can then help guide experiments by suggesting mutations to differentiate between similar scores. For example, mutation of positions 8 and 28 to G/G would be functional if itis a cHS bp type but lethal or non-functional if it is tWS bp type (forbidden gray box).

470

Mokdad and Frankel

tailed in the online user manual. After all possible bps are scored against all possible IMs, scores are sorted and bp predictions are displayed in text and graphical form (Figure 2). The user can systematically examine possible base pairing within a user-defined motif, in order of these scores. Because some bp families share similarities in their isostericity patterns, and because not all allowed substitutions may be fully populated, several bp types may be satisfied equally by the sequence alignment or mutational data. In such cases, Isfold sorts the equally satisfied bp families based on their observed rate of occurrence in the particular structural context (Table I).

Results and Discussion

Isfold provides a text output and a graphical interface to display bp possibilities based on isostericity rules. Figure 3 outlines the basic procedure with an example. Isfold was applied to 5S rRNA beginning with a high quality alignment (29) that was refined manually based on its known 3D structure. The quality of the input sequence alignment is the single most critical element to the success of the method. In 65% of the bps examined, the first prediction by Isfold was the correct one as ob-served in the crystal structure (30), in 15% the second prediction was correct, and in 15% the third prediction was correct. In most cases where predictions were errone-ous, there was insufficient sequence variation to distinguish between possible bps.

This method has several limitations. As mentioned, the predictions inherently rely on the data used to generate sequence alignments, for which many inaccura-cies often exist. While all sequence comparative methods rely on good align-ments, the search for non-WC bps is especially sensitive because there typically are more limited phylogenetic data and thus the sequence signatures can be quite subtle. Mutational data can substantially improve the signal-to-noise of sequence variation by providing more diversity or by constraining the space of possible interactions. Nevertheless, even with limited sequence variation, Isfold can help differentiate between plausible and highly improbable bp configurations. The ones that are most reasonable can be used to guide mutation experiments to fur-ther restrict the possibilities, and may be combined with other programs such as Ribostral (25) to interpret sequence variations by superposing them on IMs (Figure 3). Using both programs together can help iteratively improve structure prediction and refine the alignment. The analysis is further complicated in that not every internal or hairpin loop adopts a unique structure, as some motifs are dynamic in nature and can form different structures, for example in response to ligand or protein binding or substrate positioning (4-7). In such interesting cases, molecular dynamics (MD) simulations may help establish alternative conformers

Table I Observed occurrence rate of bp types in different structural contexts, based on structures of small and large ribosomal subunits solved to high resolution (30, 37). Non-helical structural contexts include internal loops, hairpin loops, and any other tertiary bp interaction that is not part of a WC helix. Isfold options allow users to control how these contexts are specified. The three-letter names of the bp families are the same as in Figure 1.

All RNA Non-helical Internal Loop Hairpin Loop

01-cWW 62.1 12.9 10.2 8.5 02-tWW 1.4 3.3 1.4 2.1 03-cWH 1.3 3.1 0.0 0.0 04-tWH 4.7 10.8 18.1 13.8 05-cWS 3.1 7.2 8.4 9.6 06-tWS 2.0 4.6 0.9 10.6 07-cHH 0.1 0.3 0.0 0.0 08-tHH 1.5 3.5 5.1 4.3 09-cHS 1.8 4.2 2.8 1.1 10-tHS 8.3 19.1 40.0 45.7 11-cSS 7.2 16.6 1.9 3.2 12-tSS 5.4 12.5 6.5 0.0

471ISFOLD: Structure

Prediction of Base Pairs in Non-Helical RNA Motifs

(6) and can provide a useful and feasible additional step to discriminate between structures proposed by isostericity rules alone (31, 32).

Isfold is limited further by the established patterns of isostericity (23) and thus cannot predict intermediate or novel types of bps not represented in the IMs (33). The dis-covery of such interactions by experimental, modeling (3, 23), or quantum mechanical (34-36) methods can be used later to revise Isfold and the bp classification schemes.

Conclusions and Future Improvements

Isfold represents a new computational approach to help predict bps in internal loops, hairpin loops, and other RNA motifs that play important roles in RNA folding and function. When the target motif is well localized within an alignment, a task typi-cally well-achieved by available secondary structure prediction algorithms, Isfold can provide plausible WC and non-WC arrangements within the motif. The results of mutation experiments can be used to further narrow the possibilities that are most consistent with isomorphic combinations.

Currently, Isfold assesses bps independently of each other. However, RNA motifs often have specific architectures involving bp stacking and other discrete arrange-ments. As databases of such motifs become more complete, it will be possible to incorporate these contextual aspects into Isfold scoring schemes. In the interim, Isfold should provide a useful tool to help evaluate plausible base pairings that conform to basic rules of structural isomorphism.

Acknowledgements

The authors thank Matt Daugherty and Jason Fernandes for valuable discussions and contributions. This work was supported by NIH grant GM47478.

References and Footnotes

1.2.3.4.

5.6.7.

8.9.

10.

11.12.13.14.15.

16.

17.18.19.20.21.22.

23.24.

Lemieux, S. and Major, F. Nucleic Acids Res 30, 4250-4263 (2002).Leontis, N. B. and Westhof, E. RNA 7, 499-512 (2001).Walberer, B. J., Cheng, A. C., and Frankel, A. D. J Mol Biol 327, 767-780 (2003).Mokdad, A., Krasovska, M. V., Sponer, J., and Leontis, N. B. Nucleic Acids Res 34, 1326-1341 (2006).Noller, H. F. Science 309, 1508-1514 (2005).Razga, F., Koca, J., Sponer, J., and Leontis, N. B. Biophys J 88, 3466-3485 (2005).Shankar, N., Kennedy, S. D., Chen, G., Krugh, T. R., and Turner, D. H. Biochemistry 45, 11776-11789 (2006).Zuker, M. Nucleic Acids Res 31, 3406-3415 (2003).Mathews, D. H., Disney, M. D., Childs, J. L., Schroeder, S. J., Zuker, M., and Turner, D. H. Proc Natl Acad Sci USA 101, 7287-7292 (2004).Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer, S., Tacker, M., and P., S. Monatsh Chem 125, 167-188 (1994).Knudsen, B. and Hein, J. Nucleic Acids Res 31, 3423-3428 (2003).Ding, Y., Chan, C. Y., and Lawrence, C. E. Nucleic Acids Res 32, W135-141 (2004).Lindgreen, S., Gardner, P. P., and Krogh, A. Bioinformatics 22, 2988-2995 (2006).Gutell, R. R., Lee, J. C., and Cannone, J. J. Curr Opin Struct Biol 12, 301-310 (2002).Gutell, R. R., Power, A., Hertz, G. Z., Putz, E. J., and Stormo, G. D. Nucleic Acids Res 20, 5785-5795 (1992).MacKay, R. M., Spencer, D. F., Schnare, M. N., Doolittle, W. F., and Gray, M. W. Can J Biochem 60, 480-489 (1982).Pays, E. Arch Int Physiol Biochim 84, 647-648 (1976).Das, R. and Baker, D. Proc Natl Acad Sci USA 104, 14664-9 (2007).Major, F. and Griffey, R. Curr Opin Struct Biol 11, 282-6 (2001).St-Onge, K., Thibault, P., Hamel, S., and Major, F. Nucleic Acids Res 35, 1726-1736 (2007).Lisi, V. and Major, F. RNA 13, 1537-1545 (2007).Shapiro, B. A., Yingling, Y. G., Kasprzak, W., and Bindewald, E. Curr Opin Struct Biol 17, 157-165 (2007).Leontis, N. B., Stombaugh, J., and Westhof, E. Nucleic Acids Res 30, 3497-3531 (2002).Yang, H., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H., and Westhof, E. Nucleic Acids Res 31, 3450-3460 (2003).

472

Mokdad and Frankel

25.26.

27.28.

29.

30.31.32.

33.34.

35.

36.

37.

Mokdad, A. and Leontis, N. B. Bioinformatics 22, 2168-2170 (2006).Lescoute, A., Leontis, N. B., Massire, C., and Westhof, E. Nucleic Acids Res 33, 2395-2409 (2005).Wang, Y., Zhong, X., Itaya, A., and Ding, B. J Virol 81, 2074-2077 (2007).Zhong, X., Leontis, N., Qian, S., Itaya, A., Qi, Y., Boris-Lawrie, K., and Ding, B. J Virol 80, 8566-8581 (2006).Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., and Eddy, S. R. Nucleic Acids Res 31, 439-441 (2003).Ban, N., Nissen, P., Hansen, J., Moore, P. B., and Steitz, T. A. Science 289, 905-920. (2000).Leontis, N. B. and Westhof, E. Rna 4, 1134-1153 (1998).Reblova, K., Spackova, N., Stefl, R., Csaszar, K., Koca, J., Leontis, N. B., and Sponer, J. Biophys J 84, 3564-3582 (2003).Razga, F., Koca, J., Mokdad, A., and Sponer, J. Nucleic Acids Res 35, 4007-4017 (2007).Sponer, J. E., Leszczynski, J., Sychrovsky, V., and Sponer, J. J Phys Chem B 109, 18680-18689 (2005).Sponer, J. E., Spackova, N., Kulhanek, P., Leszczynski, J., and Sponer, J. J Phys Chem A 109, 2292-2301 (2005).Sponer, J. E., Spackova, N., Leszczynski, J., and Sponer, J. J Phys Chem B 109, 11399-11410 (2005).Wimberly, B. T., Brodersen, D. E., Clemons, W. M., Jr., Morgan-Warren, R. J., Carter, A. P., Vonrhein, C., Hartsch, T., and Ramakrishnan, V. Nature 407, 327-339 (2000).

Date Received: October 5, 2007

Communicated by the Editor Jiri Sponer

Documents

ISFOLD: Structure Prediction of Base Pairs Ali Mokdad in Non …biochemistry2.ucsf.edu/labs/frankel/papers/JBSD 2008... · 2008-04-14 · developed the Isfold (Isosteric Folding)