8
Chapter 13 Solutions to Solutions Qijun He 1 , Matthew Macauley 1 and Robin Davies 2 1 Department of Mathematical Sciences, Clemson University, Clemson, SC, USA, 2 Department of Biology, Sweet Briar College, Sweet Briar, VA, USA Exercise 13.1. Consider the RNA structures shown below. G A A G G G U U U U U U A A A A A U U U 5 -end 3 -end G G A C C U U C C C C C C A A G G G G G G G G G 5 -end 3 -end For each, draw an RNA diagram and find its minimum arc length, minimum stack size, and crossing number. Solution. Here is the arc diagram of the first structure: G A A G G G U U U U U U A A A A A U U U 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The minimum arc length is λ = 5, the minimum stack size is σ = 2, and the crossing number is 1. Here is the arc diagram of the second structure: G G A C C U U C C C C C C A A G G G G G G G G G 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 The minimum arc length is λ = 4, the minimum stack size is σ = 2, and the crossing number is 2. Exercise 13.2. Draw the Motzkin path and point-bracket notation for the first RNA structure in the previous exercise. Attempt to do this for the second RNA structure. What goes wrong, and why? Algebraic and Discrete Mathematical Methods for Modern Biology. http://dx.doi.org/10.1016/B978-0-12-801213-0.00019-8 Copyright © 2015 Elsevier Inc. All rights reserved. 187

Solutions to Solutions - Elsevier · 2015-05-14 · 188 Algebraic and Discrete Mathematical Methods for Modern Biology Solution. HereistheMotzkinpathandpoint-bracketnotationforthefirstRNAstructure:

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Solutions to Solutions - Elsevier · 2015-05-14 · 188 Algebraic and Discrete Mathematical Methods for Modern Biology Solution. HereistheMotzkinpathandpoint-bracketnotationforthefirstRNAstructure:

Chapter 13

Solutions to SolutionsQijun He1, Matthew Macauley1 and Robin Davies21Department of Mathematical Sciences, Clemson University, Clemson, SC, USA, 2Department of Biology,

Sweet Briar College, Sweet Briar, VA, USA

Exercise 13.1. Consider the RNA structures shown below.

G A AG

GG

U UU

U

UU

AAA

AA

UUU

5′-end

3′-end

G G A CC

UUCCCCCCA

A G G G G G G G GG

5′-end

3′-end

For each, draw an RNA diagram and find its minimum arc length, minimum stack size, and crossing number.Solution. Here is the arc diagram of the first structure:

G A A G G G U U U U U U A A A A A U U U

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

The minimum arc length is λ = 5, the minimum stack size is σ = 2, and the crossing number is 1.Here is the arc diagram of the second structure:

G G A C C U U C C C C C C A A G G G G G G G G G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

The minimum arc length is λ = 4, the minimum stack size is σ = 2, and the crossing number is 2.

Exercise 13.2. Draw the Motzkin path and point-bracket notation for the first RNA structure in the previousexercise. Attempt to do this for the second RNA structure. What goes wrong, and why?

Algebraic and Discrete Mathematical Methods for Modern Biology. http://dx.doi.org/10.1016/B978-0-12-801213-0.00019-8Copyright © 2015 Elsevier Inc. All rights reserved. 187

Page 2: Solutions to Solutions - Elsevier · 2015-05-14 · 188 Algebraic and Discrete Mathematical Methods for Modern Biology Solution. HereistheMotzkinpathandpoint-bracketnotationforthefirstRNAstructure:

188 Algebraic and Discrete Mathematical Methods for Modern Biology

Solution. Here is the Motzkin path and point-bracket notation for the first RNA structure:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

1

2

3

4

• ( ( • • • ( ( • • • • ) ) • • • ) ) •

This does not work for the second structure because it has crossing arcs.

Exercise 13.3. Draw an RNA strand that exhibits a pseudoknot structure, as a 5-noncrossing (but not4-noncrossing) arc diagram. Draw the diagram and an actual RNA strand that it represents.

Solution. Consider the following fold of the RNA strand b = AAAACCCCCCCUUUU:

A

A

A

A

C CC

C

C

C

CU

U

U

U

A A A A C C C C C C C U U U U1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

As it turns out, it is chemically impossible for an RNA strand to bond in this “antiparallel” fashion, but thiswould be an example of a 5-noncrossing (but not 4-noncrossing) arc diagram if it were possible. However, thereare certain proteins that can fold into antiparallel structures like this.

Exercise 13.4. Write out the loop decomposition of the noncrossing RNA structure from Exercise 13.1.Solution. The null loop is L0 = {1, 2, 19, 20}; there is one 1-loop, L1 = {9, 10, 11, 12}; two stacked pairs,

{3, 18} and {8, 13}; and one interior loop, {4, 5, 6, 7, 14, 15, 16, 17}.

Exercise 13.5. Draw a secondary structure that has a 3-loop, and then draw its arc diagram and write out itsloop decomposition.

Solution. Consider the following fold of b = AAAGGCCCCCGGGGUUUUUU:

A

A

A

G

GC

C CC

C G

GG G

UU

U

U

U

U

A A A G G C C C C C G G G G U U U U U U1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

The loop decomposition consists of the null loop L0 = {1, 2, 19, 20}; two 1-loops, {6, 7, 8} and {13, 14, 15};three stacked pairs {3, 18}, {5, 9}, and {12, 16}; and one 3-loop, {4, 10, 11, 17}.

Exercise 13.6. Say that a secondary structure S of the RNA strand b = GGGACCUUCC is saturated if nomore arcs can be added without introducing crossings. Find all possible foldings S of b = GGGACCUUCC thatare saturated and have at least two base pairs, and compute ES(b) of each.

Page 3: Solutions to Solutions - Elsevier · 2015-05-14 · 188 Algebraic and Discrete Mathematical Methods for Modern Biology Solution. HereistheMotzkinpathandpoint-bracketnotationforthefirstRNAstructure:

Solutions to Solutions Chapter| 13 189

Solution. There are six possible foldings. In addition to the four shown in Figures 13.12 and 13.13 are thesetwo:

G G G A C C U U C C

ES(b) = 6

GGG

ACC

UU

CCG G G A C C U U C C

ES(b) = 8CCU

UC

CAGGG

Exercise 13.7. There is one more valid traceback for the table in Figure 13.12. Find it, and draw thecorresponding RNA structure and arc diagram.

Solution.

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

G G G A C C U U C C

G

G

G

A

C

C

U

U

C

C

3

3

1

2

0

0

3

3

2

2

0

4

3

5

2

4

5

5

6

8

8

G G G A C C U U C C

ES(b) = 8

C C U U

C

CAGGG

Exercise 13.8. The optimal fold of the sequence b = GAAACAAAAU is a secondary structure with twononnested arcs. Use a dynamic program to fill out the table, and then traceback. Find the arc diagram and RNAstructure. How is the feature of two nonnested arcs reflected in the DP table?

Solution. Here is the DP table, arc diagram, and RNA structure:

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

G A A A C A A A A U

G

A

A

A

C

A

A

A

A

U

3

0

0

0

0

2

3

0

0

0

2

3

0

0

2

3

0

2

3

2

5

G A A A C A A A A U

ES(b) = 5

C

AA

A

AC

AA

A

G

The fact that there are two nonnested arcs is reflected in the very last step of the table, where the “Case (4)”recursive step finally arises to compute E(1, 10). The E(1, 10) entry is

E(1, 10) = max1<k<10

{E(1, k) + E(k + 1, 10)} = E(1, 5) + E(6, 10) = 5.

Page 4: Solutions to Solutions - Elsevier · 2015-05-14 · 188 Algebraic and Discrete Mathematical Methods for Modern Biology Solution. HereistheMotzkinpathandpoint-bracketnotationforthefirstRNAstructure:

190 Algebraic and Discrete Mathematical Methods for Modern Biology

Exercise 13.9. The example from the text of folding b = GGGACCUUCC uses the requirement that theminimum arc length is λ = 4. Repeat this example, except allow arc lengths of size λ = 3. That is, fill out thetable to find E(1, 10) and then traceback to find all secondary structures that achieve this maximum.

Solution. Allowing arc lengths of size 3 gives way to two structures with energy score ES(b) = 9:

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

G G G A C C U U C C

G

G

G

A

C

C

U

U

C

C

0

3

3

2

0

0

0

3

3

3

2

0

0

6

4

3

2

0

6

4

5

2

6

6

5

7

8

9

G G G A C C U U C C

ES(b) = 9

G G G A C

CUUCC

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

G G G A C C U U C C

G

G

G

A

C

C

U

U

C

C

0

3

3

2

0

0

0

3

3

3

2

0

0

6

4

3

2

0

6

4

5

2

6

6

5

7

8

9

G G G A C C U U C C

ES(b) = 9

C CU U

CCAGGG

Exercise 13.10. Find a derivation for the string α = aaaabbbb from the grammar S → aSb|ab and draw theparse tree. Is this grammar unambiguous?

Solution. A derivation for the string α = aaaabbbb is

S =⇒ aSb =⇒ aaSbb =⇒ aaaSbbb =⇒ aaaabbbb.

Here is the parse tree:S

a S b

a S b

a S b

a b

This grammar is unambiguous.

Exercise 13.11. Construct a regular grammar that generates the language {bna | n ≥ 0}.Solution. The grammar S −→ bS|a generates the language {bna | n ≥ 0}.

Exercise 13.12. Construct a regular grammar that generates the language {abna | n ≥ 0}. Try to construct aregular grammar that generates the language {anbn | n ≥ 0}. What goes wrong?

Solution. The grammar

S −→ aB

B −→ bB|a

Page 5: Solutions to Solutions - Elsevier · 2015-05-14 · 188 Algebraic and Discrete Mathematical Methods for Modern Biology Solution. HereistheMotzkinpathandpoint-bracketnotationforthefirstRNAstructure:

Solutions to Solutions Chapter| 13 191

generates the language {abna | n ≥ 0}. The language {anbn | n ≥ 0} is not regular, so no regular grammar cangenerate it. The reader is encouraged to explain in their own words what goes wrong.

Exercise 13.13. Use the Knudsen-Hein grammar to construct the next three secondary structures, S3, S4,and S5, from Figure 13.17.

Solution. A derivation of the structure S3:p1

p1

p2q1 q3 p12

p12

q1

p1 p2

p2

q1

q1

q3

q3 q1

q1

q25

q2

q2

5

A derivation of the structure S4:

A derivation of the structure S5:3 5

Exercise 13.14. Use the Knudsen-Hein grammar to construct a derivation of the hairpin loop ssddsssd′d′ss,and compute its probability.

Solution. A derivation for this hairpin loop (this time without the optional boldface and arc decorations) is

Sp41=⇒ LLLLS

q1=⇒ LLLLLp2=⇒ LLdFd′LL p3=⇒ LLddFd′d′LL q3=⇒ LLddLSd′d′LL

p1=⇒ LLddLLSd′d′LL q1=⇒ LLddLLLd′d′LLq72=⇒ ssddsssd′d′ss.

The probability of this structure S is P(S) = p51p2p3q21q

72q3.

Exercise 13.15. Modify the rules to make the minimum loop size j− i ≥ 4, and repeat the above problem.Solution. If the minimum loop size is j− i ≥ 4, then the Knudsen-Hein grammar needs to be modified as

follows:

S −→ LS with probability p1, or L with probability q1

L −→ dFd′ with probability p2, or s with probability q2

F −→ dFd′ with probability p3, or LLS with probability q3.

Below is a derivation of the hairpin loop ssddsssd′d′ss from the previous exercise:

Sp41=⇒ LLLLS

q1=⇒ LLLLLp2=⇒ LLdFd′LL p3=⇒ LLddFd′d′LL q3=⇒

=⇒LLddLLSd′d′LL q1=⇒ LLddLLLd′d′LLq72=⇒ ssddsssd′d′ss.

The probability of this structure S is p41p2p3q21q

72q3.

Exercise 13.16. Consider the following stochastic context-free grammar for RNA folding:

S −→ sS(p1)|dSd′S(p2)|ε(p3),where ε is the empty string. This grammar was proposed by Ivo Hofacker but unpublished, and it was one of theeight grammars compared to the Knudsen-Hein grammar in [18]. What types of structures does this grammarproduce? What are its weaknesses?

Solution. If pairs of the form (d, d′) are assumed to be arcs, and s is an isolated vertex, then this grammarproduces noncrossing matchings. The grammar is unambiguous (not obvious), but it allows structures that donot arise in RNA secondary structures, such as the empty string, hairpin loops of any size, and 1-loops.

Page 6: Solutions to Solutions - Elsevier · 2015-05-14 · 188 Algebraic and Discrete Mathematical Methods for Modern Biology Solution. HereistheMotzkinpathandpoint-bracketnotationforthefirstRNAstructure:

192 Algebraic and Discrete Mathematical Methods for Modern Biology

Exercise 13.17. The following grammar proposed in [18] is ambiguous:S −→ dSd′|dS|Sd|SS|ε.

Find a secondary structure that has multiple left parse trees.

Solution. This is one of many possible solutions—two left parse trees for the structure ddddd′.S

d S d′

S d

S d

S d

ε

S

d S d′

Sd

Sd

Sd

ε

Though these trees are the same topologically, they are different as embeddings in the plane. For a simplersolution, consider the following two left derivations of the empty string, ε:

S =⇒ ε, and S =⇒ SS =⇒ Sε =⇒ ε.

Exercise 13.18. Consider the following “mystery grammar” from Chapter 9.3 of [21]:

S −→ aAu | cAg | gAc | uAaA −→ aBu | cBg | gBc | uBaB −→ aCu | cCg | gCc | uCaC −→ gaaa | gcaa.

What is the language L derived from this grammar? Describe it in terms of RNA secondary structures.Solution. This language produces an RNA hairpin loop of either GAAA or GCAA that has a stack, or “stem”

of size 3. Unlike the Knudsen-Hein grammar, this one specifies the actual bases in the structure.

Exercise 13.19. Draw the left parse tree of the string ddssd′sd′ in the Knudsen-Hein grammar. The leftderivation of this string is shown in Figure 13.18.

Solution. 123

S

L

d F d′

L S

d F d ′L

L S s

s L

s

Page 7: Solutions to Solutions - Elsevier · 2015-05-14 · 188 Algebraic and Discrete Mathematical Methods for Modern Biology Solution. HereistheMotzkinpathandpoint-bracketnotationforthefirstRNAstructure:

Solutions to Solutions Chapter| 13 193

Exercise 13.20. Compute the right derivation of the string ddssd′sd′ and draw its right parse tree.Solution. The right derivation of ddssd′sd′ is

Sq1=⇒ L

p2=⇒ dFd′ q3=⇒ dLSd′ q1=⇒ dLLd′ q2=⇒ dLsd′ p2=⇒ ddFd′sd′q3=⇒ ddLSd′sd′ q1=⇒ ddLLd′sd′ q2=⇒ ddLsd′sd′ q2=⇒ ddssd′sd′,

and its right parse tree is the same as its left parse tree:S

L

d F

L S

d F d′

d′

L

L S s

s L

s

Exercise 13.21. Draw the core and the shape of the pseudoknot shown in Figure 13.5.Solution. Contracting the stacks into single edges yields the core, shown below at left. On the right is the

shape, which results from removing isolated vertices from the core.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4

Exercise 13.22. Draw the core and the shape of the pseudoknot from Exercise 13.1.Solution. The core is shown at left, and the shape is shown at right:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6

Page 8: Solutions to Solutions - Elsevier · 2015-05-14 · 188 Algebraic and Discrete Mathematical Methods for Modern Biology Solution. HereistheMotzkinpathandpoint-bracketnotationforthefirstRNAstructure:

194 Algebraic and Discrete Mathematical Methods for Modern Biology

Exercise 13.23. Find all shapes on exactly six nodes. That is, find all noncrossing matchings on six nodeswith maximum stack length equal to 1.

Solution. There are ten shapes on six nodes:

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

1 2 3 4 5 6