27
R’MES Finding Exceptional Motifs in Sequences S. Schbath INRA, Jouy-en-Josas, France http://genome.jouy.inra.fr/ssb/rmes/ BOSC, Stockholm, June 27-28, 2009 – p.1

Schbath Rmes Bosc2009

  • Upload
    bosc

  • View
    1.125

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Schbath Rmes Bosc2009

R’MESFinding Exceptional Motifs in Sequences

S. Schbath

INRA, Jouy-en-Josas, France

http://genome.jouy.inra.fr/ssb/rmes/

BOSC, Stockholm, June 27-28, 2009 – p.1

Page 2: Schbath Rmes Bosc2009

Introduction:motifs and statistics

BOSC, Stockholm, June 27-28, 2009 – p.2

Page 3: Schbath Rmes Bosc2009

DNA and motifs

• DNA: Long molecule, sequence ofnucleotides

• Nucleotides: A(denine), C(ytosine),G(uanine), T(hymine).

• Motif (= oligonucleotides): shortsequence of nucleotides, e.g.CAGTAG

• Functional motif: recognized byproteins or enzymes to initiate abiological process

TAGACAGATAGACGATCAGTAGCCAGTAGACAGTAGGCATGA. . .

TAGACAGATAGACGAT CAGTAG CCAGTAGACAGTAGGCATGA. . .

BOSC, Stockholm, June 27-28, 2009 – p.3

Page 4: Schbath Rmes Bosc2009

Some functional motifs

• Restriction sites: recognized by specific bacterial restriction enzymes ⇒double-strand DNA break.E.g. GAATTC recognized by EcoRI

very rare along bacterial genomes

• Chi motif: recognized by an enzyme which processes along DNA sequenceand degrades it ⇒ enzyme degradation activity stopped and DNA repair isstimulated by recombination.E.g. GCTGGTGG recognized by RecBCD (E. coli)

very frequent along E. coli genome

• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genomeinto macro-domains.Tc

GTTt

Ac

ACt

ACGTGAt

AACA

very frequent into the ORI domain, rare elsewhere

• promoter: structured motif recognized by the RNA polymerase to initiategene transcription.

E.g. TTGAC(16;18)

−−− TATAAT (E. coli).

particularly located in front of genes

BOSC, Stockholm, June 27-28, 2009 – p.4

Page 5: Schbath Rmes Bosc2009

Some functional motifs

• Restriction sites: recognized by specific bacterial restriction enzymes ⇒double-strand DNA break.E.g. GAATTC recognized by EcoRIvery rare along bacterial genomes

• Chi motif: recognized by an enzyme which processes along DNA sequenceand degrades it ⇒ enzyme degradation activity stopped and DNA repair isstimulated by recombination.E.g. GCTGGTGG recognized by RecBCD (E. coli)very frequent along E. coli genome

• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genomeinto macro-domains.Tc

GTTt

Ac

ACt

ACGTGAt

AACAvery frequent into the ORI domain, rare elsewhere

• promoter: structured motif recognized by the RNA polymerase to initiategene transcription.

E.g. TTGAC(16;18)

−−− TATAAT (E. coli).particularly located in front of genes

BOSC, Stockholm, June 27-28, 2009 – p.4

Page 6: Schbath Rmes Bosc2009

Prediction of functional motifs

Most of the functional motifs are unknown in the different species.For instance,• which would be the Chi motif of S. aureus? [Halpern et al. (08)]• Is there an equivalent of parS in E. coli? [Mercier et al. (08)]

Statistical approach: to identify candidate motifs based on their statisticalproperties.

The most over-represented The most over-represented families8-letter words under M1 anbcdefg under M1

E. coli (` = 4.6 106) H. influenzae (` = 1.8 10

6)word obs exp score motif obs exp score

gctggtgg 762 84.9 73.5 gntggtgg 223 55.3 22.33ggcgctgg 828 125.9 62.6 anttcatc 469 180.3 21.59cgctggcg 870 150.8 58.6 anatcgcc 288 87.8 21.38gctggcgg 723 125.9 53.3 tnatcgcc 279 84.5 21.18cgctggtg 619 101.7 51.3 gnagaaga 270 83.6 20.10

BOSC, Stockholm, June 27-28, 2009 – p.5

Page 7: Schbath Rmes Bosc2009

Presentation of R’MES

BOSC, Stockholm, June 27-28, 2009 – p.6

Page 8: Schbath Rmes Bosc2009

Statistical questions addressed by R’MES

Questions related to the significance of the number of occurrences of motifs w

in sequences:

• Is N obs(w) significantly high?• Is N obs(w) significantly higher than N obs(w′)?

−→ If w′ = w: is w significantly skewed (strand bias)?• Is N obs

1 (w) significantly more unexpected than N obs2 (w)?

Several types of motifs w:

• fixed words (e.g. gctggtgg),• degenerated patterns (e.g. gntggtgg),• set of words (e.g. {w,w}).

BOSC, Stockholm, June 27-28, 2009 – p.7

Page 9: Schbath Rmes Bosc2009

Is Nobs(w) significantly high?

• One needs to calculate the p-value P(N(w) ≥ N obs(w)) where N(w) is thecount (r.v.) of w in random sequences (→ model).

• R’MES considers Markov chain models of order m (Mm) which fit thesequence composition in oligos of length 1- up to -(m + 1).Possibility to take the phase in coding sequences into account (Mm_3)

• R’MES approximates the p-value by using• either a Gaussian approximation of N(w) (when E(N(w)) is large)

[Prum et al. (95)], [Schbath et al. (95)]• or a compound Poisson distribution of N(w) (when E(N(w)) is small)

[Schbath (95)], [Roquain and Schbath (07)](see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )

• R’MES produces scores of exceptionality (probit transformation).High positive (resp. negative) scores correspond to exceptionally frequent(resp. rare) motifs.

rmes –gauss –s seqfile –m m –l wordlength –o outputfile

BOSC, Stockholm, June 27-28, 2009 – p.8

Page 10: Schbath Rmes Bosc2009

Is Nobs(w) significantly high?

• One needs to calculate the p-value P(N(w) ≥ N obs(w)) where N(w) is thecount (r.v.) of w in random sequences (→ model).

• R’MES considers Markov chain models of order m (Mm) which fit thesequence composition in oligos of length 1- up to -(m + 1).Possibility to take the phase in coding sequences into account (Mm_3)

• R’MES approximates the p-value by using• either a Gaussian approximation of N(w) (when E(N(w)) is large)

[Prum et al. (95)], [Schbath et al. (95)]• or a compound Poisson distribution of N(w) (when E(N(w)) is small)

[Schbath (95)], [Roquain and Schbath (07)](see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )

• R’MES produces scores of exceptionality (probit transformation).High positive (resp. negative) scores correspond to exceptionally frequent(resp. rare) motifs.

rmes –gauss –s seqfile –m m –l wordlength –o outputfile

BOSC, Stockholm, June 27-28, 2009 – p.8

Page 11: Schbath Rmes Bosc2009

Is Nobs(w) significantly high?

• One needs to calculate the p-value P(N(w) ≥ N obs(w)) where N(w) is thecount (r.v.) of w in random sequences (→ model).

• R’MES considers Markov chain models of order m (Mm) which fit thesequence composition in oligos of length 1- up to -(m + 1).Possibility to take the phase in coding sequences into account (Mm_3)

• R’MES approximates the p-value by using• either a Gaussian approximation of N(w) (when E(N(w)) is large)

[Prum et al. (95)], [Schbath et al. (95)]• or a compound Poisson distribution of N(w) (when E(N(w)) is small)

[Schbath (95)], [Roquain and Schbath (07)](see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )

• R’MES produces scores of exceptionality (probit transformation).High positive (resp. negative) scores correspond to exceptionally frequent(resp. rare) motifs.

rmes –gauss –s seqfile –m m –l wordlength –o outputfile

BOSC, Stockholm, June 27-28, 2009 – p.8

Page 12: Schbath Rmes Bosc2009

Is Nobs(w) significantly higher than N

obs(w)?

• One needs to calculate the p-value P

N(w)N(w)

≥ Nobs(w)

Nobs(w)

where N(·) is thecount (r.v.) in random sequences (→ model).

• R’MES considers Markov chain models of order m (Mm) which fit thesequence composition in oligos of length 1- up to -(m + 1).Possibility to take the phase in coding sequences into account (Mm_3)

• R’MES approximates the p-value by using• the 2-dimensional Gaussian approximation of (N(w), N(w)) (when

E(N(w)) and E(N(w)) are large)[Prum et al. (95)], [Schbath et al. (95)]

• R’MES produces scores of exceptional skew (probit transformation):High positive (resp. negative) scores correspond to motifs significantlty morefrequent (resp. rare) along the sequence than along the complementary one.

rmes –skew –seq seqfile –m m –l wordlength –o outputfile

BOSC, Stockholm, June 27-28, 2009 – p.9

Page 13: Schbath Rmes Bosc2009

Is Nobs(w) significantly higher than N

obs(w)?

• One needs to calculate the p-value P

N(w)N(w)

≥ Nobs(w)

Nobs(w)

where N(·) is thecount (r.v.) in random sequences (→ model).

• R’MES considers Markov chain models of order m (Mm) which fit thesequence composition in oligos of length 1- up to -(m + 1).Possibility to take the phase in coding sequences into account (Mm_3)

• R’MES approximates the p-value by using• the 2-dimensional Gaussian approximation of (N(w), N(w)) (when

E(N(w)) and E(N(w)) are large)[Prum et al. (95)], [Schbath et al. (95)]

• R’MES produces scores of exceptional skew (probit transformation):High positive (resp. negative) scores correspond to motifs significantlty morefrequent (resp. rare) along the sequence than along the complementary one.

rmes –skew –seq seqfile –m m –l wordlength –o outputfile

BOSC, Stockholm, June 27-28, 2009 – p.9

Page 14: Schbath Rmes Bosc2009

Is Nobs(w) significantly higher than N

obs(w)?

• One needs to calculate the p-value P

N(w)N(w)

≥ Nobs(w)

Nobs(w)

where N(·) is thecount (r.v.) in random sequences (→ model).

• R’MES considers Markov chain models of order m (Mm) which fit thesequence composition in oligos of length 1- up to -(m + 1).Possibility to take the phase in coding sequences into account (Mm_3)

• R’MES approximates the p-value by using• the 2-dimensional Gaussian approximation of (N(w), N(w)) (when

E(N(w)) and E(N(w)) are large)[Prum et al. (95)], [Schbath et al. (95)]

• R’MES produces scores of exceptional skew (probit transformation):High positive (resp. negative) scores correspond to motifs significantlty morefrequent (resp. rare) along the sequence than along the complementary one.

rmes –skew –seq seqfile –m m –l wordlength –o outputfile

BOSC, Stockholm, June 27-28, 2009 – p.9

Page 15: Schbath Rmes Bosc2009

Is Nobs1

(w) significantly more except. than Nobs2

(w)?

• One wants to compare the exceptionality of a motif w in two differentsequences (two observed counts N obs

1 (w) and N obs2 (w))

• R’MES computes a test statistic and its asociated p-value to testH0 : {w is equally exceptional in both sequences}

againstH1 : {w is more exceptional in the first sequence}

[Robin et al. (08)]• The test is performed by considering occurrence processes like Poisson

processes whose intensities take the sequence compositions in oligos oflength 1- up to -(m + 1) into account.

• Option –seq2 soon available in R’MES.

BOSC, Stockholm, June 27-28, 2009 – p.10

Page 16: Schbath Rmes Bosc2009

Is Nobs1

(w) significantly more except. than Nobs2

(w)?

• One wants to compare the exceptionality of a motif w in two differentsequences (two observed counts N obs

1 (w) and N obs2 (w))

• R’MES computes a test statistic and its asociated p-value to testH0 : {w is equally exceptional in both sequences}

againstH1 : {w is more exceptional in the first sequence}

[Robin et al. (08)]

• The test is performed by considering occurrence processes like Poissonprocesses whose intensities take the sequence compositions in oligos oflength 1- up to -(m + 1) into account.

• Option –seq2 soon available in R’MES.

BOSC, Stockholm, June 27-28, 2009 – p.10

Page 17: Schbath Rmes Bosc2009

Is Nobs1

(w) significantly more except. than Nobs2

(w)?

• One wants to compare the exceptionality of a motif w in two differentsequences (two observed counts N obs

1 (w) and N obs2 (w))

• R’MES computes a test statistic and its asociated p-value to testH0 : {w is equally exceptional in both sequences}

againstH1 : {w is more exceptional in the first sequence}

[Robin et al. (08)]• The test is performed by considering occurrence processes like Poisson

processes whose intensities take the sequence compositions in oligos oflength 1- up to -(m + 1) into account.

• Option –seq2 soon available in R’MES.

BOSC, Stockholm, June 27-28, 2009 – p.10

Page 18: Schbath Rmes Bosc2009

RMESPlot interface

BOSC, Stockholm, June 27-28, 2009 – p.11

Page 19: Schbath Rmes Bosc2009

Prediction and identificationof functional DNA motifs

BOSC, Stockholm, June 27-28, 2009 – p.12

Page 20: Schbath Rmes Bosc2009

Chi motifs in bacterial genomes

• Motif involved in the repair of double-strand DNA breaks.Chi needs to be frequent along bacterial genomes.

• Chi motifs have been identified for few bacterial species. They are notconserved through species.

• Known Chi motifs are 5 to 8 nucleotides long and can be degenerated.• Moreover, Chi activity is strongly orientation-dependent (direction of DNA

replication).It is present preferentially on the leading strands (high skew).

BOSC, Stockholm, June 27-28, 2009 – p.13

Page 21: Schbath Rmes Bosc2009

E. coli as a learning case

• 8-letter word GCTGGTGG• 762 occurrences on the leading strands (` = 4.6 106)• Among the most over-represented 8-letter words (whatever the model Mm)

⇒ its frequency cannot be explained by the genome composition.• Its rank is improved if one analyzes only the backbone genome (genome

conserved in several strains of the species).• Its skew equals 3.20 (p-value of 3.310−11).

The skew of a motif w is defined by N obs(w)/N obs(w) where w is the reversecomplementary of w.

BOSC, Stockholm, June 27-28, 2009 – p.14

Page 22: Schbath Rmes Bosc2009

Identification of Chi motif in S. aureus

Halpern et al. (07)• Analysis of the S. aureus backbone (` = 2.44 106).• 8-letter words: none of the most over-represented and skewed motifs were

frequent enough.• 7-letter words:

A=gaaaatg (1067), B=ggattag (266), C=gaagcgg (272), D=gaattag (614)BOSC, Stockholm, June 27-28, 2009 – p.15

Page 23: Schbath Rmes Bosc2009

Organization of the Ter macrodomain in E. coli

The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].How is such structure ensured?

Bacillus subtilis as a learning case:

• In B. subtilis, the parS motif is responsible for the structuration of thechromosomal domain surrounding the origin of replication [Lin andGrossman (98)].

• parS motif is 16 nt long, its sequence is partially degenerated and ratherpalindromic.

Tc

GTTt

Ac

ACt

ACGTGAt

AACA

• It is recognized by SpoOJ in both directions.• One of its 11-mer is the most exceptional 11-mer (w,w) in the origin domain.

BOSC, Stockholm, June 27-28, 2009 – p.16

Page 24: Schbath Rmes Bosc2009

Organization of the Ter macrodomain in E. coli

The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].How is such structure ensured?

Bacillus subtilis as a learning case:

• In B. subtilis, the parS motif is responsible for the structuration of thechromosomal domain surrounding the origin of replication [Lin andGrossman (98)].

• parS motif is 16 nt long, its sequence is partially degenerated and ratherpalindromic.

Tc

GTTt

Ac

ACt

ACGTGAt

AACA

• It is recognized by SpoOJ in both directions.• One of its 11-mer is the most exceptional 11-mer (w,w) in the origin domain.

BOSC, Stockholm, June 27-28, 2009 – p.16

Page 25: Schbath Rmes Bosc2009

Identification of matS in E. coli

10 most over-represented 11-mer (w,w) of the TER domain (compound Poissonapproximation + family option):

rank rankword N1 N2 E1 E2 score1 score2 p-skew R’MES Skew

GACACTGTCAC 7 0 0.21 0.43 5.84 0.39 0.0004 1 7TGACACTGTCA 7 2 0.28 0.53 5.49 1.29 0.0101 2 40GACAGTGTCAC 6 0 0.20 0.43 5.24 0.38 0.0011 3 18GACGTTGTCAC 7 3 0.35 1.30 5.22 1.06 0.0012 4 19GACAACGTCAC 7 3 0.37 1.49 5.15 0.88 0.0008 5 13GACCCGAACGA 5 1 0.12 0.47 5.09 0.31 0.0017 6 23

ATAGGGTAGAT 4 1 0.06 0.26 4.94 0.73 0.0041 7 30TAGTTACAACA 5 1 0.16 0.54 4.79 0.21 0.0032 8 29

ATAAACGGCCC 6 3 0.31 1.68 4.76 0.71 0.0008 9 14TGACAACGTCA 7 5 0.51 1.786 4.72 1.81 0.0073 10 34

BOSC, Stockholm, June 27-28, 2009 – p.17

Page 26: Schbath Rmes Bosc2009

Identification of matS in E. coli

GACACTGTCACTGACACTGTCA

GACAGTGTCACGACGTTGTCACGACAACGTCAC

TGACAACGTCA

GTGACRNYGTCAC

matS is the 13nt GTGACRNYGTCAC: it is recognized by the matP protein whichstructures the Ter domain [Mercier at al. (08)].

BOSC, Stockholm, June 27-28, 2009 – p.18

Page 27: Schbath Rmes Bosc2009

Acknowledgment

Françoise Gélis (R’MES 1.0)Annie Bouvier (R’MES 2.0)Mark Hoebeke (R’MES 3.0)

http://genome.jouy.inra.fr/ssb/rmes/

BOSC, Stockholm, June 27-28, 2009 – p.19