20
Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Robert F. Murphy Copyright Copyright 1996, 1999- 1996, 1999- 2009. 2009. All rights reserved. All rights reserved.

Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright 1996, 1999-2009. All rights reserved

Embed Size (px)

Citation preview

Page 1: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Computational Biology, Part 2Sequence Motifs

Computational Biology, Part 2Sequence Motifs

Robert F. MurphyRobert F. Murphy

Copyright Copyright 1996, 1999-2009. 1996, 1999-2009.

All rights reserved.All rights reserved.

Page 2: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Slides from Chapter 4Slides from Chapter 4

Ch04_Motifs_mod.ppt

Page 3: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Describing features using frequency matricesDescribing features using frequency matrices Goal: Describe a sequence feature (or Goal: Describe a sequence feature (or

motifmotif) more quantitatively than possible ) more quantitatively than possible using consensus sequencesusing consensus sequences

Need to describe how often particular bases Need to describe how often particular bases are found in particular positions in a are found in particular positions in a sequence featuresequence feature

Page 4: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Describing features using frequency matricesDescribing features using frequency matrices DefinitionDefinition: For a feature of length : For a feature of length mm using using

an alphabet of an alphabet of nn characters, a characters, a frequency frequency matrix matrix is an is an nn by by mm matrix in which each matrix in which each element contains the frequency at which a element contains the frequency at which a given member of the alphabet is observed at given member of the alphabet is observed at a given position in an aligned set of a given position in an aligned set of sequences containing the featuresequences containing the feature

Page 5: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Frequency matrices (continued)Frequency matrices (continued)

Three uses of frequency matricesThree uses of frequency matrices DescribeDescribe a sequence feature a sequence feature Calculate Calculate probability of occurrenceprobability of occurrence of feature of feature

in a random sequencein a random sequence Calculate Calculate degree of matchdegree of match between a new between a new

sequence and a featuresequence and a feature

Page 6: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Matlab DemonstrationMatlab Demonstration

% read some aligned sequences provided with the bioinformatics % read some aligned sequences provided with the bioinformatics toolboxtoolbox

seqs = fastaread('pf00002.fa');seqs = fastaread('pf00002.fa');

seqdisp(seqs);seqdisp(seqs);

startposition=4; endposition=13;startposition=4; endposition=13;

[P,S] = seqprofile(seqs,'limits',[startposition endposition]);[P,S] = seqprofile(seqs,'limits',[startposition endposition]);

disp([' ' sprintf('%2d ',[1:size(P,2)])]);disp([' ' sprintf('%2d ',[1:size(P,2)])]);

for i=1:length(S)for i=1:length(S)

disp([S(i) ' ' sprintf('%4.3f ',P(i,:))])disp([S(i) ' ' sprintf('%4.3f ',P(i,:))])

endend

seqlogo(seqs,'startat',startposition,'endat',endposition,'alphabet','aa’);seqlogo(seqs,'startat',startposition,'endat',endposition,'alphabet','aa’);

Page 7: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Frequency matrixFrequency matrix

Page 8: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Logo ExampleLogo Example

Page 9: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Logos for displaying sequence motifsLogos for displaying sequence motifs http://www.ccrnp.ncifcrf.gov/~toms/sequencelogo.html

Free logo maker at Free logo maker at http://weblogo.berkeley.edu/

Page 10: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Frequency Matrices, PSSMs, and ProfilesFrequency Matrices, PSSMs, and Profiles A A frequency matrixfrequency matrix can be converted to a can be converted to a

PPosition-osition-SSpecific pecific SScoring coring MMatrix (atrix (PSSMPSSM) ) by converting by converting frequenciesfrequencies to to scoresscores

PSSMPSSMs also called s also called PPosition osition WWeight eight MMatrixes (atrixes (PWMPWMs) or s) or ProfilesProfiles

Page 11: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Methods for converting frequency matrices to PSSMsMethods for converting frequency matrices to PSSMs Using log ratio of observed to expectedUsing log ratio of observed to expected

where where m(j,i)m(j,i) is the frequency of character is the frequency of character jj observed at position observed at position i i and and f(j)f(j) is the overall frequency of character j (usually in some is the overall frequency of character j (usually in some large set of sequences)large set of sequences)

Using amino acid substitution matrix (Dayhoff similarity Using amino acid substitution matrix (Dayhoff similarity matrix) [see later]matrix) [see later]

score( j,i) = logm( j,i) / f ( j)

Page 12: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Pseudo-countsPseudo-counts

How do we get a score for a position with How do we get a score for a position with zero counts for a particular character? zero counts for a particular character? Can’t take log(0).Can’t take log(0).

Solution: add a small number to all Solution: add a small number to all positions with zero frequencypositions with zero frequency

Page 13: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Finding occurrences of a sequence feature using a ProfileFinding occurrences of a sequence feature using a Profile As with finding occurrences of a consensus As with finding occurrences of a consensus

sequence, we consider all positions in the sequence, we consider all positions in the target sequence as candidate matchestarget sequence as candidate matches

For each position, we calculate a score by For each position, we calculate a score by “looking up” the value corresponding to the “looking up” the value corresponding to the base at that positionbase at that position

Page 14: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Block Diagram for Building a PSSM – Aligned SequencesBlock Diagram for Building a PSSM – Aligned Sequences

PSSM builder

Set of Aligned Sequence Features

Expected frequencies of each sequence element

PSSM

Page 15: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Block Diagram for Building a PSSM – Unaligned SequencesBlock Diagram for Building a PSSM – Unaligned Sequences

PSSM builder

Set of unaligned sequences

Expected frequencies of each sequence element

PSSM

Parameters for aligning (i.e., expected length)

Page 16: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Block Diagram for Searching with a PSSMBlock Diagram for Searching with a PSSM

PSSM search

PSSM

Set of Sequences to search

Sequences that match above thresholdThreshold

Positions and scores of matches

Page 17: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Block Diagram for Searching for sequences related to a family with a PSSM

Block Diagram for Searching for sequences related to a family with a PSSM

PSSM search

PSSM

Set of Sequences to search

Sequences that match above threshold

Threshold

Positions and scores of matches

PSSM builder

Set of Aligned Sequence Features

Expected frequencies of each sequence element

Page 18: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Consensus sequences vs. PSSMsConsensus sequences vs. PSSMs

Should I use a Should I use a consensus sequenceconsensus sequence or a or a frequency matrixfrequency matrix to describe my site? to describe my site? If all allowed characters at a given position are If all allowed characters at a given position are

equally "good", use IUB codes to create equally "good", use IUB codes to create consensus sequenceconsensus sequence Example: Restriction enzyme recognition sitesExample: Restriction enzyme recognition sites

If some allowed characters are "better" than If some allowed characters are "better" than others, use PSSMothers, use PSSM Example: Promoter sequencesExample: Promoter sequences

Page 19: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Consensus sequences vs. frequency matricesConsensus sequences vs. frequency matrices Advantages of consensus sequencesAdvantages of consensus sequences: :

smaller description, quicker comparisonsmaller description, quicker comparison DisadvantageDisadvantage: lose quantitative information : lose quantitative information

on preferences at certain locationson preferences at certain locations

Page 20: Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved

Reading for next classReading for next class

Jones/Pevzner Ch 6 through section 6.9 (p. Jones/Pevzner Ch 6 through section 6.9 (p. 185)185)

Read paper by Needleman and Wunsch on Read paper by Needleman and Wunsch on web siteweb site

(recommended) Durbin et al, pp 17-32(recommended) Durbin et al, pp 17-32