30
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Robert F. Murphy Copyright Copyright 1996, 1999- 1996, 1999- 2006. 2006. All rights reserved. All rights reserved.

Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright 1996, 1999-2006. All rights reserved

  • View
    220

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Computational Biology, Part 2

Sequence Comparison with Dot Matrices

Computational Biology, Part 2

Sequence Comparison with Dot Matrices

Robert F. MurphyRobert F. Murphy

Copyright Copyright 1996, 1999- 1996, 1999-2006.2006.

All rights reserved.All rights reserved.

Page 2: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Sequence AlignmentSequence Alignment

Definition: Procedure for comparing Definition: Procedure for comparing two or more sequences by searching two or more sequences by searching for a series of individual for a series of individual characters or character patterns characters or character patterns that are that are in the same orderin the same order in the in the sequencessequences Pair-wise alignmentPair-wise alignment: compare two : compare two sequencessequences

Multiple sequence alignmentMultiple sequence alignment: compare : compare more than two sequencesmore than two sequences

Page 3: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Example sequence alignmentExample sequence alignment Task: align Task: align “abcdef”“abcdef” with with “abdgf”“abdgf” Write second sequence below the Write second sequence below the firstfirst

abcdefabcdefabdgfabdgf

Move sequences to give maximum Move sequences to give maximum match between themmatch between them

Show characters that match using Show characters that match using vertical barvertical bar

Page 4: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Example sequence alignmentExample sequence alignment

abcdefabcdef

||||

abdgfabdgf Insert gap between Insert gap between bb and and dd on on lower sequence to allow lower sequence to allow dd and and ff to align to align

Page 5: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Example sequence alignmentExample sequence alignment

abcdefabcdef

|| | ||| | |

ab-dgfab-dgf

Page 6: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Example sequence alignmentExample sequence alignment

abcdefabcdef

|| | ||| | |

ab-dgfab-dgf Note Note ee and and gg don’t match don’t match

Page 7: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Matching Similarity vs. IdentityMatching Similarity vs. Identity Alignments can be based on Alignments can be based on finding only identical finding only identical characters, or (more characters, or (more commonly) can be based on commonly) can be based on finding finding similarsimilar characters characters

More on how to define More on how to define similaritysimilarity later later

Page 8: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Global vs. Local AlignmentGlobal vs. Local Alignment We distinguishWe distinguish

GlobalGlobal alignment algorithms which alignment algorithms which optimize optimize overall overall alignment between two alignment between two sequences sequences

LocalLocal alignment algorithms which seek alignment algorithms which seek only relatively only relatively conservedconserved pieces of pieces of sequencesequence Alignment stops at the ends of regions of Alignment stops at the ends of regions of strong similaritystrong similarity

Favors finding conserved patterns in Favors finding conserved patterns in otherwise different pairs of sequencesotherwise different pairs of sequences

Page 9: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Global vs. Local AlignmentGlobal vs. Local Alignment GlobalGlobal

LGPSSKQTGKGS-SRIWDNLGPSSKQTGKGS-SRIWDN| | ||| | | | | ||| | | LN-ITKSAGKGAIMRLGDALN-ITKSAGKGAIMRLGDA

LocalLocal

--------GKG----------------GKG-------- ||| ||| --------GKG----------------GKG--------

Page 10: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Global vs. Local AlignmentGlobal vs. Local Alignment GlobalGlobal

LGPSSKQTGKGS-SRIWDNLGPSSKQTGKGS-SRIWDN| | ||| | | | | ||| | | LN-ITKSAGKGAIMRLGDALN-ITKSAGKGAIMRLGDA

LocalLocal

-------TGKG---------------TGKG-------- ||| ||| -------AGKG---------------AGKG--------

Page 11: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Why do sequence alignments?Why do sequence alignments? To find whether two (or more) To find whether two (or more) genes or proteins are genes or proteins are evolutionarily related to evolutionarily related to each othereach other

To find structurally or To find structurally or functionally similar regions functionally similar regions within proteinswithin proteins

Page 12: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Origin of similar genesOrigin of similar genes Similar genes arise by Similar genes arise by gene duplicationgene duplication

Copy of a gene inserted Copy of a gene inserted next to the originalnext to the original

Two copies mutate Two copies mutate independentlyindependently

Each can take on Each can take on separate functionsseparate functions

All or part can be All or part can be transferred from one transferred from one part of genome to part of genome to anotheranother

Page 13: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Methods for Pairwise AlignmentMethods for Pairwise Alignment Dot matrix analysisDot matrix analysis Dynamic ProgrammingDynamic Programming Word or Word or k-k-tuple methods tuple methods (FASTA and BLAST)(FASTA and BLAST)

Page 14: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Sequence comparison with dot matricesSequence comparison with dot matrices Goal: Goal: Graphically display Graphically display regions of similarity between regions of similarity between two sequences (e.g., domains two sequences (e.g., domains in common between two in common between two proteins of suspected similar proteins of suspected similar function)function)

Page 15: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Sequence comparison with dot matricesSequence comparison with dot matrices Basic Method: Basic Method: For two sequences of For two sequences of lengths M and N, lay out an M by N lengths M and N, lay out an M by N grid (matrix) with one sequence grid (matrix) with one sequence across the top and one sequence down across the top and one sequence down the left side. For each position in the left side. For each position in the grid, compare the sequence the grid, compare the sequence elements at the top (column) and to elements at the top (column) and to the left (row). If and only if they the left (row). If and only if they are the same, place a dot at that are the same, place a dot at that position.position.

Page 16: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Examples for protein sequencesExamples for protein sequences (Demonstration A6, Sequence 1 (Demonstration A6, Sequence 1 vs. 2)vs. 2)

(Demonstration A6, Sequence 2 (Demonstration A6, Sequence 2 vs. 3)vs. 3)

Page 17: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Interpretation of dot matricesInterpretation of dot matrices Regions of similarity appear as Regions of similarity appear as diagonal runs of dotsdiagonal runs of dots

Reverse diagonals (perpendicular Reverse diagonals (perpendicular to diagonal) indicate inversionsto diagonal) indicate inversions

Reverse diagonals crossing Reverse diagonals crossing diagonals (Xs) indicate diagonals (Xs) indicate palindromespalindromes (Demonstration A6, Sequence 4 vs. (Demonstration A6, Sequence 4 vs. 4)4)

Page 18: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Interpretation of dot matricesInterpretation of dot matrices Can link or "join" separate Can link or "join" separate diagonals to form diagonals to form alignmentalignment with with "gaps""gaps" Each a.a. or base can only be used Each a.a. or base can only be used onceonce Can't trace vertically or horizontallyCan't trace vertically or horizontally Can't double backCan't double back

A gap is introduced by each A gap is introduced by each vertical or horizontal skipvertical or horizontal skip

Page 19: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Uses for dot matricesUses for dot matrices

Can use dot matrices to align two Can use dot matrices to align two proteins or two nucleic acid proteins or two nucleic acid sequencessequences

Can use to find amino acid repeats Can use to find amino acid repeats within a protein by comparing a within a protein by comparing a protein sequence to itselfprotein sequence to itself Repeats appear as a set of diagonal Repeats appear as a set of diagonal runs stacked vertically and/or runs stacked vertically and/or horizontallyhorizontally (Demonstration A6, Sequence 5 vs. 6)(Demonstration A6, Sequence 5 vs. 6)

Page 20: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Uses for dot matricesUses for dot matrices

Can use to find self base-Can use to find self base-pairing of an RNA (e.g., pairing of an RNA (e.g., tRNA) by comparing a sequence tRNA) by comparing a sequence to itself complemented and to itself complemented and reversedreversed

Excellent approach for Excellent approach for finding sequence finding sequence transpositionstranspositions

Page 21: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Filtering to remove “noise”Filtering to remove “noise” A problem with dot matrices for long A problem with dot matrices for long sequences is that they can be very sequences is that they can be very noisy due to lots of insignificant noisy due to lots of insignificant matches (i.e., one A)matches (i.e., one A)

Solution use a window and a thresholdSolution use a window and a threshold compare character by character within a compare character by character within a window (have to choose window size)window (have to choose window size)

require certain fraction of matches require certain fraction of matches within window in order to display it within window in order to display it with a “dot”with a “dot”

Page 22: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Example spreadsheet with windowExample spreadsheet with window (Demonstration A7)(Demonstration A7)

Page 23: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

How do we choose a window size?How do we choose a window size? Window size changes with goal Window size changes with goal of analysisof analysis size of average exonsize of average exon size of average protein size of average protein structural elementstructural element

size of gene promotersize of gene promoter size of enzyme active sitesize of enzyme active site

Page 24: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

How do we choose a threshold value?How do we choose a threshold value? Threshold based on statisticsThreshold based on statistics

using shuffled actual sequenceusing shuffled actual sequence find average (find average (mm) and s.d. () and s.d. () of match scores ) of match scores of shuffled sequenceof shuffled sequence

convert original (unshuffled) scores (convert original (unshuffled) scores (xx) to) to ZZ scoresscores• Z = (x - m)/Z = (x - m)/

use threshold Z of of 3 to 6use threshold Z of of 3 to 6 using analysis of other sets of using analysis of other sets of sequencessequences provides “objective” standard of significanceprovides “objective” standard of significance

Page 25: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Dot matrix analysis with DNA Strider (Mount, Fig 3.4)

Dot matrix analysis with DNA Strider (Mount, Fig 3.4) Get phage Get phage cI and phage P22 cI and phage P22 c2 repressor sequences from c2 repressor sequences from Genbank (X00166 and V01153 Genbank (X00166 and V01153 respectively)respectively)

Use DNA Strider 1.4 (contact Use DNA Strider 1.4 (contact TA to get a copy) TA to get a copy)

Use window size of 11 and Use window size of 11 and stringency of 7stringency of 7

Page 26: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Dot matrix (Mount Fig 3.4)Dot matrix (Mount Fig 3.4) Note set Note set of of diagonals diagonals in lower in lower right right that do that do not line not line up due to up due to insertion insertion near 475 near 475 on cIon cI

100

100

200

200

300

300

400

400

500

500

600

600

100 100

200 200

300 300

400 400

500 500

600 600

700 700

Page 27: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Dot matrix analysis with DNA Strider (Mount, Fig 3.6)

Dot matrix analysis with DNA Strider (Mount, Fig 3.6) Get human LDL receptor protein Get human LDL receptor protein sequence from Genbank (P01130)sequence from Genbank (P01130)

Use weighting “Identity”Use weighting “Identity” Use window size of 1 and Use window size of 1 and stringency of 1stringency of 1

Use window size of 23 and Use window size of 23 and stringency of 7stringency of 7

Page 28: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Dot matrix (Mount Fig 3.6)Dot matrix (Mount Fig 3.6) W=1 S=1W=1 S=1 Note Note set of set of stacked stacked diagonadiagonals in ls in upper upper leftleft

100

100

200

200

300

300

400

400

500

500

600

600

700

700

800

800

100 100

200 200

300 300

400 400

500 500

600 600

700 700

800 800

Page 29: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Dot matrix (Mount Fig 3.6)Dot matrix (Mount Fig 3.6) W=23 W=23 S=7S=7

Note Note set of set of stacked stacked diagonadiagonals in ls in upper upper leftleft

100

100

200

200

300

300

400

400

500

500

600

600

700

700

800

800

100 100

200 200

300 300

400 400

500 500

600 600

700 700

800 800

Page 30: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, 1999-2006. All rights reserved

Reading for next classReading for next class

Mount, Chapter 3 through page Mount, Chapter 3 through page 9393

Look over paper by Needleman Look over paper by Needleman and Wunsch on web siteand Wunsch on web site

(03-510/710) Durbin et al, pp (03-510/710) Durbin et al, pp 17-3217-32