18
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Embed Size (px)

Citation preview

Page 1: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Whole genome alignments

Genome 559: Introduction to Statistical and Computational Genomics

Prof. James H. Thomas

Page 2: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Review

• What a score matrix is and how to calculate and use one.

• Why an affine gap penalty is desirable.

• How to align sequences using dynamic programming.

• How to calculate and interpret p-values and E-values for pair alignments and database searches.

Page 3: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Whole genome alignments

Why?

Page 4: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

known gap in

assembly

averaged conservation

for 17 genomes

individual genome

alignments, darker = higher

scoring

alignment discontinuity (e.g. translocation

break point)

questionable

alignment segment

sequence present but unalignable

UCSC Browser track

Page 5: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

GQSQVGQGPPCPHHRCTTCCPDGCHFEPQVCMCDWESCCEEGGQSEVRQGPQCPYHKCIKCQPDGCHYEPTVCICREKPCDEKG

Page 6: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

How are genome-wide alignments made?

• mouse and human genomes are each about 3x109 nucleotides.

• how many calculations would a dynamic programming alignment have to make?

• at a minimum - 3 integer additions and 3 inequality tests for each DP matrix position

(by the way, there are other problems too, including assuming colinearity)

Page 7: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

• Most common method is the BLAST search (Basic Local Alignment Search Tool). Only the initial step is substantially different from dynamic alignment.

• Search sequence is broken into small words (usually 3 residues long for proteins). 20 * 20 * 20 = 8,000 words. These act as seeds for searches.

• The target dataset is pre-indexed to indicate the positions in the database sequences that match each search word above some score threshold (using a global score matrix such as BLOSUM62).

Making large searches faster

Page 8: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

...VFEWVHLLP... WIY

• Target sequences around each indexed word hit are retrieved and the initial match is extended in both directions:

your sequencedatabase (many sites)

• For example, the search sequence word “WVH” might score above threshold with these indexed sequences:

Indexed word Score WVH 23 WIH 22 WVY 17 WIY 16

BLAST searches (cont.)

Page 9: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Schematic of indexed matches

Result – instead of aligning these 3 amino acids to everything, they are aligned only with the tiny fraction of sequence regions that are good candidates for a valid alignment.

(note- blast actually looks for two such matches close to each other)

Page 10: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Extension and scoring

...QSVFEWVHLLPGA... ..WIY..

...QSVFEWVHLLPGA... ..WIYQ..

...QSVFEWVHLLPGA... ..WIYQK..

...QSVFEWVHLLPGA... ..WIYQKA..

Total Score:

16

13

11

10

Match Score:

16

-3

-2

-1

[mention gap variant]

Page 11: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Extension termination

• Extension is continued until the cumulative score drops below some threshold (usually 0).

• This permits the match to cross a region of marginal similarity or frank mismatching (e.g. a small intron in tblastn) if it flanks a region of high similarity.

• Extensions whose maximal cumulative score is above some threshold are kept for reporting to user.

• For web interfaces, various formatting, links, and overviews are added and reported according to user settings (it is also fairly easy to download and run your own blast).

Page 12: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Key to speed: word matching and prior indexing

• Though gapped blast local alignment is slow (like dynamic programming), only a very small part of total search space is analyzed.

• Because the positions of all database word matches are indexed and stored prior to the blast search, the relevant parts of search space are reached quickly.

• Tradeoff is in accuracy and certainty – occasionally matches will be missed (when they are distant enough and dispersed enough that no local word pairs match well enough).

Page 13: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

genome A

genome BDP alignment region

M x N manageable

BLAST matches

Dynamic programming after BLAST matching

Page 14: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Page 15: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Defining what a “tree” means

rooted tree (all real trees are rooted):unrooted tree (used when the root isn’t known):

time

ancestral sequence

time vaguely radiates out from somewhere near the center

…divergence time is the sum of (horizontal) branch lengths

sequences(leaves or tips)branch

points

branches

root

Page 16: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

A tree has topology and distances

Are these different trees?

Page 17: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

The number of tree topologies grows extremely fast

3 leaves3 branches1 internal node1 topology(3 insertions)

4 leaves5 branches2 internal nodes3 topologies (x3)(5 insertions)

5 leaves7 branches3 internal nodes15 topologies (x5)(7 insertions)

In general, an unrooted tree with N leaves has:2N – 3 branchesN – 2 internal nodes~ O(N!) topologies 3 5 7 ... 2 5N

Page 18: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

There are many rooted trees for each unrooted tree

For each unrooted tree, there are 2N - 3 times as many rooted trees, where N is the number of leaves (# internal branches = 2N – 3).

20 leaves - 564,480,989,588,730,591,336,960,000,000 topologies