21
CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 15 Genome assembly

CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423: Bioinformatic Algorithms, Databases and Tools

Lecture 15

Genome assembly

Page 2: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 2

Admin• Project questions?

Page 3: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 3

Questions/answers• Why do you need a multiple alignment for phylogeny?• What is the running time of the neighbor-joining

algorithm, given k sequences of length L?• What is the parsimony score of the following tree, and

what are the labels at internal nodes?

C T AG T T

ACTG2112

ACTG2211

ACTG2322

ACTG4434

ACTG5535

Page 4: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 4

Reading assignment• http://www.cbcb.umd.edu/research/assembly_primer.shtml• Chapter 4.5 – coverage statistics• Chapter 8 – genome assembly

Page 5: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 5

Shotgun sequencing

shearing

sequencing

assemblyoriginal DNA (hopefully)

Page 6: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 6

Overview of terms

Assembly

Scaffolding

Page 7: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 7

Shortest common superstring problem

Given a set of strings, Σ=(s1, ..., sn), determine the shortest string Ssuch that every si is a sub-string of S. NP-hardapproximations: 4, 3, 2.89, ...

Greedy algorithm (4-approximation)

phrap, TIGR Assembler, CAP

...ACAGGACTGCACAGATTGATAG ACTGCACAGATTGATAGCTGA...

Page 8: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 8

Greedy algorithm detailsCompute all pairwise overlaps*Pick best (e.g. in terms of alignment score) overlapJoin corresponding readsRepeat from * until no more joins possible

• How do you compute an overlap alignment?• Hint: modify Smith-Waterman dynamic programming

algorithm

Page 9: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 9

Repeats (where greedy fails)

AAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAA AAAAAA

AAAAAA AAAAAAAAAAAA AAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Page 10: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 10

Impact of randomness – non-uniform coverage

1 2 3 4 5 6 C

over

age

Contig

Reads

Imagine raindrops on a sidewalk

Page 11: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 11

Lander-Waterman statistics

L = read lengthT = minimum overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L

E(#islands) = Ne-cσ E(island size) = L(ecσ – 1) / c + 1 – σcontig = island with 2 or more reads

See chapter 4.5

Page 12: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 12

All pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs are

possible– Build a table of k-mers contained in sequences (single pass

through the genome)– Generate the pairs from k-mer table (single pass through k-

mer table)

k-mer

A

B

C

D H I

F

G E

Page 13: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 13

Additional pairwise-alignment details• 4 types of overlaps• Often – assume first read is “forward”

• Representing the alignment

• Why not store length of overlap?

Normal

Innie

Outie

Anti-normal

A-hang B-hang

Page 14: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 14

Overlap-layout-consensus

Main entity: readRelationship between reads: overlap

12

3

45

6

78

9

1 2 3 4 5 6 7 8 9

1 2 3

1 2 3

1 2 3 12

3

1 3

2

13

2

ACCTGAACCTGAAGCTGAACCAGA

Page 15: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 15

Paths through graphs and assembly• Hamiltonian circuit: visit each node (city) exactly once,

returning to the start

A

B D C

E

H G

I

F

A

B

C

D H I

F

G E

Genome

Page 16: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 16

Sequencing by hybridization

AAAAAAACAAAGAAATAACAAACGAACTAAGA...

probes - all possible k-mers

AACAGTAGCTAGATGAACA TAGC AGAT ACAG AGCT GATG CAGT GCTA AGTA CTAG GTAG TAGA

Page 17: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 17

Assembling SBH data

Main entity: oligomer (overlap)Relationship between oligomers: adjacency

ACCTGATGCCAATTGCACT...

CTGAT follows CCTGA (they share 4 nucleotides: CTGA)

Problem: given all the k-mers, find the original string

In assembly: fake the SBH experiment - break the reads into k-mers

Page 18: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 18

Eulerian circuit

• Eulerian circuit: visit each edge (bridge) exactly once and come back to the start

ACCTAGATTGAGGTCG

CCTAGATTGAGGTCGACCTAGATTGAGGTC

Page 19: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 19

deBruijn graph• Nodes – set of k-mers obtained from the reads• Edges – link k-mers that overlap by k-1 lettersACCAGTGCA

CCAGTGCAT

• This formulation particularly useful for very short reads• Solution – Eulerian path through the graph• Note – multiple Eulerian paths possible (exponential

number) due to repeats

Page 20: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 20

deBruijn graph of Mycoplasma genitalium

Page 21: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony

CMSC423 Fall 2008 21

Read-length vs. genome complexity