Transcript
Page 1: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Genome Sequencing Algorithms

William Hamilton (1805 – 1865)

Leonhard Euler (1707 – 1783)

Nicolaas Govert de Bruijn (1918 – 2012)

Phillip Compaeu and Pavel Pevzner

Bioinformatics Algorithms: an Active Learning Approach

Page 2: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

The Genome Sequencing Problem

● Determining the order of nucleotides in a genome

● Human genome contains about 3 billion nucleotides– Ameoba dubia and Paris japonica contain 200

times more!

● Applications in Medicine, Agriculture, Biotechnology, ...

Page 3: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

The Genome Sequencing Problem

● There is no technology to read the genome from one end to another.– Short snippets, called reads (200-300 nucleotides),

can be identified.

– No info about a location of a read is known.

● Assembling individual reads into the entire genome is akin to solving a giant overlapping puzzle.

● The newspaper explosion analogy

Page 4: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

History of Genome Sequencing● 1977: Walter Gilbert and Frederick Sanger developed

independent DNA sequencing methods.● 1990: Human Genome Project, Francis Collins.● 1997: Celera Genomics, Craig Venter.● 2000: Human genome is sequenced.

Page 5: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Next Generation Sequencing● Illumina sequences human genomes for

$10,000● Complete Genomics sequences 100s of

genomes per month● Beijing Genome Institute has 100s of

sequencing machines. Is the world's biggest sequencing center.

Page 6: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Next Generation Sequencing

● Identification of mutations in personal genomes for health diagnosis

● Genome 10K project

Page 7: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Genome Assembly – The Computational Problem

ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

Page 8: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Genome Assembly – The Computational Problem

Sequencing Machine generates reads

A String Reconstruction ProblemA String Reconstruction Problem

Page 9: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

The Genome Sequencing Problem

Reconstruct a genome from readsReconstruct a genome from reads

Input: A collection of strings, ReadsOutput: A string, Genome, reconstructed from all the ReadsInput: A collection of strings, ReadsOutput: A string, Genome, reconstructed from all the Reads

Page 10: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

k-mer CompositionComposition3(TAATGCCATGGGATGTT) =

TAA AAT ATG TGCGCC CCA CAT ATG TGG GGG GGA GAT TGT GTT

Lexicographical ordering of k-mers

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT

ATG

Page 11: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

The String Reconstruction Problem

Reconstruct a string from its k-mer composition.Reconstruct a string from its k-mer composition.

Input: A collection of k-mersOutput: A Genome, such that Composition

k(Genome)

is equal to the collection of k-mers

Input: A collection of k-mersOutput: A Genome, such that Composition

k(Genome)

is equal to the collection of k-mers

Page 12: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Naive String Reconstruction Approach

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT

TAAAAT

ATGTGT

GTTNo 3-mer begins with TT!No 3-mer begins with TT!

Page 13: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Representing a Genome as a Path

Composition3(TAATGCCATGGGATGTT) =

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA

GATATGTGTGTT

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT TGT GTTATG

The Genome

Connect k-mer1 with k-mer2

if suffix(k-mer1) = prefix(k-mer2)

Connect k-mer1 with k-mer2

if suffix(k-mer1) = prefix(k-mer2)

Nodes in a Graph

Page 14: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Path turns into a Graph

Connect k-mer1 with k-mer2

if suffix(k-mer1) = prefix(k-mer2)

Connect k-mer1 with k-mer2

if suffix(k-mer1) = prefix(k-mer2)

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA

GATATGTGTGTT

Page 15: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Path turns into a Graph

Connect k-mer1 with k-mer2

if suffix(k-mer1) = prefix(k-mer2)

Connect k-mer1 with k-mer2

if suffix(k-mer1) = prefix(k-mer2)

Page 16: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Path turns into a Graph

Nodes are ordered lexicographically.

How does one find the genome string?

Page 17: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Genome Path in the Graph

TAAAAT ATG TGCGCCCCACATATG TGGGGGGGAGATATG TGTGTT

TAATGCCATGGGATGTT

The genome string is a Hamiltonian walk in the graph

Page 18: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Hamiltonian Path Problem

Find a Hamiltonian path in the graphFind a Hamiltonian path in the graph

Input: A graph.

Output: A path visiting every node in the graph exactly once.

Input: A graph.

Output: A path visiting every node in the graph exactly once.

Hamiltonian Path: A path in a graph that traverses every node exactly once

William R Hamilton (1805 – 1865)

Page 19: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

A Different Path

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT

3-mers as nodes

3-mers as edges

TAA

AAT

ATG

TGC

GCC

CCA

CAT

ATG

TGG

GGG

GGA

GAT

ATG

TGT

GTT

Page 20: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

A Different Path

3-mers as edges and nodes as prefix and suffixes of the corresponding 3-mers

TAA

AAT

ATG

TGC

GCC

CCA

CAT

ATG

TGG

GGG

GGA

GAT

ATG

TGT

GTT

TAA

AAT

ATG

TGC

GCC

CCA

CAT

ATG

TGG

GGG

GGA

GAT

ATG

TGT

GTT

TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

Page 21: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Glue Identical Nodes

TAA

AAT

ATG

TGC

GCC

CCA

CAT

ATG

TGG

GGG

GGA

GAT

ATG

TGT

GTT

TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

TAA

AAT

ATG

TGC

GCC

CCA

CAT

ATG

TGG

GGG

GGA

GAT

ATG

TGT

GTT

TA AA AT

TG GC CC CA

AT

TG GG GG GA

AT

TG GT TT

Page 22: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Glue Identical Nodes

TAA

AAT

ATG

TGC

GCC

CCA

CAT

ATG

TGG

GGG

GGA

GAT

ATG

TGT

GTT

TA AA AT

TG GC CC CA

AT

TG GG GG GA

AT

TG GT TT

TAA

AAT

ATG

TGC

GCC

CCA

CATATG

TGG

GGG

GGA

GAT

ATGTGT

GTT

TA AA AT

TG GC CC CA TG GG GG

GA

TG GT TT

Page 23: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Glue Identical Nodes

TAA

AAT

ATG

TGC

GCC

CCA

CATATG

TGG

GGG

GGA

GAT

ATGTGT

GTT

TA AA AT

TG GC CC CA TG GG GG

GA

TG GT TT

TAAAAT

TGC

GCC

CCA

CAT

ATGTGG

GGG

GGA

GAT ATG

TGTGTTTA AA AT

GCCC

CA

TG

GGGA

GT TT

TG

ATG

Page 24: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

De Bruijn Graph of the Genome

TAA

AAT

TGC

GCCCCA

CAT

ATG

TGG

GGG

GGA

GAT ATG

TGTGTT

TA AA AT

GC

CC

CA

TG

GGGA

GT TTATG

Page 25: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

De Bruijn Graph of the Genome

TAA

AAT

TGC

GCCCCA

CAT

ATG

TGG

GGG

GGA

GAT ATG

TGTGTT

TA AA AT

GC

CC

CA

TG

GGGA

GT TT

The genome string is an Eulerian walk in the De Bruijn graph

TAATGCCATGGGATGTT

ATG

Page 26: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Eulerian Path Problem

Leonhard Euler (1707 – 1783)

Find an Eulerian path in a graphFind an Eulerian path in a graph

Input: A graph.

Output: A path visiting every edge in the graph exactly once.

Input: A graph.

Output: A path visiting every edge in the graph exactly once.

Eulerian Path: A path in a graph that traverses every edge exactly once.

Page 27: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Hamiltonian Path vs. Eulerian Path

TAA

AAT

TGC

GCCCCA

CAT

ATG

TGG

GGA

GAT ATG

TGTGTT

TA AA AT

GC

CC

CA

TG

GGGA

GT TTATG

Page 28: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Hamiltonian Path vs. Eulerian Path

TAA

AAT

TGC

GCCCCA

CAT

ATG

TGG

GGG

GGA

GAT ATG

TGTGTT

TA AA AT

GC

CC

CA

TG

GGGA

GT TT

Euler has presented an efficient solution to the Eulerian path problem. No fast algorithm exists to solve the Hamiltonian Path problem. The

Hamiltonian Path Problem is NP-Complete.

ATG

Page 29: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

The ObjectiveTAATGCCATGGGATGTT

AATAAT ATGATG ATGATG ATGATG CATCAT CCACCA GATGAT GCCGCC GGAGGA GGGGGG GTTGTT TAATAA TGCTGC TGGTGG TGTTGT

Page 30: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

To Do ...TAATGCCATGGGATGTT

AATAAT ATGATG ATGATG ATGATG CATCAT CCACCA GATGAT GCCGCC GGAGGA GGGGGG GTTGTT TAATAA TGCTGC TGGTGG TGTTGT

TAA

AAT

TGC

GCCCCA

CAT

ATG

TGG

GGG

GGA

GAT ATG

TGTGTT

TA AA AT

GC

CC

CA

TG

GGGA

GT TTATG

Page 31: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Constructing De Bruijn Graph

The composition of the genome is known

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT

TAA

TA AA

AAT

AA AT

ATG

AT TG

TGC

TG GC

GCC

GC CC

CCA

CC CA

CAT

CA AT

ATG

AT TG

TGG

TG GG

GGG

GG GG

GGA

GG GA

GAT

GA AT

ATG

AT TG

TGT

TG GT

GTT

GT TT

Page 32: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Constructing De Bruijn Graph

TAA

TA AA

AAT

AA AT

ATG

AT TG

TGC

TG GC

GCC

GC CC

CCA

CC CA

CAT

CA AT

ATG

AT TG

TGG

TG GG

GGG

GG GG

GGA

GG GA

GAT

GA AT

ATG

AT TG

TGT

TG GT

GTT

GT TT

Page 33: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Constructing De Bruijn Graph

TAA

TA

AAT

AA AT

ATG

AT TG

TGC

TG GC

GCC

GC CC

CCA

CC CA

CAT

CA AT

ATG

AT TG

TGG

TG GG

GGG

GG GG

GGA

GG GA

GAT

GA AT

ATG

AT TG

TGT

TG GT

GTT

GT TT

Page 34: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Constructing De Bruijn Graph

TAA

TA

AAT

AA

ATG

AT TG

TGC

TG GC

GCC

GC CC

CCA

CC CA

CAT

CA AT

ATG

AT TG

TGG

TG GG

GGG

GG GG

GGA

GG GA

GAT

GA AT

ATG

AT TG

TGT

TG GT

GTT

GT TT

Page 35: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Constructing De Bruijn Graph

TAA

AAT

ATG

TGC

GCC

CCA

CAT

ATG

TGG

GGG

GGA

GAT

ATG

TGT

GTT

TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

Page 36: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Glue Identical Nodes

TAA

AAT

ATG

TGC

GCC

CCA

CAT

ATG

TGG

GGG

GGA

GAT

ATG

TGT

GTT

TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

TAA

AAT

ATG

TGC

GCC

CCA

CAT

ATG

TGG

GGG

GGA

GAT

ATG

TGT

GTT

TA AA AT

TG GC CC CA

AT

TG GG GG GA

AT

TG GT TT

Page 37: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Glue Identical Nodes

TAA

AAT

ATG

TGC

GCC

CCA

CAT

ATG

TGG

GGG

GGA

GAT

ATG

TGT

GTT

TA AA AT

TG GC CC CA

AT

TG GG GG GA

AT

TG GT TT

TAA

AAT

ATG

TGC

GCC

CCA

CATATG

TGG

GGG

GGA

GAT

ATGTGT

GTT

TA AA AT

TG GC CC CA TG GG GG

GA

TG GT TT

Page 38: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Glue Identical Nodes

TAA

AAT

ATG

TGC

GCC

CCA

CATATG

TGG

GGG

GGA

GAT

ATGTGT

GTT

TA AA AT

TG GC CC CA TG GG GG

GA

TG GT TT

TAAAAT

TGC

GCC

CCA

CAT

ATGTGG

GGG

GGA

GAT ATG

TGTGTTTA AA AT

GCCC

CA

TG

GGGA

GT TT

TG

ATG

Page 39: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

De Bruijn Graph of the Genome Composition

TAA

AAT

TGC

GCCCCA

CAT

ATG

TGG

GGG

GGA

GAT ATG

TGTGTT

TA AA AT

GC

CC

CA

TG

GGGA

GT TT

De Bruijn Graph(Genome Composition) == De Bruijn Graph(Genome)De Bruijn Graph(Genome Composition) == De Bruijn Graph(Genome)

ATG

Page 40: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Constructing the De Bruijn Graph

● De Bruijn graph of a collection of k-mers:– Represent every k-mer as an edge between its

prefix and suffix

– Glue all nodes with identical labels

Page 41: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Euler Cycle Problem

Find an Eulerian cycle in a graphFind an Eulerian cycle in a graph

Input: A graph.

Output: A cycle visiting every edge in the graph exactly once.

Input: A graph.

Output: A cycle visiting every edge in the graph exactly once.

Page 42: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

The Konigsberg Bridges

Page 43: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Eulerian Graph

A graph is Eulerian if it contains an Eulerian cycle

Every balanced and strongly connected graph is Eulerian

1

23

4

5

6

7

8

9

10

11

Page 44: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Algorithm to Find the Eulerian Cycle

1

2

3

45

6 7

8

1

2

3

4

Page 45: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Algorithm to Find the Eulerian Cycle

5

6

7 8

9

10

111

2

3

4

Page 46: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Algorithm to Find the Eulerian Cycle

EulerianCycle(Graph)EulerianCycle(Graph)

form a cycle Cycle by randomly walking in Graph while there are unexplored edges in Graph select a node newStart in Cycle with unexplored edges form Cycle' by traversing Cycle (starting at newStart) and then randomly walking Cycle ← Cycle' return Cycle

form a cycle Cycle by randomly walking in Graph while there are unexplored edges in Graph select a node newStart in Cycle with unexplored edges form Cycle' by traversing Cycle (starting at newStart) and then randomly walking Cycle ← Cycle' return Cycle

Page 47: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

From Reads to De Bruijn Graph to Genome

TAA

AAT

TGC

GCCCCA

CAT

ATG

TGG

GGA

GAT ATG

TGTGTT

TA AA AT

GC

CC

CA

TG

GGGA

GT TT

TAATGCCATGGGATGTTTAATGCCATGGGATGTT

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT

ATG

Page 48: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

TAA

AAT

TGC

GCCCCA

CAT

ATG

TGG

GGA

GAT ATG

TGTGTT

TA AA AT

GC

CC

CA

TG

GGGA

GT TT

GGG

TAATGCCATGGGATGT

ATG

Multiple Eulerian Paths

T

Page 49: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Multiple Eulerian Paths

TAA

AAT

TGC

GCCCCA

CAT

ATG

TGG

GGA

GAT ATG

TGTGTT

TA AA AT

GC

CC

CA

TG

GGGA

GT TT

GGG

TAATG CCATGGGAT GTT

TAA

AAT

TGC

GCCCCA

CAT

ATG

TGG

GGA

GAT ATG

TGTGTT

TA AA AT

GC

CC

CA

TG

GGGA

GT TT

GGG

TAATGCCATGGGATGTTTAATGCCATGGGATGTT

ATG

ATG

Page 50: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

DNA Sequencing with Read-pairs● Read-pair is a pair of reads separated by a

fixed distance d.

Genome: TAATGCCATGGGATGTT.AAT-CCA is a 3,1 read pair.

AAT-CCA represents the sequenceAATGCCA in the original deBruijn graph

Page 51: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

DNA Sequencing with Read-pairs

Composition3(TAATGCCATGGGATGTT) =

TAA AAT ATG TGCGCC CCA CAT ATG TGG GGG GGA GAT TGT GTTATG

PairedComposition3,1(TAATGCCATGGGATGTT) =

TAA|GCC

AAT|CCA

ATG|CAT

TGC|ATGGCC|TGG

CCA|GGGCAT|GGA

ATG|GATTGG|ATG

GGG|TGTGGA|GTT

Page 52: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Paired Composition

TAAGCC

AATCCA

ATGCAT

TGCATG

GCCTGG

CCAGGG

CATGGA

ATGGAT

TGGATG

GGGTGT

GGAGTT

TAAGCC

AATCCA

ATGCAT

TGCATG

GCCTGG

CCAGGG

CATGGA

ATGGAT

TGGATG

GGGTGT

GGAGTT

Lexicographical order:

Page 53: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

String Reconstruction from Read-pairs

String reconstruction from read-pairsString reconstruction from read-pairs

Input: A collection of paired k-mers

Output: A string Text such that PairedComposition(Text) is equal to the collection of paired k-mers

Input: A collection of paired k-mers

Output: A string Text such that PairedComposition(Text) is equal to the collection of paired k-mers

Page 54: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Paired De Bruijn Graphs

TAAGCC

AATCCA

ATGCAT

TGCATG

GCCTGG

CCAGGG

CATGGA

ATGGAT

TGGATG

GGGTGT

GGAGTT

AACC

ATCA

AATCCA

ATCA

TGAT

ATGCAT

ATGA

TGAT

CAGG

ATGA

ATGGAT

CATGGA

CCGG

CAGG

GCTG

CCGG

GGGT

GATT

GGTG

GGGT

CCAGGG

GCCTGG

GGAGTT

GGGTGT

TAGC

AACC

TGAT

GCTG

TAAGCC

TGCATG

TGAT

GGTG

TGGATG

Page 55: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Paired De Bruijn Graphs

ATGCAT

AACC

ATCA

AATCCA

ATCA

TGAT

ATGA

TGAT

CAGG

ATGA

ATGGAT

CATGGA

CCGG

CAGG

GCTG

CCGG

GGGT

GATT

GGTG

GGGT

CCAGGG

GCCTGG

GGAGTT

GGGTGT

TAGC

AACC

TGAT

GCTG

TAAGCC

TGCATG

TGAT

GGTG

TGGATG

Combine nodes with identical labels

Page 56: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Paired De Bruijn Graphs

ATGCAT

AACC

ATCA

AATCCA

TGAT

ATGA

TGAT

CAGG

ATGGAT

CATGGA

CCGG

GCTG

GGGT

GATT

GGTG

GGGT

CCAGGG

GCCTGG

GGAGTT

GGGTGT

TAGC

TAAGCC

TGCATG

TGGATG

Page 57: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Paired De Bruijn Graphs

ATGCAT

AACC

ATCA

AATCCA

TGAT

ATGA

CAGG

ATGGAT

CATGGA

CCGG

GCTG

GGGT

GATT

GGTG

GGGT

CCAGGG

GCCTGG

GGAGTT

GGGTGT

TAGC

TAAGCC

TGCATG

TGGATG

Paired DeBruijn graphs obtained from the paired composition and the genome are identical.

Page 58: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler

Paired De Bruijn Graphs

ATGCAT

AACC

ATCA

AATCCA

TGAT

ATGA

CAGG

ATGGAT

CATGGA

CCGG

GCTG

GGGT

GATT

GGTG

GGGT

CCAGGG

GCCTGG

GGAGTT

GGGTGT

TAGC

TAAGCC

TGCATG

TGGATG

Unique genome string: TAATGCCATGGGATGTT


Recommended