Genome Sequencing Algorithms - Basavaraj Genome Sequencing Algorithms William Hamilton (1805 ¢â‚¬â€œ 1865)

  • View
    1

  • Download
    1

Embed Size (px)

Text of Genome Sequencing Algorithms - Basavaraj Genome Sequencing Algorithms William Hamilton (1805...

  • Genome Sequencing Algorithms

    William Hamilton (1805 – 1865)

    Leonhard Euler (1707 – 1783)

    Nicolaas Govert de Bruijn (1918 – 2012)

    Phillip Compaeu and Pavel Pevzner Bioinformatics Algorithms: an Active Learning Approach

  • The Genome Sequencing Problem

    ● Determining the order of nucleotides in a genome

    ● Human genome contains about 3 billion nucleotides – Ameoba dubia and Paris japonica contain 200

    times more! ● Applications in Medicine, Agriculture,

    Biotechnology, ...

  • The Genome Sequencing Problem ● There is no technology to read the genome

    from one end to another. – Short snippets, called reads (200-300 nucleotides),

    can be identified. – No info about a location of a read is known.

    ● Assembling individual reads into the entire genome is akin to solving a giant overlapping puzzle.

    ● The newspaper explosion analogy

  • History of Genome Sequencing ● 1977: Walter Gilbert and Frederick Sanger developed

    independent DNA sequencing methods. ● 1990: Human Genome Project, Francis Collins. ● 1997: Celera Genomics, Craig Venter. ● 2000: Human genome is sequenced.

  • Next Generation Sequencing ● Illumina sequences human genomes for

    $10,000 ● Complete Genomics sequences 100s of

    genomes per month ● Beijing Genome Institute has 100s of

    sequencing machines. Is the world's biggest sequencing center.

  • Next Generation Sequencing ● Identification of mutations in personal genomes

    for health diagnosis ● Genome 10K project

  • Genome Assembly – The Computational Problem

    ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

    ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

    ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

    ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

    ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

    ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

    ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

    ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA

  • Genome Assembly – The Computational Problem

    Sequencing Machine generates reads

    A String Reconstruction ProblemA String Reconstruction Problem

  • The Genome Sequencing Problem

    Reconstruct a genome from readsReconstruct a genome from reads

    Input: A collection of strings, Reads Output: A string, Genome, reconstructed from all the Reads Input: A collection of strings, Reads Output: A string, Genome, reconstructed from all the Reads

  • k-mer Composition Composition3(TAATGCCATGGGATGTT) =

    TAA AAT ATG TGCGCC CCA CAT ATG TGG GGG GGA GAT TGT GTT

    Lexicographical ordering of k-mers

    AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT

    ATG

  • The String Reconstruction Problem

    Reconstruct a string from its k-mer composition.Reconstruct a string from its k-mer composition.

    Input: A collection of k-mers Output: A Genome, such that Composition

    k (Genome)

    is equal to the collection of k-mers

    Input: A collection of k-mers Output: A Genome, such that Composition

    k (Genome)

    is equal to the collection of k-mers

  • Naive String Reconstruction Approach

    AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT

    TAA AAT

    ATG TGT

    GTT No 3-mer begins with TT!No 3-mer begins with TT!

  • Representing a Genome as a Path

    Composition3(TAATGCCATGGGATGTT) =

    TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA

    GATATGTGTGTT

    TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT TGT GTTATG

    The Genome

    Connect k-mer1 with k-mer2 if suffix(k-mer1) = prefix(k-mer2)

    Connect k-mer1 with k-mer2 if suffix(k-mer1) = prefix(k-mer2)

    Nodes in a Graph

  • Path turns into a Graph

    Connect k-mer1 with k-mer2 if suffix(k-mer1) = prefix(k-mer2)

    Connect k-mer1 with k-mer2 if suffix(k-mer1) = prefix(k-mer2)

    TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA

    GATATGTGTGTT

  • Path turns into a Graph

    Connect k-mer1 with k-mer2 if suffix(k-mer1) = prefix(k-mer2)

    Connect k-mer1 with k-mer2 if suffix(k-mer1) = prefix(k-mer2)

  • Path turns into a Graph

    Nodes are ordered lexicographically.

    How does one find the genome string?

  • Genome Path in the Graph

    TAAAAT ATG TGCGCCCCACATATG TGGGGGGGAGATATG TGTGTT

    TAATGCCATGGGATGTT

    The genome string is a Hamiltonian walk in the graph

  • Hamiltonian Path Problem

    Find a Hamiltonian path in the graphFind a Hamiltonian path in the graph

    Input: A graph. Output: A path visiting every node in the graph exactly once.

    Input: A graph. Output: A path visiting every node in the graph exactly once.

    Hamiltonian Path: A path in a graph that traverses every node exactly once

    William R Hamilton (1805 – 1865)

  • A Different Path

    TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT

    3-mers as nodes

    3-mers as edges

    TAA

    AAT

    ATG

    TGC

    GCC

    CCA

    CAT

    ATG

    TGG

    GGG

    GGA

    GAT

    ATG

    TGT

    GTT

  • A Different Path

    3-mers as edges and nodes as prefix and suffixes of the corresponding 3-mers

    TAA

    AAT

    ATG

    TGC

    GCC

    CCA

    CAT

    ATG

    TGG

    GGG

    GGA

    GAT

    ATG

    TGT

    GTT

    TAA

    AAT

    ATG

    TGC

    GCC

    CCA

    CAT

    ATG

    TGG

    GGG

    GGA

    GAT

    ATG

    TGT

    GTT

    TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

  • Glue Identical Nodes TAA

    AAT

    ATG

    TGC

    GCC

    CCA

    CAT

    ATG

    TGG

    GGG

    GGA

    GAT

    ATG

    TGT

    GTT

    TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

    TAA

    AAT

    ATG

    TGC

    GCC

    CCA

    CAT

    ATG

    TGG

    GGG

    GGA

    GAT

    ATG

    TGT

    GTT

    TA AA AT

    TG GC CC CA

    AT

    TG GG GG GA

    AT

    TG GT TT

  • Glue Identical Nodes

    TAA

    AAT

    ATG

    TGC

    GCC

    CCA

    CAT

    ATG

    TGG

    GGG

    GGA

    GAT

    ATG

    TGT

    GTT

    TA AA AT

    TG GC CC CA

    AT

    TG GG GG GA

    AT

    TG GT TT

    TAA

    AAT

    ATG

    TGC

    GCC

    CCA

    CAT ATG

    TGG

    GGG

    GGA

    GAT

    ATG TGT

    GTT

    TA AA AT

    TG GC CC CA TG GG GG

    GA

    TG GT TT

  • Glue Identical Nodes

    TAA

    AAT

    ATG

    TGC

    GCC

    CCA

    CAT ATG

    TGG

    GGG

    GGA

    GAT

    ATG TGT

    GTT

    TA AA AT

    TG GC CC CA TG GG GG

    GA

    TG GT TT

    TAA AAT

    TGC

    GCC CCA

    CAT

    ATG TGG

    GGG

    GGA

    GAT ATG

    TGT GTTTA AA AT

    GC CC

    CA

    TG

    GGGA

    GT TT

    TG

    ATG

  • De Bruijn Graph of the Genome

    TAA AAT

    TGC

    GCCCCA

    CAT

    ATG TGG

    GGG

    GGA

    GAT ATG

    TGT GTTTA AA AT

    GC

    CC

    CA

    TG

    GGGA

    GT TT ATG

  • De Bruijn Graph of the Genome

    TAA AAT

    TGC

    GCCCCA

    CAT

    ATG TGG

    GGG

    GGA

    GAT ATG

    TGT GTTTA AA AT

    GC

    CC

    CA

    TG

    GGGA

    GT TT

    The genome string is an Eulerian walk in the De Bruijn graph

    TAATGCCATGGGATGTT

    ATG

  • Eulerian Path Problem

    Leonhard Euler (1707 – 1783)

    Find an Eulerian path in a graphFind an Eulerian path in a graph

    Input: A graph. Output: A path visiting every edge in the graph exactly once.

    Input: A graph. Output: A path visiting every edge in the graph exactly once.

    Eulerian Path: A path in a graph that traverses every edge exactly once.

  • Hamiltonian Path vs. Eulerian Path

    TAA AAT

    TGC

    GCCCCA

    CAT

    ATG TGG

    GGA

    GAT ATG

    TGT GTTTA AA AT

    GC

    CC

    CA

    TG

    GGGA

    GT TT ATG

  • Hamiltonian Path vs. Eulerian Path

    TAA AAT

    TGC

    GCCCCA

    CAT

    ATG TGG

    GGG

    GGA

    GAT ATG

    TGT GTTTA AA AT

    GC

    CC

    CA

    TG

    GGGA

    GT TT

    Euler has presented an efficient solution to the Eulerian path problem. No fast algorithm exists to solve the Hamiltonian Path problem. The

    Hamiltonian Path Problem is NP-Complete.

    ATG

  • The Objective TAATGCCATGGGATGTT

    AATAAT ATGATG ATGATG ATGATG CATCAT CCACCA GATGAT GCCGCC