35
Jigsaw Puzzlers' Delight: Sequencing DNA Prof. Sara Billey and Prof. Sreeram Kannan University of Washington Happy Mathday! March 20, 2016 http://www.nist.gov/pml/div689/images/sh_17004592_dna_Benjamin_Albiach_Galan_LR.jpg

Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Jigsaw Puzzlers' Delight: Sequencing DNA

Prof. Sara Billey and

Prof. Sreeram Kannan University of Washington

Happy Mathday! March 20, 2016

http://www.nist.gov/pml/div689/images/sh_17004592_dna_Benjamin_Albiach_Galan_LR.jpg

Page 2: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Who Are These People?

Each human genome is a three billion nucleotide long “book” written in an alphabet with only the four letters A, C, G, T.

http://davidlazarphoto.com/

Page 3: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

•  Differentpeoplehaveslightlydifferentgenomes:onaverage,roughly1muta9onin1000nucleo9des.

•  The1in1000nucleo9desdifferenceaccountsforheight,highcholesterolsuscep9bility,and1000sofgene9cdiseases.

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGTGACTATTATCGACTACAGATGAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

Few Mutations Can Make a Big Difference…

©2013 by Compeau and Pevzner.

Page 4: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Genomes for Different Species

Amoeba Paris Amoeba Paris dubia japonica

§  All human genomes are similar (99.9% agreement).

§  Human genomes and chimpanzee genomes are further apart (96% agreement).

§  Some genomes are 100 X larger than the human genome:

©2013 by Compeau and Pevzner.

Page 5: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

A Short Genome (5386 bases long)

Enterobacteria phage phiX174 sensu lato, complete genome from (http://www.ncbi.nlm.nih.gov/nuccore/9626372?report=fasta) NCBI Reference Sequence: NC_001422.1 GenBank Graphics >gi|9626372|ref|NC_001422.1| Enterobacteria phage phiX174 sensu lato, complete genome GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA GGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTT GGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACA ACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGC TCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTT TCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGC ATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAAC CTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTT GATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGC CGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGAC TAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTG TATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGT TTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGA AGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGAT TATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTT ATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGAGTGTGAGGTTATAAC GCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGC TTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGT TCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTA TATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTG TCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGC CTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTG AATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGC CGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGT TTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTG CTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAA AGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCT GGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTG GTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGA TAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTAT CTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGG TTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGA GATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGAC CAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTA TGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCA AACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGAC TTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTT CTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGA TACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCG TCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTT CTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTAT TGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGC ATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATG TTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGA ATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGG GACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCC CTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATT GCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAG GCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTT ATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCG CAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGC CGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTC GTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCAT CGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAG CCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATA TGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACT TCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTG TCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGC AGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACC TGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCA ***************************************************************

Page 6: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

DNA = Deoxyribonucleic acid v  DNA encodes genetic instructions used in the development,

functioning and reproduction of all known living things. v DNA is a type of polymer (long chain of repeating

molecules) first isolated by Friedrich Miescher in 1869.

v Photo 51: X-ray diffraction image by Rosalind Franklin and Ray Gosling 1952.

https://en.wikipedia.org/wiki/Photo_51

Page 7: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

DNA = Deoxyribonucleic acid Each nucleotide is

composed of a nitrogen-containing nucleobase— either cytosine (C), guanine (G), adenine (A), or thymine (T) — along with a sugar called deoxyribose. T = Thymine =

C5H6N2O2

http://www.councilforresponsiblegenetics.org/geneticprivacy/images/c16x6base-pairs.png

Page 8: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Scientific Goal of the Century

Goal: To discover how can we read off and classify the function and variations in each genome.

First Steps: Acquire exact DNA sequences from different living creatures. Compare. Question: What is holding us back? Answer: It’s not so easy to read DNA!

http://www.nist.gov/oles/forensics/images/DNA-Strand.jpg

Page 9: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Real Image of DNA from 2012

http://scitechdaily.com/first-electron-microscope-image-of-dna-double-helix/

Page 10: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

• Modernsequencingmachinescannotreadanen9regenomeonenucleo9deata9mefrombeginningtoend(likewereadabook)

•  Theycanonlyshredthegenomeandgenerateshortreads.

•  Thegenomeassemblyisnotthesameasajigsawpuzzle:wemustuseoverlappingreadstoreconstructthegenome,agiantoverlappuzzle!

What Makes Genome Sequencing Difficult?

©2013 by Compeau and Pevzner.

Page 11: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

– Applica9onsinmedicine(genomesofpathogens),agriculture(oilpalmgenome),biotechnology(genomesofenergy-producingcyanobacteria),etc.,etc.,etc.

Why Do We Sequence 1000s of Species?

©2013 by Compeau and Pevzner.

Page 12: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

•  2010:NicholasVolkerbecamethefirsthumanbeingtobesavedbygenomesequencing.– Doctorscouldnotdiagnosehiscondi9on;hewentthroughdozensofsurgeries.

–  Sequencingrevealedararemuta9oninaXIAPgenelinkedtoadefectinhisimmunesystem.

–  Thisleddoctorstouseimmunotherapy,whichsavedthechild.

Why Do We Sequence Personal Genomes?

Page 13: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

•  1977:WalterGilbertandFrederickSangerdevelopindependentDNAsequencingmethods.

•  1980:TheysharetheNobelPrize.

•  S9ll,theirsequencingmethodsweretooexpensive($3billiontosequenceahumangenome).

Walter Gilbert

Frederick Sanger

Brief History of Genome Sequencing

©2013 by Compeau and Pevzner.

Page 14: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

•  1990:ThepublicHumanGenomeProject,headedbyFrancisCollins,aimstosequencethehumangenomeby2005.

•  1997:CraigVenterfoundsCeleraGenomics,aprivatefirm,withthesamegoal.

•  2000:

Francis Collins

The Race to Sequence the Human Genome

©2013 by Compeau and Pevzner.

Page 15: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

•  1990:ThepublicHumanGenomeProject,headedbyFrancisCollins,aimstosequencethehumangenomeby2005.

•  1997:CraigVenterfoundsCeleraGenomics,aprivatefirm,withthesamegoal.

•  2000:

Francis Collins

Craig Venter

The Race to Sequence the Human Genome

©2013 by Compeau and Pevzner.

Page 16: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

•  1990:ThepublicHumanGenomeProject,headedbyFrancisCollins,aimstosequencethehumangenomeby2005.

•  1997:CraigVenterfoundsCeleraGenomics,aprivatefirm,withthesamegoal.

•  2000:

Francis Collins

Craig Venter

The Race to Sequence the Human Genome

©2013 by Compeau and Pevzner.

Page 17: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Early 2000s: Many more mammalian genomes are sequenced using the same Sanger sequencing method, but it is clear that new technology is needed for further progress.

From Human to Mouse to Rat to …

©2013 by Compeau and Pevzner.

Page 18: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

•  Early2000s:Themarketfornewsequencingmachinestakesoff.–  Illuminareducesthecostofsequencingahumangenomefrom$3billionto$10,000.

–  CompleteGenomicsbuildsagenomicfactoryinSiliconValleythatsequenceshundredsofgenomespermonth.

–  BeijingGenomeIns9tuteordershundredsofsequencingmachines,becomingtheworld’slargestsequencingcenter.

Next Generation Sequencing Technologies

©2013 by Compeau and Pevzner.

Page 19: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

10,000 Genomes and Beyond

•  2010: Scientists launch a project to sequence 10,000 vertebrate genomes.

•  Now:Humangenomesequencingabout$1000.Andpar9alsequencingis$199on23andme.com!

©2013 by Compeau and Pevzner.

Page 20: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

What brought the price of genome sequencing down?

1)  Better technology: Shorter reads are cheaper to

produce. 2)  Better mathematical algorithms: Solve the

overlapping puzzle problem to reconstruct the original DNA sequence quickly and reliably from the random reads.

Page 21: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

From Reads to Sequences

Unsolved Problem: Find the best possible way to

reconstruct the original DNA sequence from the reads. Example reads: AAGT TAGA GTAG GAAG One Solution: AAGTAGAGTAGAAG Better Solution: GAAGTAGA (unique on 8 letters)

Page 22: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

From Snips to Sequences

Activity: Get into groups of 4-6 people. Get a packet of snips. Try to

recreate the most likely DNA sequence from these snips. Note: Each snip is a consecutive sequence of 20 letters from the original.

These were sampled in a circular fashion, some snips wrap around. Find the secrete message in positions: 10, 20, 30,…, 90. Hint: The original sequence has 100 letters. It ends with ATATGGA.

Page 23: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

From Snips to Sequences C G C C T C G A G A

A T T T G T C T A T T C A T T A A C G T C A G T T T G C T A C T C C G G A C C C G C C G T G A C A A T C C G A C T A T C G T G C T G G C C A C C G C A G T C T T T T G A T A T G G A

Solution Sequence:

Page 24: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

How to reconstruct?

T T G T C T A T T C A T T A A C G T C A

T A T T C A T T A A C G T C A G T T T G

C A T T A A C G T C A G T T T G C T A C

A C G T C A G T T T G C T A C T C C G G

C G T C A G T T T G C T A C T C C G G A

G C T A C T C C G G A C C C G C C G T G

..T T G T C T A T T C A T T A A C G T C A G T T T G C T A C T C C G G A C C C G C C G T G ..

Page 25: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Why does this work?

•  Each read is extended using the most overlapping read

•  Overlap is significant: >7 usually

•  Puzzle: What is the chance that two k-length DNA sequences are the same? •  How many possible k-length starting positions in a DNA seq of length D?

4^{-k} < D

Page 26: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Greedy algorithm (TIGR Assembler, phrap, CAP3...)

Input: the set of snips 1.  Set the initial set of “contigs” as the snips

2.  Find two contigs with largest overlap and merge them into a new contig

3.  Repeat step 2 until only one contig remains

Page 27: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Shortest Common Supersequence (SCS)

Input: A set of reads Output: The shortest sequence containing all the reads as

subsequences ó Shortest DNA sequence that can explain all reads

Example reads: AAGT TAGA GTAG GAAG One Solution: AAGTAGAGTAGAAG Better Solution: GAAGTAGA

Page 28: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Read Overlap Graph

AAGT

TAGA

GTAG GAAG 3 2 2

1

1 GAAG AAGT

Find a path through all the reads (nodes) with highest weight.

Page 29: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Shortest Common Supersequence (SCS)

Question: How to find the SCS? Answer: Equivalent to finding Hamiltonian path in a read-

overlap graph ó hard problem Alternate way: formulate an Eulerian cycle problem ó easy to find

Page 30: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Conditions for Reconstruction

•  When can a DNA sequence be reconstructed correctly from reads?

•  Which jigsaw puzzles easily reconstructible?

Page 31: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Jigsaw puzzles

easier jigsaw puzzle harder jigsaw puzzle

How exactly do the fundamental limits depend on repeat statistics?

Page 32: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Not reconstructable: Interleaved repeats

Page 33: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Unreconstructable DNA Sequences

These two are confusable if: Read length < Interleaved Repeat length

Page 34: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Reprise

•  Algorithms for DNA reconstruction are derived from graph theory and probability theory.

•  Mathematical algorithms have led to faster and more accurate reconstruction.

•  Still many questions unanswered.

•  Come, join, contribute to the DNA revolution!

Page 35: Jigsaw Puzzlers' Delight - University of Washingtonmorrow/mathday/mathday16/dna.puzzle.pdfNote: Each snip is a consecutive sequence of 20 letters from the original. These were sampled

Resources and Acknowledgements Many thanks to Glenn Tesler, Phillip Compeau and Pavel

Pevzner, Alan, Marisa and Paul Viola for help on preparing some of these slides!

Thanks to all of you for listening and participating! Resources: “How to apply de Bruijn graphs to genome assembly” Phillip E C Compeau, Pavel A

Pevzner, and Glenn Tesler. Nature Biotechnology 29, 987–991 (2011) http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html

“Genome Sequencing” by Phillip Compeau and Pavel Pevzner

https://www.coursera.org/course/assembly