84
Outline Indian Population Genome Project Methods Results Conclusions and Discussions Some Problems in Mathematics Related to Genomics & SUTTA Algorithm Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Jointly with Giuseppe Narzisi (Graduate Student, Courant) B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Some Problems in Mathematics Related toGenomics & SUTTA Algorithm

Bud Mishra

Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA

Jointly with Giuseppe Narzisi (Graduate Student, Courant)

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 2: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Outline

1 Indian Population Genome ProjectWhole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

2 MethodsSUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

3 ResultsAssembly comparison

4 Conclusions and Discussions

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 3: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Midnight’s Children...

“Whither do we go and what shall be our endeavour? Tobring freedom and opportunity to the common man, to thepeasants and workers of India; to fight and end povertyand ignorance and disease ; to build up a prosperous,democratic and progressive nation, and to create social,economic and political institutions which will ensure justiceand fullness of life to every man and woman. We havehard work ahead.”

Pandit Jawaharlal Nehru, August 15, 1947

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 4: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Indian Population — A melting pot ...

A current population of approximately 1.17 billion people(more precisely, according to CIA World Factbook,1,166,079,217 people on July 1st 2009 est.)Remarkably diverse genealogies and demographics.

Linguistic diversity (with four major language groupsIndo-European, Dravidian, Austro-Asiatic andTibeto-Burman; with language isolates like Nihali or greatAndamanese)Religious multiplicity (with every major religion represented)Racial variety (more than two thousand ethnic groups)

“Separate individual men and women, each differing fromthe other ... a bundle of contradictions held together bystrong but invisible threads.” (Nehru)

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 5: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

A Few Numbers..

The entire Indian human population can thus be describedby about 7.3805818× 1018 base pairs of genomic data...

This entire body of genomic information can be stored inmerely 2 exabytes of memory – significantly less than thecurrent global monthly Internet traffic.Equivalent to about 1 × 1012 (1 trillion) copies ofMahabharatas, the great Indian epic

While reading and storing India’s genomic information maybe a surmountable challenge, understanding it with everybit of its nuances, is not!

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 6: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Challenges..

Sex ratio : It has been plunging precipitously ... accordingto the 2001 census, India’s sex ratio is 927 female for1,000 male (at birth). (Worse in states like Punjab,Haryana and Delhi, but better in few states like Kerala)

Age Structure : 0–14 years: 30.8%, 15–64 years: 64.3%,65+ years: 4.9% ... a population growth rate of 1.548%(2009 est.).

As India’s population ages, the health care cost of the olderIndians could become an onerous burden.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 7: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Future

Preparing for this future, India must step up to innovateand create a unique vision of her own — in bio-medicine,biotechnology and computational biology at all scalesranging from single nucleotides, single molecules andsingle cells to individual citizens and her entire population.Concepts from many disciplines must intermingle to fulfillthis vision, namely, population genetics, genomics,biotechnology, bioinformatics, genome-wide associationstudies and systems biology, as described below.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 8: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

“The Grandest Genetic Experiment Ever Performed on Man...”

1 Best described in terms of two idealized geneticallydivergent ancestral populations: ANI (Ancestral NorthIndians) and ASI (Ancestral South Indians)... ANI-ASIadmixtures described along an Indian Cline.

2 ANI ancestry was estimated to be higher than average inthe upper castes (Brahmins and Kshatriyas) andIndo-Aryan linguistic groups. ASI ancestry was determinedto be best represented by the Onge population inAndaman Island.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 9: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

“The Grandest Genetic Experiment Ever Performed on Man...”

1 Autosomal estimates of ANI ancestries indicate a strongermale gene flow from groups with high ANI ancestry intoones with less.

2 Two interesting outlier groups: the Siddi (African ancestry)and the Nyshi and Ao Naga (Chinese ancestry)...Fisher’sFixation Distance (FST) statistics show that the 19 mainIndian population groups showed much more differences(than the 23 European groups)

3 Strong Founder effects: (followed by limited gene flow...due to strong endogamy)... A high-rate of recessivediseases...

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 10: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Few Technological Milestones

1 Indian Human Population Genomes Project ...sequencing several thousand whole-genomeshaplotypically and accurately

2 High Throughput and Inexpensive HaplotypicSequencing Technology Project ... capable ofsequencing genomic DNA from small number of cells andminute amounts of genomic materials

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 11: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Few Technological Milestones

1 High Accuracy Sequence Assembly Project ... accuratehaplotypic sequence assembly software (accompaniedwith better sequence annotation and comparisonalgorithms)

2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypesand causal relations

3 Translational Systems Biology Project ... phenotyping,disease etiologies and systems biology models of diseases(with capabilities for model checking).

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 12: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Shotgun Sequencing

The DNA sequence of an organism is sheared into a largenumber of small fragments (8-10x coverage), the ends ofthe fragments are sequenced (≈ 500 bp), then theresulting sequences are joined together using a computerprogram called assembler.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 13: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Shotgun Sequencing

Assume : If two sequence reads (strings of lettersproduced by the sequencing machine) share the samestring of letters, then they must have originated from thesame genomic location.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 14: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Sequence Assembly Entry from Wiki

Given a set of sequence fragments the object is to findthe shortest common supersequence.

Algorithm 1 : GREEDY - pseudo codeInput : Set of readsOutput : Set of contigs

Calculate pairwise alignments of all fragments;1

Choose two fragments with the largest overlap;2

Merge chosen fragments;3

repeat4

step 2. and 3.5

until only one fragment is left ;6

return Set of contigs;7

This is a suboptimal approach, but it’s our best idea!

Two other good ideas: OLC (Overlap-Layout-Consensus) andSBH (Sequencing-By-Hybridization).

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 15: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Name Algorithm Author YearArachne WGA OLC Batzoglou, S. et al. 2002 / 2003Celera WGA Assembler / CABOG OLC Myers, G. et al.; Miller G. et al. 2004 / 2008Minimus (AMOS) OLC Sommer, D.D. et al. 2007Newbler OLC 454/Roche 2009Edena OLC Hernandez D., et al. 2008SUTTA B&B NYU/Abraxis (unpublished) 2009/2010TIGR Greedy TIGR 1995 / 2003Phusion Greedy Mullikin JC, et.al. 2003Phrap Greedy Green, P. 2002 / 2003 / 2008CAP3, PCAP Greedy Huang, X. et al. 1999 / 2005Euler SBH Pevzner, P. et al. 2001 / 2006Euler-SR SBH Chaisson, MJ. et al. 2008Velvet SBH Zerbino, D. et al. 2007 / 2009ALLPATHS SBH Butler, J. et al. 2008ABySS SBH Simpson, J. et al. 2008 / 2009SOAPdenovo SBH Ruiqiang Li, et al. 2009SHARCGS Prefix-Tree Dohm et al. 2007SSAKE Prefix-Tree Warren, R. et al. 2007VCAKE Prefix-Tree Jeck, W. et al. 2007QSRA Prefix-Tree Douglas W. et al. 2009Sequencher - Gene Codes Corporation 2007SeqMan NGen - DNASTAR 2008Staden gap4 package - Staden et al. 1991 / 2008MIRA, miraEST - Chevreux, B. 1998 / 2008NextGENe - Softgenetics 2008CLC Genomics Workbench - CLC bio 2008 / 2009CodonCode Aligner - CodonCode Corporation 2003 / 2009

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 16: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Shortest Superstring Problem(First approximation)

Definition (Shortest Superstring Problem)

Given a set of strings {s1, s2, . . . , sn} find the shortest string T suchthat ∀i , si is a substring of T .

First issue : NP-completenss! [Gallant et al. 1980]

Second issue : Incorrectly formulates the assembly problem!!

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 17: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Intractability(NP-Completeness)

A class of problems having two properties:

Any given solution to the problem can be verified quickly (inpolynomial time); the set of problems with this property is calledNP (nondeterministic polynomial time).

Any NP problem can be converted into this one by atransformation of the inputs in polynomial time.

There are thousands of important computational problems thatrepresent essentially one problem in many disguises!

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 18: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Intractability(NP-Completeness)

A class of problems having two properties:

Any given solution to the problem can be verified quickly (inpolynomial time); the set of problems with this property is calledNP (nondeterministic polynomial time).

Any NP problem can be converted into this one by atransformation of the inputs in polynomial time.

There are thousands of important computational problems thatrepresent essentially one problem in many disguises!

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 19: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Intractability(NP-Completeness)

For instance, think about sequence reads as “towns” and overlaps as“roads:” Then Shortest Common Superstring problem is same asvisiting all the towns (never more than once) using the roads in aminimum-distance tour.

SCSP(= ShortestCommonSuperstringProb)

7→ HAM(TSP= TravlingSalesmanProblem)

7→ SAT(= Satisfiability)

7→ LatinSquare

7→ Sudoku

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 20: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Sudoku

It is easy to verify if a solution of a Sudoku problem iscorrect.

If you exhaustively try all possible configurations, you canfind the correct solution, surely. This takes very long(exponential) time.

Nobody has (yet) a rigorous argument to convince us thatthere might not be a better/efficient way to solve Sudoku.

Not all instances are hard: they range from easy, mediumand hard to devilishly hard.

But if you try to create a Sudoku puzzle at random, withhigh probability, it will be easy to solve. Pathologically hardinstances can be hard to construct.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 21: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

How to Cope with NP-Completeness

Tell the biologist to think about easier problems.

Come up with a simpler problem that vaguely looks like theoriginal problem. Solve the easy problem, even if it givesthe wrong solution. Tell the biologist to learn to live withincomplete or incorrect solutions.

Work with the biologist to cheat. Design experiments andtechnologies so that they only generate easy instances ofa hard problem. Solve them correctly.

Solve the problem by exhaustive search, but learn toconstrain the search space intelligently.

Try all of the above (+ kitchen sink)!

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 22: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Traveling Salesman Problem

A traveling salesman wishes to visit a given number of cities,returning to the starting point in a tour. Each pair of cities incurs acost proportional to the distance between these cities. The TravelingSalesman Problem (TSP) is to find a tour with minimum costs.

1994 Applegate, Bixby, Chvátal, Cook 7,3971998 Applegate, Bixby, Chvátal, Cook 13,509 (USA tour)2001 Applegate, Bixby, Chvátal, Cook 15,112 (D tour)2004 App., Bixby, Chvátal, Cook, Helsgaun 24,978 (Swe tour)2006 App., Bixby, Chvátal, Cook, Helsgaun 85,900

Table: TSP Competition.

These five largest instances were solved by Concorde which isbased on branch-and-cut .

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 23: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Traveling Salesman Problem: World Tour

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 24: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

SAT Solver

Given a Boolean formula (in CNF), determine if, undersome truth-assignment, the formula will evaluate to true.This is a classical NP-complete problem.

A DPLL SAT solver employs a systematic backtrackingsearch procedure to explore the (exponentially-sized)space of variable assignments looking for satisfyingassignments. The basic search procedure is based on theDavis-Putnam-Logemann-Loveland algorithm (DPLL). Ithas a theoretically exponential lower bound.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 25: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

SAT Solver

Modern SAT solvers come in two flavors: “conflict-driven”and “look-ahead.”

Conflict-driven solvers augment the DPLL search algorithmwith efficient conflict analysis, clause learning,non-chronological backtracking, “two-watched-literals” unitpropagation, adaptive branching, and random restarts.Look-ahead solvers have especially strengthenedreductions (going beyond unit-clause propagation).

SAT Competitions: The conflict-driven MiniSAT (the 2005SAT competition) only has about 600 lines of code. Thelook-ahead solver march_dl won a prize at the 2007 SATcompetition.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 26: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Fragments and Overlaps

Fragments :

A set of fragments/reads F = {r1, r2, . . . , rN}, s.t. ri ∈ {A, C, G, T}∗.

Each read is represented as pairs of integers (si , ei), i ∈ [1, |F |] where1 ≤ si , ei ≤ |R|, and R is the reconstructed string (the order of si and eiencodes the orientation of the read).

Overlaps :

Use Smith-Waterman algorithm to compute the best alignment between a pair ofstrings.

A π.sAAπ.e Aπ.e

Bπ.s π.eB Bπ.s π.eB

hangAB hang

B

hangAB hang

B

A A

Normal

πInnie

.s

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 27: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Layout Representation

Let us define the layout L associated to a set of reads F = {r1, r2, . . . , rN} asfollows:

L = r1π1⇋ r2

π2⇋ r3

π3⇋ · · ·

πN−1⇋ rN (1)

where there are no containments (contained reads can be initially removed andthen added later after the layout has been created)

Definition (Consistency Property)

A layout L is consistent if the following property holds for i = 2, . . . , N − 1:

πi−1⇋ ri

πi⇋ iff suffixπi−1 (ri ) 6= suffixπi (ri ) (2)

The estimated start positions for each fragment are given by:

sp1 = 1, spi = spi−1 + πi−1.hangri−1 if i > 1 (3)

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 28: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Layout Representation(Example)

Layout for a set of fragments F = {A, B, C, D, E , F , G} with asequence of overlaps π

N(A,B), π

I(B,C), π

N(C,D), π

I(D,E), π

N(E,F ), π

N(F ,G)

BC

DE

FG

A

sp sp sp sp sp spspB C D E F GA

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 29: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Definition (Sequence Assembly Problem)

Given a collection of fragment/reads F = {ri}Ni=1 and a

tolerance level (error rate) ǫ, find a reconstruction R whoselayout L is ǫ-valid , consistent and such that the following set ofproperties (oracles) are satisfied :

Overlap-Constraint (O) : The cumulative overlap score ofthe layout is optimized.

NP-completeness of this formulation becomes a serious Issue.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 30: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Problems Related to Genome StructureRepeats

If we look for a reconstruction of minimum length, thereconstructed string can have many errors due to repeats.

Cor

rect

Ass

embl

y

A R R C1 2

A CR R1 2+

B

B

Mis

−as

sem

bly

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 31: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Screwed!

Unless P = NP, this incorrect formulation (SCSP) resultsin an intractable problem... If we can solve SCSP, we canalso solve other hard problem like Traveling SalesmanProblem (TSP), Satisfiability (SAT), Quadratic AssignmentProblem (QAP), etc.

“The shortest superstring problem, an elegant butflawed abstraction: [since it defines assembly problemas finding] a shortest string containing a set of givenstrings as substrings.”

The Role of Algorithmic Research in Computational Genomics, Richard M. Karp

IEEE Computer Society Bioinformatics Conference, August 14, 2003

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 32: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Toxic Assembly Recovery Program (TARP)

“Of particular interest are the relative rates of misassembly(sequence assembled in the wrong order and/ororientation) and the relative coverage achieved by thethree protocols.“Unfortunately the UCSC group were alone in havingpublished assessments of the rate of misassembly in thecontigs they produced. Using artificial data sets, theyfound that, on average 10 per cent of assembledfragments were assigned the wrong orientation and 15 percent of fragments were placed in wrong order by theirprotocol (Kent and Haussler, 2001).“Two independent assemessments [more recent] of UCSCassemblies have come to the similar conclusions.”

Colin A.M. Semple, 2007B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 33: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Genomic Dark Matter

Quality of the human genome sequence?

Most of the reference sequences are genotypic (i.e.,non-haplotypic ) and lack long range information (i.e., fail tocharacterize rearrangements, duplications, inversions a ndtranslocations ).

Resequencing reveals about 30% of the resquenced reads notaligning to the reference sequences: Genomic Dark Matter .

Optical Mapping , a highthroughput, high-resolutionsingle-molecule system reveals many previously unknowngenome structural variants not captured in the referencesequences (Teague et al.)

Genome-Wide Association Studies based on the currentlyavailable genomic data have proven inadequate in explainingcommon diseases.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 34: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Dude, Where is my genome??

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 35: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Hooplas, Hypes, Haplotypes

Whole-Genome Human Optical Map (above) constructed by ourGentig algorithm (Anantharaman, Mishra, Schwartz, 1999).

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 36: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Greedy Strategy(TIGR 1995, Phrap 1996, CAP3 1999)

Pick the highest scoring overlap.Merge the two fragments (add this new sequence to the pool of sequences).Heuristically correct regions of the overlay in some plausible manner (wheneverpossible).Regions that do not yield to these error-correction heuristics are abandoned asirrecoverable and shown as gaps.Repeat until no more merges can be done.

35

3)

2)

1)130

75

75

130

130

35

75

A

B

C

D

B

C

A

B

C

D

A

B

C

DB Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 37: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Overlap-Layout-Consensus(CELERA 2000, Minimus 2007)

Idea: Construct a graph in which nodes represent reads and edges indicateoverlaps.

Goal: Need to solve for a Hamiltonian path !

Strategy :

Remove contained and transitivity edges.Collapse "unique connector" overlaps (chordal subgraph with noconflicting edges).Use mate-pairs to connect and order the contigs.

7

1

2

3

4

5

6

71

2

3 4

5

6

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 38: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Sequencing by Hybridization(EULER 2001, Velvet 2008)

Idea: Break the reads into overlapping n-mers (an n-mer is asubstring of length n). Build a DeBruijn graph in which eachedge is an n-mer and the source and destination nodes arerespectively the n − 1 prefix and n − 1 suffix of thecorresponding n-mer.

Goal : find a path that uses all the edges (an Eulerian path) →linear time algorithm (however actual performance similar to theoverlap-layout-consensus approach).

Problem : Errors in the data can introduce many erroneousedges !

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 39: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

A

CA

GC

AC

A

C

A

CCG

G

AA

C

AA

DeBruijn graph for the listL = {AAA, AAC, ACA, CAC, CAA, CGC, GCG}. The Euler path is:AC → CA → AC → CG → GC → CA → AA → AC

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 40: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Next Generation

Greedy algorithm is competitive (within a factor of four).

Let G be a weighted directed graph induced by the nodesrepresenting the reads and edges representing the overlaps(with their weights determined by overhangs).

CYC(G) ≤ MWHC(G) ≤ Opt(S),

An optimal Cycle Cover has a smaller weight than that of aminimal weight Hamiltonian Cycle of the graph.

Furthermore, if T is the solution of the greedy algorithm then

Opt(S) ≤ |T | ≤ 3CYC(G) + Opt(S) ≤ 4Opt(S).

Note: if the genome is completely random, then|T | = (1 + o(1))Opt(S).

Try to create better Cycle Cover. . . As few contigs as one can.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 41: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Higher Coverage ... Shorter Reads

Massively parallel sequencing platforms such as:

Illumina, Inc. Genome Analyzer,

Applied Biosystems SOLiD System, and

454 Life Sciences (Roche) GS FLX

IonTorrent Sequencer

Features:

Typical read size 35-500 bps. Poorer quality base-calls

Very high coverage (up to 200X).

Need to assemble millions of reads!

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 42: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Raw Coverage vs. Effective Coverage

Raw coverage depth c = LNG where

G = Genome length (in bp).L = Average length of a fragment.N = Number of fragments.K = # base pairs two fragments must have in common toensure their overlap (overlap parameter).

Effective coverage

ce =N(L − K )

G

S. aureusRaw coverage c = LN

G = 48X

Effective coverage ce = N(L−K )G = 14X

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 43: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Higher Effective Coverage

As the read-length increases, the needed overlap ratioparameter gets smaller.

Even better: shorter repeats and haplotype ambiguitiesbecome less of a problem.

However, in order to get better quality in base calling, itbecomes necessary to develop single-molecule methodsthat can detect chemistry at single base resolution.

Cost of sample preparation and high resolution sensing!How about low-cost low resolution approaches:

Optical Mapping (Restn. or Nicking enz. or probes)Mate-pairs (with multiple-length clones)Strobed sequencingDilution

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 44: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Optical Maps

Whole-Genome Optical Map: Ordered Restriction Map;Markers are restriction sites; usually represented by an orderedsequence of restriction fragment length.

Statistical algorithms (e.g., Gentig [AMS-1999] andHaptig [AM-2005]) construct accurate consensus map, even ifthe raw input suffers from many corrupting error processes (e.g.,sizing, partial digestion, desorption, false-cuts, etc.)

Gentig and Haptig integrate nicely with SUTTA assembler in atechnologically agnostic manner.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 45: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Optical Maps

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 46: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Wicked Problem

The Sequence Assembly problem is an NP-hard combinatorialoptimization problem.

The Sequence Assembly problem is claimed to have been successfully(but approximately) solved using greedy and heuristic methods; thegreedy approaches exhibit many limitations and low flexibility.

“Fast” Brute-Force global optimization of the sequence assemblyproblem is possible!

SUTTA outperforms many assembly algorithms on bacterial genomes.

SUTTA has the potential to assemble haplotypic whole-genomesequences.

SUTTA is technology-agnostic: if the sequencing technology changes,just change the score function.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 47: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Whole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

Toxic Assembly Recovery ProgramReformulate

Definition (Sequence Assembly Problem)

Given a collection of fragment/reads F = {ri}Ni=1 and a tolerance level (error rate) ǫ,

find a reconstruction R whose layout L is ǫ-valid , consistent and such that thefollowing set of properties (oracles) are satisfied :

Overlap-Constraint (O) : The cumulative overlap score of the layout is optimized.

Local-Read-Distribution-Constraint (R) : The observed distribution of fragmentreads start point, Dobs, has the minimum deviation from the source distributionDsrc

Mate-Pair-Constraint (M) : The distance between mate-pairs is consistent.

Optical-Map-Constraint (OM) : The observed distribution of restriction enzymesites, Cobs is consistent with the distribution of experimental optical map dataCsrc .

Constrained Optimization : We suggest an approach that can combine and use allthe oracles while searching the optimal layout.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 48: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

Outline

1 Indian Population Genome ProjectWhole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

2 MethodsSUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

3 ResultsAssembly comparison

4 Conclusions and Discussions

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 49: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

NP-EasyDe Novo Genome Assembly

SUTTA’s approch1 Could potentially lead to an exhaustive search over all

possible overlays;2 Tames the computational complexity through a constrained

search (Branch-and-Bound) by identifying implausibleoverlays quickly;

3 Uses a score-function (oracle) combining different structuralproperties (e.g., transitivity, coverage, physical maps, etc).

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 50: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

SUTTAIllustration

Bes

t pat

h

Start node

Dou

ble

tree

Rea

ds la

yout

First generate LEFT and RIGHT trees for the start read.Next, the best LEFT path is concatenated with the root and the best RIGHT pathto create a globally optimal contig.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 51: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

SUTTAPseudocode

Algorithm 2 : SUTTA - pseudo codeInput : Set of N readsOutput : Set of contigs

F := ⊘; /* Forest of D-trees */1

C := ⊘; /* Set of contigs */2

B :=⋃N

i {ri}; /* All the available reads */3

while (B 6= ⊘) do4

r := B.getNextRead();5

if ( ¬isUsed(r) && ¬isContained(r) ) then6

DT := create_double_tree(r );7

F := F ∪ {DT };8

Contig CT G := create_contig(DT );9

C := C ∪ {CT G};10

CT G.layout(); /* Compute contig layout */11

B := B \ {CT G.reads}; /* Remove used reads */

else12

/* jump to next available read */end13

end14

return C;15

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 52: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

Node expansion(High-level Description)

1 Start with a random read (It will be the root of a tree; Use onlythe read that has not been "used" in a contig yet, or that is not"contained").

2 Create RIGHT Tree: Start with an unexplored leaf node (a read)with the best score-value; Choose all its non-contained"right"-overlapping reads and expand the node by making themits children; Compute their scores. (Add the "contained" nodesalong the way, while including them in the computed scores;Check that no read occurs repeatedly along any path of thetree). STOP when the tree cannot be expanded any further.

3 Create LEFT Tree: Symmetric to previous step.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 53: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

Node expansion(Branch-and-Bound)

Algorithm 3 : Node expansionInput : Start read r0, max queue size K , percentage T of top

ranking solutionsOutput : Best scoring leaf

T := ⊘; /* Set of leaves */1

L := {(r0, g(r0))}; /* Live nodes (priority queue) */2

while (L 6= ⊘) do3

L := Prune(L, K , T ); /* Prune the queue */4

ri := L.getNext(); /* Get the best scoring node */5

L := L \ {ri};6

if (no reads align with ri ) then7

T := T ∪ {ri}; /* ri is a leaf */8

else9

Add contained reads to ri ;10

/* Branch on ri generating ri1 , ri2 , . . . riM */for (j=1 to M) do11

L := L ∪ {(rij , g(rij ))};12

end13

end14

end15

return maxri∈T {g(ri)};16

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 54: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

Overlap Score(Weighted transitivity)

Idea: if read A overlaps read B, and read B overlaps read C, we willscore those overlaps strongly if in addition A and C also overlap. Thisimplicitly assumes that the coverage is higher than 3.

if(π(A,B) ∧π(B,C))then{Sπ(A,B,C)= Sπ(A,B)

+ Sπ(B,C)+(π(A,C)?Sπ(A,C)

: 0)}(4)

sA

B hang

eA

BsB

A

CsC eC

eB

A simple generalization for higher coverage is obvious.

This score cannot resolve repeats or haplotypic variations. Solution:augment the score with information for mate-pairs distances or opticalmap alignment to put an appropriate reward/penalty term.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 55: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

Strategy for selecting next sub-problem

Problem : trade-off between keeping the number of explored nodes inthe search tree low and staying within the memory capacity.

Best First Search (BeFS): always select among the live subproblems the one

with best score.1 Memory problems can arise since this strategy behaves similarly to a

Breadth First Search (BFS).2 Checking repeated nodes in a branch is computationally expensive (linear

time).3 Theoretically superior: whenever a node is chosen for expansion, a

best-score path to that node has been found.

Depth First Search (DFS): always select among the live subproblems the one

with largest level in the search tree.1 Memory requirements are bounded by depth × branching.2 Use depth-first search interval schemes to check if a read occurs

repeatedly along a path (constant time).

Solution : combined strategy. DFS + BeFS.B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 56: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

Transitivity pruning

Observation : do not waste time expanding nodes thatcreate suffix-paths of a previously created path.Idea: delay expansion of the "last" node/read involved in atransitivity relation.

n

A

A

B

B

B

B

1

2

3

n 1h 2 h 3 h n

h

B B BB 1 2 3 n

h1

h2

3h

h

The expansion of nodes B2, B3, . . . , Bn can be delayed because their overlap with read

A is enforced by read B1 (h1 ≤ h2 ≤ · · · ≤ hn).

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 57: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

LookaheadUsing Long Range Information

Scenario : A potential repeat boundary between reads A, B and C. Read Aoverlaps both reads B and C, but B and C do not overlap each other.Observation : No decision can be made at this point on which read tokeep/prune.Idea: Chose between reads A and B based on how well the mate-pairs in theirsubtree satisfy the length constraints.

A

R

CB

C

A

B

Start node

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 58: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

Staphylococcus Epidermidis - 2,616,530 bp(SUTTA DotPlot)

*3

484039*2044294

*317

*30328

*9

*411

*12

2

26*2713*3550466*3819*45*1743*5242*5334*51*493611

1521*28*1410*235

*33*24*1816

22*47*2537545556575859606162636465

S_epidermidis_RP62A_chromsome

S_epidermidis_RP62A_plasmid

QRY

REF

*29

*8813271*223932*4

7

12*3338*24*103578*10

*20

*50*142

28*1

*175*926*18

*8780*16577*17446*1705213

1623*1513

*15

4511*36*26

5

*21

1405172993717

*149*48*9*1934*2563*271279011810212544558416195163109891483135789310615765138114581531378160101738676626713912918611082147168184135112691721454917812410413118112179154961261591601764798117431191801791627418261108115113152142911671071331236497411735666547068166136116144141301001281201561344075835917715015513053122143158421058518518394146111169164171

S_epidermidis_RP62A_chromsome

S_epidermidis_RP62A_plasmid

QRY

REF

(left plot) no lookahead; (right plot) with lookahead.Num. of reads: 60, 761; Avg read length: 900.2; Coverage: 19.9X

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 59: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

Hard instances(SUTTA DotPlot)

Chromosome 22 - 5Mbp (simulated reads)

*17*18*113*19*33*1

*5*64*63*12

*31*77*34*439*25*3

*6

*14

*1549*16*76*38146*354124*476070*69*36*262265*7*84*4

*23

*68*20*56*8*37*29*44*2

46165*28*3011

*114*40*677451*394510558106*13

*825310

*27*83140141142143144145147148149424815015115215315415515615715815950525455575916016116216316416616716816961626617017117217317417517617717817971727375100101781027910310410710810918018118218318418518618718880818586110871118811289115116117118119909192939495961209712198122991231241251261271281292113013113213313413513613713813932

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 5000000

QRY

22

Num. of reads: 62, 542; Avg read length: 799.5; Coverage: 10X

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 60: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

SUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

Wicked Problem

The Sequence Assembly problem is an NP-hard combinatorialoptimization problem.

The Sequence Assembly problem is claimed to have been successfully(but approximately) solved using greedy and heuristic methods; thegreedy approaches exhibit many limitations and low flexibility.

“Fast” Brute-Force global optimization of the sequence assemblyproblem is possible!

SUTTA outperforms many assembly algorithms on bacterial genomes.

SUTTA has the potential to assemble haplotypic whole-genomesequences.

SUTTA is technology-agnostic: if the sequencing technology changes,just change the score function.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 61: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Outline

1 Indian Population Genome ProjectWhole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

2 MethodsSUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

3 ResultsAssembly comparison

4 Conclusions and Discussions

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 62: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Benchmark data

Genome Length (bp) # reads avg. read Coveragelength (bp)

Brucella S. 3, 315, 173 36, 276 895.8 9.8Wolbachia Sp. 1, 267, 782 26, 817 981.9 20.7Staphylococcus E. 2, 616, 530 60, 761 900.2 19.9

Table: Bacteria benchmark data.

These bacteria have been sequenced and fully finished at TIGR, and all the

sequencing reads generated for these projects are publicly available at both

the NCBI Trace Archive, and from the CBCB website1.

1www.cbcb.umd.edu/research/benchmark.shtmlB Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 63: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Genome Assembler # contigs # big contigs Max contig Mean big contig N502 Big contig(>10 kbp) size (kbp) size (kbp) (kbp) coverage (%)

Brucella Minimus 203 101 89 30 32 93.1Suis TIGR 108 67 182 48 57 98.8

CAP3 1321 35 18 12 4 12.7Euler 238 108 82 25 25 82.2Phrap 54 23 434 126 199 103.2SUTTAc 73 53 268 62 79 99.2SUTTAa 73 45 396 72 98 98.4

Wolbachia Sp. Minimus 1545 37 16 13 2 40.7TIGR 1080 46 46 20 5 73.6CAP3 1661 1 10 10 2 0.7Euler 615 0 6 0 1 0Phrap 2253 55 64 22 1.8 98.5SUTTAc 1089 39 87 26 6 80.8SUTTAa 1068 27 181 39 6 83.5

Staphylococcus Minimus 425 86 119 10 19 80.7Epidermidis TIGR 94 38 230 68 100 99.8

CAP3 1219 39 21 13 5 20.2Euler 116 54 149 44 55 91.5Phrap 86 22 357 123 183 103.9SUTTAc 65 33 268 78 98 98.7SUTTAa 64 24 756 108 148 99.1

2N50 = length Lc of the largest contig such that the sum of contigs of equal length or longer is at least 50% of

the total length of all contigs.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 64: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Brucella Suis - 2 chromosomes of 2,107,792 and 1,207,381 bp(Minimus DotPlot)

Minimus’s conservative strategy fails to create long contigs.

*457383*94*127*119*33152161108120*157*16796*70*15997*95*86173*3542103*32

165146199*160*1962826*501231562582*11106*1*16460913767*110138647*182*14*122*29

107*6871170*65167546*3*51134*78*1358093*133*58188*88*24203191*12812*49118*5610*14040*183*172*687*436112513190*4

2*111174*166*31347498130*124*143*13938*4418*39*59*154*100101195*5*15*1164727*137*4136

*114*117*62*169*129*15513*1131211442016392*9

48*5772*109*104*115*158*150126*6381*17777985105*2221*89*8419

*151*54*1455352*231537666*99691321628*11230*102*17155141142147148149168175176177178179180181184185186187189190192193194197198200201202136

AE014291

AE014292

QRY

REF

Num. of reads: 36, 276; Avg read length: 895.8; Coverage: 9.8X

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 65: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Brucella Suis - 2 chromosomes of 2,107,792 and 1,207,381 bp(Phrap DotPlot)

Phrap’s aggressive strategy creates many mis-assemblies.

brucella.seq.Contig36brucella.seq.Contig52

*brucella.seq.Contig45

*brucella.seq.Contig27brucella.seq.Contig38*brucella.seq.Contig49

brucella.seq.Contig48

*brucella.seq.Contig54

brucella.seq.Contig51

brucella.seq.Contig29brucella.seq.Contig35*brucella.seq.Contig50

brucella.seq.Contig33*brucella.seq.Contig28brucella.seq.Contig34*brucella.seq.Contig37*brucella.seq.Contig30brucella.seq.Contig40brucella.seq.Contig31brucella.seq.Contig46

*brucella.seq.Contig42

*brucella.seq.Contig32brucella.seq.Contig47

brucella.seq.Contig53

brucella.seq.Contig39brucella.seq.Contig44

*brucella.seq.Contig43

brucella.seq.Contig41

brucella.seq.Contig10brucella.seq.Contig11brucella.seq.Contig20brucella.seq.Contig12brucella.seq.Contig21brucella.seq.Contig13brucella.seq.Contig22brucella.seq.Contig14brucella.seq.Contig23brucella.seq.Contig15brucella.seq.Contig24brucella.seq.Contig16brucella.seq.Contig25brucella.seq.Contig17brucella.seq.Contig26brucella.seq.Contig18brucella.seq.Contig19brucella.seq.Contig1brucella.seq.Contig2brucella.seq.Contig3brucella.seq.Contig4brucella.seq.Contig5brucella.seq.Contig6brucella.seq.Contig7brucella.seq.Contig8brucella.seq.Contig9

AE014291

AE014292

QRY

REF

Num. of reads: 36, 276; Avg read length: 895.8; Coverage: 9.8X

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 66: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Brucella Suis - 2 chromosomes of 2,107,792 and 1,207,381 bp(SUTTA DotPlot)

8243543*1627*41*3

67*6869*20*2

5647*3158*3353*30636*10

465571784*7313

38*6025*5242*34*661240*1448*19

49*612377*5028*26*7115*29

51*44*37482*11

*39*1874*5922*4536

*7

*21*329

*791

707275767880818385868788895455909192626465

AE014291

AE014292

QRY

REF

Num. of reads: 36, 276; Avg read length: 895.8; Coverage: 9.8X

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 67: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Staphylococcus Epidermidis - 2,616,530 bp(TIGR DotPlot)

TIGR’s greedy strategy fails to join some of the contigs and produces few mis-assemblies.

*staphylococcus_epidermidis_10

*staphylococcus_epidermidis_2

staphylococcus_epidermidis_1

*staphylococcus_epidermidis_4

staphylococcus_epidermidis_5

staphylococcus_epidermidis_42*staphylococcus_epidermidis_50*staphylococcus_epidermidis_37*staphylococcus_epidermidis_427staphylococcus_epidermidis_6

staphylococcus_epidermidis_11

staphylococcus_epidermidis_7

*staphylococcus_epidermidis_3

staphylococcus_epidermidis_8

staphylococcus_epidermidis_18*staphylococcus_epidermidis_9GSEFZ05TFGSFDH05TFGSEFE52TFGSEHN46TFGSEAN28TRGSFDH21TRGSEDK10TRGSFDH13TRGSFER46TFGSFDU17TFGSFDH05TRGSEKA27TFGSEBL62TFGSFEM63TFGSFFR25TRGSFFN60TFGSFEM39TFGSEBQ45TRGSEIP61TRGSEGV96TFstaphylococcus_epidermidis_271GSFEL40TFGSEJD01TFstaphylococcus_epidermidis_275GSFDH85TFGSFDH77TFGSEIP32TFGSEFS74TRGSFDR54TFGSFDK51TRGSFDH69TFGSFEN73TRGSEFS61TFGSFEA16TFGSFDH93TRGSEAJ18TRGSEIA56TFGSECY32TFGSFDH85TRGSFDH77TRGSEGV67TFGSFDH80TFGSFDH69TRGSFDK06TFGSEHW64TFGSFDH72TFGSFDH64TFGSFBA95TFGSEAM64TFGSFDH56TFGSEGC33TFGSFGF37TFGSEIE16TRGSECM06TRGSFDH80TRGSFDH72TRstaphylococcus_epidermidis_284GSEGH08TRGSFDS43TFGSFDH64TRGSFBA95TRGSEEU01TRstaphylococcus_epidermidis_286GSFDH56TRGSFDL24TRGSFDH48TRGSFEU55TRGSFGF40TFGSFGH57TFGSFDV78TFGSFDH35TFGSFFB85TFGSFEV73TRGSFDH19TFGSEAV02TRGSFFL46TFGSFDH51TRGSFDH43TRGSEIO49TFGSEDL18TFGSFGH60TFGSFEV36TFGSEFD43TFGSEIC63TRGSFEA83TFGSEHD01TRGSFFW61TRBstaphylococcus_epidermidis_53GSEJC13TFstaphylococcus_epidermidis_55GSFDH06TFGSEBP28TFGSFGE78TFGSFDH30TRGSFDH22TRGSEDQ20TRGSFDH14TRGSFDK81TRGSFDH06TRGSEGS62TFGSEDS40TFGSEAT41TRGSFDH01TFGSFEH86TRGSEGH70TFGSFDK52TFGSEHU61TFGSFDH01TRstaphylococcus_epidermidis_61GSEGI80TFGSFDH86TFGSEHV76TRstaphylococcus_epidermidis_65GSFDH78TFstaphylococcus_epidermidis_67GSFFO39TFGSEIQ56TRGSEDC25TFGSEIA65TFGSFDO89TFGSFDH86TRGSFDH78TRGSFDV31TRGSEAL39TFGSFDH81TFGSEAM81TFGSFDH73TFGSEHD44TRGSEAY83TFGSFDN74TFGSFEN29TFGSFDH65TFGSFGF54TFGSFDH49TFGSFDK15TRGSFDH81TRstaphylococcus_epidermidis_70GSFDH73TRGSEJP84TFGSFDH65TRGSFFG93TFstaphylococcus_epidermidis_386GSFFY11TRGSEJF01TFGSEAF17TFGSFDH49TRstaphylococcus_epidermidis_77GSFGH82TFGSFDH60TFGSFDV95TFGSFDH52TFGSEHF48TRGSEEA19TFGSEHK18TFGSEEW06TFGSFFS03TRBGSFDH28TFGSFDH60TRGSEFM11TFGSFDH52TRGSFDU48TFGSEIP92TFGSEHK18TRGSFFB86TRGSFDH28TRGSFDH23TFGSEFM11TRGSEIN27TFGSFGH37TFGSFDK79TRGSFDH15TFGSEJC14TFGSFDH07TFGSEGX59TRGSEDP96TFGSFDU51TFGSEEE83TFGSEES25TRGSFDH31TRGSEJT07TRGSEHY48TRGSFGF20TRGSFET89TFGSFDH15TRGSFDK82TRGSFDH07TRGSEDP96TRGSFEL71TRGSFDH02TFGSEGE12TFGSEIC19TRGSFFR30TFGSEIQ65TFGSEJV27TRGSEIW66TFGSECP11TRGSEGR32TFGSFDH10TRGSFGD48TRGSFDH02TRGSEES89TFGSFDH95TFGSFEL26TFGSFDH87TFGSEIE63TRGSECL22TFGSFDH79TFGSEHV80TFstaphylococcus_epidermidis_93staphylococcus_epidermidis_94GSEKG09TFGSEHP63TFstaphylococcus_epidermidis_111GSFBA12TFstaphylococcus_epidermidis_97GSEIQ49TRGSEBW51TRGSEBS75TRGSEIQ52TFGSFEL26TRGSFDH95TRGSFDN96TRGSFDH87TRstaphylococcus_epidermidis_119GSEHC19TRGSEAE45TRGSFDH90TFGSEHV80TRGSEGV69TFGSEDD28TFGSFDH82TFGSFBA12TRGSEGK93TFGSFDH74TFGSFDH66TFGSFDH58TFGSEGB33TRGSFDH90TRGSEJM26TFGSEEQ43TRGSEFA23TFGSFDH74TRGSFFZ14TFGSFDK11TFGSFDH66TRGSEAL35TFGSEFX33TRGSFDH58TRGSEET94TRGSEID03TRGSFDP02TRGSEGO45TRGSFDH61TFGSFEX02TFGSFDN62TFGSEHL66TRGSFDH53TFstaphylococcus_epidermidis_123staphylococcus_epidermidis_124GSEHN94TFGSEAZ86TRGSEGV72TRGSEIY65TRGSFGH59TFGSFDH37TFGSFEK67TFGSFDI68TRGSFDH29TFGSEBZ44TFGSFDH53TRGSEBQ85TFGSEIA16TFGSFFM82TFGSFDH45TRGSFDH37TRGSEDO02TRGSFDH29TRGSFGH62TFGSEAT80TRGSEGK51TFGSFDH32TFGSFDH24TFGSEEN51TFGSFDH16TFGSFEQ47TFGSEIO62TFGSEHL24TFGSFDG06TRGSFET13TRGSFDH40TRGSFDH32TRGSEBD52TRGSFDH24TRGSEHC57TFGSFDK91TRGSFDH16TRGSEGU12TRGSFGH41TFGSEAU53TFGSFDH11TFGSEAJ58TRGSFDH03TFGSEDW02TFGSEHZ62TFGSFDK59TRGSEFU17TRGSEIQ66TFGSEIC23TFGSFEB74TFGSFGB32TRGSEBJ40TRGSFDH96TFGSFDK62TRGSEAO31TFGSEIQ74TRGSFDK54TRGSECF06TFGSEJA38TFGSFFN42TFGSFDH96TRGSEAR66TFGSFEX37TRGSFDH88TRGSFFI48TRGSFDH91TFGSEKD84TRGSFDH83TFGSFDH75TFGSFGF64TFGSFDH67TFGSFDK41TRGSEJK07TRGSFDG65TRGSFGE54TRGSFFL01TRGSFDH59TFGSFFX11TRGSFDH91TRGSFEX32TRGSFDH83TRGSEBF06TFGSEIT96TRGSFDH75TRGSFDK12TFGSFDH67TRGSEIQ16TFGSFDH70TFGSFDH59TRGSEJR13TFGSFDH62TFGSEDV19TFGSEEE05TFGSEFR20TFstaphylococcus_epidermidis_221GSFDH54TFGSFDH46TFGSEFL03TFGSFBA77TFGSFDH38TFGSEAZ79TRGSFEU37TFGSFDH70TRGSECQ57TFGSEDE50TRGSFDH54TRGSFDC47TFGSFDR31TRGSFDH46TRGSFBA77TRGSEJU64TRGSFDH38TRGSFDH41TFGSFDZ52TFGSFDH33TFGSEEV01TRGSEDR62TRGSFDK89TRGSFDH25TFGSEAF22TRGSEAW02TFGSFDH17TFGSFDH09TFGSFDH41TRGSFDH33TRstaphylococcus_epidermidis_306staphylococcus_epidermidis_160staphylococcus_epidermidis_307GSFDH17TRGSEJO26TRGSFDH09TRGSFDH20TFstaphylococcus_epidermidis_166GSEJL84TRGSFDH12TFGSFGF01TFGSEAL02TRGSEAI60TRGSFDH20TRGSEHT62TRGSEIK82TRGSFDK71TRGSFDK63TRstaphylococcus_epidermidis_240GSFDK39TRGSEFC79TFGSEDD62TFGSFDH92TFGSEGN37TFGSEDO46TRGSFDH84TFGSFDH76TFGSEHS02TFGSEJO85TFGSFDH68TFGSEFW41TRGSEAM68TFGSEGK79TFGSEGN61TRGSEAI02TFGSEHH07TRGSEAT02TRGSEAZ11TRGSFDH92TRGSFDH84TRGSFDH76TRGSFDH68TRGSEIQ17TFGSFDH71TFGSFFF86TRGSFDH63TFGSFFW41TRBGSFDH55TFGSEDX93TRGSFGH69TFGSFDZ66TFGSFDH47TFGSEFT38TFstaphylococcus_epidermidis_12GSEJV96TRGSFDH39TFstaphylococcus_epidermidis_13GSEBY60TRstaphylococcus_epidermidis_14GSFFN06TRstaphylococcus_epidermidis_15GSFFT13TRBstaphylococcus_epidermidis_16GSFDH71TRstaphylococcus_epidermidis_17GSFDH63TRGSFDH55TRGSEEO76TFGSFDH47TRGSFDH39TRGSEKE45TFGSFDH50TFGSFDH42TFGSFDH34TFGSEGV61TRGSEJH40TRGSFGH48TFGSFDH26TFGSFDH18TFGSEGE28TFGSEGV45TRGSEJR01TRGSFDH50TRGSEIQ89TFGSFDH42TRGSEJU60TRGSFEW82TRGSFDH34TRGSEKE40TFGSFDI28TFGSFFA53TFGSFDH26TRGSEIP74TFGSFDH18TRstaphylococcus_epidermidis_22GSFGH51TFstaphylococcus_epidermidis_23staphylococcus_epidermidis_264staphylococcus_epidermidis_24GSFDH21TFGSEDW09TRGSFDH13TF

S_epidermidis_RP62A_chromsome

S_epidermidis_RP62A_plasmid

QRY

REF

Num. of reads: 60, 761; Avg read length: 900.2; Coverage: 19.9X

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 68: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Staphylococcus Epidermidis - 2,616,530 bp(Phrap DotPlot)

Phrap’s greedy strategy fails to join some of the contigs and produces many mis-assemblies.

staphylococcus_epidermidis.seq.Contig80

*staphylococcus_epidermidis.seq.Contig70staphylococcus_epidermidis.seq.Contig51*staphylococcus_epidermidis.seq.Contig50staphylococcus_epidermidis.seq.Contig81

*staphylococcus_epidermidis.seq.Contig60*staphylococcus_epidermidis.seq.Contig82

*staphylococcus_epidermidis.seq.Contig85

staphylococcus_epidermidis.seq.Contig75

staphylococcus_epidermidis.seq.Contig73*staphylococcus_epidermidis.seq.Contig71*staphylococcus_epidermidis.seq.Contig79

staphylococcus_epidermidis.seq.Contig69staphylococcus_epidermidis.seq.Contig53*staphylococcus_epidermidis.seq.Contig78

*staphylococcus_epidermidis.seq.Contig61staphylococcus_epidermidis.seq.Contig57*staphylococcus_epidermidis.seq.Contig72

staphylococcus_epidermidis.seq.Contig67*staphylococcus_epidermidis.seq.Contig76staphylococcus_epidermidis.seq.Contig83

staphylococcus_epidermidis.seq.Contig77

*staphylococcus_epidermidis.seq.Contig65staphylococcus_epidermidis.seq.Contig86

*staphylococcus_epidermidis.seq.Contig59*staphylococcus_epidermidis.seq.Contig74*staphylococcus_epidermidis.seq.Contig84

*staphylococcus_epidermidis.seq.Contig58*staphylococcus_epidermidis.seq.Contig64staphylococcus_epidermidis.seq.Contig66*staphylococcus_epidermidis.seq.Contig68*staphylococcus_epidermidis.seq.Contig63staphylococcus_epidermidis.seq.Contig54staphylococcus_epidermidis.seq.Contig55staphylococcus_epidermidis.seq.Contig56staphylococcus_epidermidis.seq.Contig20staphylococcus_epidermidis.seq.Contig21staphylococcus_epidermidis.seq.Contig22staphylococcus_epidermidis.seq.Contig23staphylococcus_epidermidis.seq.Contig24staphylococcus_epidermidis.seq.Contig25staphylococcus_epidermidis.seq.Contig26staphylococcus_epidermidis.seq.Contig27staphylococcus_epidermidis.seq.Contig28staphylococcus_epidermidis.seq.Contig29staphylococcus_epidermidis.seq.Contig62staphylococcus_epidermidis.seq.Contig30staphylococcus_epidermidis.seq.Contig31staphylococcus_epidermidis.seq.Contig32staphylococcus_epidermidis.seq.Contig33staphylococcus_epidermidis.seq.Contig34staphylococcus_epidermidis.seq.Contig35staphylococcus_epidermidis.seq.Contig36staphylococcus_epidermidis.seq.Contig37staphylococcus_epidermidis.seq.Contig38staphylococcus_epidermidis.seq.Contig39staphylococcus_epidermidis.seq.Contig1staphylococcus_epidermidis.seq.Contig2staphylococcus_epidermidis.seq.Contig3staphylococcus_epidermidis.seq.Contig4staphylococcus_epidermidis.seq.Contig5staphylococcus_epidermidis.seq.Contig6staphylococcus_epidermidis.seq.Contig7staphylococcus_epidermidis.seq.Contig8staphylococcus_epidermidis.seq.Contig9staphylococcus_epidermidis.seq.Contig40staphylococcus_epidermidis.seq.Contig41staphylococcus_epidermidis.seq.Contig42staphylococcus_epidermidis.seq.Contig43staphylococcus_epidermidis.seq.Contig44staphylococcus_epidermidis.seq.Contig45staphylococcus_epidermidis.seq.Contig46staphylococcus_epidermidis.seq.Contig47staphylococcus_epidermidis.seq.Contig48staphylococcus_epidermidis.seq.Contig49staphylococcus_epidermidis.seq.Contig10staphylococcus_epidermidis.seq.Contig11staphylococcus_epidermidis.seq.Contig12staphylococcus_epidermidis.seq.Contig13staphylococcus_epidermidis.seq.Contig14staphylococcus_epidermidis.seq.Contig15staphylococcus_epidermidis.seq.Contig16staphylococcus_epidermidis.seq.Contig17staphylococcus_epidermidis.seq.Contig18staphylococcus_epidermidis.seq.Contig19staphylococcus_epidermidis.seq.Contig52

S_epidermidis_RP62A_chromsome

S_epidermidis_RP62A_plasmid

QRY

REF

Num. of reads: 60, 761; Avg read length: 900.2; Coverage: 19.9X

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 69: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Staphylococcus Epidermidis - 2,616,530 bp(SUTTA DotPlot)

*29

*8813271*223932*4

7

12*3338*24*103578*10

*20

*50*142

28*1

*175*926*18

*8780*16577*17446*1705213

1623*1513

*15

4511*36*265

*21

1405172993717

*149*48*9*1934*2563*271279011810212544558416195163109891483135789310615765138114581531378160101738676626713912918611082147168184135112691721454917812410413118112179154961261591601764798117431191801791627418261108115113152142911671071331236497411735666547068166136116144141301001281201561344075835917715015513053122143158421058518518394146111169164171

S_epidermidis_RP62A_chromsome

S_epidermidis_RP62A_plasmid

QRY

REF

Num. of reads: 60, 761; Avg read length: 900.2; Coverage: 19.9X

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 70: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Feature-Response curve(motivation)

Too often assemblies have been judged only by contig size, with larger contigspreferred without regard to quality. A new and more reliable metric needs to bedevised.

Inspired by the standard receiver operating characteristic (ROC) curve, theFeature-Response curve characterizes the sensitivity (coverage) of thesequence assembler as a function of its discrimination threshold (number offeatures/errors).

Features include:(M) mate-pair orientations and separations,(K ) repeat content by k -mer analysis,(C) depth-of-coverage,(P) correlated polymorphism in the read alignments, and(B) read alignment breakpoints to identify structurally suspicious regionsof the assembly.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 71: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Feature-Response Curve(computation)

For a fixed feature threshold φ, the contigs are sorted by size and, starting fromthe longest, only those contigs are tallied, if their sum of features is ≤ φ.For this set of contigs, the corresponding genome coverage is computed, leadingto a single point of the Feature-Response curve.

0

20

40

60

80

100

120

140

0 500 1000 1500 2000 2500 3000 3500 4000

Cov

erag

e (%

)

Feature threshold

Feature-Response curve comparison on Brucella Suis (not mate-pairs)

SUTTAMinimus

PhrapTIGRCAP3

10

20

30

40

50

60

70

80

90

100

110

0 200 400 600 800 1000 1200 1400 1600

Cov

erag

e (%

)

Feature threshold

Feature-Response curve comparison on the Staphylococcus Epidermidis

SUTTATIGR

Arachne

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 72: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Wicked Problem

The Sequence Assembly problem is an NP-hard combinatorialoptimization problem.

The Sequence Assembly problem is claimed to have been successfullysolved using greedy and heuristic methods; the greedy approachesexhibit many limitations and low flexibility.

“Fast” Brute-Force global optimization of the sequence assemblyproblem is possible!

SUTTA outperforms many assembly algorithms on bacterial genomes.

SUTTA has the potential to assemble haplotypic whole-genomesequences.

SUTTA is technology-agnostic: if the sequencing technology changes,just change the score function.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 73: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Short-Read Overlapper

Idea: use exact matching

allowing approximate matching (using dynamicprogramming) would significantly increase the number ofnonspecific spurious overlaps (sequencing errors)

drastically faster than approximate matching

Implementation : Trie data structure (prefix-tree)

index the non-redundant read data set by a prefix-tree(both forward and reverse complement).

find overlaps by simple in-order traversal of the tree.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 74: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Short-Read Overlapperillustration

j

A

B

Set

of o

verla

ppin

g re

ads

i j

Prefix−Tree

K

i

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 75: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

Comparison of assembliesshort read data

Only contigs ≥ 100 bp.Correct contig (S. aureus): aligned along its whole length with at least 98% base similarity.Correct contig (E. coli): fewer than 5 consecutive base mismatches at the termini and at least 95% basesimilarity.

Genome Assembler # correct # misassembled N50 Mean Max Coverage(mean) (kbp) (kbp) (kbp) (%)

S. aureus SUTTA 998 11 6.0 2.6 22.8 97Edena (strict) 1122 0 6.0 2.6 25.7 98Edena (nonstrict) 733 14 9.4 3.7 51.8 97Velvet 1093 2 5.4 2.5 22.9 98SSAKE 2334 99 2.0 1.2 12.6 97SHARCGS 3632 3 1.2 0.76 8.6 97

E. coli SUTTA 423 7 (18.8) 22.7 10.2 84.5 98(K12 MG1655) ABySS 220 13 (33.2) 45.3 20.2 173.8 99

Edena (strict) 674 6 (13.2) 16.4 6.6 67.1 99Velvet 277 9 (52.3) 54.3 15.9 164.2 98SSAKE 893 38 (5.8) 11.4 4.9 50.6 99EULER-SR 190 26 (37.8) 57.4 21.1 174.0 99SOAPdenovo 182 5 (n.a.) 89.0 25.0 n.a. n.a.

N50 = the largest number L such that the combined length of all contigs of length ≥ L is at least 50% of the total

length of all contigs. B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 76: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

N50 vs. min overlap size k

The min overlap length k is a determinant parameter and itsoptimal setting strongly depends on the data (coverage).Trade-off between number of spurious overlaps and lack ofoverlaps.

0

500

1000

1500

2000

2500

3000

3500

4000

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

N50

con

tig s

ize

Min overlap size K

0

2000

4000

6000

8000

10000

12000

14000

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

N50

con

tig s

ize

Min overlap size K

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 77: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Assembly comparison

SOAPDenovo

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 78: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Outline

1 Indian Population Genome ProjectWhole-Genome Shotgun Sequence AssemblyA Phylogeny of AssemblersAssembly Paradigms

2 MethodsSUTTA: Scoring-and-Unfolding Trimmed Tree AssemblerAlgorithmic Improvements

3 ResultsAssembly comparison

4 Conclusions and Discussions

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 79: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Wicked Problem

The Sequence Assembly problem is an NP-hard combinatorialoptimization problem.

The Sequence Assembly problem is claimed to have been successfullysolved using greedy and heuristic methods; the greedy approachesexhibit many limitations and low flexibility.

“Fast” Brute-Force global optimization of the sequence assemblyproblem is possible!

SUTTA outperforms many assembly algorithms on bacterial genomes.

SUTTA has the potential to assemble haplotypic whole-genomesequences.

SUTTA is technology-agnostic: if the sequencing technology changes,just change the score function.

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 80: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

Transcriptomics

MMC (Molecular Morse Codes): Single Molecule Single Cell AFM-BasedTransriptome profiling

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 81: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

[End of Talk]

Puzzle: A shotgun assembly of few words.

WordList: “assembled,” “completely,” “correct,” “genome,”“human,” “in-,” “is,” “only,” “sequence,” and “the.”

WordAssembly: “the human genome sequence is correct,only incompletely assembled;”

Other Solutions: “the only assembled human genomesequence is completely incorrect;”

Other Solutions: “only, correct the assembled sequence;genome is completely inhuman.”

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 82: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

[End of Talk]

Puzzle: A shotgun assembly of few words.

WordList: “assembled,” “completely,” “correct,” “genome,”“human,” “in-,” “is,” “only,” “sequence,” and “the.”

WordAssembly: “the human genome sequence is correct,only incompletely assembled;”

Other Solutions: “the only assembled human genomesequence is completely incorrect;”

Other Solutions: “only, correct the assembled sequence;genome is completely inhuman.”

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 83: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

[End of Talk]

Puzzle: A shotgun assembly of few words.

WordList: “assembled,” “completely,” “correct,” “genome,”“human,” “in-,” “is,” “only,” “sequence,” and “the.”

WordAssembly: “the human genome sequence is correct,only incompletely assembled;”

Other Solutions: “the only assembled human genomesequence is completely incorrect;”

Other Solutions: “only, correct the assembled sequence;genome is completely inhuman.”

B Mishra Some Problems in Mathematics Related to Genomics & SUTT

Page 84: Some Problems in Mathematics Related to Genomics & …2 Advanced Genome-Wide Association Studies Project... incorporating improved analysis of ancestry, haplotypes and causal relations

OutlineIndian Population Genome Project

MethodsResults

Conclusions and Discussions

[End of Talk]

Puzzle: A shotgun assembly of few words.

WordList: “assembled,” “completely,” “correct,” “genome,”“human,” “in-,” “is,” “only,” “sequence,” and “the.”

WordAssembly: “the human genome sequence is correct,only incompletely assembled;”

Other Solutions: “the only assembled human genomesequence is completely incorrect;”

Other Solutions: “only, correct the assembled sequence;genome is completely inhuman.”

B Mishra Some Problems in Mathematics Related to Genomics & SUTT