112
High Throughput Sequencing: Microscope in the Big Data Era Sreeram Kannan and David Tse Tutorial ISIT 2014 rch supported by NSF Center for Science of Informat

High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Embed Size (px)

Citation preview

Page 1: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

High Throughput Sequencing:Microscope in the Big Data Era

Sreeram Kannan and David Tse

Tutorial

ISIT 2014

Research supported by NSF Center for Science of Information.

Page 2: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

DNA sequencing

…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…

Page 3: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

High throughput sequencing revolution

tech. driver for communications

Page 4: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Shotgun sequencing

read

Page 5: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Technologies

Sequencer

Sanger 3730xl

454 GS Ion Torrent

SOLiDv4 Illumina HiSeq 2000

Pac Bio

Mechanism

Dideoxy chain termination

Pyrosequencing

Detection of hydrogen ion

Ligation and two-base coding

Reversible Nucleotides

Single molecule real time

Read length

400-900 bp

700 bp ~400 bp 50 + 50 bp

100 bp PE

1000~10000 bp

Error Rate 0.001% 0.1% 2% 0.1% 2% 10-15%

Output data (per run)

100 KB 1 GB 100 GB 100 GB 1 TB 10 GB

Page 6: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

High throughput sequencing:Microscope in the big data era

Genomic variations, 3-D structures, transcription, translation, protein interaction, etc.

The quantities measured can be dynamic and vary spatially.

Example: RNA expression is different in different tissues and at different times.

Page 7: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Computational problems for high throughput data

measure data

manage data

utilize data

• Assembly (de Novo)

• Variant calling (reference-based assembly)

• Compression

• Privacy

• Genome wide association studies

• Phylogenetic tree reconstruction

• Pathogen detection

Scope of this tutorial

Page 8: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Assembly: three points of view

• Software engineering

• Computational complexity theoretic

• Information theoretic

Page 9: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Assembly as a software engineering problem

• A single sequencing experiment can generate 100’s of millions of reads, 10’s to 100’s gigabytes of data.

• Primary concerns are to minimize time and memory requirements.

• No guarantee on optimality of assembly quality and in fact no optimality criterion at all.

Page 10: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Computational complexity view

• Formulate the assembly problem as a combinatorial optimization problem:– Shortest common superstring (Kececioglu-Myers 95)– Maximum likelihood (Medvedev-Brudno 09)– Hamiltonian path on overlap graph (Nagarajan-Pop 09)

• Typically NP-hard and even hard to approximate.

• Does not address the question of when the solution reconstructs the ground truth.

Page 11: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Information theoretic view

Basic question:

What is the quality and quantity of read data needed to reliably reconstruct?

Page 12: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Tutorial outline

I. De Novo DNA assembly.

II. Reference-based DNA assembly.

III. De Novo RNA assembly

Page 13: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Themes

• Interplay between information and computational complexity.

• Role of empirical data in driving theory and algorithm development.

Page 14: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Part I:

De Novo DNA Assembly

Page 15: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Shotgun sequencing model

Basic model : uniformly sampled reads.

Assembly problem: reconstruct the genome given the reads.

Page 16: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

A Gigantic Jigsaw Puzzle

Page 17: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Challenges

Long repeats

`

log(# of -̀repeats)

Human Chr 22repeat length histogram

Illumina read error profile

Read errors

Page 18: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Two-step approach

• First, we assume the reads are noiseless

• Derive fundamental limits and near-optimal assembly algorithms.

• Then, we add noise and see how things change.

Page 19: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Repeat statistics

easier jigsaw puzzle harder jigsaw puzzle

How exactly do the fundamental limits depend on repeat statistics?

Page 20: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Lower bound: coverage

• Introduced by Lander-Waterman

in 1988.

• What is the number of reads needed to cover the entire DNA sequence with probability 1-²?

• NLW only provides a lower bound on the number of reads needed for reconstruction.

• NLW does not depend on the DNA repeat statistics!

Page 21: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

reconstructable by greedy algorithm

Simple model: I.I.D. DNA, G ! 1

(Motahari, Bresler & Tse 12)

read length L

1

many repeats of length L

no repeatsof length L

normalized # of reads

coverage

no coverage

What about for finite real DNA?

Page 22: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

`

log(# of -̀repeats)

i.i.d. fit data

I.I.D. DNA vs real DNA

Example: human chromosome 22 (build GRCh37, G = 35M)

(Bresler, Bresler & Tse 12)

Can we derive performance bounds on an individual sequence basis?

Page 23: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

GREEDYDEBRUIJN

SIMPLEBRIDGING

MULTIBRIDGING

Lander-Waterman coverage

ML lower bound

Individual sequence performance bounds

repeatlength

Human Chr 19Build 37

(Bresler, Bresler, Tse BMC Bioinformatics 13)

Lcritical

Given a genome s

log(# of -̀repeats)

Page 24: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Rhodobacter sphaeroides

GAGE Benchmark Datasets

Staphylococcus aureus

G = 4,603,060 G = 2,903,081 G = 88,289,540

Human Chromosome14

http://gage.cbcb.umd.edu/

MULTIBRIDGINGlower bound

MULTIBRIDGINGlower bound

MULTIBRIDGINGlower bound

Page 25: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Lower bound: Interleaved repeats

Necessary condition:

all interleaved repeats are bridged.

L

m m nn

In particular: L > longest interleaved repeat length (Ukkonen)

Page 26: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Lower bound: Triple repeats

Necessary condition:

all triple repeats are bridged

In particular: L > longest triple repeat length (Ukkonen)

L

Page 27: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

length

Lander-Waterman coverage

lower bound

Individual sequence performance bounds

Human Chr 19Build 37

(Bresler, Bresler, T. BMC Bioinformatics 13)

log(# of -̀repeats)

Page 28: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Greedy algorithm (TIGR Assembler, phrap, CAP3...)

Input: the set of N reads of length L

1. Set the initial set of contigs as the reads

2. Find two contigs with largest overlap and merge them into a new contig

3. Repeat step 2 until only one contig remains

Page 29: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Greedy algorithm: first error at overlap

A sufficient condition for reconstruction:

repeat

bridging read already merged

contigs

all repeats are bridged

L

Page 30: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

longest interleaved repeatsat length 2248

lower bound

longest repeat at

Back to chromosome 19

GRCh37 Chr 19 (G = 55M)

log(# of -̀repeats)

greedyalgorithm

non-interleaved repeatsare resolvable!

Page 31: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Dense Read Model

• As the number of reads N increases, one can recover exactly the L-spectrum of the genome.

• If there is at least one non-repeating L-mer on the genome, this is equivalent information to having a read at every starting position on the genome.

• Key question:

What is the minimum read length L for which the genome is uniquely reconstructable from its L-spectrum?

Page 32: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

de Bruijn graph

ATAGACCCTAGACGAT

1. Add a node for each (L-1)-mer on the genome.

2. Add k edges between two (L-1)-mers if their overlap has length L-2 and the corresponding L-mer appears k times in genome.

(L = 5)

TAGA

AGCC

AGCG

GCCC

GCGA

CCCTCCTA

CTAG

ATAG

CGAT

AGAC

Page 33: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Eulerian path

ATAGACCCTAGACGAT

(L = 5)

TAGA

AGCC

AGCG

GCCC

GCGA

CCCTCCTA

CTAG

ATAG

CGAT

AGAC

Theorem (Pevzner 95) :

If L > max(linterleaved, ltriple) , then the de Bruijn graph has a unique Eulerian path which is the original genome.

Page 34: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Resolving non-interleaved repeats

non-interleaved repeat

Unique Eulerian path.

Condensed sequence graph

Page 35: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

From dense reads to shotgun reads[Idury-Waterman 95]

[Pevzner et al 01]

Idea: mimic the dense read scenario by looking at K-mers of the length L reads

Construct the K-mer graph and find an Eulerian path.

Success if we have K-coverage of the genome and K > Lcritical

Page 36: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

GREEDYDEBRUIJN

length

Lander-Waterman coverage

lower bound

De Bruijn algorithm: performance

Human Chr 19Build 37

Loss of info. from the reads!

Page 37: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Resolving bridged interleaved repeats

interleaved repeat

bridging read

Bridging read resolves one repeat and the unique Eulerian path resolves the other.

Page 38: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

GREEDYDEBRUIJN

SIMPLEBRIDGING

length

Lander-Waterman coverage

lower bound

Simple bridging: performance

Human Chr 19Build 37

Page 39: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Resolving triple repeats

triple repeat

all copies bridged

neighborhood of triple repeat

all copies bridgedresolve repeat locally

Page 40: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Triple Repeats: subtleties

Page 41: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Multibridging De-Brujin

Theorem: (Bresler,Bresler, Tse 13)

Original sequence is reconstructable if:

2. interleaved repeats are (single) bridged

3. coverage

1. triple repeats are all-bridged

Necessary conditions for ANY algorithm:

1. triple repeats are (single) bridged

2. interleaved repeats are (single) bridged.

3. coverage.

Page 42: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

GREEDYDEBRUIJN

SIMPLEBRIDGING

MULTIBRIDGING

length

Lander-Waterman coverage

lower bound

Multibridging: near optimality for Chr 19

Human Chr 19Build 37

Page 43: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Rhodobacter sphaeroides

GAGE Benchmark Datasets

Staphylococcus aureus

G = 4,603,060 G = 2,903,081 G = 88,289,540

Human Chromosome14

http://gage.cbcb.umd.edu/

MULTIBRIDGINGlower bound

MULTIBRIDGINGlower bound

MULTIBRIDGINGlower bound

Lcritical Lcritical Lcritical

Lcritical = length of the longest triple or interleaved repeat.

Page 44: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Gap

Sulfolobus islandicus. G = 2,655,198

triple repeat lower bound

interleaved repeatlower bound

MULTIBRIDGINGalgorithm

Page 45: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Complexity: Computational vs Informational

• Complexity of MULTIBRIDGING – For a G length genome, O(G2)

• Alternate formulations of Assembly– Shortest Common Superstring: NP-Hard– Greedy is O(G), but only a 4-approximation to SCS in the

worst case– Maximum Likelihood: NP-Hard

• Key differences– We are concerned only with instances when reads are

informationally sufficient to reconstruct the genome.– Individual sequence formulation lets us focus on issues

arising only in real genomes.

Page 46: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Confidence

• When the algorithm obtains an answer, can it be sure?

• Under the dense read model, we can guarantee that when there is a unique Eulerian cycle, the reconstructed answer is correct. – This happens whenever L > max(linterleaved, ltriple)

• Conversely, when L > max(linterleaved, ltriple), there are multiple reconstructions that are consistent with the observed data.

• Under the shotgun read model, there is ambiguity in some scenarios.

Page 47: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Read Errors

Error rate and nature depends on sequencing technology:

Examples:

• Illumina: 0.1 – 2% substitution errors• PacBio: 10 – 15% indel errors

We will focus on a simple substitution noise model with noise parameter p.

ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTTATACTTA

Page 48: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Consistency

Basic question:

What is the impact of noise on Lcritical?

This question is equivalent to whether the L-spectrum is exactly recoverable as the number of noisy reads

N -> 1.

Theorem (C.C. Wang 13):

Yes, for all p except p = ¾.

Page 49: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

What about coverage depth?

Theorem (Motahari, Ramchandran,Tse, Ma 13):

Assume i.i.d. genome model. If read error rate p is less than a threshold, then Lander-Waterman coverage is sufficient for L > Lcritical

For uniform distr. on {A,G,C,T}, threshold is 19%.

A separation architecture is optimal:

errorcorrection

assembly

Page 50: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Why?

• Coverage means most positions are covered by many reads.

• Multiple aligning overlapping noisy reads is possible if

• Assembly using noiseless reads is possible if

noise averaging

M

Page 51: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

From theory to practice

Two issues:

1) Multiple alignment is performed by testing joint typicality of M sequences, computationally too expensive.

Solution: use the technique of finger printing.

2) Real genomes are not i.i.d.

Solution: replace greedy by multibridging.

Page 52: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

X-phased multibridging

Prochlorococcus marinus

Substitution errors of rate 1.5 %

Lcritical

Lam, Khalak, T.Recomb-Seq 14

Page 53: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

More results

Helicobacter pylori

Methanococcus maripaludis Mycoplasma agalactiae

Prochlorococcus marinus

Lcritical

Lcritical Lcritical

Lcritical

Page 54: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

A more careful look

Mycoplasma agalactiae

Lcritical

Lcritical-approx

Page 55: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Approximate repeat example: Yersinia pestis

exact triple repeat, length 1662

approximate triple repeat length

5608

Page 56: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Application: finishing tool for PacBio reads

OurfinishingTool

raw_reads.fastacontigs.fasta

improved_contigs.fasta

https://github.com/kakitone/finishingTool

PacBio Assembler

HGAP

raw_reads.fasta contigs.fasta

Page 57: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Experimental results

Escherichia coli Meiothermus ruber Pedobacter heparinus

Before

After

Page 58: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

More detail of the result

Species Before [Ncontigs]

After [Ncontigs]

% Match with reference

Time Size

Escherichia coli (MG 1655)

21 7 [finisherSC]99.60

< 3 mins (laptop)

~ 4.6M

Meiothermus ruber (DSM 1279)

3 1 [finisherSC]99.99

< 1 min(laptop)

~ 3.0M

Pedobacter heparinus (DSM 2366)

18 5 [finisherSC]99.89

< 3 mins(laptop)

~ 5.1M

S_cerivisea (fungus)

252 78 [finisherSC]95.46

< 3 hours(laptop)

~ 12.4M

S_cerivisea(fungus)

252 55 [Greedy] 53.91

< 3 hours(laptop)

~ 12.4M

Page 59: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Part II:

Reference-Based DNA Assembly

(Mohajer, Kannan, Tse ‘14)

Page 60: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Many genomes to sequence…

100 million species(e.g. phylogeny)

7 billion individuals (SNP, personal genomics)

1013 cells in a human(e.g. somatic mutations

such as HIV, cancer) courtesy: Batzoglou

… but not all independent

Page 61: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Reference Based Assembly: Formulation

ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTACC

AssemblerSide Information

ACGTCCCATGCGTATGCATAATGCCACATATGGCTATGCGTAATGAGTACCReference

Target

Page 62: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Types of Variations

ACGTCCCATGCGTATGCATAATGCCACATATGGCTATGCGTAATGAGTACC

ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTACC

Substitutions (Single Nucleotide Polymorphisms: SNP)

Reference

Target

Page 63: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Types of Variations

ACGTCCATGCGTATGCTAATGCCACATATTGAGCTATGCGTAATGCTGTACC

ACGTCCTAGATGCGTATGCGTAATGCCACATATGCTATGCGTAATGGTACC

Small Indels (Insertions and Deletions)

ACGTCC___ATGCGTATGC_TAATGCCACATATTGAGCTATGCGTAATGCTGTACC

ACGTCCTAGATGCGTATGCGTAATGCCACATAT___GCTATGCGTAATG__GTACC

Reference

Target

Page 64: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Types of Variations

Structural VariationReference

Inversion

Duplication

Duplication (dispersed)

Copy Number Variation

Page 65: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Mathematical Formulation

Focus on SNP version Define SNP rate

Noiseless reads

What is Lcritical for this problem?

Want exact reconstruction

Algorithm

r (Reference DNA)

SNP Rate

Reads from target t

Estimate of Target DNA

Dense

Page 66: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Mathematical Formulation

For any given reference DNA and SNP rate, what is the read length required for reconstruction? In the worst case among target DNA sequences

Lcritical is a function of r, SNP rate

Algorithm

r (Reference DNA)

SNP Rate

Reads from target t

Estimate of Target DNA

Dense

Page 67: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Necessary Conditions

r

t1

Let the reference DNA have a repeat of size lrep > 2L

t2

Consider two possible target DNA sequences t1 and t2

Since L < lrep /2, the two targets D1 and D2 indistinguishable from reads

Sanity check: interleaved repeat of length lrep /2 in D1 and D2

lreplrep

L L

Page 68: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Necessary Conditions

t1

Let the reference DNA have an approximate repeat of size lrep,app > 2L

t2

If L < lrep,app / 2: the two possible targets t1 and t2 indistinguishable

r’

r

Can create r’ close to r but having exact repeat of size lrep,app

Tolerance for approximate repeat depends on SNP rate

Page 69: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Algorithm

r

Map reads to r

r

Keep only uniquely mapped reads

Estimate t

ť

t

lrep,app lrep,app

Let L > lrep,app / 2

Page 70: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Condition for Success

r

Loci covered by uniquely mapped reads are correctly called.

Algorithm fails at a particular locus =>

None of the (L-1) possible reads uniquely mapped

2L 2L

Second case more typical in real genome =>

2L length approximate repeat in r

L > lrep,app / 2 => The algorithm succeeds.

Case 1 Case 2

Page 71: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Assembly Vs. Alignment: I

Necessary condition L ≥ lrep,app (r) / 2

Sufficient condition L > lrep,app (r) / 2 (subject to the assumption)

=> Alignment near optimal and Lref = lrep,app (r) / 2.

De Novo algorithm achieves Lcrit (t) = max {linterleaved(t), ltriple(t) }

In terms of r, for worst case t Lde-novo = max {linterleaved,app (r), ltriple,app (r)}

Page 72: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Assembly Vs. Alignment: II

1. Clearly Lde-novo ≥ Lref since Lref is necessary.

2. Lde-novo = max {linterleaved,app (r), ltriple,app (r)} ≤ lrep,app(r) = 2 Lref

Thus gain from reference is at-most a factor of 2 in the read length.

The maximal gain happens when linterleaved,app (r) = lrep,app (r), i.e., when the largest approximate repeat is an interleaved repeat.

This happens for example, when the DNA is an i.i.d. sequence

Page 73: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Reference based Assembly: Reprise

• Complexity of alignment– Very fast aligners using fingerprinting available when SNP

rate small

• Better than alignment ?– Theory shows alignment near optimal– But alignment is what everyone uses anyway– Nothing better is possible?

• The limitations of the worst case formulation!• If we adopt a individual sequence analysis for both reference

and target, better solution possible.

Page 74: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Part III:RNA (Transcriptome) Assembly

Kannan, Pachter, Tse Genome Informatics ‘13

Page 75: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

RNA: The RAM in Cells

• The instructions from DNA are copied to mRNA transcripts by transcription– RNA transcripts captures dynamics of cell

• RNA Sequencing: Importance– Clinical purposes– Research: Discovery of novel functions – Understanding gene regulation– Most popular *-Seq

DNA RNA Proteintranscription translation

Page 76: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Alternative splicing

ATAC GAAT CAAT TCAG

ATAC CAAT TCAG GAAT TCAG

DNA

RNA Transcript 1 RNA Transcript 2

IntronExon

AC TGAA AGC

Alternative splicing yields different isoforms.

1000’s to 10,000’s symbols long

Page 77: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

RNA-Seq

ATAC CAAT TCAG

GAAT TCAG

ATAC CAAT TCAG

GAAT TCAG

GAAT TCAG

TCA

(Mortazavi et al,Nature Methods 08)

ATT

GAA

Reads

Assembler reconstructs

• Existing Assemblers– Genome guided: Cufflinks, Scripture, Isolasso,..– De novo: Trinity, Oasis, TransAbyss,…

Page 78: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

RNA Sequencing: Bottleneck

Source: Wei Li et al, JCB 2011, Data from ENCODE project

24243

7553

9741

6457 448216

59647

5588

IsoLasso

Scripture

Cufflinks

Popular assemblers diverge significantly when fed the same input

Is the bottleneck informational or computational or neither?

78

Page 79: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Informational Limits

•Lcritical for transcriptome assembly

Read Length, L0

Lcritical No algo. can reconstruct

Proposed algo. can reconstruct in linear time

On many examples, these two bounds match, establishing Lcritical !

• Mouse transcriptome: Lcritical = 4077 revealing complex transcriptome structure

• What can we do at practical values of L?

79

Page 80: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Near-Optimality at Practical L

Fraction of Transcripts Reconstructable

Read LengthRead Length80

Page 81: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Fraction of Transcripts Reconstructable

Read Length

Upper bound on any algorithm

Read Length81

Upper bound without abundanceUpper Bound

Near-Optimality at Practical L

Page 82: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Fraction of Transcripts Reconstructable

Read Length

Proposed Algorithm

Read Length82

Near-Optimality at Practical L

Page 83: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Necessity of Abundance Information

Fraction of Transcripts Reconstructable

Read Length

Upper bound without abundance diversity

Read Length83

Upper bound without abundance

Page 84: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Transcriptome Assembly: Formulation

• M transcripts s1,..,sM with relative abundances α1,..,αM

which are generic (rationally independent).– Dense read model: Look at Lcrit

– Get all substrings of length L along with their relative weights

.

.

.

s1

s2

sM

α1

α2

αMαM

α1+α2

Page 85: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

What is Lcritical for transcriptome?

• Lcritical is lower bounded by the length of the longest interleaved repeat in any transcript

• It can potentially be much larger due to inter-transcript repeats of exons across isoforms.

ATAC CAAT TCAG

GAAT TCAG

Page 86: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

s1 s3 s4

s2 s3 s5

The Information Bottleneck

86

s1 s3 s4

s2 s3 s5

Page 87: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

s1 s3 s4

s2 s3 s5

87

s1 s3s4

s2 s3s5

The Information Bottleneck

Page 88: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

s5

s4

88

s1 s3

s2 s3

s1 s3

s2 s3

s4

s5

Unless L > s3 these two transcriptomes are confused

The Information Bottleneck

Page 89: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

89

s5s1 s3

s4s2 s3

Sparsity can help rule out this four transcript alternative

But first two possibilities still confusable unless L > s3

s1 s3

s2 s3

s4

s5

The Information Bottleneck

Page 90: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

s1 s3 s5

s2 s3 s4

How to Distinguish the Two

90

s1 s3s4

s2 s3s5

Page 91: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Abundance diversity

lymphoblastoid cell lineGeuvadis dataset

Page 92: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

s1 s4

s1 s5

92

Abundance Diversity

s3

s3

Page 93: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

s1 s4

s1 s5

93

s1

s4s1

s5

This transcriptome is not a viable alternative (non-uniform coverage)

Even if L < s3 these transcriptomes are distinguishable.

s3

s3

s3

s3

Abundance Diversity

Page 94: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

s5 s2 s3

s1 s2

s2

a

b

c

s3

s1 s4s4

Fooling Set under Abundance Diversity

These two transcriptomes are still confusable if L < s294

Page 95: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Achievability: Algorithm

• From the reads– we construct a transcript graph

ATCCA

TCCAT CATTC

ATTCGReads

CATTC

ATTCG

0.3

0.1

0.3

ATCCA

TCCAT

GATTCGATTC

CCATT0.30.3

• Weight edges based on relative frequencies

95

Page 96: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Achievability: Algorithm

• From the reads, we construct a transcript graph

ATCCA

TCCAT CATTC

ATTCGReads

0.3

0.1

0.3

GATTC

CCATT0.30.3

• Weight edges based on relative frequencies

96

Page 97: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Achievability: Algorithm

• From the reads, we construct a transcript graph

ATC TCGReads

0.3

0.1

0.3

GAT

CAT

• Weight edges based on relative frequencies

97

Page 98: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Transcripts from Graph

• Paths correspond to transcripts

• Naïve Algorithm: Output all paths from the graph

ATC

GAT

CAT

TCG

GAT TCG

ATC CAT TCG

98

0.3

0.1

0.3

Page 99: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Utility of Abundance

• Consider the following splice-graph– Not all paths are transcripts– Node frequencies give abundance information

– First idea: Use continuity of copy counts

s1

s2

s3

s4

s5

0.12

0.88

0.12

0.88

s1 s3 s4

s2 s3 s5

0.12

0.88

99

Page 100: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Utility of Abundance: Beyond Continuity

• More complex splice graphs:

s0

s1 s3

s4

s5

12

9

5

7

s2

6s6

15

In general, we are given values on nodes /edges.

Need to find sparsest flow (on fewest paths).100

9

5

7

6

Page 101: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

General Splice graphs

• Principle for general splice graphs:– Find the smallest set of paths that corresponds to the node /

edge copy counts• Network routing, snooping, societal networks

• How to split a flow?– Edge-flow: Flow value on each edge (satisfying conservation)– Path-flow: Flow value on each path– Given a edge-flow, find the sparsest path flow

s1

s2

s3

s4

s5

0.12

0.88

0.12

0.88

Start End

0.12

0.88

0.12

0.88

0.12

0.88101

Page 102: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Sparsest Flow Decomposition

• Problem is NP-Hard. [Vatinlen et al’ 08, Hartman et al ’12] – Closer look at hard instances: most paths have same flow– Equivalent to: Most transcripts have same abundance (!)– This is not characteristic of the biological problem

• Our Result:– Assume that abundances are generic– Propose a provably correct algorithm that reconstructs

when: L > Lsuff

– Algorithm is linear time under this condition• Approximately satisfied by biological data !

102

Page 103: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Iterative Algorithm

• The algorithm locally resolves paths using abundance diversity– Error propagation?

• Decompose a node only when sure• If unsure, decompose other nodes before coming back to this

node

• The algorithm solves paths like a sudoku puzzle– Solving one node can help uniquely resolve other nodes!– Can analyze conditions for correct recovery

• L > Lsuff

103

Page 104: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

46

47

a

b3

5

1

2

a

b+c c

346

35

a

c

1

347b

2

4

6

7

a+ba

b3

5

1

2

a

b+c c

4

6

7

a+ba

b3

5

1

2

a

b+c c

46

47

a

b3

5

1

2

a

b+c c

1346

235

2347

a

b

c

Algorithm: Example Run

104

Page 105: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Practical Implementation

Transcripts as paths

Sparsest decomposition of edge-flow into paths

Deals with inter-transcript repeats

Aggregate abundance estimationNode-wise copy count

estimatesSmoothing CC estimates

using min-cost network flow

Multibridging to construct transcript graphCondensation and intra-

transcript repeat resolutionIdentify and discard sequencing errors

Page 106: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Practical Performance

•Simulated reads from human chromosome 15, Gencode transcriptome

•Hard test case• 1700 transcripts chosen randomly from Chr 15• Abundance generated from log-uniform distribution• Read length=100, 1 Million reads• 1% error rates• Single-end reads / stranded protocol

106

Page 107: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Coverage Depth of Transcripts

Fraction of Transcripts Missed

Trinity Our0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positives

107

1 to 10 10 to 25 25 to 50 50 to 100 100+0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TrinityOur

Practical Performance

Page 108: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Complexity

• Sparsest flow problem known to be NP-Hard – Can show using similar reduction that RNA-Seq problem under

dense reads is also NP-Hard, assuming arbitrary abundances

• Reasons why our formulation leads to poly-time algorithm:– Our assumption that abundances are generic – Only worry about instances where there is enough information– Individual sequence formulation lets us focus on issues arising

only in real genomes.

Page 109: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Confidence

• Can we be sure when the produced solution is correct?– Assume dense read model– We are finding the sparsest set of transcripts that satisfy the

given L spectrum

• Under the assumption of genericity– Theorem: If the sparsest solution is unique, then it is the only

generic solution satisfying the L-spectrum (!)

s1

s2

s3

s4

s5

0.12

0.88

0.12

0.88

Page 110: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Summary

• An approach to assembly design based on principles of information theory.

• Driven by and tested on genomics and transcriptomics data.

• Ultimate goal is to build robust, scalable software with performance guarantees.

Page 111: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Problem Landscape

measure data

manage data

utilize data

• Assembly (de Novo)• Noisy reads• RNA: Finite N

• Variant calling (reference-based assembly)

• Indels• Large variants

• Metagenomic assembly

• Compression• Compress

memory?

• Privacy• Information

theoretic methods?

• Genome wide association studies

• Information bounds

• Phylogenetic tree reconstruction

• Pathogen detection

Page 112: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014

Acknowledgements

DNA Assembly RNA Assembly

Guy BreslerMIT

Ma’ayan BreslerBerkeley

Ka Kit LamBerkeley Asif Khalak

Pacific Biosciences

Lior PachterBerkeley

Joseph HuiBerkeley

Kayvon MazoojiBerkeley

Abolfazl MotahariSharif

Soheil Mohajer

Eren Sasoglu