DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species...

Preview:

Citation preview

DNA Sequencingand Assembly

DNA sequencing

How we obtain the sequence of nucleotides of a species

…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…

Which representative of the species?

Which human?

Answer one:

Answer two: it doesn’t matter

Polymorphism rate: number of letter changes between two different members of a species

Humans: ~1/1,000 – 1/10,000

Other organisms have much higher polymorphism rates

DNA sequencing – vectors

+ =

DNA

Shake

DNA fragments

VectorCircular genome(bacterium, plasmid)

Knownlocation

(restrictionsite)

Different types of vectors

VECTOR Size of insert

Plasmid2,000-10,000

Can control the size

Cosmid 40,000

BAC (Bacterial Artificial Chromosome)

70,000-300,000

YAC (Yeast Artificial Chromosome)

> 300,000Not used much

recently

DNA sequencing – gel electrophoresis

Start at primer(restriction site)

Grow DNA chain

Include dideoxynucleoside(modified a, c, g, t)

Stops reaction at allpossible points

Separate products withlength, using gel electrophoresis

Electrophoresis diagrams

Output of gel electrophoresis: a read

A read: 500-700 nucleotides

A C G A A T C A G …. A16 18 21 23 25 15 28 30 32 21

Quality scores: -10log10Prob(Error)

Reads can be obtained from leftmost, rightmost ends of the insert

Double-barreled sequencing:Both leftmost & rightmost ends are sequenced

Method to sequence segments longer than 500

cut many times at random (Shotgun)

genomic segment

Get one or two reads from each segment

~500 bp ~500 bp

Reconstructing the Sequence (Fragment Assembly)

Cover region with ~7-fold redundancy (7X)

Overlap reads and extend to reconstruct the original genomic region

reads

Definition of Coverage

Length of genomic segment: LNumber of reads: nLength of each read: l

Definition: Coverage C = nl/L

How much coverage is enough?

(Lander-Waterman model):Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides

C

Challenges with Fragment Assembly

• Sequencing errors~1-2% of bases are wrong

• Repeats

• Computation: ~ O( N2 ) where N = # reads

false overlap due to repeat

Repeats

Bacterial genomes: 5%Mammals: 50%

Repeat types:

Low-Complexity DNA (e.g. ATATATATACATA…)Microsatellite repeats: (a1…ak)N where k ~ 3-6

(e.g. CAGCAGTAGCAGCACCAG)Common Repeat Families

SINE (Short Interspersed Nuclear Elements)(e.g. ALU: ~300-long, 106 copies)

LINE (Long Interspersed Nuclear Elements)~500-5,000-long, 200,000 copies

MIRLTR/Retroviral

Other-Genes that are duplicated & then diverge (paralogs)-Recent duplications, ~100,000-long, very similar copies

What can we do about repeats?

Two main approaches:• Cluster the reads

• Link the reads

What can we do about repeats?

Two main approaches:• Cluster the reads

• Link the reads

Strategies for sequencing a whole genome

1. Hierarchical – Clone-by-clonei. Break genome into many long piecesii. Map each long piece onto the genomeiii. Sequence each piece with shotgun

Example: Yeast, Worm, Human, Rat

2. Online version of (1) – Walkingi. Break genome into many long piecesii. Start sequencing each piece with shotguniii. Construct map as you go

Example: Rice genome

3. Whole genome shotgun

One large shotgun pass on the whole genome

Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu

Hierarchical Sequencing

Hierarchical Sequencing Strategy

1. Obtain a large collection of BAC clones2. Map them onto the genome (Physical Mapping)3. Select a minimum tiling path4. Sequence each clone in the path with shotgun5. Assemble6. Put everything together

a BAC clone

mapgenome

Methods of physical mapping

Goal:

Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence

Methods:

• Hybridization• Digestion

1. Hybridization

Short words, the probes, attach to complementary words

1. Construct many probes2. Treat each BAC with all probes3. Record which ones attach to it4. Same words attaching to BACS X, Y overlap

p1 pn

Hybridization – Computational Challenge

Matrix:m probes n clones

(i, j): 1, if pi hybridizes to Cj

0, otherwise

Definition: Consecutive ones matrixA matrix 1s are consecutive

Computational problem:Reorder the probes so that matrix is in consecutive-ones form

Can be solved in O(m3) time (m >> n)Unfortunately, data is not perfect

p1 p2 …………………….pm

C1

C2 …

……

……

….C

n

1 0 1…………………...01 1 0 …………………..0

0 0 1 …………………..1

pi1pi2…………………….pim

Cj1C

j2 …

……

……

….C

jn

1 1 1 0 0 0……………..00 1 1 1 1 1……………..00 0 1 1 1 0……………..0

0 0 0 0 0 0………1 1 1 00 0 0 0 0 0………0 1 1 1

2. Digestion

Restriction enzymes cut DNA where specific words appear

1. Cut each clone separately with an enzyme2. Run fragments on a gel and measure length3. Clones Ca, Cb have fragments of length { li, lj, lk } overlap

Double digestion:Cut with enzyme A, enzyme B, then enzymes A + B

Whole-Genome Shotgun Sequencing

Whole Genome Shotgun Sequencing

cut many times at random

genome

forward-reverse linked reads

plasmids (2 – 10 Kbp)

cosmids (40 Kbp) known dist

~500 bp~500 bp

The Overlap-Layout-Consensus approach

1. Find overlapping reads

4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge good pairs of reads into longer contigs

3. Link contigs to form supercontigs

+ many heuristics

1. Find Overlapping Reads

• Sort all k-mers in reads (k ~ 24)

TAGATTACACAGATTAC

TAGATTACACAGATTAC|||||||||||||||||

• Find pairs of reads sharing a k-mer

• Extend to full alignment – throw away if not >95% similar

T GA

TAGA| ||

TACA

TAGT||

1. Find Overlapping Reads

One caveat: repeats

A k-mer that appears N times, initiates N2 comparisons

ALU: 1,000,000 times

Solution:

Discard all k-mers that appear more than c Coverage, (c ~ 10)

1. Find Overlapping Reads

Create local multiple alignments from the overlapping reads

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA

1. Find Overlapping Reads (cont’d)

• Correct errors using multiple alignment

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA

C: 20C: 35T: 30C: 35C: 40

C: 20C: 35C: 0C: 35C: 40

• Score alignments

• Accept alignments with good scores

A: 15A: 25A: 40A: 25-

A: 15A: 25A: 40A: 25A: 0

Basic principle of assembly

Repeats confuse us

Ability to merge two reads ability to detect repeats

We can dismiss as repeat any overlap of < t% similarity

Role of error correction:

Discards ~90% of single-letter sequencing errors

Threshold t% increases

2. Merge Reads into Contigs (cont’d)

Merge reads up to potential repeat boundaries(Myers, 1995)

repeat region

2. Merge Reads into Contigs (cont’d)

• Ignore non-maximal reads• Merge only maximal reads into contigs

repeat region

2. Merge Reads into Contigs (cont’d)

• Ignore “hanging” reads, when detecting repeat boundaries

sequencing errorrepeat boundary???

b

a

2. Merge Reads into Contigs (cont’d)

?????

Unambiguous

• Insert non-maximal reads whenever unambiguous

3. Link Contigs into Supercontigs

Too dense: Overcollapsed?

(Myers et al. 2000)

Inconsistent links: Overcollapsed?

Normal density

Find all links between unique contigs

3. Link Contigs into Supercontigs (cont’d)

Connect contigs incrementally, if 2 links

Fill gaps in supercontigs with paths of overcollapsed contigs

3. Link Contigs into Supercontigs

Define G = ( V, E )V := contigs

E := ( A, B ) such that d( A, B ) < C

Reason to do so: Efficiency; full shortest paths cannot be computed

3. Link Contigs into Supercontigs

d ( A, B )Contig A

Contig B

3. Link Contigs into Supercontigs

Contig AContig B

Define T: contigs linked to either A or B

Fill gap between A and B if there is a path in G passing only from contigs in T

4. Derive Consensus Sequence

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

Mouse Genome

Several heuristics of iteratively:Breaking supercontigs that are suspiciousRejoining supercontigs

Size of problem: 32,000,000 reads

Time: 15 days, 1 processorMemory: 28 Gb

N50 Contig size: 16.3 Kb 24.8 Kb N50 Supercontig size: .265 Mb 16.9 Mb

Mouse Assembly

Sequencing in the (near) future

CMOS ChipPhotodiodes

m

4m

Microfluidic Chip

m

Inlet

Outlet

m

m

Recommended