DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species...

DNA Sequencingand Assembly

DNA sequencing

How we obtain the sequence of nucleotides of a species

…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…

Which representative of the species?

Which human?

Answer one:

Answer two: it doesn’t matter

Polymorphism rate: number of letter changes between two different members of a species

Humans: ~1/1,000 – 1/10,000

Other organisms have much higher polymorphism rates

DNA sequencing – vectors

DNA fragments

VectorCircular genome(bacterium, plasmid)

Knownlocation

(restrictionsite)

Different types of vectors

VECTOR Size of insert

Plasmid2,000-10,000

Can control the size

Cosmid 40,000

BAC (Bacterial Artificial Chromosome)

70,000-300,000

YAC (Yeast Artificial Chromosome)

> 300,000Not used much

recently

DNA sequencing – gel electrophoresis

Start at primer(restriction site)

Grow DNA chain

Include dideoxynucleoside(modified a, c, g, t)

Stops reaction at allpossible points

Separate products withlength, using gel electrophoresis

Electrophoresis diagrams

Output of gel electrophoresis: a read

A read: 500-700 nucleotides

A C G A A T C A G …. A16 18 21 23 25 15 28 30 32 21

Quality scores: -10log10Prob(Error)

Reads can be obtained from leftmost, rightmost ends of the insert

Double-barreled sequencing:Both leftmost & rightmost ends are sequenced

Method to sequence segments longer than 500

cut many times at random (Shotgun)

genomic segment

Get one or two reads from each segment

~500 bp ~500 bp

Reconstructing the Sequence (Fragment Assembly)

Cover region with ~7-fold redundancy (7X)

Overlap reads and extend to reconstruct the original genomic region

Definition of Coverage

Length of genomic segment: LNumber of reads: nLength of each read: l

Definition: Coverage C = nl/L

How much coverage is enough?

(Lander-Waterman model):Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides

Challenges with Fragment Assembly

• Sequencing errors~1-2% of bases are wrong

• Repeats

• Computation: ~ O( N2 ) where N = # reads

false overlap due to repeat

Repeats

Bacterial genomes: 5%Mammals: 50%

Repeat types:

Low-Complexity DNA (e.g. ATATATATACATA…)Microsatellite repeats: (a1…ak)N where k ~ 3-6

(e.g. CAGCAGTAGCAGCACCAG)Common Repeat Families

SINE (Short Interspersed Nuclear Elements)(e.g. ALU: ~300-long, 106 copies)

LINE (Long Interspersed Nuclear Elements)~500-5,000-long, 200,000 copies

MIRLTR/Retroviral

Other-Genes that are duplicated & then diverge (paralogs)-Recent duplications, ~100,000-long, very similar copies

What can we do about repeats?

Two main approaches:• Cluster the reads

• Link the reads

What can we do about repeats?

Two main approaches:• Cluster the reads

• Link the reads

Strategies for sequencing a whole genome

1. Hierarchical – Clone-by-clonei. Break genome into many long piecesii. Map each long piece onto the genomeiii. Sequence each piece with shotgun

Example: Yeast, Worm, Human, Rat

2. Online version of (1) – Walkingi. Break genome into many long piecesii. Start sequencing each piece with shotguniii. Construct map as you go

Example: Rice genome

3. Whole genome shotgun

One large shotgun pass on the whole genome

Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu

Hierarchical Sequencing

Hierarchical Sequencing Strategy

1. Obtain a large collection of BAC clones2. Map them onto the genome (Physical Mapping)3. Select a minimum tiling path4. Sequence each clone in the path with shotgun5. Assemble6. Put everything together

a BAC clone

mapgenome

Methods of physical mapping

Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence

Methods:

• Hybridization• Digestion

1. Hybridization

Short words, the probes, attach to complementary words

1. Construct many probes2. Treat each BAC with all probes3. Record which ones attach to it4. Same words attaching to BACS X, Y overlap

Hybridization – Computational Challenge

Matrix:m probes n clones

(i, j): 1, if pi hybridizes to Cj

0, otherwise

Definition: Consecutive ones matrixA matrix 1s are consecutive

Computational problem:Reorder the probes so that matrix is in consecutive-ones form

Can be solved in O(m3) time (m >> n)Unfortunately, data is not perfect

p1 p2 …………………….pm

C2 …

……

1 0 1…………………...01 1 0 …………………..0

0 0 1 …………………..1

pi1pi2…………………….pim

j2 …

……

1 1 1 0 0 0……………..00 1 1 1 1 1……………..00 0 1 1 1 0……………..0

0 0 0 0 0 0………1 1 1 00 0 0 0 0 0………0 1 1 1

2. Digestion

Restriction enzymes cut DNA where specific words appear

1. Cut each clone separately with an enzyme2. Run fragments on a gel and measure length3. Clones Ca, Cb have fragments of length { li, lj, lk } overlap

Double digestion:Cut with enzyme A, enzyme B, then enzymes A + B

Whole-Genome Shotgun Sequencing

Whole Genome Shotgun Sequencing

cut many times at random

genome

forward-reverse linked reads

plasmids (2 – 10 Kbp)

cosmids (40 Kbp) known dist

~500 bp~500 bp

The Overlap-Layout-Consensus approach

1. Find overlapping reads

4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge good pairs of reads into longer contigs

3. Link contigs to form supercontigs

+ many heuristics

1. Find Overlapping Reads

• Sort all k-mers in reads (k ~ 24)

TAGATTACACAGATTAC

TAGATTACACAGATTAC|||||||||||||||||

• Find pairs of reads sharing a k-mer

• Extend to full alignment – throw away if not >95% similar

TAGA| ||

TAGT||

One caveat: repeats

A k-mer that appears N times, initiates N2 comparisons

ALU: 1,000,000 times

Solution:

Discard all k-mers that appear more than c Coverage, (c ~ 10)

Create local multiple alignments from the overlapping reads

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA

1. Find Overlapping Reads (cont’d)

• Correct errors using multiple alignment

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA

C: 20C: 35T: 30C: 35C: 40

C: 20C: 35C: 0C: 35C: 40

• Score alignments

• Accept alignments with good scores

A: 15A: 25A: 40A: 25-

A: 15A: 25A: 40A: 25A: 0

Basic principle of assembly

Repeats confuse us

Ability to merge two reads ability to detect repeats

We can dismiss as repeat any overlap of < t% similarity

Role of error correction:

Discards ~90% of single-letter sequencing errors

Threshold t% increases

2. Merge Reads into Contigs (cont’d)

Merge reads up to potential repeat boundaries(Myers, 1995)

repeat region

• Ignore non-maximal reads• Merge only maximal reads into contigs

repeat region

• Ignore “hanging” reads, when detecting repeat boundaries

sequencing errorrepeat boundary???

Unambiguous

• Insert non-maximal reads whenever unambiguous

3. Link Contigs into Supercontigs

Too dense: Overcollapsed?

(Myers et al. 2000)

Inconsistent links: Overcollapsed?

Normal density

Find all links between unique contigs

3. Link Contigs into Supercontigs (cont’d)

Connect contigs incrementally, if 2 links

Fill gaps in supercontigs with paths of overcollapsed contigs

Define G = ( V, E )V := contigs

E := ( A, B ) such that d( A, B ) < C

Reason to do so: Efficiency; full shortest paths cannot be computed

d ( A, B )Contig A

Contig B

Contig AContig B

Define T: contigs linked to either A or B

Fill gap between A and B if there is a path in G passing only from contigs in T

4. Derive Consensus Sequence

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

Mouse Genome

Several heuristics of iteratively:Breaking supercontigs that are suspiciousRejoining supercontigs

Size of problem: 32,000,000 reads

Time: 15 days, 1 processorMemory: 28 Gb

N50 Contig size: 16.3 Kb 24.8 Kb N50 Supercontig size: .265 Mb 16.9 Mb

Mouse Assembly

Sequencing in the (near) future

CMOS ChipPhotodiodes

Microfluidic Chip

Outlet

DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species...

Documents

Genome Sequencing and Assembly High throughput Sequencing

Sanger Sequencing - KSU · DNA sequencing: • The term DNA sequencing refers to ….. •A sequencing can be done by different methods including: 1. Maxam –Gilbert sequencing (chemical

Next-Generation Sequencing Next-Generation Sequencing ... · PDF fileNext-Generation Sequencing Technologies Next-Generation Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson

Protein Sequencing

Targeted sequencing using a long-read sequencing technology. · Targeted Sequencing Using a Long-Read Sequencing Technology Ian McLaughlin, Primo Baybayan, Richard Hall, John

DNA Sequencing. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT

Tumor Sequencing and Next-Generation Sequencing

KS3 - KS4 Sequencing skills and content Sequencing Core

Whole Genome Sequencing for Food Safety Initiative€¦ · Massively parallel signature sequencing (MPSS) ... Microfluidic Sanger sequencing Main article: Sanger sequencing Microscopy-based

WHOLE GENOME SEQUENCING: Transforming …...WHOLE GENOME SEQUENCING: Transforming health research What is the purpose of whole genome sequencing? Whole genome sequencing turns blood

Exome sequencing or Trio analysis - DNA sequencing & … · Exome sequencing or Trio analysis ... 26301 Dobris, Czech Republic | ngs@seqme.eu When ordering exome sequencing, ... Thanks

Polony Sequencing: a DNA Sequencing Technology and

DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT

RNA Sequencing - Departmentsjleek/teaching/2011/genomics/rnaseq.pdf · Much excitement over RNA-Sequencing Time Excitement RNA Sequencing Microarrays!

DNA Sequencing. CS273a Lecture 3, Autumn 08, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT

Genomic sequencing and its data analysiscalla.rnet.missouri.edu/.../sequencing-assembly.pdf · Paired-end sequencing (Mate pairs) ... •In the shotgun approach to sequencing, small

Automated DNA Sequencing - Amplicon Express€¦ · DNA sequencing kits use cycle sequencing protocols. See Chapter 3 for information on cycle sequencing protocols. Figure 1-3 Cycle

Reading: sequencing events Stage 3...2 Reading: sequencing events Stage 3 Background Information Sequencing events Sequencing is an important comprehension skill for students to organise

02 dna sequencing v2 - Department of Computer Science · 7/20/2011 · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Sequencing technologies - Technical University of Valencia · 2019-05-06 · Sequencing technologies: Sanger ... Sanger sequencing. Sanger sequencing Traditional DNA sequencing method