Transcript

Build Reference Genomes Using

Next-Generation Sequencing

Technologies

Jianbin Wang

HMGP7620, STBB7620, CPBS7620 and MICB7620

Advanced Genome Analysis

1/22/15

Yeast, 1996 E. coli, 1997

C. elegans, 1998 Fruit fly, 2000

Arabidopsis, 2000 Mouse, 2002

1st Generation Large-Scale

Sequencing (Sanger Capillary)

Human, 2001

Produced many important genomes for modern biology

Genome Sequences: 1st Step to Comprehensively Understand the Biology of Organisms

Biodiversity is Everywhere

Diversity of fungi from Northern Saskatchewan

Diversity of butterfly’s wings’ sizes, shapes, and colors

Available (Eukaryotic) Genomes

Increased Steadily at NCBI

ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/

6 3 15 22 34 47 46 65 51 60 66

205 235

396

574

6 0

200

400

600

Re

leas

ed

Ge

no

me

s

Year(s)

Number of Eukaryotic Genome Released @ NCBI GenBank (1/9/15)

Total: 1,831 by 1/9/15

Examples of Reported New Genomes in 2014

Almost all were done using Next-

Generation Sequencing (NGS)

The Nature of NGS Data

• Higher parallel operation/yield

• Much lower cost per base

• Usually shorter (unfortunately)

Illumina has the lowest cost/Mb ($0.05-$0.15) and is the most popular platform

Illumina Paired-End vs. Mate Pair Sequencing

Paired-end Mate pair

Building a Genome is Like Solving a Puzzle of a Map

States: 50 = 3Gb/50 = 60 Mb

Counties: 3,144 = 1 Mb

Zip Codes: 43,000 = 70 Kb

Sanger reads: 800 bp = 3.75 million reads (x

10x)

Illumina reads: 200 bp = 15 million reads (x 50x)

Illumina reads: 50 bp = 60 million reads (x 100x)

De-novo Genome Assembly Concepts

Genomic DNA

Gaps

Genomic reads

Whole genome

shortgun

sequencing

Contig1 Contig2 Contig3 Contig4

De novo assembly

Scaffold

Paired-end information

Metrics for Genome Assembly

N50 = 18,063 bp

N50 number = 4,175

N90 = 3,548 bp

N90 number = 16,950

• Number of contigs/scaffolds

• Total size of contigs/scaffolds

• Longest contig/scaffold

• N50/N90 contigs/scaffolds length

Methods: Overlap-Layout-Consensus

• Pair-wise sequence alignments (computationally expensive)

• Construction and manipulation of an overlap graph to produce the

reads layout

• Multiple sequence alignments and generate consensus

Examples: Phrap, Celera, Arachne, CAP, PCAP, Newbler, SGA …

Illumina

Illumina

Methods: Eulerian Path/de Bruijn Graph

• Kmer hash table

• de Bruijn

graph/Eulerian path

search Examples: Euler, Velvet,

Allpath, Abyss,

SOAPdenovo, ...

AGATGATTCG

AGA

GAT

ATG

TGA

GAT

ATT

TTC

TCG

Differences Between an Overlap Graph and a de Bruijn Graph

Schatz et. al 2010

Coverage and K-mer Coverage

• Coverage (C)

• K-mer coverage (Ck)

• K-mer coverage depends on K-mer size and read

length

Ck = C * (L - k +1) / L

where k is your hash length, and L your read length

• Choice of K-mer: a tug-of-war between specificity and

coverage

Challenges for De-novo Genome Assembly

• Repetitive sequence

• DNA polymorphisms/sequencing errors

• Non-uniform coverage

• Computational complexity of processing large volume of

data

Reduced the Complexity of the Data

• Sub-assembly (grouped assembly)

– Illumina Tn5 transposase based barcodes

– Fosmid, BAC pooling and others

• Repeat-masking

• Reference based

Scaffold

Scaffolding information

Additional Scaffolding (with indirect source)

• Related-genome as reference

• cDNAs/transcriptomes

• Conserved proteins

Contig1 Contig2 Contig3 Contig4

…… - - …..… - - …..….. - - Reference genome cDNA conserved protein

This step needs extra caution as this is under the assumption that might not be true!

To the Next Level: Chromosomal Size

Scaffolding Approaches

• Fosmid: 35-40 Kb

• BAC: 150-350 Kb

• Optical mapping: chromosomal level

• Hi-C assembly: chromosome-scale

• Longer reads (Sanger, Illumina, PacBio, Nanopore, …)

Super-

scaffold/Chromo

some

Higher-level Scaffolding Information

Scaffold1 Scaffold2 Scaffold3 Scaffold4

…… - - …..… - - …..….. - - Fosmid/BAC Optical Mapping Longer Reads

Genome Assessment - Coverage

• Reads coverage/reads used

• Physical coverage

• Functional coverage

– Core Eukaryotic Genes Mapping Approach (CEGMA)

– Transcriptomes (mRNAs, Small RNAs, and others)

– Other sequence of interest

Genome Assessment - Continuity

• N50 and N90 on contigs and scaffolds

• Consistency to available genetic maps

• Paired-end discrepancy

• mRNA/cDNA intactness

• …

Summary of De-novo Assembly Process for

WGS

• Experiment design

– Genome size and complexity

– Goal and budget

• Sample collection

• Sample preparation

• Sequencing

– Choice of platform(s)

• Pre-processing

• Assembly

– Strategies and software choices

• Post-assembly analysis

See http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/De_novo_assembly for

details such as experiment design, data processing, and available software comparison

Programmed DNA Elimination is an Exception to

Genome Constancy in Multicellular Organisms

Wang & Davis 2014

Current Opinion in

Genetics & Development

First cell lineage defined in 1910 by Theodor Boveri

Ascaris Early Embryo Development, Cell

Lineage, and Chromatin Diminution

Somatic Cells Germline Cells

P0

P1

P2

P3

P4

S1

S2

S3

S4

S1b S1a

S2a

(AB)

(EMS)

(MS) (E)

(C)

(D)

Zygote

2-cell stage

4-cell stage

8-cell stage

16-cell stage

32-cell stage

Wang et al. 2012 Developmental Cell

A. suum Diminution Mitosis

Eliminated DNA (red) stays at

metaphase plate while

retained chromosomes are

pulled toward the daughter

cells in early anaphase

Eliminated DNA (red) is in

fragments between segregating

chromosomes in anaphase.

DNA fragments from a previous

diminution is still visible

Samples and Reads for Ascaris Genome

Assembly

1 male carcass = whole male - testis - spermatids - intestine

2 female carcass = whole female - ovary (oviduct) - uterus (embryos) – intestine

3 Jex et. al used mixed DNA sources for genome assembly

Assemblies for A. suum Genomes

Protein-coding genes

Functional coverage

Wang et al. 2012 Developmental Cell

Read Coverage for the Germline Genome Defines the

Eliminated Sequences & Breakpoints

Wang et al. 2012 Developmental Cell

17 additional sites confirmed by PCR

Parascaris Genome Sequencing and

Assemblies

Genomic DNA

Source

Insertion

Size (bp)

Sequencing

type

# of sequencing

lanes

Reads Number

(million)

Genome

Coverage (x)

Male #1 testis 450 2 x 100 5 x HiSeq 2000 1,468 48

Male #1 intestine 450 2 x 100 1 x HiSeq 2000 264 70

Male #1 intestine 450 2 x 250 1 x MiSeq 31 18

Genome assembly features Germline Somatic

Estimated genome size (Mb) ~2,500 ~285

Total base assembled (bp) 234,063,191 229,444,662

Number of scaffolds (>= 200bp) 25,519 19,520

N50 of scaffolds (bp); N50 number 36,592; 1,644 103,210; 675

N90 of scaffolds (bp); N90 number 6,674; 7,370 21,816; 2,406

Maximum length of scaffold (bp) 397,251 495,322

N50 of contigs (bp); N50 number 16,545; 3,468 26,670; 2,122

N90 of contigs (bp); N90 number 2,414; 17,435 2,112; 14,596

Sequencing

Assemblies

• 88% of germline genome is eliminated in somatic

cells

• Primarily satellite repeats eliminated

• 5-mer = 1.3 Gb

• 10-mer = 0.9 Gb

• ~ 700 genes eliminated

Genes lost and many breakpoints are conserved

suggesting ancient mechanism for diminution

Parascaris Germline Genome Assembly

Was Enabled by Repeat Masker

On the way to improve the genomes using Bionano, PacBio, and Fosmid libraries

Genome Assembly Using NGS Data

• Is feasible and is the choice to sequence a new genome

• Is still a challenge for complex genomes

• Algorithm matters, but more importantly is the source of

DNA and type/quality of the libraries

• Reference genome or other higher-order genetic map is of

great value

• The quality of a genome assembly is improving constantly

• Put it into the biological content

References and Additional Reading

• Schatz, M. C., A. L. Delcher, et al. (2010). "Assembly of large genomes

using second-generation sequencing." Genome research 20(9): 1165-

1173.

• Earl, D., K. Bradnam, et al. (2011). "Assemblathon 1: a competitive

assessment of de novo short read assembly methods." Genome research

21(12): 2224-2241.

• Salzberg, S. L., A. M. Phillippy, et al. (2012). "GAGE: A critical evaluation of

genome assemblies and assembly algorithms." Genome research.

• Treangen, T. J. and S. L. Salzberg (2012). "Repetitive DNA and next-

generation sequencing: computational challenges and solutions." Nature

reviews. Genetics 13(1): 36-46.

• Nagarajan, N. and Pop, M (2013). "Sequence assembly demystified"

Nature reviews. Genetics 14(3): 157-167.


Recommended