66
De novo Genome Assembly A/Prof Torsten Seemann Winter School in Mathematical & Computational Biology - Brisbane, Australia - 4 July 2016

De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Embed Size (px)

Citation preview

Page 1: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

De novo Genome AssemblyA/Prof Torsten Seemann

Winter School in Mathematical & Computational Biology - Brisbane, Australia - 4 July 2016

Page 2: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Introduction

Page 3: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

The human genome has 47 pieces

MT

(or XY)

Page 4: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

The shortest piece is 48,000,000 bp

Total haploid size

3,200,000,000 bp(3.2 Gbp)

Total diploid size

6,400,000,000 bp(6.4 Gbp)

∑ = 6,400,000,000 bp

Page 5: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Human DNA iSequencer ™ 46 chromosomal and1 mitochondrial

sequences

In an ideal world ...

AGTCTAGGATTCGCTATAGATTCAGGCTCTGATATATTTCGCGGCATTAGCTAGAGATCTCGAGATTCGTCCCAGTCTAGGATTCGCTATAAGTCTAAGATTC...

Page 6: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

The real world ( for now )

Shortfragments

Reads are stored in “FASTQ” files

Genome

Sequencing

Page 7: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016
Page 8: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Assemble Compare

Page 9: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Genome assembly(the red pill)

Page 10: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

De novo genome assembly

Ideally, one sequence per replicon.

Millions of short sequences

(reads)

A few long sequences(contigs)

Reconstruct the original genome from the sequence reads only

Page 11: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

De novo genome assembly

“From scratch”

Page 12: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

An example

Page 13: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

A small “genome”

Friends, Romans, countrymen, lend me your ears;

I’ll return them

tomorrow!

Page 14: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me

ShakespearomicsWhoops! I dropped

them.

Page 15: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me

• OverlapsFriends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

ShakespearomicsI am good

with words.

Page 16: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me

• OverlapsFriends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

• Majority consensusFriends, Romans, countrymen, lend me your ears; (1 contig)

ShakespearomicsWe have reached a

consensus !

Page 17: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Overlap - Layout - ConsensusAmplified DNA

Shear DNA

Sequenced reads

Overlaps

Layout

Consensus ↠ “Contigs”

Page 18: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Assembly graphs

Page 19: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Overlap graph

Page 20: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Another example is this

Page 21: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Overlaps find

Page 22: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Do the graph traverse

Size matters not. Look at me. Judge me by my size, do you?Size matters not. Look at me. Judge me by my soze, do you?

2 supporting reads

1 supporting reads

Page 23: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

So far, so good.

Page 24: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016
Page 25: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Why is it so hard?

Page 26: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

What makes a jigsaw puzzle hard?

Repetitive regions

Lots of pieces

Missing pieces

Multiplecopies

No corners (circular

genomes)

No box

Dirty pieces

Frayed pieces : .,

Page 27: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

What makes genome assembly hard

Size of the human genome = 3.2 x 109 bp (3,200,000,000)

Typical short read length = 102 bp (100)

A puzzle with millions to billions of pieces

Storing in RAM is a challenge ⇢ “succinct data structure”

1. Many pieces (read length is very short compared to the genome)

Page 28: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

What makes genome assembly hard

Finding overlaps means examining every pair of reads

Comparisons = N×(N-1)/2 ~ N2

Lots of smart tricks to reduce this close to ~N

2. Lots of overlaps

Page 29: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

What makes genome assembly hard

ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA

3. Lots of sky (short repeats)

How could we possibly assemble this segment of DNA?All the reads are the same!

Page 30: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

What makes genome assembly hard

Read 1: GGAACCTTTGGCCCTGTRead 2: GGCGCTGTCCATTTTAGAAACC

4. Dirty pieces (sequencing errors)

1. The error in Read 2 might prevent us from seeing that it overlaps with Read 1

2. How would we know which is the correct sequence?

Page 31: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

What makes genome assembly hard

5. Multiple copies (long repeats)

Our old nemesis the REPEAT !

Gene Copy 1 Gene Copy 2

Page 32: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Repeats

Page 33: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

What is a repeat?

A segment of DNA that occurs more than once

in the genome

Page 34: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Major classes of repeats in the human genome

Repeat Class Arrangement Coverage (Hg) Length (bp)

Satellite (micro, mini) Tandem 3% 2-100

SINE Interspersed 15% 100-300

Transposable elements Interspersed 12% 200-5k

LINE Interspersed 21% 500-8k

rDNA Tandem 0.01% 2k-43k

Segmental Duplications Tandem or Interspersed

0.2% 1k-100k

Page 35: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

RepeatsRepeat copy 1 Repeat copy 2

Collapsed repeat consensus

1 locus

4 contigs

Page 36: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Repeats are hubs in the graph

Page 37: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Long reads can span repeatsRepeat copy 1 Repeat copy 2

long reads

1 contig

Page 38: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Draft vs Finished genomes (bacteria)

250 bp - Illumina - $250 12,000 bp - Pacbio - $2500

Page 39: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

The two laws of repeats

1. It is impossible to resolve repeats of length L unless you have reads longer than L.

2. It is impossible to resolve repeats of length L unless you have reads longer than L.

Page 40: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

How much data do we need?

Page 41: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Coverage vs Depth

Coverage

Fraction of the genome sequenced by

at least one read

Depth

Average number of reads that cover any given region

Intuitively more reads should increase coverage and depth

Page 42: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Example: Read length = ⅛ Genome Length

Coverage = ⅜

Depth = ⅜Genome

Reads

Page 43: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Example: Read length = ⅛ Genome Length

Coverage = 4.5/8

Depth = 5/8Genome

Reads

Newly Covered Regions = 1.5 Reads

Page 44: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Example: Read length = ⅛ Genome Length

Coverage = 5.2/8

Depth = 7/8Genome

Reads

Newly Covered Regions = 0.7 Reads

Page 45: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Example: Read length = ⅛ Genome Length

Coverage = 6.7/8

Depth = 10/8Genome

Reads

Newly Covered Regions = 1.5 Reads

Page 46: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Depth is easy to calculate

Depth = N × L / G

Depth = Y / G

N = Number of reads

L = Length of a read

G = Genome length

Y = Sequence yield (N x L)

Page 47: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Example: Coverage of the Tarantula genome

“The size estimate of the tarantula genome based on k-mer analysis is 6 Gb and we sequenced at 40x coverage from a single female A. geniculata”

Sangaard et al 2014. Nature Comm (5) p1-11

Depth = N × L / G40 = N x 100 / (6 Billion)

N = 40 * 6 Billion / 100 N = 2.4 Billion

Page 48: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Coverage and depth are related

Approximate formula for coverage assuming random reads

Coverage = 1 - e-Depth

Page 49: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Much more sequencing needed in reality

● Sequencing is not random○ GC and AT rich regions

are under represented

○ Other chemistry quirks

● More depth needed for:○ sequencing errors○ polymorphisms

Page 50: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Assessing assemblies

Page 51: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

ContiguityCompletenessCorrectness

Page 52: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Contiguity

● Desire○ Fewer contigs○ Longer contigs

● Metrics○ Number of contigs

○ Average contig length

○ Median contig length

○ Maximum contig length

○ “N50”, “NG50”, “D50”

Page 53: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Contiguity: the N50 statistic

Page 54: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Completeness : Total size

Proportion of the original genome represented by the assembly

Can be between 0 and 1

Proportion of estimated genome size … but estimates are not perfect

Page 55: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Completeness: core genes

Proportion of coding sequences can be estimated based on known core genes thought to be present in a wide variety of organisms.

Assumes that the proportion of assembled genes is equal to the proportion of assembled core genes.

Number of Core Genes in Assembly

Number of Core Genes in Database

In the past this was done with a tool called CEGMA

There is a new tool for this called BUSCO

Page 56: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Correctness

Proportion of the assembly that is free from errors

Errors include

1. Mis-joins2. Repeat compressions3. Unnecessary duplications4. Indels / SNPs caused by assembler

Page 57: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Correctness: check for self consistency

● Align all the reads back to the contigs● Look for inconsistencies

Original ReadsAlign

Assembly

Mapped Read

Page 58: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Assemble ALL the things

Page 59: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Why assemble genomes

● Produce a reference for new species● Genomic variation can be studied by comparing

against the reference● A wide variety of molecular biological tools become

available or more effective● Coding sequences can be studied in the context of their

related non-coding (eg regulatory) sequences● High level genome structure (number, arrangement of

genes and repeats) can be studied

Page 60: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Vast majority of genomes remain unsequenced

Estimated number of species ~ 9 Million

Whole genome sequences (including incomplete drafts) ~ 3000

Estimated number of species Millions to Billions

Genomic sequences ~ 150,000

Eukaryotes Bacteria & Archaea

Every individual has a different genome

Some diseases such as cancer give rise to massive genome level variation

Page 61: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016
Page 62: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Not just genomes

● Transcriptomes○ One contig for every isoform

○ Do not expect uniform coverage

● Metagenomes○ Mixture of different organisms○ Host, bacteria, virus, fungi all at once

○ All different depths

● Metatranscriptomes○ Combination of above!

Page 63: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Conclusions

Page 64: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Take home points

● De novo assembly is the process of reconstructing a long sequence from many short ones

● Represented as a mathematical “overlap graph”

● Assembly is very challenging (“impossible”) because○ sequencing bias under represents certain regions○ Reads are short relative to genome size○ Repeats create tangled hubs in the assembly graph○ Sequencing errors cause detours and bubbles in the assembly graph

Page 65: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

Contact

tseemann.github.io

[email protected]

@torstenseemann

Page 66: De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

The End