Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Le Reads devo essere tra,ate per trasformare i da1 in informazioni
Assembly
In bioinforma1cs, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence
A con1g (from con1guous) is a set of overlapping DNA segments that together represent a consensus region of DNA. In bo,om-‐up sequencing projects, a con1g refers to overlapping sequence data (reads);
Dat
a si
ze
Raw reads
Pre- processing
Assembly: Alignment / de novo
Application specific: Variant calling, count matrix,...
Compare samples / methods
Question
Generalized NGS analysis
Answer?
Merge small DNA fragments together so they form a previously unknown sequence
What is de novo assembly?
Merge millions reads together so they form previously unknown sequences
•
de novo assembly • Assemble reads into longer fragments
Find overlap between reads
• Many approaches
reads
con1gs
scaffolds
•
de novo assembly • Assemble reads into longer fragments
Find overlap between reads
• Many approaches
reads
con1gs
scaffolds
• • • • •
Lets try to assemble some reads!
• Rules: a minimum of 7-bp overlap overlap must not include any N bases same orientation so that the sequence can be read left to right there may be 1-bp differences simplified - no double stranded DNA
Valid assemblies ..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN.. ..NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..
Invalid assemblies ..NNNNCGGACTATGATT
|||||| ATGATTCGAGGCTAANN..
..NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..
NGS de novo assembly
• Success is a factor of:
• Genome size,genomic repeats(!),ploidy
• High coverage,long read lengths,PE/MP libraries
Repeats in E.coli Domani vedremo una storia di successo di un genemo assembly
Two bacterial genomes de Bruijn graphs
Few repeats “more” repeats
Alla fine di questa giornata non vedrete due scarabocchi, ma molto altro
•
Which approaches? Greedy (“Simple” approach)
• Overlap-Layout-Consensus (Long fewer reads)
• de Bruijn graphs (Many short reads)
•
Simple approach - Greedy • Pseudo code:
1. Pairwise alignment of all reads
2. Identify fragments that have largest overlap 3. Merge these
4. Repeat until all overlaps are used
• Can only resolve repeats smaller than read length
High computational cost with increasing no.reads
Shredded Book Reconstruc1on • Dickens accidentally shreds the first prin1ng of A Tale of Two Ci1es
– Text printed on 5 long spools
• How can he reconstruct the text? – 5 copies x 138, 656 words / 5 words per fragment = 138k fragments – The short fragments from every copy are mixed together – Some fragments are identical
It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Greedy Reconstruc1on
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
of times, it was the
times, it was the age
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
worst of times, it was
of times, it was the
times, it was the age
it was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
wisdom, it was the age
it was the age of
was the age of foolishness,
the worst of times, it
The repeated sequence make the correct reconstruc1on ambiguous • It was the best of 1mes, it was the [worst/age]
Model sequence reconstruc1on as a graph problem.
La teoria dei Grafi
la teoria dei grafi si occupa di studiare i grafi, oggeX discre1 che perme,ono di schema1zzare una grande varietà di situazioni e di processi e spesso di consen1rne l'analisi in termini quan1ta1vi e algoritmici.
La teoria dei grafi è un modo di vedere le cose
• oggeX semplici, deX ver1ci (ver1ces) o nodi (nodes), • collegamen1 tra i ver1ci. I collegamen1 possono essere:
• orienta1, e in questo caso sono deX archi (arcs) o cammini (paths), e il grafo è de,o orientato
• non orienta1, e in questo caso sono deX spigoli (edges), e il grafo è de,o non orientato
• eventualmente da1 associa1 a nodi e/o collegamen1.
Per grafo si intende una stru,ura cos1tuita da:
La stru,ura informa1ca di WikiPedia
Problema dei pon1 di Königsberg
Königsberg, è percorsa dal fiume Pregel e da suoi affluen1 e presenta due estese isole che sono connesse tra di loro e con le due aree principali della ci,à da se,e pon1
Nel corso dei secoli è stata più volte proposta la ques1one se sia possibile con una passeggiata seguire un percorso che a,raversi ogni ponte una e una volta soltanto e tornare al punto di partenza
Nel 1736 Leonhard Euler affrontò tale problema, dimostrando che la passeggiata ipo1zzata non era possibile
Problema dei pon1 di Königsberg
Eulero ha il merito di aver formulato il problema in termini di teoria dei grafi, astraendo dalla situazione specifica di Königsberg; innanzitu,o eliminò tuX gli aspeX con1ngen1 ad esclusione delle aree urbane delimitate dai bracci fluviali e dai pon1 che le collegano; secondariamente rimpiazzò ogni area urbana con un punto, ora chiamato ver1ce o nodo e ogni ponte con un segmento di linea, chiamato spigolo, arco o collegamento.
Eulero rappresentò la disposizione dei se,e pon1 congiungendo con altre,ante linee le qua,ro grandi zone della ci,à, come nella prima immagine. Si no1 che dai nodi A, B e D partono (e arrivano) tre pon1; dal nodo C, invece, cinque pon1. Ques1 sono i gradi dei nodi: rispeXvamente, 3, 3, 5, 3. Prima di raggiungere una conclusione, Eulero ha ipo1zzato delle situazioni diverse di zone e pon1 (nodi e collegamen1): con qua,ro nodi e qua,ro pon1 è possibile par1re, ad esempio, da A, e tornarci passando per tuX i pon1 una e una sola volta. Il grado di ciascun nodo è un numero pari. Se invece si parte da A per arrivare a D, ogni nodo è di grado pari a eccezione di due nodi, di grado dispari (uno). Sulla base di queste osservazioni, Eulero ha enunciato il seguente teorema: Un qualsiasi grafo è percorribile se e solo se ha tu5 i nodi di grado pari, o due di essi sono di grado dispari; per percorrere un grafo "possibile" con due nodi di grado dispari, è necessario par:re da uno di essi, e si terminerà sull’altro nodo dispari.
Overlap Layout
Consensus
de Bruijn
Graph Theory!!!
Create overlap graph by all-vs-all alignment
Contigs created based on overlap
In the graph each node is a read, edges are overlaps between reads
Overlap-Layout-Consensus
• Consensus:Hamiltonian path (visit each node exactly once)
• Computationally hard problem
Overlap-Layout-Consensus
Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find poten1ally overlapping reads
Layout: merge reads into con1gs and con1gs into supercon1gs
Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..
Overlap-‐Layout-‐Consensus
• Find the best match between the suffix of one read and the prefix of another
• Due to sequencing errors, need to use dynamic programming to find the op1mal overlap alignment
• Apply a filtra1on method to filter out pairs of fragments that do not share a significantly long common substring
Overlap
TAGATTACACAGATTAC
TAGATTACACAGATTAC |||||||||||||||||
• Sort all k-‐mers in reads (k ~ 24)
• Find pairs of reads sharing a k-mer
• Extend to full alignment – throw away if not >95% similar
T GA
TAGA | ||
TACA
TAGT ||
Overlapping Reads
Che cos’è un k-‐mer e il k-‐mer?
• A k-‐mer that appears N 1mes, ini1ates N2 comparisons
• For an Alu that appears 106 1mes à 1012 comparisons – too much
• Solu:on: Discard all k-‐mers that appear more than
t × Coverage, (t ~ 10)
Overlapping Reads and Repeats
Alu elements are the most abundant transposable elements in the human genome
Create local mul1ple alignments from the overlapping reads
TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA
Finding Overlapping Reads
• Correct errors using mul1ple alignment
TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA
C: 20 C: 35 T: 30 C: 35 C: 40
C: 20 C: 35 C: 0 C: 35 C: 40
• Score alignments • Accept alignments with good scores
A: 15 A: 25 A: 40 A: 25 -
A: 15 A: 25 A: 40 A: 25 A: 0
Finding Overlapping Reads (cont’d)
• Repeats are a major challenge • Do two aligned fragments really overlap, or are they from two copies of a repeat?
• Solu1on: repeat masking – hide the repeats!!! • Masking results in high rate of misassembly (up to 20%)
• Misassembly means a lot more work at the finishing step
Layout
• Repeats shorter than read length are OK • Repeats with more base pair differencess than sequencing error rate are OK
• To make a smaller por1on of the genome appear repe11ve, try to: – Increase read length – Decrease sequencing error rate
Repeats, Errors, and Contig Lengths
It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …
De Bruijn graph assembly
• Dickens accidentally shreds the first prin1ng of A Tale of Two Ci1es – Text printed on 5 long spools
• How can he reconstruct the text? – 5 copies x 138, 656 words / 5 words per fragment = 138k fragments – The short fragments from every copy are mixed together – Some fragments are identical
It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Shredded Book Reconstruc1on
Greedy Reconstruc1on
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
of times, it was the
times, it was the age
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
worst of times, it was
of times, it was the
times, it was the age
it was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
wisdom, it was the age
it was the age of
was the age of foolishness,
the worst of times, it
The repeated sequence make the correct reconstruc1on ambiguous • It was the best of 1mes, it was the [worst/age]
Model sequence reconstruc1on as a graph problem.
• Dk = (V,E) • V = All length-‐k subfragments • E = Directed edges between consecu1ve subfragments
• Nodes overlap by k-‐1 words
• Locally constructed graph reveals the global sequence structure • Overlaps between sequences implicitly computed
It was the best was the best of It was the best of
Original Fragment Directed Edge
de Bruijn, 1946 Idury and Waterman, 1995 Pevzner, Tang, Waterman, 2001
de Bruijn Graph Construc1on
• Can this really work? • How do we choose a value for k?
– Needs to be big enough to be unique – But repeats make it impossible to use such a large k, because en1re reads are not unique
– So pick k to be “big enough”
No need to compute overlaps!
• Dickens accidentally shreds the first prin1ng of A Tale of Two Ci1es – Text printed on 5 long spools
• How can he reconstruct the text? – 5 copies x 138, 656 words / 5 words per fragment = 138k fragments – The short fragments from every copy are mixed together – Some fragments are identical
It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Shredded Book Reconstruc1on
de Bruijn Graph Assembly
the age of foolishness
It was the best
best of times, it
was the best of
the best of times,
of times, it was
times, it was the
it was the worst
was the worst of
worst of times, it
the worst of times,
it was the age
was the age of the age of wisdom,
age of wisdom, it
of wisdom, it was
wisdom, it was the
A unique Eulerian tour of the graph reconstructs the
original text
If a unique tour does not exist, try to simplify the
graph as much as possible
de Bruijn Graph Assembly
the age of foolishness
It was the best of times, it
of times, it was the
it was the worst of times, it
it was the age of the age of wisdom, it was the A unique Eulerian tour of
the graph reconstructs the original text
If a unique tour does not exist, try to simplify the
graph as much as possible
1
2
It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …
38
Example
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGAGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG
Velvet / Curtain
Velvet / Curtain 09.03.12 39
GTCG (1x)
Example
Read: GTCGAGG
Velvet / Curtain 09.03.12 40
GTCG (1x)
TCGA (1x)
Example
Read: GTCGAGG
Velvet / Curtain 09.03.12 41
GTCG (1x)
TCGA (1x)
CGAG (1x)
Example
Read: GTCGAGG
Velvet / Curtain 09.03.12 42
GTCG (1x)
TCGA (1x)
CGAG (1x)
GAGG (1x)
Example
Read: GTCGAGG
Velvet / Curtain 09.03.12 43
Example New read: CGAGGCT
GTCG (1x)
TCGA (1x)
CGAG (2x)
GAGG (1x)
Velvet / Curtain 09.03.12 44
GTCG (1x)
TCGA (1x)
CGAG (2x)
GAGG (2x)
Example
Read: CGAGGCT
Velvet / Curtain 09.03.12 45
GTCG (1x)
TCGA (1x)
CGAG (2x)
GAGG (2x)
AGGC (1x)
Example
Read: CGAGGCT
Velvet / Curtain 09.03.12 46
GTCG (1x)
TCGA (1x)
GGCT (1x)
CGAG (2x)
GAGG (2x)
AGGC (1x)
Example
Read: CGAGGCT
Velvet / Curtain 09.03.12 47
Example
New read: TCGACGC
GTCG (1x)
TCGA (2x)
CGAG (2x)
GAGG (2x)
AGGC (1x)
Velvet / Curtain 09.03.12 48
GTCG (1x)
TCGA (2x)
CGAG (2x) CGAC (1x)
GAGG (2x) GACG (1x)
AGGC (1x) ACGC (1x)
Example
Read: TCGACGC
Velvet / Curtain 09.03.12 49
AGAT (8x)
ATCC (7x)
TCCG (7x)
CCGA (7x)
CGAT (6x)
GATG (5x)
ATGA (8x)
TGAG (9x)
GATC (8x)
GATT (1x)
TAGT (3x)
AGTC (7x)
GTCG (9x)
TCGA (10x)
GGCT (11x)
TAGA (16x)
AGAG (9x)
GAGA (12x)
GACA (8x)
ACAG (5x)
GCTT (8x)
GCTC (2x)
CTTT (8x)
CTCT (1x)
TTTA (8x)
TCTA (2x)
TTAG (12x)
CTAG (2x)
AGAC (9x)
AGAA (1x)
CGAG (8x)
CGAC (1x)
GAGG (16x)
GACG (1x)
AGGC (16x)
ACGC (1x)
Example
etc…
Velvet / Curtain 09.03.12 50
TAGTCGA
AGAGA TAGA
AGAT
GCTTTAG
GCTCTAG
AGACAG
AGAA
CGAG
CGACGC
GAGGCT
GATCCGATGAG
GATT
Example
After simplification…
GGCT
Velvet / Curtain 09.03.12 51
Example
Tips removed…
TAGTCGA
AGAGA TAGA
AGAT
GCTTTAG
GCTCTAG
AGACAG
CGAG
GAGGCT
GATCCGATGAG
GGCT
Velvet / Curtain 09.03.12 56
TAGTCGA
AGAGA TAGA
AGAT
GCTTTAG AGACAG
CGAG
GAGGCT
GATCCGATGAG
GGCT
Example
Bubbles removed… by TourBus
Velvet / Curtain 09.03.12 57
TAGTCGAG AGAGACAG
AGATCCGATGAG
GAGGCTTTAGA
Example
Final simplification…
Velvet / Curtain 09.03.12 58
One possible walk through the graph ...
TAGTCGAG GAGGCTTTAGA AGATCCGATGAG GAGGCTTTAGA AGAGACAG
TAGTCGAG AGAGACAG
Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG Final simplification…
AGATCCGATGAG
GAGGCTTTAGA
Now we create a dra} assembly in con1g
But is not sufficient to understand the characteris1c of a genome
Contigs
Scaffolds
Reads
‘De Bruijn’ assembly
To go ahead we have to talk about the paired-‐end sequencing technology
Paired-‐end Sequencing
Scaffolding
Contigs
Scaffolds
(An assembly)
Reads ‘De Bruijn’ assembly
“Captured” gaps caused by repeats. Represented by “NNN” in assembly
Join contigs using evidence from paired end data
Align reads to DeBruijn contigs
Scaffolding
SUPERSCAFOLDING!!!
A “real” protocol
1. Retrieve reads 2. Quality check of reads 3. Trimming and filtering 4. Assembly 5. Using paired-‐end for scaffolding 6. Check the genome quality
Reads
Overlap
Local Mul1ple Alignment
Con1gs
Scaffolding
Alignment Scoring
Finishing
Assembly Problems: -Repeats
-Chimerism
-Gaps
• Number of large contigs
• Total size • Coverage
• Average length • N50
• Longest contig • % genome assembled
Important Assembler Metrics How can we asses the quality of a genome?
How can we understand if we performed a good assembly?
Species Genome size
(Mb) N50 Scaffold
index N50 scaffold size
(Mb) # scaffolds N50 contig size
(Kb) sequencing technology reference
Melon 450 26 4,678 1,594 18.2 454, Sanger this report
Potato 844 121 1,782 2,043 31,4 Illumina, 454,
Sanger The Potato Genome Sequencing Consortium
2011 Apple 743 102 1,542 1,629 13.4 Sanger, 454 Velasco et al 2010
Fragaria 240 n.a. 1,361 3,263 n.a. 454, Illumina,
SOLiD Shulaev et al 2011
Cucumber 367 59 1,144 47,837 19.8 Illumina, Sanger Huang et al 2009
Brassica rapa 529 n.a. 1,97 n.a. 27.3 Illumina
The Brassica rapa Genome Sequencing Project Consortium 2011
Cacao 430 178 0,47 4,792 19,8 454 Argout et al 2011
Date palm 658 n.a. 0,03 57,277 6.4 Illumina Al-Dous et al 2011