Next Generation Sequencing Giulio Pavesi University of Milano [email protected]

Next Generation Next Generation Sequencing Sequencing

Giulio PavesiGiulio Pavesi

University of MilanoUniversity of Milano

[email protected]@unimi.it

Next generation sequencing vs Sanger sequencinghttp://en.wikipedia.org/wiki/DNA_sequencing

http://en.wikipedia.org/wiki/DNA_sequencing

Next Generation SequencingNext Generation Sequencing

Applicazioni:Applicazioni: Sequenziamento de novo di genomiSequenziamento de novo di genomi Risequenziamento di genomi per Risequenziamento di genomi per

identificazione di variantiidentificazione di varianti MetagenomicaMetagenomica Sequenziamento e quantificazione di Sequenziamento e quantificazione di

trascrittomitrascrittomi Sequenziamento di “campioni” di Sequenziamento di “campioni” di

DNA/RNA (estratti secondo diversi DNA/RNA (estratti secondo diversi criteri)criteri)

““Epigenetica”Epigenetica”

L'epigenetica (dal greco επί, epì = L'epigenetica (dal greco επί, epì = "sopra" e γεννετικός, gennetikòs = "sopra" e γεννετικός, gennetikòs = "relativo all'eredità familiare") si "relativo all'eredità familiare") si riferisce a quei cambiamenti che riferisce a quei cambiamenti che influenzano il fenotipo senza alterare il influenzano il fenotipo senza alterare il genotipo, ed è una branca della genotipo, ed è una branca della genetica che descrive tutte quelle genetica che descrive tutte quelle modificazioni ereditabili che variano modificazioni ereditabili che variano ll’’espressione genica pur non espressione genica pur non alterando la sequenza del DNAalterando la sequenza del DNA

Che cosa c’entra il sequenziamento Che cosa c’entra il sequenziamento del DNA con qualcosa che *non* del DNA con qualcosa che *non* riguarda la sequenza del DNA?!?!?!riguarda la sequenza del DNA?!?!?!

““NucleosomeNucleosome””

The nucleosome core particle The nucleosome core particle consists of approximately 147 base consists of approximately 147 base pairs of DNA wrapped in 1.67 left-pairs of DNA wrapped in 1.67 left-handed superhelical turns around a handed superhelical turns around a histone octamer histone octamer

Octamer: 2 copies each of the core Octamer: 2 copies each of the core histones H2A, H2B, H3, and H4histones H2A, H2B, H3, and H4

Core particles are connected by Core particles are connected by stretches of "linker DNA", which can stretches of "linker DNA", which can be up to about 80 bp longbe up to about 80 bp long

The histone codeThe histone code

Example Example H3H3K4K4me3me3 H3H3 is the histoneis the histone K4 K4 is the residue that is modified is the residue that is modified

and its position (K lysine in and its position (K lysine in position 4 of the sequence)position 4 of the sequence)

me3me3 is the modification (three- is the modification (three-methyl groups attached to K4) methyl groups attached to K4)

If no number at the end like in If no number at the end like in H3H3K9K9acac means only one group means only one group

Different chromatin statesDifferent chromatin states

Chromatin structure (and thus, gene expression) dependalso on the post-translational modifications associated with histones forming nuclesomes

““ChIP”ChIP”

If we have the “right” If we have the “right” antibody, we can extract antibody, we can extract (“immunoprecipitate”) (“immunoprecipitate”) from living cells the from living cells the protein of interest bound protein of interest bound to the DNAto the DNA

And - we can try to And - we can try to identify which were the identify which were the DNA regions bound by DNA regions bound by the proteinthe protein

Can be done for Can be done for transcription factorstranscription factors

But can be done also for But can be done also for histones - and separately histones - and separately for each modificationfor each modification

TF ChIPHistone ChIP

ChIP-Seq

Many cells-many copiesof the same region boundby the protein

After ChIPAfter ChIP

Identification of theDNA fragment bound

by the protein

Sequencing

Size selection: onlyfragments of the

“right size” (200 bp)are kept

So - if we foundthat a region hasbeen sequencedmany times, thenwe can suppose that it was bound by the protein, but…

Only a short fragment of the extracted DNA region canbe sequenced, at either or both ends (“single” vs “paired end” sequencing)

for no more than 35 (before) / 50 (yesterday) / 100 (now) bps

Thus, original regions have to be “reconstructed”

Read MappingRead Mapping

Each sequence read has to be assigned to Each sequence read has to be assigned to its original position in the genomeits original position in the genome

A typical ChIP-Seq experiment produces A typical ChIP-Seq experiment produces from 6 (before) to 100 million (now) reads from 6 (before) to 100 million (now) reads of 50-70 and more base pairs for each of 50-70 and more base pairs for each sequencing “lane” (Solexa/Illumina)sequencing “lane” (Solexa/Illumina)

There exist efficient “sequence mappers” There exist efficient “sequence mappers” against the genome for NGS readagainst the genome for NGS read

Read Mapping “Typical” Read Mapping “Typical” OutputOutput

@12_10_2007_SequencingRun_3_1_119_647 (actual sequence)TTTGAATATATTGAGAAAATATGACCATTTTT+12_10_2007_SequencingRun_3_1_119_647 (“quality” scores)40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 27 40 40 4 27 40

““Peak finding”Peak finding”

The The critical partcritical part of any ChIP-Seq analysis is the of any ChIP-Seq analysis is the identification of the genomic regions that produced identification of the genomic regions that produced a a significantly high number of sequence reads, significantly high number of sequence reads, corresponding to the region where the protein corresponding to the region where the protein (nucleosome) of interest was bound to DNA(nucleosome) of interest was bound to DNA

Since a graphical visualization of the “piling” of read Since a graphical visualization of the “piling” of read mapping on the genome produces a “mapping on the genome produces a “peakpeak” in ” in correspondence of these regions, the problem is correspondence of these regions, the problem is often referred to as “peak finding”often referred to as “peak finding”

A “peak” then marks the region that was enriched in A “peak” then marks the region that was enriched in the original DNA samplethe original DNA sample


Peaks:How tall?

How wide?How muchenriched?


The main issue: the DNA sample sequenced The main issue: the DNA sample sequenced (apart from sequencing errors/artifacts) (apart from sequencing errors/artifacts) contains a lot of “noise”contains a lot of “noise” Sample “contamination” - the DNA of the PhD Sample “contamination” - the DNA of the PhD

student performing the experimentstudent performing the experiment DNA shearing is not uniform: open chromatin DNA shearing is not uniform: open chromatin

regions tend to be fragmented more easily and regions tend to be fragmented more easily and thus are more likely to be sequencedthus are more likely to be sequenced

Repetitive sequences might be artificially enriched Repetitive sequences might be artificially enriched due to inaccuracies in genome assemblydue to inaccuracies in genome assembly

Amplification pushed too much: you see a single Amplification pushed too much: you see a single DNA fragment amplified, not enrichedDNA fragment amplified, not enriched

As yet unknown problems, that anyway seem to As yet unknown problems, that anyway seem to produce “noisy” sequencings and screw the produce “noisy” sequencings and screw the experiment upexperiment up

ChIP-Seq histone dataChIP-Seq histone data

Histone modifications tend to be located at Histone modifications tend to be located at preferred locations with respect to gene preferred locations with respect to gene annotations/transcribed regionsannotations/transcribed regions

Hence, enrichment can be assessed in two Hence, enrichment can be assessed in two waysways Enrichment with respect a the control Enrichment with respect a the control

experiment and peak identificationexperiment and peak identification ““Local” enrichment in given regions with respect Local” enrichment in given regions with respect

to gene annotationsto gene annotations Promoters (active/non active)Promoters (active/non active) Upstream of transcribed/non transcribed genesUpstream of transcribed/non transcribed genes Within transcribed/not transcribed regionsWithin transcribed/not transcribed regions Enhancers, whatever elseEnhancers, whatever else

EsperimentoEsperimento

Eseguire una ChIP-Seq per diverse Eseguire una ChIP-Seq per diverse modificazioni istoniche, partendo da modificazioni istoniche, partendo da quelle più quelle più ““classicheclassiche””

Verificare:Verificare: Se ciascuna modifica ha una sua Se ciascuna modifica ha una sua

localizzazione localizzazione ““preferenzialepreferenziale”” sul sul genoma o rispetto ai geni (es. nel genoma o rispetto ai geni (es. nel promotore, nella regione trascritta, etc.)promotore, nella regione trascritta, etc.)

Se ciascuna modifica è Se ciascuna modifica è ““correlatacorrelata”” in in qualche modo alla qualche modo alla trascrizione/espressione dei genitrascrizione/espressione dei geni

Genome wide histone Genome wide histone modifications maps through modifications maps through ChIP-SeqChIP-Seq Barski et.al - Barski et.al - CellCell 129 823-837, 2007 129 823-837, 2007 20 histone lysine and arginine methylations in CD4+ T 20 histone lysine and arginine methylations in CD4+ T

cellscells H3K27H3K27 H3K9H3K9 H3K36 H3K36 H3K79H3K79 H3R2 H3R2 H4K20 H4K20 H4R3 H4R3 H2BK5 H2BK5

Plus:Plus: Pol II bindingPol II binding H2A.Z (replaces H2A in some nucleosomes)H2A.Z (replaces H2A in some nucleosomes) insulator-binding protein (CTCF)insulator-binding protein (CTCF)

Genome wide histone Genome wide histone modifications maps through modifications maps through ChIP-SeqChIP-Seq


ChIP-Seq associata a una particolare modificazione ChIP-Seq associata a una particolare modificazione (es, H3K4me3)(es, H3K4me3)

Domanda: la modificazione è Domanda: la modificazione è ““correlabilecorrelabile”” alla alla trascrizione dei geni?trascrizione dei geni?

Ovvero, la modificazione Ovvero, la modificazione ““marcamarca”” particolari particolari nucleosomi rispetto allnucleosomi rispetto all’’inizio della trascrizione, o inizio della trascrizione, o alla regione trascrittaalla regione trascritta

Esempio: potrebbero esserci modificazioni che:Esempio: potrebbero esserci modificazioni che: Marcano lMarcano l’’inizio della trascrizioneinizio della trascrizione Marcano tutta e solo la regione trascrittaMarcano tutta e solo la regione trascritta ““SilenzianoSilenziano”” particolari loci genici impedendo la particolari loci genici impedendo la

trascrizionetrascrizione Non c’entrano nulla con la trascrizione vera e Non c’entrano nulla con la trascrizione vera e

propria e sono localizzate altrovepropria e sono localizzate altrove


Sequenze ottenute da ChIP-Seq per la Sequenze ottenute da ChIP-Seq per la modificazione studiatamodificazione studiata

Input: coordinate genomiche delle posizioni in Input: coordinate genomiche delle posizioni in ciascuna delle sequenze mappa (vedi file di ciascuna delle sequenze mappa (vedi file di esempio)esempio)

Input: coordinate genomiche dei geni RefSeq Input: coordinate genomiche dei geni RefSeq annotatiannotati

Un nucleosoma marcato dalla modificazione Un nucleosoma marcato dalla modificazione dovrebbe corrispondere a un dovrebbe corrispondere a un ““mucchiettomucchietto”” di di read che si sovrappongono (read che si sovrappongono (““piccopicco””))

Andiamo a contare, nucleosoma per Andiamo a contare, nucleosoma per nucleosoma, quanto alto è il nucleosoma, quanto alto è il ““mucchiettomucchietto””, , ovvero quanti read sono associabili al ovvero quanti read sono associabili al nucleosomanucleosoma

Nucleosoma

Esempio: se si trovasse la modifica nel nucleosoma a montedel TSS dei geni trascritti, troveremmo un “mucchietto” così

Modificazione

Nucleosoma

Esempio: se si trovasse la modifica nei nucleosomi associati alle regioni trascritte, troveremmo “mucchietti” così

Modificazione

““Inizi della trascrizione”Inizi della trascrizione”

Tecniche di laboratorio come il “CAGE” Tecniche di laboratorio come il “CAGE” (Cap-Analysis-Gene-Expression) (Cap-Analysis-Gene-Expression) permettono:permettono: L’esatta mappatura del 5’ degli RNA sul L’esatta mappatura del 5’ degli RNA sul

genoma, ovvero localizzare gli esatti TSSgenoma, ovvero localizzare gli esatti TSS Quantificare il livello di trascritto prodotto a Quantificare il livello di trascritto prodotto a

partire da ciascuno del TSS identificatipartire da ciascuno del TSS identificati Poiché cerchiamo la precisa Poiché cerchiamo la precisa

localizzazione delle modifiche istoniche localizzazione delle modifiche istoniche rispetto ai TSS, è importante localizzare rispetto ai TSS, è importante localizzare anche i TSS con precisione anche i TSS con precisione

Analisi: primo esempioAnalisi: primo esempio

InputInput Lista ordinata delle coordinate genomiche dei TSS , Lista ordinata delle coordinate genomiche dei TSS ,

con relativo livello di trascrittocon relativo livello di trascritto Lista ordinata delle coordinate genomiche dove Lista ordinata delle coordinate genomiche dove

mappa ciascuna sequenza della ChIP-Seqmappa ciascuna sequenza della ChIP-Seq Output: calcolare la distribuzione (i Output: calcolare la distribuzione (i ““mucchiettimucchietti””) )

rispetto ai TSS rispetto ai TSS Suddividere i TSS sulla base del livello di trascritto:Suddividere i TSS sulla base del livello di trascritto:

Geni trascrittiGeni trascritti Geni (poco trascritti)Geni (poco trascritti) Geni NON trascrittiGeni NON trascritti

E verificare se ci sono differenze evidenti a seconda del fatto che il E verificare se ci sono differenze evidenti a seconda del fatto che il TSS sia effettivamente trascritto o menoTSS sia effettivamente trascritto o meno

Confrontare i risultati della modifica istonica con un Confrontare i risultati della modifica istonica con un esperimento di controlloesperimento di controllo

TSS

-1000 +1000

Dato ciascun TSS, calcolare quante sequenze mappano tra -1000 e +1000 bp rispetto al TSSContare quante sequenze mappano a -1000, -999, -998...-1,0+1,+2,...+998,+999,+1000Sommare per tutti i TSS i conteggi a ciascuna distanza (-1000, -999, -998,...,-1,0,+1,+2,...+998,+999,+1000)

Algoritmo!Algoritmo!

TSS

-1000 +1000

Attenzione!Attenzione!

TSS

+1000 -1000

Le coordinate rispetto al TSS dipendono dalla direzione della trascrizione!!

Output: histone modifications Output: histone modifications at TSSat TSS

0 +1000-1000

Distance from TSS

Rea

d co

unt (

peak

hei

ght)

Output: histone modifications Output: histone modifications at TSSat TSS

0 +1000-1000

Distance from TSS

Rea

d co

unt (

peak

hei

ght)

PolII is found bound to DNA at the TSS of transcribed genes

H3K4me3 is found just before and after the TSS of transcribed genes

H3K4me2 (not me3!) is found just before and after the TSS of transcribed genes,but farther away than H3K4me3

H3K4me1 is found just before and after the TSS of transcribed genes,but farther away than H3K4me3 and H3K4me2

H3K27me3 covers the whole locus of “silent” genes - no transcription here

H3K27me1 (not me3!) is vice versa associated before and after loci oftranscribed genes

H3K36me3 is found within the transcribed region - a bit downstream of the TSS -as if it “lets” polymerase proceed with transcription

H3K9me1 is similar in profile to H3K4me3

Barski et. al. High-Resolution Profiling of Histone Methylations in the Human Genome, Cell 129(4)

Histone modifications at Histone modifications at transcribed regionstranscribed regions

Expression level

Rea

d co

unt (

peak

hei

ght)

High Low

Documents

Next Generation Sequencing Giulio Pavesi University of Milano [email protected]