109
[I0D51A] Bioinformatics: High-Throughput Analysis Next-generation sequencing. Part 3: Variation discovery Prof Jan Aerts Faculty of Engineering - ESAT/SCD [email protected] TA: Alejandro Sifrim ([email protected]) 1

Next-generation sequencing - variation discovery

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Next-generation sequencing - variation discovery

[I0D51A] Bioinformatics: High-Throughput AnalysisNext-generation sequencing.

Part 3: Variation discovery

Prof Jan AertsFaculty of Engineering - ESAT/[email protected]

TA: Alejandro Sifrim ([email protected])

1

Page 2: Next-generation sequencing - variation discovery

Context

2

Page 3: Next-generation sequencing - variation discovery

Types of genomic variation

SNPs vs structural variation

3

Page 4: Next-generation sequencing - variation discovery

A - Single nucleotide polymorphisms (SNPs)

4

Page 5: Next-generation sequencing - variation discovery

What are SNPs and why are they important?

• SNP = single nucleotide polymorphism

• It’s the differences that matter:

• Human vs chimp: 98% identical (<2 differences every 100bp)

• Between any 2 individuals: 1 difference every 1000bp

• Disease: A or G == life or death

• Mutations can result in:

• change in level of transcription or translation (loss/gain)

• change in protein structure

5

Page 6: Next-generation sequencing - variation discovery

6

Page 7: Next-generation sequencing - variation discovery

SNP discovery - overview

generate sequence reads

➡ map reads to reference sequence

➡ convert from read-based to position-based (“pileup”)

➡ identify differences

7

Page 8: Next-generation sequencing - variation discovery

8

Page 9: Next-generation sequencing - variation discovery

9

Page 10: Next-generation sequencing - variation discovery

10

Page 11: Next-generation sequencing - variation discovery

11

Page 12: Next-generation sequencing - variation discovery

Monet “Meule, Effet de Neige, le Matin”

Not a trivial problem...

12

Page 13: Next-generation sequencing - variation discovery

Many SNP callers:

• samtools

• GATK

• SOAPsnp

• ...

Read-based -> position-based

Here: (1) samtools -> pileup; (2) GATK -> VCF

13

Page 14: Next-generation sequencing - variation discovery

pileup

14

Page 15: Next-generation sequencing - variation discovery

15

Page 16: Next-generation sequencing - variation discovery

pileup

16

1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<61 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

alignment mapping quality

Page 17: Next-generation sequencing - variation discovery

Intermezzo: quality scores

“Phred-score”: used for sequence quality as well as mapping quality

Chance of 1/1000 that read is mapped at wrong position = 10-3 => phred-score = 30Chance of 1/100 that read is mapped at wrong position = 10-2 => phred-score = 20

Sanger encoding: quality score 30 = “>”

17

Page 18: Next-generation sequencing - variation discovery

pileup

18

1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<61 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

Page 19: Next-generation sequencing - variation discovery

Heterozygous SNPs and the binomial distribution

SNPs are bi-allelic => allele combinations for heterozygous SNP follow binomial distribution

outcome = binary (red/white, head/tail, yes/no, A/G)probability p of the outcome of a single draw is the same for all draws

E.g. 8 A’s + 12 G’s = SNP?hypothesis: heterozygous => nr of draws = 20; nr of “successes” = 8; probability p of outcome in single draw = 0.5

table with cumulative bionomial probabilities: http://bit.ly/cumul_binom_prob

8 A’s given coverage of 20 => cumulative probability = 0.252 > 0.05=> heterozygote

19

Page 20: Next-generation sequencing - variation discovery

20

Page 21: Next-generation sequencing - variation discovery

samtools pileup \ -vcs \ -r 0.001 \ -l CCDS.txt \ -f human_b36_plus.fasta \ input.bam \ output.pileup

samtools

21

Page 22: Next-generation sequencing - variation discovery

VCF file##fileformat=VCFv3.3##FILTER=DP,"DP < 3 || DP > 1200"##FILTER=QUAL,"QUAL < 25.0"##FILTER=SnpCluster,"SNPs found in clusters"##FORMAT=DP,1,Integer,"Read Depth"##FORMAT=GQ,1,Integer,"Genotype Quality"##FORMAT=GT,1,String,"Genotype"##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"##INFO=DB,0,Flag,"dbSNP Membership"##INFO=DP,1,Integer,"Total Depth"##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of

reads>"##INFO=MQ,1,Float,"RMS Mapping Quality"##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"##annotatorReference=human_b36_plus.fasta##reference=human_b36_plus.fasta##source=VariantAnnotator##source=VariantFiltration#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE

GT:DP:GQ 1/1:3:36.001 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE

GT:DP:GQ 1/1:6:45.00. . .

22

Page 23: Next-generation sequencing - variation discovery

VCF file

23

##fileformat=VCFv3.3##FILTER=DP,"DP < 3 || DP > 1200"##FILTER=QUAL,"QUAL < 25.0"##FILTER=SnpCluster,"SNPs found in clusters"##FORMAT=DP,1,Integer,"Read Depth"##FORMAT=GQ,1,Integer,"Genotype Quality"##FORMAT=GT,1,String,"Genotype"##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"##INFO=DB,0,Flag,"dbSNP Membership"##INFO=DP,1,Integer,"Total Depth"##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of

reads>"##INFO=MQ,1,Float,"RMS Mapping Quality"##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"##annotatorReference=human_b36_plus.fasta##reference=human_b36_plus.fasta##source=VariantAnnotator##source=VariantFiltration#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE

GT:DP:GQ 1/1:3:36.001 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE

GT:DP:GQ 1/1:6:45.00. . .

file header

column header

actual data

Page 24: Next-generation sequencing - variation discovery

VCF file

24

INFODB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSEDB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE

FORMAT a_a:bwa057_b:picard.bamGT:DP:GQ 1/1:3:36.00GT:DP:GQ 1/1:6:45.00

genotype

depthgenotype

quality

1/1 = homozygous non-reference0/1 = heterozygous

Page 25: Next-generation sequencing - variation discovery

java \ -Xmx6g \ -jar /path_to/GenomeAnalysisTK.jar \ -l INFO \ -R human_b36_plus.fasta \ -I input.bam \ -T UnifiedGenotyper \ --heterozygosity 0.001 \ -pl Solexa \ -varout output.vcf \ -vf VCF \ -mbq 20 \ -mmq 10 \ -stand_call_conf 30.0 \ --DBSNP dbsnp_129_b36_plus.rod

GATK

25

Page 26: Next-generation sequencing - variation discovery

SNP annotation

26

by piculak (Flickr)

Page 27: Next-generation sequencing - variation discovery

We have: chromosome + position + alleles

We need:

• in gene?

• damaging?

will be basis for filtering

SIFT (http://sift.bii.a-star.edu/sg), annovar, PolyPhen, ...

27

Page 28: Next-generation sequencing - variation discovery

28

3,81780820,1,T/C 2,43881517,1,A/T2,43857514,1,T/C

#SNP codon substitution region type prediction gene OMIM3,81780820,1,T/C AGA-gGA R190G EXON CDS Nonsynonymous DAMAGING GBE1 POLYGLUCOSAN BODY DISEASE2,43881517,1,A/T ATA-tTA I230L EXON CDS Nonsynonymous TOLERATED DYNC2LI12,43857514,1,T/C TTT-TcT F33S EXON CDS Nonsynonymous TOLERATED DYNC2LI1

SIFT

input

output

Page 29: Next-generation sequencing - variation discovery

29

3,81780820,1,T/C 2,43881517,1,A/T2,43857514,1,T/C

#SNP codon substitution region type prediction gene OMIM3,81780820,1,T/C AGA-gGA R190G EXON CDS Nonsynonymous DAMAGING GBE1 POLYGLUCOSAN BODY DISEASE2,43881517,1,A/T ATA-tTA I230L EXON CDS Nonsynonymous TOLERATED DYNC2LI12,43857514,1,T/C TTT-TcT F33S EXON CDS Nonsynonymous TOLERATED DYNC2LI1

SIFT

input

output

Page 30: Next-generation sequencing - variation discovery

SNP filtering

2 aspects:

• filtering to improve quality of SNP calls

• filtering to find likely candidates

30

Page 31: Next-generation sequencing - variation discovery

Reduce false positives without increasing false negatives:

• depth of coverage

• mapping quality

• SNP clusters

• allelic balance (diploid genome)

• number of reads with mq0

• consequence

Filtering to improve quality

31

Page 32: Next-generation sequencing - variation discovery

java \ -Xmx4g \ -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R human_b36_plus.fasta \ -o output.vcf \ -B variant,VCF,input.vcf \ --clusterWindowSize 10 \ --filterExpression 'DP < 3 || DP > 1200' \ --filterName 'DP' \ --filterExpression 'QUAL < 20' \ --filterName 'QUAL' \ --filterExpression 'AB > 0.75 && DP > 40' \ --filterName 'AB'

GATK

32

Page 33: Next-generation sequencing - variation discovery

VCF file

33

##fileformat=VCFv3.3##FILTER=DP,"DP < 3 || DP > 1200"##FILTER=QUAL,"QUAL < 25.0"##FILTER=SnpCluster,"SNPs found in clusters"##FORMAT=DP,1,Integer,"Read Depth"##FORMAT=GQ,1,Integer,"Genotype Quality"##FORMAT=GT,1,String,"Genotype"##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"##INFO=DB,0,Flag,"dbSNP Membership"##INFO=DP,1,Integer,"Total Depth"##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of

reads>"##INFO=MQ,1,Float,"RMS Mapping Quality"##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"##annotatorReference=human_b36_plus.fasta##reference=human_b36_plus.fasta##source=VariantAnnotator##source=VariantFiltration#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam1 856182 rs9988021 G A 36.00 DP DB;DP=2;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE

GT:DP:GQ 1/1:3:36.001 866362 rs4372192 A G 45.00 PASSED DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE

GT:DP:GQ 1/1:6:45.00. . .

Page 34: Next-generation sequencing - variation discovery

Transition/transversion ratio

Transition/transversion ratio Ti/Tv

random: Ti/Tv = 0.5

whole genome: Ti/Tv = 2.0-2.1

exome: Ti/Tv = 3-3.5

34

Page 35: Next-generation sequencing - variation discovery

Novel SNPs

Number of novel SNPs

exome:

total = 20k - 25k

novel = 1k - 3k

35

Page 36: Next-generation sequencing - variation discovery

Factors that influence SNP accuracy

• sequencing technology

• mapping algorithms and parameters

• post-mapping manipulation

duplicate removal, base quality recalibration, local realignment around indels, ...

• SNP calling algorithms and parameters

36

Page 37: Next-generation sequencing - variation discovery

Specificity vs sensitivity

37

true

pos

itive

s

false positives

Page 38: Next-generation sequencing - variation discovery

Filtering to find likely candidates

Which are the most interesting?

• only highqual: DP, QUAL, AB, but keep eye on Ti/Tv

• novel

• loss-of-function (stop gained, splice site, ...) or predicted to be damaging (non-synonymous)

• found in multiple individuals

• conserved

• homozygous non-reference or compound heterozygous

38

Page 39: Next-generation sequencing - variation discovery

Disease model

• dominant: a single heterozygous SNP is damaging

• recessive: either homozygous non-reference or compound heterozygous necessary to lead to disease phenotype

(e.g. phenylketonuria: cannot convert phenylalanine to tyrosine. Can lead to: mental retardation, microcephaly, ...)

39

Page 40: Next-generation sequencing - variation discovery

B - Structural variation

40

Page 41: Next-generation sequencing - variation discovery

Why bother?

Iafrate et al, Nat Genet 2004 & Sebat et al, Science 2004

Redon et al, Nature 2006: 12% of genome is covered by copy number variable regions (270 individuals) => more nucleotide content per genome than SNPs

• colour vision in primates

• CCL3L1 copy number -> susceptibility to HIV

• AMY1 copy number -> diet

=> “the dynamic genome”

41

Page 42: Next-generation sequencing - variation discovery

42

Case 1: Evolution - chromosome fusion

Page 43: Next-generation sequencing - variation discovery

human chromosome 2

chimp chromosome 12

chimp chromosome 13

by Beth Kramer

43

Page 44: Next-generation sequencing - variation discovery

Molecular Biology of the Cell, 4th Edition

colorectal cancer karyotype

normal karyotype

44

Case 2: Cancer - rearranged genome

Page 45: Next-generation sequencing - variation discovery

Robberecht et al, 2010

45

Case 3: Embryogenesis - “abnormal” cells

segmental chromosomal imbalancesmosaicism for whole chromosomesuniparental isodisomy

Page 46: Next-generation sequencing - variation discovery

46

Case 4: Down Syndrome = trisomy 21

Page 47: Next-generation sequencing - variation discovery

Types of structural variation

Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009

47

Page 48: Next-generation sequencing - variation discovery

Types of structural variation

48

Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009

CNV = Copy Number Variation

Page 49: Next-generation sequencing - variation discovery

Copy number variation (CNV)

Not equally distributed over genome: more pericentromeric and subtelomeric (especially in primates)

Pericentromeric & subtelomeric regions: bias towards interchromosomal rearrangements; interstitial regions: bias towards intrachromosomal

Generation of duplications:

pericentromeric: 2-stage model (Sharp & Eichler, 2006)

1. series of seeding events: one of more progenitor loci transpose together to pericentromeric receptor => generates mosaic block of duplicated segments derived from different loci

2. inter- & intrachromosomal duplication => large blocks are duplicated near other centromeres

subtelomeric: due to normal recombination: cross-overs lead to translocation of distal sequences between chromosomes

49

Page 50: Next-generation sequencing - variation discovery

Copy number variation and segmental duplications

Close relationship between CNVs and segmental duplications (aka low-copy repeats aka LCRs; genomic regions with >1 copy that are at least 1kb long and have at least 90% sequence similarity):

• Copy number variation that is fixed in population = segmental duplication (in other words: segmental duplications started out themselves as copy number variations)

• Segmental duplications can stimulate formation of new CNVs due to NAHR (see later)

➡In human + chimp: 70-80% of inversions and 40% of insertions/deletions overlap with segmental duplications

➡80% of human segmental duplications arose after the divergence of Great Aples from the rest of the primates

50

Page 51: Next-generation sequencing - variation discovery

Effects of structural variation

51

Feuk et al, 2006

Page 52: Next-generation sequencing - variation discovery

Mechanisms of formation for structural variation

52

Gu et al, 2008

Page 53: Next-generation sequencing - variation discovery

Mechanisms: NAHR

53

NAHR = non-allelic homologous recombination

often between segmental duplications

• can recur

• clustered breakpoints

• larger

Hastings et al, 2009

Page 54: Next-generation sequencing - variation discovery

Mechanisms: NHEJ

54

Gu et al, 2008

NHEJ = non-homologous end-joining

pathway to repair double-strand breaks, but may lead to translocations and telomere fusion

not associated with segmental duplications

• more scattered

• unique origins

• smaller

Page 55: Next-generation sequencing - variation discovery

Mechanisms: FoSTeS

55

Hastings et al, 2009

FoSTeS = DNA replication fork-stalling and template switching

can occur multiple times in series => can generate very complex rearrangements

Page 56: Next-generation sequencing - variation discovery

Feuk et al, 2006

Discovery of structural variation

56

Page 57: Next-generation sequencing - variation discovery

Approaches for discovery

• karyotyping, fluorescent in situ hybridization (FISH)

• array comparative genomic hybridization (aCGH)

• next-generation sequencing: combination of:

• read pair information

• read depth information

• split read information

• for fine-mapping breakpoints: local assembly

=> identify signatures

57

Page 58: Next-generation sequencing - variation discovery

Feuk et al, 2006

Feuk et al, 2006Feuk et al, 2006

FISH = fluorescent in-silico hybridization

duplicationinversion

duplication

Structural variation discovery using FISH

58

Page 59: Next-generation sequencing - variation discovery

Structural variation discovery using aCGH

59

Xie & Tammi, 2009

aCGH = array comparative genome hybridization

Page 60: Next-generation sequencing - variation discovery

60

http://www.breenlab.org/array.html

Page 61: Next-generation sequencing - variation discovery

61van de Wiel et al, 2010

Page 62: Next-generation sequencing - variation discovery

Structural variation discovery using next-generation sequencing

General approaches:

1.Read depth

2.Read pairs

3.Split reads

62

Page 63: Next-generation sequencing - variation discovery

Structural variation discovery: read depth

Xie & Tammi, 2009

63

Page 64: Next-generation sequencing - variation discovery

Workflow

1.Mapping

2.Read filtering

3.GC correction

4.Spike identification

5.Validation

64

Page 65: Next-generation sequencing - variation discovery

General principle

• Similar to aCGH: using reference RD file (e.g. from 1000Genomes Project)

• In theory: higher resolution, but noisier than aCGH

• Algorithms not mature yet

• More complex steps

➡Data binned

65

Page 66: Next-generation sequencing - variation discovery

66

Page 67: Next-generation sequencing - variation discovery

67van de Wiel et al, 2010

Page 68: Next-generation sequencing - variation discovery

Xie & Tammi, 2009

68

Page 69: Next-generation sequencing - variation discovery

69

CNV = copy number variation

Combining CNV data for >1 individuals/samples

Page 70: Next-generation sequencing - variation discovery

70

CNVR = copy number variation region

CNVR = any region covered by at least 1 CNV

Page 71: Next-generation sequencing - variation discovery

71

CNVE = copy number variation event

CNVE = subgroups of CNVR with >= 50% reciprocal overlap

Page 72: Next-generation sequencing - variation discovery

Data normalization

• Mainly: GC

• Other: repeat-rich regions, mapping Q, ...

• Fit linear model GC-content and RD => noise decreases

72

Page 73: Next-generation sequencing - variation discovery

Segmentation

• Identify spikes

• Many segmentational algorithms, e.g. GADA

• Issues: setting parameters: when to cut off peaks?

• Combine outputs from different runs with different parameters

• Compare to known CNVs

73

Page 74: Next-generation sequencing - variation discovery

74

Xie & Tammi, 2009

Page 75: Next-generation sequencing - variation discovery

7543

Xie & Tammi, 2009

peak

Page 76: Next-generation sequencing - variation discovery

764443

Xie & Tammi, 2009

...but is this?

Page 77: Next-generation sequencing - variation discovery

77

Abysov et al

Page 78: Next-generation sequencing - variation discovery

Drawbacks

• Can only find unbalanced structural variation (i.e. CNVs)

• How to assess specificity and sensitivity? => compare with known CNVs

• Database of Genomic Variants DGV (http://projects.tcag.ca/variation/)

• Decipher (http://decipher.sanger.ac.uk/)

• Breakpoints: unknown

• Different parameters for rare vs common CNVs => which?

78

Page 79: Next-generation sequencing - variation discovery

Structural variation discovery: read pairs

79

50

Korbel et al, 2007

Page 80: Next-generation sequencing - variation discovery

Discordant readpairs

• Orientation

• Distance

• Plot insert size distribution for chromosome

• Very long tail!! => difficult to set cutoff: 4 MAD or 0.01%?

80

Page 81: Next-generation sequencing - variation discovery

Read pair signatures

Medvedev et al, 2009

81

Page 82: Next-generation sequencing - variation discovery

Real data

82

Page 83: Next-generation sequencing - variation discovery

Read pair workflow

1. Map reads

2. Identify discordant pairs

3. Cluster on location

4. Filter on number of readpairs per cluster

5. Filter on read depth

6. Filter on mapping quality for read pairs

7. Identify signatures

8. (Optionally) create alternative reference

9. Validate

83

Page 84: Next-generation sequencing - variation discovery

84

figure by Klaudia Walter

Page 85: Next-generation sequencing - variation discovery

85

figure by Klaudia Walter

Page 86: Next-generation sequencing - variation discovery

86

figure by Klaudia Walter

Page 87: Next-generation sequencing - variation discovery

87

figure by Klaudia Walter

Page 88: Next-generation sequencing - variation discovery

88

figure by Klaudia Walter

Page 89: Next-generation sequencing - variation discovery

Clustering

• “standard clustering strategy”

• only consider mate pairs that do not have concordant mappings

• ignore read pairs that have more than one good mapping

• clustering: use insert size distribution (e.g. 2x4 MAD)

89

Page 90: Next-generation sequencing - variation discovery

Clustering: issues

• Ignores pairs that have >1 good mapping => no detection within repetitive regions (segmental duplications)

• What cutoff for what is considered abnormal distance? (4 MAD? 0.01%? 2stdev?)

• Low library quality of mix of libraries => multiple peaks in size distribution

90

Page 91: Next-generation sequencing - variation discovery

Filtering

• On number of RPs per cluster

• normally: n = 2

• for high coverage (e.g. 1000Genomes pilot 2: 80X): n = 5

• On drop in read depth and split reads

• On (mappingQ x nrRP)

• if published data available: look at specificity and sensitivity for different cutoffs mQ x nrRP

• if not: very difficult

91

Page 92: Next-generation sequencing - variation discovery

Filtering: issues

• Large insert size: low resolution for detecting breakpoints

• Small insert size: low resolution for detecting complex regions

92

Page 93: Next-generation sequencing - variation discovery

Structural variation discovery: split reads

93

Page 94: Next-generation sequencing - variation discovery

Mapping

• short subsequences => many possible mappings

• solution: “anchored split mapping” (e.g. Pindel)

94

Medvedev et al, 2009

Page 95: Next-generation sequencing - variation discovery

Local reassembly

• Aim: to determine breakpoints

• Which reads?

• for deletions: local reads

• for insertions: hanging reads for read pairs with only one read mapped

• (rather not: unmapped reads)

• For large region: split up

95

Page 96: Next-generation sequencing - variation discovery

96

Page 97: Next-generation sequencing - variation discovery

97Nielsen et al, 2009

sequence reads -> contigs (using sequence overlap)contigs -> scaffolds (using read-pair information)

1 scaffold contigs

Page 98: Next-generation sequencing - variation discovery

98

+ -

read depth

read pairs

split reads

conceptually simple only unbalanced (CNVs)low resolution

wide range of types of variation

complicated

basepair resolution very small reads

General conclusions NGS & structural variation (1)

Page 99: Next-generation sequencing - variation discovery

General conclusions NGS & structural variation (2)

• Available algorithms: more to demonstrate technique than comprehensive solution

• Difficult => different software = different results => “consensus set”

• based on read pairs and split reads: many sets agree

• based on read depth: totally different

• sometimes drop in read depth, but no aberrant read pairs spanning the region => why???

• Mapper = critical; maq/bwa: only 1 mapping (=> many false negatives); mosaik, mrFAST: return more results

99

Page 100: Next-generation sequencing - variation discovery

Software for structural variation discovery

100

Medvedev et al, 2009

Page 101: Next-generation sequencing - variation discovery

Chris Yoon

101

Page 102: Next-generation sequencing - variation discovery

Chris Yoon

102

Page 103: Next-generation sequencing - variation discovery

103

Websites

http://www.broadinstitute.org/gatk

http://samtools.sourceforge.net

http://picard.sourceforge.net

http://www.annotate-it.org

http://bit.ly/siftsnp

Page 104: Next-generation sequencing - variation discovery

References and software

• Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)

• Lee S et al. Bioinformatics 24:i59-i67 (2008)

• Hormozdiari F et al. Genome Res 19:1270-1278 (2009)

• Campbell P et al. Nat Genet 40:722-729 (2008)

• Ye K et al. Bioinformatics 25(21):2865-2871 (2009)

• Chen K et al. Genome Res 19:1527-1741 (2009)

• Yoon S et al. Genome Res 19:1586-1592 (2009)

• Du J et al PLoS Comp Biol 5(7):e1000432 (2009)

• Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences (2009)

• Hastings P et al Nat Rev Genet 10:551-564 (2009)

104

Page 105: Next-generation sequencing - variation discovery

Exercises

105

Page 106: Next-generation sequencing - variation discovery

Finding SNPs using Galaxy

Based on the SAM-file you created in Galaxy in the last lecture, create a list of SNPs. You’ll first have to convert the SAM file to BAM, then create a pileup and finally filter the pileup (using “Filter pileup on coverage and SNPs”). Let this filter only return variants where the coverage is larger than 3 and the base quality is larger than 20.

How many SNPs do you find?

Calculate a histogram of the coverage over all SNPs (= column 4 in the filtered file you just created)

106

Page 107: Next-generation sequencing - variation discovery

Finding SNPs using samtools

Using the SAM file you created in the last lecture on the linux command line: Generate a BAM file and sort it. Next, generate a pileup for that BAM file using ~jaerts/i0d51a/chr9.fa as the reference sequence. When doing this: only print the variant sites and also compute the reference sequence (run “samtools pileup” without arguments to get more info).

How many SNPs are identified? Is the SNP at position 139,391,636 heterozygous or homozygous-non-reference? And the one at 139,399,365? Do you trust the SNP at 139,401,304?

107

Page 108: Next-generation sequencing - variation discovery

Annotating and filtering SNPs

Download ~jaerts/i0d51a/sift.input to your own machine and then upload it to the SIFT website at http://bit.ly/siftsnp. Positions in this file are on Homo sapiens build NCBI36. Make sure to let SIFT send the results by email.

How many SNPs are in/near genes?

How many are in exons?

What percentage of the SNPs is predicted damaging?

108

Page 109: Next-generation sequencing - variation discovery

Structural variation

We’ll be looking at copy number variation using the cnv-seq package. This software is available from http://tiger.dbs.nus.edu.sg/cnv-seq/We’ll be running the example from the cnv-seq tutorial at http://tiger.dbs.nus.edu.sg/cnv-seq/doc/manual.pdf. (Read that!)• Log into the server mentioned on Toledo.•Calculate CNVs in the file ~jaerts/i0d51a/test_1.hits compared to ~jaerts/

i0d51a/ref_1.hits:

/mnt/apps/cnv-seq/current/cnv-seq.pl --test ~jaerts/i0d51a/test_1.hits --ref ~jaerts/i0d51a/ref_1.hits --genome chrom1 --log2 0.6 -p 0.001 --bigger-window 1.5 --annotate --minimum-windows 4

• Finally investigate in R. Start R by typing “R”. Then:library(cnv)data <- read.delim(’test_1.hits-vs-ref_1.hits.log2-0.6.pvalue-0.001.miw-4.cnv’)cnv.print(data)cnv.summary(data)plot.cnv(data, CNV=4, upstream=4e+6, downstream=4e+6)ggsave(’sample_1.pdf’)

• Describe the main features in the plot.

109