52
Next-generation sequencing and structural variation Jan Aerts Wellcome Trust Sanger Institute [email protected]

ECCB10 talk - Next-generation sequencing and structural variation

Embed Size (px)

Citation preview

Page 1: ECCB10 talk - Next-generation sequencing and structural variation

Next-generation sequencingand structural variation

Jan AertsWellcome Trust Sanger Institute

[email protected]

Page 2: ECCB10 talk - Next-generation sequencing and structural variation

principles & pittfallsvs

list of commands

Page 3: ECCB10 talk - Next-generation sequencing and structural variation

What is structural variation?

• “variation that changes the structure ofa chromosome”

• Mechanisms: NAHR, NHEJ, FoSTeS• This presentation: focus on discovery

(not: genotyping)

“experiment 4” from last slide Thomas

Page 4: ECCB10 talk - Next-generation sequencing and structural variation

Types of structural variation

Page 5: ECCB10 talk - Next-generation sequencing and structural variation

Approaches for discovery

Combination of:• Read pairs• Read depth• Split reads• Fine-mapping breakpoints: local assembly

=> Identify signatures

Page 6: ECCB10 talk - Next-generation sequencing and structural variation

A. Read Pairs

Page 7: ECCB10 talk - Next-generation sequencing and structural variation

RP - General principle

• Paired-end library => insert size• Orientation/distance

Page 8: ECCB10 talk - Next-generation sequencing and structural variation

RP - Signatures

Medvedev et al, 2009

Page 9: ECCB10 talk - Next-generation sequencing and structural variation

RP - Real world

Page 10: ECCB10 talk - Next-generation sequencing and structural variation

RP - Workflow overview

Mapping Identify discordant readpairsCluster on locationFilter on nr RPs/clusterFilter on RDFilter: mappingQ x #readpairs Identify signaturesAlternative referenceValidate

Page 11: ECCB10 talk - Next-generation sequencing and structural variation

RP - Mapping

• Provides raw data => crucial• MAQ/bwa

– only report one hit (mappingQ = 0)– MAQ might prefer mismatches to aberrant

distance!• Insert size = distribution instead of exact

Page 12: ECCB10 talk - Next-generation sequencing and structural variation

RP - Discordant readpairs

• Orientation• Distance

– Plot insert size distribution for chromosome– Very long tail! => difficult to set cutoff:

• 4mad or 0.01%?

Page 13: ECCB10 talk - Next-generation sequencing and structural variation
Page 14: ECCB10 talk - Next-generation sequencing and structural variation
Page 15: ECCB10 talk - Next-generation sequencing and structural variation
Page 16: ECCB10 talk - Next-generation sequencing and structural variation
Page 17: ECCB10 talk - Next-generation sequencing and structural variation
Page 18: ECCB10 talk - Next-generation sequencing and structural variation

RP - Clustering

“standard clustering strategy”– Only consider mate pairs that do not have

concordant mappings– Ignore read pairs that have more than one

good mapping

Clustering: use insert size distribution(e.g. 2x4 mad)

Page 19: ECCB10 talk - Next-generation sequencing and structural variation

RP - Clustering: issues

• Ignores pairs that have >1 good mapping =>no detection within repetitive regions(segmental duplications)

• What cutoff for what is considered abnormaldistance? (4 mad? 0.01%? 2stdev?)

• Low library quality or mix of libraries =>multiple peaks in size distribution

Page 20: ECCB10 talk - Next-generation sequencing and structural variation

RP - Filtering

• On nr RPs/cluster– Normally: n=2– For high coverage (e.g. pilot 2: 80X): n=5

• On drop in RD & SR• On (mappingQ x nrRP)

– If published data available: ROC fordifferent cutoffs mQxnrRP

– If not: very difficult

Page 21: ECCB10 talk - Next-generation sequencing and structural variation

RP - Issues

• Difficult => different groups = different results“consensus set”– RP & SP: many set agree– RD: totally different

• CEU (80X): sometimes drop in RD in all 3,but RP spanning only in 2 => why??

• Mapper = critical; maq/bwa: only 1 mapping(=> many false negatives); mosaik, mrFAST:return more results

Page 22: ECCB10 talk - Next-generation sequencing and structural variation

RP - Issues (2)

• Large insert size: low resolution for detectingbreakpoints

• Small insert size: low resolution for detectingcomplex regions

Page 23: ECCB10 talk - Next-generation sequencing and structural variation

B. Read Depth

Page 24: ECCB10 talk - Next-generation sequencing and structural variation

RD - General principle

• Similar to aCGH: using reference RDfile (e.g. based on 1kG)

• In theory: higher resolution, but noisierthan aCGH– Algorithms not mature yet– More complex steps

=> Data binned

Page 25: ECCB10 talk - Next-generation sequencing and structural variation

RD - Exome

here: using exome data

Page 26: ECCB10 talk - Next-generation sequencing and structural variation

RD - Example

Page 27: ECCB10 talk - Next-generation sequencing and structural variation

RD - Workflow overview

• Mapping• Read filtering• GC correction• Spike identification• Validation

Page 28: ECCB10 talk - Next-generation sequencing and structural variation
Page 29: ECCB10 talk - Next-generation sequencing and structural variation

RD - mapping

Critical…(see RP)

Page 30: ECCB10 talk - Next-generation sequencing and structural variation

RD - Filtering

• mapQ– mapQ >= 0 (noisy; few FN, many FP)– mapQ >= 10– mapQ >= 30 (many FN, few FP)

• Mean depth exon (often: e.g. +/- 0.01)– Mean depth > 1– Mean depth > 5

Page 31: ECCB10 talk - Next-generation sequencing and structural variation

RD - Filtering: what’s left

152,000153,000160,000mean DP exon > 5

162,000163,000169,000mean DP exon > 1

207,000207,000207,000all

mapQ >= 30mapQ >= 10mapQ >= 0

Page 32: ECCB10 talk - Next-generation sequencing and structural variation

RD - correction

• Mainly: GC– Other: repeat-rich regions, mapping Q, …

• Fit linear model GC-content exon andRD of exon=> noise decreases

Page 33: ECCB10 talk - Next-generation sequencing and structural variation
Page 34: ECCB10 talk - Next-generation sequencing and structural variation
Page 35: ECCB10 talk - Next-generation sequencing and structural variation

RD - segmentation

• Identify spikes• Many segmentational algorithms, e.g.

GADA• Issues: setting parameters: when to cut

off peaks?– Combine outputs from different runs with

different parameters– Compare to known CNVs

Page 36: ECCB10 talk - Next-generation sequencing and structural variation

RD - Combine algorithms

Page 37: ECCB10 talk - Next-generation sequencing and structural variation
Page 38: ECCB10 talk - Next-generation sequencing and structural variation
Page 39: ECCB10 talk - Next-generation sequencing and structural variation

RD - Issues

• How to assess TP/FP/FN? => comparewith known CNVs

• Breakpoints: unknown– 1 datapoint/exon– Can be outside of exon

• Different parameters for rare vscommon CNVs => which?

Page 40: ECCB10 talk - Next-generation sequencing and structural variation

C. Split Reads

Page 41: ECCB10 talk - Next-generation sequencing and structural variation

SR - Principle

Page 42: ECCB10 talk - Next-generation sequencing and structural variation

SR - Mapping

Short subsequences => many possiblemappings

Solution: “anchored split mapping” (e.g.Pindel)

Page 43: ECCB10 talk - Next-generation sequencing and structural variation
Page 44: ECCB10 talk - Next-generation sequencing and structural variation

D. Local reassembly

Aim: to determine breakpoints

Which reads?– for deletions: local reads– for insertions: hanging reads for read pairs with

only one read mapped

– (rather not: unmapped reads)

For large region: split up

Page 45: ECCB10 talk - Next-generation sequencing and structural variation

Assemblers

VelvetABySSTIGRA…

Page 46: ECCB10 talk - Next-generation sequencing and structural variation
Page 47: ECCB10 talk - Next-generation sequencing and structural variation

Conclusions

• Available algorithms: more todemonstrate technique rather thancomplete solution

• Different algorithms => different results

Page 48: ECCB10 talk - Next-generation sequencing and structural variation

Chris Yoon

Page 49: ECCB10 talk - Next-generation sequencing and structural variation
Page 50: ECCB10 talk - Next-generation sequencing and structural variation

Genotyping• Create alternative reference => remap reads

– All reads vs reads covering variant locis– Whole-genome vs concatenation of variant loci

• Homozygous insertions/deletions: should disappear• Heterozygous insertions/deletions: should have different

signatures• Bayesian approach: see what’s the most likely: do the reads

support wild-type/het/homnonref?• Not exact mapping => local reassembly

– Microhomologies & non-template sequence => “breakpoint”= region of 2-10 bp

• Convention: left-most position reported (but not always)

Page 51: ECCB10 talk - Next-generation sequencing and structural variation

References and software• Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)• Lee S et al. Bioinformatics 24:i59-i67 (2008)• Hormozdiari F et al. Genome Res 19:1270-1278 (2009)• Campbell P et al. Nat Genet 40:722-729 (2008)• Ye K et al. Bioinformatics 25(21):2865-2871 (2009)• Chen K et al. Genome Res 19:1527-1541 (2009)• Yoon S et al. Genome Res 19:1586-1592 (2009)• Du J et al. PLoS Comp Biol 5(7):e1000432 (2009)• Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences

(2009)

Page 52: ECCB10 talk - Next-generation sequencing and structural variation

Questions?