18
PipeCleaner: Sanitation for your NGS Pipeline Ken Doig, Jason Ellul Peter MacCallum Bioinformatics Core BigData April 2013

PipeCleaner: Sanitation for your NGS Pipeline

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

PipeCleaner: Sanitation for

your NGS Pipeline

Ken Doig, Jason Ellul

Peter MacCallum Bioinformatics Core

BigData – April 2013

What we do

• Molecular pathology services

• Blood and tumour tissue samples

• Targeted genetic sequencing using amplicon panels

• Between 4-48 cancer specific genes

• Looking for needles in haystacks

• Very sensitive assays

19 Apr 2013 BigData 2

...

... AAAAGCAGGT TATATAGGCT AAATAGAACT AATCATTGTT TTAGACATAC TTATTGACTC TAAGAGGAAA GATGAAGTAC TATGTTTTAA AGAATATTAT ATTACAGAAT TATAGAAATT AGATCTCTTA CCTAAACTCT TCATAATGCT TGCTCTGATA GGAAAATGAG ATCTACTGTT TTCCTTTACT TACTACACCT CAGATATATT TCTTCATGAA GACCTCACAG TAAAAATAGG TGATGTTGGT AGCTAGGAGT GAAATCTCGA TGGAGTGGGT CCCATCAGTT TGAACAGTTG TCTGGATCCA TTTTGTGGAT GGTAAGAATT GAGGCTATTT TTCCACTGAT TAAATTTTTG GCCCTGAGAT GCTGCTGAGT TACTAGAAAG TCATTGAAGG TCTCAACTAT AGTATTTTCA TAGTTCCCAG TATTCACAAA AATCAGTGTT CTTATTTTTT ATGTAAATAG ATTTTTTAAC TTTTTTCTTT ... ...

Why: Acquisition of mutations

19 Apr 2013 BigData 3

B Vogelstein et al. Science 2013;339:1546-1558

Driver mutations

Somatic

mutations

Allele distribution Cancer 2015 study data

19 Apr 2013 BigData

Known

Polymorphisms

(dbSNP)

Known

Cancerous

(Cosmic)

VOUS

4

The problem

• Ageing population – more cancer

• NGS means more data / more variants

• Need faster turn around

• Need audited processes

• Replace manual paper trail

• Get rid of uncontrolled data

19 Apr 2013 5 BigData

Software Qualities

Software should be: • Correct

• Efficient

• Robust

• Flexible

• Repeatable

• Maintainable

• Reusable

• Using debugged methods/libraries

• Logging all activity

• Version controllled

19 Apr 2013 BigData 6

Groovy

http://www.youtube.com/watch?v=7jZsEUMeU94

Sample Data Flow

19 Apr 2013 BigData 7

Histology

DNA Extract

PCR

Sequencing

Alignment

Variant

Calling

Annotation

DB Load

Filtering

Known

Variant ?

Publish

Report

Curation

Peter Mac

Mutation

DB

External and

Locus Specific

DBs

Yes

No

Patient

Clinical

Report

Document

Assembly

Variant

Normalisation

Report

Editing and

Signoff

Manual

Step

Automatic

Step

Wet Lab Bioinformatics Clinical Informatics

Patient

Sample

Report Assembly

19 Apr 2013 BigData 8

Images

Generated

graphics

PathOS Database • Analysis data

• Patient details

• Lab QC

• Pharmacogenomics

• Clinical reports

• Mutation descripts

“TransMute”

Document

Assembly

System

Continuous testing

19 Apr 2013 BigData 9

PipeCleaner

19 April, 2013 Bioinformatics Core

simulated

read data Pipeline Under Test

Variant

Generator

reference

genome

Pipeline results:

variant calls

assemblies

alignments

Result

Comparator Test

results

Read

Generator

(wgsim)

SNPs, indels (VCF file(s))

and known variants

(dbSNP, Cosmic)

mutated

genomes

reference genomes

(if mapping to reference)

Test Controller

Tumour clonal evolution

19 Apr 2013 BigData 11

Nature 481, 506–510 (26 January 2012)

Simulating tumour genetics

19 Apr 2013 BigData

Amplicon region

Germline variant

- homozygous

Low frequency

allele – 10% not

detected

Variant allele 40%

detected

Variant allele –

20% detected

12

Pipeline Validation (PipeCleaner)

19 Apr 2013 BigData 13

Germline mutation: dbSNP = rs1050171 @ 100% Tumour mutation: cosmic = cosm476 @ 10% Readlen: 200 Read depth: 200 Readerr: 1.0% Indel fraction: 10.0%

Input Parameters (expected)

Pipeline Output (actual)

SNPs – simulated reads (PipeCleaner)

19 Apr 2013 BigData

0

20

40

60

80

100

120

0.2

0.4

0.6

0.8 1

1.2

1.4

1.6

1.8 2

2.2

2.4

2.8 3

3.2

3.4

3.6

3.8 4

4.2

4.4

4.6

4.8 5

5.2

5.4

5.6

5.8 6

6.2

6.4

6.6

6.8 7

7.2

7.4

7.6

7.8 8

8.2

8.4

8.6

8.8 9

9.2

9.4

9.6

9.8 10

SN

P C

ou

nt

Allele Frequency %

SNP Accuracy (err=0.003, depth=2000)

True SNPs

VarScan 2

VS false -ve

GATK 2.1

GK false -ve

FreeBayes

FB false -ve

14

Deletions – simulated reads (PipeCleaner)

19 Apr 2013 BigData

0

5

10

15

20

25

1 4 7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

61

64

67

70

73

76

79

82

85

88

91

94

97

10

0

10

3

10

6

10

9

11

2

11

5

11

8

12

1

12

4

Nu

mb

er

of

Vari

an

ts

Deletion Size (bp)

Deletion Validation 6-Dec-2012

True variants

VarScan 2

GATK 2.1

FreeBayes

15

Inserts – simulated reads (PipeCleaner)

19 Apr 2013 BigData

0

5

10

15

20

25

30

1 5 9

13

17

21

25

29

33

37

41

45

49

53

57

61

65

69

73

77

81

85

89

93

97

10

1

10

5

10

9

11

3

11

7

12

1

12

5

12

9

Nu

mb

er

of

Vari

an

ts

Insert Size (bp)

Insert validation 6-Dec-2012

True variants

VarScan 2

GATK 2.1

16

PipeCleaner Outcomes

• Regression testing

– Unit testing

– Control samples

• End to end test from PCR to Database

• Two control samples per run

• 4 variants per control (3 x 2.5% af, 1 x 100% af)

• Test for ref/alt, gene, HGVS, allele freq., conseq.

– Set of dbSNP and Cosmic mutations

• Test for annotation and allele freq.

• Performance testing

– Increased insert/deletion size from 15bps to 100bps

– Calculating false +ve and false –ve rates of pipeline

19 Apr 2013 BigData 17

19 Apr 2013 BigData 18

Acknowledgements

• Molecular Pathology Dept. support

– Andrew Fellowes

– Anthony Bell

• Peter Mac amplicon aligner

– Jason Ellul

– Franco Caramia