49
Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics WAVLD 17 , June 17 2015

Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Embed Size (px)

Citation preview

Page 1: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory

Gary Van Domselaar, PhD

Chief, Bioinformatics

WAVLD17, June 17 2015

Page 2: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Canada’s National Microbiology Laboratory

Page 3: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Next-Generation Sequencing

Illumina MiSeq Ion Torrent PGM Oxford Nanopore

15 Gigabases / Run (1 d) 1 Gigabase / Run ( 2 h) 1 Gigabase / Run

$125K $80K $1K

3

Page 4: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Ion Torrent / Ion Proton 6+ m/80 Million reads/run

Illumina MiSeq 25-Million reads / run

Big Data Analytics and High Performance Computing

Page 5: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Genomic Epidemiology

Page 6: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Canadian Listeriosis Outbreak, 2008

6

Page 7: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Eppinger M et al. mBio 2014; doi:10.1128/mBio.01721-14 Katz L et al. mBio. 2013 Jul-Aug; 4(4): e00398-13.

Genomic Epidemiology

Page 8: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

8

PulseNet Canada

Genomic

Epidemiology

Roadmap

Aleisha Reimer with contributions from Drs Celine

Nadon, Morag Graham, and PulseNet Canada

members

October 16, 2013

Based on existing PulseNet model

De-centralized sequencing and analysis

Parallel, centralized storage & analysis of

national data sets

Continued NML support in reference testing,

training, certification & proficiency

Continued method development, refinement,

and KT

Page 9: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Diagnostic Metagenomics

Page 10: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Pathogen Profiling Pipeline

Page 11: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Neptune: Target Pathogen Signature Detection

Inclusion Group Exclusion Group

Target Sequences

Page 12: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

HIV Genotyping

Page 13: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

NGS HIV Genotying with HyDRA

Page 14: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Quality Assurance

Page 15: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

QA/QC Today: A Work in Progress

Page 16: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

QC Metrics: From Reads to Annotated Genome

• Sequence Reads – Basic stats in read number, read length, etc – Sequence quality – GC content – Duplicated sequences, over-represented

sequences – Kmer content

• Error correction • Contamination • Sequence Mapping

– Average read coverage and its distribution – Composition of the data set according to read

length – Fragment-length average, distribution, and

outliers – Base quality values – Read duplication rate – Mapping quality values and fraction of

properly mapped read pairs – % of mapped reads – % of concordance reads – % of discordance reads

• Sequence Assemblies – # Contigs – Total length of contigs – Largest contigs – N50 (ref-free); NG50 (with reference) – # misassemblies (with reference) – Unaligned contigs / length

• Sequence Annotation – Genome Quality Score (Land et al Genomic Sciences

2014)

– Number of contigs and number of non-standard bases

– Presence of a full-length 5S, 16S, and 23 rRNA – Presence of at least one tRNA coding for all of

the 20 standard amino acids – Presence of a set of essential genes

containing 102 conserved Pfam-A domains

• cgMLST / wgMLST • Metabolic Pathways

Page 17: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Quality Assessment for Genomic Epidemiology

Page 18: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

The Daubert Standard

• Judge is gatekeeper: – assuring that scientific expert testimony truly proceeds from "scientific knowledge", rests on

the trial judge.

• Relevance and reliability: – requires the trial judge to ensure that the expert's testimony is "relevant to the task at hand"

and that it rests "on a reliable foundation

• Scientific knowledge = scientific method/methodology: A conclusion will qualify as scientific knowledge if the proponent can demonstrate that it is the product of sound "scientific methodology" derived from the scientific method

• Factors relevant: The Court defined "scientific methodology" as the process of formulating hypotheses and then conducting experiments to prove or falsify the hypothesis, and provided a nondispositive, nonexclusive, "flexible" set of "general observations" (i.e. not a "test") that it considered relevant for establishing the "validity" of scientific testimony: – Empirical testing: whether the theory or technique is falsifiable, refutable, and/or testable. – Whether it has been subjected to peer review and publication. – The known or potential error rate. – The existence and maintenance of standards and controls concerning its operation. – The degree to which the theory and technique is generally accepted by a relevant scientific

community.

Page 19: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Best Practices for Regulatory Bioinformatics

Page 20: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Best Practices for Regulatory Bioinformatics

• Fitness-for-purpose (statement of quality requirements, validation)

• Gap: Standards for high-quality reference genomes/materials

• Gap: Minimum requirements for quality metrics

• Gap: Benchmarking and validation of pipelines

• Gap: Proficiency verification

• Traceability and auditability • Gap: Organization of information in a manner facilitating retrospective analysis

• Gap: Method-specific details, such as sequencing chemistry, platform, software versions, etc.

• Gap: Measures to prevent or mitigate procedural errors

• Gap: Standardization for bioinformatics analyses*

• Documentation • Gap: Storage of raw reads vs assembled genomes

• Gap: Metadata

Page 21: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics
Page 22: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Bioinformatic Genomic Analytical Validation, and Best Practices for Microbial Forensics

Page 23: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics
Page 24: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics
Page 25: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

IRIDA

Page 26: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

User and File Management

Gap: Measures to prevent or mitigate procedural errors Gap: Secure storage and sharing of information

Page 27: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Users, Projects, Samples, and Files

Page 28: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Users, Projects, Samples, and Files

Page 29: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Users, Projects, Samples, and Files

Page 30: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Federated Identity with OAuth2

30

• Designed to be a “simple” authorization protocol.

• Developed by a consortium of developers and industry leaders.

• Implemented by Facebook, Google, Twitter, Hotmail, Amazon, Dropbox...

• More of a description of a protocol than a protocol itself.

Page 31: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Ontologies and Data Standards

Page 32: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Analytical Subunit

Subunit Quality Control Module

Quality Report

Quality Control

Page 33: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Subunit Quality Control Module

Quality Report

Quality Control

Quality Assessment • JSON-formatted report {[ { "metric": "n50", "score": "125000", }, ... ]}

Quality Verification Logic • User-modifiable runtime parameters. • Modify JSON-formatted report to

include assessment. • Developer customized module

• Pass with or without warnings (continue workflow), or fail (halt workflow).

• Low-quality/invalid data filtering.

Analytical Subunit

Page 34: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Variant Consolidation

HGT & Recombination

Filtering

Repeat region filtering

Meta-alignment generation

SNP Matrix

Whole Genome Phylogeny

Isolate Sequencing

Reads Variant Calling

Isolate Sequencing

Reads Variant Calling

The IRIDA SNVPhyl Pipeline

User

selects isolates

Phylogeny Viewer

selects reference

Reference Genome

* * *

* *

Page 35: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

The SNVPhyl Pipeline

Page 36: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

The SNVPhyl Pipeline

Page 37: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Sequence Reads

• FastQC run for all read data uploaded to IRIDA

• Provides statistics about quality of reads

Page 38: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Parameters UI

Files selected for analysis

Optional set of parameters can be defined

Or, parameters can be re-loaded from previously saved set (not shown)

Gap: Standardization for bioinformatics analyses; Measures to prevent or mitigate procedural errors

Page 39: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

The SNVPhyl Pipeline

Page 40: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Reports

Page 41: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Pipeline Results

• Results of pipelines will produce report or log files on quality of the data

– E.g. SPAdes logs, Prokka logs, SNV Table, Assembly statistics

• These are stored along with results of each pipeline

Prokka summary contigs: 243 bases: 2131392 rRNA: 2 gene: 2033 CDS: 1978 tRNA: 52 tmRNA: 1

SNV Table #Chromosome Position Status Reference 08-5923 08-5578v3 47737 valid T C 08-5578v3 113283 valid C T 08-5578v3 172841 valid A G

Page 42: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Data Provenance

Gap: Method-specific details, such as sequencing chemistry, platform, software versions, etc.

Page 43: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Provenance – Report

Page 44: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Auditing

• IRIDA tracks resources in its database as resources are created, modified, and removed from the system.

• Resource auditing in IRIDA is implemented as resource-level database auditing, tracking what, who, and when.

• [What]: When a resource in IRIDA is modified, the user account credentials are captured whenever resources in the database are modified. IRIDA securely exposes internal resources for use in external tools over a REST API.

• [Who]: If a human user modifies a resource via a non-human client operation or process, the client credentials are also captured within the database’s auditing information, allowing for traceability of any data manipulations.

• [When]: In addition to user credentials, IRIDA tracks the modification of resources over time.

Page 45: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Transparency

• All provenance information captured by IRIDA is viewable by end-users with permission to view the data stored in IRIDA.

• If a user has permission to view some analysis executed by IRIDA, then the same user can also view the individual tool execution details.

• All provenance information captured by IRIDA is available for export.

Page 46: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

HIV Genotying

Page 47: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

IRIDA Data Import / Export

Page 48: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

IRIDA Data Source for Galaxy

Page 49: Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

• Thomas Matthews

• Eric Marinier

• Hellen Butungi

• Philip Mabon

• Franklin Bristow

• Heather Kent

• Shane Thiessen

• Morag Graham

• Shaun Tyler

• Geoff Peters

• Kim Melnychuk

• Christine Bonner

Acknowledgements: PPP

• Lead Bioinformatician:

– Eric Enns

• National HIV and Retro Viral Laboratories

– Dr. James Brooks, Hezhao Ji

• Bioinformatics Corefacility

– Dr. Ben Liang

• Co-op Students

– David Peddle, Rory Finnegan, Jonathan Boisvert

Acknowledgements: HyDRA Acknowledgements: IRIDA Project Leaders Fiona Brinkman – SFU Will Hsiao – PHMRL Gary Van Domselaar – NML Rob Beiko – Dalhousie University University of Lisbon Joᾶo Carriҫo National Microbiology Laboratory (NML) Franklin Bristow Aaron Petkau Thomas Matthews Josh Adam Adam Olsen Tara Lynch Shaun Tyler Philip Mabon Philip Au Celine Nadon Matthew Stuart-Edwards Morag Graham Chrystal Berry Lorelee Tschetter Laboratory for Foodborne Zoonoses (LFZ) Eduardo Toboada Peter Kruczkiewicz Chad Laing Vic Gannon Matthew Whiteside Ross Duncan Steven Mutschall

Simon Fraser University (SFU) Melanie Courtot Emma Griffiths Geoff Winsor Julie Shay Matthew Laird Bhav Dhillon Raymond Lo BC Public Health Microbiology & Reference Laboratory (PHMRL) and BC Centre for Disease Control (BCCDC) Judy Isaac-Renton Patrick Tang Natalie Prystajecky Jennifer Gardy Damion Dooley Linda Hoang Kim MacDonald Yin Chang Eleni Galanis Marsha Taylor Cletus D’Souza Ana Paccagnella University of Maryland Lynn Schriml Canadian Food Inspection Agency (CFIA) Burton Blais Catherine Carrillo Dominic Lambert Dalhousie University Rob Beiko Alex Keddy

McMaster University Andrew McArthur Daim Sardar European Nucleotide Archive Guy Cochrane Petra ten Hoopen Clara Amid European Food Safety Agency Leibana Criado Ernesto Vernazza Francesco Rizzi Valentina