Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory

Gary Van Domselaar, PhD

Chief, Bioinformatics

WAVLD17, June 17 2015

Canada’s National Microbiology Laboratory

Next-Generation Sequencing

Illumina MiSeq Ion Torrent PGM Oxford Nanopore

15 Gigabases / Run (1 d) 1 Gigabase / Run ( 2 h) 1 Gigabase / Run

$125K $80K $1K

Ion Torrent / Ion Proton 6+ m/80 Million reads/run

Illumina MiSeq 25-Million reads / run

Big Data Analytics and High Performance Computing

Genomic Epidemiology

Canadian Listeriosis Outbreak, 2008

Eppinger M et al. mBio 2014; doi:10.1128/mBio.01721-14 Katz L et al. mBio. 2013 Jul-Aug; 4(4): e00398-13.

Genomic Epidemiology

PulseNet Canada

Genomic

Epidemiology

Roadmap

Aleisha Reimer with contributions from Drs Celine

Nadon, Morag Graham, and PulseNet Canada

members

October 16, 2013

Based on existing PulseNet model

De-centralized sequencing and analysis

Parallel, centralized storage & analysis of

national data sets

Continued NML support in reference testing,

training, certification & proficiency

Continued method development, refinement,

and KT

Diagnostic Metagenomics

Pathogen Profiling Pipeline

Neptune: Target Pathogen Signature Detection

Inclusion Group Exclusion Group

Target Sequences

HIV Genotyping

NGS HIV Genotying with HyDRA

Quality Assurance

QA/QC Today: A Work in Progress

QC Metrics: From Reads to Annotated Genome

• Sequence Reads – Basic stats in read number, read length, etc – Sequence quality – GC content – Duplicated sequences, over-represented

sequences – Kmer content

• Error correction • Contamination • Sequence Mapping

– Average read coverage and its distribution – Composition of the data set according to read

length – Fragment-length average, distribution, and

outliers – Base quality values – Read duplication rate – Mapping quality values and fraction of

properly mapped read pairs – % of mapped reads – % of concordance reads – % of discordance reads

• Sequence Assemblies – # Contigs – Total length of contigs – Largest contigs – N50 (ref-free); NG50 (with reference) – # misassemblies (with reference) – Unaligned contigs / length

• Sequence Annotation – Genome Quality Score (Land et al Genomic Sciences

– Number of contigs and number of non-standard bases

– Presence of a full-length 5S, 16S, and 23 rRNA – Presence of at least one tRNA coding for all of

the 20 standard amino acids – Presence of a set of essential genes

containing 102 conserved Pfam-A domains

• cgMLST / wgMLST • Metabolic Pathways

Quality Assessment for Genomic Epidemiology

The Daubert Standard

• Judge is gatekeeper: – assuring that scientific expert testimony truly proceeds from "scientific knowledge", rests on

the trial judge.

• Relevance and reliability: – requires the trial judge to ensure that the expert's testimony is "relevant to the task at hand"

and that it rests "on a reliable foundation

• Scientific knowledge = scientific method/methodology: A conclusion will qualify as scientific knowledge if the proponent can demonstrate that it is the product of sound "scientific methodology" derived from the scientific method

• Factors relevant: The Court defined "scientific methodology" as the process of formulating hypotheses and then conducting experiments to prove or falsify the hypothesis, and provided a nondispositive, nonexclusive, "flexible" set of "general observations" (i.e. not a "test") that it considered relevant for establishing the "validity" of scientific testimony: – Empirical testing: whether the theory or technique is falsifiable, refutable, and/or testable. – Whether it has been subjected to peer review and publication. – The known or potential error rate. – The existence and maintenance of standards and controls concerning its operation. – The degree to which the theory and technique is generally accepted by a relevant scientific

community.

Best Practices for Regulatory Bioinformatics

• Fitness-for-purpose (statement of quality requirements, validation)

• Gap: Standards for high-quality reference genomes/materials

• Gap: Minimum requirements for quality metrics

• Gap: Benchmarking and validation of pipelines

• Gap: Proficiency verification

• Traceability and auditability • Gap: Organization of information in a manner facilitating retrospective analysis

• Gap: Method-specific details, such as sequencing chemistry, platform, software versions, etc.

• Gap: Measures to prevent or mitigate procedural errors

• Gap: Standardization for bioinformatics analyses*

• Documentation • Gap: Storage of raw reads vs assembled genomes

• Gap: Metadata

Bioinformatic Genomic Analytical Validation, and Best Practices for Microbial Forensics

User and File Management

Gap: Measures to prevent or mitigate procedural errors Gap: Secure storage and sharing of information

Users, Projects, Samples, and Files

Federated Identity with OAuth2

• Designed to be a “simple” authorization protocol.

• Developed by a consortium of developers and industry leaders.

• Implemented by Facebook, Google, Twitter, Hotmail, Amazon, Dropbox...

• More of a description of a protocol than a protocol itself.

Ontologies and Data Standards

Analytical Subunit

Subunit Quality Control Module

Quality Report

Quality Control

Subunit Quality Control Module

Quality Report

Quality Control

Quality Assessment • JSON-formatted report {[ { "metric": "n50", "score": "125000", }, ... ]}

Quality Verification Logic • User-modifiable runtime parameters. • Modify JSON-formatted report to

include assessment. • Developer customized module

• Pass with or without warnings (continue workflow), or fail (halt workflow).

• Low-quality/invalid data filtering.

Analytical Subunit

Variant Consolidation

HGT & Recombination

Filtering

Repeat region filtering

Meta-alignment generation

SNP Matrix

Whole Genome Phylogeny

Isolate Sequencing

Reads Variant Calling

Isolate Sequencing

Reads Variant Calling

The IRIDA SNVPhyl Pipeline

selects isolates

Phylogeny Viewer

selects reference

Reference Genome

The SNVPhyl Pipeline

Sequence Reads

• FastQC run for all read data uploaded to IRIDA

• Provides statistics about quality of reads

Parameters UI

Files selected for analysis

Optional set of parameters can be defined

Or, parameters can be re-loaded from previously saved set (not shown)

Gap: Standardization for bioinformatics analyses; Measures to prevent or mitigate procedural errors

The SNVPhyl Pipeline

Reports

Pipeline Results

• Results of pipelines will produce report or log files on quality of the data

– E.g. SPAdes logs, Prokka logs, SNV Table, Assembly statistics

• These are stored along with results of each pipeline

Prokka summary contigs: 243 bases: 2131392 rRNA: 2 gene: 2033 CDS: 1978 tRNA: 52 tmRNA: 1

SNV Table #Chromosome Position Status Reference 08-5923 08-5578v3 47737 valid T C 08-5578v3 113283 valid C T 08-5578v3 172841 valid A G

Data Provenance

Gap: Method-specific details, such as sequencing chemistry, platform, software versions, etc.

Provenance – Report

Auditing

• IRIDA tracks resources in its database as resources are created, modified, and removed from the system.

• Resource auditing in IRIDA is implemented as resource-level database auditing, tracking what, who, and when.

• [What]: When a resource in IRIDA is modified, the user account credentials are captured whenever resources in the database are modified. IRIDA securely exposes internal resources for use in external tools over a REST API.

• [Who]: If a human user modifies a resource via a non-human client operation or process, the client credentials are also captured within the database’s auditing information, allowing for traceability of any data manipulations.

• [When]: In addition to user credentials, IRIDA tracks the modification of resources over time.

Transparency

• All provenance information captured by IRIDA is viewable by end-users with permission to view the data stored in IRIDA.

• If a user has permission to view some analysis executed by IRIDA, then the same user can also view the individual tool execution details.

• All provenance information captured by IRIDA is available for export.

HIV Genotying

IRIDA Data Import / Export

IRIDA Data Source for Galaxy

• Thomas Matthews

• Eric Marinier

• Hellen Butungi

• Philip Mabon

• Franklin Bristow

• Heather Kent

• Shane Thiessen

• Morag Graham

• Shaun Tyler

• Geoff Peters

• Kim Melnychuk

• Christine Bonner

Acknowledgements: PPP

• Lead Bioinformatician:

– Eric Enns

• National HIV and Retro Viral Laboratories

– Dr. James Brooks, Hezhao Ji

• Bioinformatics Corefacility

– Dr. Ben Liang

• Co-op Students

– David Peddle, Rory Finnegan, Jonathan Boisvert

Acknowledgements: HyDRA Acknowledgements: IRIDA Project Leaders Fiona Brinkman – SFU Will Hsiao – PHMRL Gary Van Domselaar – NML Rob Beiko – Dalhousie University University of Lisbon Joᾶo Carriҫo National Microbiology Laboratory (NML) Franklin Bristow Aaron Petkau Thomas Matthews Josh Adam Adam Olsen Tara Lynch Shaun Tyler Philip Mabon Philip Au Celine Nadon Matthew Stuart-Edwards Morag Graham Chrystal Berry Lorelee Tschetter Laboratory for Foodborne Zoonoses (LFZ) Eduardo Toboada Peter Kruczkiewicz Chad Laing Vic Gannon Matthew Whiteside Ross Duncan Steven Mutschall

Simon Fraser University (SFU) Melanie Courtot Emma Griffiths Geoff Winsor Julie Shay Matthew Laird Bhav Dhillon Raymond Lo BC Public Health Microbiology & Reference Laboratory (PHMRL) and BC Centre for Disease Control (BCCDC) Judy Isaac-Renton Patrick Tang Natalie Prystajecky Jennifer Gardy Damion Dooley Linda Hoang Kim MacDonald Yin Chang Eleni Galanis Marsha Taylor Cletus D’Souza Ana Paccagnella University of Maryland Lynn Schriml Canadian Food Inspection Agency (CFIA) Burton Blais Catherine Carrillo Dominic Lambert Dalhousie University Rob Beiko Alex Keddy

McMaster University Andrew McArthur Daim Sardar European Nucleotide Archive Guy Cochrane Petra ten Hoopen Clara Amid European Food Safety Agency Leibana Criado Ernesto Vernazza Francesco Rizzi Valentina

Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory · 2015-07-15 ·...

Documents

High throughput sequencing : informatics & software aspects

Bringing high-throughput sequencing into clinical application generation diagnostics Meza Zepeda Stavanger 2015.pdfBringing high-throughput sequencing into clinical application Leonardo

High throughput sequencing from Angolan citrus accessions

Environmental bio-monitoring with high-throughput sequencing...Keywords: metagenomics; environmental bio-monitoring; high-throughput sequencing INTRODUCTION The application of high-throughput

High-throughput sequencing technology to reveal the

High-throughput sequencing of microbial community

Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory Gary Van Domselaar, PhD Chief, Bioinformatics

Targeted high-throughput sequencing for genetic diagnostics ......prospectively evaluated a targeted high-throughput sequencing approach for HLH diagnostics. Methods: A high-throughput

High Throughput Sequencing Data Analysis on Genomics

High throughput genotyping e next generation sequencing

High-throughput sequencing for community analysis: the

Next Generation Sequencing (NGS) or high throughput

Bioinformatics for high-throughput DNA sequencing

High-throughput deep sequencing reveals ... - obs.lc-bio.cn

Genomics I: High throughput sequencing

Personalized Oncology Through High-throughput Sequencing ... · Personalized Oncology Through High-throughput Sequencing: MI-ONCOSEQ (Michigan Oncology Sequencing Center) (HUM00046018)

High Throughput DNA Sequencing

High Throughput Sequencing Facility - FAPESPHigh Throughput Sequencing Facility Mission • Be one of the best high-throughput genomic centers at public university. • Support a cutting-edge

High Throughput Genomic DNA Sequencing and Bioinformatics

High Throughput Sequencing: Technologies & Applications