43
1

2015 06-12-beiko-irida-big data

  • Upload
    beiko

  • View
    489

  • Download
    4

Embed Size (px)

Citation preview

Page 1: 2015 06-12-beiko-irida-big data

1

Page 2: 2015 06-12-beiko-irida-big data

2

“All of your answers are approximate, you might as well live with it…”

Andrew Rau-Chaplin, 1½ hours ago

Page 3: 2015 06-12-beiko-irida-big data

Integrated Rapid Infectious Disease Analysiswww.irida.ca

Rob BeikoFaculty of Computer ScienceDalhousie UniversityJune 12

Microbial genomics for rapid investigation of infectious disease

Image © Kenneth Todar

Page 4: 2015 06-12-beiko-irida-big data

4

2009 and Influenza A

Page 5: 2015 06-12-beiko-irida-big data

5

Page 6: 2015 06-12-beiko-irida-big data

6

Page 7: 2015 06-12-beiko-irida-big data

7

Influenza ARNA genome (14,000 nucleotides)Eight segments(Image: Tao and Zheng, Science 2012)

S. Typhi CT18DNA genome (~5,100,000 nucleotides)One chromosome + two plasmidsScience (2001)

VIRUS BACTERIUM

Page 8: 2015 06-12-beiko-irida-big data

8

Outbreak investigation

Similarities: place, time, genetics

fda.gov

2014

2010-2013

Inns et al. (2015)

Page 9: 2015 06-12-beiko-irida-big data

9

Outbreak investigation in Canada

NATIONAL MICROBIOLOGY LABORATORY

PROVINCIAL PUBLIC HEALTH LABORATORIES

CLINICAL ISOLATES

SENTINEL SURVEILLANCE(FoodNet Canada)

CLINICAL, FOOD, ENVIRONMENTAL

CANADIAN FOOD INSPECTION AGENCY

(Regulatory)

FOOD ISOLATES

LISTERIA - E. COLI O157:H7 - SALMONELLA - SHIGELLA

PFGE/MLVA

PUBLIC HEALTH ACTION

Page 10: 2015 06-12-beiko-irida-big data

10

Pulsed Field Gel ElectrophoresisSerratia - NICU

Hospita

l cas

es

Handwash

es

Environmental

(doors, etc)

Control

(elsewhere in

hospita

l)

Jang et al., J Hosp Infect (2001)

Page 11: 2015 06-12-beiko-irida-big data

11

15 gigabases per run$1000 - $1500 / run, 1 day

Tinier pieces (150 – 400 bases)

< 1 kilobase per run$2 / run, 1-3 hours (96 in parallel)

Tiny pieces (600 – 1000 bases)

2011: Illumina MiSeq1977: Sanger sequencing ( )

DNA Sequencing

Page 12: 2015 06-12-beiko-irida-big data

10/10/2013 VanBUG 12

Page 13: 2015 06-12-beiko-irida-big data

13

MiSeq projects at Dalhousie• Bedford Basin microbial monitoring• Pediatric Crohn’s disease samples• Global microbial air sampling• Mink genomes• Sequencing Lactobacillus genomes from the poop of

old mice• Wastewater diversity and function in the Arctic• Verifying ingredients in dog food ( )• Exercise and the Microbiome

Page 14: 2015 06-12-beiko-irida-big data

14

Integrated Rapid Infectious Disease Analysiswww.irida.ca

1.56M, 3-year Genome Canada Large-Scale Applied Platform Grant

SFU / BCCDC / PHAC-NML / Dalhousie DNA sequencing and downstream applications

• data management / federation• analysis workflows• ontologies• APIs• 3rd-party applications

Implementation in provincial public health labs Training

Page 15: 2015 06-12-beiko-irida-big data

15

Five Pillars of IRIDA

Page 16: 2015 06-12-beiko-irida-big data

16

Ontologies and data standards NCBI, MiXS, vegetables

Metadata Data provenance Data quality Environmental information

Page 17: 2015 06-12-beiko-irida-big data

17

Data sharing!

• BIG challenges – different jurisdictions, “ownership” of epi data. Privacy!• Health service providers – concerns

about privacy and data breach• Technology outstrips policy• What digital records could we get TODAY?

• Canada lagging in data sharing

Page 18: 2015 06-12-beiko-irida-big data

18

Calling isolates based on genetic variation

Traditional: Pulsed-field Multi-locus (standards! mlst.net)

Whole genomes: Lots of information! Too much information! Lots of filtering and quality

control required

Page 19: 2015 06-12-beiko-irida-big data

19

Workflow management

REST-like API (3rd – party applications)

Security: authentication / authorization

Data models & implementation

Page 20: 2015 06-12-beiko-irida-big data

Local Storage

Remote APIs

IRIDA’s Federated Design

List Samples

20

Page 21: 2015 06-12-beiko-irida-big data

21

Each pipeline is implemented as a Galaxy workflow

Internal analysis pipelines Assembly and annotation Phylogenetics “Line list” management

3rd-party applications

Page 22: 2015 06-12-beiko-irida-big data

22

Sampled genomes Quality control Tree generation /visualization

Single-Nucleotide Variant Phylogenetic Pipeline

(SNVPhyl)

Page 23: 2015 06-12-beiko-irida-big data

23

GenGIS

Data from Haiti cholera outbreak, 2010http://kiwi.cs.dal.ca/GenGIS

Page 24: 2015 06-12-beiko-irida-big data

24

IslandViewer

http://www.pathogenomics.sfu.ca/islandviewer/browse

Page 25: 2015 06-12-beiko-irida-big data

25

Interfaces / environment

Personas Researchers Epidemiologists Clinical microbiologists / lab technicians

Workflow design and execution

Page 26: 2015 06-12-beiko-irida-big data

Full Privileges

Cluster Line List ID

Patient Name

Prov. Health

No.Age Sex Location Sample

IDCollection

DateCulture Result

A 1John Smith 4513253244 26 M Vancouver F14231 14/03/21 Salmonella

sp.

A 2Sally Smith 4519567458 24 F Vancouver F14235 14/03/21 Salmonella

sp.

B 3Tom Jones 4517543216 35 M Vancouver M6542 14/03/24 Salmonella

sp.

B 4Helen Jones 9856321124 35 F Vancouver S1245 14/03/22 Salmonella

sp.

C 5Jennifer Lee 4516853122 29 F Vancouver S5642 14/03/22 Salmonella

sp.

C 6Michael Brown 9456534561 45 M Victoria T68954 14/03/25 Salmonella

sp.

Phylogenetic Tree

Genetic Distance

Page 27: 2015 06-12-beiko-irida-big data

Limited Privileges

Cluster Line List ID

Patient Name

Prov. Health

No.Age Sex Location Sample

IDCollection

DateCulture Result

A 1John Smith 4513253244 26 M Vancouver F14231 14/03/21 Salmonella

sp.

A 2Sally Smith 4519567458 24 F Vancouver F14235 14/03/21 Salmonella

sp.

B 3Tom Jones 4517543216 35 M Vancouver M6542 14/03/24 Salmonella

sp.

B 4Helen Jones 9856321124 35 F Vancouver S1245 14/03/22 Salmonella

sp.

C 5Jennifer Lee 4516853122 29 F Vancouver S5642 14/03/22 Salmonella

sp.

C 6Michael Brown 9456534561 45 M Victoria T68954 14/03/25 Salmonella

sp.

Phylogenetic Tree

Genetic Distance

Page 28: 2015 06-12-beiko-irida-big data

28

Large-scale sequencing initiatives

en.wikipedia.org

Page 29: 2015 06-12-beiko-irida-big data

29

FDA GenomeTrakr

http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm

Page 30: 2015 06-12-beiko-irida-big data

30

Public Health England project (>10,000 Salmonella so far)

• As of 2015, sequencing every sampled Salmonella isolate collected in England• Over 10,000 sequenced to date• 8000 already available for download in the public

databases

Page 31: 2015 06-12-beiko-irida-big data

31Gary van Domselaar, NML

The Global Microbial Identifier

Page 32: 2015 06-12-beiko-irida-big data

32

What’s next?

??? per run$900 / run, 6 hours

Huge pieces (max so far – 200-300 kilobases)Can stop / restart using same disposable flowcell

2015: Oxford Nanopore MinION

15 cm (-ish)

thehightechsociety.com

Page 33: 2015 06-12-beiko-irida-big data

33Quick et al. (2015)

“Using a novel streaming phylogenetic placement method samples can be assigned to a serotype in 40 minutes and determined to be part of the outbreak in less than 2 h.”

Page 34: 2015 06-12-beiko-irida-big data

34

Ebola monitoring

blogs.biomedcentral.comJoshua Quick, Nick Loman

Page 35: 2015 06-12-beiko-irida-big data

35

Example workflow

6 hrs

Changeflowcell

Samples evaluated against reference in real time

Positive ID / placement

Load DNA

confi

denc

e

Page 36: 2015 06-12-beiko-irida-big data

36

Challenges

• Sample extraction: getting DNA from stuff• Clinical-grade evaluation• Training• Equipment reliability• Sequencing errors• Quality of reference data / attribution algorithms

• Database updates in real time• Ethics / privacy (Genomes Sequenced While U Wait)

Page 37: 2015 06-12-beiko-irida-big data

37

The Point

Comprehensive monitoringAccurate typingRapid identification

Real-time decision making

Page 38: 2015 06-12-beiko-irida-big data

Acknowledgements PIs

Fiona Brinkman – SFUWill Hsiao – PHMRLGary Van Domselaar – NMLMorag Graham - NMLRob Beiko – Dalhousie

University of LisbonJoᾶo Carriҫo

National Microbiology Laboratory (NML)Franklin BristowAaron PetkauThomas MatthewsJosh AdamAdam OlsenTara LynchShaun TylerPhilip MabonPhilip AuCeline NadonMatthew Stuart-EdwardsChrystal BerryLorelee Tschetter

Laboratory for Foodborne Zoonoses (LFZ)Eduardo TaboadaPeter KruczkiewiczChad LaingVic GannonMatthew WhitesideRoss DuncanSteven Mutschall

Simon Fraser University (SFU)Melanie CourtotEmma GriffithsGeoff WinsorJulie ShayMatthew LairdBhav DhillonRaymond Lo

BC Public Health Microbiology & Reference Laboratory (PHMRL) and BC Centre for Disease Control (BCCDC)Judy Isaac-RentonPatrick TangNatalie PrystajeckyJennifer GardyDamion DooleyLinda HoangKim MacDonaldYin ChangEleni GalanisMarsha TaylorCletus D’SouzaAna Paccagnella

University of MarylandLynn Schriml

Canadian Food Inspection Agency (CFIA)Burton BlaisCatherine CarrilloDominic Lambert

Dalhousie UniversityAlex Keddy 38

McMaster UniversityAndrew McArthurDaim Sardar

European Nucleotide ArchiveGuy CochranePetra ten HoopenClara Amid

European Food Safety AgencyLeibana Criado ErnestoVernazza FrancescoRizzi Valentina

Page 39: 2015 06-12-beiko-irida-big data

39

Seminar from the Will Hsiao,BC Centres for Disease Control

Page 40: 2015 06-12-beiko-irida-big data

40

Materials to be available onhttp://bioinformatics.ca/

June 24-26, 2015

Page 41: 2015 06-12-beiko-irida-big data

41

The Bioinformatics Exam of the Future

tagc.com.aucommons.wikimedia.org/wiki/File:DNA_ahelatest_moodustunud_niit_katsuti_korgil..JPGhttp://omicfrontiers.com/2014/06/11/diaryofaminion_part2/

Page 42: 2015 06-12-beiko-irida-big data

42

2009 was a long time ago

J. Craig Venter Institute

Page 43: 2015 06-12-beiko-irida-big data

43Photo credit: Emma Allen-VercoeSome slides courtesy of Gary Van Domselaar, NML

FIN