15
© 2009 Illumina, Inc. All rights reserved. Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, and GenomeStudio are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. COMPANY CONFIDENTIAL INTERNAL USE ONLY Genome Informatics Alliance 2013 Defining Genomic Big Data and its Impact on Scientific Progress

Scott Kahn Genomic Big Data.gia.052913

Embed Size (px)

DESCRIPTION

Dr. Scott Kahn, CIO of Illumina, presents challenges and progress on big data solutions and its impact on scientific research at the 2013 Genome Informatics Alliance meeting.

Citation preview

Page 1: Scott Kahn Genomic Big Data.gia.052913

© 2009 Illumina, Inc. All rights reserved.

Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb,

iSelect, CSPro, and GenomeStudio are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Genome Informatics

Alliance 2013

Defining Genomic Big Data

and its Impact on Scientific

Progress

Page 2: Scott Kahn Genomic Big Data.gia.052913

2

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

From Whence We Came…

ATGCCGTTT…

CCGGTTAAT…

GAATTGCAG…

6:A2567C

12:C123T

20:T4678A

30-40TB

˜5TB

600GB

˜20GB

Page 3: Scott Kahn Genomic Big Data.gia.052913

3

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Genomic Big Data

Large amounts of data generated in genomics; multiple

samples, size of data, etc

Integration of digital data to enrich context of samples;

DNA, RNA, methylation, time courses, spatial

distributions with samples, …

Fusion of digital data and categorical data; combination

rules (categories), extraction from unstructured inputs,

Tools and techniques appropriate for resultant data

sets; visualization, model building, exploration, …

Advances require data mining rather than the one-at-a-

time hypothesis testing approaches of today

Page 4: Scott Kahn Genomic Big Data.gia.052913

4

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Genomic Big Data and Personal Genome Information

PERSONAL SEQUENCE

(owned by individual/doctor)

Issued: 01 MAR 07 Recommended next check: 28 FEB 10

PGI id: 5910322 – 61215923014

RISK VARIANTS

(approved for clinical use)

Human Genome

Clinical studies Populations

Sequencing Functional annotation

3: 12,300 3: 12,400 ( kb )

PPARg

GENOMIC ANNOTATION

(in public domain)

Variant:

C3 : 12,450,610 : T0.7/C0.3 :

PPARG : Pro/Leu :

Medical

consequence:

Associated with severe insulin

resistance, diabetes mellitus,

hypertension

Pharmacological

consequence:

Resistant to thiazolidinediones

CLINICAL DECISION

Consultation

Consent

Clinical assessment

Selected risk

information

Page 5: Scott Kahn Genomic Big Data.gia.052913

5

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Sequencing a 17-member three-generation

pedigree.

– Ultra deep sequencing improves sensitivity

– Leveraging inheritance information improves

accuracy

– Data and results made publicly available

Identifying ultra accurate genomic variants is

enabling rapid improvements in technology

and software

This data will allow us to assess accuracy for

many FDA submissions

We are collaborating with NIST & CDC to

develop a public resource for quantifying

sequencing accuracy

Platinum Genomes as a Truth Reference Creating a catalogue of highly-accurate SNPs, indels & SVs

Page 6: Scott Kahn Genomic Big Data.gia.052913

6

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Reduction from 40 Q-scores to 8 Q-scores becoming accepted

Sequencing output is still increasing exponentially therefore further compression is likely to be required

Platinum genome work suggest ~95% of genome is consistently called (this 95% is known as the platinum regions)

Regions which are reliably called may not need 8 Q-scores resolution – we can reduce “well

sequenced” regions to 2 Q- scores

Start with 8 Q-score bam file: – Reduce the platinum regions

to 2 Q-scores (keep non- platinum at 8 Q-scores)

– Reduce the platinum regions to 1 Q-score

– Whole genome 2 Q-score

– Reduce platinum region to 2 Q-scores but also keep original Q-scores of mismatches (MM) and anomalous reads

– ~40Gb (20Gb CRAM)

Data Reduction Via Vertical Compression (NA12882)

Build Total SNPs

(>Q20)

SNPs diff

genotype

(>Q20)

Not called in

Q-score

compressed

build

(>Q20)

Not called in 8

Q-score build

(>Q20)

8 Q-score 3,735,575

(3,627,165)

- - -

8 Q-score

technical

replicate

3,734,849

(3,626,485)

45,584

(22,400)

80,131 (29,211) 79,405 (28,845)

Platinum

Genome 2 Q-

score

3,732,568

(3,620,612)

3,255 (161) 3417 (63) 410 (127)

Platinum

Genome 1 Q-

score

3,764,928

(3,626,468)

4002 (584) 2605 (75) 31,958 (2964)

Whole Genome

2 Q-score

3,712,636

(3,598,400)

25,175 (1912) 24,237 (166) 1298 (112)

Platinum 2 q-

score keep MM

and anom.

reads

3,735,684

(3,627,226)

197 (123) 142 (35) 251 (102)

Page 7: Scott Kahn Genomic Big Data.gia.052913

7

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Faster Data – DNA to Result in <2 Days

12 core server

64Gb RAM

Sequence Analyze Annotate Sample

27 hr 8 hr

HiSeq2500 Isaac analysis overnight

40 hr

Fast turnaround is required for clinical applications

4.5 hr

PCR Free library

Page 8: Scott Kahn Genomic Big Data.gia.052913

8

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

WGS reveals somatic mutations in TERT

gene promoter of melanoma patients

Form a novel transcription factor binding

motif

Recurrence in melanoma is as high as

any known coding mutation

Importance of Non-coding Mutations – Bigger Data!

-200 -100

TERT gene

0 +100 +200

Gene (mutation) Incidence in

melanoma

TERT (promoter) 52%

BRAF (V600E) 53%

CDKN2A 50%

NRAS (Q61R) 28%

TERT (coding) 1%

Horn et al. & Huang et al., Science 2013

Page 9: Scott Kahn Genomic Big Data.gia.052913

9

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Complexity of Data

Page 10: Scott Kahn Genomic Big Data.gia.052913

10

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Surveillance of Leukaemia (CLL) – More Data Complexity!

0 64 63 65 66 62

Event

Timeline

Sequencing

Birth Death Treatment Diagnosis Treatment Treatment

0

50

100

150

200

250

a b c d e

NORMAL

CLASS4

CLASS3

CLASS2

CLASS1

Time points

Ab

un

da

nc

e

Changing

subclonal

populations

0

1

2

3

4

5

c

NORMAL

CLASS4

CLASS3

CLASS2

CLASS1

“Remission” has

disease Schuh et al., Oxford

Page 11: Scott Kahn Genomic Big Data.gia.052913

11

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

A Deeper Complexity of Genomic Data

Page 12: Scott Kahn Genomic Big Data.gia.052913

12

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Utility Requires Complex Composite Information

iPad

Plug and Play

Cloud

Allele Frequency in populations www.1000genomes.org

Medical/Risk data

(with expert review)

Hgmd, pharmgkb

Genetic Variants dbSNP

Functional Effects ensembl.org,

genome.ucsc.edu,

encode.org

Disease association genome.gov

ANNOTATED

GENOME

( gVCF)

<1Gbyte

Ancestry

Tissue type

Risk

Carrier status

Diagnosis

Drug

response

Annotate Disseminate Interpret

Page 13: Scott Kahn Genomic Big Data.gia.052913

13

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Apps

Public Genomic Databases

Users

EMR

Support & Engineering

Instruments

Genomic Big Data Ecosystems

Page 14: Scott Kahn Genomic Big Data.gia.052913

14

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Genomic Big Data Status

Researcher

Treatment choice

Clinician

Patient

Knowledge

Information

Page 15: Scott Kahn Genomic Big Data.gia.052913

15

COMPANY CONFIDENTIAL – INTERNAL USE ONLY

Challenges for this Meeting to Address

What data frameworks and models

are required?

How will genomes (DNA, RNA,

methylation states, etc) be

aggregated and compared?

How will collaboration and data

sharing evolve?

Where will the technology go and

how must the community respond

to lever the benefits

Brainstorming of ideas

Sessions from groups that have

experiences from many fields

Next steps!!

Actively participate and enjoy the entire

experience!