Dekker trog - learning outcome prediction models from cancer data - 2017

Preview:

Citation preview

Learning outcome prediction models from cancer data

Andre DekkerDepartment of Radiation Oncology (MAASTRO)GROW - Maastricht University Medical Centre +Maastricht, The Netherlands

SLIDES AVAILABLE ON SLIDESHARE (slideshare.net/AndreDekker)

2

Disclosures• Research collaborations incl. funding and speaker honoraria

– Varian (VATE, SAGE, ROO, chinaCAT, euroCAT), Siemens (euroCAT), Sohard (SeDI, CloudAtlas), Mirada Medical (CloudAtlas), Philips (EURECA, TraIT, SWIFT-RT, BIONIC), Xerox (EURECA), De Praktijkindex (DLRA), ptTheragnostic (DART, Strategy), CZ (My Best Treatment)

• Public research funding– Radiomics (USA-NIH/U01CA143062), euroCAT(EU-Interreg), duCAT&Strategy

(NL-STW), EURECA (EU-FP7), SeDI & CloudAtlas & DART (EU-EUROSTARS), TraIT (NL-CTMM), DLRA (NL-NVRO), BIONIC (NWO)

• Spin-offs and commercial ventures– MAASTRO Innovations B.V. (CSO)– Various patents on medical machine learning

3

TROG 2017 talks• Learning outcome prediction models from

cancer data– Technical Research Workshop, Monday 840-910,

followed by Panel Discussion• Big Data in Radiation Oncology

– Statistical Methods, Evidence Appraisal and Research for Trainees, Monday 1450-1520

• Knowledge Engineering in Oncology– TROG Plenary, Tuesday, 925-1000

• Radiomics for Oncology– TROG Plenary, Thursday, 1150-1220

Some Overlap

NoOverlap

4

Learning objectivesAfter the lecture, attendees should be able to• Name the major sources of cancer data and their absolute and

relative size• Understand the challenges of sharing data and solutions to these• Itemize steps in the methodology to go from data to models• Appraise papers that describe models incl. using TRIPOD

The Data Part

6

Cancer Data?

Oncology2005-2015140M patients0.1-10GB per patient14-1400PB80% unstructured100k hospitals

7

Barriers to sharing data[..] the problem is not really technical […]. Rather, the problems are ethical, political, and administrative. Lancet Oncol 2011;12:933

1. Administrative (I don’t have the resources)2. Political (I don’t want to)3. Ethical (I am not allowed to)

4. Technical (I can’t)

8

Common approaches to sharing• Sharing standardized, highly curated data from

clinical research programs• Very useful, but only 3% of patients (if that)

• Sharing standardized, highly curated data to clinical registries

• Very useful, but limited amount of features and a lot of work

• Big Data companies usually cloud based (Watson Health Cloud, Flatiron/Google, ASCO/SAP CancerLinq)

• Worries about privacy, loss of control, limited reusability, silos

9

Data landscape• Clinical research

• 3% of patients• 100% of features• 5% missing• 285 data points

• Clinical registries• 100% of patients• 3% of features• 20% missing• 240 data points

• Clinical routine• 100% of patients• 100% of features• 80% missing• 2000 data points

Data elementsPatients

10

A different approach• If sharing is the problem: Don’t share the data

• If you can’t bring the data to the research• You have to bring the research to the data

• Challenges– The research application has to be distributed (trains & track)– The data has to be understandable by an application (i.e. not a human) ->

FAIR data stations

11

CORAL: Community in Oncology for RApid Learning

7

4

meerCATLung - DyspneaU MichiganMAASTROThe Christie

Map © Copyright Showeet.com

canCATLung SBRT - ControlPrincess MargaretMAASTRO

BIONICRadiomicsMAASTROTata Memorial

duCATLung - DysphagiaMAASTRORadboudNKI

euroCATLung - SurvivalUK AachenLOC HasseltCatharinaMAASTROCHU Liege

Interest to joinErasmus (Breast)BCCA (Breast)Bloemfontein (Cervix)Odense (HN, Lung)Aalst (Lung)McGill (Brain)

ozCATHead&Neck - Survival LiverpoolIllawarra NewcastleWestmeadMAASTRORTOG/NRG

worldCATRectum - Local ControlFudanRome/EURTOG/NRG

12

Typical Data Quality challenges• Data are unstructured• Data are not understandable• Data are missing• Data are incorrect• Data are contradicting• Data are biased• Data are biased missing

• Garbage in – Garbage out?

声门下区T4N0M0 Stage IV patientPatient weighing 1000kg

Grade 3+ toxicities

For the techies…

14

Horizontal PartitionsData elementsPatients

Maastricht

Patients Shanghai

• Reasonably well understood

• Distributed learning possible if data is FAIR

• No need for data to leave the hospital

15

Vertical and Complex PartitionsData elements MAASTROPatients

Data elements Registry

16

A bit more technical detail• Keep data locally• Standardize it

according to an ontology

• Make and send around learning “bots”

• Share the results - not the data!

17

Even more technical details• De-identification• Semantic web, linked data• Imaging/DICOM data & clinical data stream

The Modelling Part

19

Our modelling approach• Hypothesis driven!!

20

How much data do you need?• Rule of thumb. Min. 10 events per input feature

• 200 NSCLC patients• 25% survival at two years• 50 events

• 10 input features• More is better Source: vitalflux.com (2017)

21

Source: Jason Brownlee (2013)

Machine Learning

22

Considerations for machine learning• Discrimination (AUC)• Calibration (Brier)

• Interpretability (black box vs. transparent)

• Can it handle low data quality (of training and validation)?

• Can it be learned in a distributed setting?

23

Choose alreadySimple and quick, but need complete data• Logistic regression• Support Vector Machines

Intuitive and can handle missing data• Bayesian Networks

All can be learned in a distributed setting

Review pending

24

TRIPOD

https://www.tripod-statement.org/

25

Validation model• Discrimination: Is the model able to classify the

population into two or more groups with different observed survival?

• Calibration: Is the estimated probability of survival equal to the observed survival probability?

• Clinical usefulness: Is the data on which the data is based representative for my patient and is the predicted outcome clinically relevant for my patient?

26

Laryngeal carcinoma model• 994 MAASTRO patients• 1990-2005• www.predictcancer.org• Input parameters

– Age– Hemoglobin– T-stage– Radiotherapy Dose (Gy)– Gender– N+– Tumor location

• Output parameters– Overall survival

27

Discrimination / Calibration / Clinical Relevance?

• Discrimination: Is the model able to classify the population into two or more groups with different observed survival?

• Calibration: Is the estimated probability of survival equal to the observed survival probability?

• Clinical usefulness: Is the data on which the data is based representative for my patient and is the predicted outcome clinically relevant for my patient?

28

Discrimination / Calibration / Clinical Relevance?

• Discrimination: Is the model able to classify the population into two or more groups with different observed survival?

• Calibration: Is the estimated probability of survival equal to the observed survival probability?

• Clinical usefulness: Is the data on which the data is based representative for my patient and is the predicted outcome clinically relevant for my patient?

29

Discrimination / Calibration / Clinical Relevance?

• Discrimination: Is the model able to classify the population into two or more groups with different observed survival?

• Calibration: Is the estimated probability of survival equal to the observed survival probability?

• Clinical usefulness: Is the data on which the data is based representative for my patient and is the predicted outcome clinically relevant for my patient?

30

There is an app for that

31

Learning objectivesAfter the lecture, attendees should be able to• Name the major sources of cancer data and their absolute and

relative size• Understand the challenges of sharing data and solutions to these• Itemize steps in the methodology to go from data to models• Appraise papers that describe models incl. using TRIPOD

32

Acknowledgements• Fudan Cancer Center, Shanghai,

China• Varian, Palo Alto, CA, USA• Siemens, Malvern, PA, USA• RTOG, Philadelphia, PA, USA• MAASTRO, Maastricht, Netherlands• Policlinico Gemelli, Roma, Italy• UH Ghent, Belgium• UZ Leuven, Belgium• Radboud, Nijmegen, Netherlands• University of Sydney, Australia• University of Michigan, Ann Arbor,

USA

• Liverpool and Macarthur CC, Australia

• CHU Liege, Belgium• Uniklinikum Aachen, Germany• LOC Genk/Hasselt, Belgium• Princess Margaret CC, Canada• The Christie, Manchester, UK• UH Leuven, Belgium• State Hospital, Rovigo, Italy• Illawarra Shoalhaven CC, Australia • Catharina Zkh Eindhoven,

Netherlands• Philips, Eindhoven, NetherlandsMore info on: www.predictcancer.org www.cancerdata.org

www.eurocat.info www.mistir.info

Thank you for your attention

Andre DekkerDepartment of Radiation Oncology (MAASTRO)GROW - Maastricht University Medical Centre +Maastricht, The Netherlands

34

35

Patient(ncit:C16960)

Age at start RT(roo:100003)

Year(uo:UO_0000036)

Value

Non-small cell lung carcinoma

(ncit:C2926)

Sex(nci:C20197 and

nci:C16576)

Value

Hospital(ncit:C19326)(uri=http://

www.uhn.ca/PrincessMargaret)

Month(uo:UO_0000035)

Value

Survival(roo:100063)

Vital Status(ncit:C37987 or

ncit:28554)

FEV1(nci:C38084)

Percentage FEV1(nci:C112376)

Liter(uo:UO_0000099)

Value

Percent(uo:UO_0000187)

Age at diagnosis(roo:100002) Year

(uo:UO_0000036)

Value

ECOG performance status

(nci:105722nci:105723nci:105725nci:105726nci:105727nci:105728)

Value

Positive Lymph Node Stations(roo:100049)

Count(uo:UO_0000189)

has_unitroo:100027

Value

DateTimeDescription

Clinical TNM Finding

(ncit:C48881)

Generic T-stage 0-4(ncit:48719)(ncit:48720)

).(ncit:48732)

has_clinical_t_stageroo:100244

Diagnostic Procedure

(ncit:C18020)

Volume of primary tumor

(roo:100054)

has_

volu

me

(roo

:100

315)

Cubic centimeter(uo:UO_0000097)

ValueRT Structure Set

(sedi:RTStructureSet)MIA Version

(mia:<version>)

AJCC Edition(roo:100052)(roo:100053)

Radition Therapy (ncit:C15313)

OR

SBRT(ncit:C118286)

Prescribed Radiotherapy Dose

(roo:100013)

Gray(uo:UO_0000134)

Value(xsd:double)

No. RT Fractions Per Treatment(roo:100356)

Value(xsd:integer)

No. RT Fractions Per Day

(roo:100355)

Value(xsd:integer)

Delivered Radiotherapy Dose

(roo:100012)

Gray(uo:UO_0000134)

Value(xsd:double)

First radiotherapy fraction

(roo:100058)

Last radiotherapy fraction

(roo:100059)

Histology(nci:2926nci:2852nci:3780nci:2929nci:2852nci:3915)

DateTimeDescription

DateTimeDescription

DateTimeDescriptionat_date_timeroo:100041

DateTimeDescription

Pneumonitis(ctcae:Pneumonitis)

Fracture(ctcae:Fracture)

Rib(fma:fma7574)

DateTimeDescription

DateTimeDescription

Reaction(ctcae:Radiation_recall_reaction_derm

atologic)

DateTimeDescription

at_date_timeroo:100041

Fatigue(ctcae:Fatigue)

DateTimeDescription

at_date_timeroo:100041

Dyspnea(ctcae:Dyspnea)

DateTimeDescription

at_date_timeroo:100041

Couch(ctcae:Couch)

DateTimeDescription

at_date_timeroo:100041

Anorexia(ctcae:Anorexia)

DateTimeDescription

at_date_timeroo:100041

DateTimeDescription Dysphagia(ctcae:Dysphagia)

at_date_timeroo:100041

DateTimeDescription Hemoptysis(nci:C3094)

at_date_timeroo:100041

DateTimeDescription Esophagitis(ctcae:Esophagitis)

at_date_timeroo:100041

DateTimeDescriptionPulmonary Fibrosis(ctcae:Pulmonary_fi

brosis)

at_date_timeroo:100041

DateTimeDescriptionBrachial plexopathy(ctcae:Brachial_plex

opathy)

at_date_timeroo:100041

36

Tech used• ETL (Pentaho, Talend)• DICOM de-identification

(CTP)• RDF store & SPARLQ

endpoint (Blazegraph, Sesame)

• Ontology editing (Protégé)• Ontology publishing

(BioPortal)• Database (PostgreSQL)

• Database to RDF (D2R)• DICOM to RDF (SeDI)• PACS (dcm4chee)• Image processing pipeline

(MIA-MAASTRO)• Distributed application

(Varian, Docker)• Generic & Machine

learning (Matlab, R, Java, Python)