Data Science, Big Data and You

Preview:

DESCRIPTION

Presentation at George Mason University, April 2013

Citation preview

Joel Saltz MD, PhDEmory UniversityFebruary 2013

Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Big Data

• Social media—analysis of tweets and Facebook to observed trends in real time

• Local Walgreens stock their shelves according to local tweets about cold symptoms 

• Credit card fraud—lost of transactions, but yet you get a flag that you shopped in a store that does not fit your profile—and within minutes your card is blocked. 

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Big Data in Commerce - Fraud Detection

• Seek unexpected data – outliers• Lots of data – all Amex, Visa or Mastercard

transactions• Look for individual outliers – e.g. credit

transaction involving large amount of money purchasing unusual product

• Look for sequence data with temporal or spatial relationship -- find unusual sequence e.g., intrusion detection and cyber security

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Define the “typical” regions in a data set – may be difficult

• “Typical” behavior may change with time. What is typical today may be considered anomalous in future and vice versa.

• (Smart) crooks will make “keep under the radar” to try to stay undetected

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Approaches

• Sometimes build a model from the training data and apply the model to detect outliers

• Sometimes use the existing data directly to detect outliers

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Big Data Ecosystem

6

Credit: http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Science and Engineering Applications

Sloan Sky Survey

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Early Big Data 1922 -Lewis Richardson Weather Forecasting

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Scientific Big Data Targets

• Multi-dimensional spatial-temporal datasets– Biomedicine

– Oil Reservoir Simulation/Carbon Sequestration/Groundwater Pollution Remediation

– Biomass monitoring and disaster surveillance

– Weather prediction

– Analysis of Results from Large Scale Simulations

• Correlative and cooperative analysis of data from multiple sensor modalities and sources

• What-if scenarios and multiple design choices or initial conditions

Emory In Silico Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz)

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Integrative Cancer Research with Digital Pathology

histology neuroimaging

clincal\pathology

IntegratedAnalysis

molecular

High-resolution whole-slide microscopy

Multiplex IHC

Integrative Analysis: OSU BISTI NBIB Center

Big Data (2005)Associate genotype with

phenotypeBig science experiments on

cancer, heart disease, pathogen host responseTissue specimen -- 1 cm3

0.3 μ resolution – roughly 1013 bytes

Molecular data (spatial location) can add additional significant factor; e.g. 102

Multispectral imaging, laser captured microdissection, Imaging Mass Spec, Multiplex QD

Multiple tissue specimens; another factor of 103

Total: 1018 bytes – exabyte per big science experiment

A Data Intense Challenge:The Instrumented Oil Field of the

Future

The Tyranny of Scale(Tinsley Oden - U Texas)

process scalefield scale

km

cm

simulation scale

mm

pore scale

Why Applications Get Big• Physical world or simulation results• Detailed description of two, three (or more)

dimensional space• High resolution in each dimension, lots of

timesteps• e.g. oil reservoir code -- simulate 100 km by

100 km region to 1 km depth at resolution of 100 cm:

– 10^6*10^6*10^4 mesh points, 10^2 bytes per mesh point, 10^6 timesteps --- 10^24 bytes (Yottabyte) of data!!!

Detect and track changes in data during productionInvert data for reservoir propertiesDetect and track reservoir changes

Assimilate data & reservoir properties into the evolving reservoir model

Use simulation and optimization to guide future production

Oil Field Management – Joint ITR with Mary Wheeler, Paul Stoffa

Coupled Ground Water and Surface Water Simulations

Multiple codes -- e.g. fluid code, contaminant transport codeDifferent space and time scalesData from a given fluid code run is used in different contaminant transport code scenarios

Bioremediation Simulation

Microbe colonies (magenta)

Dissolved NAPL (blue)

Mineral oxidation products (green)

abiotic reactions compete with

microbes, reduce extent of biodegradation

National Science Foundation Grand Challenge in Land Cover Dynamics - 1994

• Remote sensing analysis of high resolution satellite images.

• Databases of land cover dynamics are essential for global carbon models, biogeochemical cycling, hydrological modeling and ecosystem response modeling

• Maps of the world's tropical rain forest during the past three decades.

Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , John Townshend

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Analysis of Computational Data; Uncertainty Quantification, Comparisons with Experimental Results

Dimitri Mavriplis, Raja Das, Joel Saltz -- 1990’s

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Whole Slide Imaging: Scale

Data per slide: 500MB to 100GBRoughly 250-500M Slides/Year in USA

Total: 0.1-10 Exabytes/year

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Using TCGA Data to Study

Glioblastoma

Diagnostic Improvement

Molecular Classification

Predictors of Progression

Digital Pathology

Neuroimaging

TCGA Network

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Morphological Tissue Classification

Nuclei Segmentation

Cellular Features

Lee Cooper,Jun Kong

Whole Slide Imaging

Oligodendroglioma Astrocytoma

Nuclear Qualities

Can we use image analysis of TCGA GBMs TO INFORM diagnostic criteria based on molecular or clinical endpoints?

Application: Oligodendroglioma Component in GBM

Millions of Nuclei Defined by n Features

• Top-down analysis: use the features with existing diagnostic constructs

• Bottom-up analysis: let features define and drive the analysis

TCGA Whole Slide Images

Jun Kong

Step 1:Nuclei

Segmentation

• Identify individual nuclei and their boundaries

Nuclear Analysis Workflow

• Describe individual nuclei in terms of size, shape, and texture

Step 2:Feature

Extraction

Step 1:Nuclei

Segmentation

Oligodendroglioma Astrocytoma

Nuclear Qualities

1 10

Step 3:Nuclei

Classification

Survival Analysis

Human Machine

Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification

Oligo Related Genes

Myelin Basic ProteinProteolipoproteinHoxD1

Nuclear features mostAssociated with Oligo Signature Genes:

Circularity (high)Eccentricity (low)

Millions of Nuclei Defined by n Features

• Top-down analysis: analyze features in context of existing diagnostic constructs

• Bottom-up analysis: let nuclear features define and drive the analysis

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Direct Study of Relationship Between Image Features vs Clinical Outcome, Response to Treatment, Molecular Information

Lee Cooper,Carlos Moreno

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Consensus clustering of morphological signatures

Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients

Each possibility evaluated using 2000 iterations of K-means to quantify co-clustering

Nuclear Features Used to Classify GBMs

3 2 1

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

1602 3 4 5 6 725

30

35

40

45

50

# Clusters

Silh

ouet

te A

rea

0 0.5 1

1

2

3

Silhouette Value

Clu

ster

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Clustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:

Cell Cycle (CC), Chromatin Modification (CM),

Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)

Featu

re I

ndic

es

CC CM PB

10

20

30

40

500 500 1000 1500 2000 2500 3000

0

0.2

0.4

0.6

0.8

1

Days

Sur

viva

l

CC

CM

PB

Molecular Correlates of MR Features Using TCGA Data

MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In Vivo Imaging tools

MR Features compared to TCGA Transcriptional Classes and Genetic Alterations

David Gutman

VASARI Feature Set

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

46

Principal Investigator and Director: Haian FuCo-Directors: Fadlo R. Khuri, Joel Saltz

Project Manager: Margaret Johns

Aim 1 LeaderYuhong Du

Aim 2 Leader Carlos Moreno

Cancer genomics-

based HT PPI network

discovery & validation

Genomics informatics and data integration

Emory CTD2 Center:

High throughput protein-protein interaction interrogation in cancer

Winship Cancer Institute

Center for Comprehensive

InformaticsEmory Chemical Biology Discovery Center

Emory Molecular Interaction Center for Functional Genomics (MicFG)

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

a.k.a “Big Data”

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Titan – Peak Speed 30,000,000,000,000,000 floating point operations per second!

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

HPC Segmentation and Feature Extraction Pipeline

Tony Pan and George Teodoro

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Large Scale Data Management

Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.

Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships

Highly optimized spatial query and analyses Implemented in a variety of ways including

optimized CPU/GPU, Hadoop/HDFS and IBM DB2

Spatial Centric – Pathology Imaging “GIS”Point query: human marked point inside a nucleus

.

Window query: return markups contained in a rectangle

Spatial join query: algorithm validation/comparison

Containment query: nuclear featureaggregation in tumor regions

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

a.k.a “Big Data”

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Example Project: Find hot spots in readmissions within 30 days– What fraction of patients with a given principal diagnosis will be

readmitted within 30 days?– What fraction of patients with a given set of diseases will be

readmitted within 30 days?– How does severity and time course of co-morbidities affect

readmissions?– Geographic analyses

• Compare and contrast with UHC Clinical Data Base– Repeat analyses across all UHC hospitals– Are we performing the same?– How are UHC-curated groupings of patients (e.g., product lines)

useful?

Clinical Phenotype Characterization and the Emory Analytic Information Warehouse

Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Overall System

I2b2 Web Server

I2b2 Database

Source data

Database Mapper

Source data

Source data

Data Processing

Metadata Manager

Metadata Repository

Query Specification

Investigator

Data Analyst

Data Analyst

Data Modeler

Investigator

Query toolsStudy-

specific Database

Investigator

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

5-year Datasets from Emory and University Healthcare Consortium

• EUH, EUHM and WW (inpatient encounters)• Removed encounter pairs with chemotherapy and radiation

therapy readmit encounters (CDW data)

• Encounter location (down to unit for Emory)• Providers (Emory only)• Discharge disposition• Primary and secondary ICD9 codes• Procedure codes• DRGs• Medication orders (Emory only)• Labs (Emory only)• Vitals (Emory only)• Geographic information (CDW only + US Census and American

Community Survey)Analytic Information Warehouse

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Using Emory & UHC Data to Find Associations With 30-day Readmits

• Problem: “Raw” clinical and administrative variables are difficult to use for associative data mining– Too many diagnosis codes, procedure codes– Continuous variables (e.g., labs) require interpretation– Temporal relationships between variables are implicit

• Solution: Transform the data into a much smaller set of variables using heuristic knowledge– Categorize diagnosis and procedure codes using code

hierarchies– Classify continuous variables using standard interpretations

(e.g., high, normal, low)– Identify temporal patterns (e.g., frequency, duration, sequence)– Apply standard data mining techniques

Analytic Information Warehouse

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

30-Day Readmission Rates for Derived VariablesEmory Health Care

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Geographic AnalysesUHC Medicine General Product Line (#15)

Analytic Information Warehouse

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Predictive Modeling for Readmission

• Random forests (ensemble of decision trees)– Create a decision tree using a random subset of the

variables in the dataset– Generate a large number of such trees– All trees vote to classify each test example in a

training dataset– Generate a patient-specific readmission risk for each

encounter

• Rank the encounters by risk for a subsequent 30-day readmission

Sharath Cholleti

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Emory Readmission Rates for High and Low Risk Groups Generated with Random Forest

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Predictive Modeling for 180 UHC Hospitals, 35 Million PatientsIdentify High Risk Patients! Readmission fraction of top 10% high risk patients

1 14 27 40 53 66 79 92 105 118 131 144 157 170 1830

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

All Hospital Model

Individual Hospital Model

Quasi-real-time display and analysis of physiologic data from Emory University Hospital SICU

Numerics and Waveforms (240 Hz)

~ 2 sec latency

Burst of tachycardia, no desaturation

Two episodes ofdesaturation, no change in heart rate

HR

SpO2

This slide is for orientation. Red data are the newest, green intermediate, blue oldest. Frequency every 2 seconds.

We have started to construct alerts around desaturation behaviors

(this image courtesy IBM)

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Thanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish

Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)

• Digital Pathology R01 (s): Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)

• Analytic Warehouse team: Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod

• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich

Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe

• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-Ramos

• ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Thanks to

• National Cancer Institute• National Library of Medicine• National Science Foundation• Cardiovascular Research Grid (NHLBI)• Minority Health Grid (ARRA)• Emory Health Care• Kaiser Health Care• Winship Cancer Institute• Oak Ridge National Laboratory• Woodruff Health Sciences

Thanks!

Recommended