24
Data has Shape, Shape has Meaning Pek Lum, Ph.D. Chief Data Scientist VP of Solutions

MLconf NYC Pek Lum

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: MLconf NYC Pek Lum

Data has Shape, Shape has Meaning

Pek Lum, Ph.D. Chief Data Scientist

VP of Solutions

Page 2: MLconf NYC Pek Lum

2000 2005 2008 2010 2013

NSF funds Stanford Math Professor Gunnar Carlsson to research Topological Data Analysis

Floodgate and private investors provide seed capital

Khosla Ventures and Institutional Venture Partners lead growth financing rounds

DARPA invests in applying TDA to massive, complex, multi-modal DoD data

AYASDI founded with DARPA and IARPA funding

Company Timeline

COMPANY CONFIDENTIAL 2

�� � Tony Tether, Director Defense Advanced Research Projects Agency (2001-2009)

Ayasdi’s approach is using Topological Data Analysis one of the top 10 innovations developed at DARPA in the last decade.

Page 3: MLconf NYC Pek Lum

Ayasdi named one of the Top 10 Most Innovative Companies in Big Data for 2013

Won the Gigaom Structure Data Award for Most Promising Machine Learning/Artificial Intelligence Startup

In collaboration with UCSF won the NFL/GE Head Health challenge for their work in mild-traumatic brain injury

COMPANY CONFIDENTIAL 8

Recent Awards

Page 4: MLconf NYC Pek Lum
Page 5: MLconf NYC Pek Lum

The Problem

The Solution

How can the Life Sciences field fully leverage data to gain knowledge and extract insights towards better outcomes for health?

Why is complex (& big) data not tractable or accessible to everyone who has a stake in the results? From computational folks to physicians who collected the data?

Easy access to game changing analytical methods. Computer- augmented analysis. Automation. Visualization. No need to think of queries first- answers first, hypothesis building after that.

The ImpactObtaining insights from data can become more democratized, more collaborative among disparate disciplines, more meaningful faster

Page 6: MLconf NYC Pek Lum

TDA as a framework for ML methods

Automation

AYASDI TDA

My talk today

Interactive visualization

What is TDA?

Why TDA

Shape of Data

Speed to insights

Leveraging big data for the rest of us

Page 7: MLconf NYC Pek Lum

Data has shape and!shape has meaning.!

Page 8: MLconf NYC Pek Lum

TDA methods will transform the way that doctors triage patients, through construction of non-linear, non-invasive medical statistics to assess patients in intensive and critical care situations. �

Introducing “The Shape of Data”

A mathematical concept that began in the 1700�s.

Uses the shape of data to find unknown phenomena.

Topological Data Analysis (TDA)

Math + Computer Science + User Experience Automated discovery of shapes

Page 9: MLconf NYC Pek Lum

What do I mean by data having shape ?

Age, Weight and Height sampled uniformly at random

In reality, age, weight and height are correlated and that data has a shape

Page 10: MLconf NYC Pek Lum
Page 11: MLconf NYC Pek Lum
Page 12: MLconf NYC Pek Lum

1. Coordinate free representations are vital when one is studying data collected with different technologies- studies done at different times, multi-variate data types collected !2. Deformation invariance has an effect of introducing a degree of robustness into the analysis, which is important in the study of real world data- human heterogeneity is complex and needs an approach that is deformation (variation) resistant !3. Compact representations are important for visualizing large and complex datasets- this is so that signals can be easily identified in the form of “shapes” in the network

What does the TDA approach bring to the table?

Page 13: MLconf NYC Pek Lum

TDA summarizes the shape of data with no pre-conceived model of what it should be

Page 14: MLconf NYC Pek Lum

Bringing TDA (math world) into Real Data World=Ayasdi Software

Automation

Interactive, intuitive visualization

Handling “Big Data”

No coding needed

Topology for the rest of us

TDA as a framework for ML methods

Page 15: MLconf NYC Pek Lum

A node represents a group of similar objects. !Edges between nodes are drawn when the nodes are very similar to each other. Nodes that are not connected are less similar to each other. !The coloring reflects values of interest. The position of a node on the screen is irrelevant - only its connection to other nodes matters.

Ayasdi Network Orientation

Page 16: MLconf NYC Pek Lum

© 2012 Ayasdi inc.Case Study > Identification of patient subtypes

Ayasdi Network Orientation

Topological Map of Patient-Patient Relationships according to their tumors� molecular characteristics (in this case, gene expression)!

Each node contains subsets of patients!

These patients are eccentric (away from the center of the data)!

These patients are close to the center of the data!

Color scheme!

Page 17: MLconf NYC Pek Lum

The Problem

The Approach

Landscape of cancer and path towards more precise treatments

Cancer is very heterogenous. Data generated from TCGA is extremely large and generally not viewable all at once and certainly very hard to access and analyze by non-computational folks

View 12 cancers at once before drilling down to sub-populations

The Results

The ability to view all cancers at once allows very quick scans of the landscape of unique and common mutations. We also show that TP53 mutation is the common link between Triple Negatives Breast Cancer and Ovarian Cancer.

The Cancer Genome Atlas

Page 18: MLconf NYC Pek Lum

Breast invasive carcinoma!

Kidney renal clear cell carcinoma!

Bladder Urothelial Carcinoma!

Cervical squamous cell carcinoma and endocervical adenocarcinoma!

Lung squamous cell carcinoma!

Ovarian serous cystadenocarcinoma!Uterine Corpus

Endometrioid Carcinoma!Colon adenocarcinoma! Glioblastoma multiforme!

Prostate adenocarcinoma! Rectum adenocarcinoma! Acute Myeloid Leukemia!

12 cancer types from The Cancer Genome Atlas DNA exome sequencing: High Volume, High Complexity!

Over 2400 tumors, 12 cancer types, over half a million unique variants analyzed simultaneously!

Page 19: MLconf NYC Pek Lum

Landscape of p53 mutations across all 12 cancers!

Breast invasive carcinoma!

Kidney renal clear cell carcinoma!

Bladder Urothelial Carcinoma!

Cervical squamous cell carcinoma and endocervical adenocarcinoma!

Lung squamous cell carcinoma!

Ovarian serous cystadenocarcinoma!Uterine Corpus

Endometrioid Carcinoma!Colon adenocarcinoma! Glioblastoma multiforme!

Prostate adenocarcinoma! Rectum adenocarcinoma! Acute Myeloid Leukemia!

red= enriched for p53 mutations; blue=not enriched!

Page 20: MLconf NYC Pek Lum

GATA3 mutations quite unique to breast cancer

Page 21: MLconf NYC Pek Lum

Identifying commonalities between breast and ovarian cancers

Triple negative!breast cancers !

Ovarian!cancers !

Breast!cancers !

A!

A!

B!

Merging both DNA mutations + gene expression data together !

Page 22: MLconf NYC Pek Lum

Triple negative!breast cancers !

mutations in p53!red= many; dark blue= none! Ovarian!

cancers !

Breast!cancers !

TP53 mutation in the tumors is the common theme between TNBC and Ovarian cancers

Page 23: MLconf NYC Pek Lum

FOXA1 gene levels!GO_pathway_positive regulation of

potassium ion transport!PSAT1 gene levels!

Mutations in TP53!

Triple negative breast cancer!

Ovarian tumors!

Common target for TNBC and Ovarian Cancer?

Page 24: MLconf NYC Pek Lum

Data has Shape, Shape has Meaning

Pek Lum, Ph.D. Chief Data Scientist

VP of Solutions