Machine Learning Applications to Translational Research ... · based approach for selecting visualization techniques and machine learning algorithms appropriate for a given dataset

1

Machine Learning Applications to Translational

Research: Titbits from NCI Cancer Center Projects

Srisairam Achuthan, PhD

Center for Informatics, City of Hope

Oct 3rd , 2018

CI4CC Fall 2018 Symposium, New Orleans

Outline

• SPIRIT- Scientific Analytics

• Case Studies

• Stem Cell Therapy

• Epigenetics (DNA Methylation)

• Radiology and Pathology Image Analysis

• Summary

• Eugene and Ruth Roberts Summer Academy Interns• Michelle Tran (MD/PhD Candidate, Icahn School of Medicine, Mount Sinai)• Kelsang Donyo (Bachelor’s Degree, Statistics, Harvard University)• Arin Jayasekara (USC Neuroscience, Tulane University School of Medicine)• Eric Jiang (Bachelor’s Degree, Data Science, UC San Diego)• Joseph Wong (Bachelor’s Degree, Computer Science, UC Santa Barbara)• Sid Rumalla (Master’s Degree, Public Health, UT Austin)

• Beckman Research Institute, City of Hope• Prof. Karen Aboody, Dr. Lucy Ghoda, Prof. Michael Barish: Stem Cells Project• Prof. Rama Natarajan, Chuo Zhen (Nancy): Epigenetics Project• Prof. Joyce Niland

• Center for Informatics, City of Hope• Karthik Seetharamu, Allen Mao, Lawrence Love• Zahra (Nasim) Eftekhari, Lorenzo Rossi (Data Science)

• Dr. Ajay Shah, Executive Director, Bristol Myers Squibb

• Sorena Nadaf, Chief Informatics Officer, Center for Informatics, City of Hope

Acknowledgements

Motivation

• Significant effort is spent when various scientific analytical

methods are applied to biomedical problems using independent

one-off deployment of computational pipeline.

• SPIRIT−SA is a comprehensive scientific analytics platform, as

part of SPIRIT1 (Software Platform for Integrated Research

Information and Transformation) that also provides a rule-based

approach to simplify the data visualization and machine learning

model selection problems.

[1] Achuthan S, Chang M, Shah A. SPIRIT-ML: A Machine learning platform for deriving knowledge from

biomedical datasets 11th International Conference, DILS. Los Angeles, CA, USA: Springer; 2015.

SPIRIT – Scientific Analytics (SA)

SPIRIT – SA Workflow

SPIRIT – SA Dataflow

Recommendation Rules

[2] Sheneiderman, B. The eyes have it. A task by data type taxonomy for information visualizations. Proceedings

of IEEE Symposium on Visual Languages , Boulder, Co; 1996.

[3] Tableau : https://www.tableau.com/sites/default/files/media/which_chart_v6_final_0.pdf

[4] https://eazybi.com/blog/data_visualization_and_chart_types/

Visualization Rules Machine Learning Rules

[5] Scikit : http://scikit-learn.org/stable/tutorial/machine_learning_map/

Example : Mammographic Mass Data

• Discrimination of benign and malignant mammographic masses

based on BI-RADS attributes and the patient’s age7.

• Dataset description

Number of observations : 961

Number of Features : 5

Missing Values ? : Present

Outcome (=Class) : Severity

Benign – 516 instances

Malignant – 445 instances

[6] UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets/mammographic+mass

VisiRule for Distribution Visualization

[7] VisiRule : http://www.lpa.co.uk/vsr.htm

Data Visualization : Tableau

• Data Visualization based on distribution :

[6] UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets/mammographic+mass

Distribution of Severity as a function of Age

Distribution of Features

VisiRule for Supervised Learning

SPIRIT- ML Implementation

• Predictive Model Building 961 observations with 5 numeric features

• Data imputation was

applied in this case

(replace with median).

Missing Data Analysis

Default parameters

Attribute Selection and Normalization

• Predictive Model Building : 70% of data used for training

Accuracy, Performance Measures and Feature Ranking based on training data

Most important features (top 3) obtained by consensus polling of features : BI-RADS

Assessment, Age and Margin

Training Data Results

Test Data Results

• Predictive Model Building :15 % of data used for testing

Accuracy, Performance Measures and Feature Ranking based on test data

• Case Study 1 : Stem Cells Project

In collaboration with Prof. Karen Aboody’s Lab @COH

Stem Cell Project

Determine all experimentally observed

factors that may be influencing the

neural stem cells’ coverage of tumor

sites.

[8] Metz MZ, Gutova M, Lacey SF, Abramyants Y, Vo T, Gilchrist M, Tirughana R, Ghoda LY, Barish ME, Brown

CE, Najbauer J, Potter PM, Portnow, J, Synold TW, and Aboody KS: Neural stem cell mediated delivery of

irinotecan-activating carboxylesterases to glioma: Implications for clinical use. Stem Cells Transl Med 2013

Dec;2(12):983-92

Experimental Factors

•

• Results based on training data with % of tumor coverage as the

class variable with 2 classes (covered vs not covered).

Summary of ML Results

• Using consensus analysis, we find that the factors influencing

tumor coverage the most are :

Tumor_NumberCell

LogOne

LogTwo

Tumor_Age

Sex

Most Influential Factors

Test Data Results

Cluster Analysis

Threshold

Mice

• Case Study 2 : DNA Methylation Project

In collaboration with Prof. Rama Natarajan’s Lab @COH

Epigenetics

• Epigenetics involves the study of heritable changes in gene

functionality during cell replication that does not involve any

change in the underlying DNA sequence9.

• Epigenetic mechanisms such as DNA Methylation (DNAm) vary

at specific genomic locations in human diseases such as cancer

and diabetes.

• DNA methylation (DNAm) is known to change the DNA and

chromatin structure among cancer and diabetic patients relative

to healthy subjects.

[9] Armstrong, L. Epigenetics. Garland Science, New York (2014).

DNA Methylation

Adapted from Illumina data sheet

Adapted from Illumina data sheet

Applications of DNAm

• Infinium HumanMethylation 27K Beadchips generated DNAm

datasets obtained from GEO10 are used to validate our approach.

• DNA methylation profiles across approximately 27,000 CpGs in

smear cells from the uterine cervix (liquid based cytology samples),

obtained from 48 women. All women tested positive for the human

papilloma virus (HPV+). Of the 48 samples, 24 were cytologically

normal while the other 24 exhibited morphological transformation

(cervical intraepithelial neoplasia of grade 2 or higher - CIN2+).

[10] Teschendorff, A. E. and Widschwendter, M. Differential variability improves the identification of cancer risk

markers in DNA methylation studies profiling precursor cancer lesions. Bioinformatics, 28:11, 1487-94 (2012).

DNAm Results

• Bartlett’s test was applied to identify CpGs (features) that had

high variance between control and case patients (GSE37020).

• This reduced the total number of dimensions to just over 22%

of the original number of measured features. The reduction in

number of dimensions is dependent on patient

sample variance11.

DNA Analysis Step Initial # of CpGs Final # of CpGs Number of

Groups

Variance Filtering

(Bartlett’s test)

27578 6114 2

[11] Zhuang, J, Widschwendter, M. and Teschendorff, A. E. A comparison of feature selection and classification

methods in DNA methylation studies using the Illumina Infinium platform. BMC Bioinformatics, 13:59, 1-14

(2012).

DNAm Results Contd.

• To refine the features we make use of the variable importance

option in H2O’s deep learning that ranks all the features used

to build the deep learning model.

• The variable importance method (Gedeon) considers the

weights connecting the input features to the first two hidden

layers.

DNA Analysis Step Initial # of CpGs Final # of CpGs Number of

Groups

Classification

Analysis (multiple

refinements)

6114 368 2

H2O-ML Results

Accuracy: Training & Cross Validation

Top Features

after

Refinement

CHR

Gene

Name

cg00025138 14 MAP3K9

cg00080012 11 EED

cg00112517 17 PPP1R1B

cg00212549 3 SEMA5B

cg00234616 2 TLX2

cg00480356 15 HYPK

cg00489401 5 FLT4

cg00501366 17 ALOX12B

cg00509670 14 PAX9

cg00579393 2 REG1B

Genes & Chrmosomes: Top 10 CpGs

• Case Study 3 : Image based detection use cases

Radiology Images

Pathology Images

In collaborations with Dr. Ammar Chaudhry & Dr. Raju Pillai @COH

Deep Learning Models

[12] Simonyan, K and Zisserman, A, “Very Deep Convolutional Networks for Large-Scale Image Recognition”,

arXiv: 1409.1556

[13] Kermany, D; Goldbaum, M, et.al. , “Identifying Medical Diagnoses and Treatable Diseases by Image-Based

Deep Learning”, Cell, 172 (5) P1122-1131 (2018).

Example of Deep Learning Model Architecture

with Convolutional Neural Network Layers

Example of Transfer Learning

Classification of Radiology Images

[14] Kermany, D; Zhang, K and Goldbaum, M, “Labeled Optical Coherence Tomography (OCT) and Chest X-

Ray Images for Classification”, Mendeley Data, v2 http://dx.doi.org/10.17632/rscbjbr9sj.2

• Chest X-ray images of pediatric patients (1-5 years old),

Guangzhou Women and Children’s Medical Center,

Guangzhou14.

NormalViral Pneumonia Bacterial Pneumonia

• Trained a deep neural network [ VGG16 CNN (minus the top

layer) +Dense(256 neurons) + Dropout (0.5) + Dense (3

neurons) ] with 3600 X-ray images and validated with 300

images. A total of 623 images held out as part of testing dataset.

Model Accuracy of training and

validation datasets

Confusion Matrix and Metrics

of testing dataset

Radiology Images – Results

Classification of Pathology Images

[15] https://github.com/Narasimha1997/Blood-Cell-type-identification-using-CNN-classifier

https://github.com/Narasimha1997/Blood-Cell-type-identification-using-CNN-classifier

• Trained a deep neural network [ Customized CNN+Dense(128

neurons) + Dropout (0.5) + Dense (4 neurons) ] with 4800

X-ray images and validated with 960 images. A total of 400

images were held out as part of testing dataset.

• Server with an NVIDIA Quadro P5000 GPU

Confusion Matrix and Metrics of testing dataset

Pathology Images – Results

Quantitative Image Analysis in

Digital Pathology

[16] Coudray, N; Ocampo, PS ; et.al., “Classification and Mutation Prediction from Non-Small Cell Lung

Cancer Histopathology Images using Deep Learning”, Nature Medicine (Sept. 2018),

https://doi.org/10.1038/s41591-018-0177-5.

https://doi.org/10.1038/s41591-018-0177-5

• SPIRIT-SA can be a powerful end-to-end scientific analytics

platform that has been developed to provide a seamless user

experience for analyzing biomedical datasets.

• Key features that highlight the utility of this platform are the rules

based approach for selecting visualization techniques and

machine learning algorithms appropriate for a given dataset.

• A guided approach to choosing visualization methods and

machine learning models make the tasks of prediction and trend

analysis when analyzing data as seamless as possible for a user

in translational research projects.

Summary

Questions and Answers

• Thank You !

• Questions ?

• Contact Information : [email protected]

Documents

Machine Learning Applications to Translational Research ... · based approach for selecting visualization techniques and machine learning algorithms appropriate for a given dataset