Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
Machine Learning Applications to Translational
Research: Titbits from NCI Cancer Center Projects
Srisairam Achuthan, PhD
Center for Informatics, City of Hope
Oct 3rd , 2018
CI4CC Fall 2018 Symposium, New Orleans
Outline
• SPIRIT- Scientific Analytics
• Case Studies
• Stem Cell Therapy
• Epigenetics (DNA Methylation)
• Radiology and Pathology Image Analysis
• Summary
• Eugene and Ruth Roberts Summer Academy Interns• Michelle Tran (MD/PhD Candidate, Icahn School of Medicine, Mount Sinai)• Kelsang Donyo (Bachelor’s Degree, Statistics, Harvard University)• Arin Jayasekara (USC Neuroscience, Tulane University School of Medicine)• Eric Jiang (Bachelor’s Degree, Data Science, UC San Diego)• Joseph Wong (Bachelor’s Degree, Computer Science, UC Santa Barbara)• Sid Rumalla (Master’s Degree, Public Health, UT Austin)
• Beckman Research Institute, City of Hope• Prof. Karen Aboody, Dr. Lucy Ghoda, Prof. Michael Barish: Stem Cells Project• Prof. Rama Natarajan, Chuo Zhen (Nancy): Epigenetics Project• Prof. Joyce Niland
• Center for Informatics, City of Hope• Karthik Seetharamu, Allen Mao, Lawrence Love• Zahra (Nasim) Eftekhari, Lorenzo Rossi (Data Science)
• Dr. Ajay Shah, Executive Director, Bristol Myers Squibb
• Sorena Nadaf, Chief Informatics Officer, Center for Informatics, City of Hope
Acknowledgements
Motivation
• Significant effort is spent when various scientific analytical
methods are applied to biomedical problems using independent
one-off deployment of computational pipeline.
• SPIRIT−SA is a comprehensive scientific analytics platform, as
part of SPIRIT1 (Software Platform for Integrated Research
Information and Transformation) that also provides a rule-based
approach to simplify the data visualization and machine learning
model selection problems.
[1] Achuthan S, Chang M, Shah A. SPIRIT-ML: A Machine learning platform for deriving knowledge from
biomedical datasets 11th International Conference, DILS. Los Angeles, CA, USA: Springer; 2015.
SPIRIT – Scientific Analytics (SA)
SPIRIT – SA Workflow
SPIRIT – SA Dataflow
Recommendation Rules
[2] Sheneiderman, B. The eyes have it. A task by data type taxonomy for information visualizations. Proceedings
of IEEE Symposium on Visual Languages , Boulder, Co; 1996.
[3] Tableau : https://www.tableau.com/sites/default/files/media/which_chart_v6_final_0.pdf
[4] https://eazybi.com/blog/data_visualization_and_chart_types/
Visualization Rules Machine Learning Rules
[5] Scikit : http://scikit-learn.org/stable/tutorial/machine_learning_map/
Example : Mammographic Mass Data
• Discrimination of benign and malignant mammographic masses
based on BI-RADS attributes and the patient’s age7.
• Dataset description
Number of observations : 961
Number of Features : 5
Missing Values ? : Present
Outcome (=Class) : Severity
Benign – 516 instances
Malignant – 445 instances
[6] UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets/mammographic+mass
VisiRule for Distribution Visualization
[7] VisiRule : http://www.lpa.co.uk/vsr.htm
Data Visualization : Tableau
• Data Visualization based on distribution :
[6] UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets/mammographic+mass
Distribution of Severity as a function of Age
Distribution of Features
VisiRule for Supervised Learning
SPIRIT- ML Implementation
• Predictive Model Building 961 observations with 5 numeric features
• Data imputation was
applied in this case
(replace with median).
Missing Data Analysis
Default parameters
Attribute Selection and Normalization
• Predictive Model Building : 70% of data used for training
Accuracy, Performance Measures and Feature Ranking based on training data
Most important features (top 3) obtained by consensus polling of features : BI-RADS
Assessment, Age and Margin
Training Data Results
Test Data Results
• Predictive Model Building :15 % of data used for testing
Accuracy, Performance Measures and Feature Ranking based on test data
• Case Study 1 : Stem Cells Project
In collaboration with Prof. Karen Aboody’s Lab @COH
Stem Cell Project
Determine all experimentally observed
factors that may be influencing the
neural stem cells’ coverage of tumor
sites.
[8] Metz MZ, Gutova M, Lacey SF, Abramyants Y, Vo T, Gilchrist M, Tirughana R, Ghoda LY, Barish ME, Brown
CE, Najbauer J, Potter PM, Portnow, J, Synold TW, and Aboody KS: Neural stem cell mediated delivery of
irinotecan-activating carboxylesterases to glioma: Implications for clinical use. Stem Cells Transl Med 2013
Dec;2(12):983-92
Experimental Factors
•
• Results based on training data with % of tumor coverage as the
class variable with 2 classes (covered vs not covered).
Summary of ML Results
• Using consensus analysis, we find that the factors influencing
tumor coverage the most are :
Tumor_NumberCell
LogOne
LogTwo
Tumor_Age
Sex
Most Influential Factors
Test Data Results
Cluster Analysis
Threshold
Mice
• Case Study 2 : DNA Methylation Project
In collaboration with Prof. Rama Natarajan’s Lab @COH
Epigenetics
• Epigenetics involves the study of heritable changes in gene
functionality during cell replication that does not involve any
change in the underlying DNA sequence9.
• Epigenetic mechanisms such as DNA Methylation (DNAm) vary
at specific genomic locations in human diseases such as cancer
and diabetes.
• DNA methylation (DNAm) is known to change the DNA and
chromatin structure among cancer and diabetic patients relative
to healthy subjects.
[9] Armstrong, L. Epigenetics. Garland Science, New York (2014).
DNA Methylation
Adapted from Illumina data sheet
Adapted from Illumina data sheet
Applications of DNAm
• Infinium HumanMethylation 27K Beadchips generated DNAm
datasets obtained from GEO10 are used to validate our approach.
• DNA methylation profiles across approximately 27,000 CpGs in
smear cells from the uterine cervix (liquid based cytology samples),
obtained from 48 women. All women tested positive for the human
papilloma virus (HPV+). Of the 48 samples, 24 were cytologically
normal while the other 24 exhibited morphological transformation
(cervical intraepithelial neoplasia of grade 2 or higher - CIN2+).
[10] Teschendorff, A. E. and Widschwendter, M. Differential variability improves the identification of cancer risk
markers in DNA methylation studies profiling precursor cancer lesions. Bioinformatics, 28:11, 1487-94 (2012).
DNAm Results
• Bartlett’s test was applied to identify CpGs (features) that had
high variance between control and case patients (GSE37020).
• This reduced the total number of dimensions to just over 22%
of the original number of measured features. The reduction in
number of dimensions is dependent on patient
sample variance11.
DNA Analysis Step Initial # of CpGs Final # of CpGs Number of
Groups
Variance Filtering
(Bartlett’s test)
27578 6114 2
[11] Zhuang, J, Widschwendter, M. and Teschendorff, A. E. A comparison of feature selection and classification
methods in DNA methylation studies using the Illumina Infinium platform. BMC Bioinformatics, 13:59, 1-14
(2012).
DNAm Results Contd.
• To refine the features we make use of the variable importance
option in H2O’s deep learning that ranks all the features used
to build the deep learning model.
• The variable importance method (Gedeon) considers the
weights connecting the input features to the first two hidden
layers.
DNA Analysis Step Initial # of CpGs Final # of CpGs Number of
Groups
Classification
Analysis (multiple
refinements)
6114 368 2
H2O-ML Results
Accuracy: Training & Cross Validation
Top Features
after
Refinement
CHR
Gene
Name
cg00025138 14 MAP3K9
cg00080012 11 EED
cg00112517 17 PPP1R1B
cg00212549 3 SEMA5B
cg00234616 2 TLX2
cg00480356 15 HYPK
cg00489401 5 FLT4
cg00501366 17 ALOX12B
cg00509670 14 PAX9
cg00579393 2 REG1B
Genes & Chrmosomes: Top 10 CpGs
• Case Study 3 : Image based detection use cases
Radiology Images
Pathology Images
In collaborations with Dr. Ammar Chaudhry & Dr. Raju Pillai @COH
Deep Learning Models
[12] Simonyan, K and Zisserman, A, “Very Deep Convolutional Networks for Large-Scale Image Recognition”,
arXiv: 1409.1556
[13] Kermany, D; Goldbaum, M, et.al. , “Identifying Medical Diagnoses and Treatable Diseases by Image-Based
Deep Learning”, Cell, 172 (5) P1122-1131 (2018).
Example of Deep Learning Model Architecture
with Convolutional Neural Network Layers
Example of Transfer Learning
Classification of Radiology Images
[14] Kermany, D; Zhang, K and Goldbaum, M, “Labeled Optical Coherence Tomography (OCT) and Chest X-
Ray Images for Classification”, Mendeley Data, v2 http://dx.doi.org/10.17632/rscbjbr9sj.2
• Chest X-ray images of pediatric patients (1-5 years old),
Guangzhou Women and Children’s Medical Center,
Guangzhou14.
NormalViral Pneumonia Bacterial Pneumonia
• Trained a deep neural network [ VGG16 CNN (minus the top
layer) +Dense(256 neurons) + Dropout (0.5) + Dense (3
neurons) ] with 3600 X-ray images and validated with 300
images. A total of 623 images held out as part of testing dataset.
Model Accuracy of training and
validation datasets
Confusion Matrix and Metrics
of testing dataset
Radiology Images – Results
Classification of Pathology Images
[15] https://github.com/Narasimha1997/Blood-Cell-type-identification-using-CNN-classifier
• Trained a deep neural network [ Customized CNN+Dense(128
neurons) + Dropout (0.5) + Dense (4 neurons) ] with 4800
X-ray images and validated with 960 images. A total of 400
images were held out as part of testing dataset.
• Server with an NVIDIA Quadro P5000 GPU
Confusion Matrix and Metrics of testing dataset
Pathology Images – Results
Quantitative Image Analysis in
Digital Pathology
[16] Coudray, N; Ocampo, PS ; et.al., “Classification and Mutation Prediction from Non-Small Cell Lung
Cancer Histopathology Images using Deep Learning”, Nature Medicine (Sept. 2018),
https://doi.org/10.1038/s41591-018-0177-5.
• SPIRIT-SA can be a powerful end-to-end scientific analytics
platform that has been developed to provide a seamless user
experience for analyzing biomedical datasets.
• Key features that highlight the utility of this platform are the rules
based approach for selecting visualization techniques and
machine learning algorithms appropriate for a given dataset.
• A guided approach to choosing visualization methods and
machine learning models make the tasks of prediction and trend
analysis when analyzing data as seamless as possible for a user
in translational research projects.
Summary