20
Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability, and Scalability Center for Health Informatics and Bioinformatics NYU Langone Medical Center Yin Aphinyanaphongs MD, PhD Lawrence Fu PhD Constantin Aliferis MD, PhD Medinfo, Copenhagen, Denmark 08/19/2013

Center for Health Informatics and Bioinformatics NYU Langone Medical Center

  • Upload
    justin

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability, and Scalability. Center for Health Informatics and Bioinformatics NYU Langone Medical Center Yin Aphinyanaphongs MD, PhD Lawrence Fu PhD Constantin Aliferis MD, PhD - PowerPoint PPT Presentation

Citation preview

Page 1: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability, and Scalability

Center for Health Informatics and BioinformaticsNYU Langone Medical Center

Yin Aphinyanaphongs MD, PhDLawrence Fu PhDConstantin Aliferis MD, PhDMedinfo, Copenhagen, Denmark 08/19/2013

Page 2: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

2

Our Vision: A Safer Health Web

•We envision an health web that provides safe, reliable, validated medical information to health consumers.•Focus on the lowest quality websites and building classifiers to identify those.•Our initial pilot targeted unproven treatments in cancer.

Image From: http://www.bbc.co.uk/cbeebies/grownups/article/internet-use-and-safety/

Page 3: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Presentation Title Goes Here 3

Cancer, Health Consumers, and the Health Web

• In the US, 65% of cancer pts searched unproven treatments and 12% purchased at least one unproven treatment.• 83% of cancer patients used at least one unproven treatment.• Patients with dire prognosis are more susceptible to using these unproven treatments.

• Routine cancer search topics return sites with non-conventional treatments.• These sites were of variable quality. 23% discouraged the use of conventional medicine,

15% discouraged adherence to physician advice, and 26% provided anecdotal experiences.• Schmidt and Ernst found that nearly 10% of their sample had potentially harmful or

definitely harmful material.

• Impact of bad advice:•Death• Financial Costs with No Benefit.•Reduction in Efficacy of Known Treatments•Delayed Access to Proven Therapies

Page 4: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Presentation Title Goes Here 4

How is the Health Web policed now?

•Manual efforts•Rating agencies (HONCode)•Self ratings•Operation Cure.all•Effort led by Federal Trade Commission in 1997 and 1999 in collaboration with 80 agencies and organizations from 25 countries. • Identified approximately 800 to 1200 web sites purporting to cure a variety of diseases.

Page 5: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Presentation Title Goes Here 5

How Do We Get to a Safer Health Web?

•Automation

• In our prior work, we built models that successfully identified known unproven cancer treatments on the Internet. (where an unproven treatment is known and the goal is to identify new websites that may market the known unproven treatment, Area Under the Curve = 0.93).

•Some questions left to consider:•What happens with unknown unproven treatments. (i.e treatments that are not known a priori.) Is it possible to build models that will identify these unknown unproven treatments?•What methods may scale a high performing machine learning model to billions of web pages?

•This paper addresses the limitations of this prior work.

Page 6: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

6

Where MachineLearning May Generalize and Scale?

w1 w2 w3d1 1 0 1

d2 1 0 0

1. Preprocessing

2. Generalizability

3. ModelBuilding

4. ModelApplication

Page 7: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

7

Solution 1: Map Reduce for Pre-processing

Image from: http://www.infosun.fim.uni-passau.de/cl/MapReduceFoundation/

Page 8: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Image from: Causal Graph Based analysis of genome wide association data in rheumatoid arthritis.

8

Solution 2: Feature Selection

• Identifying a subset of relevant features for use in model construction that ideally maintain or improve predictivity from a dataset that includes all features.

•Generalized Local Learning Parents Children is one such algorithm that learns a subset of features for use in model construction.•Will identify the smallest subset of features that gives optimal classification performance under several assumptions.

Page 9: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Area Under the Curve

• Receive Operating Curves• Area Under the Curve

Centor RM. The Use of ROC Curves and Their Analyses. Med Decis Making 1991;11(2):102-106.

Page 10: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Area Under the Curve as a Measure of Performance

Perfect Separation

Area Under the Curve ~ 1.0

Okay Separation Area Under the

Curve ~ 0.8

Random Separation

Area Under the Curve ~ 0.5

Page 11: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Gold Standard

•8 quack treatments identified by quackwatch.org.•Applied to Google appending “cancer” and “treatment.”•Top 30 results for each treatment labeled by the authors.

•Two authors labeled 191 out of 240 web pages as making unproven claims or not (Inter-rater Reliability - Kappa 0.76)•Excluded •not found (404 response code) error pages•no content pages•non-English pages•password-protected pages•pdf pages•redirect pages•pages where the actual treatment text does not appear in the document

Page 12: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Presentation Title Goes Here 12

Build 8 Independent Training/ Testing Sets

Category Train Size Test Size

Insulin Potentiation Therapy

146 18+/ 7-

Cure for All Cancers 147 16+/ 8-

Mistletoe 145 8+/ 18-

Metabolic Therapy 153 11+/ 7-

Macrobiotic Diet 148 4+/ 19-

ICTH 162 5+/ 4-

Krebiozen 151 10+/ 10-

Cellular Health 154 9+/ 8-

Page 13: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Presentation Title Goes Here 13

Experimental Design

•8 fold Leave One Treatment Out Cross Validation

•Classifiers•Linear Support Vector Machines•L1 Regularized Logistic Regression•L2 Regularized Logistic Regression•Classifiers optimized over costs and regularization coefficient respectively.

•Text pre-processed by term frequency – inverse document frequency weighting scheme.

Page 14: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

14

Results – Generalizability to unknown unproven treatments

Category Train Set

Test Set

Linear SVM

L1 Regularized Logistic Regression

L2 Regularized Logistic Regression

Insulin Potentiation Therapy

146 18+/ 7- 0.96 0.97 0.97

Cure for All Cancers 147 16+/ 8- 0.89 0.85 0.85

Mistletoe 145 8+/ 18- 0.90 0.88 0.90Metabolic Therapy 153 11+/ 7- 0.97 0.99 0.96

Macrobiotic Diet 148 4+/ 19- 0.97 0.98 0.97

ICTH 162 5+/ 4- 1.0 0.94 1.0Krebiozen 151 10+/

10-0.86 0.75 0.90

Cellular Health 154 9+/ 8- 0.89 0.91 0.93

Page 15: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

15

Results – Feature Selection

Feature/ Classifier Combination

Number of Features Area Under the Curve Performance

All Features/ Linear SVM

9,187 0.946*

Generalized Local Learning – Parents Children

96 0.974*

* - these performances are calculated using 8 fold nested cross validation. Example documents are held out randomly without regards to the selected treatment.

Page 16: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Presentation Title Goes Here 16

Computational Performance Point EstimatesCorpus Preparation Number of

DocumentsSpeed (Point estimate)

Single Threaded 200,000 1,300 seconds

Hadoop/ Mapreduce 200,000 450 seconds

Model Building Performance

Number of Features

Speed (Point estimate)

All Features 9,187 0.20 seconds

Generalized Local Learning – Parents Children

96 0.0069 seconds

Model Application Performance

Number of Documents

Number of Features

Speed (Point Estimate)

All Features 145,000 9,187 98.6 seconds

Generalized Local Learning – Parents Children

145,000 96 6 seconds

Page 17: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Presentation Title Goes Here 17

Future Work

•Larger Sample Size.

•Address building classification models in other languages.

•Improved Computational Performance Estimate (not point).

•Address adversarial nature of this application.

•Further explore labeled sample versus ability to identify low quality websites in the full web.

•Address efficient collection of web pages.

Page 18: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Presentation Title Goes Here 18

Conclusions

•Generalization to unknown unproven treatments.

•Evidence that training in a production environment will need a small number of labeled documents.

•Scalability.•Mapreduce for pre-processing.•Feature Selection•Speed up model building.•Speed up model application.

Page 19: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Presentation Title Goes Here 19

References• Aphinyanaphongs, Y., & Aliferis, C. (2007). Text categorization models for identifying unproven cancer

treatments on the web. Presented at the MEDINFO 2007.• J. Metz, “A multi-institutional study of Internet utilization by radiation oncology patients,” International

Journal of Radiation Oncology Biology Physics, vol. 56, no. 4, pp. 1201–1205, Jul. 2003.• M. Richardson, T. Sanders, and J. Palmer, “Complementary/alternative medicine use in a

comprehensive cancer center and the implications for oncology,” Journal of Clinical Oncology, 2000.• K. Schmidt, “Assessing websites on complementary and alternative medicine for cancer,” Annals of

Oncology, vol. 15, no. 5, pp. 733–742, May 2004. • D. Y. KIM, H. R. LEE, and E. M. NAM, “Assessing cancer treatment related information online:

unintended retrieval of complementary and alternative medicine web sites,” European Journal of Cancer Care, vol. 18, no. 1, pp. 64–68, Jan. 2009.• M. Hainer, N. Tsai, and S. Komura, “Fatal hepatorenal failure associated with hydrazine sulfate,”

Annals of internal …, 2000.• J. Bromley, “Life-Threatening Interaction Between Complementary Medicines: Cyanide Toxicity

Following Ingestion of Amygdalin and Vitamin C,” Annals of Pharmacotherapy, vol. 39, no. 9, pp. 1566–1569, Aug. 2005.• A. Sparreboom, “Herbal Remedies in the United States: Potential Adverse Interactions With Anticancer

Agents,” Journal of Clinical Oncology, vol. 22, no. 12, pp. 2489–2503, Jun. 2004.• “ ‘Operation Cure.all’ Targets Internet Health Fraud.” Federal Trade Commission, 24-Jun-1999.• J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun.

ACM, vol. 51, no. 1, pp. 107–113, 2008. • C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D. Koutsoukos, “Local causal and markov

blanket induction for causal discovery and feature selection for classification part i: Algorithms and empirical evaluation,” The Journal of Machine Learning Research, vol. 11, pp. 171–234, 2010.

Page 20: Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Presentation Title Goes Here 20

Thank you.

•NYU Langone Medical Center.•Center for Health Informatics and Bioinformatics.

•Center for Translational Sciences Institute•Grant 1UL1RR029893 from the National Center for Research Resources, National Institutes of Health

•Collaborators•Lawrence Fu•Constantin Aliferis

•Contact: Yin Aphinyanaphongs ([email protected]).