27

Copyright © 2004 Oracle Corporation Life Sciences eSeminar Oracle Data Mining for Life Sciences Problems Meeting Place: US Toll Free: 1-888-967-2253 US Only: 1-650-607-2253 Asia/Pacific:

Embed Size (px)

Citation preview

Copyright © 2004 Oracle Corporation

Oracle Life Sciences eSeminarOracle Data Mining for Life Sciences Problems

http://conference.oracle.com Meeting Place:US Toll Free: 1-888-967-2253 US Only: 1-650-607-2253 Asia/Pacific: +61 2 8817 6100 Europe/M. East/Africa: +44 118 924 9000Meeting ID #: 407709Meeting Password: 407709

Charlie Berger ([email protected])Sr. Director of Product Management, Life Sciences and Data MiningPablo Tamayo ([email protected])Consulting Member of Technical Staff, Data Mining TechnologiesPat Hoffman ([email protected])Senior Principal Consultant, Oracle Consulting

Copyright © 2004 Oracle Corporation

Oracle Life Science Platform1. Access distributed data

Gateways, External Tables, SQL Loader, Streams, Oracle Gateway to Lion SRS, etc.

2. Integrate a variety of data typesXML DB, Intermedia, Text, etc.

3. Manage vast quantities of dataRAC, Partitioning, Grid, etc.

4. Collaborate securelyCollaboration Suite, iFS (Oracle FilesOnline), Portal, Security, etc.

5. Find patterns and insightsData Mining, BLAST, Statistics, Text, etc.

GenomicsGenomics

ProteomicsProteomics

PathwaysPathways

CheminformaticsCheminformatics

ClinicalClinical

Copyright © 2004 Oracle Corporation

Example Data Mining ApplicationsLife Sciences examples

Leukemia AML/ALL Golub et al.NCI-60 ChemoSensitivity data

Database MarketingTarget doctors likely to prescribe new drug(s)Target “best” patients

Discovery/DevelopmentDiscover target genes and proteinsIdentify promising leads for new drugsMedline literature miningPharmacovigilance

Health CarePredicting medical outcomes

DiabetesPneumoniaRespond to treatment

Fraud detection

Copyright © 2004 Oracle Corporation

Oracle Data Mining Algorithms & Example Applications

Attribute Importance• Identify most influential attributes

for a target attribute• Factors associated a disease• Promising leads

Classification and Prediction• Predict most likely to:

• Doctors who prescribe a new drug• Patients who respond to a treatment

• Regression• Predict a numeric value

• Predict a value • Predict the size tumor will be reduced

A1 A2 A3 A4 A5 A6 A7

Copyright © 2004 Oracle Corporation

Oracle Data Mining Algorithms & Example Applications

Clustering• Find naturally occurring groups

• Gene clusters• Find disease subgroups• Distinguish normal from non-normal behavior

Association Rules• Find co-occurring items

• Suggest interactions

Feature Extraction• Reduce a large dataset into representative

new attributes• Useful for clustering and text mining

F1 F2 F3 F4

Copyright © 2004 Oracle Corporation

Oracle Data Mining Algorithms & Example Applications

Text Mining• Combine data and text for better models

• Add unstructured text e.g. physician’s notes to structured data e.g. age, weight, height, etc., to predict outcomes

• Classify and cluster documents• Combined with Oracle Text to develop

advanced text mining applications e.g. Medline

BLAST• Sequence matching and alignment

• Find genes and proteins thatare “similar”

ATGCAATGCCAGGATTTCCA

CTGCAAGGCCAGGAAGTTCCAATGCGTTGCCAC…ATTTCCAGGC..TGCAATGCCAGGATGACCAATGCAATGTTAGGACCTCCA

Copyright © 2004 Oracle Corporation

5. Discover Patterns and Insights

Deductive Analysis

Inductive Analysis

Answer complex questions about the

relationships in genomic, clinical and

pharmacological data

Finding relationships for classification,

class discovery and prediction

Life Sciences data

Pharmacological databases

Proteomics Database

Clinical Databases

Functional Genomic

Databases

C A T G0 0 1 0 1

Copyright © 2004 Oracle Corporation

metagroup.comCopyright © 2004 META Group, Inc. All rights reserved. METAspectrum 60.1

Copyright © 2004 Oracle Corporation

D E M O N S T R A T I O N

Oracle Data Mining

Copyright © 2004 Oracle Corporation

Demo scenarios

• Gene expression analysis• Chemosensitivity analysis• Clinical data analysis• Clinical data analysis with text mining• Medline text mining

Copyright © 2004 Oracle Corporation

Oracle Data Miner• Data miner uses

Oracle Data Miner to build, evaluate, and apply ODM models• Mining Activity

Guide• Wizards approach

• Generate Java and SQL code to “operationalize”applications• Integrate “insights”

into other applications

Copyright © 2004 Oracle Corporation

Oracle Text & Text Mining

Copyright © 2004 Oracle Corporation

Multiple Examples of tumor tissue (public data from Whitehead/MIT)

Oracle 10gSVM Classification of Multiple Tumor Types

DNA Microarray Data

Oracle Data Mining

Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR

BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 MELANOMA-ML 1 1 UTERUS-UT 2 LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 MESOTHELIOMA-MS

3

BRAIN-BR 4

78.25% accuracy

Green=Correct Red=Errors

We feed multiple cancer types data into the Oracle DB: 16,063 genes, 144 cancer

patients and 10 samples per class.

We mine the data using Support Vector Machines and create the confusion matrix

Copyright © 2004 Oracle Corporation

Classification of Multiple Tumor Types• Multiple examples of 14 tumor types • Training set: 144 samples. Test set: 46 samples• Microarrays gene expression profiles for 7,129 genes (features)• Problem: how well can a model distinguish between multiple

tumor types?• Datasets composition:

Tumor Class # Train # Test Tumor Class # Train # Test Breast (BR) 8 3 Uterus (UT) 8 2

Prostate (PR) 8 2 Leukemia (LE) 24 6

Lung (LU) 8 3 Renal (RE) 8 3

Colorectal (CO) 8 5 Pancreas (PA) 8 3

Lymphoma (LY) 16 6 Ovary (OV) 8 3

Bladder (BL) 8 3 Mesothelioma (MS) 8 3

Melanoma (ML) 8 2 Brain (BR) 16 4

Copyright © 2004 Oracle Corporation

Supervised Classification SVM Methodology

Multi-Tumor DatasetMulti-Tumor Dataset

Build SVM Model (Training)Build SVM Model (Training)

Evaluate Model on Test SetEvaluate Model on Test Set

Data Preparation (Scaling)Data Preparation (Scaling)

Read into RDMS as TableRead into RDMS as Table

Oracle Task

SQLLDR

SQL query

ODM Model Build

ODM Model Apply

Tumor Labels (Train)

Tumor Labels (Train)

Tumor Labels (Test)

Tumor Labels (Test)

Prediction ResultsPrediction Results

Copyright © 2004 Oracle Corporation

Gene Expression Data Table and Rescaling

• The datasets were downloaded from the web site and stored in flat files prior to loading them to the Oracle database.

• The data was loaded using SQLLDR to create a fact table of the following format:

NUMBERexpr

VARCHAR2(30)gene

NUMBERsid

typecolumn

NUMBERexpr

VARCHAR2(30)gene

NUMBERsid

typecolumn

• Rescaling: the values were divided by a constant (10000) to make them into small numbers near 1 (to keep the dot products between all samples in the dataset inside the [-1, 1] range.

Copyright © 2004 Oracle Corporation

Example of ODM PL/SQL Build and Apply Commands

DBMS_DATA_MINING.build( model_name => 'SVM_model', function => DBMS_DATA_MINING.classification, data_table_name => ‘multitumor_train', settings_table_name => 'svm_settings', case_id_column_name => 'id', target_column_name => ‘class');

DBMS_DATA_MINING.apply(model_name => ‘SVM_model’,data_table_name => ‘multitumor_test’,case_id_column_name => ‘id’,result_table_name => ‘multitumor_apply_result’);

Copyright © 2004 Oracle Corporation

Algorithm Settings for Support Vector Machines

svms_kernel_function Kernel: svms_linear (for Linear Kernel)svms_gaussian (for Gaussian Kernel)

svms_target_type Target Type for SVM – either of:svms_multi_targetsvms_single_target

Copyright © 2004 Oracle Corporation

SVM Results• Entire methodology implemented in Oracle RDBMS and ODM

• The SVM model works with all 7,129 input features (genes) genes and do not require feature selection.

• The SVM model is relatively fast: 9 minutes training time on 500MHz Netra.

• The SVM is very accurate for multi-tumor molecular classification: 78.25% accuracy.

(comparable to published results in Ramaswamy et al PNAS 2001 paper, they also found that k-NN = 63% and Weighted Voting = 46% accuracy).

Copyright © 2004 Oracle Corporation

Oracle 10gSVM Classification of Multiple Tumor Types

Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR

BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 MELANOMA-ML 1 1 UTERUS-UT 2 LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 MESOTHELIOMA-MS

3

BRAIN-BR 4

78.25% accuracy

Green=Correct Red=Errors

Oracle Data Mining’s SVM models are able to accurately predict the multi-class tumor problem with

78.25% accuracy.

Copyright © 2004 Oracle Corporation

Benefits of Oracle’s ApproachOracle Data Mining Feature BenefitPlatform for Data Mining Applications

• Eliminates data movement and security exposure

• Fastest: Data InformationWide range of data mining algorithms

• Supports most data mining problems

Runs on multiple platforms • Applications may be developed and deployed

Built on Oracle Technology • Grid, RAC, integrated BI,…• SQL & PL/SQL available• Leverage existing skills

Copyright © 2004 Oracle Corporation

InforSense Oracle Edition Data Acquisition Data Analysis Multi - Search Discovery

Oracle Component

InforSense Component

Web Services

2

3.1

3.2

3.3

4

5

(Unified environment + heterogeneous components) enable complex process

Copyright © 2004 Oracle Corporation

InforSense/Oracle Integrated Advanced Analytics

InforSense analytics

Domain specific tools

Workflows

Warehousing

Deployment

Oracle Analytics

• Oracle Data Mining

• Oracle Text Mining

• Oracle Life Science

External Data

• Files

• XML

• SRS

Third-party analytics

NonNon--clinical clinical DataData(NIH)(NIH)

HealthcareHealthcareDataData(HTB)(HTB)

Copyright © 2004 Oracle Corporation

Value of InforSense Integration

Candidate Drug

LY317615

1 Candidate Drug1 Candidate Drug

1 Deployable Model1 Deployable Model

Reusable ProcessReusable Process

2 Weeks2 Weeks

8 Methodologies8 Methodologies

100 Components (Nodes)100 Components (Nodes)

1 Analyst1 Analyst

1 Workflow1 Workflow

•• Heterogeneous componentsHeterogeneous components

•• Complex process encapsulationComplex process encapsulation

•• Smooth integration of web servicesSmooth integration of web services

•• Rapid build and build for reuseRapid build and build for reuse

•• Adaptive to different usersAdaptive to different users

•• Leverage Oracle componentsLeverage Oracle components

InforSense – the Power of Integration

Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S