130
1 Machine Learning Methods Machine Learning Methods for Decision Support and for Decision Support and Discovery Discovery Constantin Aliferis M.D., Ph.D., Ioannis Tsamardinos Ph.D. Discovery Systems Laboratory, Department of Biomedical Informatics, Vanderbilt University 2004 MEDINFO Tutorial 7 September 2004

Machine Learning Methods for Decision Support and Discovery

  • Upload
    jackie

  • View
    31

  • Download
    1

Embed Size (px)

DESCRIPTION

Machine Learning Methods for Decision Support and Discovery. Constantin Aliferis M.D., Ph.D., Ioannis Tsamardinos Ph.D. Discovery Systems Laboratory, Department of Biomedical Informatics, Vanderbilt University 2004 MEDINFO Tutorial 7 September 2004. Acknowledgments. - PowerPoint PPT Presentation

Citation preview

Page 1: Machine Learning Methods for Decision Support and Discovery

1

Machine Learning Methods for Decision Machine Learning Methods for Decision Support and DiscoverySupport and Discovery

Constantin Aliferis M.D., Ph.D., Ioannis Tsamardinos Ph.D.

Discovery Systems Laboratory,

Department of Biomedical Informatics,

Vanderbilt University

2004 MEDINFO Tutorial

7 September 2004

Page 2: Machine Learning Methods for Decision Support and Discovery

2

AcknowledgmentsAcknowledgments

Alexander Statnikov for code and for putting together the Resource Web page and CD

Doug Hardin, Pierre Massion, Yindalon Aphinyanaphongs, Laura E. Brown, and Nafeh Fananapazir for access to data and results for case studies

Page 3: Machine Learning Methods for Decision Support and Discovery

3

GoalGoal

The purpose of this tutorial is: To help participants develop a solid understanding of some of the most

useful machine learning methods. To give several examples of how these methods can be applied in

practice, and To provide resources for expanding the knowledge gained in the

tutorial.

Page 4: Machine Learning Methods for Decision Support and Discovery

4

OutlineOutlinePart I: Overview and Foundations

1. Tutorial Overview and goals2. Importance of Machine Learning for discovery and decision-support system construction 3. A framework for inductive Machine Learning 4. Generalization and Over-fitting 5. Quick review of data preparation and model evaluation6. Families of methods

a. Bayesian classifiers

breakb. Neural Networks c. Support Vector Machines

break7. Quick Review of Additional families

a. K-Nearest Neighborhs, b. Clustering, c. Decision Tree Induction, d. Genetic Algorithms

Page 5: Machine Learning Methods for Decision Support and Discovery

5

Outline (cont’d)Outline (cont’d)Part II.More Advanced Methods and Case Studies

1. More Advanced Methodsa. Causal Discovery methods using Causal Probabilistic Networksb. Feature selection

break2: Case Studies

a. Building a diagnostic model from gene expression data b. Building a diagnostic model from mass spectrometry data c. Categorizing text into content categories

breakd. Discovery of causal structure using Causal Probabilistic Network induction (demo)

3. Conclusions and wrap-upa. Resources for machine learningb. Questions & feedback

Page 6: Machine Learning Methods for Decision Support and Discovery

6

Definitions & Importance of Definitions & Importance of Machine LearningMachine Learning

Page 7: Machine Learning Methods for Decision Support and Discovery

7

A (Simplified) Motivating A (Simplified) Motivating ExampleExample

Assume we wish to create a decision support system capable of diagnosing patients according to two categories: Lung Cancer and Normal.

The input to the system will be of array gene expression measurements from tissue biopsies.

Page 8: Machine Learning Methods for Decision Support and Discovery

8

A (Simplified) Motivating A (Simplified) Motivating ExampleExample

Little is currently known about how gene expression values differentiate human lung cancer tissue from normal tissue.

Thus we will use an automated approach in which a computer system will examine patients’ array gene expression measurements and the correct diagnosis (provided by a pathologist).

Page 9: Machine Learning Methods for Decision Support and Discovery

9

A (Simplified) Motivating A (Simplified) Motivating ExampleExample

The system will produce a program that implements a function that assigns the correct diagnosis to any pattern of array gene expression data to the correct diagnostic label (and not just the input-output patterns of the training data).

Thus the system will learn (i.e., generalize) from training data the general input-output function for our diagnosis problem.

Page 10: Machine Learning Methods for Decision Support and Discovery

10

A (Simplified) Motivating A (Simplified) Motivating ExampleExample

What are the principles and specific methods that enable the creation of such learning systems?

What flavors of learning systems currently exist?

What are their capabilities and limitations?

…These are some of the questions we will be addressing in this tutorial

Page 11: Machine Learning Methods for Decision Support and Discovery

11

What is Machine Learning (ML)? How is it different than Statistics and Data Mining?

Machine Learning is the branch of Computer Science (Artificial Intelligence in particular) that studies systems that learn.

Systems that learn = systems that improve their performance with experience.

Page 12: Machine Learning Methods for Decision Support and Discovery

12

What is Machine Learning (ML)? How is it different than Statistics and Data Mining?

Typical tasks: image recognition, Diagnosis, elicitation of possible causal structure of problem domain, game playing, solving optimization problems, prediction of structure or function of biomolecules, text categorization, identification of relevant variables, etc.

Page 13: Machine Learning Methods for Decision Support and Discovery

13

Indicative Example applications of ML in Biomedicine

Bioinformatics

• Prediction of Protein Secondary Structure • Prediction of Signal Peptides • Gene Finding and Intron/Exon Splice Site Prediction• Diagnosis using cDNA and oligonucleotide array gene

expression data• Identification of molecular subtypes of patients with

various forms of cancer

Clinical problem areas• Survival after Pneumonia (CAP)• Survival after Syncope• Diagnosis of Acute M.I.• Diagnosis of Prostate Cancer• Diagnosis of Breast Cancer• Prescription and monitoring in hemodialysis• Prediction of renal transplant graft failure

Page 14: Machine Learning Methods for Decision Support and Discovery

14

Importance of ML: Task TypesImportance of ML: Task Types

Diagnosis (what is the most likely disease given a set of clinical findings?),

Prognosis (what will be the outcome after a certain treatment has been given to a patient?),

Treatment selection (what treatment to give to a specific patient?), Prevention (what is the likelihood that a specific patient will develop

disease X if preventable risk factor Y is present?).

ML has practically replaced Knowledge Acquisition for building Decision Support (“Expert”) Systems.

Page 15: Machine Learning Methods for Decision Support and Discovery

15

Importance of ML: Task Types (cont’d)Importance of ML: Task Types (cont’d)

Discovery – Feature selection (e.g., what is a minimal set of laboratory values needed

for pneumonia diagnosis?); – Concept formation (e.g., what are patterns of genomic instability as

measured by array CGH that constitute molecular subtypes of lung cancer capable of guiding development of new treatments?);

– Feature construction (e.g., how can mass-spectrometry signals be decomposed into individual variables that are highly predictive for detection of cancer and can be traced back to individual proteins that may play important roles in carcinogensis?); information retrieval query construction (e.g., what are PubMed Mesh terms that predict with high sensitivity and specificity whether medical journals talk about treatment?);

– Questions about function, interactions, and structure (e.g., how do genes and proteins regulate each other in the cells of lower and higher organisms? what is the most likely function of a protein given the sequence of its aminoacids?), etc.

Page 16: Machine Learning Methods for Decision Support and Discovery

16

What is Machine Learning (ML)? How is it different than Statistics and Data Mining?

Broadly speaking ML, DM, and Statistics have similar goals (modeling for classification and hypothesis generation or testing).

Statistics has traditionally emphasized models that can be solved analytically (for example various versions of the Generalized Linear Model – GLM). To achieve this both restrictions in the expressive power of models and their parametric distributions are heavily used.

Data Mining emphasizes very large-scale data storage, integration, retrieval and analysis (typically the last one as a secondary focus).

Machine Learning seeks to use computationally powerful approaches to learn very complex non- or quasi-parametric models of the data. Some of these models are closer to human representations of the problem domain per se (or of problem solving in the domain)

Page 17: Machine Learning Methods for Decision Support and Discovery

17

Importance of ML: Importance of ML: Data Types and VolumeData Types and Volume

Overwhelming production of data:– Bioinformatics (mass-throughput assays for gene

expression, protein abundance, SNPs…)

– Clinical Systems (EPR, POE)

– Bibliographic collections

– The Web: web pages, transaction records,…

Page 18: Machine Learning Methods for Decision Support and Discovery

18

Importance of ML: Importance of ML: Reliance on Hard data and evidenceReliance on Hard data and evidence

Machine learning has become critical for Decision Support System Construction given extensive cognitive biases and the corresponding need to base MDSSs on hard scientific evidence and high-quality data

Page 19: Machine Learning Methods for Decision Support and Discovery

19

Supplementary: Supplementary: Cognitive Biases Cognitive Biases

Main thesis: – human cognitive abilities are tailored to support

instinctive, reflexive, life-preserving reactions traced back in our evolution as species. They are not designed for rational, rigorous reasoning such as the reasoning needed in science and engineering.

In other words, there is a disconnect between our innate cognitive ability and the complexity of reasoning tasks required by the explosive advances in science and technology in the last few hundred years.

Page 20: Machine Learning Methods for Decision Support and Discovery

20

Supplementary: Supplementary: But is the Cognitive Biases But is the Cognitive Biases Thesis Correct?Thesis Correct?

Psychology of Judgment and Decision Making (Plous) Tversky and Kahneman (Judgment under uncertainty:

Heuristics and Biases) Methods of Influence (Cialdini)

And highly-recommended supplementary information can be found in:

Professional Judgment (Elstein) Institute of Medicine’s Report in Medical Errors (1999)

Page 21: Machine Learning Methods for Decision Support and Discovery

21

Supplementary: Supplementary: Tversky and Kahneman Tversky and Kahneman “Judgment under uncertainty: Heuristics and Biases”“Judgment under uncertainty: Heuristics and Biases”

This work (a constellation of psychological studies converging to a description of human decision making under uncertainty) is very highly regarded and influential It was recently (2002) awarded the Nobel Prize of Economics.

Main points:– People use a few simple heuristics when making judgments under

uncertainty– These heuristics sometimes are useful and other times lead to severe

and systematic errors– These heuristics are: representativeness, availability and anchoring

Page 22: Machine Learning Methods for Decision Support and Discovery

22

Supplementary: Representativeness E.g., : the probability P that patient X has disease D given

that she has findings F is assessed by the similarity of X to a prototypical description of D (found in a textbook, or recalled from earlier practice and training).

Why is this wrong?– Reason #1: similarity ignores base-rate of D

– Reason #2: similarity ignores sample size

– Reason #3: similarity ignores predictability

– Reason #4: similarity is affected by redundant features

Page 23: Machine Learning Methods for Decision Support and Discovery

23

Supplementary: Supplementary: Availability• E.g., : the probability P that patient X with disease D given

that she is given treatment T will become healthy is assessed by recalling such occurrences in one’s prior experience

• Why is this wrong?– Reason #1: availability is influenced by familiarity

– Reason #2: availability is influenced by salience

– Reason #3: availability is influenced by elapsed time

– Reason #4: availability is influenced by rate of abstract terms

– Reason #5: availability is influenced by imaginability

– Reason #6: availability is influenced by perceived association

Page 24: Machine Learning Methods for Decision Support and Discovery

24

Supplementary: Supplementary: Adjustment and Anchoring• E.g., : the probability P that patient X has disease D given

that she has findings F is assessed by making an initial estimate P1 for findings F1 and updating it when new evidence F2, F3, …, and so on, becomes available.

• What goes wrong?– Problem #1: initial estimate over-influences the final estimate

– Problem #2: initial estimate is often based on quick and then extrapolated calculations

– Problem #3: people overestimate the probability of conjunctive events

– Problem #4: according to initial anchor, people’s predictions are calibrated differently

Page 25: Machine Learning Methods for Decision Support and Discovery

25

Supplementary: Supplementary: Additional

• Methods of Influence (Cialdini, 1993):

– Reciprocation

– Commitment & Consistency

– Social Proof

– Liking

– Authority

– Scarcity

• Professional Judgment (Dowie and Elstein 1988)

• Institute of Medicine’s Report in Medical Errors (1999)

Page 26: Machine Learning Methods for Decision Support and Discovery

26

Supplementary: Supplementary: Putting MDSSs and Putting MDSSs and Machine Learning in Historical ContextMachine Learning in Historical Context

40s – Foundations of Formal Decision-Making Theory by VonNeuman and Morgerstern

50s – Ledley and Lusted lay out how logic and probabilistic reasoning can help in

diagnosis and treatment selection in medicine 60s

– Applications of Bayes theorem for diagnosis and treatment selection pioneered by Warner and DeDombal

– Medline (NLM) Early 70s

– Ad-hoc systems (Myers et al; Pauker et al)– Study of Cognitive Biases (Kahneman, Tversky)

Late 70s – Rule-based systems (Buchanan & Shortliffe)

Page 27: Machine Learning Methods for Decision Support and Discovery

27

Supplementary: Supplementary: Milestones in MDSSs

• 80s – Analysis of ad-hoc and RBSs (Heckerman et al.)– Bayesian Networks (Pearl, Cooper, Heckerman et al.)– Medical Decision Making as discipline (Pauker)– Literature-driven decision support (Renels & Shortliffe)

• Early 90s– Web-enabled decision support & wide-spread information retrieval– Computational Causal Discovery (Pearl, Spirtes et al. Cooper et al.)– Sound re-formulation of very large ad-hoc systems (Shwe et al)– Analysis of Bayesian systems (Domingos et al, Henrion et al.)– Proliferation of focused Statistics and Machine Learning MDDSs– First-order Logics that combine classical FOL with probabilistic

reasoning, causation and planning (Haddaway)

Page 28: Machine Learning Methods for Decision Support and Discovery

28

Supplementary: Supplementary: Milestones in MDSSs

• Late 90s – Efficient Inference for very large probabilistic systems (Jordan et al)– Kernel-based methods for sample-efficient learning (Vapnik)– Evidence-Based Medicine (Haynes et al)

• 21st Century – Diagnosis, Prognosis and Treatment selection (a.k.a. “Personalized

medicine” or “Pharmacogenomics”) based on molecular information (proteomic spectra, gene expression arrays, SNPs) collected via mass-throughput assaying technology, and modeleld using machine learning methods

– Provide-order entry delivery of advanced decision support – Advanced representation, storage, retrieval and application of EBM

information (guidelines, journals, meta-analyses, clinical bioinformatics models)

Page 29: Machine Learning Methods for Decision Support and Discovery

29

Importance of MLImportance of ML How often ML techniques are being used? #Articles in Medline (in parentheses last 2 years):

– Artificial Intelligence: 12,441 (2,358)– Expert systems: 2,271 (121)– Neural Networks 5,403 (1,158)– Support Vector Machines 163 (121)– Clustering 17,937 (4,080)– Genetic Algorithms 2,798 (969)– Decision Trees 4,958 (752)– Bayesian (Belief) Networks 1,627 (585)– Bayes (Bayesian Statistics + Nets) 4,369 (561)Compare to:– Regression 164,305 (28,134)– Knowledge acquisition 310 (56)– Knowledge representation 227 (27)– 4 major Symbolic DSS 145 (10)(Internist-I, QMR, ILIAD, DxPlain)– Rule-based systems 802 (151)

Page 30: Machine Learning Methods for Decision Support and Discovery

30

Importance of MLImportance of ML– Importance of ML becomes very evident in cases where:

» data analysis is too time consuming (e.g., classify web pages or medline documents into content or quality categories)

» There is little or no domain theory

What is the diagnosis?

Is this an early cancer?

Page 31: Machine Learning Methods for Decision Support and Discovery

31

A Framework for Inductive ML A Framework for Inductive ML & Related Introductory Concepts& Related Introductory Concepts

Page 32: Machine Learning Methods for Decision Support and Discovery

32

What is the difference between supervised and unsupervised ML methods?

Supervised learning:- Give to the learning algorithm several instances of input-output

pairs; the algorithm learns to predict the correct output that corresponds to some inputs (not only previously seen but also previously unseen ones (“generalization”)).

- In our original example: show to learning algorithm array gene expression measurements from several patient cases as well as normal subjects; then the learning algorithm induces a classifier that can classify a previously unseen subject to the correct diagnostic category given the gene expression values observed in that subject

Page 33: Machine Learning Methods for Decision Support and Discovery

33

Classification

TRAININSTANCES

APPLICATIONINSTANCES

A

B C

D E

A1, B1, C1, D1, E1

A2, B2, C2, D2, E2

An, Bn, Cn, Dn, En

CLASSIFIER-INDUCTIVE ALGORITHM

CLASSIFIER

CLASSIFICATION PERFORMANCE

Page 34: Machine Learning Methods for Decision Support and Discovery

34

What is the difference between supervised and unsupervised ML methods?

Unsupervised learning:

- Discover the categories (or other structural properties of the

domain)

- Example: give the learning algorithm gene expression

measurements of patients with Lung Cancer as well as normal subjects; the algorithm finds sub-types (“molecular profiles”) of patients that are very similar to each other, and different to the rest of the types. Or another algorithm may discover how various genes interact among themselves to determine development of cancer.

Page 35: Machine Learning Methods for Decision Support and Discovery

35

Discovery

TRAININSTANCES

A

B C

D E

A1, B1, C1, D1, E1

A2, B2, C2, D2, E2

An, Bn, Cn, Dn, En

STRUCTURE-INDUCTION ALGORITHM

PERFORMANCE

A

B C

D

E

Page 36: Machine Learning Methods for Decision Support and Discovery

36

A first concrete attempt at solving our A first concrete attempt at solving our hypothetical diagnosis problem using a hypothetical diagnosis problem using a

particular type of learning approach particular type of learning approach (decision tree induction)(decision tree induction)

Page 37: Machine Learning Methods for Decision Support and Discovery

37

Decision Tree InductionDecision Tree Induction

An example decision tree to solve the problem of how to classify subjects into lung cancer vs normal

Over-expressed

Gene139

Normally-expressed

Under-expressed

Gene202 Gene8766

Lung cancer Normal Lung cancer Normal

Lung cancer

Over-expressedNormally-expressed

Over-expressedNormally-expressed

Page 38: Machine Learning Methods for Decision Support and Discovery

38

How Can I Learn Such A Decision Tree How Can I Learn Such A Decision Tree Automatically?Automatically?

A basic induction procedure is very simple in principle:

1. Start with an empty tree2. Put at the root of the tree the variable that best classifies the

training examples3. Create branches under the variable corresponding to its values4. Under each branch repeat the process with the remaining

variables5. Until we run out of variables or sample

Page 39: Machine Learning Methods for Decision Support and Discovery

39

How Can We Generalize From How Can We Generalize From This Example?This Example?

Page 40: Machine Learning Methods for Decision Support and Discovery

40

A General Description of supervised Inductive ML

Inductive Machine Learning algorithms can be designed and analyzed using the following framework:

A language L in which we express models. The set of all possible models expressible in L constitutes our hypothesis space H

A scoring metric M tells us how good is a particular model A search procedure S helps us identify the best model in H

x x x x

x x x x

x x

x x xx

Space of all possible models

Models in H

Page 41: Machine Learning Methods for Decision Support and Discovery

41

A General Description of supervised Inductive ML

In our decision tree example: A language L in which we express models = decision trees The hypothesis space H = space of all decision trees that can be

constructed with genes 1 to n A scoring metric M telling us how good is a particular model = min

(classification error + model complexity) A search procedure S = greedy search

Page 42: Machine Learning Methods for Decision Support and Discovery

42

How can ML methods fail?

Wrong language Bias: best model is not in HExample: we look for models expressible as discrete decision trees but the domain is continuous

x x x x

x x x x

x x

x x x

x Space of all possible models

Models in H

Page 43: Machine Learning Methods for Decision Support and Discovery

43

How can ML methods fail?

Search Failure: best model is in H but search fails to examine itExample: greedy search fails to capture a strong gene-gene interaction effect

x x x x

x x x x

x x

x x x

xSpace of all possible models

Models in H

Page 44: Machine Learning Methods for Decision Support and Discovery

44

Generalization & Over-fittingGeneralization & Over-fitting

Page 45: Machine Learning Methods for Decision Support and Discovery

45

Generalization & Over-fittingGeneralization & Over-fitting

It was mentioned previously that a good learning program learns something about the data beyond the specific cases that have been presented to it.

Indeed, it is trivial to just store and retrieve the cases that have been seen in the past (“rote learning” implemented as a lookup table). This does not address the problem of how to handle new cases, however.

Page 46: Machine Learning Methods for Decision Support and Discovery

46

Generalization & Over-fittingGeneralization & Over-fitting

In supervised learning we typically seek to minimize “i.i.d.” error, that is error over future cases (not used in training). Such cases contain both previously encountered as well as new cases.

“i.i.d.” = independently sampled and identically distributed problem instances.

In other words, the training and application samples come from the same population (distribution) with identical probability to be selected for inclusion and this population/distribution is time-invariant.

(Note: if not time invariant then by incorporating time as independent variable or by other appropriate transformations we restore the i.i.d. condition)

Page 47: Machine Learning Methods for Decision Support and Discovery

47

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

Consider now the following simplified diagnostic classification problem: classify patients into cancer (red/vertical pattern) versus normal (green/no pettern) on the basis of the values of two gene values (gene1, gene2)

Gene1

Gene2

Page 48: Machine Learning Methods for Decision Support and Discovery

48

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

The diagonal line represents a perfect classifier for this problem (do not worry for the time being how to mathematically represent or computationally implement the line – we will see how to do so in the Neural Network and Support Vector Machine segments):

Gene1

Gene2

Page 49: Machine Learning Methods for Decision Support and Discovery

49

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

Let’s solve the same problem from a small sample; one such possible small sample is:

Gene1

Gene2

Page 50: Machine Learning Methods for Decision Support and Discovery

50

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

We may be tempted to solve the problem with a fairly complicated line:

Gene1

Gene2

Page 51: Machine Learning Methods for Decision Support and Discovery

51

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

In which case we get several errors:

Gene1

Gene2

Page 52: Machine Learning Methods for Decision Support and Discovery

52

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

…whereas with a simpler line…:

Gene1

Gene2

Page 53: Machine Learning Methods for Decision Support and Discovery

53

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

…a much smaller error:

Gene1

Gene2

Page 54: Machine Learning Methods for Decision Support and Discovery

54

Generalization & Over-fittingGeneralization & Over-fitting In general, over-fitting a model to the data means that instead of general properties of the population from which the data is sampled we learn idiosyncracies (i.e., non-representative properties) of the sample data. Over-fitting and poor generalization (i.e., the error in the overall population (“true error”) is large) are synonymous as long as we have learned the training data well (i.e., “small apparent error”). Over-fitting is not only affected by the “simplicity” of the classifier (e.g., straight vs wiggly line) but also by:

– the size of the sample, – the complexity of the function we wish to learn from data, – the amount of noise, and– the number and nature (continuous discrete, ordered, distribution, etc.) of the variables.

Page 55: Machine Learning Methods for Decision Support and Discovery

55

Generalization & Over-fittingGeneralization & Over-fitting We wish to particularly emphasize the danger of grossly over-fitting the learning when the number of predictive variables is large relative to the available sample. Consider for example the following situation: Assume we have 5 binary predictors and two samples, and wish to classify instances into two classes The predictors can encode 2^5=32 possible distinct patterns. Assume all patterns are equally probable. Hence the chances of the two cases having different predictive patterns are 31/32=97%. Thus in 97% of our samples of size two, the five variables are sufficient to identify perfectly the case. Combine this with a powerful enough learning algorithm (i.e., one that can effectively associate any pattern with the desired outcome) and it follows that in 97% of samples, one gets optimal apparent error even when there is no relationship between the target variable and the

predictive variables!

Page 56: Machine Learning Methods for Decision Support and Discovery

56

Generalization & Over-fittingGeneralization & Over-fitting This situation is particularly relevant in bioinformatics in which we routinely have >10,000 continuous variables, noise, and <500 samples. Under

these conditions every training instance has almost always a unique value set of the predictive variables; thus if one is not careful, the learning algorithm can simply learn what amounts to a lookup table (i.e., by associating the unique predictor signature

with the outcome of that case for every case).

Page 57: Machine Learning Methods for Decision Support and Discovery

57

Generalization & Over-fittingGeneralization & Over-fitting So how does one avoid over-fitting? Via a variety of approaches:

– Use learning algorithms that intrinsically (by design) generalize well – Pursue simple (“highly biased”) classifiers for small samples– Choose unbiased and low-variance statistical estimators of the true error and employ them sparingly

Very important rule:

Estimate the performance (true error) of a model with data you did not use to construct the model

Page 58: Machine Learning Methods for Decision Support and Discovery

58

Generalization & Over-fittingGeneralization & Over-fitting Avoiding over-fitting will be a primary concern of ours in this tutorial We will outline here some specific cross-validation procedures and use them to build models in the case studies

segment

Page 59: Machine Learning Methods for Decision Support and Discovery

59

Generalization & Over-fittingGeneralization & Over-fitting Hold-out cross-validation method:

– Split data in Train and Test data– Learn with Train and estimate true error with Test

test

traindata

Page 60: Machine Learning Methods for Decision Support and Discovery

60

traintraintrain

Generalization & Over-fittingGeneralization & Over-fitting N-fold Cross-validation:

– Split data in Train and Test data n times such that union of test sets is full data set– Learn with Train and estimate true error with Test in each split separately– Average test performance

testtest

test

train

testtrain

test

train

testtrain

Page 61: Machine Learning Methods for Decision Support and Discovery

61

Generalization & Over-fittingGeneralization & Over-fitting Leave-one-out =

n-fold C.V. where n is equal to the number of data instances

Page 62: Machine Learning Methods for Decision Support and Discovery

62

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

Stratified (balanced) Cross-validation:

An n-fold C.V. in which (by design) the target class has the same distribution as in the full dataset

Page 63: Machine Learning Methods for Decision Support and Discovery

63

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

Nested Cross-validation:– Assume we wish to apply cross-validation to find the best parameter values for parameter C for a classifier from parameter value set [1,..,100]. – One way to use C.V. to select the best values for C is to apply the holdout method 100 times, one for each value of C and select the value that gives the best error in the test sample. – The problem with this approach is that the true error estimate is not reliable since it is produced by running the best model on a test set that was used to derive the best model. – In other words, our data used to estimate the true error can no longer be used to produce unbiased estimates since it also guided the selection of the model.

Page 64: Machine Learning Methods for Decision Support and Discovery

64

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting solution:

– Split the Train data into two (Traintrain and Validation),– Use the validation set to find the best parameters, – Use the test set to estimate the true error

test

Traintrain

dataValidation

Page 65: Machine Learning Methods for Decision Support and Discovery

65

TT

TT

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

If the sample is small, the nesting can be repeated with different assignment of the test set (i.e., nested n-fold C.V.):

One can also nest LOO with n-fold C.V. or LOO with LOO

Te

dataV

V TT

V

TT

TT

Te

V

V TT

V

Page 66: Machine Learning Methods for Decision Support and Discovery

66

Supplementary: Generalization & Supplementary: Generalization & Over-fittingOver-fitting

Important notes:– Estimating the true error of the best model is a separate procedure than generating the best model; the former requires an additional layer of nesting our cross-validation– When there are several types of parameters to be selected (e.g., normalization, discretization, classifier parameters) one can:

» do one n-fold cross-validation using the cartesian product of all parameters which uses more sample but yields more conservative true error estimates, or one can » nest the cross-validation to as many nesting levels as the number of distinct parameters that need optimization, which yields more unbiased true error estimates but uses less sample

Page 67: Machine Learning Methods for Decision Support and Discovery

67

Quick Notes On Data PreparationQuick Notes On Data Preparation

Page 68: Machine Learning Methods for Decision Support and Discovery

68

Data preparationData preparation

Non-specific– Is the data lawfully in our disposal? – Are there issues that deal with patient privacy and

confidentiality as well as intellectual property issues that need be resolved?

– How were the data produced, by whom, when, with what purpose in mind?

– Any known or plausible biases present? – References in the literature?– Is there a codebook with clear definitions of variables

location, date of creation, method of creation, value list, value meaning, missing value codes and meanings, history of the database and its elements?

Page 69: Machine Learning Methods for Decision Support and Discovery

69

Data preparationData preparation

Data specific– Valid values?– Variable distributions?– Descriptive statistics?– Mechanisms of missingness & imputation

Page 70: Machine Learning Methods for Decision Support and Discovery

70

Data preparationData preparation

Learner specific– De-noising– Scaling/Normalization– Discretization– Transforming variable distributions– Co-linearities– Homoskedasticity– Outliers– Feature selection

Page 71: Machine Learning Methods for Decision Support and Discovery

71

Data preparationData preparation

Task specific– Reconstruct hidden or distorted signals from

observed ones– Infer presence of hidden variables, determine

their cardinality and values– Stem, normalize, extract terms– Weight or Project variables

Page 72: Machine Learning Methods for Decision Support and Discovery

72

Basic Evaluation MetricsBasic Evaluation Metrics

Page 73: Machine Learning Methods for Decision Support and Discovery

73

Evaluation MetricsEvaluation Metrics T: test D: disease

Accuracy (0/1 loss):

Number of correct classifications a+dNumber of total classifications a+b+c+d

D+ D-

T+ a b a+b

T- c d c+d

a+c b+d a+b+c+d

Page 74: Machine Learning Methods for Decision Support and Discovery

74

Evaluation MetricsEvaluation Metrics T: test D: disease

Sensitivity: proportion of true positives identified by testa

a+c

D+ D-

T+ a b a+b

T- c d c+d

a+c b+d a+b+c+d

Page 75: Machine Learning Methods for Decision Support and Discovery

75

Evaluation MetricsEvaluation Metrics T: test D: disease

Specificity: proportion of true negatives identified by testd

b+d

D+ D-

T+ a b a+b

T- c d c+d

a+c b+d a+b+c+d

Page 76: Machine Learning Methods for Decision Support and Discovery

76

Evaluation MetricsEvaluation Metrics T: test D: disease

Positive predictive value (PPV): proportion of true positives over test positives

aa+b

D+ D-

T+ a b a+b

T- c d c+d

a+c b+d a+b+c+d

Page 77: Machine Learning Methods for Decision Support and Discovery

77

Evaluation MetricsEvaluation Metrics T: test D: disease

Negative predictive value (NPV): proportion of true negatives over test negatives

dc+d

D+ D-

T+ a b a+b

T- c d c+d

a+c b+d a+b+c+d

Page 78: Machine Learning Methods for Decision Support and Discovery

78

Evaluation MetricsEvaluation Metrics Mean squared error (MSE) (“Quadratic loss”):

1/|D| (predicted_value(i)-true_value(i))2

|D| = cardinality of test dataset Suitable for continuous outputs

i

|D|

Page 79: Machine Learning Methods for Decision Support and Discovery

79

Evaluation MetricsEvaluation Metrics

ROC area

Sensitivity

1-Specificity

0

1

1

Page 80: Machine Learning Methods for Decision Support and Discovery

80

Evaluation MetricsEvaluation Metrics

In information retrieval:– “Precision” is the name for “PPV” and– “Recall” is the name for “Sensitivity”

Page 81: Machine Learning Methods for Decision Support and Discovery

81

Evaluation MetricsEvaluation Metrics Recall-precision curve (and area under it):

Recall

Precision

0

100%

100%

Page 82: Machine Learning Methods for Decision Support and Discovery

82

Bayesian ClassifiersBayesian Classifiers

Note: we will be discussing Bayesian classifiers using the diagnostic context, (which in terms of applications and historical development of the related ideas is representative). However the ideas discussed readily translate to any type of learning for classification and concomitant decision support function.

Page 83: Machine Learning Methods for Decision Support and Discovery

83

Bayesian ClassifiersBayesian Classifiers

Bayes’ Theorem (or formula) says that:

P (D) * P(F| D)P (D | F) =

P(F)

Where: – P(D) is the probability of some disease D in the general population (i.e.,

before obtaining some evidence F), a.k.a. as “disease prior probability”– P(F) is the probability of some evidence in the form of findings such as lab

tests, physical examination findings etc.– P(F | D) is the probability of the same findings given that someone has disease

D– P(D | F) is the probability of disease D given that someone has the findings F

(i.e., after obtaining some evidence F), a.k.a. as “disease posterior probability”

Page 84: Machine Learning Methods for Decision Support and Discovery

84

Bayesian ClassifiersBayesian Classifiers

Since the most likely diagnosis is the one with the maximum a posteriori probability, Bayes’ formula allows one to solve the differential diagnosis problem, as well as as any classification learning problem that can be cast as supervised learning

Indeed, in the sample limit, there cannot be a better way to infer the most likely diagnosis than Bayes’ theorem and thus it serves as the theoretical gold standard against which statistical and machine learning classifiers are measured in terms of true error.

In that context it is referenced as the “Bayes Optimal Classifier”

Page 85: Machine Learning Methods for Decision Support and Discovery

85

Bayesian ClassifiersBayesian Classifiers

Note that Bayes’ formula can be applied to diagnosis of multiple possibly inter-depended diseases and non-independent findings since where there is “F” one can place a vector of findings (e.g., F1+, F2-, F3-,…,Fn+) and where there is “D” one can put a vector of diseases (e.g., D1-, D2-, D3+,…,Dm+) .

Page 86: Machine Learning Methods for Decision Support and Discovery

86

Bayesian ClassifiersBayesian Classifiers

Further, the intuitive interpretation of Bayes’ rule is that of updating belief about the patient’s true state: before seeing F we have some prior belief (measured as probability) that the patient has disease(s) D. After seeing F we update the prior belief (diagnosis) to reflect (incorporate) the new evidence F; the new belief is the posterior produced by Bayes’s rule

Page 87: Machine Learning Methods for Decision Support and Discovery

87

Bayesian ClassifiersBayesian Classifiers

Unfortunately there is a significant drawback with straightforward Bayes rule: we need number of probabilities, storage and computational time that is exponential to the number of findings (i.e., |F|) and the number of diseases (i.e., |D|).

This means that for any diagnostic or other classification problem of non-trivial size (measured in terms of |F| and |D|) straight Bayes is not feasible

Page 88: Machine Learning Methods for Decision Support and Discovery

88

Bayesian ClassifiersBayesian Classifiers

This has led to a simplified version in which we disallow multiple diseases (i.e., require that the patient may have only one disease at a time) and we require that findings are independent conditioned on the disease states (note: this does not mean that the findings are independent in general, but rather, that they are conditionally independent).

The combination of these two assumptions yields required number of probabilities, storage and computational time that is linear to the number of findings and the number of diseases.

Page 89: Machine Learning Methods for Decision Support and Discovery

89

Simple (a.k.a. “Naive”) BayesSimple (a.k.a. “Naive”) Bayes

Application of Bayes’ rule with the Mutual Disease Exclusivity assumption (MEE) and the Conditional Independence assumption (FCID) is known as “Simple Bayes’ Rule”, “Naïve Bayes”, or, rather non-tastefully, as “Idiot’s Bayes”.

Simple Bayes can be implemented by plugging in the main formula:

P(F | D) = P(Fi | Dj)

and P( F) = P(Fi, Dj) = [P(Fi | Dj) * P(Dj)]

where Fi is the ith (singular) finding and Dj the jth (singular) disease. Several other (mathematically) equivalent formulations exist using

sensitivities and specificities, likelihood ratios or other convenient building blocks

i,j

i,j i,j

Page 90: Machine Learning Methods for Decision Support and Discovery

90

Simple (a.k.a. “Naive”) BayesSimple (a.k.a. “Naive”) Bayes

Simple Bayes was applied very early (from the early 60’s and on) in Medical Informatics for diagnosis and optimal treatment selection as well as sequential testing.

See for example the classic papers by Warner et al (1961), DeDombal (1972), Leaper (1972), Gorry and Barnett (1968)

Page 91: Machine Learning Methods for Decision Support and Discovery

91

Variants of Simple BayesVariants of Simple Bayes

Since the MEE and FCID assumptions clearly are violated in many medical contexts, researchers early on sought to relax them and created modified Bayesian classifiers that approximated P(F |D) (Fryback 1978) or assumed independent diseases and multiple diagnoses (“Multi-membership model” of Ben-Basat, 1980).

These models (and many others not mentioned here) have primarily historical significance currently, because:– (a) It was shown (1997, Domingos and Pazzani) that the MEE and

FCID assumptions are not necessary but sufficient conditions for a wide variety of target functions under 0/1 loss

– (b) Bayesian Networks were invented and as we will see next they allow flexible representation of dependencies so that parsimony and tractability is maintained without compromising soundness

– (c) several other restricted Bayesian classifiers have been shown to perform well in a variety of practical settings

Page 92: Machine Learning Methods for Decision Support and Discovery

92

Bayesian NetworksBayesian Networks

Page 93: Machine Learning Methods for Decision Support and Discovery

93

Supplementary: Supplementary: Bayesian Networks: Overview

A Note On Terminology Brief Historical Perspective The Bayesian Network Model and Its Uses Learning BNs Reference & Resources

Page 94: Machine Learning Methods for Decision Support and Discovery

94

Supplementary: Supplementary: Bayesian Networks: A Note On Terminology

Bayesian Networks (or “Nets”): generic name Belief Networks: subjective probability-based, non-causal Causal Probabilistic Networks: frequentist probability-based,

causal

Page 95: Machine Learning Methods for Decision Support and Discovery

95

Supplementary: Supplementary: Bayesian Networks: A Note On Terminology

Various other names for special model classes:– Influence Diagrams (Howard and Mathesson): incorporate decision and

utility nodes. Used for decision analyses

– Dynamic Bayesian Networks (Dagum et al.): temporal semantics. Used as alternatives to multivariate time series models and dynamic control

– Markov Decision Processes (Dean et al.): for decision policy formulation in temporally-evolving domains

– Modifiable Temporal Belief Networks (Aliferis et al.): for well-structured and very large problem models that involve time and causation and cannot be stored explicitly

Page 96: Machine Learning Methods for Decision Support and Discovery

96

Supplementary: Supplementary: Bayesian Networks: Historical Perspective

Naïve Bayesian Model (mutually exclusive diseases, findings independent given diseases) predominant model for medical decision support systems in the 60’s and early 70’s because it requires linear number of parameters and computational steps (to total findings and diseases)

Theorem 1 (Minsky, Peot): Naïve Bayes heuristic usefulness (expected classification performance) over all domains gets exponentially worse as number of variables increases

Theorem 2 (see Mitchell): Full Bayesian classifier=perfect classifier However FBC impractical and serves as analytical tool only

Page 97: Machine Learning Methods for Decision Support and Discovery

97

Supplementary: Supplementary: Bayesian Networks: Bayesian Networks: Historical Perspective Perspective

In the late 70’s and up to mid-80’s this led to: Production Systems (i.e., rule-based systems, that is simplifications of first-order logic). The most influential version of PSs (Shortliffe, Buchanan) handled uncertainty through a modular account of subjective belief (the Certainty Factor Calculus)

Theorem 3 (Heckerman): The CFC is inconsistent with probability theory unless rule-space search graph is a tree. Consequently, forward and backward reasoning cannot be combined in a CFC PS and still produce valid results

Page 98: Machine Learning Methods for Decision Support and Discovery

98

Supplementary: Supplementary: Bayesian Networks: Bayesian Networks: Historical PerspectiveHistorical Perspective

Bayesian NetworksVariables Conditionally Independent Given Categories &

Categories Mutually Exclusive

Variables Conditionally Dependent

That led to research (late 80s) in Bayesian Networks which can vary expressiveness between the full dependency (or even the full Bayesian classifier) and the Naïve Bayes model (Pearl, Cooper)

Page 99: Machine Learning Methods for Decision Support and Discovery

99

Supplementary: Supplementary: Bayesian Networks: Bayesian Networks: Historical PerspectiveHistorical Perspective

In the early 90’s researchers developed the first algorithms for learning BNs from data (Herskovits, Cooper, Heckerman)

In the mid 90’s researchers (Spirtes, Glymour, Sheines, Pearl, Verma) discovered methods to learn CPNs from observational data(!). We will cover the foundations of this in the causal discovery segment.

Overall BNs is the brain child of computer scientists, medical informaticians, artificial intelligence researchers, and industrial engineers and is considered to be the representation language of choice for most biomedical Decision Support Systems today

Page 100: Machine Learning Methods for Decision Support and Discovery

100

Bayesian Networks: The Bayesian Network Bayesian Networks: The Bayesian Network Model and Its UsesModel and Its Uses

BN=Graph (Variables (nodes), dependencies (arcs)) + Joint Probability Distribution + Markov Property

Graph has to be DAG (directed acyclic) in the standard BN model

A

B C

JPDP(A+, B+, C+)=0.006P(A+, B+, C-)=0.014P(A+, B-, C+)=0.054P(A+, B-, C-)=0.126P(A-, B+, C+)=0.240P(A-, B+, C-)=0.160P(A-, B-, C+)=0.240P(A-, B-, C-)=0.160

Theorem 4 (Neapolitan): any JPD can be represented in BN form

Page 101: Machine Learning Methods for Decision Support and Discovery

101

Bayesian Networks: The Bayesian Network Bayesian Networks: The Bayesian Network Model and Its UsesModel and Its Uses

Markov Property: the probability distribution of any node N given its parents P is independent of any subset of the non-descendent nodes W of N

A

C D

F G

B

E H

JI

e.g., :

D {B,C,E,F,G | A}

F {A,D,E,F,G,H,I,J | B, C }

Page 102: Machine Learning Methods for Decision Support and Discovery

102

Bayesian Networks: The Bayesian Network Bayesian Networks: The Bayesian Network Model and Its UsesModel and Its Uses

Theorem 5 (Pearl): the Markov property enables us to decompose (factor) the joint probability distribution into a product of prior and conditional probability distributions

A

B C

The original JPD:P(A+, B+, C+)=0.006P(A+, B+, C-)=0.014P(A+, B-, C+)=0.054P(A+, B-, C-)=0.126P(A-, B+, C+)=0.240P(A-, B+, C-)=0.160P(A-, B-, C+)=0.240P(A-, B-, C-)=0.160

Becomes:P(A+)=0.8P(B+ | A+)=0.1P(B+ | A-)=0.5P(C+ | A+)=0.3P(C+ | A-)=0.6

Up to Exponential Saving in Number of Parameters!

P(V) = p(Vi|Pa(Vi))i

Page 103: Machine Learning Methods for Decision Support and Discovery

103

Bayesian Networks: The Bayesian Network Model and Its Uses

As we will see in the causal discovery segment, BNs are a useful language for automated causal discovery because the Markov property captures causality thus:– Revealing confounders– Modeling “explaining away”– Modeling/understanding selection bias– Modeling causal pathways– Modeling manipulation in the presence of confounders– Modeling manipulation in the presence of selection bias– Identifying targets for manipulation in causal chains

Page 104: Machine Learning Methods for Decision Support and Discovery

104

Bayesian Networks: The Bayesian Network Model and Its Uses

Once we have a BN model of some domain we can ask questions:

A

C D

F G

B

E H

• Forward: P(D+,I-| A+)=?

• Backward: P(A+| C+, D+)=?

• Forward & Backward:

P(D+,C-| I+, E+)=?

• Arbitrary abstraction/Arbitrary

predictors/predicted variables

JI

Page 105: Machine Learning Methods for Decision Support and Discovery

105

Bayesian Networks: The Bayesian Network Bayesian Networks: The Bayesian Network Model and Its UsesModel and Its Uses

The Markov property tells us which variables are important to predict a variable (Markov Blanket), thus providing a principled way to reduce variable dimensionality

C D

F G

B

E H

JI

A

Page 106: Machine Learning Methods for Decision Support and Discovery

106

Bayesian Networks: Demonstration of Bayesian Networks: Demonstration of Flexible RepresentationFlexible Representation

A BN in which FCID holds

F2

D1 D2 D3

F1 F3 F4

Page 107: Machine Learning Methods for Decision Support and Discovery

107

Bayesian Networks: Demonstration of Bayesian Networks: Demonstration of Flexible RepresentationFlexible Representation

A BN in which MEE holds

F2

D1 D2 D3

F1 F3 F4

Page 108: Machine Learning Methods for Decision Support and Discovery

108

Bayesian Networks: Demonstration of Bayesian Networks: Demonstration of Flexible RepresentationFlexible Representation

A BN in which MEE and FCID hold

F2

D

F1 F3 F4

Page 109: Machine Learning Methods for Decision Support and Discovery

109

Bayesian Networks: Demonstration of Bayesian Networks: Demonstration of Flexible RepresentationFlexible Representation

Hybrid assumptions

F2

D1 D2 D3

F1 F3 F4

F5 F6

F7

D2 D3

F8 F9

Page 110: Machine Learning Methods for Decision Support and Discovery

110

Inference AlgorithmsInference Algorithms Exact

– Lauritzen & Spigelhalter– Cooper: Recursive decomposition

Stochastic-approximate– Likelihood weighting– Dagum and Luby

Variational (approximate but not stochastic)– Jordan et al. (1998): solves queries in QMR-DT in seconds

Reference:An Introduction to Variational Methods for Graphical Methods (1998) Michael I. Jordan, Zoubin

Ghahramani, Tommi S. Jaakkola, Lawrence K. Saul. Machine LearningAn Optimal Approximation Algorithm For Bayesian Inference (1997) Paul Dagum, Michael Luby.

Artificial IntelligenceProbabilistic Reasoning in Expert Systems: Theory and Algorithms by Richard E. Neapolitan. Kohn

Wiley 1990

Page 111: Machine Learning Methods for Decision Support and Discovery

111

Theoretical ComplexityTheoretical Complexity

Inference is NP-hard (Cooper (exact, 1990) Dagum and Luby (stochastic, 1993))

Learning is NP-hard (Chickering, 1994, Bukaert, 1995) However: Many widely-applicable algorithms are very

efficient (allowing up to thousands of variables for inference and up to >100,000 variables for focused learning)

Page 112: Machine Learning Methods for Decision Support and Discovery

112

Automatic Construction of Bayesian Automatic Construction of Bayesian Networks from DataNetworks from Data

For causal discovery: – Perl, Verma (1988) – Spirtes, Glymour, Scheines, (1991)

For classification/automatic DSS construction– Herskovits, Cooper (1991): Kutato (entropy-based)– Cooper, Herskovits (1992): K2 (Bayesian)

[to be discussed at length in second part]

Reference:Computation, Causation, and Discovery by Clark Glymour (Preface), Gregory F. Cooper (Editor), 2000, AAAI Press

Page 113: Machine Learning Methods for Decision Support and Discovery

113

Supplementary: Supplementary: Bayesian Networks: Sparse Bayesian Networks: Sparse Candidate AlgorithmCandidate Algorithm

Repeat

Select candidate parents Ci for each variable Xi

Set new best NW B to be Gn s.t. Gn maximizes a Bayesian score Score(G|D) where G is a member

of class of BNs for which: PaG(Xi) PaBprev(Xi) Xi

Until Convergence

Return B

Restriction Step

Maximization Step

Page 114: Machine Learning Methods for Decision Support and Discovery

114

Supplementary: Supplementary: Bayesian Networks: Sparse Bayesian Networks: Sparse Candidate AlgorithmCandidate Algorithm

SCA proceeds by selecting up to k candidate parents for each variable on the basis of pair-wise association

Then search is performed for a best network within the space defined by the union of all potential parents identified in the previous step

The procedure is iterated by feeding the parents in the currently best network to the restriction step

Theorem 6 (Friedman) : SCA monotonically improves the quality of the examined networks

Convergence criterion: no gain in score, and maximum number of cycles with no improvement in score

Page 115: Machine Learning Methods for Decision Support and Discovery

115

Supplementary: Supplementary: Bayesian Networks: Learning Bayesian Networks: Learning Partial ModelsPartial Models

Partial model: feature (Friedman et al.) Examples:

– Order relation (is X an ascendant or descendent of Y?)

– Markov Blanket membership (Is A in the MB of B?) We want:

And we approximate it by:

P(f(G|D) = (f(G) * p(G|D))

Conf(f) = * f(Gi)

G

i=1

m

m

1

Page 116: Machine Learning Methods for Decision Support and Discovery

116

Supplementary: Supplementary: BN ApplicationsBN Applications

Page 117: Machine Learning Methods for Decision Support and Discovery

117

Supplementary: Supplementary: PathfinderPathfinder

Heckerman et al early 90s Diagnosis and test selection of lymph-node pathology Assumes MEE but not FICD Similarity networks (special enhancement to BNs) allow more efficient

Knowledge acquisition Myopic test selection strategy (similar to Gorry and Barnett) & combined

monetary cost/expected survival utility measure Led to Intellipath commercial product

Reference:Heckerman DE, Horvitz EJ, Nathwani BN. Toward normative expert systems: Part I. The Pathfinder

project. Methods Inf Med. 1992 Jun;31(2):90-105.Heckerman DE, Nathwani BN. Toward normative expert systems: Part II. Probability-based representations

for efficient knowledge acquisition and inference.Methods Inf Med. 1992 Jun;31(2):106-16.Heckerman DE, Nathwani BN. An evaluation of the diagnostic accuracy of Pathfinder. Comput Biomed

Res. 1992 Feb;25(1):56-74.

Page 118: Machine Learning Methods for Decision Support and Discovery

118

Supplementary: Supplementary: QMR-DTQMR-DT

Stanford, late 80s, early 90s Probabilistic formulation of QMR KB (subsequent version of

INTERNIST I) Full-scope of INTERNIST/QMR Uses

– two-layered BN representation, – No MEE or FICD assumptions– Stochastic inference

Reference: Shwe M, Cooper G. An empirical analysis of likelihood-weighting simulation on a large, multiply connected medical belief network. Comput Biomed Res. 1991 Oct;24(5):453-75.

Page 119: Machine Learning Methods for Decision Support and Discovery

119

Supplementary: Supplementary: Analysis of sensitivity of Analysis of sensitivity of BNs to errors in probability specificationBNs to errors in probability specification

Henrion et al. 1996– System: CPCS (subset of QMR)– Results: average probabilities assigned to the actual diseases showed

small sensitivity even to large amounts of noise. – Explanation: One reason is that the criterion for performance is

average probability of the true hypotheses, which is insensitive to symmetric noise distributions. But, even asymmetric, logodds-normal noise has modest effects. A second reason is that the gold-standard posterior probabilities are often near zero or one, and are little disturbed by noise.

Reference:Max Henrion, Malcolm Pradhan, Brendan Del Favero, Kurt Huang, Gregory Provan and Paul O'Rorke. Why is

Diagnosis Using Belief Networks Insensitive to Imprecision in Probabilities? UAI, 1996

Page 120: Machine Learning Methods for Decision Support and Discovery

120

Supplementary: Supplementary: Temporal Causal and Spatial Temporal Causal and Spatial Reasoning with Probabilistic methodsReasoning with Probabilistic methods

Haddaway 1995: temporal, causal and probabilistic FOL Aliferis (97, 98): temporal and causal Bayesian Networks

with clear causal-temporal semantics Spatio-temporal BNs for GI endoscopy

Reference:Aliferis CF, Cooper GF. Temporal representation design principles: an assessment

in the domain of liver transplantation. Proc AMIA Symp. 1998;:170-4.Ngo L, Haddawy P, Krieger RA, Helwig J. Efficient temporal probabilistic

reasoning via context-sensitive model construction. Comput Biol Med. 1997 Sep;27(5):453-76.

Page 121: Machine Learning Methods for Decision Support and Discovery

121

Supplementary: Supplementary: Dynamic construction of BNs from Dynamic construction of BNs from Knowledge Bases to solve problem instances of interest Knowledge Bases to solve problem instances of interest

(KBMC)(KBMC)

Haddaway 1995 probabilistic FOL KBBN Aliferis: Modifiable Temporal BNs: temporal and causal

Bayesian Networks with adjustable hybrid granularities, variable time

horizon, and interacting subnetworks (“contexts”) 1996-8 Koller et al. object-oriented BNs, 1997

Reference:Generating Bayesian Networks from Probability Logic Knowledge Bases P. Haddawy , Proceedings of

the Tenth Conference on Uncertainty in Artificial Intelligence, July, 1994.

Daphne Koller and Avi Pfeffer. Object-Oriented Bayesian Networks , UAI, 1997

Aliferis C.F. A Temporal representation and Reasoning Model for Medical Decision Support Systems, Doctoral Thesis, 1998

Page 122: Machine Learning Methods for Decision Support and Discovery

122

Supplementary: Supplementary: Other applicationsOther applications Parsing of natural language with BNs

– Haug et al 1999– Charniak et al.

Extensive applications for classification and discovery Margaritis et al (1999), Aliferis & Tsamardinos et al. (2001,2,3): – focused causal discovery (parents-children or Markov Blankets)– feature selection

[to be discussed at length in second part]

Reference:Fiszman M, Chapman WW, Evans SR, Haug PJ. Automatic identification of pneumonia related concepts

on chest x-ray reports.Proc AMIA Symp. 1999;:67-71.Charniak E. Bayesian Networks Without Tears. AI Magazine 1991

Page 123: Machine Learning Methods for Decision Support and Discovery

123

Simple Bayes RevisitedSimple Bayes Revisited

Domingos and pazzani 1997:– Naïve Bayes assumptions are sufficient for accurate probability estimates in

the sample limit but not necessary for a wide variety of learning problems when accuracy is the evaluation metric

– In small samples even if the assumptions are violated SB can do better than more expressive representations due to the bias-variance decomposition of the error

– The best way to correct (extend) SB is not to join highly-associated findings These results explain the excellent performance of SB in text categorization

with thousands of variables (words) and many other learning/inference tasks even against more expressive representations

Reference:On the Optimality of the Simple Bayesian Classifier under Zero-One Loss (1997) Pedro Domingos and

Michael Pazzani. Machine Learning

Page 124: Machine Learning Methods for Decision Support and Discovery

124

Other Restricted Bayesian Other Restricted Bayesian ClassifiersClassifiers

The TAN classifier augments Naïve Bayes with “augmenting” edges among findings such that the resulting network among the findings is a tree

F2

D

F1 F3 F4

Page 125: Machine Learning Methods for Decision Support and Discovery

125

Other Restricted Bayesian ClassifiersOther Restricted Bayesian Classifiers The TAN multinet classifier uses a different TAN for each value of D and then

chooses the predicted class to be the value of D that has the highest posterior given the findings

F2

D=1

F1 F3 F4 F2

D=2

F1 F3 F4

F2

D=3

F1 F3 F4

Page 126: Machine Learning Methods for Decision Support and Discovery

126

Other Restricted Bayesian Other Restricted Bayesian ClassifiersClassifiers

The main advantages of TANs and TAN multinets are superior performance to Naïve Bayes and efficient learning Several other restricted Bayesian classifiers have been proposed, e.g., non-tree augmented Naïve Bayes (BANs) and corresponding multinets

References: – Friedman et al Machine Learning 1997– Cheng and Greiner, UAI 1999

Page 127: Machine Learning Methods for Decision Support and Discovery

127

Supplementary: Supplementary: Myths Surrounding Bayesian Decision Support In General and Classifiers in

Particular Bayes’ Theorem requires that diseases are mutually

exclusive and that findings are independent Corollary: Simple Bayes is wrong whenever the MEE and

FICD assumptions do not hold Related: A good way to fix Simple Bayes is to “lamp

together” highly-dependent findings Bayes’ probabilities are difficult to assess and because

errors accumulate the final conclusions are wrong Related: To apply Bayesian inference you need an

astronomical number of probabilities Related: Bayesian inference is too subjective because

probabilities are not frequency-based

Page 128: Machine Learning Methods for Decision Support and Discovery

128

Supplementary: Supplementary: ConclusionsConclusions Since many of the previous advances are very recent, the medical

informatics community that does not specialize in probabilistic systems has not yet caught up with them

Dissemination issues aside, the significant progress that has been accomplished in the theory and technology of Bayesian Networks in the 1990s has yielded:– Algorithms that allow efficient learning and inference with

thousands of variables without unrealistic assumptions– Formal handling of temporal and causal reasoning – Decision-theory and probability theory compliant decision making– Well-understood properties – A plethora of tools for representation, inference , learning, and

experimentation– Pioneering applications in many medical areas

Page 129: Machine Learning Methods for Decision Support and Discovery

129

Supplementary: Supplementary: Bayesian Networks: Bayesian Networks: ReferenceReference

Simple Bayes weakness: – M. Peot, Proc. Proc. UAI 96– M. Minsky, Transactions of IRE, 49:8-30, 1961

Simple Bayes application: – H. Warner et al. Annals of NYAS, 115:2-16, 1964 – F. de Dombal et al. BMJ, 1:376-380, 1972

Full Bayesian Classifier: – T. Mitchell, Machine Learning, McGraw Hill, 1997

Bayesian Networks as a knowledge representation: – J. Pearl, Probabilistic Reasoning in Expert Systems, Morgan Kaufmann,

1988 Certainty Factor/PSs weaknesses:

– D. Heckerman et al., Proc. UAI 86

Page 130: Machine Learning Methods for Decision Support and Discovery

130

Supplementary: Supplementary: Bayesian Networks: Bayesian Networks: ReferenceReference

Causal discovery using BNs: – P. Spirtes et al. , Causation, Prediction and Search, MIT Press 2000– C. Glymour, G. Cooper, Computation, Causation and Discovery, AAAI Press/MIT Press,

1999 – C. Aliferis, G. Cooper, Proc. UAI 94

Textbooks on BNs: – R. Neapolitan, Probabilistic Reasoning in Expert Systems, John Wiley, 1990– F. Jensen, An Introduction to Bayesian Networks, UCL Press, 1996– E. Castillo, et al. Expert Systems and Probabilistic Network Models, Springer 1997

Learning BNs: – G. Cooper et al. Machine Learning 9:309-347, 1992– E. Herskovits, Report No. STAN-CS-91-1367 (Thesis)– D. Heckerman, Technical report Msr TR-95-06, 1995– J. Pearl, Causality, Gambridge University Press, 2001– N. Friedman et al. J Comput Biol, 7(3/4):601-620, 2000, and Proc. UAI 99

Comparison to other learning algorithms:– G. Cooper, C. Aliferis et al. Artificial Intelligence in Medicine, 9:107-138, 1997