76
© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected] Causal graphs for risk assessment and risk management Tony Cox 10-2-03

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected] Causal graphs for risk assessment and risk management Tony Cox 10-2-03

Embed Size (px)

Citation preview

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Causal graphs for risk assessment and risk management

Tony Cox

10-2-03

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Main themes• Risk analysis is largely about how to use observed

information on some variables to help predict the values of other unobserved variables.– Exposure Future response– Current response Past exposures

• Causal graphs/Bayesian networks provide a systematic way to use information on some variables to predict (a) the probable values of others and (b) the likely effects of interventions.

• Influence diagrams are causal graphs with decision and value nodes.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Main themes (Cont.)• Methods for learning causal graphs from data,

inferring likely values of unobserved variables from values of observed ones, and optimizing decisions are heavily computational.– Good algorithms and effective approximate heuristics

are available for inference and decision-making. (Mature technology, many choices.)

– Learning models from data automatically or with minimal user knowledge input is current cutting edge.

• Commercial software and academic shareware are closing the gap between theory and practice.– “Ready for prime time” in RA applications

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Quantifying Uncertainty• Uncertainty about a random variable, X, can be

quantified by its entropy, denoted H(X).

• X and Y are statistically dependent if and only if each gives some information (i.e., reduces expected uncertainty/entropy) about the other.

• Conditional entropy: H(X | Y) H(X)– Equal iff X and Y are statistically independent

• X and Y always provide equal information about each other: H(X) – H(X | Y) = H(Y) – H(Y | X) = mutual information between X and Y.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Uncertainty and Bayes Rule• If random variables (or uncertain quantities) are

statistically independent, then their values can be predicted easily, e.g., by sampling from their PDFs.

• If variables are statistically dependent, then Bayes Rule allows their values to be predicted from whatever is observed (e.g., by sampling from their conditional distributions)– Predicts unobserved from observed values.– True risk is always unobserved.

• Conditional independence among variables helps reveal potential causal relations among them.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Statistical Independence for events and RVs

• Events A and B are statistically independent if and only if: Pr(A & B) = Pr(A)Pr(B)– Now, Pr(A & B) = Pr(A)*Pr(B | A)

– Independence: Pr(B | A) = Pr(B) = Pr(B | not-A).

• Random variables X and Y are statistically independent if Pr(X, Y) = Pr(X)Pr(Y) (“joint PDF equals product of marginals”)– Meaning: Pr(X = x and Y = y) = Pr(X = x)*Pr(Y = y)

for all pairs of possible values, (x, y).

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Example• Joint probability table/ PDF for X and Y.

• Q: Are X and Y statistically independent?

• Find marginal distributions Pr(X) and Pr(Y)

Y

X

0 1

Low = 0

0.0 0.2

High = 1

0.3 0.5

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Marginalization Principle: “Joint determines marginal”

• Given a joint PDF table, the marginal probability distribution of any variable(s) can be found… how?

• A: Sum joint PDF over values of other variable(s). • In this sense, a joint PDF is an “oracle” that can

answer any query about any of its variables by summing it over appropriate cells in the table.

• Could be useful… • But joint PDF too big to estimate, store, sum over!• Bayesian networks provide a solution based on

sparseness of table / statistical independence

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

“DAG model” notation: X Y• “Directed acyclic graph” (DAG) notation

– “Bayes net”, “Bayesian belief network”, “causal graph”, “DAG model”, “influence diagram”

• X Y means the same as [Pr(X), Pr(Y | X)]• Pr(X) is stored at input node X.

– Gives Pr(X = x) for each possible value x of X. – Pr(X) = “marginal” distribution of input X.

• Pr(Y | X) is stored at node Y. – Gives Pr(Y = y | X = x) for each x and y. – Called the conditional probability table (CPT) of Y.

• Extends to many variables: X Y Z W

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Example of a Bayesian Network (BN) for Health Risk Assessment

action exposure response consequence QALYs

behavior susceptibility treatment

Risk assessment estimates the probable consequences of actions, given all available information.

Bayes Rule estimates probabilities of values of unobserved variables conditioned on all observed ones.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Bayes Rule: What it does• Bayes Rule lets us calculate Pr(X | Y) from Pr(X )

and Pr(Y | X), for any X and Y. – X is unobserved, Y is observed– Needed inputs: Pr(X) = prior, Pr(Y | X) = likelihood– Output: Pr(X | Y) = posterior distribution for X given Y– “Reverses arc” in the probability model X Y.

• Multivariate Bayes Rule: X and Y can be sets of uncertain quantities (e.g., vectors)

• Provides a consistent, computationally practical framework for data-based modeling, estimation, prediction, inference, and decision-making!

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Bayes Rule: How it Works• Bayes Rule [Pr(Y | X) and Pr(X)] Pr(X | Y)

• Q: How is this possible?

• A: Use Pr(Y | X) and Pr(X) to get joint distribution:– Q: How?– A: Multiply! Pr(X, Y) = Pr(Y | X)*Pr(X)– “Joint = conditional times marginal”

• Then use Pr(X, Y) to calculate Pr(X | Y) (How?)– Pr(X = x | Y = y) = Pr(x, y)/Pr(y) = Pr(cell)/Pr(column)

• Pr(X | Y) = Pr(X, Y)/Pr(Y) = Pr(Y | X)Pr(X)/Pr(Y)– “Conditional = joint divided by marginal”– “Posterior = likelihood times prior, normalized”

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Pop Quiz: Find Pr(X = 8 | Y = 2)

Y

X

1 2 3

4 0.0 0.2 0.1

8 0.1 0.2 0.0

16 0.1 0 0.3

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Details on Making Predictions

• DAG model: X Y. We will predict Y: Pr(Y = y)

• If X = x is observed, then prediction for Y is just the CPT column, Pr(Y = y | X = x) for each value of y.

• But what if X is not observed?

• Then Pr(Y = y) = xPr(Y = y | X = x)Pr(X = x)

= xPr(Y = y & X = x) (since exactly one x will occur!)

– “Marginalize out” X by summing over possible values, x

• Fast, modern prediction: Monte Carlo simulation!– Sample the values of X from Pr(X = x)– Sample the values of Y from the CPT Pr(Y = y | X = x)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Example: Predict Pr(C) for random patient (Source: http://research.microsoft.com/~breese/)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Manual Solution for Pr(C = none)• Model: S C, i.e., Smoking Cancer

• Rule: Pr(Y = y) = xPr(Y = y | X = x)Pr(X = x)

Pr(C = c) = sPr(C = c | S = s)Pr(S = s)

Pr(C = none) = sPr(C = none | S = s)Pr(S = s)

Pr(C = none | S = no)Pr(S = no) +

Pr(C = none | S = light)Pr(S = light) +

Pr(C = none | S = heavy)Pr(S = heavy) =

(0.96)(0.80) + (0.88)(0.15) + (0.60)(0.05) = 0.93

= [0.96, 0.88, 0.60]*[0.80, 0.15, 0.05]'

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Better Living through linear algebra

• Rule: Pr(Y = y) = xPr(Y = y | X = x)Pr(X = x)

• Matrix-vector notation: Pr(y) = Pr(y | x)*Pr(x)x = vector of probabilities for values of x, y = vector of

probabilities for values of y, Pr(y | x) = CPT

Pr(S) = Pr[none, light, heavy] = s = [0.8, 0.15, 0.05];

Pr(C | S) = P = [.96, .88, .60; .03, .08, .25; .01, .04, .15];

• Pr(C) = Pr(C | S)Pr(S) c = P*s'

• MATLAB™ program: c = P*s' = [0.93, 0.0485, 0.0215] for [none, benign, malignant]

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Bayes Applet Approachhttp://www.cs.ubc.ca/labs/lci/CIspace/bayes

• Draw DAG model: Smoking Cancer

• Populate marginal distributions and CPTs– Same information as s and P in MATLAB

• Click on a node (“query” it) to get its PDF!

• Program computes PDFs of nodes by exact formulas (marginalizing out and generalizations):Pr(Y = y) = xPr(Y = y | X = x)Pr(X = x)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Prediction: Summary• Given a probability model Pr(Y = y | X = x) (usually

summarized in a CPT, P)• Given input probabilities Pr(X = x) = marginal PDF, x• The predicted probability distribution of Y is found by

summing products: [Pr(Y = y) for each y] = y = P*x'• Bayes Applet automates the calculations• MC simulation makes (approximate) calculations easy

even when Pr(Y = y | X = x) cannot be stored as a table.• Framework applies to arbitrary probabilistic models

– The input-output relations can be very nonlinear, have high-order interactions, etc.

– Applies approximately to simplifed parametric models

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Updating Beliefs by Conditioning• Given a model X Y with prior Pr(X) and likelihood

Pr(Y | X) and observed data or evidence Y = y, find posterior: Pr(X | Y = y). (X = unobserved, Y = observed)– Probable diagnosis, explanation, etc. for observations

• Pr(Y = y) = v[Pr(Y = y & X = v)] = v[Pr(y | v)Pr(v)]• Abbreviated notation: Pr(X = x | Y = y) = Pr(x | y)• Method: Pr(x | y)Pr(y) = Pr(x, y) = Pr(y | x)Pr(x)

So, Pr(x | y) = Pr(x, y)/Pr(y) (Defn. of conditional PDF)= Pr(y | x)Pr(x)/Pr(y) joint = conditional x marginal

= Pr(y | x)Pr(x)/[vPr(y | v)Pr(v)] (Bayes Rule)where sum is over all possible values, v, of X.

• This is Bayes Rule!

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Pr(S | C) in Smoker Problem• Q: What is the probability that someone is a heavy smoker

(S = heavy), given that she is diagnosed with malignant lung cancer (C = malignant)?

• A: Pr(s | c) = Pr(c | s)Pr(s) /[vPr(c | v)Pr(v)] • Pr(heavy | malignant) = ???

Pr(malignant | heavy)Pr(heavy)/Pr(malignant)= (0.15)*(0.05)/((.01)*(.8) + (.04)*(.15) + (.15)*(.05))= 0.3488. (Check with Bayes Applet solver)

• In practice, Probabilistic Logic Sampling (forward Monte Carlo simulation with rejection of samples that disagree with observations) and more efficient sampling-based methods can be used for inference in large BNs.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Bayes Applet Approachhttp://www.cs.ubc.ca/labs/lci/CIspace/bayes

• Draw DAG model, X Y: Smoking Cancer

• Populate marginal distributions and CPTs– Pr(x) = Pr(X = x), Pr(y | x) = Pr(Y = y | X = x)

• Click on a node (“query” it) to get its PDF!

• Program computes PDFs of nodes by exact formulas (marginalizing out and generalizations):Predict: Pr(y) = xPr(Y = y | X = x)Pr(X = x)

Infer: Pr(x | y) = Pr(x, y)/Pr(y) = Pr(y | x)Pr(x)/Pr(y)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Example: Diagnostic Test

• 1% of population is sick, 3% has allergy, 96% is well

• An imperfect medical test will yield a positive (“sick”) reading as follows:

• Pr(test = sick | person is sick) = 0.97• Pr(test = sick | person is well) = 0.05• Pr(test = sick | person has allergy) = 0.10• Q: What is Pr(person is sick | test says sick)?

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Solution to Diagnosis (Cont.)Pr(test sick | sick)Pr(sick)/Pr(test sick) =

(0.97)(0.01)/0.0607 = 0.1598 (answer)

• Even this very reliable test is only right 16% when it diagnosis illness!

• Explanation: Illness is unlikely!

• Many physicians are surprised by correct Bayesian inference in diagnosis.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Bayesian Modeling and Updating• Likelihood Model: Pr(Y | X)

– Pr(outputs | inputs) vs. Pr(data | model)

• Prior data/beliefs: Pr(X), X = predictive model• Model-learning cycle:

X = model assumptions + inputsY = predicted outputs/observed dataModel predictions: [Pr(X), Pr(Y | X)] predictions Pr(Y)Model updating/learning: [Pr(X) + data on Y] Pr(X |Y) – After observing enough data, we learn a Pr(X | data) that

makes accurate predictions (if X ranges over enough models, data are adequately diverse, etc.)

– Bayesian model-averaging (BMA) allows uncertain X

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Applications of Bayes in RA• Bayes Rule: X Y Pr(X | Y)

– X = unobserved, Y = observed– E.g., Pr(Y | X) = Pr(data | exposure, model)

• Diagnosis: X = disease, Y = symptom• Hazard identification: X = cause, Y = health effect• Exposure assessment: X = true exposures, Y =

measured (e.g., sample) values• Dose-response modeling: X = true dose-response

model, Y = observed dose-response data• Risk characterization: X = true models, exposure

distributions (and risk), Y = data

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Bayesian Networks

• Generalizes preceding to graphs (“DAGs”) with many nodes and arcs

• Example: X Y Z– (X Y) = PBPK model = (exposure internal dose)– (Y Z) = PD model = (internal dose response)

• Extension allows composition of probabilistic relations: (X Z) = (X Y)(Y Z).Q: Find Pr(Z = z | X = x), Y unobserved

A: yPr(Z = z | Y = y)Pr(Y = y | X = x)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Expert System for Many Variables• How to design and build a probabilistic expert system to

answer all questions about risk (given joint PDF)?• Goal: Answer user queries with probabilities, taking into

account all known observations/facts/data.• Method: Condition the joint distribution Pr(X) on any

known facts (i.e., exclude all cells that are inconsistent with known facts.) Then, sum over all cells satisfying query conditions and normalize to get total conditional probability that the query is satisfied.

• Challenge: Joint PDF of all variables is impossible to store (or estimate)!

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Key Idea: Factoring a Joint PDF• For two variables, we have

Pr(X = x, Y = y) = Pr(X = x)Pr(Y = y | X = x)More briefly: Pr(x, y) = Pr(x)Pr(y | x)

• For n variables, we have (perhaps after renumbering variables so that all of a variable’s direct parents precede it in the new ordering):Pr(x1, x2, …, xn) =

Pr(x1)Pr(x2 | x1)Pr(x3 | x1, x2)…Pr(xn | x1, x2, …, xn-1)

• Most of these terms simplify for sparse DAGs!• Example: Find Pr(x1, x2, x3, x3) in the following DAG:

X1 X2 X3 X4

• A: Pr(x1, x2, x3, x4) = Pr(x1)Pr(x2 | x1)Pr(x4)Pr(x3 | x2 , x4)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Example of a Bayesian Network (BN) for Risk Assessment

act exposure response consequence QALYs

behavior susceptibility treatment

a x r c Q All CPTs and Pr(b) given. b s m

• Factor joint probability Pr(b, s, m, x, r, c, Q) given a:Pr(b)Pr(s | b)Pr(m | s)Pr(x | b, a)Pr(r | s, x)Pr(c | r, m)Pr(Q | c)

• BN technology lets us find the PDF of any unobserved variable(s) given observed values of any others

• Monte Carlo (MC) composes components into whole

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Monte Carlo Prediction for X Y• Rule: Pr(Y = y) = xPr(Y = y | X = x)Pr(X = x)• Suppose that Pr(Y = y | X = x) is hard to calculate but easy

to simulate (e.g., it is a large, complex spreadsheet simulation model). What to do?

Answer: Sample x values from Pr(X = x). – For univariate X, partition unit interval into sub-intervals of

lengths Pr(x), then use uniform U[0, 1] random number generator

• For each x, calculate/simulate a corresponding value of y from Pr(Y = y | X = x)

• Fraction of sampled Y values equal to y Pr(Y = y)• Adaptive importance sampling and variance reduction

techniques (e.g., Latin hypercube sampling) improve the efficiency of this basic idea.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Gibbs Sampling for BNs a x r c Q All CPTs and Pr(b) given

b s m

• Factor joint PDF for all variables:– Pr(b)Pr(s | b)Pr(m | s)Pr(x | b, a)Pr(r | s, x)Pr(c | r, m)Pr(Q | c)

• Gibbs sampling: Sample b from Pr(b), then s from Pr(s | b), then… then Q from Pr(Q | c)

• Repeat many times to obtain a random sample from the joint PDF!!! (Special case of MCMC)

• Can condition on observed values (with more or less efficiency) to obtain samples from posterior joint PDF of all unobserved values. (WinBUGS)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Remaining Challenges• How to learn (and objectively validate) the

causal structure of a BN from data?– Answer: Conditional independence tests– Software: Classification trees, BayesiaLab™,

Bayesian Network Toolbox, etc.

• How to learn/estimate a BN’s CPTs from data?• Once these things are known, WinBUGS, Bayes

Applet, etc. can be used to draw inferences within the BN about risk, exposure, etc.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Learning Causal DAG Structures from Data

• Can be very hard for people– “Often in error but never in doubt!”

• General problem is computationally demanding• Excellent sampling-based heuristics are now

available (once we agree what a “causal structure” is!)

• Exact solutions are available for small problems and models with special structure/constraints– Includes many risk assessment models!

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

A Cognitive bias (Wason, 1970)• Consider a deck of cards, each with a letter on one of its sides and a number on the

other.• Claim: “If a card has a vowel on one side, then it has an even number on the other

side”

A D 4 7

Q: Which of these 4 cards must be turned over to determine whether the claim is true of them?(“Wason selection task”)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Structural vs. Reduced-Form ModelsSuppose truth (unknown to risk analyst) is:

E(EFFECT) = AGE – EXPOSURE (1) EXPOSURE = (1/3)*AGE (2)

(Q: What is non-causal about: AGE = 3*EXPOSURE?)

Q: What is E(Effect | Exposure)?A: E(EFFECT | EXPOSURE) = 2*EXPOSURE (3)

E(EFFECT) -1 +1

EXPOSURE AGE+1/3

Math is right but implications are all wrong!• (1) and (2) are structural equations (causal); • (3) is a reduced-form equation (descriptive, not causal)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

X is a potential direct cause of Y if and only if…

• X is informative about Y– I(X ; Y) > 0

• Information that X provides about Y cannot be fully explained away by any other variables– There is no W for which I(X ; Y | W) = 0

– X W Y vs. W X Y vs. W Y X

• Composition, ordering, and conditional independence implications are all satisfied

• The history of X up through time t contains information about the future of Y after t.

These are all information (vs. intervention) criteria

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Tests for Potential Causality and Conditional Independence in Data

Different approaches:

• Time series: Granger causality, intervention analysis, change point analysis– http://support.sas.com/rnd/app/examples/ets/granger/

• Systems of equations: Simon causal ordering

• Correlations: SEMs/path analysis

• Conditional independence: Causal graphs

• Our favorite: Classification trees

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Classification Trees• A powerful, popular method for data mining• Background/software on classification trees:

http://www.statsoftinc.com/textbook/stcart.html; http://www.stat.wisc.edu/~loh/quest.html

• Idea: “Always ask the most informative question” for reducing uncertainty (conditional entropy) about the dependent variable.

• Partitions a set of cases into groups or clusters (the leaf nodes) with similar conditional probabilities for dependent variable.

• Download applet from: http://www.cs.ubc.ca/labs/lci/CIspace/dTree/

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

What is meant by the “Most Informative” Question?

• The one that minimizes expected entropy about the quantity being predicted.

• Entropy measures the “surprise” from learning the correct value of a variable.

• Let X be a binary 0-1 variable with E(X) = p (in other words, with Pr(X = 1) = p.)

• How surprised will we be to find that in fact X = 1? The answer depends on p.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Quantifying Entropy• Let S = surprise from finding that X = 1.

– So, S = information gained from this finding.– The expected value of S is the entropy of X.

• Assume S depends only on p = Pr(X = 1).• Four axioms for entropy:

– S(1) = 0– S(p) decreases with p– S(p) is a continuous function of p– S(pq) = S(p) + S(q)

• Theorem: S(p) = -k*log2p. If we choose S(0.5) = 1, then S(p) = log2p bits.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Entropy for any discrete X

• It is usual to use H(X) to denote the entropy of a discrete random variable X.

• Let X have possible values x1,…, xn with probabilities p1,…, pn. Then the entropy of X is the expected information gained from learning its value:

• H(X) = -ipilog2pi = E[log2(1/pi)] bits

• H(X) depends only on the probabilities of the different possible values of X.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Information for two variables, X and Y http://www.montefiore.ulg.ac.be/~lwh/Info/Transp2000/jour2.pdf

Conditional entropy H(X | Y) is defined as:

• H(X | Y) = yPr(Y = y) H(X | Y = y) = EY[H(X | Y)]

– H(X | Y = y) = -iPr(X = xi | Y = y)log2Pr(X = xi | Y = y)

Properties:• H(X | Y) H(X),

• H(X | Y) = yPr(Y = y) H(X | Y = y)

• Joint entropy: H(X, Y) = H(X) + H(Y | X)

Mutual information:

I(X ; Y) = I(Y ; X) = H(X) – H(X | Y)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Information Theory Ideas• Fano’s inequality: The conditional entropy H(X | Y)

puts a lower bound on the probability of estimating X correctly from Y.

• Kolmogorov complexity: H(X) length of shortest binary program for computing samples from X.

• Maximum compression is determined by H(X) + 1• Shannon’s coding theorem: A channel Pr(Y | X) has

capacity C = maxPr(x)I(X ; Y). Asymptotically error-free communication through the channel is possible iff source entropy < channel capacity! C is the maximum possible rate of error-free communication using the channel.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Mutual Information in DAGs

• In a causal graph model X – Y – Z (with the arcs oriented in any directions), more remote ancestors can never be more informative than direct parents.

• Thus, I(X; Z) I(X ; Y)– Information-processing inequality for causal graphs.– Special case: H(X) H[f(X)]

• Moreover, I(X ; Z | Y) = 0 (conditional independence of X and Z given Y) unless both X and Z point into Y.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Entropy and Classification Trees• Let X be the dependent variable (the one we are

predicting) in a classification tree. Before asking any questions, the marginal distribution of X has entropy H(X).

• After asking “What is Y?” and learning that “Y = y”, the conditional entropy of X becomes H(X | Y = y). This may be larger or smaller than H(X).

• The expected value of H(X | Y) is < H(X) – H(X | Y) < H(X) unless X and Y are independent.

• The question Y that maximizes I(X ; Y) is the most informative question to ask (about X).– A myopic criterion: the most informative single question

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Classification Tree AlgorithmsGrowing a tree requires splitting, stopping, pruning.• Splitting criteria

– Entropy / mutual information

– Impurity / Gini coefficient of inequality

– Gain and error criteria

• Stopping criteria– Statistical tests for significant differences between conditional

PDFs (F-test, chi-square)

– Merging leaves

• Pruning criteria– Model cross-validation

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Model Cross-Validation• Key idea: Choose the model to minimize cross-

validation error (not the naïve resubstitution error)– Prune trees to minimize CVE. Myopic: Drop node to most

reduce CVE until no further reduction possible.

• The multiple training and test sets can be randomly selected or deterministically selected (e.g., “leave-one-out” sampling)

• k-fold cross-validation can give an unbiased estimate of true predictive performance of model

• Re-sampling data to obtain improved models is a key idea of modern statistics

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Classification/Decision Tree Algorithms in Practice

• Different decision tree (= classification tree) algorithms use different splitting, stopping, and pruning criteria.

• Examples of tree algorithms:– CHAID: chi-square automatic interaction detection– CART: Classification and regression trees, allows continuous

as well as discrete variables– MARS: Multiple adaptive regression splines, smooths the

relations at tree-tips to fit together smoothly– KnowledgeSeeker allows multiple splits, variable types– ID3, C5.0, etc.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Generating CPTs with Trees

• Classification trees grown using parents of a node provide exactly the information needed to fill in its Conditional Probability Table (CPTs).

• Each leaf node gives the conditional probability distribution of the dependent variable, given the combination of values of its parents.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Learning Structure with Trees• In a classification tree, the dependent variable

(root node) is conditionally independent of all variables not in the tree, given the variables that are in the tree. (At least as far as the tree-growing heuristic can discover.)

• Generically, all causal parents of a node appear in every classification tree for that node.

• Starting with a childless node (output), we can recursively seek direct parents of all nodes.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Eureka! Learning BNs is possible!• We can use classification trees to perform

conditional independence tests to identify potential causal graph structures.

• The leaves of the tree for each node provide the numbers to put in its CPT.

• Many extensions, simplifications (e.g.., parametric models), and computational efficiencies have been and are being developed.

• Current status: From theory to software products

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Classification-tree Tests of CI

• Pop Quiz: How to use classification trees to decide which of the following is correct from (Exposure, Lifestyle, Response) data for many individuals?

• Alternative hypotheses: Causal model 1: Exposure Response Lifestyle Causal model 2: Exposure Lifestyle Response Causal model 3: Lifestyle Exposure Response

• If a test (e.g., classification tree test for conditional independence) can discriminate among alternative models, perhaps it can be used to identify all possible models from the data?

• The PC algorithm builds on this idea.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

The PC Algorithm (Reference: Glymour C, Cooper GF, 1999. Computation, Causation &

Discovery, MIT Press)

1. Start with a complete undirected graph

2. Use conditional independence tests to successively remove edges

3. Orient any triple X – Y – Z as X Y Z if and only if X and Z are dependent when conditioned on Y.

4. Orient any triple X Y – Z as X Y Z.

5. Orient any pair X – Y with a directed path through X to Y as X Y.

6. Repeat steps 3-5 until no more directions can be assigned.

This two-phase approach (eliminate edges via conditional independence, orient remaining edges where possible) can be extended to latent variables and possible sample selection biases Fast Causal Inference (FCI) algorithm.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Pragmatics: Multiple ModelsBasic ideas: D = Data M = Model S = Prediction statement1. Assign CI relations among nodes based on data. 2. Average predictions over all plausible models.

Pr(statement S | data D) and model uncertainty:

Pr(S | D) = MPr(S | M)Pr(M | D), M = models. This is model-averaging.

Occam’s Window for selective model averaging (Madigan, Raftery, 94): 1. Create initial set of models (e.g., model with all nodes connected).2. Consider all one-step changes (arc additions, deletions, reversals

that do not create cycles). Quantify each model from the data. (Estimate CPTs by “obvious” MLE. Estimate likelihood of data using DAG decomposition/Gibbs sampling)

3. Keep any new graphs that (a) are at least 5% as probable as the most probable model known and (b) Contain no more-probable subgraphs. Repeat 2 and 3 until no more graphs added.

4. Perform model-averaging using all the structures found.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Influence diagrams with Analytica

• Influence diagram software (e.g., Hugin, Analytica) allows causal graph models with decisions and value nodes to be quickly assembled and managed if the required CPTs and input probabilities Pr(y) are known.

• Every influence diagram can be automatically converted to an equivalent Bayesian Network and solved (e.g., Xiang and Ye, 2001)– http://citeseer.nj.nec.com/492444.html

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Excess Deaths

Total Cost

Emissions Reduction

Population Exposed

Health Damage Factor

Concentration Health

Damage

Threshold Base

Concentration

Control Cost

Control Cost Factor

"Value of a Life"

Analysis

EXAMPLE: Analytica Influence DiagramDescription: Total cost to society = value of the lives saved - cost of Emissions Reduction required to save those lives.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Clicking on “Total Cost” value node and selecting mean value yields:

Total Cost ($/yr)

Emissions Reduction Factor 0 0.2 0.4 0.6 0.8 1

0

200M

400M

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Selecting “Probability Bands” yields: To

tal

Co

st (

$/yr

)

Emissions Reduction0 0.2 0.4 0.6 0.8 1

0

500M

1G

Key Probability0.050.250.50.750.95

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Example Analytica Model Wrap-Up

• Run time = ~1 sec.• Method = Monte-Carlo propagation of input

probability distributions through influence diagram. 

• Conclusion: An emissions reduction factor of 0.7 is a robust optimum for this decision problem, given the probabilistic model (DAG influence diagram model) mapping inputs to output probabilities.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Conclusion: Bayes Rule and Conceptual Models

Bayes’ Rule can help:• Identify hazards/agents (sources of bad effects)

– Test hypothesized causative agents, pathways, adverse effects via CI tests

• “Put everything together”– Compose (modular, validated) components via MC– Conditional independence

• “Drill down” on components– Decompose relation into measurable components– BNs provide testable decompositions

• Construct and validate system of causal connections

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Appendix Decision Analysis Preview:Beyond Influence Diagrams

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Individual Decision Analysis Credohttp://www.cs.mcgill.ca/~dprecup/courses/AI/Lectures/ai-lecture17x2.pdf

1. DECISIONS are represented by choices among risk profiles (probabilities of consequences), one per act.

2. CONSEQUENCES are represented by numerical values on one or more outcome scales

3. The best decision maximizes expected utility (correctly defined)

4. UTILITIES reflect risk attitudes and preferences for consequences

5. UNCERTAINTIES are represented by probabilities, often derived from published models.

6. LEARNING takes place by conditioning on data (via Bayes’ Rule and model-based likelihood functions.)

7. MODELS can be represented by influence diagrams/BNs.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Decision Analysis Framework (Cont.) • Justification of DA paradigm: All people should prefer a higher

probability of gaining a preferred outcome! – In prospect [(p, x); (1 – p, 0)], x > 0, pick option with highest p.

• Gives well-defined value of information for multi-stage and sequential decision-making

• Gives a universal recipe for prescribing what to do:1. Identify alternative acts/choices/decisions

2. Identify relevant consequences

3. Create causal model(s) and quantify risk profiles (i.e., probability of each consequence, given each act)

4. Clarify value trade-offs. Assign utilities to outcomes.

5. Optimize decision. (Max EU s.t. model constraints)

6. Implement decision, monitor results.

7. Iterate!

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Solving Normal-Form Problems• Ingredients: States, acts, probabilities, consequences,

utilities, expected utilities– c(?, ?), Pr(?), u(?), EU(?)

– c(a, s), Pr(s), u(c), EU(a)

• Once the probabilities of consequences for each act are known (given any available information), finding expected utilities of acts is easy!

• Picking the EU-maximizing act is justified by a powerful normative (prescriptive) theory of rational decision-making.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

EU for “Normal Form” (Act-State-Result) • Model: act consequence state

• Usually shown as a table.

• Act = decision rule or policy; state = all observed values; Model determines Pr(state) and Pr(consequence | act, state) = Pr(c | a, s).

• Q: Given Pr(c | a, s), Pr(s) and u(c), find EU(a)

• A: EU(a) = cPr(c | a)u(c)

= cu(c)[sPr(c | a, s)Pr(s)]

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Simple Example: What to Do?

State:

Act:

Problem No Problem

Launch 0 1

Don’t launch

0.5 0.8

Probs: 0.2 0.8

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Simple Example of Normal Form

State:

Act:

Problem No Problem

EU

Launch 0 1 0.80

Don’t launch

0.5 0.8 0.74

Probs: 0.2 0.8

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Rocket Launch Problem

• An unmanned rocket is being launched.• An unreliable warning light comes on with probability

0.5 if the rocket has a problem and with probability 0.333 even if it has no problem.

• The loss is 2 from not launching when there is no problem. It is 5 from launching when there is a problem. It is 0 otherwise.

• The prior probability of no problem is p.• How great must p be to justify launching even if the

warning light comes on? (Assume that the goal is to minimize expected loss.)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

A More Realistic/Complex Example: EU Table

State:

Rule:

Problem No Problem

EU

Launch if and only if no light

? ? ?

Always launch 0 1 p

Never launch 0.5 0.8 .5 + .3p

Probs: 1 - p p

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

A More Realistic/Complex Example: EU Table

State:

Act:

Problem No Problem

EU

Launch if and only if no light

EU = 0.25 = (.5)(.5)

EU = (1/3)(.8) + (2/3)*1 = .933

.25(1-p) + .933p

Always launch 0 1 p

Never launch 0.5 0.8 .5 + .3p

Probs: 1 - p p

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Extensions of Normal FormFramework/Representation

Act State Payoffs Optimization algorithms

Payoff matrix(Raiffa, 1968)

row column Payoff in cell

Eliminate dominated rows, then choose row to maximize expected utility

Normal Form a = act 

s = state 

c(a, s) = consequence model

Mathematical programming: Max cu(c)p(c | a)a A

s.t. p(c | a) =

sc(a, s)p(s | a)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Normal Form Extensions (Cont.)Framework/Representation

Act State Payoff Optimization algorithms

Decision tree (Raiffa, 1968)

Decision at each choice node

Outcome at each chance node

Payoffs at tips of tree

Averaging out & folding back (simple dynamic programming)

Influence diagram

Decision at each choice

node  

Outcome at each chance node  

Value at value node

 Arc reversal•MCMC •Variable elimination•Etc.

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Normal Form Extensions (Cont.)Framework/Representation

Act State Payoff Optimization algorithms

Markov decision process (MDP);

Choice of act in each state

Transition time and next state at each visit to a state

Reward per unit time in states and/or at transitions

  Value iteration Policy iteration   MCMC•Dynamic programming

Stochastic optimal control

Decision rule (maps observed history to control trajectory)  

Evolution of state trajectory, given control trajectory  

Function of state and/or control trajectory

Mathematical programming: Max cu(c)p(c | a)a A

s.t. p(c | a) =

sc(a, s)p(s | a)

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

Why use decision analysis?A: Intuitive decisions are often bad/dominated!

Example: A = gain $240 for sure vs. B = 25% to win $1,000 (else 0)C = lose $750 for sure vs. D = 75% to lose $1,000 (else 0)• You must choose one prospect from the pair (A, B) and one

prospect from (C, D). What do you choose? (A, C), (A, D), (B, C), or (B, D)?

• Many prefer (A, D) to (B, C)• But (A, D) gives 25% chance at $240, 75% chance of -$760• (B, C) gives 25% chance to win $250, 75% chance of -$750• So, (B, C) dominates (A, D).

© Cox Associates, 2003. Ph. 303-388-1778; Fax 303-388-0609. [email protected]

A Formula for Expected Utility

• Let Pr(c | a) be the probability of receiving consequence or result c if act or decision a is taken. a c

• Pr(c | a) for all c = “Risk profile” for act a.• Let u(c) = utility of consequence c. (If u(c) is a utility

function representing preferences for consequences, so is ru(c) + s, for any r > 0 and for any s.)– c is equally preferred to {probability u(c) of best outcome,

probability 1 – u(c) of least-preferred outcome}.

Then EU(a) = cPr(c | a)u(c) is the expected utility of act a.