CS 7180: Behavioral Modeling and Decision-‐making in AI Probability Theory Review Prof. Amy Sliva October 3, 2012
Decision-‐making under uncertainty • So far we have assumed perfect, complete, and reliable information in reasoning about behavior • Derive previously unknown facts/states from the current, known ones
• In many (most?) domains this is not the case… • Not always possible to have access to the entire set of facts for reasoning • Agent behavioral data is noisy, incomplete, and inconsistent • Complexity of domains prevent complete representation
• Don’t know what we don’t know…
• Actions, effects, and the state of the world are all uncertain • Yet a decision must still be made!
Example: Uncertainty in modeling security
• Suppose we are trying to model the behaviors of groups involved in a civil conQlict to determine a conQlict management strategy
Action ArmedAttack(g, t) = Group g will engage in an armed attack by time t
• Will group g1 attack by a particular time t?
Example: Uncertainty in modeling security • Problems:
• Partial observability—group’s resources, other agent’s plans (parties in the conQlict, external states, international organizations), etc.
• Noisy sensors—media or intelligence reports • Uncertainty in action outcomes—casualties, responses other group, etc. • Immense complexity of modeling and predicting human behavior
• A purely logical approach either… • Risks falsehood: ArmedAttack(g1, 10) Group g1 will attack at time 10
• Leads to conclusions that are too weak for decision making: “ArmedAttack(g1, 10) will occur by time 10 if g1 has a consistent inQlow of resources and g2 does not receive external state support and the attack is successful and etc., etc…”
Several sources of uncertainty in AI • Information is partial
• Information is not fully reliable • Representation language is inherently imprecise
• Information comes from multiple sources and it is con;licting
• Information is approximate
• Non-‐absolute cause-‐effect relationships exist (nondeterminism)
Several sources of uncertainty in AI • Information is partial
• Information is not fully reliable • Representation language is inherently imprecise
• Information comes from multiple sources and it is con;licting
• Information is approximate
• Non-‐absolute cause-‐effect relationships exist (nondeterminism)
Sources of Uncertainty 1. Ignorance 2. Laziness (efQiciency)
What we call uncertainty is a summary of all information that is not explicitly taken into account in our model or knowledge base.
Managing uncertainty in AI
How to represent uncertainty in knowledge?
How to perform inferences with uncertain knowledge?
Which action to choose under uncertainty?
Methods for uncertain reasoning • Default or nonmonotonic reasoning
• Assume the normal case unless or until it is contradicted by evidence If I believe Tweety is a bird, then I think he can Qly If I learn Tweety is a penguin, then I think he can’t Qly
• Worst-‐case reasoning • Assume and plan for the worst (i.e., adversarial search against optimal opponent)
More methods for uncertain reasoning • Evidential reasoning—how strongly do I believe P based on evidence? (con;idence levels) • Quantitative [0, 1], [-‐1, 1], 95% conQidence interval • Qualitative {deQinite; very likely, likely, neutral, unlikely, very unlikely, deQinitely not}
• Fuzzy concepts—measure degree of “truth” not uncertainty
• Unemployment is high • The next season of Mad Men will start “soon” • Add degree to fuzzy assertions between 0 and 1
• We will mainly focus on probabilistic reasoning models
Musings on probability… “When it is not in our power to determine what is true, we ought to follow what is most probable.”
—Rene Descartes
“The idea was fantastically, wildly improbable. But like most fantastically, wildly improbable ideas it was at least as worthy of consideration as a more mundane one to which the facts had been strenuously bent to Dit.”
—Douglas Adams
Probability Theory • World is not necessarily divided between “normal” or “abnormal,” nor is it adversarial
• Possible situations have associated likelihoods
• Probability theory enables us to make rational decisions • Will an armed attack happen at time t? • What is the probability of an attack in a given situation?
• Use probabilities to represent the structure of our knowledge and for reasoning over that knowledge
Syntax for probabilisJc reasoning • Basic element: random variable
• Similar to propositional logic—possible worlds (sample space) deQined by assignment of values to random variables
• Boolean random variables • E.g., Attack (Is an attack occurring?)
• Discrete random variables • E.g., Direction is one of <north, south, east, west>
• Elementary proposition constructed by assignment of a value to a single random variable • E.g., Direction = west, Attack = false (abbreviated ¬attack)
• Complex propositions formed from elementary propositions and standard logical connectives • E.g., Direction = west ∨ Attack = false
• Atomic event—A complete speciQication of the state of the world about which the agent is uncertain • E.g., If the world consists of two boolean variables Attack and Propaganda, then there are four distinct atomic events:
Attack = false ∧ Propaganda = false Attack = false ∧ Propaganda = true Attack = true ∧ Propaganda = false Attack = true ∧ Propaganda = true
• Atomic events are mutually exclusive and exhaustive (often called “outcomes”)
• Events in general are sets of atomic events, such as Attack = true
Syntax for probability (cont.)
Syntax for probability (cont.) • Atomic event—A complete speciQication of the state of the world about which the agent is uncertain • E.g., If the world consists of two boolean variables Attack and Propaganda, then there are four distinct atomic events:
Attack = false ∧ Propaganda = false Attack = false ∧ Propaganda = true Attack = true ∧ Propaganda = false Attack = true ∧ Propaganda = true
• Atomic events are mutually exclusive and exhaustive (often called “outcomes”)
• Events in general are sets of atomic events, such as Attack = true
Axioms of probability theory • Basic notation for probability: P(A) is the probability of proposition A being true in the KB
OR P(A) is the probability that event A occurs in the world • For any propositions A, B
• 0 ≤ P(A) ≤ 1 • P(true) = 1 and P(false) = 0 • P(A ∨ B) = P(A) + P(B) -‐ P(A ∧ B) (Inclusion-‐exclusion principle)
A ∧ B A B
MulJvalued Random Variables • Suppose A can take on more than 2 values • A is a random variable with arity k if it can take on a value out of the domain {v1,v2,…,vk}
• Then…
P(A = vi ∧ A = vj) = 0 if i ≠ j P(A = v1 ∨ A = v2 ∨ … ∨ A = vk) = 1
• Sum of probability over all possible values must equal 1
Σ P(A = vj) = 1 j = 1
k
Set-‐theoreJc interpretaJon of probability
• Remember the possible worlds interpretation from logic • A model (world) is a setting of true or false to every proposition • All possible worlds are all combinations of true and false
• Suppose all possible worlds can be represented by the following diagram:
W = set of all possible worlds
Set-‐theoreJc interpretaJon of probability
• Remember the possible worlds interpretation from logic • A model (world) is a setting of true or false to every proposition • All possible worlds are all combinations of true and false
• The probability of A being true is the proportion of |WA| to |W| : P(A) = 10 / 32 = 0.31
W = set of all possible worlds
WA = set of worlds where A is true
Inclusion-‐exclusion axiom and possible worlds
• Inclusion-‐exclusion principle P(A ∨ B) = P(A) + P(B) -‐ P(A ∧ B) • i.e., P(WA U WB)
• The probability of WA U WB is the proportion of |WA U WB| to |W| : P(WA U WB) = 14 / 32 = 0.44
W = set of all possible worlds
WA = set of worlds where A is true
WB = set of worlds where B is true
Inclusion-‐exclusion axiom and possible worlds
• Inclusion-‐exclusion principle P(A ∨ B) = P(A) + P(B) -‐ P(A ∧ B) • i.e., P(WA U WB)
• If A and B are mutually-‐exclusive events i.e., WA WB = ∅ then P(A ∨ B) = P(A) + P(B) = 16 / 32 = 0.5
W = set of all possible worlds
WA = set of worlds where A is true
WB = set of worlds where B is true
U
Joint probability distribuJons • Prior or unconditional probability—value prior to any (new) evidence
• E.g., P(Attack = true) = 0.1 and P(Direction = north) = 0.72
• Probability distribution gives probabilities for each possible value • E.g., P(Direction) = <0.72, 0.1, 0.08, 0.1> (sums to 1)
• Joint probability distribution for a set of random variables gives probability for each combination of values (i.e., every atomic event) for those random variables • E.g., P(Direction, Attack) = a 4 ×2 matrix of values: • Sum of joint probabilities for each case (table) must equal 1
• Every question can be answered by the joint distribution!
Direction = north south east west
Attack = true 0.144 0.02 0.016 0.02
Attack = false 0.576 0.08 0.064 0.08
Inference using joint probabiliJes • Start with the joint probability distribution
• For any proposition φ, sum the atomic events where: P(φ) = Σω:ω╞φ P(ω) (i.e., sum over all events where φ is true)
propaganda ¬propaganda election ¬election election ¬election
attack 0.108 0.012 0.072 0.008 ¬attack 0.016 0.064 0.144 0.576
• Start with the joint probability distribution
• For any proposition φ, sum the atomic events where: P(φ) = Σω:ω╞φ P(ω) P(propaganda) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2
propaganda ¬propaganda election ¬election election ¬election
attack 0.108 0.012 0.072 0.008 ¬attack 0.016 0.064 0.144 0.576
Inference using joint probabiliJes
The good and the bad with JPDs • Good news Once you have a joint distribution, you can ask important questions about stuff that involves a lot of uncertainty!
• Bad news Impossible to create for more than about 10 variables because there are so many numbers needed when you build the thing!
CondiJonal probability • Conditional or posterior probabilities—based on known information • Eg., P(attack | propaganda) = 0.8
Given that propaganda is known with certainty
• Formal deQinition of conditional probability: P(a | b) = P(a ∧ b) if P(b) > 0
P(b)
• That is, the proportion of |WA WB| to |WB| P(A ∧ B) / P(B) = 2 / 6 = 0.33
W = set of all possible worlds
WA = set of worlds where A is true
WB = set of worlds where B is true U
propaganda ¬propaganda election ¬election election ¬election
attack 0.108 0.012 0.072 0.008 ¬attack 0.016 0.064 0.144 0.576
ComputaJon of condiJonal probabiliJes • Start with the joint probability distribution
• Can also compute conditional probabilities: P(¬attack | propaganda) = P(¬attack ∧ propaganda)
P(propaganda)
= 0.016 + 0.064 0.108 + 0.012 + 0.016 + 0.064
= 0.2
Independence allows simplificaJon • A and B are independent iff P(A | B) = P(A) or P(B|A) = P(B) or P(A,B) = P(A)P(B) • E.g., P(Propaganda, Election, Attack, Direction)=
P(Propaganda, Election, Attack)P(Direction)
• If events in conditional or joint distribution are independent, we can decompose into smaller distributions
• We know that A and S are independent P(A) = 0.6 P(S) = 0.3 P( S | A) = P(S)
• We can derive the full JPD (assuming independence)
• Since we have the JPD, can make any query!
A S Probability T T 0.18 T F 0.42 F T 0.12 F F 0.28
Absolute independence is rare • Absolute independence powerful, but rare
• Behavioral models (in economics, gaming, security, etc.) require hundreds of variables, none of which are independent • In fact, interdependencies are often the interesting parts of our data!
• What to do??
Bayesian probability theory (1763 -‐ now) • The basis of Bayesian Theory is conditional probabilities
• Bayesian Theory sees a conditional probability as a way to describe the structure and organization of knowledge
• In this view, A | B indicates the event A in the “context” of event B • E.g., the symptom A in the context of disease B the action A in the state of the world B
Adding new evidence to the current context • Additional evidence may change the environment, and hence the conditional probability
• P(attack | propaganda) = 0.8 If we know more, e.g., attack is also given, then we have P(attack| propaganda, attack) = 1 • Evidence may be irrelevant (independent) allowing simpliQication
• E.g., attack does not depend on direction: P(attack | propaganda, north) = P(attack | propaganda) = 0.8
• This kind of inference, sanctioned by domain knowledge, is crucial for probabilistic reasoning!
The chain rule • The probability of a joint event (X1,…,Xn) can be computed using the conditional probabilities
• Product rule P(a ∧ b) = P(a | b)P(b) = P(b | a)P(a) • Chain rule derived by successive application of product rule P(X1,…,Xn) = P(X1,…,Xn-‐1)P(Xn | X1,…,Xn-‐1)
= P(X1,…,Xn-‐2)P(Xn-‐1 | X1,…,Xn-‐2) P(Xn | X1,…,Xn-‐1) = … = P(X1)P(X2 | X1)P(X3 | X1, X2) … P(Xn | X1,…,Xn-‐1)
OR Πi = 1 to n P(Xi | X1,…,Xi-‐1)
EvidenJal reasoning with condiJonal probs • Reasoning about “hypotheses” and “evidences” that do or do not support the hypotheses—Bayesian inference P(H | e): given that I know about evidence e, the probability that my hypothesis H is true P(H | e) = P(H ∧ e)
P(e)
• Might also have some extra or hidden context variables Y P(H | e) = Σy P(H ∧ e ∧ Y = y)
P(e ∧ Y = y)
Bayesian reasoning in medical diagnosis • Causal model: Disease Condition Symptom (H Y E)
P(H = cancer | E = fatigue) = α [ P(H = cancer ∧ E = fatigue ∧ anemia) +
P(H = cancer ∧ E = fatigue ∧ ¬anemia) ]
α = 1/P(E = fatigue) or 1/[P(E = fatigue ∧ anemia) + P(E = fatigue ∧ ¬anemia) ]
Cancer Anemia Fatigue
Kidney Disease Anemia Fatigue
…but where do we find the numbers? • Assuming independence, doctors may be able to estimate P(symptom | disease) for each S/D pair (causal reasoning)
• Hard to estimate what we really need to know: P(disease | symptom)
• This is why Bayes rule is so important in probabilistic AI!
Bayes Rule • Product rule P(a ∧ b) = P(a | b)P(b) = P(b | a)P(a) ⇒ Bayes rule: P(a | b) = P(b | a)P(a)
P(b)
• Useful for assessing diagnostic probability from causal probability: • P(Cause | Effect) = P(Effect | Cause) P(Cause)
P(Effect)
• E.g., Let M be meningitis, S be stiff neck P(M | S) = P(S | M) P(M) / P(S) = 0.8 × 0.0001 / 0.1 = 0.0008
• Note: posterior probability of meningitis is still very small!
More general forms of Bayes rule • P(A | B) = P(B | A)P(A)
P(B | A)P(A) + P(B | ¬A)P(¬A)
• P(A | B) = P(B | A, e)P(A | e) P(B | e)
• P(A = vi | B) = P(B | A = vi)P(A = vi) Σ P(B | A = vk)P(A = vk) k = 1
n
CondiJonal independence • If an attack occurs, the probability of casualties does not depend on whether or not there is a propaganda campaign:
• P(casualties| propaganda, attack) = P(casualties| attack)
• The same independence holds if I haven’t got a Attack • P(casualties| propaganda, ¬attack) = P(casualties| ¬attack)
• Casualties is conditionally independent of propaganda given attack:
• P(Casualties| Propaganda, Attack) = P(Casualties| Attack) P(Propaganda| Casualties, Attack) = P(Propaganda| Attack) P(Propaganda, Casualties| Attack) = P(Propaganda|Attack)P(Casualties| Attack)
CondiJonal independence reduces size • P(Propaganda, Attack, Casualties) has 23 -‐ 1 = 7 JPD entries • Write out full joint distribution using chain rule: P(Propaganda, Casualties, Attack)
= P(Propaganda | Casualties, Attack)P(Casualties, Attack) = P(Propaganda | Casualties, Attack)P(Casualties | Attack)P(Attack) = P(Propaganda | Attack)P(Casualties | Attack)P(Attack)
i.e., 2 + 2+ 1 = 5 independent numbers
• In most cases, conditional independence reduces size of representation from exponential in n to linear in n
• Conditional independence is most basic and robust form of knowledge about uncertain environments
Bayes rule and condiJonal independence • P(Attack | Propaganda, Casualties) = P(Propaganda, Casualties | Attack)P(Attack) = P(Propaganda | Attack)P(Casualties | Attack)P(Attack)
• We say: “Propaganda and Casualties are independent, given Attack” • Attack separates Propaganda and Casualties because it is a direct cause of both • Example of a naïve Bayes model
• P(Cause,Effect1,…,Effectn) = P(Cause)ΠP(Effecti | Cause)
• Total number of parameters is linear in n (number of effects) • This is our Qirst Bayesian inference net! More on Friday…
i
Attack
Propaganda Casualties
Cause
Effect1 Effectn
• P(Attack | Propaganda, Casualties) = P(Propaganda, Casualties | Attack)P(Attack) = P(Propaganda | Attack)P(Casualties | Attack)P(Attack)
• We say: “Propaganda and Casualties are independent, given Attack” • Attack separates Propaganda and Casualties because it is a direct cause of both • Example of a naïve Bayes model
• P(Cause,Effect1,…,Effectn) = P(Cause)ΠP(Effecti | Cause)
• Total number of parameters is linear in n (number of effects) • This is our Qirst Bayesian inference net! More on Friday…
Bayes rule and condiJonal independence
i
Attack
Propaganda Casualties
Cause
Effect1 Effectn
Example of condiJonal independence • R: There is rioting in Nigeria • H: Supply of gasoline in U.S. reduced by hurricane • G: U.S. gas prices increase
• Assume gas prices are sometimes responsive to global events (e.g., riots in oil-‐producing countries like Nigeria)
• Start with knowledge we are conQident about: P(H | R) = P(H), P(H) = 0.3, P(R) = 0.6
• Gas prices are not independent of the weather and are not independent of the political situation in Nigeria
• Hurricanes in the U.S. and rioting in Nigeria are independent
Reduce complexity with independence • R: There is rioting in Nigeria • H: Supply of gasoline in U.S. reduced by hurricane • G: U.S. gas prices increase
• Assume gas prices are sometimes responsive to global events (e.g., riots in oil-‐producing countries like Nigeria)
• Start with knowledge we are conQident about: P(H | R) = P(H), P(H) = 0.3, P(R) = 0.6
• Know the joint probability of H and R, so now need: • P(G | H, R) for the 4 cases where H and R are true/false
Reduce complexity with independence • R: There is rioting in Nigeria • H: Supply of gasoline reduced by hurricane • G: U.S. gas prices increase
• Assume gas prices are sometimes responsive to global events (e.g., riots in oil-‐producing countries like Nigeria)
• Can derive a full JPD with “mere” 6 numbers instead of 7
• NOTE: Savings are larger for larger numbers of variables/values
• Same expressive and inference power as JPD
P(H | R) = P(H) P(H) = 0.3 P(R) = 0.6
P(G | R ∧ H) = 0.05 P(G | R ∧¬ H) = 0.1 P(G | ¬R ∧ H) = 0.1 P(G | ¬R ∧ ¬H) = 0.2