View
35
Download
4
Category
Preview:
DESCRIPTION
Probability and Statistics for Data Mining. COMP5318. Question 1. Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?. Probability. - PowerPoint PPT Presentation
Citation preview
Probability and Statistics for Data Mining
COMP5318
Question 1
• Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?
Gender % of credit card holders
% of gender who default
Male 60 55
Female 40 35
Probability
• Probability is the mathematical language to understand uncertainty.
• We need to make decisions in the presence of uncertainty which is ever present.
• Example: The Earth is warming- a phenomenon that is known as Global Warming (GW). Is modern human activity the cause of GW.– Physics driven approach– Data driven approach
Experiments and Observation
• When an experiment is carried out we observe the outcome – which is often uncertain.– If not uncertain then why carry out the experiment?
• We look into a random shopping basket. Does it contain a a packet of “Tofu”?
• We toss a coin, does it land on “Heads”?• We ask a question: “Is it raining in Broom, WA,
right now”?
Building Blocks of Probability
• The space of all possible outcomes is called the sample space.– Non-trivial to decide.
• Single Coin Toss. The space is {H,T}.
• Shopping Basket. The space of all possible combinations of all items sold in the store.
• Shopping Basket: {Tofu, Not-Tofu}.
Events
• Events are subsets of the sample space. Events are often defined in familiar terms.
• In the shopping basket scenario– A vegetarian shopping basket is an event.– all possible vegetarian item combinations.
• Throw of a dice. The event we are looking for could be: Even Number = {2,4,6}, where the sample space = {1,2,3,4,5,6}
Events
• Let G be the set of all galaxies. Characterize each galaxy by three number – d: distance from earth– a: major axis– b: minor axis
• Elliptic Galaxies (EG)– EG ={(a,b,d) | a/b > 1.5}
• Distant Spiral Galaxies (DSG)– DSG ={(a,b,d) | a/b <= 1.5 and d > 10}
Events
• Let G be the set of all genes. Each gene can be “on” or “off”. Let E correspond to the event: all genes which are “on” when the skin cells are “starved”.
Events are Sets
• At the most basic level events are sets. Therefore we can carry out set union, difference and intersection on events.
• For example:– E1: shopping baskets which contain Tofu– E2: shopping baskets which contain Milk– E1 U E2: shopping baskets which contain
either Tofu or Milk
Probability
• Let S be the space of all possible elementary outcomes. Let = Power(S) be the power set of S. Then the probability P is function: P : [0,1]
that satisfy the following properties (axioms):
Interpretation of Probability
• Physical or Ontological: Long term frequency– 50% chance that a coin will land on heads.– 20% of all Woolworth shopping baskets are
vegetarian.– 22% of all Woolworth shopping baskets in
Northbridge plaza are vegetarian.• Epistemological : Degree of Belief
– 20% chance that my neighbours are watering their lawn on “dry” days.
– 99% chance that the green immovable object outside my house is a Tree.
– 90% chance that Australia will win the cricket world cup.
Consequences of Axioms
Example
• Two coin tosses. Let H1 be the event that a heads occurs on toss 1 and H2 a heads on toss 2. All events are equally likely.
• Sample space = {HH, HT, TH, TT}– H1 = {HH, HT}– H2 = {HH,TH}– P(H1 U H2) = ½ + ½ - ¼ = 3/4
Example
• Two events A and B are independent if – P(A ∩ B) = P(A)P(B)
• P(A∩B) is also written as P(AB) and P(A,B).• If A and B are disjoint event then A and B such
that P(A) > 0 and P(B) > 0 then A and B cannot be independent– P(A ∩ B) = 0. Yet P(A)P(B) > 0
• Except for this case you cannot determine independence by looking at a Venn diagram
Question
• A shopping basket can either be kosher or not. The probability that it will be kosher is 3/4. Examine 10 baskets at a check out counter. What is the probability that there will be at least one kosher basket.
Answer
• Let E be the event “At least one kosher basket.” Let NKi be the event that the i-th basket is non-kosher.
Independence
Example
• For an Online Book Seller (OBS) the conversion rate is 1/100, i.e., every 100th visitors ends up making a purchase. What is the probability that at least one purchase will be made in 10 consecutive visits (by distinct customers).
Example
• Two people take turns to sink a basketball. P1 succeeds with probability 1/3 and P2 with ¼. What is the probability that P1 succeeds before P2.
• Requires clever setting up of the events.– Let E be the event that P1 succeeds before P2.
– Let Ai be the event that P1 succeeds before P2 on the ith trial.
– Ai ∩Aj = Ø and E = [i=11Ai
Conditional Probability
• Very Important Concept• P(A|B) is “fraction of occurrences of B in
which A also occurs”– P(A|B) = P(A ∩ B)/P(B); P(B) > 0
• For a fixed B, P(.|B) is a probability– Therefore if A1 and A2 are disjoint then– P(A1 U A2 |B) = P(A1|B) + P(A2|B)
• Note, P(A|B U C) =/= P(A|B) + P(A|C)• Also P(A|B) =/= P(B|A)
Standard Example
D Dc
+ 0.009 0.099
- 0.001 0.891
9.0001.0009.0
009.0
)(
)()|(
DP
DPDP
9.0099.0891.0
891.0
)(
)()|(
c
cc
DP
DPDP
Suppose a test is positive. What isthe probability of disease?
08.0099.0009.0
009.0)|(
DP
D is disease+/-; Test positive or negative
Standard Data Mining ExampleTID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Suppose the data above closely resembles the behaviour of the populationat large.
What is the chance that those who buy a Diaper will also buy Beer.
= P(Diaper ∩ Beer)/P(Diaper) = 0.6/0.8 = 0.75
Is Diaper an Event?
Conditional Independence
• If A and B are independent then P(A|B)=P(A)
• P(AB) = P(A|B)P(B)• Law of Total Probability.
Bayes Theorem
Question 1
• Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?
Gender % of credit card holders
% of gender who default
Male 60 55
Female 40 35
Answer to Question 1
30.060.055.040.035.0
40.035.0)|()|(
)()|()|(
MGYDPFGYDP
FGPFGYDPYDFGP
But what does G=F and D=Y mean? We have not even formally defined them.
Recommended