Upload
dwight-pierce
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Aspects of Bayesian InferenceAspects of Bayesian Inferenceandand
Statistical Disclosure ControlStatistical Disclosure Controlin Pythonin Python
Duncan SmithDuncan Smith Confidentiality and Privacy Group Confidentiality and Privacy Group
CCSR CCSR University of ManchesterUniversity of Manchester
IntroductionIntroduction
Bayesian Belief Networks (BBNs)Bayesian Belief Networks (BBNs)
probabilistic inferenceprobabilistic inference
Statistical Disclosure Control (SDC)Statistical Disclosure Control (SDC)
deterministic inference (attribution)deterministic inference (attribution)
Bayesian Belief NetworksBayesian Belief Networks
Decision-making in complex domainsDecision-making in complex domains
Hard and soft evidenceHard and soft evidence
Correlated variablesCorrelated variables
Many variablesMany variables
Bayes’ RuleBayes’ Rule
APBP
ABPBAP
A prior belief and evidence combined to give a posterior belief
Venn DiagramVenn Diagram
Both A & B
A only
B only
Neither A nor B
Event A Event B
aa aa
bb 3/73/7 2/32/3
bb 4/74/7 1/31/3
11 11
aa aa
0.70.7 0.30.3
1. Prior probability table P(A)
2. Conditional probability table P(B|A)
InferenceInference
aa aa
bb 0.30.3 0.20.2 0.50.5
bb 0.40.4 0.10.1 0.50.5
0.70.7 0.30.3 11
aa aa
bb 0.30.3 0.20.2
aa aa
bb 0.60.6 0.40.4
3. Produce joint probability table by multiplication
4. Condition on evidence
5. Normalise table probabilities to sum to 1
def Bayes(prior, conditional, obs_level):
"""Simple Bayes for two categorical variables. 'prior' is a Python list. 'conditional' is a list of lists (‘column’ variable conditional on ‘row’ variable). 'obs_level' is the index of the observed level of the row variable"""
levels = len(prior) # condition on observed level result = conditional[obs_level] # multiply values by prior probabilities result = [result[i] * prior[i] for i in range(levels)] # get marginal probability of observed level marg_prob = sum(result) # normalise the current values to sum to 1 posterior = [value / marg_prob for value in result]
return posterior
Note: conditioning can be carried out before calculating the joint probabilities, reducing the cost of inference
>>> A = [0.7, 0.3]>>> B_given_A = [[3.0/7, 2.0/3], [4.0/7, 1.0/3]]>>> Bayes(A, B_given_A, 0)[0.59999999999999998, 0.39999999999999997]>>>
The posterior distribution can be used as a new The posterior distribution can be used as a new prior and combined with evidence from further prior and combined with evidence from further observed variablesobserved variables
Although computationally efficient, this ‘naïve’ Although computationally efficient, this ‘naïve’ approach implies assumptions that can lead to approach implies assumptions that can lead to problemsproblems
Naive BayesNaive Bayes
APABPACPC,B,AP
B C
A
A ‘correct’ factorisationA ‘correct’ factorisation
APABPB,ACPC,B,AP
B C
A
Conditional independenceConditional independence
The Naive Bayes example assumes:The Naive Bayes example assumes:
But if valid, the calculation is easier and But if valid, the calculation is easier and fewer probabilities need to be specifiedfewer probabilities need to be specified
ACPB,ACP
The conditional independence
implies that if A is observed, then
evidence on B is irrelevant in
calculating the posterior of C
A Bayesian Belief NetworkA Bayesian Belief Network
R and S are independent until H is R and S are independent until H is observedobserved
W H
R S
A Markov GraphA Markov Graph
The conditional independence structure is found by The conditional independence structure is found by marryingmarrying parents with common children parents with common children
W H
R S
FactoringFactoring
The following factorisation is impliedThe following factorisation is implied
So P(S) can be calculated as follows So P(S) can be calculated as follows (although there is little point, yet)(although there is little point, yet)
SRHPSPRPRWPWSRHP ,,,,
WHR
RWPSRHPRPSPSP ,
If H and W are observed to be in states If H and W are observed to be in states hh
and and ww, then the posterior of S can be , then the posterior of S can be expressed as follows (where epsilon expressed as follows (where epsilon denotes ‘the evidence’)denotes ‘the evidence’)
RwWPSRhHPRPSPSPR
,
Graph TriangulationGraph Triangulation
A B
D C
E
A B
D C
E
A B
D C
E
Belief PropagationBelief Propagation
Message passing in a Message passing in a Clique TreeClique Tree
W,R
R,H,S
R
Message passing in a Message passing in a Directed Junction Directed Junction TreeTree
W,RR,H,S
R,S
S
A Typical BBNA Typical BBN
Belief Network SummaryBelief Network Summary
Inference requires a Inference requires a decomposable decomposable graphgraph
Efficient inference requires a good Efficient inference requires a good decompositiondecomposition
Inference involves Inference involves evidence evidence instantiationinstantiation, , table combinationtable combination and and variable marginalisationvariable marginalisation
Statistical Disclosure ControlStatistical Disclosure Control
Releases of small area population Releases of small area population (census) data(census) data
AttributionAttribution occurs when a occurs when a data intruderdata intruder can make inferences (with probability 1) can make inferences (with probability 1) about a member of the populationabout a member of the population
Negative AttributionNegative Attribution - An individual - An individual who is an accountant does not work for who is an accountant does not work for Department CDepartment C
Positive AttributionPositive Attribution - An individual - An individual who works in Department C is a lawyerwho works in Department C is a lawyer
ProfessionLawyer
AccountantCol sum
Row sumDepartment
A B C
20 7 2
032
2
29
5
2418 4
Release of the full table is not safe from an Release of the full table is not safe from an attribute disclosure perspective (it contains a attribute disclosure perspective (it contains a zero)zero)
Each of the two marginal tables is safe Each of the two marginal tables is safe (neither contains a zero)(neither contains a zero)
Is the release of the two marginal tables Is the release of the two marginal tables ‘jointly’ safe?‘jointly’ safe?
The Bounds ProblemThe Bounds Problem
Given a set of released tables (relating to Given a set of released tables (relating to the same population), what inferences the same population), what inferences about the counts in the ‘full’ table can be about the counts in the ‘full’ table can be made?made?
Can a Can a datadata intruderintruder derive an upper bound derive an upper bound of zero for any cell count?of zero for any cell count?
A non-graphical caseA non-graphical case
All 2 All 2 ×× 2 marginals of a 2 2 marginals of a 2×2×2 table×2×2 table
A maximal complete subgraph (A maximal complete subgraph (cliqueclique) ) without an individual corresponding tablewithout an individual corresponding table
B C
A
Var1Var1
Var2Var2 AA BB
CC 33 99
DD 22 22
Var1Var1
Var3Var3 AA BB
EE 11 1010
FF 44 11
Var2Var2
VarVar33
CC DD
EE 88 33
FF 44 11
Var1 and Var2Var1 and Var2
VarVar33
A, CA, C A, DA, D B, CB, C B, B, DD
EE 00 11 88 22
FF 33 11 11 00
Original cell counts can be recovered from Original cell counts can be recovered from the marginal tablesthe marginal tables
Each cell’s upper bound is the minimum of it’s Each cell’s upper bound is the minimum of it’s relevant margins (Dobra and Fienberg)relevant margins (Dobra and Fienberg)
ProfessionLawyer
AccountantCol sum
Row sumDepartment
A B C
20 7 2
255
2
29
5
24720
SDC SummarySDC Summary
A set of released tables relating to a given A set of released tables relating to a given population population
If the resulting graph is both If the resulting graph is both graphicalgraphical and and decomposabledecomposable, then the upper bounds can , then the upper bounds can be derived efficientlybe derived efficiently
Common aspectsCommon aspects
Graphical representationsGraphical representations
Graphs / cliques / nodes / treesGraphs / cliques / nodes / trees
Combination of tablesCombination of tables
Pointwise operationsPointwise operations
BBNsBBNs
pointwise multiplicationpointwise multiplication
SDCSDC
pointwise minimumpointwise minimum
andand pointwise additionpointwise addition
pointwise subtractionpointwise subtraction
For calculating exact lower bounds
}
Coercing Numeric built-insCoercing Numeric built-ins
A A tabletable is a numeric array with an is a numeric array with an associated list of associated list of variablesvariables
Marginalisation is trivial, using the built-in Marginalisation is trivial, using the built-in Numeric.add.reduce() function and Numeric.add.reduce() function and removing the relevant variable from the listremoving the relevant variable from the list
Conditioning is easily achieved using a Conditioning is easily achieved using a Numeric.take() slice, appropriately Numeric.take() slice, appropriately reshaping the array with reshaping the array with Numeric.reshape() and removing the Numeric.reshape() and removing the variable from the listvariable from the list
Pointwise multiplication Pointwise multiplication
Numeric.multiply() generates the Numeric.multiply() generates the appropriate table IF the two tables have appropriate table IF the two tables have identical ranks and variable listsidentical ranks and variable lists
This is ensured by adding new axes This is ensured by adding new axes (Numeric.NewAxis) for the ‘missing’ axes (Numeric.NewAxis) for the ‘missing’ axes and transposing one of the tables and transposing one of the tables (Numeric.transpose()) so that the variable (Numeric.transpose()) so that the variable lists matchlists match
array([24, 5]) ['profession'] (2,)array([24, 5]) ['profession'] (2,)
array([20, 7, 2]) ['department'] (3,)array([20, 7, 2]) ['department'] (3,)
array(array([[24],[[24], [ 5]]) (2, 1)[ 5]]) (2, 1)
['profession', 'department']['profession', 'department']
array(array([[20, 7, 2]]) (1, 3)[[20, 7, 2]]) (1, 3) ['profession', 'department']['profession', 'department']
>>> prof * dept>>> prof * dept
array ([[480, 168, 48],array ([[480, 168, 48], [100, 35, 10]])[100, 35, 10]])['profession', 'department']['profession', 'department']
>>> (prof * dept).normalise(29)>>> (prof * dept).normalise(29)
array([[ 16.551, 5.793, 1.655],array([[ 16.551, 5.793, 1.655], [ 3.448, 1.206, 0.344]])[ 3.448, 1.206, 0.344]])['profession', 'department']['profession', 'department']
Pointwise minimum / addition / subtraction Pointwise minimum / addition / subtraction
Numeric.minimum(), Numeric.add() and Numeric.minimum(), Numeric.add() and Numeric.subtract() generate the Numeric.subtract() generate the appropriate tables IF the two tables have appropriate tables IF the two tables have identical ranks and variable lists AND the identical ranks and variable lists AND the two tables also have identical shapetwo tables also have identical shape
This is ensured by a secondary This is ensured by a secondary preprocessing stage where the tables from preprocessing stage where the tables from the first preprocessing stage are multiplied the first preprocessing stage are multiplied by a ‘correctly’ shaped table of ones (this by a ‘correctly’ shaped table of ones (this is actually quicker than using is actually quicker than using Numeric.concatenate())Numeric.concatenate())
array(array([[24],[[24],
[ 5]]) (2, 1)[ 5]]) (2, 1)
['profession', 'department']['profession', 'department']
array(array([ [20, 7, 2]]) (1, 3)[ [20, 7, 2]]) (1, 3)
['profession', 'department']['profession', 'department']
array([[20, 7, 2] array([[20, 7, 2]
[20, 7, 2]]) (2,3)[20, 7, 2]]) (2,3)(2nd stage preprocessing)
>>> prof.minimum(dept)>>> prof.minimum(dept)
array([[20, 7, 2],array([[20, 7, 2],
[ 5, 5, 2]])[ 5, 5, 2]])
['profession', 'department']['profession', 'department']
SummarySummary The Bayesian Belief Network software The Bayesian Belief Network software
was originally implemented in Python for was originally implemented in Python for two reasonstwo reasons
1. The author was, at the time, a relatively 1. The author was, at the time, a relatively inexperienced programmerinexperienced programmer
2. Self-learning (albeit with some help) was 2. Self-learning (albeit with some help) was the only optionthe only option
The SDC software was implemented in The SDC software was implemented in Python because,Python because,
1. Python + Numeric turned out to be a 1. Python + Numeric turned out to be a wholly appropriate solution for BBNs wholly appropriate solution for BBNs (Python is powerful, Numeric is fast)(Python is powerful, Numeric is fast)
2. Existing code could be reused2. Existing code could be reused