43
Aspects of Bayesian Aspects of Bayesian Inference Inference and and Statistical Disclosure Statistical Disclosure Control Control in Python in Python Duncan Smith Duncan Smith Confidentiality and Privacy Group Confidentiality and Privacy Group CCSR CCSR University of Manchester University of Manchester

Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Embed Size (px)

Citation preview

Page 1: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Aspects of Bayesian InferenceAspects of Bayesian Inferenceandand

Statistical Disclosure ControlStatistical Disclosure Controlin Pythonin Python

Duncan SmithDuncan Smith Confidentiality and Privacy Group Confidentiality and Privacy Group

CCSR CCSR University of ManchesterUniversity of Manchester

Page 2: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

IntroductionIntroduction

Bayesian Belief Networks (BBNs)Bayesian Belief Networks (BBNs)

probabilistic inferenceprobabilistic inference

Statistical Disclosure Control (SDC)Statistical Disclosure Control (SDC)

deterministic inference (attribution)deterministic inference (attribution)

Page 3: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Bayesian Belief NetworksBayesian Belief Networks

Decision-making in complex domainsDecision-making in complex domains

Hard and soft evidenceHard and soft evidence

Correlated variablesCorrelated variables

Many variablesMany variables

Page 4: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Bayes’ RuleBayes’ Rule

APBP

ABPBAP

A prior belief and evidence combined to give a posterior belief

Page 5: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Venn DiagramVenn Diagram

Both A & B

A only

B only

Neither A nor B

Event A Event B

Page 6: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

aa aa

bb 3/73/7 2/32/3

bb 4/74/7 1/31/3

11 11

aa aa

0.70.7 0.30.3

1. Prior probability table P(A)

2. Conditional probability table P(B|A)

InferenceInference

Page 7: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

aa aa

bb 0.30.3 0.20.2 0.50.5

bb 0.40.4 0.10.1 0.50.5

0.70.7 0.30.3 11

aa aa

bb 0.30.3 0.20.2

aa aa

bb 0.60.6 0.40.4

3. Produce joint probability table by multiplication

4. Condition on evidence

5. Normalise table probabilities to sum to 1

Page 8: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

def Bayes(prior, conditional, obs_level):

"""Simple Bayes for two categorical variables. 'prior' is a Python list. 'conditional' is a list of lists (‘column’ variable conditional on ‘row’ variable). 'obs_level' is the index of the observed level of the row variable"""

levels = len(prior) # condition on observed level result = conditional[obs_level] # multiply values by prior probabilities result = [result[i] * prior[i] for i in range(levels)] # get marginal probability of observed level marg_prob = sum(result) # normalise the current values to sum to 1 posterior = [value / marg_prob for value in result]

return posterior

Note: conditioning can be carried out before calculating the joint probabilities, reducing the cost of inference

Page 9: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

>>> A = [0.7, 0.3]>>> B_given_A = [[3.0/7, 2.0/3], [4.0/7, 1.0/3]]>>> Bayes(A, B_given_A, 0)[0.59999999999999998, 0.39999999999999997]>>>

The posterior distribution can be used as a new The posterior distribution can be used as a new prior and combined with evidence from further prior and combined with evidence from further observed variablesobserved variables

Although computationally efficient, this ‘naïve’ Although computationally efficient, this ‘naïve’ approach implies assumptions that can lead to approach implies assumptions that can lead to problemsproblems

Page 10: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Naive BayesNaive Bayes

APABPACPC,B,AP

B C

A

Page 11: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

A ‘correct’ factorisationA ‘correct’ factorisation

APABPB,ACPC,B,AP

B C

A

Page 12: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Conditional independenceConditional independence

The Naive Bayes example assumes:The Naive Bayes example assumes:

But if valid, the calculation is easier and But if valid, the calculation is easier and fewer probabilities need to be specifiedfewer probabilities need to be specified

ACPB,ACP

Page 13: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

The conditional independence

implies that if A is observed, then

evidence on B is irrelevant in

calculating the posterior of C

Page 14: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

A Bayesian Belief NetworkA Bayesian Belief Network

R and S are independent until H is R and S are independent until H is observedobserved

W H

R S

Page 15: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

A Markov GraphA Markov Graph

The conditional independence structure is found by The conditional independence structure is found by marryingmarrying parents with common children parents with common children

W H

R S

Page 16: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

FactoringFactoring

The following factorisation is impliedThe following factorisation is implied

So P(S) can be calculated as follows So P(S) can be calculated as follows (although there is little point, yet)(although there is little point, yet)

SRHPSPRPRWPWSRHP ,,,,

WHR

RWPSRHPRPSPSP ,

Page 17: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

If H and W are observed to be in states If H and W are observed to be in states hh

and and ww, then the posterior of S can be , then the posterior of S can be expressed as follows (where epsilon expressed as follows (where epsilon denotes ‘the evidence’)denotes ‘the evidence’)

RwWPSRhHPRPSPSPR

,

Page 18: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Graph TriangulationGraph Triangulation

A B

D C

E

A B

D C

E

A B

D C

E

Page 19: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Belief PropagationBelief Propagation

Message passing in a Message passing in a Clique TreeClique Tree

W,R

R,H,S

R

Page 20: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Message passing in a Message passing in a Directed Junction Directed Junction TreeTree

W,RR,H,S

R,S

S

Page 21: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

A Typical BBNA Typical BBN

Page 22: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Belief Network SummaryBelief Network Summary

Inference requires a Inference requires a decomposable decomposable graphgraph

Efficient inference requires a good Efficient inference requires a good decompositiondecomposition

Inference involves Inference involves evidence evidence instantiationinstantiation, , table combinationtable combination and and variable marginalisationvariable marginalisation

Page 23: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Statistical Disclosure ControlStatistical Disclosure Control

Releases of small area population Releases of small area population (census) data(census) data

AttributionAttribution occurs when a occurs when a data intruderdata intruder can make inferences (with probability 1) can make inferences (with probability 1) about a member of the populationabout a member of the population

Page 24: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Negative AttributionNegative Attribution - An individual - An individual who is an accountant does not work for who is an accountant does not work for Department CDepartment C

Positive AttributionPositive Attribution - An individual - An individual who works in Department C is a lawyerwho works in Department C is a lawyer

ProfessionLawyer

AccountantCol sum

Row sumDepartment

A B C

20 7 2

032

2

29

5

2418 4

Page 25: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Release of the full table is not safe from an Release of the full table is not safe from an attribute disclosure perspective (it contains a attribute disclosure perspective (it contains a zero)zero)

Each of the two marginal tables is safe Each of the two marginal tables is safe (neither contains a zero)(neither contains a zero)

Is the release of the two marginal tables Is the release of the two marginal tables ‘jointly’ safe?‘jointly’ safe?

Page 26: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

The Bounds ProblemThe Bounds Problem

Given a set of released tables (relating to Given a set of released tables (relating to the same population), what inferences the same population), what inferences about the counts in the ‘full’ table can be about the counts in the ‘full’ table can be made?made?

Can a Can a datadata intruderintruder derive an upper bound derive an upper bound of zero for any cell count?of zero for any cell count?

Page 27: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

A non-graphical caseA non-graphical case

All 2 All 2 ×× 2 marginals of a 2 2 marginals of a 2×2×2 table×2×2 table

A maximal complete subgraph (A maximal complete subgraph (cliqueclique) ) without an individual corresponding tablewithout an individual corresponding table

B C

A

Page 28: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Var1Var1

Var2Var2 AA BB

CC 33 99

DD 22 22

 

 

 

Var1Var1

Var3Var3 AA BB

EE 11 1010

FF 44 11

Page 29: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Var2Var2

VarVar33

CC DD

EE 88 33

FF 44 11

  

 

Var1 and Var2Var1 and Var2

VarVar33

A, CA, C A, DA, D B, CB, C B, B, DD

EE 00 11 88 22

FF 33 11 11 00

Original cell counts can be recovered from Original cell counts can be recovered from the marginal tablesthe marginal tables

Page 30: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Each cell’s upper bound is the minimum of it’s Each cell’s upper bound is the minimum of it’s relevant margins (Dobra and Fienberg)relevant margins (Dobra and Fienberg)

ProfessionLawyer

AccountantCol sum

Row sumDepartment

A B C

20 7 2

255

2

29

5

24720

Page 31: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

SDC SummarySDC Summary

A set of released tables relating to a given A set of released tables relating to a given population population

If the resulting graph is both If the resulting graph is both graphicalgraphical and and decomposabledecomposable, then the upper bounds can , then the upper bounds can be derived efficientlybe derived efficiently

Page 32: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Common aspectsCommon aspects

Graphical representationsGraphical representations

Graphs / cliques / nodes / treesGraphs / cliques / nodes / trees

Combination of tablesCombination of tables

Pointwise operationsPointwise operations

Page 33: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

BBNsBBNs

pointwise multiplicationpointwise multiplication

SDCSDC

pointwise minimumpointwise minimum

andand pointwise additionpointwise addition

pointwise subtractionpointwise subtraction

For calculating exact lower bounds

}

Page 34: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Coercing Numeric built-insCoercing Numeric built-ins

A A tabletable is a numeric array with an is a numeric array with an associated list of associated list of variablesvariables

Marginalisation is trivial, using the built-in Marginalisation is trivial, using the built-in Numeric.add.reduce() function and Numeric.add.reduce() function and removing the relevant variable from the listremoving the relevant variable from the list

Page 35: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Conditioning is easily achieved using a Conditioning is easily achieved using a Numeric.take() slice, appropriately Numeric.take() slice, appropriately reshaping the array with reshaping the array with Numeric.reshape() and removing the Numeric.reshape() and removing the variable from the listvariable from the list

Page 36: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Pointwise multiplication Pointwise multiplication

Numeric.multiply() generates the Numeric.multiply() generates the appropriate table IF the two tables have appropriate table IF the two tables have identical ranks and variable listsidentical ranks and variable lists

This is ensured by adding new axes This is ensured by adding new axes (Numeric.NewAxis) for the ‘missing’ axes (Numeric.NewAxis) for the ‘missing’ axes and transposing one of the tables and transposing one of the tables (Numeric.transpose()) so that the variable (Numeric.transpose()) so that the variable lists matchlists match

Page 37: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

array([24, 5]) ['profession'] (2,)array([24, 5]) ['profession'] (2,)

array([20, 7, 2]) ['department'] (3,)array([20, 7, 2]) ['department'] (3,)

array(array([[24],[[24], [ 5]]) (2, 1)[ 5]]) (2, 1)

['profession', 'department']['profession', 'department']

array(array([[20, 7, 2]]) (1, 3)[[20, 7, 2]]) (1, 3) ['profession', 'department']['profession', 'department']

Page 38: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

>>> prof * dept>>> prof * dept

array ([[480, 168, 48],array ([[480, 168, 48], [100, 35, 10]])[100, 35, 10]])['profession', 'department']['profession', 'department']

>>> (prof * dept).normalise(29)>>> (prof * dept).normalise(29)

array([[ 16.551, 5.793, 1.655],array([[ 16.551, 5.793, 1.655], [ 3.448, 1.206, 0.344]])[ 3.448, 1.206, 0.344]])['profession', 'department']['profession', 'department']

Page 39: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

Pointwise minimum / addition / subtraction Pointwise minimum / addition / subtraction

Numeric.minimum(), Numeric.add() and Numeric.minimum(), Numeric.add() and Numeric.subtract() generate the Numeric.subtract() generate the appropriate tables IF the two tables have appropriate tables IF the two tables have identical ranks and variable lists AND the identical ranks and variable lists AND the two tables also have identical shapetwo tables also have identical shape

This is ensured by a secondary This is ensured by a secondary preprocessing stage where the tables from preprocessing stage where the tables from the first preprocessing stage are multiplied the first preprocessing stage are multiplied by a ‘correctly’ shaped table of ones (this by a ‘correctly’ shaped table of ones (this is actually quicker than using is actually quicker than using Numeric.concatenate())Numeric.concatenate())

Page 40: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

array(array([[24],[[24],

[ 5]]) (2, 1)[ 5]]) (2, 1)

['profession', 'department']['profession', 'department']

array(array([ [20, 7, 2]]) (1, 3)[ [20, 7, 2]]) (1, 3)

['profession', 'department']['profession', 'department']

array([[20, 7, 2] array([[20, 7, 2]

[20, 7, 2]]) (2,3)[20, 7, 2]]) (2,3)(2nd stage preprocessing)

Page 41: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

>>> prof.minimum(dept)>>> prof.minimum(dept)

array([[20, 7, 2],array([[20, 7, 2],

[ 5, 5, 2]])[ 5, 5, 2]])

['profession', 'department']['profession', 'department']

Page 42: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

SummarySummary The Bayesian Belief Network software The Bayesian Belief Network software

was originally implemented in Python for was originally implemented in Python for two reasonstwo reasons

1. The author was, at the time, a relatively 1. The author was, at the time, a relatively inexperienced programmerinexperienced programmer

2. Self-learning (albeit with some help) was 2. Self-learning (albeit with some help) was the only optionthe only option

Page 43: Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester

The SDC software was implemented in The SDC software was implemented in Python because,Python because,

1. Python + Numeric turned out to be a 1. Python + Numeric turned out to be a wholly appropriate solution for BBNs wholly appropriate solution for BBNs (Python is powerful, Numeric is fast)(Python is powerful, Numeric is fast)

2. Existing code could be reused2. Existing code could be reused