Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith...

Preview:

Citation preview

Aspects of Bayesian InferenceAspects of Bayesian Inferenceandand

Statistical Disclosure ControlStatistical Disclosure Controlin Pythonin Python

Duncan SmithDuncan Smith Confidentiality and Privacy Group Confidentiality and Privacy Group

CCSR CCSR University of ManchesterUniversity of Manchester

IntroductionIntroduction

Bayesian Belief Networks (BBNs)Bayesian Belief Networks (BBNs)

probabilistic inferenceprobabilistic inference

Statistical Disclosure Control (SDC)Statistical Disclosure Control (SDC)

deterministic inference (attribution)deterministic inference (attribution)

Bayesian Belief NetworksBayesian Belief Networks

Decision-making in complex domainsDecision-making in complex domains

Hard and soft evidenceHard and soft evidence

Correlated variablesCorrelated variables

Many variablesMany variables

Bayes’ RuleBayes’ Rule

APBP

ABPBAP

A prior belief and evidence combined to give a posterior belief

Venn DiagramVenn Diagram

Both A & B

A only

B only

Neither A nor B

Event A Event B

aa aa

bb 3/73/7 2/32/3

bb 4/74/7 1/31/3

11 11

aa aa

0.70.7 0.30.3

1. Prior probability table P(A)

2. Conditional probability table P(B|A)

InferenceInference

aa aa

bb 0.30.3 0.20.2 0.50.5

bb 0.40.4 0.10.1 0.50.5

0.70.7 0.30.3 11

aa aa

bb 0.30.3 0.20.2

aa aa

bb 0.60.6 0.40.4

3. Produce joint probability table by multiplication

4. Condition on evidence

5. Normalise table probabilities to sum to 1

def Bayes(prior, conditional, obs_level):

"""Simple Bayes for two categorical variables. 'prior' is a Python list. 'conditional' is a list of lists (‘column’ variable conditional on ‘row’ variable). 'obs_level' is the index of the observed level of the row variable"""

levels = len(prior) # condition on observed level result = conditional[obs_level] # multiply values by prior probabilities result = [result[i] * prior[i] for i in range(levels)] # get marginal probability of observed level marg_prob = sum(result) # normalise the current values to sum to 1 posterior = [value / marg_prob for value in result]

return posterior

Note: conditioning can be carried out before calculating the joint probabilities, reducing the cost of inference

>>> A = [0.7, 0.3]>>> B_given_A = [[3.0/7, 2.0/3], [4.0/7, 1.0/3]]>>> Bayes(A, B_given_A, 0)[0.59999999999999998, 0.39999999999999997]>>>

The posterior distribution can be used as a new The posterior distribution can be used as a new prior and combined with evidence from further prior and combined with evidence from further observed variablesobserved variables

Although computationally efficient, this ‘naïve’ Although computationally efficient, this ‘naïve’ approach implies assumptions that can lead to approach implies assumptions that can lead to problemsproblems

Naive BayesNaive Bayes

APABPACPC,B,AP

B C

A

A ‘correct’ factorisationA ‘correct’ factorisation

APABPB,ACPC,B,AP

B C

A

Conditional independenceConditional independence

The Naive Bayes example assumes:The Naive Bayes example assumes:

But if valid, the calculation is easier and But if valid, the calculation is easier and fewer probabilities need to be specifiedfewer probabilities need to be specified

ACPB,ACP

The conditional independence

implies that if A is observed, then

evidence on B is irrelevant in

calculating the posterior of C

A Bayesian Belief NetworkA Bayesian Belief Network

R and S are independent until H is R and S are independent until H is observedobserved

W H

R S

A Markov GraphA Markov Graph

The conditional independence structure is found by The conditional independence structure is found by marryingmarrying parents with common children parents with common children

W H

R S

FactoringFactoring

The following factorisation is impliedThe following factorisation is implied

So P(S) can be calculated as follows So P(S) can be calculated as follows (although there is little point, yet)(although there is little point, yet)

SRHPSPRPRWPWSRHP ,,,,

WHR

RWPSRHPRPSPSP ,

If H and W are observed to be in states If H and W are observed to be in states hh

and and ww, then the posterior of S can be , then the posterior of S can be expressed as follows (where epsilon expressed as follows (where epsilon denotes ‘the evidence’)denotes ‘the evidence’)

RwWPSRhHPRPSPSPR

,

Graph TriangulationGraph Triangulation

A B

D C

E

A B

D C

E

A B

D C

E

Belief PropagationBelief Propagation

Message passing in a Message passing in a Clique TreeClique Tree

W,R

R,H,S

R

Message passing in a Message passing in a Directed Junction Directed Junction TreeTree

W,RR,H,S

R,S

S

A Typical BBNA Typical BBN

Belief Network SummaryBelief Network Summary

Inference requires a Inference requires a decomposable decomposable graphgraph

Efficient inference requires a good Efficient inference requires a good decompositiondecomposition

Inference involves Inference involves evidence evidence instantiationinstantiation, , table combinationtable combination and and variable marginalisationvariable marginalisation

Statistical Disclosure ControlStatistical Disclosure Control

Releases of small area population Releases of small area population (census) data(census) data

AttributionAttribution occurs when a occurs when a data intruderdata intruder can make inferences (with probability 1) can make inferences (with probability 1) about a member of the populationabout a member of the population

Negative AttributionNegative Attribution - An individual - An individual who is an accountant does not work for who is an accountant does not work for Department CDepartment C

Positive AttributionPositive Attribution - An individual - An individual who works in Department C is a lawyerwho works in Department C is a lawyer

ProfessionLawyer

AccountantCol sum

Row sumDepartment

A B C

20 7 2

032

2

29

5

2418 4

Release of the full table is not safe from an Release of the full table is not safe from an attribute disclosure perspective (it contains a attribute disclosure perspective (it contains a zero)zero)

Each of the two marginal tables is safe Each of the two marginal tables is safe (neither contains a zero)(neither contains a zero)

Is the release of the two marginal tables Is the release of the two marginal tables ‘jointly’ safe?‘jointly’ safe?

The Bounds ProblemThe Bounds Problem

Given a set of released tables (relating to Given a set of released tables (relating to the same population), what inferences the same population), what inferences about the counts in the ‘full’ table can be about the counts in the ‘full’ table can be made?made?

Can a Can a datadata intruderintruder derive an upper bound derive an upper bound of zero for any cell count?of zero for any cell count?

A non-graphical caseA non-graphical case

All 2 All 2 ×× 2 marginals of a 2 2 marginals of a 2×2×2 table×2×2 table

A maximal complete subgraph (A maximal complete subgraph (cliqueclique) ) without an individual corresponding tablewithout an individual corresponding table

B C

A

Var1Var1

Var2Var2 AA BB

CC 33 99

DD 22 22

 

 

 

Var1Var1

Var3Var3 AA BB

EE 11 1010

FF 44 11

Var2Var2

VarVar33

CC DD

EE 88 33

FF 44 11

  

 

Var1 and Var2Var1 and Var2

VarVar33

A, CA, C A, DA, D B, CB, C B, B, DD

EE 00 11 88 22

FF 33 11 11 00

Original cell counts can be recovered from Original cell counts can be recovered from the marginal tablesthe marginal tables

Each cell’s upper bound is the minimum of it’s Each cell’s upper bound is the minimum of it’s relevant margins (Dobra and Fienberg)relevant margins (Dobra and Fienberg)

ProfessionLawyer

AccountantCol sum

Row sumDepartment

A B C

20 7 2

255

2

29

5

24720

SDC SummarySDC Summary

A set of released tables relating to a given A set of released tables relating to a given population population

If the resulting graph is both If the resulting graph is both graphicalgraphical and and decomposabledecomposable, then the upper bounds can , then the upper bounds can be derived efficientlybe derived efficiently

Common aspectsCommon aspects

Graphical representationsGraphical representations

Graphs / cliques / nodes / treesGraphs / cliques / nodes / trees

Combination of tablesCombination of tables

Pointwise operationsPointwise operations

BBNsBBNs

pointwise multiplicationpointwise multiplication

SDCSDC

pointwise minimumpointwise minimum

andand pointwise additionpointwise addition

pointwise subtractionpointwise subtraction

For calculating exact lower bounds

}

Coercing Numeric built-insCoercing Numeric built-ins

A A tabletable is a numeric array with an is a numeric array with an associated list of associated list of variablesvariables

Marginalisation is trivial, using the built-in Marginalisation is trivial, using the built-in Numeric.add.reduce() function and Numeric.add.reduce() function and removing the relevant variable from the listremoving the relevant variable from the list

Conditioning is easily achieved using a Conditioning is easily achieved using a Numeric.take() slice, appropriately Numeric.take() slice, appropriately reshaping the array with reshaping the array with Numeric.reshape() and removing the Numeric.reshape() and removing the variable from the listvariable from the list

Pointwise multiplication Pointwise multiplication

Numeric.multiply() generates the Numeric.multiply() generates the appropriate table IF the two tables have appropriate table IF the two tables have identical ranks and variable listsidentical ranks and variable lists

This is ensured by adding new axes This is ensured by adding new axes (Numeric.NewAxis) for the ‘missing’ axes (Numeric.NewAxis) for the ‘missing’ axes and transposing one of the tables and transposing one of the tables (Numeric.transpose()) so that the variable (Numeric.transpose()) so that the variable lists matchlists match

array([24, 5]) ['profession'] (2,)array([24, 5]) ['profession'] (2,)

array([20, 7, 2]) ['department'] (3,)array([20, 7, 2]) ['department'] (3,)

array(array([[24],[[24], [ 5]]) (2, 1)[ 5]]) (2, 1)

['profession', 'department']['profession', 'department']

array(array([[20, 7, 2]]) (1, 3)[[20, 7, 2]]) (1, 3) ['profession', 'department']['profession', 'department']

>>> prof * dept>>> prof * dept

array ([[480, 168, 48],array ([[480, 168, 48], [100, 35, 10]])[100, 35, 10]])['profession', 'department']['profession', 'department']

>>> (prof * dept).normalise(29)>>> (prof * dept).normalise(29)

array([[ 16.551, 5.793, 1.655],array([[ 16.551, 5.793, 1.655], [ 3.448, 1.206, 0.344]])[ 3.448, 1.206, 0.344]])['profession', 'department']['profession', 'department']

Pointwise minimum / addition / subtraction Pointwise minimum / addition / subtraction

Numeric.minimum(), Numeric.add() and Numeric.minimum(), Numeric.add() and Numeric.subtract() generate the Numeric.subtract() generate the appropriate tables IF the two tables have appropriate tables IF the two tables have identical ranks and variable lists AND the identical ranks and variable lists AND the two tables also have identical shapetwo tables also have identical shape

This is ensured by a secondary This is ensured by a secondary preprocessing stage where the tables from preprocessing stage where the tables from the first preprocessing stage are multiplied the first preprocessing stage are multiplied by a ‘correctly’ shaped table of ones (this by a ‘correctly’ shaped table of ones (this is actually quicker than using is actually quicker than using Numeric.concatenate())Numeric.concatenate())

array(array([[24],[[24],

[ 5]]) (2, 1)[ 5]]) (2, 1)

['profession', 'department']['profession', 'department']

array(array([ [20, 7, 2]]) (1, 3)[ [20, 7, 2]]) (1, 3)

['profession', 'department']['profession', 'department']

array([[20, 7, 2] array([[20, 7, 2]

[20, 7, 2]]) (2,3)[20, 7, 2]]) (2,3)(2nd stage preprocessing)

>>> prof.minimum(dept)>>> prof.minimum(dept)

array([[20, 7, 2],array([[20, 7, 2],

[ 5, 5, 2]])[ 5, 5, 2]])

['profession', 'department']['profession', 'department']

SummarySummary The Bayesian Belief Network software The Bayesian Belief Network software

was originally implemented in Python for was originally implemented in Python for two reasonstwo reasons

1. The author was, at the time, a relatively 1. The author was, at the time, a relatively inexperienced programmerinexperienced programmer

2. Self-learning (albeit with some help) was 2. Self-learning (albeit with some help) was the only optionthe only option

The SDC software was implemented in The SDC software was implemented in Python because,Python because,

1. Python + Numeric turned out to be a 1. Python + Numeric turned out to be a wholly appropriate solution for BBNs wholly appropriate solution for BBNs (Python is powerful, Numeric is fast)(Python is powerful, Numeric is fast)

2. Existing code could be reused2. Existing code could be reused

Recommended