14
Tony Jebara, Columbia University Machine Learning 4771 Instructor: Tony Jebara

Machine Learningjebara/4771/notes/class15x.pdf · 2015. 12. 9. · 4771 Instructor: Tony Jebara . Tony Jebara, Columbia University Topic 15 •Graphical Models •Maximum Likelihood

  • Upload
    others

  • View
    12

  • Download
    1

Embed Size (px)

Citation preview

  • Tony Jebara, Columbia University

    Machine Learning 4771

    Instructor: Tony Jebara

  • Tony Jebara, Columbia University

    Topic 15 • Graphical Models

    • Maximum Likelihood for Graphical Models • Testing for Conditional Independence & D-Separation • Bayes Ball

  • Tony Jebara, Columbia University

    Learning Fully Observed Models • Easiest scenario: we have observed all the nodes • Want to learn the probability tables from data… • Have N iid patients:

    • Simplest case: least general graph handle each dim individually as Bernoulli/Multinomial

    • 2nd Simplest case: most general, count each entry in pdf

    • What about learning graphs in between?

    PATIENT FLU FEVER SINUS TEMP SWELL HEAD

    1 Y Y N L Y Y

    2 N N N M N Y

    3 Y N Y H Y N

    4 Y N Y M N N

    Divide by total count Since FLU

    TEMP SINUS

    +1

    +1 ! p x( )

    x6

    ∑x1

    ∑ = 1

  • Tony Jebara, Columbia University

    Maximum Likelihood CPTs • Each conditional probability table θi part of our parameters • Given table, have pdf

    • Have M variables:

    • Have N x M dataset:

    • Maximum likelihood:

    θ4

    p X

    U| θ( ) = p xi | πi, θi( )i=1

    M∏

    X

    U= x

    1,…,x

    M{ }

    D = X

    U ,1,…,X

    U ,N{ }

    θ* = arg maxθlog p D | θ( )

    = arg maxθ

    log p XU ,n

    | θ( )n=1N∑

    = arg maxθ

    log p xi,n

    | πi,nθ

    i( )i=1M∏n=1

    N∑= arg max

    θlog p x

    i,n| π

    i,nθ

    i( )i=1M∑n=1

    N∑

    each θi appears independently, can do ML for each CPT alone! efficient storage & efficient learning

  • Tony Jebara, Columbia University

    Maximum Likelihood CPTs

    δ XU ,n

    ,XU ,m( ) =

    1 if XU ,n

    = XU ,m

    0 otherwise

    ⎧⎨⎪⎪⎪

    ⎩⎪⎪⎪

    m x

    i( ) = δ xi,xi,n( )n=1N∑

    l θ( ) = log p XU ,n | θ( )n=1N∑

    = log p XU

    | θ( )XU∏δ XU ,XU ,n( )

    n=1

    N∑= δ X

    U,X

    U ,n( ) log p XU | θ( )XU∑n=1N∑

    = m XU( ) log p XU | θ( )XU∑ = m XU( ) log p xi | πi, θi( )i=1

    M∏XU∑= m X

    U( ) log p xi | πi, θi( )i=1M∑XU∑

    m X

    U( ) = δ XU ,XU ,n( )n=1N∑

    m X

    C( ) = m XU( )XU\C∑

    N = m x

    1( )x1∑ = m x1,x2( )x2∑( )x1∑ = m x1,x2,x3( )x3∑( )x2∑( )x1∑

    Counts: # of times what’s in the bracket appeared in data, for example:

    • So…

  • Tony Jebara, Columbia University

    • Continuing:

    • Define: Constraint: • Now have above with Lagrange multipliers:

    • Plug constraint:

    • Final solution (trivial!):

    Maximum Likelihood CPTs

    l θ( ) = m XU( ) log p xi | πi, θi( )i=1M∑XU∑

    = m XU( ) log p xi | πi, θi( )XU\xi \πi∑xi ,πi∑i=1

    M∑= m x

    i,π

    i( ) log p xi | πi, θi( )xi ,πi∑i=1M∑

    θ x

    i,π

    i( ) = p xi | πi, θi( ) θ xi,πi( )xi∑ = 1

    l θ( ) = m xi,πi( ) log θ xi,πi( )πi∑xi∑i=1

    M∑ − λπi θ xi,πi( )xi∑ −1( )πi∑i=1M∑

    ∂l θ( )∂θ x

    i,π

    i( )=

    m xi,π

    i( )θ x

    i,π

    i( )−λ

    πi= 0

    → θ xi,π

    i( ) =m x

    i,π

    i( )λπi

    m xi,π

    i( )λπi

    xi∑ = 1 → λπi = m xi,πi( )xi∑ = m πi( )

    θ xi,π

    i( ) =m x

    i,π

    i( )m π

    i( )

  • Tony Jebara, Columbia University

    • Continuing:

    • Define: Constraint: • Now have above with Lagrange multipliers:

    • Plug constraint:

    • Final solution (trivial!): MAP VERSION

    Maximum Likelihood CPTs

    l θ( ) = m XU( ) log p xi | πi, θi( )i=1M∑XU∑

    = m XU( ) log p xi | πi, θi( )XU\xi \πi∑xi ,πi∑i=1

    M∑= m x

    i,π

    i( ) log p xi | πi, θi( )xi ,πi∑i=1M∑

    θ x

    i,π

    i( ) = p xi | πi, θi( ) θ xi,πi( )xi∑ = 1

    l θ( ) = m xi,πi( ) log θ xi,πi( )πi∑xi∑i=1

    M∑ − λπi θ xi,πi( )xi∑ −1( )πi∑i=1M∑

    ∂l θ( )∂θ x

    i,π

    i( )=

    m xi,π

    i( )θ x

    i,π

    i( )−λ

    πi= 0

    → θ xi,π

    i( ) =m x

    i,π

    i( )λπi

    m xi,π

    i( )λπi

    xi∑ = 1 → λπi = m xi,πi( )xi∑ = m πi( )

    θ xi,π

    i( ) =m x

    i,π

    i( ) + εm π

    i( ) + ε xi

  • Tony Jebara, Columbia University

    Maximum Likelihood CPTs • Let’s try an example: • Compute the cpt

    • Using the formula:

    θ xi,π

    i( ) =m x

    i,π

    i( )m π

    i( )

    PATIENT FLU FEVER SINUS TEMP SWELL HEAD

    1 Y Y N L Y Y

    2 N N N M N Y

    3 Y N Y H Y N

    4 Y N Y M N N p x

    3| x

    1( )

    1 1

    0 2

    Note, here 0/0 = prior constant

    1 3

    1 1/3

    0 2/3

    m x

    3,x

    1( )

    m x

    1( )

    p x

    3| x

    1( )

    x1 = 0 x

    1= 1

    x3 = 0

    x3 = 1

    Efficient, only count over subset of variables in p(XB|XA) Not all p(x1,…,xM)

  • Tony Jebara, Columbia University

    Conditional Dependence Tests • Another thing we would like to do with a graphical model: Check conditional independencies… “Is Temperature Indep. of Flu Given Fever?” “Is Temperature Indep. of Sinus Infection Given Fever?” • Try computing & simplify marginals of p(x)

    • In this case it was easy, what if checking: • Hard to compute want efficient algorithm…

    p X( ) = p x1( )p x2 | x1( )p x3 | x1( )p x4 | x2( )p x5 | x3( )p x6 | x2,x5( )

    p x4| x

    1,x

    2,x

    3( ) =p x

    1,x

    2,x

    3,x

    4( )p x

    1,x

    2,x

    3( )=

    p X( )x6∑x5∑p X( )x6∑x5∑x4∑

    =p x

    1( )p x2 | x1( )p x3 | x1( )p x4 | x2( )p x

    1( )p x2 | x1( )p x3 | x1( )= p x

    4| x

    2( ) x4 ! x1,x3 | x2 x1 ! x6 | x2,x3

    p x

    1| x

    2,x

    3,x

    6( )

  • Tony Jebara, Columbia University

    D-Separation & Bayes Ball • There is a graph algorithm for checking independence • Intuition: separation or blocking of some nodes by others • Example: if nodes “block” path from to we might say that

    • This is not exact for directed graphs (true for Undirected) • We need more than just simple Separation • Need D-Separation (directed separation) • D-Separation is computed via the Bayes Ball algorithm • Use to prove general statements over subsets of vars:

    XA ! XB | XC

    x2,x3

    x1 x

    6

    x1 ! x6 | x2,x3

  • Tony Jebara, Columbia University

    Bayes Ball Algorithm • The algorithm: 1) Shade nodes XC 2) Place a ball at each node in XA 3) Bounce balls around graph according to some rules 4) If no balls reach XB, then is true (else false)

    Balls can travel along/against arrows Pick any incoming & outgoing path Test each to see if ball goes through or bounces back

    Look at canonical sub-graphs & leaf cases for rules…

    XA ! XB | XC

    XA ! XB | XC

  • Tony Jebara, Columbia University

    Bayes Ball Algorithm 1) Markov Chain:

    Rule depends only on shading of middle node Ball stops Go Through

    2) Two Effects:

    Rule depends only on shading of middle node Ball stops Go Through

    3) Two Causes (V): Rule depends only on shading of middle node x Go Through Ball stops

    x ! z | y x ! z

    x ! z | y

    x ! z | y x ! zx

    x

    x ! zx

  • Tony Jebara, Columbia University

    Bayes Ball Algorithm • Also need to look at special ‘leaf’ cases:

    • Example:

    • Example: x1 ! x5 | x3

    ∴ x

    1! x

    6| x

    2,x

    3{ }

    X1,X2,X4 Stopped X1,X2,X6 Stopped X1,X3,X5 Stopped

    Flu is independent of headache given fever & sinus infection!

    Ball is stopped

    Bounces back

    Bounces back

    Ball is stopped

  • Tony Jebara, Columbia University

    Bayes Ball Algorithm • Example:

    • Example:

    X2,X6,X5 Goes Through Because of V-structure

    ∴ x

    2! x

    3| x

    1,x

    6{ }x

    aliens ! watch aliens ! watch | reportx Ball bounces back from report leaf and goes to right if report is shaded. Bob is waiting for Alice but can’t know if she is late. Instead a security guard says if she is. She can be late if aliens abduct her or Bob’s watch is ahead (daylight savings time). Guard reports she is late. If watch is ahead, p(alien=true) goes down, they are dependent.