28
EE269 Signal Processing for Machine Learning Lecture 6 Instructor : Mert Pilanci Stanford University Jan 9 2019

EE269 Signal Processing for Machine Learning - Lecture 6web.stanford.edu/class/ee269/Lecture6.pdfEE269 Signal Processing for Machine Learning Lecture 6 Instructor : Mert Pilanci Stanford

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

  • EE269Signal Processing for Machine Learning

    Lecture 6

    Instructor : Mert Pilanci

    Stanford University

    Jan 9 2019

  • Bayes classifiers

    Given Pxy joint probability distribution of (x[n], y) what is thebest classifier ?

    I signal classifier function f(x[n]) : RN → {1, ..., k}I best classifier depends on the performance measure

    I risk (probability of error):

    R(f) = Pxy(f(x) 6= y)

  • Bayes risk R∗

    I Bayes risk is R∗ the smallest risk of any classifier

    If R(f) = R∗, then f is a Bayes classifier

    πk := Py(y = k) prior class probabilities

    gk(x) := Px|y=k class conditional distribution

    I define posterior class probabilities:

    ηk(x) = Py=k|x(y = k|x)

    note that∑K

    k=1 ηk(x) = 1

  • Bayes risk R∗

    I Bayes risk is R∗ the smallest risk of any classifier

    If R(f) = R∗, then f is a Bayes classifier

    πk := Py(y = k) prior class probabilities

    gk(x) := Px|y=k class conditional distribution

    I define posterior class probabilities:

    ηk(x) = Py=k|x(y = k|x)

    note that∑K

    k=1 ηk(x) = 1

  • Example: x[n] = x ∈ R

    I Suppose x|y is Gaussian, y ∈ {−1,+1}

    x|y = −1 ∼ N(µ−, σ2−) and x|y = +1 ∼ N(µ+, σ2+)

    η1(x) = P (y = 1|x) = P (x|y=1)P (y=1)P (x) (Bayes rule)

  • Bayes classifier

    πk := Py(y = k) prior class probabilities

    gk(x) := Px|y=k class conditional distribution

    ηk(x) = Py=k|x(y = k|x) posterior class probabilities

    Theorem: The classifier

    f∗(x) = arg maxk=1,...,K

    ηk(x)

    = arg maxk=1,...,K

    πkgk(x)

    is a Bayes classifier

  • Detecting a constant signal in Gaussian noise

    I x[n] = [x1, ...xN ] ∼ N(0, σ2) i.i.d. for y = 1 (noise)I x[n] = [x1, ...xN ] ∼ N(µ, σ2) i.i.d. for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume µ is known

    g1(x) := Px|y=1 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi)

    2

    g2(x) := Px|y=2 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi−µ)2

    Bayes classifier:

    Classify as signal+noise if π1g1(x) < π2g2(x)

    µ

    N∑i=1

    xi > σ2 log(

    π1π2

    ) +Nµ2/2

  • Detecting a constant signal in Gaussian noise

    I x[n] = [x1, ...xN ] ∼ N(0, σ2) i.i.d. for y = 1 (noise)I x[n] = [x1, ...xN ] ∼ N(µ, σ2) i.i.d. for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume µ is known

    g1(x) := Px|y=1 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi)

    2

    g2(x) := Px|y=2 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi−µ)2

    Bayes classifier:

    Classify as signal+noise if π1g1(x) < π2g2(x)

    µ

    N∑i=1

    xi > σ2 log(

    π1π2

    ) +Nµ2/2

  • Detecting a constant signal in Gaussian noise

    I x[n] = [x1, ...xN ] ∼ N(0, σ2) i.i.d. for y = 1 (noise)I x[n] = [x1, ...xN ] ∼ N(µ, σ2) i.i.d. for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume µ is known

    g1(x) := Px|y=1 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi)

    2

    g2(x) := Px|y=2 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi−µ)2

    Bayes classifier:

    Classify as signal+noise if π1g1(x) < π2g2(x)

    µ

    N∑i=1

    xi > σ2 log(

    π1π2

    ) +Nµ2/2

  • Detecting a change in variance

    I x[n] = [x1, ...xN ] ∼ N(0, σ21) i.i.d. for y = 1 (noise)I x[n] = [x1, ...xN ] ∼ N(0, σ22) i.i.d. for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume σ1 and σ2 is known

    g1(x) := Px|y=1 =∏Ni=1

    1√2πσ21

    e− 1

    2σ21(xi)

    2

    g2(x) := Px|y=2 =∏Ni=1

    1√2πσ22

    e− 1

    2σ22(xi)

    2

    Bayes classifier:

    Classify as class 2 if π1g1(x) < π2g2(x)(1

    2σ21− 1

    2σ22

    ) N∑i=1

    x2i > log(π1π2

    )− N2

    log(σ21σ22

    )

  • Detecting a change in variance

    I x[n] = [x1, ...xN ] ∼ N(0, σ21) i.i.d. for y = 1 (noise)I x[n] = [x1, ...xN ] ∼ N(0, σ22) i.i.d. for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume σ1 and σ2 is known

    g1(x) := Px|y=1 =∏Ni=1

    1√2πσ21

    e− 1

    2σ21(xi)

    2

    g2(x) := Px|y=2 =∏Ni=1

    1√2πσ22

    e− 1

    2σ22(xi)

    2

    Bayes classifier:

    Classify as class 2 if π1g1(x) < π2g2(x)(1

    2σ21− 1

    2σ22

    ) N∑i=1

    x2i > log(π1π2

    )− N2

    log(σ21σ22

    )

  • Detecting a change in variance

    I x[n] = [x1, ...xN ] ∼ N(0, σ21) i.i.d. for y = 1 (noise)I x[n] = [x1, ...xN ] ∼ N(0, σ22) i.i.d. for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume σ1 and σ2 is known

    g1(x) := Px|y=1 =∏Ni=1

    1√2πσ21

    e− 1

    2σ21(xi)

    2

    g2(x) := Px|y=2 =∏Ni=1

    1√2πσ22

    e− 1

    2σ22(xi)

    2

    Bayes classifier:

    Classify as class 2 if π1g1(x) < π2g2(x)(1

    2σ21− 1

    2σ22

    ) N∑i=1

    x2i > log(π1π2

    )− N2

    log(σ21σ22

    )

  • Detecting a known waveformI s[n] ∈ RN known waveformI x[n] = [x1, ...xN ] ∼ N(0, σ2) i.i.d. for y = 1 (noise)I x[n] = [s1, ...sN ] +N(0, σ2) i.i.d. for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume µ is known

    g1(x) := Px|y=1 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi)

    2

    g2(x) := Px|y=2 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi−si)2

    Bayes classifier:

    Classify as signal+noise if π1g1(x) < π2g2(x)

    N∑i=1

    xisi > σ2 log(

    π1π2

    ) +1

    2

    N∑i=1

    |si|2

    〈x, s〉 > σ2 log(π1π2

    ) +1

    2||s||22

    I matched filter (e.g., face detector, DFT basis)

  • Detecting a known waveformI s[n] ∈ RN known waveformI x[n] = [x1, ...xN ] ∼ N(0, σ2) i.i.d. for y = 1 (noise)I x[n] = [s1, ...sN ] +N(0, σ2) i.i.d. for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume µ is known

    g1(x) := Px|y=1 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi)

    2

    g2(x) := Px|y=2 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi−si)2

    Bayes classifier:

    Classify as signal+noise if π1g1(x) < π2g2(x)

    N∑i=1

    xisi > σ2 log(

    π1π2

    ) +1

    2

    N∑i=1

    |si|2

    〈x, s〉 > σ2 log(π1π2

    ) +1

    2||s||22

    I matched filter (e.g., face detector, DFT basis)

  • Detecting a known waveformI s[n] ∈ RN known waveformI x[n] = [x1, ...xN ] ∼ N(0, σ2) i.i.d. for y = 1 (noise)I x[n] = [s1, ...sN ] +N(0, σ2) i.i.d. for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume µ is known

    g1(x) := Px|y=1 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi)

    2

    g2(x) := Px|y=2 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi−si)2

    Bayes classifier:

    Classify as signal+noise if π1g1(x) < π2g2(x)

    N∑i=1

    xisi > σ2 log(

    π1π2

    ) +1

    2

    N∑i=1

    |si|2

    〈x, s〉 > σ2 log(π1π2

    ) +1

    2||s||22

    I matched filter (e.g., face detector, DFT basis)

  • Detecting a known waveformI s[n] ∈ RN known waveformI x[n] = [x1, ...xN ] ∼ N(0, σ2) i.i.d. for y = 1 (noise)I x[n] = [s1, ...sN ] +N(0, σ2) i.i.d. for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume µ is known

    g1(x) := Px|y=1 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi)

    2

    g2(x) := Px|y=2 =∏Ni=1

    1√2πσ2

    e−1

    2σ2(xi−si)2

    Bayes classifier:

    Classify as signal+noise if π1g1(x) < π2g2(x)

    N∑i=1

    xisi > σ2 log(

    π1π2

    ) +1

    2

    N∑i=1

    |si|2

    〈x, s〉 > σ2 log(π1π2

    ) +1

    2||s||22

    I matched filter (e.g., face detector, DFT basis)

  • Multivariate Gaussianx1, ...xn ∼ i.i.d. N(µ,Σ)

    P (x1, ...xn;µ,Σ) =1

    (2π)N2 |Σ|

    12

    e−12

    (x−µ)TΣ−1(x−µ)

  • Detecting a correlated signal

    I [x1, ...xN ] ∼ N(0, σ2In) for y = 1 (noise)I [x1, ...xN ] ∼ N(0,Σ) +N(0, σ2In) for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume σ and Σ are known

    g1(x) := Px|y=1 =1

    (2πσ2)N2e−

    12xT (σ2In)−1x

    g2(x) := Px|y=2 =1

    (2π)N2 |Σ|

    12e−

    12xT (Σ+σ2In)−1x

    Bayes classifier:

    Classify as signal+noise if π1g1(x) < π2g2(x)

    log(π1)−1

    2x(σ2In)

    −1x < −12x(Σ+σ2In)

    −1x+log π2+constant

  • Detecting a correlated signal

    I [x1, ...xN ] ∼ N(0, σ2In) for y = 1 (noise)I [x1, ...xN ] ∼ N(0,Σ) +N(0, σ2In) for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume σ and Σ are known

    g1(x) := Px|y=1 =1

    (2πσ2)N2e−

    12xT (σ2In)−1x

    g2(x) := Px|y=2 =1

    (2π)N2 |Σ|

    12e−

    12xT (Σ+σ2In)−1x

    Bayes classifier:

    Classify as signal+noise if π1g1(x) < π2g2(x)

    log(π1)−1

    2x(σ2In)

    −1x < −12x(Σ+σ2In)

    −1x+log π2+constant

  • Detecting a correlated signal

    I [x1, ...xN ] ∼ N(0, σ2In) for y = 1 (noise)I [x1, ...xN ] ∼ N(0,Σ) +N(0, σ2In) for y = 2 (signal+noise)I P (y = 1) = π1 and P (y = 2) = π2

    Assume σ and Σ are known

    g1(x) := Px|y=1 =1

    (2πσ2)N2e−

    12xT (σ2In)−1x

    g2(x) := Px|y=2 =1

    (2π)N2 |Σ|

    12e−

    12xT (Σ+σ2In)−1x

    Bayes classifier:

    Classify as signal+noise if π1g1(x) < π2g2(x)

    log(π1)−1

    2x(σ2In)

    −1x < −12x(Σ+σ2In)

    −1x+log π2+constant

  • Detecting a correlated signal

    I [x1, ...xN ] ∼ N(0, σ2In) for y = 1 (noise)I [x1, ...xN ] ∼ N(0,Σ) +N(0, σ2In) for y = 2 (signal+noise)

    log(π1)−1

    2xT (σ2In)

    −1x < −12xT (Σ+σ2In)

    −1x+log π2+constant

    I Matrix inversion lemma (A invertible)(A+B)−1 = A−1 −A−1(I +BA−1)−1BA−1

    I (σ2In)−1 − (Σ + σ2In)−1 = 1σ2 (σ2In + Σ)

    −1Σ

    I Classify as signal whenxT (σ2In + Σ)

    −1Σx > σ2 log π1π2 + constant

  • Detecting a correlated signal

    I [x1, ...xN ] ∼ N(0, σ2In) for y = 1 (noise)I [x1, ...xN ] ∼ N(0,Σ) +N(0, σ2In) for y = 2 (signal+noise)

    log(π1)−1

    2xT (σ2In)

    −1x < −12xT (Σ+σ2In)

    −1x+log π2+constant

    I Matrix inversion lemma (A invertible)(A+B)−1 = A−1 −A−1(I +BA−1)−1BA−1

    I (σ2In)−1 − (Σ + σ2In)−1 = 1σ2 (σ2In + Σ)

    −1Σ

    I Classify as signal whenxT (σ2In + Σ)

    −1Σx > σ2 log π1π2 + constant

  • Detecting a correlated signal

    I [x1, ...xN ] ∼ N(0, σ2In) for y = 1 (noise)I [x1, ...xN ] ∼ N(0,Σ) +N(0, σ2In) for y = 2 (signal+noise)

    log(π1)−1

    2xT (σ2In)

    −1x < −12xT (Σ+σ2In)

    −1x+log π2+constant

    I Matrix inversion lemma (A invertible)(A+B)−1 = A−1 −A−1(I +BA−1)−1BA−1

    I (σ2In)−1 − (Σ + σ2In)−1 = 1σ2 (σ2In + Σ)

    −1Σ

    I Classify as signal whenxT (σ2In + Σ)

    −1Σx > σ2 log π1π2 + constant

  • Detecting a correlated signal

    [x1, ...xN ] ∼ N(0, σ2In) for y = 1 (noise)[x1, ...xN ] ∼ N(0,Σ) +N(0, σ2In) for y = 2 (signal+noise)Classify as signal whenxT (σ2In + Σ)

    −1Σx > σ2 log π1π2 + constant

    I Eigenvalue Decomposition: Σ = UΛUT =∑n

    i=1 λiuiuTi

    xT (σ2In + UΛUT )−1UΛUTx

    = xTU(σ2In + Λ)−1ΛUTx

    I Change of basis: y = UTx

    I yT (σ2In + Λ)−1Λy =∑N−1

    n=0λn

    λn+σ2|y[n]|2

    yfiltered[n] =√

    λnλn+σ2

    y[n]

  • Detecting a correlated signal

    [x1, ...xN ] ∼ N(0, σ2In) for y = 1 (noise)[x1, ...xN ] ∼ N(0,Σ) +N(0, σ2In) for y = 2 (signal+noise)Classify as signal whenxT (σ2In + Σ)

    −1Σx > σ2 log π1π2 + constant

    I Eigenvalue Decomposition: Σ = UΛUT =∑n

    i=1 λiuiuTi

    xT (σ2In + UΛUT )−1UΛUTx

    = xTU(σ2In + Λ)−1ΛUTx

    I Change of basis: y = UTx

    I yT (σ2In + Λ)−1Λy =∑N−1

    n=0λn

    λn+σ2|y[n]|2

    yfiltered[n] =√

    λnλn+σ2

    y[n]

  • Detecting a correlated signal

    [x1, ...xN ] ∼ N(0, σ2In) for y = 1 (noise)[x1, ...xN ] ∼ N(0,Σ) +N(0, σ2In) for y = 2 (signal+noise)Classify as signal whenxT (σ2In + Σ)

    −1Σx > σ2 log π1π2 + constant

    I Eigenvalue Decomposition: Σ = UΛUT =∑n

    i=1 λiuiuTi

    xT (σ2In + UΛUT )−1UΛUTx

    = xTU(σ2In + Λ)−1ΛUTx

    I Change of basis: y = UTx

    I yT (σ2In + Λ)−1Λy =∑N−1

    n=0λn

    λn+σ2|y[n]|2

    yfiltered[n] =√

    λnλn+σ2

    y[n]

  • Optimality of filtering for circulant covariances

    xTU(σ2In + Λ)−1ΛUTx > σ2 log

    π1π2

    I Symmetric circulant covariances

    Σ =

    s[0] s[1] s[2] . . . s[2] s[1]s[1] s[0] s[1] . . . s[3] s[2]

    ......

    .... . .

    ......

    s[2] s[3] s[4] . . . s[0] s[1]s[1] s[2] s[3] . . . s[1] s[0]

    I Σij = ΣjiI s[N − n] = s[n]

    Fourier basis: wk[n] =1√Nej

    2πNnk

    wHk Σ = S[k]wHk

  • Optimality of filtering for circulant covariances

    xTU(σ2In + Λ)−1ΛUTx > σ2 log

    π1π2

    I Symmetric circulant covariances

    Σ =

    s[0] s[1] s[2] . . . s[2] s[1]s[1] s[0] s[1] . . . s[3] s[2]

    ......

    .... . .

    ......

    s[2] s[3] s[4] . . . s[0] s[1]s[1] s[2] s[3] . . . s[1] s[0]

    I Σij = ΣjiI s[N − n] = s[n]

    Fourier basis: wk[n] =1√Nej

    2πNnk

    wHk Σ = S[k]wHk