78
An Asymptotic Analysis of Estimators: Generative, Discriminative, Pseudolikelihood ICML 2008 Helsinki, Finland July 6, 2008 Percy Liang Michael I. Jordan UC Berkeley

An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

An Asymptotic Analysis of Estimators:

Generative, Discriminative, Pseudolikelihood

ICML 2008 Helsinki, Finland

July 6, 2008

Percy Liang Michael I. Jordan

UC Berkeley

Page 2: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Goal: structured prediction

x ⇒ y = (y1, . . . , y`)We focus on probabilistic models of x and y

2

Page 3: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Goal: structured prediction

x ⇒ y = (y1, . . . , y`)We focus on probabilistic models of x and y

Many approaches

Discriminative (logistic regression, conditional random fields)Generative (Naive Bayes, Bayesian networks, HMMs)

2

Page 4: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Goal: structured prediction

x ⇒ y = (y1, . . . , y`)We focus on probabilistic models of x and y

Many approaches

Discriminative (logistic regression, conditional random fields)Generative (Naive Bayes, Bayesian networks, HMMs)Pseudolikelihood [Besag, 1975]

2

Page 5: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Goal: structured prediction

x ⇒ y = (y1, . . . , y`)We focus on probabilistic models of x and y

Many approaches

Discriminative (logistic regression, conditional random fields)Generative (Naive Bayes, Bayesian networks, HMMs)Pseudolikelihood [Besag, 1975]Composite likelihood [Lindsay, 1988]Multi-conditional learning [McCallum, et al., 2006]Piecewise training [Sutton & McCallum, 2005]Variational relaxations [Wainwright, 2006]Agreement-based learning [Liang, et al., 2008]...how to choose among these approaches?

2

Page 6: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Goal: structured prediction

x ⇒ y = (y1, . . . , y`)We focus on probabilistic models of x and y

Many approaches

Discriminative (logistic regression, conditional random fields)Generative (Naive Bayes, Bayesian networks, HMMs)Pseudolikelihood [Besag, 1975]Composite likelihood [Lindsay, 1988]Multi-conditional learning [McCallum, et al., 2006]Piecewise training [Sutton & McCallum, 2005]Variational relaxations [Wainwright, 2006]Agreement-based learning [Liang, et al., 2008]...how to choose among these approaches?

Our work:

• Put first three in a unified composite likelihood framework• Compare their statistical properties theoretically

2

Page 7: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Existing intuitions:

• Discriminative: lower biasGenerative: lower variance[Ng & Jordan, 2002; Bouchard & Triggs, 2004]

3

Page 8: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Existing intuitions:

• Discriminative: lower biasGenerative: lower variance[Ng & Jordan, 2002; Bouchard & Triggs, 2004]

• Pseudolikelihood: slower statistical convergence[Besag, 1975]

3

Page 9: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Existing intuitions:

• Discriminative: lower biasGenerative: lower variance[Ng & Jordan, 2002; Bouchard & Triggs, 2004]

• Pseudolikelihood: slower statistical convergence[Besag, 1975]

Our general result:

Derive the (excess) risk of composite likelihood estimators

3

Page 10: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Existing intuitions:

• Discriminative: lower biasGenerative: lower variance[Ng & Jordan, 2002; Bouchard & Triggs, 2004]

• Pseudolikelihood: slower statistical convergence[Besag, 1975]

Our general result:

Derive the (excess) risk of composite likelihood estimators

Specific conclusions:

If the model is well-specified:

Risk(generative) < Risk(discriminative) < Risk(pseudolikelihood)

3

Page 11: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Existing intuitions:

• Discriminative: lower biasGenerative: lower variance[Ng & Jordan, 2002; Bouchard & Triggs, 2004]

• Pseudolikelihood: slower statistical convergence[Besag, 1975]

Our general result:

Derive the (excess) risk of composite likelihood estimators

Specific conclusions:

If the model is well-specified:

Risk(generative) < Risk(discriminative) < Risk(pseudolikelihood)

If the model is misspecified:

Risk(discriminative) < Risk(pseudolikelihood), Risk(generative)

3

Page 12: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Model-based estimators and neighborhoods

Generative: θg = argmaxθ

E log pθ(x, y)

4

Page 13: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Model-based estimators and neighborhoods

Generative: θg = argmaxθ

E log pθ(x, y)

(x, y) = {(∗, ∗)}

4

Page 14: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Model-based estimators and neighborhoods

Generative: θg = argmaxθ

E log pθ(x, y)

(x, y) = {(∗, ∗)}

Discriminative: θd = argmaxθ

E log pθ(y | x)

4

Page 15: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Model-based estimators and neighborhoods

Generative: θg = argmaxθ

E log pθ(x, y)

(x, y) = {(∗, ∗)}

Discriminative: θd = argmaxθ

E[log pθ(x, y)− log pθ(x)]

4

Page 16: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Model-based estimators and neighborhoods

Generative: θg = argmaxθ

E log pθ(x, y)

(x, y) = {(∗, ∗)}

Discriminative: θd = argmaxθ

E[log pθ(x, y)− log pθ(x)]

(x, y) = {(x, ∗)}

4

Page 17: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Model-based estimators and neighborhoods

Generative: θg = argmaxθ

E log pθ(x, y)

(x, y) = {(∗, ∗)}

Discriminative: θd = argmaxθ

E[log pθ(x, y)− log pθ(x)]

(x, y) = {(x, ∗)}

More generally: θ = argmaxθ

E[log pθ(x, y)− log pθ(r(x, y))]

(x, y) = r(x, y)

4

Page 18: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Model-based estimators and neighborhoods

Generative: θg = argmaxθ

E log pθ(x, y)

(x, y) = {(∗, ∗)}

Discriminative: θd = argmaxθ

E[log pθ(x, y)− log pθ(x)]

(x, y) = {(x, ∗)}

More generally: θ = argmaxθ

E[log pθ(x, y)− log pθ(r(x, y))]

(x, y) = r(x, y)

r(x, y) is subset of input-output space we want to contrast4

Page 19: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Composite likelihood estimators

Discriminative pseudolikelihood:y1 yj y`

x

θp = argmaxθ

E[ ∑j=1

log p(yj | x, y\{yj})]

5

Page 20: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Composite likelihood estimators

Discriminative pseudolikelihood:y1 yj y`

x

θp = argmaxθ

∑j=1

E[log p(yj | x, y\{yj})]

5

Page 21: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Composite likelihood estimators

Discriminative pseudolikelihood:y1 yj y`

x

θp = argmaxθ

∑j=1

E[log p(x, y)− log p(x, y\{yj})]

5

Page 22: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Composite likelihood estimators

Discriminative pseudolikelihood:y1 yj y`

x

θp = argmaxθ

∑j=1

E[log p(x, y)− log p(x, y\{yj}︸ ︷︷ ︸rj(x,y)

)]

5

Page 23: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Composite likelihood estimators

Discriminative pseudolikelihood:y1 yj y`

x

θp = argmaxθ

∑j=1

E[log p(x, y)− log p(x, y\{yj}︸ ︷︷ ︸rj(x,y)

)]

General composite likelihood:

θ = argmaxθ

∑j

wjE[log pθ(x, y)− log pθ(rj(x, y))]

(x, y) = r1(x, y)= r2(x, y)

5

Page 24: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Review of exponential families

log pθ(x, y | r(x, y)) =

φ(x, y) · θ − log∑

(x′,y′)∈r(x,y) exp{φ(x′, y′)>θ}features parameters log-partition function

6

Page 25: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Review of exponential families

log pθ(x, y | r(x, y)) =

φ(x, y) · θ − log∑

(x′,y′)∈r(x,y) exp{φ(x′, y′)>θ}features parameters log-partition function

Moment-generating properties:

Mean:

∇ log pθ(x, y | r(x, y)) = φ− Eθ[φ | r]Variance:

∇2 log pθ(x, y | r(x, y)) = −varθ[φ | r]

Derivatives are useful for asymptotic Taylor expansions

6

Page 26: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

7

Page 27: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

Intuition:

Grow r ⇒ model more about data⇒ data tells us more about parameters

7

Page 28: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

Intuition:

Grow r ⇒ model more about data⇒ data tells us more about parameters

For exponential families:

Parameters θ

Features φ θ → mean φ

7

Page 29: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

Intuition:

Grow r ⇒ model more about data⇒ data tells us more about parameters

For exponential families:

Parameters θ

Features φ θ → mean φ

noise

7

Page 30: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

Intuition:

Grow r ⇒ model more about data⇒ data tells us more about parameters

For exponential families:

Parameters θ

Features φ θ → mean φ

noise

7

Page 31: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

Intuition:

Grow r ⇒ model more about data⇒ data tells us more about parameters

For exponential families:

Parameters θ

Features φ θ → mean φ

noise slope = variance of φ overdef= sensitivity

7

Page 32: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

Intuition:

Grow r ⇒ model more about data⇒ data tells us more about parameters

For exponential families:

Parameters θ

Features φ θ → mean φ

noise slope = variance of φ overdef= sensitivity

Sensitivity ↑ ⇒ Risk ↓

7

Page 33: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

Intuition:

Grow r ⇒ model more about data⇒ data tells us more about parameters

For exponential families:

Parameters θ

Features φ θ → mean φ

noise slope = variance of φ overdef= sensitivity

Sensitivity ↑ ⇒ Risk ↓Generative Discriminative

var(φ) ? E var(φ | X)7

Page 34: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

Intuition:

Grow r ⇒ model more about data⇒ data tells us more about parameters

For exponential families:

Parameters θ

Features φ θ → mean φ

noise slope = variance of φ overdef= sensitivity

Sensitivity ↑ ⇒ Risk ↓Generative Discriminative

var(φ) = E var(φ | X) + var E(φ | X)7

Page 35: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

Intuition:

Grow r ⇒ model more about data⇒ data tells us more about parameters

For exponential families:

Parameters θ

Features φ θ → mean φ

noise slope = variance of φ overdef= sensitivity

Sensitivity ↑ ⇒ Risk ↓Generative Discriminative

var(φ) � E var(φ | X)7

Page 36: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Sketch of arguments for comparing estimators

(x, y) = r(x, y)

Intuition:

Grow r ⇒ model more about data⇒ data tells us more about parameters

For exponential families:

Parameters θ

Features φ θ → mean φ

noise slope = variance of φ overdef= sensitivity

Sensitivity ↑ ⇒ Risk ↓Generative Discriminative

var(φ) � E var(φ | X)Risk(generative) < Risk(discriminative)

7

Page 37: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Overview of asymptotic analysis

How accurately can we estimate the parameters?

ParameterError = O(

Σ√n

)Σ: asymptotic variance of parametersn: number of training examples

8

Page 38: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Overview of asymptotic analysis

How accurately can we estimate the parameters?

ParameterError = O(

Σ√n

)Σ: asymptotic variance of parametersn: number of training examples

How fast can we drive the excess risk (expected log-loss) to 0?

In general, get normal rate:

Risk = O(

Σ√n

)

8

Page 39: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Overview of asymptotic analysis

How accurately can we estimate the parameters?

ParameterError = O(

Σ√n

)Σ: asymptotic variance of parametersn: number of training examples

How fast can we drive the excess risk (expected log-loss) to 0?

In general, get normal rate:

Risk = O(

Σ√n

)But if some condition is satisfied, get fast rate:

Risk = O(Σn

)

8

Page 40: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Overview of asymptotic analysis

How accurately can we estimate the parameters?

ParameterError = O(

Σ√n

)Σ: asymptotic variance of parametersn: number of training examples

How fast can we drive the excess risk (expected log-loss) to 0?

In general, get normal rate:

Risk = O(

Σ√n

)But if some condition is satisfied, get fast rate:

Risk = O(

Σn

)Issues:• O(n−

12) or O(n−1)?

• Compare Σ

8

Page 41: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Overview of asymptotic analysis

How accurately can we estimate the parameters?

ParameterError = O(

Σ√n

)Σ: asymptotic variance of parametersn: number of training examples

How fast can we drive the excess risk (expected log-loss) to 0?

In general, get normal rate:

Risk = O(

Σ√n

)But if some condition is satisfied, get fast rate:

Risk = O(

Σn

)Issues:• O(n−

12) or O(n−1)?

• Compare Σ

Agenda:1. Well-specified, one component

2. Well-specified, multiple components

3. Misspecified8

Page 42: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Well-specified case

Risk = O(Σn

)for all consistent estimators

Thus, sufficient to just compare Σs of different estimators...

9

Page 43: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Well-specified case

Risk = O(Σn

)for all consistent estimators

Thus, sufficient to just compare Σs of different estimators...

Estimator:

θ = argmaxθ E[log pθ(x, y)− log pθ(r(x, y))]

(x, y) = r(x, y)

9

Page 44: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Well-specified case

Risk = O(

Σn

)for all consistent estimators

Thus, sufficient to just compare Σs of different estimators...

Estimator:

θ = argmaxθ E[log pθ(x, y)− log pθ(r(x, y))]

(x, y) = r(x, y)

Asymptotic variance:

Σ = Γ−1, where Γ = E var(φ | r) is the sensitivity

9

Page 45: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Well-specified case

Risk = O(

Σn

)for all consistent estimators

Thus, sufficient to just compare Σs of different estimators...

Estimator:

θ = argmaxθ E[log pθ(x, y)− log pθ(r(x, y))]

(x, y) = r(x, y)

Asymptotic variance:

Σ = Γ−1, where Γ = E var(φ | r) is the sensitivity

Proof:

By Taylor expansion and moment-generating properties.

9

Page 46: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Well-specified case: comparing two estimatorsTwo estimators:

θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2

(x, y) = r1(x, y)= r2(x, y)

10

Page 47: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Well-specified case: comparing two estimatorsTwo estimators:

θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2

(x, y) = r1(x, y)= r2(x, y)

Comparison theorem:If model is well-specified and

r1(x, y) ⊃ r2(x, y)Then

Risk(θ1) ≤ Risk(θ2)

10

Page 48: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Well-specified case: comparing two estimatorsTwo estimators:

θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2

(x, y) = r1(x, y)= r2(x, y)

Comparison theorem:If model is well-specified and

r1(x, y) ⊃ r2(x, y)Then

Risk(θ1) ≤ Risk(θ2)Proof:

Σj = E var(φ | rj)−1

10

Page 49: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Well-specified case: comparing two estimatorsTwo estimators:

θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2

(x, y) = r1(x, y)= r2(x, y)

Comparison theorem:If model is well-specified and

r1(x, y) ⊃ r2(x, y)Then

Risk(θ1) ≤ Risk(θ2)Proof:

Σj = E var(φ | rj)−1 Σ1 � Σ2

10

Page 50: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Well-specified case: comparing two estimatorsTwo estimators:

θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2

(x, y) = r1(x, y)= r2(x, y)

Comparison theorem:If model is well-specified and

r1(x, y) ⊃ r2(x, y)Then

Risk(θ1) ≤ Risk(θ2)Proof:

Σj = E var(φ | rj)−1 Σ1 � Σ2 Risk = O(

Σjn

)

10

Page 51: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Well-specified case: comparing two estimatorsTwo estimators:

θj = argmaxθ E[log pθ(x, y)− log pθ(rj(x, y))] for j = 1, 2

(x, y) = r1(x, y)= r2(x, y)

Comparison theorem:If model is well-specified and

r1(x, y) ⊃ r2(x, y)Then

Risk(θ1) ≤ Risk(θ2)Proof:

Σj = E var(φ | rj)−1 Σ1 � Σ2 Risk = O(

Σjn

)Modeling more reduces error (when model is well-specified)

10

Page 52: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Multiple componentsAsymptotic variance:

Σ = Γ−1 + Γ−1CcΓ−1

11

Page 53: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Multiple componentsAsymptotic variance:

Σ = Γ−1 + Γ−1CcΓ−1

Γ =∑

j wjE var(φ | rj) is the sensitivity

11

Page 54: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Multiple componentsAsymptotic variance:

Σ = Γ−1 + Γ−1CcΓ−1

Γ =∑

j wjE var(φ | rj) is the sensitivity

Cc � 0 : correction due to multiple components

11

Page 55: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Multiple componentsAsymptotic variance:

Σ = Γ−1 + Γ−1CcΓ−1

Γ =∑

j wjE var(φ | rj) is the sensitivity

Cc � 0 : correction due to multiple components

Comparison theorem:If the model is well-specified and

θ1: one component r1 θ2: multiple components {r2,j}

11

Page 56: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Multiple componentsAsymptotic variance:

Σ = Γ−1 + Γ−1CcΓ−1

Γ =∑

j wjE var(φ | rj) is the sensitivity

Cc � 0 : correction due to multiple components

Comparison theorem:If the model is well-specified and

θ1: one component r1 θ2: multiple components {r2,j}r1(x, y) ⊃ r2,j(x, y) for all components j

11

Page 57: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Multiple componentsAsymptotic variance:

Σ = Γ−1 + Γ−1CcΓ−1

Γ =∑

j wjE var(φ | rj) is the sensitivity

Cc � 0 : correction due to multiple components

Comparison theorem:If the model is well-specified and

θ1: one component r1 θ2: multiple components {r2,j}r1(x, y) ⊃ r2,j(x, y) for all components j

Then

Risk(θ1) ≤ Risk(θ2)

11

Page 58: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Multiple componentsAsymptotic variance:

Σ = Γ−1 + Γ−1CcΓ−1

Γ =∑

j wjE var(φ | rj) is the sensitivity

Cc � 0 : correction due to multiple components

Comparison theorem:If the model is well-specified and

θ1: one component r1 θ2: multiple components {r2,j}r1(x, y) ⊃ r2,j(x, y) for all components j

Then

Risk(θ1) ≤ Risk(θ2)

Note: does not apply if θ1 has more than one component

11

Page 59: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Misspecified case

Result:

For any estimator in general, get normal rate:

Risk = O(

Σ√n

)

12

Page 60: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Misspecified case

Result:

For any estimator in general, get normal rate:

Risk = O(

Σ√n

)But for the discriminative estimator, get fast rate:

Risk = O(Σn

)

12

Page 61: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Misspecified case

Result:

For any estimator in general, get normal rate:

Risk = O(

Σ√n

)But for the discriminative estimator, get fast rate:

Risk = O(Σn

)Corollary:

Risk(discriminative) < Risk(pseudolikelihood), Risk(generative)

12

Page 62: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Misspecified case

Result:

For any estimator in general, get normal rate:

Risk = O(

Σ√n

)But for the discriminative estimator, get fast rate:

Risk = O(Σn

)Corollary:

Risk(discriminative) < Risk(pseudolikelihood), Risk(generative)

Key desirable property: training criterion = test criterion

12

Page 63: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Verifying the error rates empiricallySetup:

Learn

y1 y2

xfrom n training examples

Estimate (excess) risk from 10,000 trials

13

Page 64: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Verifying the error rates empiricallySetup:

Learn

y1 y2

xfrom n training examples

Estimate (excess) risk from 10,000 trials

Well-specified

generate from

20K 40K 60K 80K 100K

n

var(

Risk)

Generative

Discriminative

Pseudolikelihood

13

Page 65: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Verifying the error rates empiricallySetup:

Learn

y1 y2

xfrom n training examples

Estimate (excess) risk from 10,000 trials

Well-specified

generate from

20K 40K 60K 80K 100K

n

√n·v

ar(R

isk)

Generative

Discriminative

Pseudolikelihood

13

Page 66: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Verifying the error rates empiricallySetup:

Learn

y1 y2

xfrom n training examples

Estimate (excess) risk from 10,000 trials

Well-specified

generate from

20K 40K 60K 80K 100K

n

n·v

ar(R

isk)

Generative

Discriminative

Pseudolikelihood

13

Page 67: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Verifying the error rates empiricallySetup:

Learn

y1 y2

xfrom n training examples

Estimate (excess) risk from 10,000 trials

Well-specified

generate from

20K 40K 60K 80K 100K

n

n·v

ar(R

isk)

Generative

Discriminative

Pseudolikelihood

All: O(n−1)

13

Page 68: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Verifying the error rates empiricallySetup:

Learn

y1 y2

xfrom n training examples

Estimate (excess) risk from 10,000 trials

Well-specified Misspecified

generate from generate from

20K 40K 60K 80K 100K

n

n·v

ar(R

isk)

Generative

Discriminative

Pseudolikelihood20K 40K 60K 80K 100K

n

var(

Risk)

All: O(n−1)

13

Page 69: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Verifying the error rates empiricallySetup:

Learn

y1 y2

xfrom n training examples

Estimate (excess) risk from 10,000 trials

Well-specified Misspecified

generate from generate from

20K 40K 60K 80K 100K

n

n·v

ar(R

isk)

Generative

Discriminative

Pseudolikelihood20K 40K 60K 80K 100K

n

√n·v

ar(R

isk)

All: O(n−1)

13

Page 70: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Verifying the error rates empiricallySetup:

Learn

y1 y2

xfrom n training examples

Estimate (excess) risk from 10,000 trials

Well-specified Misspecified

generate from generate from

20K 40K 60K 80K 100K

n

n·v

ar(R

isk)

Generative

Discriminative

Pseudolikelihood20K 40K 60K 80K 100K

n

n·v

ar(R

isk)

All: O(n−1)

13

Page 71: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Verifying the error rates empiricallySetup:

Learn

y1 y2

xfrom n training examples

Estimate (excess) risk from 10,000 trials

Well-specified Misspecified

generate from generate from

20K 40K 60K 80K 100K

n

n·v

ar(R

isk)

Generative

Discriminative

Pseudolikelihood20K 40K 60K 80K 100K

n

n·v

ar(R

isk)

All: O(n−1)Fully dis.: O(n−1)

others: O(n−12)

13

Page 72: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Application: part-of-speech tagging

Task:

y: Det Noun Verb Det Noun

x: The cat ate a fish

14

Page 73: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Application: part-of-speech tagging

Task:

y: Det Noun Verb Det Noun

x: The cat ate a fish

Data: Wall Street Journal news articles (40K sentences)

14

Page 74: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Application: part-of-speech tagging

Task:

y: Det Noun Verb Det Noun

x: The cat ate a fish

Data: Wall Street Journal news articles (40K sentences)

Synthetic data (well-specified)

Gen. Dis. Pseudo.

4.0

8.0

12.0

Tes

ter

ror

14

Page 75: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Application: part-of-speech tagging

Task:

y: Det Noun Verb Det Noun

x: The cat ate a fish

Data: Wall Street Journal news articles (40K sentences)

Synthetic data (well-specified)

Gen. Dis. Pseudo.

4.0

8.0

12.0

Tes

ter

ror

Real data (misspecified)

Gen. Dis. Pseudo.

2.0

4.0

6.0

Tes

ter

ror

14

Page 76: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Summary

Unifying composite likelihood framework for

generative, discriminative, pseudolikelihood estimators

15

Page 77: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Summary

Unifying composite likelihood framework for

generative, discriminative, pseudolikelihood estimators

Asymptotic statistics:

a powerful tool for comparing estimators

15

Page 78: An Asymptotic Analysis of Estimatorspliang/papers/asymptotics-icml...Existing intuitions : Discriminative : lower bias Generative : lower variance [Ng & Jordan, 2002; Bouchard & Triggs,

Summary

Unifying composite likelihood framework for

generative, discriminative, pseudolikelihood estimators

Asymptotic statistics:

a powerful tool for comparing estimators

General conclusions:

• Well-specified case: modeling more of data reduces error

• Desirable: training criterion = test criterion

15