25
From Histograms to Multivariate Polynomial Histograms and Shape Estimation Assoc Prof Inge Koch Statistics, School of Mathematical Sciences University of Adelaide Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 1 / 27

From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

From Histograms to Multivariate Polynomial

Histograms and Shape Estimation

Assoc Prof Inge Koch

Statistics, School of Mathematical Sciences

University of Adelaide

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 1 / 27

Page 2: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Motivation: determine the shape of data

We have 12 measurements on each of 27,994 blood cells

How many cluster?

How big are they and where are they?

Data: Centre for Immunology, St Vincent Hospital, Sydney

Immunologists want to differentiate between

healthy individuals from those with HIV+.

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 2 / 27

Page 3: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Look at the (Log-Data)

3 4 5 6 7

24

68

0

5

CD4

2000 blood cells

CD8

CD

3

2 4 605

0

2

4

6

8

CD4

4000 blood cells

CD8

CD

3

0 5 0510

0

2

4

6

8

CD8

10000 blood cells

CD4

CD

3

0 5 0510

0

2

4

6

8

CD8

27994 blood cells

CD4

CD

3

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 3 / 27

Page 4: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Histograms of the (Log-Data)

0 5 100

500

1000

1500

2000CD3 10 bins

0 5 100

1000

2000

3000

4000CD3 5 bins

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 4 / 27

Page 5: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Histograms of the (Log-Data)

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 5 / 27

Page 6: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

How Many Cluster are in the Data?

One-dimensional data: 1 or 2 modes;

Two-dimensional data: 1 to 3 or 4 modes;

How many clusters are in the 12-dimensional data?

If the measurements were independent,

then the number of modes would be the product

→ but this is not the case in our data

Can you think of a 3D example with k modes such that the 2D

projections have k − 1 modes?

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 6 / 27

Page 7: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Polynomial Histogram Estimators

Main idea

histograms have flat tops, so instead of

only estimating the number of points in each bin

estimate the shape separately in each bin

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 7 / 27

Page 8: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

What are Polynomial Histogram Estimators?

Number of observations n, dimension d , binwidth h

B` = hd a bin with n` observations

The model for each bin B`

1 histogram estimators (Hist) f0(x) = a02 first-order polynomial histogram estimator (Fophe)

f1(x) = a0 + aTx

3 second-order polynomial histogram estimator (Sophe)

f2(x) = a0 + aTx + xTAx

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 8 / 27

Page 9: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Relationships for Coefficients

In each bin B` the estimate fk satisfies

1 proportion of data∫B`

fk(x)dx =n`n

2 local mean ∫B`

xfk(x)dx =n`n

x̄`

3 local second moment∫B`

xxT fk(x)dx =n`nM`

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 9 / 27

Page 10: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

The New Estimators

In each bin B` with bin centre t`

Fophe

f̂1(x) =1

hd+2n`n

[h2 + 12(x̄` − t`)

T (x− t`)]

Sophe

f̂2(x) =1

hd+4n`n×{

(4 + 5d)

4h4 − 15h2 tr (S`) + 12h2(x− t`)

T (x̄` − t`)

+ (x− t`)T[72S` + 108 diag(S`)− 15h2I

](x− t`)

}.

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 10 / 27

Page 11: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Roederer Data: 10,000 observations, CD4 & CD8

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 11 / 27

Page 12: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

The performance of estimators

We assess the performance of estimators with the MSE.

Let θ̂ be an estimator for a true quantity θ. Then

MSE(θ̂) =[

bias(θ̂)]2

+ var(θ̂)

bias(θ̂) = Eθ̂ − θ

var(θ̂) ={E[θ̂ − Eθ̂

]}2= E

[θ̂2]−[Eθ̂]2

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 12 / 27

Page 13: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Sophe’s Performance

For a fixed point x ∈ B` we want the bias of f̂ = f̂2 at x

Consider

E[f̂ (x)

]= E

(1

hd+4n`n×{

(4 + 5d)

4h4 − 15h2 tr (S`) + 12h2(x− t`)

T (x̄` − t`)

+ (x− t`)T[72S` + 108 diag(S`)− 15h2I

](x− t`)

})

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 13 / 27

Page 14: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Some Expectation Calculations I

We show that

E[n`n

(x̄` − t`)]

=

∫B`

(y − t`)f (y)dy

and so

E[

12h2

hd+4(x− t`)

T n`n

(x̄` − t`)

]=

12

hd+2(x− t`)

T

∫B`

(y − t`)f (y)dy

then use a Taylor expansion of f about the bin centre t`

f (y) = f (t`) + (y − t`)Df (t`) +1

2(y − t`)

2D2f (t`)

+1

6(y − t`)

3D3f (t`) + o(‖y − t`‖3

)Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 14 / 27

Page 15: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Some Expectation Calculations II

The first non-zero integral gives

E[

12

hd+2(x− t`)

T n`n

(x̄` − t`)

]≈ (x− t`)

TDf (t`)

We prove similar results for all terms contributing to E[f̂ (x)

]. . . and finally get

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 15 / 27

Page 16: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

The Bias

E[f̂ (x)] = f (t`) + (x− t`)TDf (t`) +

1

2(x− t`)

2D2f (t`)

+h2

12(x− t`)

T

(∑i fuii2−fuuu

5

)+ o(h3)

Taylor expansion of f about the bin centre t`

f (x) = f (t`) + (x− t`)Df (t`) +1

2(x− t`)

2D2f (t`)

+1

6(x− t`)

3D3f (t`) + o(‖x− t`‖3

)so bias[f̂ (x)] depends on difference of 3rd order derivatives

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 16 / 27

Page 17: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Moving on . . .

and making some big leaps

We have the following steps in the performance calculations

1 pointwise bias and variance → MSE at f̂ (x)

2 integrated squared bias and integrated variance of f̂ over all x

3 finally some asymptotics when n →∞

We want to know how Fophe and Sophe depend on the sample

size n, the binwidth h, and the dimension d

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 17 / 27

Page 18: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

How Good are Fophe and Sophe

Bias2 Variance Rate of Convergence

hist CHh2 1

nhdn−2/(d+2)

kernel CKh4 R(K )

nhdn−4/(d+4)

fophe CFh4 d + 1

nhdn−4/(d+4)

sophe CSh6 (d + 1)(d + 2)

2nhdn−6/(d+6)

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 18 / 27

Page 19: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Performance for 200, 1000 and 10000 Observations

50 100 150 2000

1

2

3

4

5x 10

−7

50 100 150 200 2500

0.5

1

1.5

2

2.5x 10

−6

5 10 15 200

1

2

3

4

5x 10

−4

kernel

Fophe

hist

Sophe

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 19 / 27

Page 20: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

27,994 obs: Kernel est. takes 92× Sophe

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 20 / 27

Page 21: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Advantages of Fophe and Sophe

Computational advantages

1 a smaller number of bins is required

2 number of bins only needs to be approximately correct

Sophe better than Fophe in visual and computational aspects

→ use Sophe for data

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 21 / 27

Page 22: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Finding Modes with the Sophe

1 Fix binwidth h0, # of bins νbin, thresholds θ0, and κ.2 Find bins with high density.

1 Find n` in each bin, and discard bins that contain fewer than θ0observations. Let B0 = {B` : n` > θ0}.

2 Sort bins in B0 by # of observations, starting with largest.

3 Determine modes from B0 using (1) or (2) below.1 For i , j = 1 . . . , κ calculate pairwise distances ∆(i ,j) between the bin

centres. For i consider the set of nearest neighbours

nn(i) ={

(∆(i ,j), n(j)) : ∆(i ,j) ≤ h0}.

B(i) contains a mode, if n(i) is maximum over nn(i).2 If matrix A(j) is negative definite, then B(j) contains a mode.

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 22 / 27

Page 23: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Look at the (Log-Data)

3 4 5 6 7

24

68

0

5

CD4

2000 blood cells

CD8

CD

3

2 4 605

0

2

4

6

8

CD4

4000 blood cells

CD8

CD

3

0 5 0510

0

2

4

6

8

CD8

10000 blood cells

CD4

CD

3

0 5 0510

0

2

4

6

8

CD8

27994 blood cells

CD4

CD

3

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 23 / 27

Page 24: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

Modes for 12-Dimensional Data

Use 5 bins in each variable

compare # of modes and % of non-empty bins

# variables # modes # of bins % non-empty

CDs 3,4,8 3 125 39.2

+ CDs 14, 19, 56 5 15625 2.6

all 12 9 244,140,625 0.0015

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 24 / 27

Page 25: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

The End

J Jing, I Koch and K Naito (2009). Polynomial Histograms for

Multivariate Density and Mode Estimation preprint.

Thank you

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 25 / 27