From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density

From Histograms to Multivariate Polynomial

Histograms and Shape Estimation

Assoc Prof Inge Koch

Statistics, School of Mathematical Sciences

University of Adelaide

Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 1 / 27

Motivation: determine the shape of data

We have 12 measurements on each of 27,994 blood cells

How many cluster?

How big are they and where are they?

Data: Centre for Immunology, St Vincent Hospital, Sydney

Immunologists want to differentiate between

healthy individuals from those with HIV+.


Look at the (Log-Data)

3 4 5 6 7

24

68

0

5

CD4

2000 blood cells

CD8

CD

3

2 4 605

0

2

4

6

8

CD4

4000 blood cells

CD8

CD

3

0 5 0510

0

2

4

6

8

CD8

10000 blood cells

CD4

CD

3

0 5 0510

0

2

4

6

8

CD8

27994 blood cells

CD4

CD

3


Histograms of the (Log-Data)

0 5 100

500

1000

1500

2000CD3 10 bins

0 5 100

1000

2000

3000

4000CD3 5 bins


Histograms of the (Log-Data)


How Many Cluster are in the Data?

One-dimensional data: 1 or 2 modes;

Two-dimensional data: 1 to 3 or 4 modes;

How many clusters are in the 12-dimensional data?

If the measurements were independent,

then the number of modes would be the product

→ but this is not the case in our data

Can you think of a 3D example with k modes such that the 2D

projections have k − 1 modes?


Polynomial Histogram Estimators

Main idea

histograms have flat tops, so instead of

only estimating the number of points in each bin

estimate the shape separately in each bin


What are Polynomial Histogram Estimators?

Number of observations n, dimension d , binwidth h

B` = hd a bin with n` observations

The model for each bin B`

1 histogram estimators (Hist) f0(x) = a02 first-order polynomial histogram estimator (Fophe)

f1(x) = a0 + aTx

3 second-order polynomial histogram estimator (Sophe)

f2(x) = a0 + aTx + xTAx


Relationships for Coefficients

In each bin B` the estimate fk satisfies

1 proportion of data∫B`

fk(x)dx =n`n

2 local mean ∫B`

xfk(x)dx =n`n

x̄`

3 local second moment∫B`

xxT fk(x)dx =n`nM`


The New Estimators

In each bin B` with bin centre t`

Fophe

f̂1(x) =1

hd+2n`n

[h2 + 12(x̄` − t`)

T (x− t`)]

Sophe

f̂2(x) =1

hd+4n`n×{

(4 + 5d)

4h4 − 15h2 tr (S`) + 12h2(x− t`)

T (x̄` − t`)

+ (x− t`)T[72S` + 108 diag(S`)− 15h2I

](x− t`)

}.


Roederer Data: 10,000 observations, CD4 & CD8


The performance of estimators

We assess the performance of estimators with the MSE.

Let θ̂ be an estimator for a true quantity θ. Then

MSE(θ̂) =[

bias(θ̂)]2

+ var(θ̂)

bias(θ̂) = Eθ̂ − θ

var(θ̂) ={E[θ̂ − Eθ̂

]}2= E

[θ̂2]−[Eθ̂]2


Sophe’s Performance

For a fixed point x ∈ B` we want the bias of f̂ = f̂2 at x

Consider

E[f̂ (x)

]= E

(1

hd+4n`n×{

(4 + 5d)

4h4 − 15h2 tr (S`) + 12h2(x− t`)

T (x̄` − t`)

+ (x− t`)T[72S` + 108 diag(S`)− 15h2I

](x− t`)

})


Some Expectation Calculations I

We show that

E[n`n

(x̄` − t`)]

=

∫B`

(y − t`)f (y)dy

and so

E[

12h2

hd+4(x− t`)

T n`n

(x̄` − t`)

]=

12

hd+2(x− t`)

T

∫B`

(y − t`)f (y)dy

then use a Taylor expansion of f about the bin centre t`

f (y) = f (t`) + (y − t`)Df (t`) +1

2(y − t`)

2D2f (t`)

+1

6(y − t`)

3D3f (t`) + o(‖y − t`‖3

)Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 14 / 27

Some Expectation Calculations II

The first non-zero integral gives

E[

12

hd+2(x− t`)

T n`n

(x̄` − t`)

]≈ (x− t`)

TDf (t`)

We prove similar results for all terms contributing to E[f̂ (x)

]. . . and finally get


The Bias

E[f̂ (x)] = f (t`) + (x− t`)TDf (t`) +

1

2(x− t`)

2D2f (t`)

+h2

12(x− t`)

T

(∑i fuii2−fuuu

5

)+ o(h3)

Taylor expansion of f about the bin centre t`

f (x) = f (t`) + (x− t`)Df (t`) +1

2(x− t`)

2D2f (t`)

+1

6(x− t`)

3D3f (t`) + o(‖x− t`‖3

)so bias[f̂ (x)] depends on difference of 3rd order derivatives


Moving on . . .

and making some big leaps

We have the following steps in the performance calculations

1 pointwise bias and variance → MSE at f̂ (x)

2 integrated squared bias and integrated variance of f̂ over all x

3 finally some asymptotics when n →∞

We want to know how Fophe and Sophe depend on the sample

size n, the binwidth h, and the dimension d


How Good are Fophe and Sophe

Bias2 Variance Rate of Convergence

hist CHh2 1

nhdn−2/(d+2)

kernel CKh4 R(K )

nhdn−4/(d+4)

fophe CFh4 d + 1

nhdn−4/(d+4)

sophe CSh6 (d + 1)(d + 2)

2nhdn−6/(d+6)


Performance for 200, 1000 and 10000 Observations

50 100 150 2000

1

2

3

4

5x 10

−7

50 100 150 200 2500

0.5

1

1.5

2

2.5x 10

−6

5 10 15 200

1

2

3

4

5x 10

−4

kernel

Fophe

hist

Sophe


27,994 obs: Kernel est. takes 92× Sophe


Advantages of Fophe and Sophe

Computational advantages

1 a smaller number of bins is required

2 number of bins only needs to be approximately correct

Sophe better than Fophe in visual and computational aspects

→ use Sophe for data


Finding Modes with the Sophe

1 Fix binwidth h0, # of bins νbin, thresholds θ0, and κ.2 Find bins with high density.

1 Find n` in each bin, and discard bins that contain fewer than θ0observations. Let B0 = {B` : n` > θ0}.

2 Sort bins in B0 by # of observations, starting with largest.

3 Determine modes from B0 using (1) or (2) below.1 For i , j = 1 . . . , κ calculate pairwise distances ∆(i ,j) between the bin

centres. For i consider the set of nearest neighbours

nn(i) ={

(∆(i ,j), n(j)) : ∆(i ,j) ≤ h0}.

B(i) contains a mode, if n(i) is maximum over nn(i).2 If matrix A(j) is negative definite, then B(j) contains a mode.


Look at the (Log-Data)

3 4 5 6 7

24

68

0

5

CD4

2000 blood cells

CD8

CD

3

2 4 605

0

2

4

6

8

CD4

4000 blood cells

CD8

CD

3

0 5 0510

0

2

4

6

8

CD8

10000 blood cells

CD4

CD

3

0 5 0510

0

2

4

6

8

CD8

27994 blood cells

CD4

CD

3


Modes for 12-Dimensional Data

Use 5 bins in each variable

compare # of modes and % of non-empty bins

# variables # modes # of bins % non-empty

CDs 3,4,8 3 125 39.2

+ CDs 14, 19, 56 5 15625 2.6

all 12 9 244,140,625 0.0015


The End

J Jing, I Koch and K Naito (2009). Polynomial Histograms for

Multivariate Density and Mode Estimation preprint.

Thank you


Documents

From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density