Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
From Histograms to Multivariate Polynomial
Histograms and Shape Estimation
Assoc Prof Inge Koch
Statistics, School of Mathematical Sciences
University of Adelaide
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 1 / 27
Motivation: determine the shape of data
We have 12 measurements on each of 27,994 blood cells
How many cluster?
How big are they and where are they?
Data: Centre for Immunology, St Vincent Hospital, Sydney
Immunologists want to differentiate between
healthy individuals from those with HIV+.
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 2 / 27
Look at the (Log-Data)
3 4 5 6 7
24
68
0
5
CD4
2000 blood cells
CD8
CD
3
2 4 605
0
2
4
6
8
CD4
4000 blood cells
CD8
CD
3
0 5 0510
0
2
4
6
8
CD8
10000 blood cells
CD4
CD
3
0 5 0510
0
2
4
6
8
CD8
27994 blood cells
CD4
CD
3
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 3 / 27
Histograms of the (Log-Data)
0 5 100
500
1000
1500
2000CD3 10 bins
0 5 100
1000
2000
3000
4000CD3 5 bins
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 4 / 27
Histograms of the (Log-Data)
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 5 / 27
How Many Cluster are in the Data?
One-dimensional data: 1 or 2 modes;
Two-dimensional data: 1 to 3 or 4 modes;
How many clusters are in the 12-dimensional data?
If the measurements were independent,
then the number of modes would be the product
→ but this is not the case in our data
Can you think of a 3D example with k modes such that the 2D
projections have k − 1 modes?
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 6 / 27
Polynomial Histogram Estimators
Main idea
histograms have flat tops, so instead of
only estimating the number of points in each bin
estimate the shape separately in each bin
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 7 / 27
What are Polynomial Histogram Estimators?
Number of observations n, dimension d , binwidth h
B` = hd a bin with n` observations
The model for each bin B`
1 histogram estimators (Hist) f0(x) = a02 first-order polynomial histogram estimator (Fophe)
f1(x) = a0 + aTx
3 second-order polynomial histogram estimator (Sophe)
f2(x) = a0 + aTx + xTAx
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 8 / 27
Relationships for Coefficients
In each bin B` the estimate fk satisfies
1 proportion of data∫B`
fk(x)dx =n`n
2 local mean ∫B`
xfk(x)dx =n`n
x̄`
3 local second moment∫B`
xxT fk(x)dx =n`nM`
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 9 / 27
The New Estimators
In each bin B` with bin centre t`
Fophe
f̂1(x) =1
hd+2n`n
[h2 + 12(x̄` − t`)
T (x− t`)]
Sophe
f̂2(x) =1
hd+4n`n×{
(4 + 5d)
4h4 − 15h2 tr (S`) + 12h2(x− t`)
T (x̄` − t`)
+ (x− t`)T[72S` + 108 diag(S`)− 15h2I
](x− t`)
}.
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 10 / 27
Roederer Data: 10,000 observations, CD4 & CD8
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 11 / 27
The performance of estimators
We assess the performance of estimators with the MSE.
Let θ̂ be an estimator for a true quantity θ. Then
MSE(θ̂) =[
bias(θ̂)]2
+ var(θ̂)
bias(θ̂) = Eθ̂ − θ
var(θ̂) ={E[θ̂ − Eθ̂
]}2= E
[θ̂2]−[Eθ̂]2
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 12 / 27
Sophe’s Performance
For a fixed point x ∈ B` we want the bias of f̂ = f̂2 at x
Consider
E[f̂ (x)
]= E
(1
hd+4n`n×{
(4 + 5d)
4h4 − 15h2 tr (S`) + 12h2(x− t`)
T (x̄` − t`)
+ (x− t`)T[72S` + 108 diag(S`)− 15h2I
](x− t`)
})
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 13 / 27
Some Expectation Calculations I
We show that
E[n`n
(x̄` − t`)]
=
∫B`
(y − t`)f (y)dy
and so
E[
12h2
hd+4(x− t`)
T n`n
(x̄` − t`)
]=
12
hd+2(x− t`)
T
∫B`
(y − t`)f (y)dy
then use a Taylor expansion of f about the bin centre t`
f (y) = f (t`) + (y − t`)Df (t`) +1
2(y − t`)
2D2f (t`)
+1
6(y − t`)
3D3f (t`) + o(‖y − t`‖3
)Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 14 / 27
Some Expectation Calculations II
The first non-zero integral gives
E[
12
hd+2(x− t`)
T n`n
(x̄` − t`)
]≈ (x− t`)
TDf (t`)
We prove similar results for all terms contributing to E[f̂ (x)
]. . . and finally get
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 15 / 27
The Bias
E[f̂ (x)] = f (t`) + (x− t`)TDf (t`) +
1
2(x− t`)
2D2f (t`)
+h2
12(x− t`)
T
(∑i fuii2−fuuu
5
)+ o(h3)
Taylor expansion of f about the bin centre t`
f (x) = f (t`) + (x− t`)Df (t`) +1
2(x− t`)
2D2f (t`)
+1
6(x− t`)
3D3f (t`) + o(‖x− t`‖3
)so bias[f̂ (x)] depends on difference of 3rd order derivatives
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 16 / 27
Moving on . . .
and making some big leaps
We have the following steps in the performance calculations
1 pointwise bias and variance → MSE at f̂ (x)
2 integrated squared bias and integrated variance of f̂ over all x
3 finally some asymptotics when n →∞
We want to know how Fophe and Sophe depend on the sample
size n, the binwidth h, and the dimension d
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 17 / 27
How Good are Fophe and Sophe
Bias2 Variance Rate of Convergence
hist CHh2 1
nhdn−2/(d+2)
kernel CKh4 R(K )
nhdn−4/(d+4)
fophe CFh4 d + 1
nhdn−4/(d+4)
sophe CSh6 (d + 1)(d + 2)
2nhdn−6/(d+6)
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 18 / 27
Performance for 200, 1000 and 10000 Observations
50 100 150 2000
1
2
3
4
5x 10
−7
50 100 150 200 2500
0.5
1
1.5
2
2.5x 10
−6
5 10 15 200
1
2
3
4
5x 10
−4
kernel
Fophe
hist
Sophe
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 19 / 27
27,994 obs: Kernel est. takes 92× Sophe
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 20 / 27
Advantages of Fophe and Sophe
Computational advantages
1 a smaller number of bins is required
2 number of bins only needs to be approximately correct
Sophe better than Fophe in visual and computational aspects
→ use Sophe for data
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 21 / 27
Finding Modes with the Sophe
1 Fix binwidth h0, # of bins νbin, thresholds θ0, and κ.2 Find bins with high density.
1 Find n` in each bin, and discard bins that contain fewer than θ0observations. Let B0 = {B` : n` > θ0}.
2 Sort bins in B0 by # of observations, starting with largest.
3 Determine modes from B0 using (1) or (2) below.1 For i , j = 1 . . . , κ calculate pairwise distances ∆(i ,j) between the bin
centres. For i consider the set of nearest neighbours
nn(i) ={
(∆(i ,j), n(j)) : ∆(i ,j) ≤ h0}.
B(i) contains a mode, if n(i) is maximum over nn(i).2 If matrix A(j) is negative definite, then B(j) contains a mode.
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 22 / 27
Look at the (Log-Data)
3 4 5 6 7
24
68
0
5
CD4
2000 blood cells
CD8
CD
3
2 4 605
0
2
4
6
8
CD4
4000 blood cells
CD8
CD
3
0 5 0510
0
2
4
6
8
CD8
10000 blood cells
CD4
CD
3
0 5 0510
0
2
4
6
8
CD8
27994 blood cells
CD4
CD
3
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 23 / 27
Modes for 12-Dimensional Data
Use 5 bins in each variable
compare # of modes and % of non-empty bins
# variables # modes # of bins % non-empty
CDs 3,4,8 3 125 39.2
+ CDs 14, 19, 56 5 15625 2.6
all 12 9 244,140,625 0.0015
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 24 / 27
The End
J Jing, I Koch and K Naito (2009). Polynomial Histograms for
Multivariate Density and Mode Estimation preprint.
Thank you
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 25 / 27