23
Compression-based Unsupervised Clustering of Spectral Signatures D. Cerra, J. Bieniarz, J. Avbelj, P. Reinartz, and R. Mueller WHISPERS, Lisbon, 8.06.2011

Compression-based Unsupervised Clustering of Spectral Signatures

  • Upload
    caroun

  • View
    64

  • Download
    0

Embed Size (px)

DESCRIPTION

Compression-based Unsupervised Clustering of Spectral Signatures . D. Cerra, J. Bieniarz, J. Avbelj, P. Reinartz, and R. Mueller WHISPERS, Lisbon, 8.06.2011. Contents. CBSM as Spectral Distances Traditional Spectral distances NCD as spectral Distance. Introduction. Compression-based - PowerPoint PPT Presentation

Citation preview

Page 1: Compression-based Unsupervised Clustering of Spectral Signatures

Compression-based Unsupervised Clustering of Spectral Signatures

D. Cerra, J. Bieniarz, J. Avbelj, P. Reinartz, and R. Mueller

WHISPERS, Lisbon, 8.06.2011

Page 2: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 2

Contents

IntroductionCBSM as

Spectral Distances

TraditionalSpectral distances

NCD as spectralDistance

Compression-basedSimilarity Measures

How to quantifyInformation?

NormalizedCompression

Distance

Page 3: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 3

Contents

CBSM asSpectral Distances

Compression-basedSimilarity Measures Introduction

Page 4: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 4

IntroductionMany applications in hyperspectral remote sensing rely on quantifying the similarities between two pixels, represented by spectra:

Classification / SegmentationTarget DetectionSpectral Unmixing

Spectral distancesMostly based on vector processing

Any different (and effective) similarity measure out there?

Similar!

Not Similar!

Page 5: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 5

Contents

Introduction Compression-basedSimilarity Measures

How to quantifyInformation?

NormalizedCompression

Distance

CBSM asSpectral Distances

Page 6: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 6

How to quantify information? Two approaches

Probabilistic (classic) AlgorithmicVS.

Information UncertaintyShannon Entropy

Information ComplexityKolmogorov Complexity

x

xpxpXH )(log)()( qxKQxq

min)(

Related to a single object (string) xLength of the shortest program q among Qx programs which outputs the string x Measures how difficult it is to describe x from scratchUncomputable

Related to a random variable X with probability mass function p(x)Measure of the average uncertainty in XMeasures the average number of bits required to describe XComputable

Page 7: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 7

VS.(Statistic) Mutual Information Algorithmic Mutual Information

Amount of computational resources shared by the shortest programs which output the strings x and y The joint Kolmogorov complexity K(x,y) is the length of the shortest program which outputs x followed by ySymmetric, non-negativeIf then

K(x,y) = K(x) + K(y)x and y are algorithmically independent

Measure in bits of the amount of information a random variable X has about another variable YThe joint entropy H(X,Y) is the entropy of the pair (X,Y) with a joint distribution p(x,y) Symmetric, non-negativeIf I(X;Y) = 0 then

H(X;Y) = H(X) + H(Y)X and Y are statistically independent

),()()();( YXHYHXHYXI ),()()():( yxKyKxKyxIw

yx ypxp

yxpyxpYXI, )()(

),(log),();(

0):( yxIw

Mutual Information in Shannon/Kolmogorov

Probabilistic (classic) Algorithmic

Page 8: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 8

Normalized Information Distance (NID)

Normalized length of the shortest program that computes x knowing y, as well as computing y knowing x

Similarity MetricNID(x,y)=0 iff x=y NID(x,y)=1 -> maximum distance between x and y

The NID minimizes all normalized admissible distances

NID (x, y) = y)} {K(x), K(x)} {K(y), K(K(x, y) -

maxmin

Li - Vitányi

Page 9: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 9

Compression: Approximating Kolmogorov Complexity

Big problem! The Kolmogorov complexity K(x) is uncomputable!K(x) represents a lower bound for what an off-the-shelf compressor can achieve when compressing xWhat if we use the approximation:

C(x) is the size of the file obtained by compressing x with a standard lossless compressor (such as Gzip)

)()( xCxK

AOriginal size: 65 KbCompressed size: 47 Kb

BOriginal size: 65 KbCompressed size: 2 Kb

Page 10: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 10

Normalized Compression Distance (NCD)

Approximate the NID by replacing complexities with compression factors

If two objects compress better together than separately, it means they share common patterns and are similar!!Advantages

Basically parameter-free (data-driven)Applicable with any off-the-shelf compressor to diverse datatypes

x

y

Coder

Coder

Coder

C(x)

C(y)

C(xy) NCD

C(y)}max {C(x),(y)}min{C(x),CC(x,y)yxNCD

),( K(y)}max {K(x),

(y)}min{K(x),KK(x,y)yxNID ),(

Page 11: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 11

Evolution of CBSM1993 Ziv & Merhav

First use of relative entropy to classify texts

2000 Frank et al., Khmelev First compression-based experiments on text categorization

2001 Benedetto et al.Intuitively defined compression-based relative entropyCaused a rise of interest in compression-based methods

2002 Watanabe et al. Pattern Representation based on Data Compression (PRDC)First in classifying general data with a first step of conversion into strings

2004 NCDSolid theoretical foundations (Algorithmic Information Theory)

2005-2010 Many things came next…Chen-Li Metric for DNA classification (Chen & Li, 2005)Compression-based Dissimilarity Measure (Keogh et al., 2006) Cosine Similarity (Sculley & Brodley, 2006)Dictionary Distance (Macedonas et al., 2008)Fast Compression Distance (Cerra and Datcu, 2010)

Page 12: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 12

Compression-Based Similarity Measures: Applications

Clustering and classification of:

Simple TextsDictionaries from different languagesMusicDNA genomesVolcanologyChain lettersAuthorship attributionImages…

Page 13: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 13

How to visualize a distance matrix?

An unsupervised clustering of a distance matrix related to a dataset can be carried out with a dendrogram (binary tree)

A dendrogram represents a distance matrix in two dimensionsIt recursively splits the dataset in two groups containing similar objectsThe most similar objects appear as siblings

a b c d e f

a 0 1 1 1 1 1

b 1 0 0.1 0.3 0.4 0.6

c 1 0.1 0 0.4 0.4 0.7

d 1 0.3 0.4 0 0.2 0.5

e 1 0.4 0.4 0.2 0 0.5

f 1 0.6 0.7 0.5 0.5 0

Page 14: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 14

Rodents

An all-purpose method: application to DNA genomes

Clustered by

Primates

Page 15: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 15

Landslides

Explosions

VolcanologySeparate Explosions (ex)

from Landslides (Ls)

Stromboli Volcano

Page 16: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 16

Optical Images Hierarchical Clustering

60 Spot 5 subsets, spatial resolution 5m

Forest

Desert

City

Fields

Clouds

Sea

Page 17: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 17

SAR Scene Hierarchical Clustering

32 TerraSAR-X subsets,

Acquired over Paris,

spatial resolution 1.8m

False Alarm

Page 18: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 18

Contents

IntroductionCBSM as

Spectral Distances

TraditionalSpectral distances

NCD as spectralDistance

Compression-basedSimilarity Measures

Page 19: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 19

Rocks Categorization

41 spectra

From Aster 2.0 Spectral Library

Spectra belonging to different rocks may present a similar behaviour or overlap

Mafic Felsic Shale

Page 20: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 20

Some well-known Spectral Distances

Euclidean Distance Spectral Angle

Spectral Correlation Spectral Information Divergence

Page 21: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 21

Results

Evaluation of the dendrogram through visual inspectionIs it possible to cut the dendogram to separate the classes?How many objects would be misplaced given the best cuts?

1

2

1 2

3

5

64

7

1

2

34

9 8

12

34

67

5

34

6

8

75

Page 22: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 22

Conclusions

The NCD can be employed as a spectral distance, and may provide surprising results

Why?

The NCD is resistant to noiseDifferences between minerals of the same class may be regarded as noise

The NCD (implicitly) focuses on the relevant information within the dataWe guess that the analysis benefits from considering the general behaviour of the spectra

Drawbacks

Computationally intensive (spectra have to be analyzed sequentially)Dependent to some extent on the compressor used

In every case the best compressor for the data at hand should be used, which approximates at best the Kolmogorov complexity

Page 23: Compression-based Unsupervised Clustering of Spectral Signatures

Folie 23

Compression