Clustering Theory Applications and Algorithms

  • Upload
    csyama

  • View
    233

  • Download
    0

Embed Size (px)

Citation preview

  • 8/10/2019 Clustering Theory Applications and Algorithms

    1/9

  • 8/10/2019 Clustering Theory Applications and Algorithms

    2/9

    Contents

    List of Figures

    iii

    List of Tables

    v

    List of Algorithms

    vi i

    Preface

    ix

    I

    2

    Clustering, Data, and Similarity Measures

    Data Clustering

    efinition of Data Clustering

    .2

    he Vocabulary of Clustering

    .2.1

    ecords and Attributes

    .2.2

    istances and Similarities

    .2.3

    lusters, Centers, and Modes

    .2.4

    ard Clustering and Fuzzy Clustering

    .2.5

    alidity Indices

    .3

    lustering Processes

    .4

    ealing with Missing Values

    .5

    esources for Clustering

    .5.1

    urveys and Reviews on Clustering

    .5.2

    ooks on Clustering

    .5.3

    ournals

    .5.4

    onference Proceedings

    .5.5

    ata Sets

    .6

    ummary

    ata Types

    2 .1

    ategorical Data

    .2

    inary Data

    .3

    ransaction Data

    .4

    ymbolic Data

    .5

    ime Series

    .6

    ummary

    3

    3

    5

    5

    5

    6

    7

    8

    8

    10

    12

    12

    12

    13

    15

    17

    17

    19

    19

    21

    23

    23

    2 4

    2 4

  • 8/10/2019 Clustering Theory Applications and Algorithms

    3/9

    vi

    Contents

    3

    Scale Conversion

    25

    3. 1

    Introduction

    5

    3.1 .1

    nterval to Ordinal

    5

    3.1.2

    nterval to Nominal

    7

    3.1.3

    rdinal to Nominal

    8

    3.1.4

    ominal to Ordinal

    8

    3.1.5

    rdinal to Interval

    9

    3.1.6

    ther Conversions

    9

    3.2

    Categorization of Numerical Data

    0

    3.2 .1

    irect Categorization

    0

    3.2.2

    luster-based Categorization

    1

    3.2.3

    utomatic Categorization

    7

    3 .3

    Summary

    1

    4

    Data Standardization and Transformation

    43

    4.1

    Data Standardization

    3

    4.2

    Data Transformation

    6

    4.2 .1

    rincipal Component Analysis

    6

    4.2.2

    VD

    8

    4.2.3

    he Karhunen-L ve Transformation

    9

    4 .3

    Summary

    1

    5

    Data Visualization

    53

    5.1

    Sammon's Mapping

    3

    5 .2

    MDS

    4

    5.3

    SOM

    6

    5.4 Class-preserving Projections

    9

    5 .5

    Parallel Coordinates

    0

    5 .6

    Tree Maps

    1

    5 .7

    Categorical Data Visualization

    2

    5 .8

    Other Visualization Techniques

    5

    5 .9 Summary5

    6 Similarity and Dissimilarity Measures

    67

    6.1

    Prel im inaries

    7

    6.1.1

    roximity Matrix

    8

    6 .1 . 2

    roximity Graph

    9

    6.1.3

    catter Matrix

    9

    6 .1 .4

    ovariance Matrix

    0

    6 .2

    Measures for Numerical Data

    1

    6.2.1

    uclidean Distance 1

    6.2.2

    anhattan Distance

    1

    6.2.3

    aximum Distance 2

    6 .2 .4

    inkowski Distance

    2

    6.2.5

    ahalanobis Distance

    2

  • 8/10/2019 Clustering Theory Applications and Algorithms

    4/9

    Contents

    vii

    6.2.6

    verage Distance

    .2 .7

    ther Distances

    3

    7 4

    6 .3

    Measures for Categorical Data4

    6.3.1

    he Simple Matching Distance 6

    6.3 .2

    ther Matching Coefficients 6

    6 .4 Measures for Binary Data

    7

    6 .5

    Measures for Mixed-type Data

    9

    6.5.1

    General Similarity Coefficient

    9

    6.5 .2

    General Distance Coefficient

    0

    6.5.3

    Generalized Minkowski Distance

    1

    6.6

    Measures for Time Series Data

    3

    6.6.1

    he Minkowski Distance

    4

    6.6 .2

    ime Series Preprocessing

    5

    6.6 .3

    ynamic Time Warping

    7

    6.6 .4

    easures Based on Longest Common Subsequences 8 8

    6.6 .5

    easures Based on Probabilistic Models

    0

    6 .6 .6

    easures Based on Landmark Models

    1

    6.6 .7

    valuation

    2

    6 .7

    Other Measures

    2

    6.7.1

    he Cosine Similarity Measure

    3

    6.7.2

    Link-based Similarity Measure

    3

    6.7 .3

    upport

    4

    6 .8

    Similarity and Dissimilarity Measures between Clusters

    4

    6.8 .1

    he Mean-based Distance

    4

    6 .8 . 2

    he Nearest Neighbor Distance

    5

    6 .8 .3

    he Farthest Neighbor Distance

    5

    6 .8 .4

    he Average Neighbor Distance

    6

    6 .8 .5

    ance-Williams Formula

    6

    6 .9

    Similarity and Dissimilarity between Variables

    8

    6.9 .1

    earson's Correlation Coefficients

    8

    6 .9 . 2

    easures Based on the Chi-square Statistic

    01

    6 .9 .3

    easures Based on Optimal Class Prediction

    03

    6 .9 .4

    roup-based Distance

    05

    6 .1 0 Summary

    06

    II

    Clustering Algorithms

    107

    7

    Hierarchical Clustering Techniques

    109

    7 .1

    Representations of Hierarchical Clusterings

    09

    7.1.1

    i tree

    10

    7 .1 . 2

    endrogram

    10

    7 .1 .3

    anner

    12

    7 .1 .4

    ointer Representation

    12

    7 .1 .5

    acked Representation

    14

    7 .1 .6

    cicle Plot

    15

    7 .1 .7

    ther Representations

    15

  • 8/10/2019 Clustering Theory Applications and Algorithms

    5/9

    viii

    Contents

    7.2

    Agglomerative Hierarchical Methods

    16

    7 .2 .1

    he Single-link Method

    18

    7.2.2

    he Complete Link Method

    20

    7.2.3

    he Group Average Method

    22

    7.2.4

    he Weighted Group Average Method

    25

    7 .2 .5

    he Centroid Method

    26

    7.2.6

    he Median Method

    30

    7 .2 . 7

    ard's Method

    32

    7.2.8

    ther Agglomerative Methods

    37

    7 .3

    Divisive Hierarchical Methods

    37

    7.4

    Several Hierarchical Algorithms

    38

    7.4.1

    LINK

    38

    7.4.2

    ingle-link Algorithms Based an Minimum Spanning Trees

    140

    7.4 .3

    LINK

    41

    7.4.4

    IRCH

    44

    7.4 .5

    URE

    44

    7.4.6

    IANA

    45

    7.4 .7

    ISMEA

    47

    7.4 .8

    dwards and Cavalli-Sforza Method

    47

    7 .5 Summary

    49

    8 Fuzzy Clustering Algorithms

    151

    8 .1

    Fuzzy Sets

    51

    8 .2

    Fuzzy Relations

    53

    8 .3 Fuzzy k-means

    54

    8 .4

    Fuzzy k-modes

    56

    8 .5

    The c-means Method

    58

    8 .6

    Summary

    59

    9

    Center based Clustering Algorithms 161

    9 .1

    The k-means Algorithm

    61

    9 .2

    Variations of the k means Algorithm

    64

    9.2 .1

    he Continuous k-means Algorithm 65

    9 .2 . 2

    he Compare-means Algorithm

    65

    9 .2 .3

    he Sort-means Algorithm 66

    9 .2 .4

    cceleration of the k-means Algorithm with the

    kd tree

    67

    9 .2 .5

    ther Acceleration Methods

    68

    9 .3

    The Trimmed k-means Algorithm

    69

    9 .4

    The x-means Algorithm

    70

    9 .5

    The k-harmonic Means Algorithm

    71

    9 .6

    The Mean Shift Algorithm

    73

    9 .7

    MEC

    75

    9 .8

    The k-modes Algorithm Huang)

    76

    9.8 .1

    nitial Modes Selection

    78

    9 .9

    The k-modes Algorithm (Chaturvedi et al.)

    78

  • 8/10/2019 Clustering Theory Applications and Algorithms

    6/9

    Contents

    ix

    10

    9.10

    he k-probabilities Algorithm

    .1 1he k-prototypes Algorithm

    .1 2

    ummary

    earch-based Clustering Algorithms

    179

    181

    18 2

    183

    10.1 Genetic Algorithms

    8 4

    10 .2

    The Tabu Search Method

    8 5

    10.3

    Variable Neighborhood Search for Clustering

    8 6

    10.4

    Al-Sultan's Method

    8 7

    10 .5

    Tabu Searchbased Categorical Clustering Algorithm

    8 9

    10.6

    J-means

    90

    10.7

    GKA

    9 2

    10.8

    The Global k-means Algorithm

    95

    10.9

    The Genetic k-modes Algorithm

    95

    10.9 .1

    he Selection Operator

    96

    10 .9 .2

    he Mutation Operator

    96

    10 .9 .3

    he k-modes Operator

    97

    10 . 10 The Genetic Fuzzy k-modes Algorithm

    97

    10 .10 .1

    tring Representation

    98

    10 .10 .2

    nitialization Process

    98

    10 .10 .3

    election Process

    9 9

    10 .10 .4

    rossover Process

    9 9

    10 .10 .5

    utation Process

    0 0

    10 .10 .6

    ermination Criterion

    0 0

    10.11

    SARS

    0 0

    10 . 12

    Summary

    0 2

    11

    Graph-based Clustering Algorithms

    203

    11.1 Chameleon

    0 3

    11.2

    CACTUS

    0 4

    11.3

    A Dynamic Systembased Approach

    0 5

    11.4

    ROCK

    0 7

    11.5

    Summary

    0 8

    12

    Grid-based Clustering Algorithms

    209

    12.1 STING

    0 9

    12.2

    OptiGrid

    1 0

    12.3

    GRIDCL US

    1 2

    12.4

    GDILC

    14

    12.5 WaveCluster

    16

    12.6

    Summary

    1 7

    13

    Density-based Clustering Algorithms

    219

    13.1

    DB SCAN

    19

    13.2

    BRIDGE

    21

    13.3

    DB CLASD

    22

  • 8/10/2019 Clustering Theory Applications and Algorithms

    7/9

    Contents

    14

    13.4

    ENCLUE

    3. 5

    UBN

    3.6

    ummary

    odel-based Clustering Algorithms

    22 3

    22 5

    226

    227

    14.1

    Introduction

    27

    14.2

    Gaussian Clustering Models

    30

    14.3

    Model-based Agglomerative Hierarchical Clustering

    32

    14.4

    The EM Algorithm

    35

    14.5

    Model-based Clustering

    37

    14.6

    COOLCAT

    40

    14.7

    STUCCO

    41

    14.8

    Summary

    42

    15

    Subspace Clustering

    243

    15.1

    CLIQUE

    44

    15.2

    PROCLUS

    46

    15.3

    ORCLUS

    49

    15.4

    ENCLUS

    53

    15.5

    FINDIT

    55

    15.6

    MAFIA

    58

    15.7

    DOC

    59

    15.8

    CLTree

    61

    15.9 PART

    6 2

    15 .10

    SUBCAD

    6 4

    15.11

    Fuzzy Subspace Clustering

    70

    15 .12

    Mean Shift for Subspace Clustering

    75

    15 .13

    Summary

    8 5

    16

    Miscell aneous Algorithms

    287

    16.1

    Time Series Clustering Algorithms

    8 7

    16.2

    Streaming Algorithms

    8 9

    16 .2 .1

    SEARCH

    90

    16 .2 .2

    ther Streaming Algorithms

    9 3

    16.3

    Transaction Data Clustering Algorithms

    9 3

    16.3.1

    argeltem

    9 4

    16.3 .2

    LOPE 9 5

    16.3 .3

    AK

    96

    16.4

    Summary

    9 7

    17

    Evaluation of Clustering Algorithms

    299

    17.1

    Introduction

    9 9

    17.1.1

    ypothesis Testing

    0 1

    17 .1 .2

    xternal Criteria

    0 2

    17.1 .3

    nternal Criteria

    0 3

    17 .1 .4

    elative Criteria

    04

  • 8/10/2019 Clustering Theory Applications and Algorithms

    8/9

    Contents

    i

    17.2

    valuation of Partitional Clustering

    0 5

    17.2.1odified Hubert's P Statistic

    305

    17 .2 .2

    he Davies-Bouldin Index

    0 5

    17.2.3

    unn's Index

    0 7

    17 .2 .4

    he SD Validity Index

    0 7

    17.2 .5

    he S_Dbw Validity Index

    0 8

    17.2.6 The RMSSTD Index

    309

    17 .2 .7

    he RS Index

    1 0

    17 .2 .8

    he Calinski-Harabasz Index

    1 0

    17 .2 .9

    and's Index

    11

    17.2.10 Average of Compactness

    1 2

    17.2.11

    istances between Partitions

    1 2

    17.3

    valuation of Hierarchical Clustering

    1 4

    17.3.1

    esting Absence of Structure

    314

    17 .3 .2

    esting Hierarchical Structures

    1 5

    17 .4

    alidity Indices for Fuzzy Clustering

    1 5

    17.4.1

    he Partition Coefficient Index

    1 5

    17.4 .2

    he Partition Entropy Index

    1 6

    17.4 .3

    he Fukuyama-Sugeno Index

    1 6

    17.4 .4

    alidity Based on Fuzzy Similarity

    1 7

    17.4 .5

    Compact and Separate Fuzzy Validity Criterion

    1 8

    17 .4 .6

    Partition Separation Index

    319

    17 .4 .7

    n Index Based on the Mini-max Filter Concept and Fuzzy

    Theory

    I 9

    17.5 Summary

    2 0

    III Applications of Clustering

    21

    18 Clustering Gene Expression Data

    23

    18.1

    ackground

    323

    18 .2

    pplications of Gene Expression Data Clustering

    2 4

    18 .3

    ypes of Gene Expression Data Clustering

    2 5

    18 .4

    ome Guidelines for Gene Expression Clustering

    2 5

    18 .5

    imilarity Measures for Gene Expression Data

    2 6

    18.5.1

    uclidean Distance

    2 6

    18 .5 .2

    earson's Correlation Coefficient

    2 6

    18.6 A Case Study

    2 8

    18 .6 .1

    ++ Code

    328

    18 .6 .2

    esults

    34

    18.7 Summary

    334

    IV MATLAB and C for Clustering

    41

    19 Data Clustering in MATLAB

    43

    19.1

    ead and Write Data Files

    43

    19.2

    andle Categorical Data

    347

  • 8/10/2019 Clustering Theory Applications and Algorithms

    9/9

    xi i

    ontents

    19.3 M-files, MEX-files, and MAT-files

    49

    19 .3 .1

    -files

    349

    19 .3 .2

    EX-files

    351

    19 .3 .3

    AT files

    54

    19.4 Speed up MATLAB

    54

    19 .5

    ome Clustering Functions

    355

    19.5.1

    ierarchical Clustering

    355

    19 .5 .2

    -means Clustering

    59

    19.6 Summary

    362

    20 Clustering in C/C 63

    20 1 The STL 363

    20 .1 .1he vector

    Class

    63

    20.1.2

    he

    l st

    Class

    364

    20.2 C/C++ Program Compilation

    66

    20.3

    ata Structure and Implementation

    6 7

    20 .3 .1

    ata Matrices and Centers

    367

    20.3.2

    lustering Results

    368

    20.3.3

    he Quick Sort Algorithm

    369

    20.4 Summary

    369

    A

    ome Clustering Algorithms

    71

    B

    he kd tree Data Structure

    75

    C MATLAB Codes

    7 7

    C.1

    he MATLAB Code for Generating Subspace Clusters

    377

    C.2

    he MATLAB Code for the k-modes Algorithm

    7 9

    C .3

    he MATLAB Code for the MSSC Algorithm

    381

    D C Codes

    8 5

    D.1

    he C++ Code for Converting Categorical Values to Integers

    8 5

    D.2

    he C++ Code for the FSC Algorithm

    388

    Bibliography

    97

    Subject Index

    43

    Author Index

    55