Upload
kitra-mayer
View
31
Download
2
Embed Size (px)
DESCRIPTION
A Two-Way Visualization Method for Clustered Data. Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda Koren and David Harel. ACM SIGKDD international conference on Knowledge discovery and datamining. Outline. Motivation Objective Introduction Basic Notions - PowerPoint PPT Presentation
Citation preview
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Advisor : Dr. Hsu
Presenter : Keng-Wei Chang
Author: Yehuda Koren and David Harel
A Two-Way Visualization Method for Clustered Data
ACM SIGKDD international conference on Knowledge discovery and datamining
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Outline
Motivation Objective Introduction Basic Notions Computing The x-Coordinates Computing The y-Coordinates Result Related Work Conclusions Personal Opinion
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
A number of technological development have led to an explosion of raw data that has to be analyzed
We are especially interested in two families of tools in this domain
Clustering algorithms and data visualization methods
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objective
in this paper, we integrate the two approacheshierarchical clustering depicted as a dendrogram
low-dimensional embedding
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction
A number of technological development have led to an explosion of raw data that has to be analyzed
We are especially interested in two families of tools in this domain
Clustering algorithms and data visualization methods
Clustering methods can be broadly classifiedHierarchical and partitional
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction
Our main interest here is hierarchical clustering
The clustering hierarchy is often visualized as a dendrogram
A full binary tree
has a significant disadvantagedoes not provide exploratory visual representations of the data itself
another issue is that of cluster validity
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction
we are particularly interested in methods for achieving a low-dimensional embedding of data
principal component analysis (PCA)
multidimensional scaling (MDS)
force-directed placement
solve some limitations of dendrogrambut, cannot utilize external clustering information
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction
for a demonstration of the relative merits of the two approaches
a dendrogram vs. a low-dimensional embedding
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction
in this paper, we integrate the two approacheshierarchical clustering depicted as a dendrogram
low-dimensional embedding
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Basic Notions
given data about n elements {1,…,n}
relationships between pairs of elements are bydistances dij ≥ 0 or
similarities wij ≥ 0
2-dimentional embedding of the dataid defined by two vectors x, y Є
the coordinates of element i are ( xi, yi)
n
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Computing The x-Coordinates
The embedding must place each element exactly below its corresponding leaf in the dendrogram
this means that the x-coordinate must corresponding leaf in the dendrogram
face the problem of computing the x-coordinates of the dendrogram leaves
preserves the relationships among the data as much as possible
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Computing The x-Coordinates
we exhaust all the existing methods, opting for a twofold process
find the best orientation of the dendrogramthis step determines the ordering of the leaves
decide on the exact gaps between consecutive leaves in the ordering
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Dendrogram orientation
a dendrogram has 2n-1 different orientationsexample :
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Dendrogram orientation
one way of defining formally what should be considered a “good” ordering
associate a cost function with the dendrogram
such that finding the best ordering is equivalent to optimizing this function
be the classical minimum linear arrangement problem
ji
jiij
def
sim xxwxLA,
.
minimizes
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Dendrogram orientation
in our particular problemalso faced with an ordering task
a permutation of {1, …, n}
however, here we should not consider all possible permutations, but only agree with dendrogram’s structure
n! 2n-1
using dynamic programming, running time is exponential in the dendrogram’s height not in its size
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Dendrogram orientation
introduce an additional form of the cost function
ji
jiij
def
dist xxdxLA,
.
maximizes
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Dendrogram orientation
given an ordered dendrogram T
a node v
Leaves(v) : the set of leaves in the substree rooted by v
x be the ordering on the leaves
Let S be Leaves(v)L be the set of leaves of left of S
R be the set of leaves of right of S
if |L| = l, |S| = s, we have x(L) = {1,…,l},
x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Dendrogram orientation
a key concept of the algorithm is local arrangement cost, defined as :
RS,ji RjLiijiij
Sji LjSiiijjiij
defT
swxslw
lxwxxwvLocalLA
,
, ,
if |L| = l, |S| = s, we have x(L) = {1,…,l}, x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Dendrogram orientation
two additional related terms will be used
another term that will be used in the algorithm
RjSi
ij
defT
LjSiij
defT wvRightCutwvLeftCut
,,
,
ij
rightvLeavesjleftvLeavesi
wvInnerCut
..
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Determining coordinates of the leaves
computing the exact gaps between each two consecutive leaves
example :
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Determining coordinates of the leaves
a better approach is to take a weighted average over all influenced leaf pairs
ikij
kj
ikiji jk
d
jkgap
,
1
,
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Computing The y-Coordinates
Principle component analysis
Classical multidimensional scaling
Eigen-projection
Stress minimization
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Result
Odors datasetconsists of 30 volatile odorous pure chemicals
contains 262 elements, natural clusters : 30
use a UPGMA agglomerative clustering to construct
the dendrogram
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Result
Iris datasetan example of discriminant analysis
contains 150 elements, natural clusters : 3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Result
Gene expression data : CDC15-synchronized cell cycle
a much larger dataset of gene-expression data
contains 6113 elements
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Related Work
TreeViewdendrogram over a color-coded matrix
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Discussion
success for integrating two key methods in exploratory data analysis
cluster analysis and low-dimensional embedding
two unique propertiesGuaranteed separation between any kind of given clusters
The ability to deal with a predefined hierarchical clustering
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Personal Opinion
Advantages─ has success for integrating two of clustering methods.─ more intuition in analyzing
Application─ Real data for clustering and analyzing.─ May solve the problem lack of clustering information
Limited ─ cannot show the real shape of clusters