Finding density-based subspace clusters in graphs with ...cloud.politala.ac.id/politala/Jurnal/JurnalTI... · Finding density-based subspace clusters in graphs ... clustering with

Data Min Knowl Disc (2012) 25:243–269DOI 10.1007/s10618-012-0272-z

Finding density-based subspace clusters in graphswith feature vectors

Stephan Günnemann · Brigitte Boden ·Thomas Seidl

Received: 31 October 2011 / Accepted: 16 May 2012 / Published online: 3 June 2012© The Author(s) 2012

Abstract Data sources representing attribute information in combination with net-work information are widely available in today’s applications. To realize the full poten-tial for knowledge extraction, mining techniques like clustering should consider bothinformation types simultaneously. Recent clustering approaches combine subspaceclustering with dense subgraph mining to identify groups of objects that are similarin subsets of their attributes as well as densely connected within the network. Whilethose approaches successfully circumvent the problem of full-space clustering, theirlimited cluster definitions are restricted to clusters of certain shapes. In this work weintroduce a density-based cluster definition, which takes into account the attribute sim-ilarity in subspaces as well as a local graph density and enables us to detect clusters ofarbitrary shape and size. Furthermore, we avoid redundancy in the result by selectingonly the most interesting non-redundant clusters. Based on this model, we introducethe clustering algorithm DB-CSC, which uses a fixed point iteration method to effi-ciently determine the clustering solution. We prove the correctness and complexityof this fixed point iteration analytically. In thorough experiments we demonstrate thestrength of DB-CSC in comparison to related approaches.

Keywords Graph clustering · Dense subgraphs · Networks

Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.

S. Günnemann (B) · B. Boden · T. SeidlData Management and Data Exploration Group, RWTH Aachen University, Aachen, Germanye-mail: [email protected]

B. Bodene-mail: [email protected]

T. Seidle-mail: [email protected]

123

244 S. Günnemann et al.

1 Introduction

In the past few years, data sources representing attribute information in combinationwith network information have become more numerous. Such data describes singleobjects via attribute vectors and also relationships between different objects via edges.Examples include social networks, where friendship relationships are available alongwith the users’ individual interests (cf. Fig. 1); systems biology, where interactinggenes and their specific expression levels are recorded; and sensor networks, whereconnections between the sensors as well as individual measurements are given. Thereis a need to extract knowledge from such complex data sources, e.g. for finding groupsof homogeneous objects, i.e. clusters. Throughout the past decades, a multitude of clus-tering techniques were introduced that either solely consider attribute information ornetwork information. However, simply applying one of these techniques misses thepotential given by such combined data sources. To detect more informative patternsit is preferable to simultaneously consider relationships together with attribute infor-mation. In this work, we focus on the mining task of clustering to extract meaningfulgroups from such complex data.

Former techniques considering both data types aim at a combination of densesubgraph mining (regarding relationships) and traditional clustering (regarding attri-butes). The detected clusters are groups of objects showing high graph connectivityas well as similarity w.r.t. all of their attributes. However, using traditional, i.e. full-space, clustering approaches on these complex data does mostly not lead to meaningfulpatterns. Usually, for each object a multitude of different characteristics is recorded;though not all of these characteristics are relevant for each cluster and thus clusters arelocated only in subspaces of the attributes. In, e.g., social networks it is very unlikelythat people are similar within all of their characteristics. In Fig. 1b the customers showsimilarity in two of their three attributes. In these scenarios, applying full-space cluster-ing is futile or leads to very questionable clustering results since irrelevant dimensionsstrongly obfuscate the clusters and distances are not discriminable anymore (Beyer etal. 1999). Consequentially, recent approaches (Moser et al. 2009; Günnemann et al.2010) combine the paradigms of dense subgraph mining and subspace clustering, i.e.clusters are identified in locally relevant subspace projections of the attribute data.

Joining these two paradigms leads to clusters useful for many applications: Insocial networks, closely related friends with similar interests in some product relevant

(a) (b)

Fig. 1 Clusters in a social network with attributes age, tv consume and web consume a attribute view(2D subspace), b graph view (extract)

123

Finding density-based subspace clusters in graphs 245

attributes are useful for target marketing. In systems biology, functional modules withpartly similar expression levels can be used for novel drug design. In sensor networks,long distance reports of connected sensors sharing some similar measurements canbe accumulated and transfered by just one representative to reduce energy consump-tion. Overall, by using subspace clustering the problems of full-space similarity are,in principle, circumvented for these complex data.

Though, the cluster models of the existing approaches (Moser et al. 2009; Gün-nemann et al. 2010) have a severe limitation: they are limited to clusters of certainshapes. This holds for the properties a cluster has to fulfill w.r.t the network informa-tion as well as w.r.t. the attribute information. While for the graph structure models likequasi-cliques are used, for the vector data the objects’ attribute values are just allowedto deviate within an interval of fixed width the user has to specify. Simply stated, theclustered objects have to be located in a rectangular hypercube of given width. Forreal world data, such cluster definitions are usually too restrictive since clusters canexhibit more complex shapes. Considering for example the 2D attribute subspace ofthe objects in Fig. 1a: The two clusters can not be correctly detected by the rectangularhypercube model of these approaches (Moser et al. 2009; Günnemann et al. 2010).Either both clusters are merged or some objects are lost. In Fig. 1b an extract of thecorresponding network structure is shown. Since the restrictive quasi-clique propertywould only assign a low density to this cluster, the cluster would probably be split bythe model of Günnemann et al. (2010).

In this work, we combine dense subgraph mining with subspace clustering basedon a more sophisticated cluster definition; thus solving the drawbacks of previousapproaches. Established for other data types, density-based notions of clusters haveshown their strength in many cases. Thus, we introduce a density-based clusteringprinciple for the considered combined data sources. Our clusters correspond to denseregions in the attribute space as well as in the graph. Based on local neighborhoods tak-ing the attribute similarity in subspaces as well as the graph information into account,we model the density of single objects. By merging all objects located in the samedense region, the overall clusters are obtained. Thus, our model is able to detect theclusters in Fig. 1 correctly.

Besides the sound definition of clusters based on density values, we achieve a fur-ther advantage. In contrast to previous approaches, the clusters in our model are notlimited in their size or shape but can show arbitrary shapes. For example, the diameterof the clusters detected in (Günnemann et al. 2010) is a priori restricted by a param-eter-dependent value (Pei et al. 2005) leading to a bias towards clusters of small sizeand little extent. Such a bias is avoided in our model, where the sizes and shapes ofclusters are automatically detected. Overall, our contributions are:

– We introduce a novel density-based cluster definition taking attributes in subspac-es and graph information into account.

– We ensure an unbiased cluster detection since our clusters can have arbitrary shapeand size.

– We propose a fixed point iteration method to determine the clusters in a singlesubspace and we prove the correctness, convergence and runtime complexity ofthis method.

123


– We develop the algorithm DB-CSC to efficiently determine the overall clusteringsolution.

Note that an earlier version of this article was published in Günnemann et al. (2011).In this extended version we additionally provide a thorough theoretical analysis of theproposed model, we present a detailed discussion of our model’s parameters, and weshow how our approach generalizes well known clustering principles. Furthermore,we prove the correctness of our fixed point iteration technique, its convergence andits runtime complexity.

2 Related work

Different clustering methods were proposed in the literature. Clustering vector datais traditionally done by using all attributes of the feature space. Density-based tech-niques (Ester et al. 1996; Hinneburg and Keim 1998) have shown their strength incontrast to other full-space clustering approaches like k-means. They do not requirethe number of clusters as an input parameter and are able to find arbitrarily shapedclusters. However, full-space clustering does not scale to high dimensional data sincelocally irrelevant dimensions obfuscate the clustering structure (Beyer et al. 1999;Kriegel et al. 2009). As a solution, subspace clustering methods detect an individualset of relevant dimensions for each cluster (Parsons et al. 2004; Kriegel et al. 2009;Müller et al. 2009; Günnemann et al. 2010). Also for this mining paradigm, density-based clustering approaches (Kailing et al. 2004; Assent et al. 2008) have resolved thedrawbacks of previous subspace cluster definitions. However, none of the introducedsubspace clustering techniques considers graph data.

Mining graph data can be done in various ways (Aggarwal and Wang 2010). Thetask of finding groups of densely connected objects in one large graph, as needed forour method, is often referred to as “graph clustering” or “dense subgraph mining”.An overview of the various dense subgraph mining techniques is given by Aggarwaland Wang (2010). Methods for graph-partitioning have been presented by Long etal. (2006, 2007); Ruan and Zhang (2007). Furthermore, several different notions ofdensity are used for graph data including cliques, γ -quasi-cliques (Pei et al. 2005),and k-cores (Dorogovtsev et al. 2006; Janson and Luczak 2007). However, none ofthe introduced dense subgraph mining techniques considers attribute data annotatedto the vertices.

Apart from the previously seen approaches there exist methods considering graphdata and attribute data. In the approaches of Du et al. (2007) and Kubica et al. (2003),attribute data is only used in a post-processing step. Thus, attributes do not influencethe resulting subgraph structures. In Hanisch et al. (2002) the network topology istransformed into a (shortest path) distance and is combined with the feature distancesuch that any distance-based clustering algorithm can be applied afterwards. Usingcombined distance functions like this leads to results that are difficult to interpret as noconclusions about cluster structures in the graph are possible. In the work of Ulitskyand Shamir (2007), contrarily, the attribute information is transformed into a graph.Ester et al. (2006) extend the k-center problem by requiring that each group has tobe a connected subgraph. These approaches (Hanisch et al. 2002; Ulitsky and Shamir

123


2007; Ester et al. 2006) perform full-space clustering on the attributes. The approachof Zhou et al. (2009, 2010) enriches the graph by further nodes based on the vertices’attribute values and connects them to vertices showing this value. The clustered objectsare only pairwise similar and no specific relevant dimensions can be defined.

Recently, two approaches (Moser et al. 2009; Günnemann et al. 2010) were intro-duced that deal with subspace clustering and dense subgraph mining. However, bothapproaches use too simple cluster definitions. Similar to grid-based subspace cluster-ing (Agrawal et al. 1998), a cluster (w.r.t. the attributes) is simply defined by taking allobjects located within a given grid cell, i.e. whose attribute values differ by at most agiven threshold. The methods are biased towards small clusters with little extent. Thisdrawback is even worsened by considering the used notions of dense subgraphs: e.g.by using quasi-cliques as in the approach of Günnemann et al. (2010) the diameter is apriori constrained to a fixed threshold (Pei et al. 2005). Very similar objects just slightlylocated next to a cluster are lost due to such restrictive models. Overall, finding mean-ingful patterns based on the cluster definitions of Moser et al. (2009); Günnemann etal. (2010) is questionable. Furthermore, the method of Moser et al. (2009) does noteliminate redundancy which usually occurs in the analysis of subspace projectionsdue to the exponential search space containing highly similar clusters (cf. Moise andSander (2008); Müller et al. (2009); Günnemann et al. (2009)).

Our novel model is the first approach using a density-based notion of clusters for thecombination of subspace clustering and dense subgraph mining. This allows arbitraryshaped and arbitrary sized clusters hence leading to an unbiased definition of clusters.We remove redundant clusters induced by similar subspace projections resulting inmeaningful result sizes.

3 A density-based clustering model for combined data

In this section we introduce our density-based clustering model for the combinedclustering of graph data and attribute data. The clusters in our model correspond todense regions in the graph as well as in the attribute space. For simplicity, we firstintroduce in Sect. 3.1 the cluster model for the case that only a single subspace, e.g.the full-space, is considered. The extension to subspace clustering and the definitionof a redundancy model to confine the final clustering to the most interesting subspaceclusters is introduced in Sect. 3.2.

Formally, the input for our model is a vertex-labeled graph G = (V, E, l) withvertices V , edges E ⊆ V × V and a labeling function l : V → R

D where Dim ={1, . . . , D} is the set of dimensions. We assume an undirected graph without self-loops, i.e. (v, u) ∈ E ⇔ (u, v) ∈ E and (u, u) �∈ E . Furthermore, we use x[i] to referto the i-th component of a vector x ∈ R

D .

3.1 Cluster model for a single subspace

In this section we introduce our density-based cluster model for the case of a singlesubspace. The basic idea of density-based clustering is that clusters correspond to con-nected dense regions in the dataspace that are separated by sparse regions. Therefore,

123


Fig. 2 Dense region in theattribute space but sparse regionin the graph

each clustered object x has to exceed a certain minimal density, i.e. its local neighbor-hood has to contain a sufficiently high number of objects that are also located in x’scluster. Furthermore, in order to form a cluster, a set of objects has to be connectedw.r.t. their density, i.e. objects located in the same cluster have to be connected via achain of objects from the cluster such that each object lies within the neighborhood ofits predecessor. Consequently, one important aspect of our cluster model is the properdefinition of a node’s local neighborhood.

If we would simply determine the neighborhood based on the attribute data, as doneby Ester et al. (1996) and Hinneburg and Keim (1998), we could define the attributeneighborhood of a vertex v by its ε-neighborhood, i.e. the set of all objects whichdistances to o do not exceed a threshold ε. Formally,

N Vε (v) = {u ∈ V | dist(l(u), l(v)) ≤ ε}

with an appropriate distance function like the maximum norm dist (x, y) =maxi∈{1,...,D} |x[i] − y[i]|. Just considering attribute data, however, leads to the prob-lem illustrated in Fig. 2. Considering the attribute space, the red triangles form adense region. (The edges of the graph and the parameter ε are depicted in the figure.)Though, this group is not a meaningful cluster since in the graph this vertex set is notdensely connected. Accordingly, we have to consider the graph data and attribute datasimultaneously to determine the neighborhood.

Intuitively, taking the graph into account can be done by just using adjacent verticesfor density computation. The resulting simple combined neighborhood of a vertex v

would be the intersection

N Vε,adj(v) = N V

ε (v) ∩ {u ∈ V | (u, v) ∈ E}

In Fig. 2 the red triangles would not be dense anymore because their simple combinedneighborhoods are empty. However, using just the adjacent vertices leads to a toorestrictive cluster model, as the next example in Fig. 3 shows.

Assuming that each vertex has to contain three objects in its neighborhood (includ-ing the object itself) to be dense, we get two densely connected sets, i.e. two clusters,in Fig. 3a. In Fig. 3b, we have the same vertex set, the same set of attribute vectors andthe same graph density. The example only differs from the first one by the interchangeof the attribute values of the vertices v3 and v4, which both belong to the same clusterin the first example. Intuitively, this set of vertices should also be a valid cluster in ourdefinition. However, it is not because the neighborhood of v2 contains just the vertices

123


(a) (b)

Fig. 3 Robust cluster detection by using k-neighborhoods. Adjacent vertices (k = 1) are not always suf-ficient for correct detection (left successful; right fails) a successful with k ≥1 min Pts = 3, b successfulwith k ≥2, min Pts = 3

{v1, v2}. The vertex v4 is not considered since it is just similar w.r.t. the attributes butnot adjacent.

The missing tolerance w.r.t. interchanges of the attribute values is one probleminduced by using just adjacent vertices. Furthermore, this approach would not be tol-erant w.r.t. small errors in the edge set. For example in social networks, some friendshiplinks are not present in the current snapshot although the people are aware of eachother. Such errors should not prevent a good cluster detection. Thus, in our approachwe consider all vertices that are reachable over at most k edges to obtain a moreerror-tolerant model. Formally, the neighborhood w.r.t. the graph data is given by:

Definition 1 (Graph k-neighborhood) As vertex u is k-reachable from a vertex v (overa set of vertices V ) if

∃v1, . . . , vk ∈ V : v1 = v ∧ vk = u ∧ ∀i ∈ {1, . . . , k − 1} : (vi , vi+1) ∈ E

The graph k-neighborhood of a vertex v ∈ V is given by

N Vk (v) = {u ∈ V | u is x-reachable from v (over V ) ∧ x ≤ k} ∪ {v}

Please note that the object v itself is contained in its neighborhood N Vk (v) as well.

Overall, the combined neighborhood of a vertex v ∈ V considering graph and attributedata can be formalized by intersecting v’s graph k-neighborhood with its ε-neighbor-hood.

Definition 2 (Combined local neighborhood) The combined neighborhood of v ∈ Vis:

N V (v) = N Vk (v) ∩ N V

ε (v)

Using the combined neighborhood N V (v) and k ≥ 2, we get in Fig. 3a and Fig. 3bthe same two clusters. In both examples v2’s neighborhood contains three vertices, forexample, N V (v2) = {v1, v2, v4} in Fig. 3b. So to speak, we “jump over” the vertex v3to find further vertices that are similar to v2 in the attribute space.

Local density calculation. As mentioned, the “jumping” principle is necessary to getmeaningful clusters. However, it leads to a novel challenge not given in previous den-sity-based clustering approaches. Considering Fig. 4, the vertices on the right hand

123


Fig. 4 Novel challenge byusing local density computationand k-neighborhoods

side form a combined cluster and are clearly separated from the remaining verticesby their attribute values. However, on the left hand side we have two separate clus-ters, one consisting of the vertices depicted as dots and one consisting of the verticesdepicted as triangles. If we would only consider attribute data for clustering, these twoclusters would be merged into one big cluster as the vertices are very similar w.r.t.their attribute values. If we would just use adjacent vertices, this big cluster wouldnot be valid as there are no edges between the dot vertices and the triangle vertices.However, since we allow “jumps” over vertices, the two clusters could be merged, e.g.for k = 2.

This problem arises because the dots and triangles are connected via the verticeson the right hand side, i.e. we “jump” over vertices that actually do not belong to thefinal cluster. Thus, a “jump” has to be restricted to the objects of the same cluster.In Fig. 3b e.g., the vertices {v2, v3, v4} belong to the same cluster. Thus, reaching v4over v3 to increase the density of v2 is meaningful. Formally, we have to restrict thek-reachability used in Definition 1 to the vertices contained in the cluster O . In thiscase, the vertices on the right hand side of Fig. 4 cannot be used for “jumping” andthus the dots are not in the triangles’ neighborhoods and vice versa. Thus, we are ableto separate the two clusters.

Overall, for computing the neighborhoods and hence the densities, only verticesv ∈ O of the same cluster O can be used. While previous clustering approaches cal-culate the densities w.r.t. the whole database (all objects in V ), our model calculatesthe densities within the clusters (objects in O). Instead of calculating global densities,we determine local densities based on the cluster O . While this sounds nice in theory,it is difficult to solve in practice, as obviously the set of clustered objects O is notknown a priori but has to be determined. The cluster O depends on the density valuesof the objects, while the density values depend on O . So we get a cyclic dependencyof both properties.

In our theoretical clustering model, we can solve this cyclic dependency by assum-ing a set of clustered objects O as given. The algorithmic solution is presented inSect. 4. Formally, given the set of clustered objects O three properties have to befulfilled: First, each vertex v ∈ O has to be dense w.r.t. it local neighborhood. Second,the spanned region of these objects has to be (locally) connected since otherwise wewould have more than two clusters (cf. Fig 4). Last, the set O has to be maximal w.r.t.the previous properties since otherwise some vertices of the cluster are lost. Overall,

Definition 3 (Density-based combined cluster) A combined cluster in a graph G =(V, E, l) w.r.t. the parameters k, ε and min Pts is a set of vertices O ⊆ V that fulfillsthe following properties:

123


(1) high local density: ∀v ∈ O : |N O(v)| ≥ min Pts(2) locally connected: ∀u, v ∈ O : ∃w1, . . . , wl ∈ O : w1 = u ∧ wl = v ∧ ∀i ∈

{1, . . . , l − 1} : wi ∈ N O(wi+1)

(3) maximality: ¬∃O ′ ⊃ O : O ′ fulfills (1) and (2)

Please note that the neighborhood calculation N O(v) is always done w.r.t. to the setO and not w.r.t. the whole graph V . Based on this definition and by using min Pts =3, k ≥ 2, we can e.g. detect the three clusters from Fig. 4. By using a too small valuefor k (as for example k = 1 in Fig. 3b), clusters are often split or even not detected atall. On the other hand, if we choose k too high, we run the risk that the graph structureis not adequately considered any more. However, for the case depicted in Fig. 4 ourmodel detects the correct clusters even for arbitrary large k values as the cluster on theright hand side is clearly separated from the other clusters by its attribute values. Thetwo clusters on the left hand side are never merged by our model as we do not allowjumps over vertices outside the cluster.

In this section we introduced our combined cluster model for the case of a singlesubspace. As shown in the examples, the model can detect clusters that are dense in thegraph as well as in the attribute space and that can often not be detected by previousapproaches.

3.2 Overall subspace clustering model

In this section we extend our cluster model to a subspace clustering model. Besidesthe adapted cluster definition we have to take care of redundancy problems due to theexponential many subspace projections. As mentioned in the last section, we are usingthe maximum norm in the attribute space. If we just want to analyze subspace pro-jections, we can simply define the maximum norm restricted to a subspace S ⊆ Dimas

distS(x, y) = maxi∈S

|x[i] − y[i]|

In principle any L p norm can be restricted in this way and can be used within ourmodel. Based on this distance function, we can define a subspace cluster which fulfillsthe cluster properties just in a subset of the dimensions:

Definition 4 (Density-based combined subspace cluster) A combined subspace clus-ter C = (O, S) in a graph G = (V, E, l) consists of a set of vertices O ⊆ V and aset of relevant dimensions S ⊆ Dim such that O forms a combined cluster (cf. Def-inition 3) w.r.t. the local subspace neighborhood N O

S (v) = N Ok (v) ∩ N O

ε,S(v) with

N Oε,S(v) = {u ∈ O | distS(l(u), l(v)) ≤ ε}.

As we can show, our subspace clusters have the anti-monotonicity property: For asubspace cluster C = (O, S), for every S′ ⊆ S there exists a vertex set O ′ ⊇ O suchthat (O ′, S′) is a valid cluster. This property is used in our algorithm to find the validclusters more efficiently.

123


Proof For every two subspaces S, S′ with S′ ⊆ S and every pair of vertices u, v itholds that distS′

(l(u), l(v)) ≤ distS(l(u), l(v)). Thus for every vertex v ∈ O we getN O

S′ (v) ⊇ N OS (v). Accordingly, the properties (1) and (2) from Def. 3 are fulfilled by

(O, S′). If (O, S′) is maximal w.r.t. these properties, then (O, S′) is a valid combinedsubspace cluster. Else, by definition there exists a vertex set O ′ ⊃ O such that (O ′, S′)is a valid subspace cluster. ��

Redundancy removal. Because of the density-based subspace cluster model, therecan not be an overlap between clusters in the same subspace. However, a vertex canbelong to several subspace clusters in different subspaces, thus clusters from differentsubspaces can overlap. Due to the exponential number of possible subspace projec-tions, an overwhelming number of (very similar) subspace clusters may exist. Whereasallowing overlapping clusters in general makes sense in many applications, allowingtoo much overlap can lead to highly redundant information. Thus, for meaningfulinterpretation of the result, removing redundant clusters is crucial. Instead of simplyreporting the highest dimensional subspace clusters or the cluster with maximal size,we use a more sophisticated redundancy model to confine the final clustering to themost interesting clusters.

However, defining the interestingness of a subspace cluster is a non-trivial tasksince we have two important properties “number of vertices” and “dimensionality”.Obviously, optimizing both measures simultaneously is not possible as clusters withhigher dimensionality usually consist of fewer vertices than their low-dimensionalcounterparts (cf. anti-monotonicity). Thus we have to realize a trade-off between thesize and the dimensionality of a cluster. We define the interestingness of a single clusteras follows:

Definition 5 (Interestingness of a cluster) The interestingness of a combined subspacecluster C = (O, S) is computed as

Q(C) = |O| · |S|

For the selection of the final clustering based on this interestingness definition we usethe redundancy model introduced by Günnemann et al. (2010). This model defines abinary redundancy relation between two clusters as follows:

Definition 6 (Redundancy between clusters) Given the redundancy parametersrobj, rdim ∈ [0, 1], the binary redundancy relation ≺red is defined by:

For all combined clusters C = (O, S), C = (O, S):

C ≺red C ⇔ Q(C) < Q(C) ∧ |O ∩ O||O| ≥ robj ∧ |S ∩ S|

|S| ≥ rdim

Using this relation, we consider a cluster C as redundant w.r.t. another cluster C ifC’s quality is lower and the overlap of the clusters’ vertices and dimensions exceedsa certain threshold. It is crucial to require both overlaps for the redundancy. E.g. twoclusters that contain similar vertices, but lie in totally different dimensions representdifferent information and thus should both be considered in the output.

123


Please note that the redundancy relation ≺red is non-transitive, i.e. we cannot justdiscard every cluster that is redundant w.r.t. any other cluster. Therefore we have toselect the final clustering such that it does not contain two clusters that are redundantw.r.t. each other. At the same time, the result set has to be maximal with this prop-erty, i.e. we can not just leave out a cluster that is non-redundant. Overall, the finalclustering has to fulfill the properties:

Definition 7 (Optimal density-based combined subspace clustering) Given the set ofall combined clusters Clusters, the optimal combined clustering Result ⊆ Clustersfulfills

– redundancy-free property: ¬∃Ci , C j ∈ Result : Ci ≺red C j

– maximality property: ∀Ci ∈ Clusters\Result : ∃C j ∈ Result : Ci ≺red C j

Our clustering model enables us to detect arbitrarily shaped subspace clusters basedon attribute and graph densities without generating redundant results.

3.3 Discussion of the cluster model

Our cluster model can be adapted based on the three parameters ε, k, and min Pts,whose effects we discuss in the following. We use the gained insights to develop analgorithm that resolves the cyclic dependency of our cluster definition in the nextsection.

Let us first consider the parameter ε: By increasing ε the neighborhood sizes regard-ing the attribute domain increase. For the extremal value of ε → ∞, all objects areconsidered similar to each other, i.e. N V

ε (v) = V . Or even more precise, we getN O

ε (v) = O for all O ⊆ V . In this case, the combined neighborhood N O(v) (cf.Definition 2) can be simplified to the graph k-neighborhood N O

k (v) and, thus, in ouroverall cluster definition we just consider the graph data. Accordingly, the results foreach subspace S are identical and our method just has to analyze the full-dimensionalspace.

Fixing this scenario and choosing k = 1, we can show that the clusters determinedby our method correspond to the well-known notion of x-cores (Dorogovtsev et al.2006; Janson and Luczak 2007).

Definition 8 (x-core) An x-core O in a graph G = (V, E, l) is a maximal connectedsubgraph O ⊆ V in which all vertices connect to at least x other vertices in O , i.e.∀v ∈ O : degG(v, O) ≥ x where degG(v, O) = |{u ∈ O | (v, u) ∈ E}|.Using this definition, we show the following theorem:

Theorem 1 (Relation between clusters and cores) If ε → ∞ and k = 1, each com-bined subspace cluster C = (O, S) in the graph G = (V, E, l) (cf. Definition 3 and4) corresponds to a (min Pts − 1)-core in G.

Proof Due to property (1) of Definition 3 we get: high local density ⇔ ∀v ∈ O :|N O(v)| ≥ min Pts

k=1⇔ ∀v ∈ O : |N Ok (v)| ≥ min Pts

ε→∞⇔ ∀v ∈ O : |{u ∈O | (v, u) ∈ E} ∪ {v}| ≥ min Pts ⇔ ∀v ∈ O : degG(v, O) ≥ min Pts − 1

123


Due to property (2) of Def. 3 we get: locally connectedε→∞,k=1⇔ ∀u, v ∈

O : ∃w1, . . . , wl ∈ O : w1 = u ∧wl = v ∧∀i ∈ {1, . . . , l − 1} : (wi , wi+1) ∈ E ⇔O is connected in GDue to property (3) of Def. 3, C = (O, S) is a maximal subgraph fulfilling the aboveproperties. Thus, C = (O, S) is a (min Pts − 1)-core in G. ��

This observation just holds for the simple case k = 1, but we will show in the nextsection how to extend this idea to solve the general setting with k > 1. Before this,however, we analyze the parameter k: Similar to ε, increasing k leads to larger neigh-borhood sizes; in this case w.r.t. the graph information. Assuming a connected graphand the extremal value of k → ∞, all vertices are located in the graph neighborhood,i.e. N V

k (v) = V . However, we cannot infer that N Ok (v) = O for all O ⊆ V . Due to

the local density principle, we are only allowed to consider the vertices of O and we(implicitely) remove the vertices V \O , resulting in a potentially disconnected graphwhere we cannot reach all vertices, i.e. N O

k (v) �= O might hold. As a consequence,for the scenario k → ∞ it is not possible to simplify the combined neighborhood.Thus, even for k → ∞ we exploit both sources: the graph data and the attribute data(in contrast to ε → ∞, where we only consider the graph data).

A special case, in which we can simplify the combined neighborhood, is assuminga complete input graph. Obviously, in this case, the graph structure is not (and need notto be) considered by our model. Thus, any choice of k leads to the same result and weonly detect clusters based on the attribute values. For this setting, our model is relatedto traditional density-based clustering methods such as DBSCAN (Ester et al. 1996)or its extension to subspace clustering SUBCLU (Kailing et al. 2004). However, ourmodel differs in two aspects: First, our model removes redundant clusters to preventan overwhelming result size. Second, the methods of (Ester et al. 1996; Kailing etal. 2004) include so called border objects. These objects correspond to objects thatdo not exceed the minimal density min Pts but are located in the neighborhood of adense object. To include border objects in our model, we can use a tuple (O, Ob) todistinguish between the different types of objects in a cluster and we have to slightlymodify Definition 3 to

(1) high local density or border object: ∀v ∈ O : |N O∪Ob (v)| ≥ min Pts and∀v ∈ Ob : ∃u ∈ O : v ∈ N O∪Ob (u)

(2) locally connected: as before(3) maximality: ¬∃(O ′, O ′

b) with (O ′ ∪ O ′b) ⊃ (O ∪ Ob) and (O ′, O ′

b) fulfills (1)& (2)

Please note that Property (2) remains unaffected: one has to ensure the local con-nectivity of O , i.e. one cannot use border objects to connect the vertices from O .Otherwise, an object with low density could potentially merge two different denseregions into a single clusters (similar to Fig. 4). As a side effect of this modifica-tion, the resulting clusters in a single subspace are not necessarily disjoint any more.Instead, a border object can belong to more than one cluster as it does not merge thetwo clusters.

At last, we discuss the parameter min Pts. While increasing the values of k or ε

leads to larger clusters due to larger neighborhood sizes, the parameter min Pts causes

123


the inverse effect: the larger min Pts, the smaller the clusters. The minimal densitythreshold is harder to reach and fulfilled by less objects. Consequently, the choicemin Pts → ∞ is not meaningful. For the case min Pts ≤ 2 (and again k = 1) ourmodel returns for each subspace S the connected components of the graph G ′

S thatresults by removing every edge (of the original graph) connecting dissimilar nodes.Formally,

Theorem 2 (Relation between clusters and connected components) If min Pts ≤ 2and k = 1, any combined subspace cluster C = (O, S) in the graph G = (V, E, l)(cf. Definitions 3 and 4) corresponds to a connected component of the graph G ′

S =(V, E ′

S, l) with E ′S = {(v, u) ∈ E | distS(l(v), l(u) ≤ ε)}.

Proof Due to property (2) of Definition 3 we get: locally connectedk=1⇔ ∀u, v ∈

O : ∃w1, . . . , wl ∈ O : w1 = u ∧ wl = v ∧ ∀i ∈ {1, . . . , l − 1} : (wi , wi+1) ∈E ′

S ⇔ O is connected in G ′S .

Since O is connected, each v ∈ O fulfills |N O(v)| ≥ 2 ≥ min Pts. Thus, property(1) of Definition 3 is directly fulfilled.Due to property (3) of Definition 3, C = (O, S) is a maximal connected subgraph ofG ′

S . That is, one of its connected components. ��Note that Theorem 2 holds for arbitrary ε values, while Theorem 1 requires ε → ∞

but allows arbitrary min Pts values. Combining both theorems, i.e. ε → ∞ andmin Pts ≤ 2, we have shown as a nice byproduct the well-known observation thatany 1-core in graph G corresponds to one of its connected components. In the nextsection we utilize the notion of cores to solve our general clustering model for arbitraryparameter settings and especially for the complex case of k ≥ 2.

4 The DB-CSC algorithm

In the following section we describe the density-based combined subspace clustering(DB-CSC) algorithm for detecting the optimal clustering result. In Sect. 4.1 we presentthe detection of our density-based clusters in a single subspace S and in Sect. 4.2 weintroduce the overall processing scheme using different pruning techniques to enhancethe efficiency.

4.1 Finding clusters in a single subspace

Before we present our method to detect the combined clusters in subspace S, we firstintroduce some basic definitions. Theoretically, each subset of objects O ⊆ V couldbe evaluated whether it forms a valid cluster in the subspace S, and selecting some ofthese sets results in a potential clustering C. We denote with A the set of all possiblegroups in subspace S, i.e. A = {(O, S) | O ⊆ V }, and with A the set of all possiblegroupings in subspace S, i.e. A = P(A). Obviously, one of these groupings R ∈ Ahas to correspond to the (so far unknown) clustering solution in the subspace S. Giventhe set A, we first define the following binary relation to compare different groupings:

123


Definition 9 (Strict partial order of groupings) Let A = {(O, S) | O ⊆ V } be the setof all possible groups in subspace S and A = P(A) be the set of all possible groupingsin subspace S. The strict partial order <A

S over the set of all possible groupings A insubspace S is defined as

∀C1, C2 ∈ A :C1 <A

S C2 ⇔ (Cov(C1) ⊂ Cov(C2)) ∨ (Cov(C1) = Cov(C2) ∧ |C1| > |C2|)

with Cov(C) = ⋃(Oi ,S)=Ci ∈C Oi

Intuitively, the grouping C2 is “better” than the grouping C1 (C1 <AS C2), if the object

groups contained in C2 cover a larger amount of objects or if the coverage is identicalbut C2 needs less sets to obtain such a coverage. In the following, we use this partialorder to prove our main theorem for detecting the clustering solution.

4.1.1 Clustering by fixed point iteration

Given the set of all possible groupings A, we are interested in finding the correct ele-ment R ∈ A that corresponds to the clustering solution in subspace S, i.e. we want todetect all but only the object sets O ⊆ V fulfilling Definition 3/4. To achieve this aim,we transfer the observations of Theorem 1 to the general case of arbitrary ε values andk > 1.

To make Theorem 1 applicable, we first introduce a graph transformation that rep-resents attribute and graph information simultaneously.

Definition 10 (Enriched subgraph) Given a set of vertices O ⊆ V , a subspace S, andthe original graph G = (V, E, l), the enriched subgraph G O

S = (V ′, E ′) is defined byV ′ = O and E ′ = {(u, v) | v ∈ N O

S (u) ∧ v �= u} using the distance function distS .

Please note that the enriched subgraph is calculated based on a given vertex setO , which is not necessarily the whole vertex set V . Two vertices are adjacent in theenriched subgraph iff their attribute values are similar in S and the vertices are con-nected by at most k edges in the original graph (using just vertices from O). In Fig.5b the enriched subgraph for the whole set of vertices V is computed, while Fig. 5cjust considers the subset O1 and O2 respectively.

(a) (b) (c)

Fig. 5 Finding clusters by (min Pts − 1)-cores in the enriched graphs (k=2, min Pts=3) a original graph,

b enriched graph GVS , c G

OiS for subsets Oi

123


Since in the enriched subgraph the neighborhood information based on the attri-bute domain is directly encoded in the graph structure (using the original value of ε),we can select in the transformed graph ε → ∞. Similarly, the (original) graph k-neighborhood is encoded by adjacent vertices in the enriched subgraph; we can selectk = 1. Thus, Theorem 1 is applicable on the enriched subgraph, i.e. in this graphwe can concentrate on the detection of (min Pts − 1)-cores for our cluster detection.However, the detected cores do not exactly represent our clusters!

While constructing the enriched subgraph G OS , we consider the neighborhoods

based on all vertices of O . However, due to our local density calculation, we are onlyallowed to use the clustered objects O ′ for the neighborhood calculation, which arepotentially a subset of the objects O . Thus, as illustrated in Fig. 5b, the detected corescan still merge several clusters. Though, what is true is the following: Each cluster O ′is a subset of some (min Pts − 1)-core in GV

S . Or even more precise, we can provethis property for each enriched subgraph G O

S with O ⊇ O ′.

Theorem 3 (Subset property w.r.t. cores) Let C = (O ′, S) be a combined subspacecluster, O ⊇ O ′, and G O

S the enriched subgraph w.r.t. O. There exists a (min Pts −1)-core X ⊆ O in G O

S with O ′ ⊆ X.

Proof (a) Since O ⊇ O ′, we get N OS (u) ⊇ N O ′

S (u) for all u ∈ V and thus G OS

contains all edges of G O ′S .

(b) Due to property (1) of Definition 3 for the cluster C it holds: high local density ⇔∀v ∈ O ′ : |N O ′

(v)| ≥ min Pts ⇔ ∀v ∈ O ′ : |N O ′(v)\{v}| ≥ min Pts − 1 ⇔ ∀v ∈

O ′ : degG O′S

(v, O ′) ≥ min Pts − 1a)⇒ ∀v ∈ O ′ : degG O

S(v, O ′) ≥ min Pts − 1 ⇒

exists maximal set X ⊇ O ′ with ∀v ∈ X : degG OS(v, X) ≥ min Pts − 1

(c) Due to property (2) of Definition 3 for the cluster C it holds: locally connected⇔ ∀u, v ∈ O ′ : ∃w1, . . . , wl ∈ O ′ : w1 = u ∧ wl = v ∧ ∀i ∈ {1, . . . , l − 1} :wi ∈ N O ′

S (wi+1) ⇔ ∀u, v ∈ O ′ : ∃w1, . . . , wl ∈ O ′ : w1 = u ∧ wl = v ∧ ∀i ∈{1, . . . , l − 1} : (wi , wi+1) are adjacent in G O ′

S ⇔ G O ′S is connected

a)⇒ O belongs toa single connected component of G O

S(d) Based on b + c: there exists a connected maximal set X ⊇ O ′ with ∀v ∈ X :degG O

S(v, X) ≥ min Pts − 1 ⇒ X is a (min Pts − 1)-core in G O

S with O ′ ⊆ X ��Theorem 3 implies that our algorithm only has to analyze vertex sets that potentially

lead to (min Pts − 1)-cores. If we were able to guess the cluster O ′, then the detectedcore in G O ′

S would exactly correspond to our cluster since in this case O = O ′ andthus O ′ ⊆ X ⊆ O ′. Based on these observations, we show:

Theorem 4 (Fixed point solution) Let <AS be the strict partial order over the set of

all possible groupings A in subspace S.(1) The set of all combined clusters in subspace S corresponds to the largest fixedpoint of the function fS : A → A with

fS(C) = {(O, S) | O is a (min Pts − 1)-core in G OiS with Ci = (Oi , S) ∈ C}

(2) The sequence f ∞S ({(V, S)}) converges to this fixed point

123


Proof Let R ∈ A be the set of all combined clusters in subspace S.(a) R is a fixed point of fS . Due to Theorem 3 there exists for each cluster Ci ∈ R a(min Pts − 1)-core X ⊆ Oi in G Oi

S with Oi ⊆ X . Thus, X = Oi has to hold. Since

all vertices of G OiS are covered by X there cannot exist further cores in G Oi

S . Thus,fS({Ci }) = {Ci } and overall fS(R) = R.(b) R is the largest fixed point. Assume R is not the largest fixed point, i.e. there existsC ∈ A with fS(C) = C and R <A

S C. Thus, due to Definition 9 either Cov(C) ⊃Cov(R) or Cov(C) = Cov(R) with |C| < |R| has to hold. The first case cannot holdsince R contains all maximal clusters (Proposition (3), Definition 3), i.e. the coverageof R is maximal, and thus Cov(C) ⊆ Cov(R). Thus, the second case with |C| < |R|has to hold. To realize this, at least two clusters Ci , C j ∈ R have to be (partially)merged into a single group D ∈ C. In this case, however, D contains at least two(min Pts − 1)-cores, i.e. | fS({D})| ≥ 2 and especially fS({D}) �= {D}. Thus, {D}and therefore C are not fixed points of fS . This is a contradiction to the assumption.Hence, R is the largest fixed point.(c) It holds f ∞

S ({(V, S)}) = R. First, we do not miss (or split up) any cluster C ∈ R byapplying the sequence f ∞

S . Due to Theorem 3 each cluster (O ′, S) ∈ R is containedin a (min Pts − 1)-core of the graph G O

S with O ⊇ O ′. Especially it has to be con-tained within a core of the graph GV

S . Since in each iteration fS(Ci ) = Ci+1 the wholeset of cores is returned, there exists due to Theorem 3 one core (O, S) ∈ Ci+1 withO ⊇ O ′. Furthermore, since cores are disjoint due to their maximality, the result off ∞S ({(V, S)}) cannot contain subsets of clusters. Second, the result of f ∞

S ({(V, S)})only contains the clusters of R. Assume a group D ∈ f ∞

S ({(V, S)}) which is not avalid cluster (and not one of its subsets). Based on Statement 1 of Theorem 4 we knowthat {D} cannot be a fixed point. Thus, there exists a k such that D �∈ f m

S ({(V, S)}) forall m > k. This is a contradiction to the assumption. Overall, f ∞

S ({(V, S)}) = R. ��

Overall, statement 2 of Theorem 4 provides us with a solution to our clusteringproblem. The clusters in a single subspace S can be detected by a fixed point iter-ation. Thus, we are able to resolve the cyclic dependency. Algorithmically, we startby extracting from GV

S all (min Pts − 1)-cores, since only these sets could lead tovalid clusters. In Fig. 5b these sets are highlighted. However, keep in mind that these(min Pts−1)-cores do not necessarily correspond to valid clusters. Theorem 4 requiresa fixed point. In Fig. 5b for example, we will get two cores but, as already discussed,the left set O1 is not a valid cluster but has to be split up. Thus, we recursively haveto apply the function until the fixed point is reached. Intuitively, in each iteration theclusters are refined until the final clusters are obtained.

In Fig. 6 we depict a schematic representation of the vertex sets to be analyzed inour method. Similar to a dendrogram, we start in the root with the whole set of verticesV . By applying the function fS , smaller cores are detected. These cores are depictedas child nodes with solid lines in the dendrogram. Vertices that do not belong to anycore are depicted with dashed lines. By recursively applying fS the cores are refineduntil the final clusters are obtained (leaf nodes with solid lines in the tree). Thus, inthe example, four different clusters Ci with their corresponding object sets Oi aredetected.

123


Fig. 6 Illustration of fixed pointiteration method

In summary, detecting density-based clusters in combined data sources is far morecomplex than in traditional vector data. While existing clustering approaches can sim-ply determine the density values for each object independently (Ester et al. 1996;Kailing et al. 2004), in our method such an independent computation is not possibledue to the local density calculation. Please note that the toy example in Fig. 5a mightlead to the idea that we can simply remove all edges connecting dissimilar nodes andthen perform a simple clustering on the residual graph. This a-priori rejection of edges,however, is not possible. As discussed, the Fig. 3b contains two clusters, which, how-ever, will not be detected if we would remove the edge between v2 and v3. Indeed, thisedge belongs to the cluster even if the two vertices are not directly similar. Overall,our fixed point iteration provides us with a solution to detect all combined clusters inthe subspace S.

4.1.2 Convergence and complexity

In this section we analyze the convergence and complexity of the fixed point itera-tion method. Important for practical applications is the convergence of the sequencef ∞S ({(V, S)}) after a finite number of iterations. It holds:

Theorem 5 (Convergence) LetRbe the set of all combined clusters in subspace S. Thefixed point iteration started with {(V, S)} converges after at most |R| + |V \Cov(R)|iterations.

Proof Using the representation of our method as illustrated in Fig. 6 or 7, this upperbound is easy to show. Obviously, the height of such a tree corresponds to the numberof required iterations. For the worst case, we have to determine the maximal heightamong all possible trees (valid for the given clustering scenario): First, the number ofleaf nodes in the tree is given by |R| + |V \Cov(R)|. Since each valid cluster corre-sponds to a solid leaf node in the tree representation, each inner node of the tree is theunion of the object sets of its child nodes, and the root has to represent all objects V ,we must have |V \Cov(R)| dashed leaf nodes. Second, the maximal height of a treewith |R|+ |V \Cov(R)| leaf nodes is achieved by a binary tree as illustrated in Fig. 7.This corresponds to the bound as given in Theorem 5. ��

To determine the overall complexity of the fixed point iteration we furthermoreneed to bound the size of the cores to be analyzed. Since an inner node O in the

123


Fig. 7 Worst case scenario

tree represents the union of all its child nodes, we can distinguish two cases: First,O contains objects that do not belong to any valid cluster, i.e. O is an inner nodewhose subtree contains leafs with dashed lines. In this case, fS({(O, S)}) must leadto cores that are subsets of O and for each (O ′, S) ∈ fS({C}) we have |O ′| < |O|.Accordingly, the size of the largest child node of O in the tree is bounded by |O| − 1.Second, O contains only objects that belong to valid clusters. That is, the inner nodeO is exactly the union of several valid clusters {C1, . . . , Cn} = R′ ⊆ R. In this casethe largest core of fS({(O, S)}) can be bounded by |O| − min(Oi ,S)∈R′ |Oi | sincefS({(O, S)}) cannot be a fixed point, the resulting cores must be supersets of the trueclusters, and the cores need to be disjoint. Intuitively, this happens if the smallestcluster is split from O , while the remaining clusters are still merged into a single core.Based on these upper bounds for the largest core we can show the following theorem:

Theorem 6 (Complexity) Given the graph G = (V, E, l), the set R of all combinedclusters in subspace S can be computed in time

O(|V \Cov(R)| · |V |3 + |R|4 · |Omax|3)

with (Omax, S) = arg max(O,S)∈R{|O|}Proof We first show that given a set of vertices O ⊆ V , the cores of the enrichedsubgraph G O

S can be computed in cubic time: The graph G OS can be constructed by a

restricted breadth-first search in the original graph, started at each node v ∈ O . Thus,in the worst case it requires time O(|O|3). Detecting cores can be simply done byfirst recursively removing nodes with a node degree less than min Pts − 1 from G O

Sand then returning the connected components Dorogovtsev et al. (2006); Janson andLuczak (2007). The first step requires time O(|O|2), since the removal of one nodeleads to an update of the degrees of the adjacent nodes. Determining the connectedcomponents also requires time O(|O|2). Overall, the worst case runtime is dominatedby constructing the enriched subgraph, i.e. time O(|O|3).

Now, we prove the overall complexity. First, as in the proof of Theorem 5, theworst case complexity occurs for a binary tree because each non-binary tree can be

123


transformed to a binary one with larger runtime: W.l.o.g. we assume an inner nodeO with three child nodes O1, O2, O3 is given. By replacing O’s child nodes with O1and X = O2 ∪ O3 as well as appending O2, O3 as child nodes to X , we get a binarytree representing the same clustering structure as before. Obviously, determining theclustering in this way leads to a higher runtime since the additional inner node X hasto be analyzed. Second, the worst case occurs if the cardinality of the object sets repre-sented by inner nodes of the tree is as large as possible. Obviously, this happens if in thefirst |V \Cov(R)| iterations, only the non-clustered vertices are removed. That is, thefirst case of the above discussion applies and in iteration i = 0, . . . , |V \Cov(R)| − 1we are just able to bound the size of the core to be analyzed by |V | − i . After theseiterations, all non-clustered vertices are removed and the second case can be used: Inthe remaining |R| iterations, we split up in the worst case only the smallest of theclusters from the core. Let x1, . . . , x|R| be the cluster sizes in increasing order, then

we have to analyze in iteration j a core of at most size∑|R|

n= j xn . Since each analysisstep requires at most cubic time, we get

∑|V \Cov(R)|−1i=0 (|V | − i)3 + ∑|R|

j=1

(∑|R|n= j xn

)3

≤ ∑|V \Cov(R)|−1i=0 |V |3 + ∑|R|

j=1

((|R| − j + 1) · x|R|

)3

≤ |V \Cov(R)| · |V |3 + ∑|R|j=1

(|R|3 · |Omax|3)

≤ |V \Cov(R)| · |V |3 + |R|4 · |Omax|3 ��Please note that our worst case analysis assumes that in each iteration the enriched

subgraph is computed from scratch, which results in the cubic complexity. In practice,however, we can extremely speed up the computation by exploiting the property thatfor each O ′ ⊆ O the edges of G O ′

S need to be a subset of the edges of G OS . Further-

more, we assume that for each subspace S we start with the whole vertex set V . Inthe following section, we show that the fixed point iteration can be initialized withfar smaller object sets if we analyze multiple subspaces. In any case, however, theclustering result in a single subspace can be computed in polynomial time w.r.t. theinput graph.

4.2 Finding clusters in different subspaces

This chapter describes how we efficiently determine the clusters located in differentsubspaces. In principle, our algorithm has to analyze each subspace. We enumeratethese subspaces by a depth first traversal through the subspace lattice. To avoid enu-merating the same subspace several times we assume an order d1, d2, . . . , dD on thedimensions. We denote the dimension with the highest index in subspace S by max{S}and extend the subspace only by dimensions that are ordered behind max{S}. Thisprinciple has several advantages:

By using a depth first search, the subspace S is analyzed before the subspaceS′ = S ∪ {d} for d > max{S}. Based on the anti-monotonicity we know that eachcluster in S′ has to be a subset of a cluster in S. Thus, in subspace S′ we do not have tostart with the enriched subgraph GV

S but it is sufficient to start with the vertex sets of theknown clusters, i.e. if the clusters of subspace S are given by Clus = {C1, . . . , Cm}, we

123


Fig. 8 The DB-CSC algorithm

will determine the fixed point f ∞S′ ({(Oi , S′)}))) for each cluster Ci = (Oi , S) ∈ Clus.

This is far more efficient since the vertex sets are smaller, i.e. Theorems 5 and 6 arenow based on the set Oi instead of V . In Fig. 8 the depth first traversal based on theclusters of the previous subspace is shown in line 21, 23 and 29.

The actual detection of clusters based on the vertices O is realized in line 13–19,which corresponds to the fixed point iteration described in the previous section. In ouralgorithm, we use the following observation to calculate the fixed point: If C is thelargest fixed point of the function fS , then obviously any individual cluster Ci ∈ Cis also a fixed point, i.e. fS({Ci }) = {Ci }. Thus, if a graph G O

S contains a single(min Pts −1)-core O1 with O1 = O , we have reached a fixed point for this individualcluster. In this case, we can add the cluster to the set of detected clusters (line 18). In theother cases, we recursively have to repeat the procedure for each (min Pts − 1)-core{O1, . . . , Om} detected in G O

S (line 19). Though, instead of applying fS to the wholeset of cores simultaneously, we just have to apply it for each core individually (line15), until all clusters are detected.

Using a depth first search enables us to store a set of parent clusters (beforehanddetected in lower dimensional subspaces) that a new cluster is based on (cf. line 22).Furthermore, given a set of vertices O in the subspace S we know that by traversingthe current subtree only clusters of the kind Creach = (Oreach, Sreach) with Oreach ⊆ O

123


and S ⊆ Sreach ⊆ S∪{max{S}+1, . . . , D} can be detected. This information togetherwith the redundancy model allows a further speed-up of the algorithm. The overall aimis to stop the traversal of a subtree if each of the reachable (potential) clusters is redun-dant to one parent cluster, i.e. if there exists C ∈ Parents such that Creach ≺red C foreach Creach. Traversing such a subtree is not worthwhile since the contained clustersare probably excluded from the result later on due to their redundancy.

Redundancy of a subtree occurs if the three properties introduced in Definition 6hold. The second property (object overlap) is always fulfilled since each Oreach is a sub-set of any cluster from Parents (cf. anti-monotonicity). The maximal possible qualityof the clusters Creach can be estimated by Qmax = |O| · |S ∪{max{S}+1, . . . , D}|. Byfocusing on the clusters C p = (Op, Sp) ∈ Parents with Qmax < Q(C p) we ensurethe first redundancy property. The third property (dimension overlap) is ensured if|Sp| ≥ |S ∪ {max{S} + 1, . . . , D}| · rdim holds. In this case we get for each Creach:

|Sp| ≥ |Sreach| · rdim ⇔ |Sp ∩ Sreach| ≥ |Sreach| · rdim ⇔ |Sp∩Sreach||Sreach| ≥ rdim. Those

parent clusters fulfilling all three properties are stored within Parentsred (line 24).If Parentsred is empty, we have to traverse the subtree (else case, line 28). If it is not

empty (line 25), the current subtree is redundant to at least one parent cluster. We cur-rently stop traversing this subtree. However, we must not directly prune the subtree ST :if the clusters from Parentsred themselves are not included in the result, clusters fromthe subtree would become interesting again. Thus, we do not finally reject the subtreeST but we store the required information and add the subtree to a priority queue.

Processing this queue is the core of the DB-CSC algorithm. The priority queuecontains clusters (line 6, 20) and non-traversed subtrees (line 9, 27). We successivelytake the object with the highest (estimated) quality from the queue. If it is a clusterthat is non-redundant to the current result, we add it to the result (line 7-8). If it is asubtree, we check if some cluster from Parentsred is already included in the result: ifso, we finally reject this subtree (line 10); otherwise, we have to restart traversing thissubtree (line 11).

Overall, our algorithm efficiently determines the optimal clustering solutionbecause only small vertex sets are analyzed for clusters and whole subtrees (i.e. setsof clusters) are pruned using the redundancy model.

5 Experimental evaluation

5.1 Setup

We compare DB-CSC to GAshape Mer (Günnemann et al. 2010) and CoPaM (Moseret al. 2009), two approaches that combine subspace clustering and dense subgraph min-ing. Furthermore, as straightforward solutions to our clustering problem, we imple-mented three versions of (traditional) density-based clustering (Ester et al. 1996) usingdifferent distance functions. In our experiments we use real world data sets as wellas synthetic data. By default, the synthetic datasets have 20 attribute dimensions andcontain 80 combined clusters, each with 15 vertices and five relevant dimensions.Additionally we add random vertices and edges to represent noise in the data. Theclustering quality is measured by the F1 measure (cf. Günnemann et al. (2011)), which

123


compares the detected clusters to the “hidden” clusters. The efficiency is measured bythe algorithms’ runtime.

5.2 Comparison with traditional density-based clustering

One possible straightforward solution to our clustering problem could be to use a tradi-tional density-based clustering approach like DBSCAN (Ester et al. 1996) in combina-tion with an appropriate similarity/distance measure. Figure 9 illustrates that DB-CSCclearly outperforms such simple solutions. In this experiment we use DBSCAN withthree different distance functions: (a) We combine the shortest path distance with theEuclidean distance on the attribute vectors; thus, exploiting the graph and the vectordata simultaneously. The normalization and combination of the two distance valuesto a single measure is done as proposed in Hanisch et al. (2002). (b) We use theEuclidean distance only. (c) We use the shortest path distance only. The latter twodistance functions are also normalized according to Hanisch et al. (2002) to real-ize comparable absolute distance values. In Fig. 9 we vary the ε parameter whileusing the same min Pts value for all approaches. As shown, DB-CSC achieves highclustering quality for a broad range of parameter values. The competing methods,however, obtain only very low quality and are sensitive w.r.t. the selected parametervalue. Using traditional density-based clustering fails for the considered data due totwo reasons: First, by modeling the network information in a distance function onecannot ensure the connectivity of the resulting clusters. Second, this issue also tendsto detect merged clusters as illustrated in Fig. 4. In contrast, DB-CSC avoids theseproblems by explicitly operating on the graph and by using a local density computa-tion. Since none of these simple solutions is able to detect the clustering structure inthe data, in the following experiments we compare DB-CSC to GAMer and CoPaMonly.

5.3 Varying characteristics of the data

In the next experiment (Fig. 10) we vary the database size of our synthetic datasetsby varying the number of generated combined clusters. The runtime of all algorithmsincreases with increasing database size (please note the logarithmic scale on both

Fig. 9 DB-CSC and DBSCANvariants

123


Fig. 10 Quality (left) and Runtime (right) w.r.t. varying database size

Fig. 11 Quality (left) and Runtime (right) w.r.t. varying dimensionality

axes). For the datasets with more than 7000 vertices, CoPaM is not applicable anymore due to heap overflows (the heap size was set to 4GB). While the runtimes of thedifferent algorithms are very similar, in terms of clustering quality DB-CSC obtainssignificantly better results than the other approaches. The competing approaches tendto output only subsets of the hidden clusters due to their restrictive cluster models.

In the next experiment (Fig. 11) we vary the dimensionality of the hidden clusters.The runtime of all algorithms increases for higher dimensional clusters. The clusteringqualities of DB-CSC and CoPaM slightly decrease. This can be explained by the factthat for high dimensional clusters it is likely that additional clusters occur in subsetsof the dimensions. However, DB-CSC still has the best clustering quality and runtimein this experiment.

In Fig. 12 the cluster size (i.e. the number of vertices per cluster) is varied. Theruntimes of DB-CSC and GAMer are very similar to each other, whereas the runtimeof CoPaM increases dramatically until it is not applicable any more. The clusteringquality of DB-CSC remains relatively stable while the qualities of the other approachesdecrease constantly. For increasing cluster sizes the expansion of the clusters in thegraph as well as in the attribute space increases, thus the restrictive cluster models ofGAMer and CoPaM can only detect subsets of them.

123


Fig. 12 Quality (left) and Runtime (right) w.r.t. varying cluster size

5.4 Robustness

In Fig. 13a we analyze the robustness of the methods w.r.t. the number of “noise” ver-tices in the datasets. The clustering quality of all approaches decreases for noisy data,however the quality of DB-CSC is still reasonably high even for 1000 noise vertices(which is nearly 50 % of the overall dataset).

In the next experiment (Fig. 13b) we vary the clustering parameter ε, for a fixedvalue of k = 2. For GAMer and CoPaM we vary the allowed width of a cluster inthe attribute space instead of ε. As shown in the figure, by choosing ε too small wecannot find all clusters and thus get smaller clustering qualities. However, for ε > 0.05the clustering quality of DB-CSC remains stable. The competing methods have lowerquality.

To further evaluate the detected clusterings, we depict in Fig. 13c the number offound clusters for varying ε-values and different values for the parameter k. The syn-thetic dataset consists of 80 hidden clusters. For k = 1, DB-CSC does not detect anyclusters, so only the results for k = 2, 3, 4 are included. For k = 3, DB-CSC findsslightly more clusters than for k = 2. However, by increasing the parameter furtherto k = 4 the number of clusters remains nearly stable. As seen before, if ε is chosentoo small DB-CSC cannot find all the clusters and so outputs only small numbers ofclusters. For ε = 0.05, more than the hidden 80 clusters are found, which can beexplained by the fact that subsets of the hidden clusters can form additional clustersin higher dimensionalities. However, for even larger ε-values the number of foundclusters remains stable and all but only the hidden clusters are found.

In the last experiment (Fig. 13d) we evaluate the robustness of DB-CSC w.r.t.the parameter min Pts. For too small values for min Pts, many vertex sets are falselydetected as clusters, thus we obtain small clustering qualities. However, for sufficientlyhigh min Pts values the quality remains relatively stable, similar to the previous exper-iment.

Overall, the experiments show that DB-CSC obtains significantly higher clusteringqualities. Even though it uses a more sophisticated cluster model than GAMer andCoPaM, the runtimes of DB-CSC are comparable to (and in some cases even betterthan) those of the other approaches.

123


(a) (b)

(d)(c)

Fig. 13 Robustness of the methods w.r.t. noise and parameter values a quality versus noise, b quality versusε, c number of found clusters versus ε & k, d quality versus min Pts

5.5 Real world data

As real world data sets we use gene data1, patent data2 and a co-author networkextracted from the DBLP database3 as also used in (Günnemann et al. 2010). Sincefor real world data there are no “hidden” clusters given that we could compare ourclustering results with, we compare the properties of the clusters found by the differentmethods. For the gene data DB-CSC detects 9 clusters with an average size of 6.3 andan average dimensionality of 13.2. In contrast, GAMer detects 30 clusters (averagesize: 8.8 vertices, average dim.: 15.5) and CoPaM 115581 clusters (average size: 9.7vertices, average dim.: 12.2), which are far too many to be interpretable. In the patentdata, DB-CSC detects 17 clusters with an average size of 19.2 vertices and an averagedimensionality of three. In contrast, GAMer detects 574 clusters with an average sizeof 11.7 vertices and an average dimensionality of three. CoPaM did not finish on thisdataset within two days. The clusters detected by DB-CSC are more expanded thanthe clusters of GAMer, which often simply are subsets of the clusters detected byDB-CSC. In the DBLP data, we observe the largest difference between the algorithms

1 http://thebiogrid.org/ and http://genomebiology.com/2005/6/3/R22.2 http://www.nber.org/patents/.3 http://dblp.uni-trier.de.

123

http://thebiogrid.org/

http://genomebiology.com/2005/6/3/R22

http://www.nber.org/patents/

http://dblp.uni-trier.de


in terms of cluster sizes: DB-CSC detects three clusters with an average size of 670.3vertices and an average dimensionality of 11. In contrast, GAMer detects 83 clusterswith an average size of 10.1 and an average dimensionality of 3.0. CoPaM detects341333 clusters with in average 10.8 vertices and an average dimensionality of three.Again, the clustering result of CoPaM is far too large to allow an easy interpretation. Incontrast, our DB-CSC method avoids redundant clusters and, thus, ensures meaningfulresult sizes.

6 Conclusion

We introduced a combined clustering model that simultaneously considers graph dataand attribute data in subspaces. Our model is the first approach that exploits the advan-tages of density-based clustering in both domains. Based on the novel notion of localdensities, our clusters correspond to dense regions in the graph as well as in the attributespace. To avoid redundancy in the result, our model selects only the most interestingclusters for the final clustering. We developed the algorithm DB-CSC which uses afixed point iteration method to efficiently determine the combined clustering solution.The correctness, convergence, and complexity of this iteration method are provenanalytically. In the experimental analysis we demonstrate the high clustering qualityand the efficiency of our DB-CSC method.

Acknowledgements This work has been supported by the UMIC Research Centre, RWTH Aachen Uni-versity, Germany, and the B-IT Research School of the Bonn-Aachen International Center for InformationTechnology.

References

Aggarwal C, Wang H (2010) Managing and mining graph data. Springer, New YorkAgrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional

data for data mining applications. In: SIGMOD, pp 94–105. SIGMOD, SeattleAssent I, Krieger R, Müller E, Seidl T (2008) EDSC: efficient density-based subspace clustering. In: CIKM,

pp 1093–1102. CIKM, GlasgowBeyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is ”nearest neighbor” meaningful? In: ICDT,

pp 217–235. ICDT, Mont BlancDorogovtsev S, Goltsev A, Mendes J (2006) K-core organization of complex networks. Phys Rev Lett

96(4):40–601Du N, Wu B, Pei X, Wang B, Xu L (2007) Community detection in large-scale social networks. In: Web-

KDD/SNA-KDD, pp 16–25. SNA-KDD, San JoseEster M, Kriegel HP, S J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial

databases with noise. In: KDD, pp 226–231. KDD, PortlandEster M, Ge R, Gao BJ, Hu Z, Ben-Moshe B (2006) Joint cluster analysis of attribute data and relationship

data: the connected k-center problem. In: SDM. SDM, BethesdaGünnemann S, Müller E, Färber I, Seidl T (2009) Detection of orthogonal concepts in subspaces of high

dimensional data. In: CIKM, pp 1317–1326. CIKM, Hong KongGünnemann S, Färber I, Boden B, Seidl T (2010) Subspace clustering meets dense subgraph mining: a

synthesis of two paradigms. In: ICDM, pp 845–850. ICDM, SydneyGünnemann S, Kremer H, Seidl T (2010) Subspace clustering for uncertain data. In: SDM, pp 385–396.

SDM, BethesdaGünnemann S, Boden B, Seidl T (2011) DB-CSC: A density-based approach for subspace clustering in

graphs with feature vectors. In: ECML/PKDD (1), pp 565–580. ECML, Athens

123


Günnemann S, Färber I, Müller E, Assent I, Seidl T (2011) External evaluation measures for subspaceclustering. In: CIKM, pp 1363–1372. CIKM, Glasgow

Hanisch D, Zien A, Zimmer R, Lengauer T (2002) Co-clustering of biological networks and gene expressiondata. Bioinformatics 18:145–154

Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise.In: KDD, pp 58–65. KDD, New York

Janson S, Luczak M (2007) A simple solution to the k-core problem. Rand Struct Algorithm 30(1–2):50–62Kailing K, Kriegel HP, Kroeger P (2004) Density-connected subspace clustering for high-dimensional data.

In: SDM, pp 246–257. SDM, BethesdaKriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering,

pattern-based clustering, and correlation clustering. Trans Knowl Discov Data 3(1):1–58Kubica J, Moore AW, Schneider JG (2003) Tractable group detection on large link data sets. In: ICDM, pp

573–576. ICDM, SydneyLong B, Wu X, Zhang ZM, Yu PS (2006) Unsupervised learning on k-partite graphs. In: KDD, pp 317–326.

KDD, PortlandLong B, Zhang ZM, Yu PS (2007) A probabilistic framework for relational clustering. In: KDD, pp 470–479.

KDD, PortlandMoise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data:

a novel approach to projected and subspace clustering. In: KDD, pp 533–541. KDD, PortlandMoser F, Colak R, Rafiey A, Ester M (2009) Mining cohesive patterns from graphs with feature vectors.

In: SDM, pp 593–604. SDM, BethesdaMüller E, Assent I, Günnemann S, Krieger R, Seidl T (2009) Relevant subspace clustering: mining the most

interesting non-redundant concepts in high dimensional data. In: ICDM, pp 377–386. ICDM, SydneyMüller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high

dimensional data. In: VLDB, pp 1270–1281. VLDB, SingaporeParsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD Explor

6(1):90–105Pei J, Jiang D, Zhang A (2005) On mining cross-graph quasi-cliques. In: KDD, pp 228–238. KDD, PortlandRuan J, Zhang W (2007) An efficient spectral algorithm for network community discovery and its applica-

tions to biological and social networks. In: ICDM, pp 643–648. ICDM, SydneyUlitsky I, Shamir R (2007) Identification of functional modules using network topology and high-throughput

data. BMC Syst Biol 1(1):8Zhou Y, Cheng H, Yu JX (2009) Graph clustering based on structural/attribute similarities. PVLDB

2(1):718–729Zhou Y, Cheng H, Yu JX (2010) Clustering large attributed graphs: an efficient incremental approach. In:

ICDM, pp 689–698. ICDM, Sydney

123

Documents

Finding density-based subspace clusters in graphs with ...cloud.politala.ac.id/politala/Jurnal/JurnalTI... · Finding density-based subspace clusters in graphs ... clustering with