Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

Embed Size (px)

Citation preview

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    1/20

    Turning Big data into tiny data:

    Constant-size coresets for k-means, PCA and projective clustering

    Dan Feldman Melanie Schmidt Christian Sohler

    Abstract

    We prove that the sum of the squared Euclideandistances from then rows of anndmatrixA to anycompact set that is spanned by k vectors in Rd canbe approximated up to (1+ )-factor, for an arbitrarysmall >0, using theO(k/2)-rank approximation ofAand a constant. This implies, for example, that theoptimalk-means clustering of the rows ofA is (1+ )-approximated by an optimal k-means clustering oftheir projection on the O(k/2) first right singularvectors (principle components) ofA.

    A (j, k)-coreset for projective clustering is a smallset of points that yields a (1 + )-approximation tothe sum of squared distances from the n rows of Atoanyset ofk affine subspaces, each of dimension atmost j. Our embedding yields (0, k)-coresets of sizeO(k) for handling k-means queries, (j, 1)-coresets ofsizeO(j) for PCA queries, and (j, k)-coresets of size(log n)O(jk) for anyj, k 1 and constant (0, 1/2).Previous coresets usually have a size which is linearlyor even exponentially dependent of d, which makes

    them useless whend n.Using our coresets with the merge-and-reduce ap-proach, we obtain embarrassingly parallel streamingalgorithms for problems such as k-means, PCA andprojective clustering. These algorithms use updatetime per point and memory that is polynomial in log nand only linear in d.

    For cost functions other than squared Euclideandistances we suggest a simple recursive coreset con-

    struction that produces coresets of size k1/O(1)

    fork-means and a special class of bregman divergencesthat is less dependent on the properties of the squaredEuclidean distance.

    MIT, Di str ibuted Roboti cs Lab. Emai l: dan-

    [email protected] Dortmund, Germany, Email: {melanie.schmidt,

    christian.sohler}@tu-dortmund.de

    1 Introduction

    Big Data. Scientists regularly encounter limitationsdue to large data sets in many areas. Data sets growin size because they are increasingly being gatheredby ubiquitous information-sensing mobile devices,aerial sensory technologies (remote sensing), genomesequencing, cameras, microphones, radio-frequencyidentification chips, finance (such as stocks) logs,

    internet search, and wireless sensor networks [30,38].The worlds technological per-capita capacity to storeinformation has roughly doubled every 40 monthssince the 1980s [31]; as of 2012, every day 2.5etabytes(2.5 1018) of data were created [4]. Datasets as the ones described above and the challengesinvolved when analyzing them is often subsumed inthe term Big Data.

    Gartner, and now much of the industry use the3Vs model for describing Big Data[14]: increasingvolumen(amount of data), its velocity (update timeper new observation) and its variety d (dimension,or range of sources). The main contribution of this

    paper is that it deals with cases where both n and dare huge, and does not assume d n.Data analysis. Classical techniques to analyzeand/or summarize data sets include clustering, i.e.the partitioning of data into subsets of similar char-acteristics, and dimension reduction which allows toconsider the dimensions of a data set that have thehighest variance. In this paper we mainly considerproblems that minimize the sum of squared error, i.e.we try to find a set of geometric centers (points, linesor subspaces), such that the sum of squared distancesfrom every input point to its nearest center is mini-

    mized.Examples are thek-meansor sum of squares clus-

    tering problem where the centers are points. An-other example is thej-subspace meanproblem, wherek = 1 and the center is a j-subspace, i. e., the sumof squared distances to the points is minimized overall j-subspaces. The j-rank approximationof a ma-trix is the projection of its rows on their j-subspacemean. Principal component analysis (PCA) is an-other example where k = 1, and the center is an

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    2/20

    affine subspace. Constrained versions of this prob-lem (that are usually NP-hard) include the non-negative matrix factorization (NNMF) [2], when the

    j-subspace should be spanned by positive vectors, andLatent Dirichlet Allocation (LDA) [3]which general-izes NNMF by assuming a more general prior distri-

    bution that defines the probability for every possiblesubspace.The most general version of the problem that we

    study is the linear or affine j-Subspace k-Clusteringproblem, where the centers are j-dimensional linearor affine subspaces and k 1 is arbitrary.

    In the context of Big Data it is of high interest tofind methods to reduce the size of the streaming databut keep its main characteristics according to theseoptimization problems.

    Coresets. A small set of points that approximatelymaintains the properties of the original point with

    respect to a certain problem is called a coreset (orcore-set). Coresets are a data reduction technique,which means that they tackle the first two Vs of BigData, volumenand velocity (update time), for a largefamily of problems in machine learning and statistics.Intuitively, a coreset is a semantic compression of agiven data set. The approximation is with respect toa given (usually infinite) set Q of query shapes: forevery shape in Q the sum of squared distances fromthe original data and the coreset is approximatelythe same. Running optimization algorithms on thesmall coreset instead of the original data allows usto compute the optimal query much faster, under

    different constraints and definition of optimality.Coresets are usually of size at most logarithmic

    in the number n of observations, and have similarupdate time per point. However, they do not handlethe variety of sources d in the sense that their size islinear or even exponential ind. In particular, existingcoreset are useless for dealing with Big Data whend n. In this paper we suggest coresets of sizeindependent ofd , while still independent or at mostlogarithmic in n.

    Non-SQL databases. Big data is difficult to workwith using relational databases of n records in d

    columns. Instead, in NOSQL is a broad class ofdatabase management systems identified by its non-adherence to the widely used relational databasemanagement system model. In NoSQL, every pointin the input stream consists of tuples (object, feature,value), such as (Scott, age, 25). More generally,the tuple can be decoded as (i,j,value) which meansthat the entry of the input matrix in the ith row and

    jth column is value. In particular, the total numberof observations and dimensions (nand d) is unknown

    while passing over the streaming data. Our papersupport this non-relational model in the sense that,unlike most of previous results, we do not assumethat eitherd or n are bounded or known in advance.We only assume that the first coordinates (i, j) areincreasing for every new inserted value.

    2 Related work

    Coresets. The term coreset was coined by Agarwal,Har-Peled and Varadarajan [10] in the context ofextend measures of point sets. They proved thatevery point set P contains a small subset of pointssuch that for any direction, the directional width ofthe point set will be approximated. They used theirresult to obtain kinetic and streaming algorithmsto approximately maintain extend measures of pointsets.

    Application to Big Data. An off-line coreset

    construction can immediately turned into streamingand embarrassingly parallel algorithms that use smallamount of memory and update time. This is doneusing a merge-and-reduce technique as explained inSection10. This technique makes coresets a practi-cal and provably accurate tool for handling Big data.The technique goes back to the work of Bentley andSaxe [11] and has been first applied to turn core-set constructions into streaming algorithms in [10].Popular implementations of this technique includeHadoop[40].

    Coresets for k-points clustering (j = 0). The

    first coreset construction for clustering problems wasdone by Badoiu, Har-Peled and Indyk [13], whoshowed that fork-center, k-median andk-means clus-tering, an approximate solution has a small witnessset (a subset of the input points) that can be usedto generate the solution. This way, they obtained im-proved clustering algorithms. Har-Peled and Mazum-dar [29] gave a stronger definition of coresets for k-median and k-means clustering. Given a point set Pthis definition requires that for any setofk centersCthe cost of the weighted coresetSwith respect toP isapproximated upto a factor of (1 + ). Here, Sis notnecessarily a subset ofP. We refer to their definition

    as a strong coreset.Har-Peled and Kushal[28] showed strong coresets

    for low-dimensional space, of size independent ofthe number of input points n. Frahling and Sohlerdesigned a strong coreset that allows to efficientlymaintain a coreset for k-means in dynamic datastreams[24]. The first construction of a strong coresetfor k-median and k-means of size polynomial in thedimension was done by Chen [15]. Langberg andSchulman[35] defined the notion ofsensitivityof an

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    3/20

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    4/20

    size O(klog n/2 log(||k log n)) by Ackermann andBlomer [6].

    3 Background and Notation

    We deal with clustering problems on point sets inEuclidean space, where a point set is represented by

    the rows of a matrix A. The input matrixA and allother matrices in this paper are over the reals.

    Notations and Assumptions. The number ofinput points is denoted by n. For simplicity ofnotation, we assume that d = n and thus A is ann nmatrix. Otherwise, we add (n d) columns (ord n rows) containing all zeros to A. We label theentries ofA byai,j. Theith row ofA will be denotedasAi and thej th columns asAj.

    The identity matrix ofRj is denoted asI Rjj .For a matrix X with entries xi,j , we denote the

    Frobenius normofX by

    X

    2= i,jx2i,j. We say

    that a matrix X Rnj has orthonormal columnsif its columns are orthogonal unit vectors. Such amatrix is also called orthogonal matrix. Notice thatevery orthogonal matrixX satisfies XTX= I.

    The columns of a matrix X span a linear sub-space L if all points in L can be written as linearcombinations of the columns ofA. This implies thatthe columns contain a basis ofL.

    A j-dimensional linear subspace L Rn willbe represented by an n j basis matrix X withorthonormal columns that span L. The projectionof a point set (matrix) A on a linear subspace Lrepresented by X will be the matrix (point set)AX Rnj . These coordinates are with respect tothe column space ofX. The projections ofA on Lusing the coordinates ofRn are the rows ofAX XT.

    Distances to Subspaces. We will often computethe squared Euclidean distance of a point set given asa matrix A to a linear subspace L. This distance isgiven by AY22, whereYis ann(nj) matrix withorthonormal columns that span L, the orthogonalcomplement ofL. Therefore, we will also sometimesrepresent a linear subspace L by such a matrix Y.

    An affine subspace is a translation of a linearsubspace and as such can be written as p +L, where

    p Rn is the translation vector and L is a linearsubspace.

    For a compact set S Rd and a vector p in Rd,we denote the Euclidean distance between p and (itsclosest points in)Sby

    dist2(p, S) := minsS

    p s22.

    For an n d matrixA whose rows are p1, , pn, wedefine the sum of the squared distances from A toS

    by

    (3.1) dist2(A, S) =

    ni=1

    dist2(pi, S).

    Thus, if L is a linear j-subspace and Y is a

    matrix with n j orthonormal columns spanningL, then dist(p, L) =pTY2. We generalize thenotation to n n matrices and define, dist(A, L) =n

    i=1dist(Ai, L) for any compact setL. HereAi istheith row ofA. Furthermore, we write dist2(A, L) =n

    i=1

    dist(Ai, L)

    2.

    Definition 1. (Linear (Affine) j-Subspace k-Clustering)Given a set ofn points inn-dimensionalspace as an n n matrix A, the k j-subspaceclustering problem is to find a set L of k linear(affine) j-dimensional subspaces L1, . . . , Lk of Rd

    that minimizes the sum of squared distances to the

    nearest subspace, i.e.,

    cost(A, L) =n

    i=1

    minj=1...,k

    dist2(Ai, Lj )

    is minimized over every choice ofL1, , Lk.

    Singular Value Decomposition. An importanttool from linear algebra that we will use in thispaper is the singular value decomposition of a matrixA. Recall that A = U DVT is the Singular ValueDecomposition (SVD) of A if U, V Rnn areorthogonal matrices, and D

    Rnn is a diagonal

    matrix with non-increasing entries. Let (s1, , sn)denote the diagonal ofD2.

    For an integer j between 0 to n, the first jcolumns ofVspan a linear subspaceL that minimizethe sum of squared distances to the points (rows) ofA, over allj-dimensional linear subspaces in Rn. Thissum equals sj+1 + . . .+sn, i.e. for any n (n j)matrixY with orthonormal columns, we have

    (3.2) AY22n

    i=j+1

    si.

    The projection of the points ofA on L are the rowsof U D. The sum of the squared projection of thepoints ofA on L iss1+ . . . + sj . Note that the sumof squared distances to the origin (the optimal andonly 0-subspace) is s1+. . .+sn.

    Coresets. In this paper we introduce a new notionof coresets, which is a small modification of theearlier definition by Har-Peled and Mazumdar [29]for thek-median andk-means problem (here adaptedto the more general setting of subspace clustering)

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    5/20

    that is commonly used in coreset constructions forthese problems. The new idea is to allow to add aconstantC to the cost of the coreset. Interestingly,this simple modification allows us to obtain improvedcoreset constructions.

    Definition 2. LetA be ann

    d matrix whose rows

    represents n points in Rd. An m n matrix Mis called (k, )-coreset for thej-subspacek-clusteringproblem of A, if there is a constant c such that forevery choice ofk j-dimensional subspacesL1, . . . , Lkwe have

    (1) cost(A, L) cost(M, L)+c (1+) cost(A, L).

    4 Our results

    In this section we summarize our results.The main technical result is a proof that the sum

    of squared distances from a set of points in Rd (rowsof an n

    d matrix A) to any other compact set

    that is spanned by k vectors of Rd can be (1 + )-approximated using the O(k/)-rank approximationofA, together with a constant that depends only onA. Here, a distance between a point p to a set is theEuclidean distance ofpto the closest point in this set.TheO(k/)-rank approximation ofAis the projectionof the rows ofA on the k -dimensional subspace thatminimizes their sum of squared distances. Hence,we prove that the low rank approximation ofA canbe considered as a coreset for its n rows. Whilethe coreset also has n rows, its dimensionality isindependent ofd, but only onk and the desired error.

    Formally:Theorem 4.1. Let A be ann d matrix, k 1 bean integer and 0 < < 1. Suppose that Am is them-rank approximation ofA, wherem:= bk/2 n

    for a sufficiently large constant b. Then for everycompact set S that is contained in a k-dimensionalsubspace ofRd, we have

    (1 )dist2(A, S) dist2(Am, S) + A Am2(1 +)dist2(A, S),

    wheredist2(A, S)is the sum of squared distances fromeach row onA to its closest point inS.

    Note that A takes nd space, while the pair Amand the constant AAm can be stored usingnm+1space.

    Coresets. Combining our main theorem with knownresults[21, 19,36]we gain the following coresets forprojective clustering. All the coresets can handleBig data (streaming, parallel computation and fastupdate time), as explained in Section10 and later inthis section.

    Corollary 4.1. LetPbe a set of points inRd, andk, j 0 be a pair of integers. There is a set Q inR

    d, a weight functionw : Q [0, ) and a constantc > 0 such that the following holds.

    For every set B which is the union of k affinej-subspaces ofRd we have

    (1 )

    pP

    dist2(p, B)

    pQ

    w(p)dist2(p, B) +c

    (1 +)

    pP

    dist2(p, B),

    and

    1.|Q| =O(j/) ifk= 12.|Q| =O(k2/4) ifj = 03.|Q| = poly(2k log n, 1/) ifj = 1,4.

    |Q

    | = poly(2kj, 1/) if j, k > 1, under the

    assumption that the coordinates of the points inPare integers between1 andnO(1).

    In particular the size ofQ is independent ofd.

    PCA and k-rank approximation. For the firstcase of the last theorem, we obtain a small coresetfor k-dimensional subspaces with no multiplicativeweights, which contains only O(k/) points in Rd.That is, its size is independent of both d and n.

    Corollary 4.2. Let A be an n d matrix. Letm =k/ +k 1 for some k 1 and 0 < < 1and suppose thatm n 1. Then, there is anm dmatrixA and a constant c 1, such that for everyk-subspaceSofRd we have

    (1 )dist2(A, S) dist2(A, S) +c(1 +)dist2(A, S)(4.3)

    Equality(4.3) can also be written using matrix nota-tion: for everyd (d k) matrix Y whose columnsare orthonormal we have

    (1 )AY2 AY +c(1 +)AY2.

    Notice that in Theorem 4.1, Am is still n-dimensional and holds n points. We obtain Corol-lary4.2 by obbserving thatAmX =D(m)VTXfor every X Rnnj . As the only non-zero entriesofD(m)VT are in the firstm rows, we can store theseas the matrix A. This can be considered as an exactcoreset forAm.

    Streaming. In the streaming model, the inputpoints (rows of A) arrive on-line (one by one) and

    5

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    6/20

    we need to maintain the desired output for the pointsthat arrived so far. We aim that both the updatetime per point and the required memory (space) willbe small. Usually, linear ind and polynomial in log n.

    For computing the k-rank approximation of amatrix A efficiently, in parallel or in the streaming

    model we cannot use Corollary 4.2 directly: First,because it assumes that we already have theO(k/)-rank approximation ofA, and second, thatAassumedto be in memory which takes nd space. However,using merge-and-reduce we only need to apply theconstruction of the theorem on very small matricesA of size independent of d in overall time that islinear in both n and d, and space that is logarithmicin O(log n). The construction is also embarrassinglyparallel; see Fig.3 and discussion in Section10.

    The following corollary follows from Theo-rem 10.1 and the fact that computing the SVD foran m

    d matrix takesO(dm2) time when m

    d.

    Corollary 4.3. LetA be then d matrix whosenrows are vectors seen so far in a stream of vectors inRd. For everyn 1we can maintain a matrixAandc 0 that satisfies (4.3) where

    1. A is of size2m 2m form= k/2. The update time per row insertion, and overall

    space used is

    d

    k log n

    O(1)

    Using the last corollary, we can efficiently com-pute a (1 + )-approximation to the subspace thatminimizes the sum of squared distances to the rowsof a huge matrix A. After computing A and c forA as described in Corollary 4.3, we compute the k-subspace S that minimizes the sum of squared dis-tances to the small matrix A. By (4.3), S approxi-mately minimizes the sum of squared distances to therows ofA. To obtain an approximation to the k-rankapproximation ofA, we project the rows ofA on S

    in O(ndk) time.Since A approximates dist2(A, S) for any k-

    subspace of Rd (not only S), computing the sub-

    space that minimizes dist2

    (A, S) under arbitrary con-straints would yield a (1 +)-approximation to thesubspace that minimizes dist2(A, S) under the sameconstraints. Such problems include the non-negativematrix factorization (NNMF, also called pLSA, orprobabilistic LSA) which aims to compute a k-subspaceS that minimizes sum of squared distancesto the rows of A as defined above, with the addi-tional constraint that the entries of S will all benon-negative.

    Latent Drichlet analysis (LDA)[3] is a generaliza-tion of NNMF, where a prior (multiplicative weight)is given for every possiblek-subspace in Rd. In prac-tice, especially when the corresponding optimizationproblem is NP-hard (as in the case of NNMF andLDA), running popular heuristics on the coreset pair

    A and c may not only turn them into faster, stream-ing and parallel algorithms. It might actually yieldbetterresults (i.e, 1 approximations) comparedto running the heuristics on A; see [33].

    In principle component analysis (PCA) we usu-ally interested in the affine k-subspace that mini-mizes the sum of squared distances to the rows ofA. That is, the subspace may not intersect the ori-gin. To this end, we replacek byk +1 in the previoustheorems, and compute the optimal affine k-subspacerather than the (k + 1) optimal subspace of the smallmatrix A.

    For the k -rank approximation we use the follow-

    ing corollary with Zas the empty set of constraints.Otherwise, for PCA, NNMF, or LDA we use the cor-responding constraints.

    Corollary 4.4. Let A be an n d matrix. LetAk denote an n k matrix of rank at most k thatminimizesAAk2 among a given (possibly infinite)set Z of such matrices. LetA and c be defined asin Corollary 4.3, and letAk denote the matrix thatminimizesA Ak2 among the matrices inZ. Then

    A Ak2 (1 +)A Ak2.Moreover,

    A

    Ak

    2 can be approximated using

    A andc, as

    (1 )A Ak2 A Ak2 +c(1 +)A Ak2.

    k-means. Thek -mean ofA is the set S ofk pointsin Rd that minimizes the sum of squared distancesdist2(A, S) to the rows ofA among everyk points inRd. It is not hard to prove that the k-mean of the k-rank approximationAk ofA is a 2-approximation forthek-mean ofA in term of sum of squared distancesto the k centers [18]. Since every set ofk points is

    contained in a k-subspace ofRd, the k-mean ofAmis a (1 + )-approximation to the k-means of A interm of sum of squared distances. Since thek-meanofAis clearly in the span ofA, we conclude from ourmain theorem the following corollary that generalizesthe known results from 2-approximation to (1 + )-approximation.

    Corollary 4.5. LetAm denote them = bk/2 rank

    approximation of an n d matrix A, where b is

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    7/20

    a sufficiently large constant. Then the sum of thesquared distances from the rows ofA to thek-mean ofAm is a(1 + )-approximation for the sum of squareddistances to thek-mean ofA.

    While the last corollary projects the input pointsto a O(k/)-dimensional subspace, the number ofrows (points) is still n. We use existing coresetconstructions for k-means on the lower dimensionalpoints. These constructions are independent of thenumber of points and linear in the dimension. Sincewe apply these coresets on Am, the resulting coresetsize is independent of both n and d.

    Notice the following coreset of a similar type forthe case k = 1. Let A be the mean ofA. Then thefollowing triangle inequality holds for every points Rd:

    dist2(A, s) = dist2(A,

    A

    ) +n dist2(A, s).

    Thus, A forms an exact coreset consisting of oned-dimensional point together with the constantdist2(A,

    A

    ).As in our result, and unlike previous coreset con-

    structions, this coreset for 1-mean is of size indepen-dent ofdand uses additive constant on the right handside. Our results generalizes this exact simple coresetsfor k-means where k 2 while introducing (1 +)-approximation.

    Unlike the k-rank approximation that can becomputed using the SVD of A, the k-means prob-lem is NP-hard when k is not a constant, analog toconstrained versions (e.g. NNMF or LDA) that are

    also NP-hard. Again, we can run (possibly inefficient)heuristics or constant factor approximations for com-puting the k-mean of A under different constraintsin the streaming and parallel model by running thecorresponding algorithms on A.

    Comparison to Johnson-Lindestrauss Lemma.The JL-Lemma states that projecting the n rowsof an n d matrix A on a random subspace ofdimension (log(n)/2) in Rd preserves the Euclideandistance between every pair of the rows up to afactor of (1 + ), with high (arbitrary small constant)probability 1

    . In particular, the k-mean of the

    projected rows minimizes the k-means cost of theoriginal rows up to a factor of (1+ ). This is becausethe minimal cost depends only on the

    n2

    distances

    between the n rows[34]. Note that we get the sameapproximation by the projection Am ofA on an m-dimensional subspace.

    While our embedding also projects the rows ontoa linear subspace of small dimension, its constructionand properties are different from a random subspace.First, our embedding is deterministic (succeeds with

    probability 1, no dependency of time and space on). Second, the dimension of our subspace isk/,that is, independent of n and only linear in 1/.Finally, the sumof squared distances toanyset thatis spanned by k-vectors is approximated, rather thanthe inter distance between each pair of input rows.

    Generalizations for Bregman divergences. Ourlast result is a simple recursive coreset constructionthat yields the first strong coreset for k-clusteringwith -similar Bregman divergences. The coresetwe obtain has size kO(1/). The idea behind theconstruction is to distinguish between the cases thatthe input point set can be clustered into k clusters ata cost of at most (1) times the cost of a 1-clusteringor not. We apply the former case recursively on theclusters of an optimal (or approximate) solution untileither the cost of the clustering has dropped to atmost times the cost of an optimal solution for the

    input point set or we reach the other case. In bothcases we can then replace the clustering by a so-calledclustering feature containing the number of points,the mean and the sum of distances to the mean. Notethat our result uses a stronger assumption than [6]because we assume that the divergence is -similaron the complete Rd while in [6] this only needs tohold for a subset X Rd. However, compared to[6] we gain a strong coreset, and the coreset size isindependent of the number of points.

    5 Implementation in Matlab

    Our main coreset construction is easy to implementif a subroutine for SVD is already available, forexample, Algorithm 1 shows a very short Matlabimplementation. The first part initializes the n d-matrix A with random entries, and also sets theparameters j and . In the second part, the actualcoreset construction happens. We setm :=j/ +

    j 1 according to Corollary6.1. Then we calculatethe first m singular vectors ofA and store it in thematricesU,D and V. Notice that for small matrices,it is better to use the full svd svd(A,0). Then, wecompute the coreset matrix Cand the constant c =

    ni=m+1si. In the third part, we check the quality

    of our coreset. For this, we compute a random querysubspace represented by a matrix Q with j randomorthogonal columns, calculate the sum of the squareddistances ofA to Q, and ofC to Q. In the end, errcontains the multiplicative error of our coreset.

    The Matlab code of Algorithm 2 is based onCorollary 4.5 for constructing a coreset C for k-mean queries. This coreset has n points that arelying on O(k/2) subspace, rather than the originald-dimensional space. Note that further reduction for

    7

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    8/20

    a coreset of size independent of both n and d can beobtained by applying additional construction on C;see Corollary9.1. The code in Algorithms1 and 2 isonly for demonstration and explanation purposes. Inparticular, it should be noted that

    For simplicity, we choose A as a random Gaus-sian matrix. Real data sets usually contain morestructure.

    We choose the value of m according to thecorresponding theorems, that are based on worstcase analysis. In practice, as can be seen in ourexperiments, significantly smaller values for mcan be used to obtain the same error bound.The desired value of m can be chosen usinghill climbing or binary search techniques on theintegerm.

    In Algorithm1we compare the coreset approx-imation for a fixed query subspace, and not themaximum error over all such subspaces, which isbounded in Corollary6.1.

    In Algorithm2 we actually get a better solutionusing the coreset, compared to the original set( > %% Coreset Computation %%%

    >> %% for j-Subspace Approximation %%%

    >> % 1. Creating Random Input

    >> n=10000;

    >> d=2000;

    >> A=rand(n,d);

    >> j=2;

    >> epsilon=0.1;

    >> % 2. The Actual Coreset Construction

    >> m=j+ceil(j/epsilon)-1; % i.e., m=21.

    >> [U, D, V]=svds(A,m);

    >> C = D*V; % C is an m-by-m matrix

    >> c=norm(A,fro)^2-norm(D,fro)^2;

    >> % 3. Compare sum of squared distances of j

    >> % random orthogonal columns Q to A and C

    >> Q=orth(rand(d,d-j));

    >> costA= norm(A*Q,fro)^2;>> costC= c+norm(C*Q,fro) 2;

    >> errJsubspaceApprox = abs(costC/costA - 1)

    errJsubspaceApprox = 2.4301e-004

    Algorithm 1: A Matlab implementation of the coresetconstruction forj -subspace approximation.

    value decomposition A = U DVT. Our first stepis to replace the matrix A by its best rank m

    approximation with respect to the squared Frobeniusnorm, namely, by U D(m)VT, where D(m) is thematrix that contains the m largest diagonal entriesofD and that is 0 otherwise. We show the followingsimple lemma about the error of this approximationwith respect to squared Frobenius norm.

    Lemma 6.1. LetA Rnn be ann n matrix withthe singular value decompositionA = U DVT, and letX be an j matrix whose columns are orthonormal.Let (0, 1] and m N with n 1 m

    j +j/1, and letD(m)

    be the matrix that containsthe firstm diagonal entries ofD and is0 otherwise.Then

    0 U DVTX22 U D(m)VTX22 n

    i=j+1

    si.

    Proof. We first observe that U DVTX22

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    9/20

    >> %% Coreset Computation for 2-means %%%

    >> % 1. Creating Random Input

    >> n=10000;

    >> d=2000;

    >> A=rand(n,d); % n-by-d random matrix

    >> j=2;

    >> epsilon=0.1;

    >> %2. Coreset Construction

    >> m=ceil(j/epsilon 2)+j-1; % i.e, m=201

    >> [U, D, V]=svds(A,m);

    >> c=norm(A,fro)^2-norm(D,fro)^2;

    >> C = U*D*V; % C is an n-by-m matrix

    >> % 3. Compute k-mean for A and its coreset

    >> k=2;

    >> [~,centersA]=...

    kmeans(A,k,onlinephase,off);

    >> [~,centersC]=...kmeans(C,k,onlinephase,off);

    >> % 4.Evaluate sum of squared distances

    >> [~,distsAA]=knnsearch(centersA,A);

    >> [~,distsCA]=knnsearch(centersC,A);

    >> [~,distsCC]=knnsearch(centersC,C);

    >> costAA=sum(distsAA.^2); % opt. k-means

    >> costCA=sum(distsCA.^2); % appr. cost

    >> costCC=sum(distsCC.^2);

    >> % Evaluate quality of coreset solution

    >> epsApproximation = costCA/costAA - 1

    epsApproximation = -4.2727e-007

    >> % Evaluate quality of cost estimation

    >> % using coreset

    >> epsEstimation = abs(costCA/(costCC+c) - 1)

    epsEstimation = 5.1958e-014

    Algorithm 2: A Matlab implementation of the coresetconstruction for k-means queries.

    U D(m)VTX22 is always non-negative. Then

    U DVTX22 U D(m)VTX22= DVTX22 D(m)VTX22= (D D(m))VTX22 (D D(m))22X2S= j sm+1

    Figure 1: Vizualization of a point set that is projecteddown to a 1-dimensional subspace. Notice that bothsubspace here and in all other pictures are of the samedimension to keep the picture 2-dimensional, but thequery subspace should have smaller dimension.

    where the first equality follows since U has or-thonormal columns, the second inequality since forM = VTX we have DM22 D(m)M22 =n

    i=1

    nj=1sim

    2ij

    mi=1

    nj=1sim

    2ij =n

    i=m+1

    nj=1sim

    2ij = (D D(m))M2, and

    the inequality follows because the spectral norm isconsistent with the Euclidean norm. It follows forour choice ofm that

    jsm+1 (mj +1)sm+1 m+1

    i=j+1si

    n

    i=j+1si.

    In the following we will show that one can usethe first m rows of D(m)VT as a coreset for thelinear j-dimensional subspace 1-clustering problem.We exploit the observation that by the Pythagoreantheorem it holds that AY22 +AX22 = A22.Thus, AY22 can be decomposed as the differ-ence between A22, the squared lengths of thepoints in A, and AX22, the squared lengths ofthe projection of A on L. Now, when usingU D(m)VT instead ofA,

    U D(m)VTY

    22 decomposes

    into U D(m)VT22 U D(m)VTX22 in the sameway. By noting thatA22 is actually

    ni=1si and

    U D(m)VT22 ism

    i=1si, it is clear that storing theremaining terms of the sum is sufficient to accountfor the difference between these two norms. Approxi-mating AY22 by U D(m)VTY22 thus reduces to ap-proximate AX22 byU D(m)VTX22. We show thatthe difference between these two is only an-fractionofAY22, and explain after the corollary why thisimplies the desired coreset result.

    9

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    10/20

    Figure 2: The distances of a point and its projectionto a query subspace.

    Lemma 6.2. (Coreset for Linear Subspace 1-

    Clustering) Let A Rnn

    be an n n matrix withsingular value decomposition A = U DVT, Y bean n (n j) matrix with orthonormal columns,0 1, andm Nwithn 1 m j +j/1.Then

    0 U D(m)VTY22+n

    i=m+1

    si AY22 AY22.

    Proof. We haveAY22 U D(m)VTY22+ U(D D(m))VTY22 U D(m)VTY22+

    ni=m+1si, which

    proves thatU D(m)VTY22+n

    i=m+1si AY22 isnon-negative. We now follow the outline sketched

    above. By the Pythagorean Theorem, AY22 =A22 AX22, where Xhas orthonormal columns

    and spans the space orthogonal to the space spannedbyY . Using,A22 =

    ni=1si and,U D(m)VT22 =m

    i=1si, we obtain

    U D(m)VTY22+n

    i=m+1

    si AY22

    = U D(m)VT22 U D(m)VTX22+

    ni=m+1

    si A22+ AX22

    = AX22 U D(m)VTX22

    ni=j+1

    si AY22

    where the first inequality follows from Lemma6.1.

    Now we observe that by orthonormality of thecolumns ofUwe have U M22= M22for any matrixM, which implies thatU DVTX22 =DVTX22.

    Thus, we can replace the matrix U D(m)VT in theabove corollary by D(m)VT. This is interesting,because all rows except the first m rows of thisnew matrix have only 0 entries and so they dontcontribute to D(m)VTX22. Therefore, we will defineour coreset matrix Sto be matrix consisting of the

    first m= O(j/) rows ofD(m)

    VT

    . The rows of thismatrix will be the coreset points.In the following, we summarize our results and

    state them forn points in d-dimensional space.

    Corollary 6.1. LetA be ann n matrix whosenrows representn points in n-dimensional space. LetA= U DVT be the SVD ofA and letD(m) be a matrixthat contains the first m = O(j/) diagonal entriesofD and is0 otherwise. Then the rows ofD(m)VT

    form a coreset for the linear j-subspace 1-clusteringproblem.

    If one is familiar with the coreset literature it

    may seem a bit strange that the resulting point setis unweighted, i.e. we replace n unweighted pointsbym unweighted points. However, for this problemthe weighting can be implicitly done by scaling.Alternatively, we could also define our coreset to bethe set of the first m rows ofVT where theith row isweighted bysi.

    7 Dimensionality Reduction for ProjectiveClustering Problems

    In order to deal with k subspaces we will use a formof dimensionality reduction. To define this reduction,

    let L be a linear j -dimensional subspace representedby ann j matrixXwith orthonormal columns andwithYbeing ann (n j) matrix with orthonormalcolumns that spans L. Notice that for any matrixMwe can write the projection of the points in therows ofM toL as M XXT, and that these projectedpoints are still n-dimensional, but lie within the j-dimensional subspace L. Our first step will be toshow that if we project both U DVT and A :=U D(m)VT on X by computing U DVTXXT andU D(m)VTXXT, then the sum of squared distancesof the corresponding rows of the projection is smallcompared to the cost of the projection. In other

    words, after the projection the points of A will berelatively close to their counterparts of A. Noticethe difference to the similar looking Lemma 6.1: InLemma 6.1, we showed that if we project A to L andsum up the squared lengthsof the projections, thenthis sum is similar to the sum of the lengths of theprojections ofA. In the following corollary, we lookat the distances between a projection of a point fromA and the projection of the corresponding point inA, then we square these distances and show that the

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    11/20

    sum of them is small.

    Corollary 7.1. LetA Rnn and letA= U DVTbe its singular value decomposition. Let D(m) be amatrix whose firstm diagonal entries are the same asthat ofD and let it be0, otherwise. LetX Rnj bea matrix with orthonormal columns and letn

    1m j + j/ 1 and let Y Rn(nj) a matrix

    with orthonormal columns that spans the orthogonalcomplement of the column space ofX. Then

    U DVTXXT U D(m)VTXXT22 AY22Proof. We have U DVTXXT U D(m)VTXXT22 =DVTX D(m)VTX22 = DVTX22 D(m)VTX22 =U DVTX22 U D(m)VTX22 n

    i=j+1si n

    i=1si AY22, where the thirdlast inequality follows from Lemma6.1.

    Now assume we want to use A = U D(m)VT

    to estimate the cost of an j-dimensional affine sub-space k-clustering problem. Let L1, . . . , Lk be a setof affine subspaces and let L be a j-dimensionalsubspace containing C = L1 Lk, j k(j+ 1). Then by the Pythagorean theorem we canwrite dist2(A, C) = dist2(A, L) + dist2(AXXT, C),where X is a matrix with orthonormal columnswhose span is L. We claim that dist2(A, C) +n

    i=m+1si is a good approximation for dist2(A, C).

    We also know that dist2(A, C) = dist2(A, L) +dist2(AXXT, C). Furthermore, by Corollary6.2, |dist2(A, L) +ni=m+1si dist2(A, L)| dist2(A, L)

    dist2(A, C). Thus, if we can

    prove that |dist2(AXXT, C) dist2(AXXT)|22 dist(A, C) we have shown that |dist(A, C)(dist(A, C) +

    ni=m+1si)| 2dist(A, C), which will

    prove our dimensionality reduction result.In order to do so, we can use the following weak

    triangle inequality, which is well known in the coresetliterature, and can be generalized for other norms anddistance functions (say, m-estimators).

    Lemma 7.1. For any 1 > > 0, a compact setC Rn, andp, q Rn,

    |dist2(p, C)

    dist2(q, C)

    |

    12p q2

    +

    2

    dist2(p, C).

    8 Proof of Lemma7.1

    Proof. Using the triangle inequality,

    |dist2(p, B) dist2(q, B)|= |dist(p, B) dist(q, B)| (dist(p, B) + dist(q, B)) p q (2dist(p, B) + p q) p q2 + 2dist(p, B)p q.

    (8.4)

    Either dist(p, B) pq/or pq < dist(p, B).Hence,

    dist(p, B)p q p q2

    +dist2(p, B).

    Combining the last inequality with (8.4) yields

    |dist2(p, B) dist2(q, B)| p q2 +2p q

    2

    + 2dist2(p, B)

    3p q2

    + 2dist2(p, B).

    Finally, the lemma follows by replacing with/4.

    We can combine the above lemma with Corollary7.1 by replacing in the corollary with 2/30 andsumming the error of approximating an input point

    p by its projection q. If C is contained in a j-dimensional subspace, the error will be sufficiently

    small for m j

    + 30j

    /2

    1. This is done inthe proof of the theorem below.

    Theorem 8.1. Let A Rnn be a matrix withsingular value decomposition A = U DVT and let > 0. Let j 1 be an integer, j = j + 1 andm = j +30j/2 1 such that m n 1.Furthermore, let A = U D(m)VT, where D(m) is adiagonal matrix whose diagonal is the firstm diagonalentries ofD followed byn m zeros.

    Then for any compact setC, which is containedin aj-dimensional subspace, we have

    dist2(A, C) dist2(A, C) +n

    i=m+1si

    dist2(A, C).Proof. Suppose that X Rn(j+1) has orthonormalcolumns that spanC. LetY Rn(n(j+1)) denote amatrix whose orthonormal columns are also orthogo-nal to the columns ofX. SinceAY2 is the sum ofsquared distances to the column space ofX, we have

    (8.5) AY2 dist2(A, C).Fix an integer i, 1 i n, and let pi denote

    the ith row of the matrix AX XT. That is, pi is the

    projection of a row ofA on X. Letpi denote the ithrow ofAXXT. Using Corollary7.1, while replacing with =2/30

    ni=1

    pi pi2(8.6)

    =UDV XXT U D(m)V XXT2

    2

    30 AY2.

    11

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    12/20

    Notice that if ui is the ith row of U, then theith row of A can be written as Ai = uiDVT,and thus for its projection p to X it holds p =uiDV

    TXXT. Similarly,p =uiD(m)VTXXT. Thus,

    the sum of the squared distances between all pand their corresponding p is just||U DVTXXT U D

    (m)

    VT

    XXT

    ||2

    2, which is bounded by

    ||AY||2

    byCorollary7.1and since we setm = j+30j/21in the precondition of this Theorem. Together, we get

    dist2(A, C) dist2(A, C) ni=j+1

    si

    =AY2 AY2(8.7)

    ni=j+1

    si+

    pP

    (dist2(p, C) dist2(p, C)) AY2(8.8)

    +

    ni=1

    12pi pi

    2

    +

    2 dist2

    (pi, C)

    +12

    +

    2

    dist2(A, C)(8.9)

    dist2(A, C),

    where (8.7) follows from the Pythagorean Theo-rem, (8.8) follows from Lemma7.1and Corollary6.2,and (8.9) follows from (8.5) and (8.6).

    9 Small Coresets for Projective Clustering

    In this section we use the result of the previous

    section to prove that there is a strong coreset of sizeindependent of the dimension of the space. In thelast section, we showed that the projection A of Ais a coreset for A. A still has n points, which are n-dimensional but lie in anm-dimensional subspace. Toreduce the number of points, we want to apply knowncoreset constructions toA within the low dimensionalsubspace. However, this would mean that our coresetonly holds for centers that are also from the lowdimensional subspace, but of course we want thatthe centers can be chosen from the full dimensionalspace. We get around this problem by applyingthe coreset constructions to a slightly larger space

    than the subspace that A lives in. The followinglemma provides us the necessary tool to complete theargumentation.

    Lemma 9.1. LetL1 be ad1-dimensional subspace ofRn and letLbe a(d1+d2)-dimensional subspace ofRn

    that containsL1. LetL2be ad2-dimensional subspaceofRn. Then there is an orthonormal matrixU suchthat U x = x for any x L1 and U x L for anyx L2.

    Proof. Let B1 be an nn matrix whose columnsare an orthonormal basis of Rn such that the firstd1 columns span L1 and the first d1 + d2 columnsspan L2. Let B2 an nn matrix whose columnsare an orthonormal basis ofRn such that the first d1columns are identical with B1 and the first d1 + d2

    columns span L2. Define U = B1BT

    2. It is easy toverify thatUsatisfies the conditions of the lemma.

    Let L be any m+j-dimensional subspace thatcontains the m-dimensional subspace in which thepoints in A lie, let C be an arbitrary compact set,which can, for example, be a set of centers (points,lines, subspaces, rings, etc.) and let L2 be a j-dimensional subspace that contains C. If we now setd1 := m and d2 := j and let L1 be the subspacespanned by the rows ofA, then Lemma9.1 implies

    that every given compact set (and so any set ofcenters) can be rotated into L. We could now proceedby computing a coreset in a m + j-dimensionalsubspace and then rotate every set of centers to thissubspace. However, the last step is not necessarybecause of the following.

    The mapping defined by any orthonormal matrixUis an isometry, and thus applying it does not changeEuclidean distances. So, Lemma 9.1 implies thatthe sum of squared distances of C to the rows ofA is the same as the sum of squared distances ofU(C) :={U x : x C} to A. Now assume that wehave a coreset for the subspaceL. U(C) is still a unionofk subspaces, and it is in L. This implies that thesum of squared distances to U(C) is approximatedby the coreset. But this is identical to the sum ofsquared distances to C and so this is approximatedby the coreset as well.

    Thus, in order to construct a coreset for a setof n points in Rn we proceed as follows. In a firststep we use the dimensionality reduction to reducethe input point set to a set ofn points that lie in anm-dimensional subspace. Then we construct a coresetfor a (d +j)-dimensional subspace that contains thelow-dimensional point set. By the discussion above,

    this will be a coreset.Finally, combining this with known results [21,19,36] we gain the following corolarry.

    Corollary 9.1. LetPbe a set of points inRd, andk, j 0 be a pair of integers. There is a set Q inRd, a weight functionw : Q [0, ) and a constantc > 0 such that the following holds.

    For every set B which is the union of k affine

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    13/20

    j-subspaces ofRd we have

    (1 )

    pP

    dist2(p, B)

    pQ

    w(p)dist2(p, B) +c

    (1 +)

    pP

    dist2(p, B),

    and

    1.|Q| =O(j/) ifk= 12.|Q| =O(k2/4) ifj = 03.|Q| = poly(2k log n, 1/)ifj = 1,4.|Q| = poly(2kj , 1/) if j, k > 1, under the

    assumption that the coordinates of the points inPare integers between1 andnO(1).

    In particular the size ofQ is independent ofd.

    10 Fast, streaming, and parallelimplementations

    One major advantage of coresets is that they canbe constructed in parallel as well as in a streamingsetting.

    That are can be constructed (embarrassingly) inparallel is due to the fact that the union of coresetsis again a coreset. More precisely, if a data setis split into subsets, and we compute a (1 + )-coreset of size s for every subset, then the unionof the coresets is a (1 + )-coreset of the wholedata set and has size s. This is especially helpfulif the data is already given in a distributed fashionon different machines. Then, every machine willsimply compute a coreset. The small coresets canthen be sent to a master device which approximatelysolves the optimization problem. Our coreset resultsfor subspace approximation for one j-dimensionalsubspace, for k-means and for general projectiveclustering thus directly lead to distributed algorithmsfor solving these problems approximately.

    If we want to compute a coreset with size indepen-dent of on the master device or in a parallel settingwhere the data was split and the coresets are latercollected, then we can compute a (1 +)-coreset of

    this merged coreset, gaining a (1 +)2-coreset. For := /3, this results in a (1 + )-coreset because(1 +/3)2

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    14/20

    Figure 3: Tree construction for generating coresets inparallel or from a data stream. Black arrows indi-cate merge-and-reduce operations. The (intermedi-ate) coresets C1, . . . , C 7 are enumerated in the orderin which they would be generated in the streamingcase. In the parallel case,C1, C2, C4 and C5 would beconstructed in parallel, followed by parallel construc-tion ofC3 andC6, finally resulting in C7. The figure

    is taken from [33].

    (i) it holds|C| a(/9, m() O(log n)) for an =O(/ log n).

    (ii) The construction ofC takes at most

    s(, 2m()) +s(/9, m() O(log n))

    space with additionalm() O(log n) items.(iii) Update time of C per point insertion to the

    stream is

    t(, 2m()) O(log n) +t(/9, O(m()log n))

    (iv) The amortized update time can be divided byMusingM 1 processors in parallel.

    Proof. We first show how to construct an -coresetonly for the firstn itemsp1, , pn in the stream, fora given n 2, using space and update time as in (i)and(ii).

    Put = /(6log n) and denote m = m().The-coreset of the first 2m 1 items in the streamis simply their union. After the insertion of p2m

    we replace the first 2m items by their -coresetC1.The coreset forp1, , p2m+i is the union ofC1 and

    p2m+1, , p2m+i for every i = 1, , 2m. Whenp4m is inserted, we replacep2m+1, , p4m by their-coresetC2. Using the assumption of the theorem,we have|C1| + |C2| 2m. We can thus compute an-coreset of size at mostm for the union of coresetsC1 C2, and delete C1 andC2 from memory.

    We continue to construct a binary tree of coresetsas in Fig.3. That is, every leaf and node contains at

    mostm items. Whenever we have two coresets in thesame level, we replace them by a coreset in a higherlevel. The height of the tree is bounded from abovebylog n. Hence, in every given moment during thestreaming ofn items, there are at most O(log n) -coresets in memory, each of size at most m. For any

    n, we can obtain a coreset for the first n points atthe moment after they the nth point was added bycomputing an (/3)-coresetC for the union of theseat mostO(log n) coresets in the memory.

    The approximation error of a coreset in the treewith respect to its leaves (the original items) isincreased by a multiplicative factor of (1+) in everytree level. Hence, the overall multiplicative error ofthe union of coresets in memory is (1+)log n. Usingthe definition of, we obtain

    (1 +)log n = (1 +

    6log n )log n e 6

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    15/20

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    16/20

    1. If cost(P) (1 + f1())k

    i=1cost(Pi) for allpartitionings P1, . . . , P k of P into k subsets,then there exists a coreset Z of size g(k, )such that for any set of k centers we have|cost(P, C) costc(Z, C)| cost(P, C). So, if an optimalk-clustering ofPis at most a (1+ )-

    factor cheaper than the best 1-clustering, thenthis must induce a coreset for P.

    2. If opt(P, f2(k)) f3()opt(P, k) for P P,then there exists a set Z of size h(f2(k), )such that for any set of centers C we have|cost(P, C) costc(Z, C)| cost(P, C).

    Algorithm.Given the above conditions, we use the following

    algorithm (the value ofis defined later), P is theinput point set:

    Partition(P, k)

    1. Let C1 V be a set with |C| = k thatsatisfies opt(P, k) = cost(P, C1). Consider thepartitioning definined byPc= {p P| n(cp) =mincCn(c p)} into k subsets (ties brokenarbitrarily).

    (a) If cost(P) (1+f1())

    cCcost(Pc), stop.

    (b) Else, if this is theth level of the recursion,still stop.If it is not, call Partition(Pc, k) for allc Cand then stop.

    Note that the first time that we reach step 2,P is the input point set, and thus opt(P, k) =opt(P, k) and

    cCcost(Pc)opt(P, k)/(1 +) =

    opt(P, k)/(1 + ). Let Q denote the set of allsubsets generated by the algorithm on level andfor which the algorithm stopped in line 1b. Onthe ith level of the recursion, the sum of all sets isopt(P, k)/(1 +f1())

    i. For =log1+f1() 1f3() ,this is smaller than f3() opt(P, k). Thus, we haveat most f(k) := k sets inQ, andUQcost(U)opt(P, k). By Condition2,this implies the existenceof a setZof sizeh(k, ) which has an error of at most

    opt(P, k).For all sets where we stop in step1a,Condition1

    directly gives a coreset of sizeg(k, ) forP. The unionof these coresets give a coreset for the union of all setswhich stoped in step 1a. Alltogether, they inducean error of less than opt(P, k). Together with the opt(P, k) error induced by the sets in Q, this gives atotal error of 2. So, if we start every thing with /2,we get a coreset forPwith error opt(P, k). The sizeof the coreset then is k g(k,/2) +h(k, /2).

    Lemma 11.1. If the cost function satisfies the abovetwo conditions, then there exists a coreset of sizekg(k,/2)+h(k, /2)for= log1+f1(/2) 1f3(/2).

    For k-means, we can use clustering features andachieve that g

    1 and h(k, ) = k. Thus, the

    overall coreset size is 2klog1+f1()

    1f3() . We do not

    present this in detail as the coreset is larger than thek-means coreset coming from our first construction.However, the proof can be deduced from the followingproof for -similar bregman divergences, as the k-means case is easier.

    11.1 Coresets for -similar Bregman diver-gences Let P be the input point set. Let d :S S R be a m-similar Bregman divergence, i. e.,dis defined on a convex set S Rd and there existsa Mahalanobis distance dB such that mdB(p, q)d(p, q) dB(p, q) for all points p, q R

    d

    and anm (0, 1] (note that we use m-similar instead of-similar in order to prevent confusion with the centroid). We need the additional restriction on the convexsetSthat for every pair p, qof points fromP,Scon-tains all points within a ball of radius (4/m) d(p, q)aroundp for a constant m, and we call such a set P-covering. Thus, in addition to the convex hull of thepoint set, a P-covering set may have to be larger bya factor dependent on m, and the diameter of P.Because of this additional restriction, our setting ismuch more restricted than in [6]. It is an interestingopen question how to remove this restriction or even

    relax the -similarity further.Notice that becausedB is a Mahalanobis distance

    there exists a regular matrix B with dB(x, y) =B(x y)2 for all points x, y Rn. In particular,m B(x y)2 d(x, y) B(x y)2. By[12],Bregman divergences (also if they are notm-similar)satisfy the Bregman version of equation (11.10), i.e.,

    pP

    d(p, z) =

    pP

    d(p, ) + |P|d(, z).(11.11)

    Condition 1. To show that Condition 1 holds, we

    setf1() = 1

    (1+ 4m )2 and assume that we are given a

    point set Sthat satisfies that for every partitioningofS into k subsets S1, . . . , S k it holds that

    sS

    d(s, (S)) (1 +f())k

    j=1

    xSj

    d(x, (Sj ))

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    17/20

    k

    j=1

    sS

    d(s, (Sj )) +k

    j=1

    |Sj |d((Sj ), (S))

    (1 +f())k

    j=1

    xSj

    d(x, (Sj ))

    k

    j=1

    |Sj |d((Sj ), (S))(4)

    1(1 + 4m )

    2

    kj=1

    xSj

    d(x, (Sj )).

    We show that this restricts the error of clusteringall points inSwith the same center, more specifically,with the centerc((S)), the center closest to(S). Todo so, we virtually add points to S. For every j =1, . . . , k, we add one point with weight 14m|Sj | withcoordinate (S) + 4m((S) (Sj)) to Sj. Noticethat this points lies within the convex setAthatdB is

    defined on because we assumed thatS is P-covering.The additional point shifts the centroid ofSj to(S)because

    |Sj | (Sj ) + m4|Sj |

    (S) + 4m((S) (Sj ))

    (1 + m4 )|Sj |

    =m4|Sj |

    (S) + 4m (S)

    (1 + m4 )|Sj |

    =(S).

    We name the set consisting ofSj together with theweighted added pointSj and the union of allS

    j isS

    .Now, clusteringS with centerc((S)) is certainly anupper bound for the clustering cost ofSwithc((S)).

    Additionally, when clusteringSj with only one center,thenc((S)) is optimal, so clusteringSj withc((Sj ))can only be more expensive. Thus, clustering allSjwith the centersc((Sj )) gives an upper bound on thecost of clusteringSwithc((S)). So, to complete theproof, we have to upper bound the cost of clusteringall Sj with the respective centers c((Sj )). We dothis by bounding the additional cost of clustering theadded points with c((Sj )), which is

    kj=1

    m

    4 |Sj | d

    (S) +

    4

    m ((S)

    (Sj )), c((Sj ))

    kj=1

    m

    4 |Sj | B((S) + 4

    m ((S)

    (Sj )) c((Sj )))2=a2

    for the k-dimensional vector a defined byaj :=

    m|Sj|/4B((S) + 4m((S) (Sj ))

    c((Sj ))). By the triangle inequality,aj

    m|Sj |/4B((1 + 4m )((S) (Sj ))) +

    m|Sj|/4B((Sj ) c((Sj ))) = bj + dj withbj =

    m|Sj|/4B((1 + 4m )((S)(Sj ))) and

    dj =

    m|Sj |/4B((Sj ) c((Sj ))). Then,

    a b +d b + d,where we use the triangle inequality again for thesecond inequality. Now we observe that

    b2 =k

    j=1

    m

    4|Sj |B((1 + 4

    m)((S) (Sj )))2

    =m

    4

    kj=1

    |Sj |(1 + 4m

    )2B((Sj ) (S))2

    m4

    (1 + 4

    m)2

    k

    j=1

    |Sj | 1m

    d((Sj ), (S))2

    4

    kj=1

    xSj

    d(x, (Sj )).

    Additionally, by the definition ofm-similarity and byEquation (11.11) it holds that

    d2 =k

    j=1

    1

    4m|Sj |B((Sj ) c((Sj )))2

    4

    k

    j=1|Sj |d((Sj ), c((Sj )))

    4

    kj=1

    xSj

    d(x, (Sj )).

    This implies that a b + d 2

    /2k

    j=1

    xSj

    d(x, (Sj )) and thus

    a2 k

    j=1

    xSj

    d(x, (Sj )).

    This means that Condition 1 holds: If a k -clusteringof S is not much cheaper than a 1-clustering, thenassigning all points in S to the same center yields a(1+ )-approximation for arbitrary center sets. Thus,we can use a clustering feature to store S, whichmeans that g(k, f(1)) 1.Condition 2. For the second condition, assume thatS is a set of subsets ofPrepresenting the f2(k) sub-sets according to an optimal f2(k)-clustering. Let asetCofkcenters be given, and define the partitioningS1, . . . , S k for every S S according to Cas above.

    17

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    18/20

    By Equation (11.11) and by the precondition of Con-dition 2,

    SS

    kj=1

    |Sj |d((Sj ), (S))

    =SS

    kj=1

    xSj

    d(x, (S)) SS

    kj=1

    xSj

    d(x, (Sj ))

    f3() opt(P, k).

    We use the same technique as in the proof thatCondition 1 holds. There are two changes: First,there are|S| sets where the centroids of the subsetsmust be moved to the centroid of the specific S(wherein the above proof, we only had one set S). Second,the bound depends on opt(P, k) instead of

    SS, so

    the approximation is dependent on opt(P, k) as well,but this is consistent with the statement in Condition

    2. The complete proof that Condition 2 holds can befound in the appendix, it is very similar to the proofthat Condition 1 holds with minor changes due to thetwo differences. We even set f3() = f1().

    Theorem 11.1. If dB : S S R is a m-similarBregman divergence on a convex andP-covering setSwithm (0, 1], then there exists a coreset consistingof clustering features of constant size, i. e., the sizeonly depends onk and.

    Proof. We have seen that the two conditions holdwith f1() = f3() =

    1(1+ 4m )

    2 , and g 1 and

    h(k

    , ) = k

    . By Lemma11.1, this implies a coresetsize of

    2k = 2klog1+f1(/2)

    1f3(/2)

    = 2klog

    1+ 1(1+ 8

    m)2

    (1+ 8m )2

    Acknowledgements. The project CG Learningacknowledges the financial support of the Futureand Emerging Technologies (FET) programme withinthe Seventh Framework Programme for Research ofthe European Commission, under FET-Open grantnumber: 255827.

    This work has been supported by DeutscheForschungsgemeinschaft (DFG) within the Collabora-tive Research Center SFB 876 Providing Informationby Resource-Constrained Analysis, project A2.

    References

    [1] Community cleverness required. Nature, 455 (7209):1. 4 September 2008. doi:10.1038/455001

    [2] D. Seung and L. Lee. Algorithms for non-negativematrix factorization, Advances in neural informa-tion processing systems, volume 13, pp. 556562,2001.

    [3] Blei, D.M. and Ng, A.Y. and Jordan, M.I. LatentDirichlet Allocation Journal of machine Learningresearch, pp. 9931022, 2003.

    [4] IBM: What is big data?. Bringing big data tothe enterprise. ibm.com/software/data/bigdata/,accessed on the 3rd of October 2012.

    [5] Sandia sees data management challenges spiral.HPC Projects, 4 August 2009.

    [6] M. Ackermann and J. Blomer. Coresets and approx-imate clustering for Bregman divergences. Proceed-ings of the 20th Annual ACM-SIAM Symposium onDiscrete Algorithms (SODA), pp. 1088-1097, 2009.

    [7] M. Ackermann, J. Blomer, and Christian Sohler.Clustering for metric and nonmetric distance mea-sures. ACM Transactions on Algorithms, 6(4), 2010.

    [8] Mahoney, M.W. Randomized Algorithms for Ma-trices and Data. Foundations and TrendsR in Ma-

    chine Learning, volume 3, II, pp.123224, 2011, NowPublishers Inc.

    [9] M. Ackermann, C. Lammersen, M. Martens, C. Rau-pach, C. Sohler and K. Swierkot. StreamKM++: AClustering Algorithms for Data Streams. Proceed-ings of the Workshop on Algorithm Engineering andExperiments (ALENEX), pp. 173-187, 2010.

    [10] P. Agarwal, S. Har-Peled and K. Varadarajan. Ap-proximating extent measures of points. Journal ofthe ACM, 51(4): 606-635, 2004.

    [11] Bentley, J.L. and Saxe, J.B. Decomposable search-ing problems I. Static-to-dynamic transformation,Journal of Algorithms, volume 1 (4), 301358, 1980

    [12] A. Banerjee, S. Merugu, I. S. Dhillon and J. Ghosh.

    Clustering with Bregman Divergences. Journal ofMachine Learning Research, volume 6: 1705-1749,2005.

    [13] M. Badoiu, S. Har-Peled, and P. Indyk. Approxi-mate clustering via core-sets. Proceedings of the 34thAnnual ACM Symposium on Theory of Computing(STOC), pp. 396407, 2002.

    [14] M. Beyer. Gartner Says Solving Big DataChallenge Involves More Than Just ManagingVolumes of Data. Gartner. Retrieved 13 July 2011.http://www.gartner.com/it/page.jsp?id=1731916.

    [15] K. Chen. On Coresets for k-Median and k-MeansClustering in Metric and Euclidean Spaces and TheirApplications. SIAM Journal on Computing, 39(3):923-947, 2009.

    [16] A. Deshpande, L. Rademacher, S. Vempala, andG. Wang. Matrix Approximation and ProjectiveClustering via Volume Sampling. Theory of Com-puting, 2(1): 225-247, 2006.

    [17] A. Deshpande, K. Varadarajan. Sampling-baseddimension reduction for subspace approximation.Proceedings of the 39th Annual ACM Symposium onTheory of Computing (STOC), pp. 641-650, 2007.

    [18] P. Drineas, A. Frieze, R. Kannan, S. Vempala,

    http://localhost/var/www/apps/conversion/tmp/scratch_2/ibm.com/software/data/bigdata/http://localhost/var/www/apps/conversion/tmp/scratch_2/ibm.com/software/data/bigdata/http://www.gartner.com/it/page.jsp?id=1731916http://www.gartner.com/it/page.jsp?id=1731916http://www.gartner.com/it/page.jsp?id=1731916http://localhost/var/www/apps/conversion/tmp/scratch_2/ibm.com/software/data/bigdata/
  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    19/20

    V. Vinay. Clustering in large graphs via the SingularValue Decomposition. Journal of Machine Learning,volume 56 (1-3), pp. 9-33, 2004.

    [19] D. Feldman, A. Fiat, M. Sharir. Coresets for-Weighted Facilities and Their Applications. Proceed-ings of the 47th IEEE Symposium on Foundations ofComputer Science (FOCS), pp. 315-324, 2006.

    [20] D. Feldman, M. Monemizadeh, C. Sohler, and D.Woodruff. Coresets and Sketches for High Dimen-sional Subspace Approximation Problems. Proceed-ings of the 21st Annual ACM-SIAM Symposium onDiscrete Algorithms (SODA), pp. 630-649, 2010.

    [21] D. Feldman, M. Langberg. A unified frameworkfor approximating and clustering data. Proceedingsof the 43rd Annual ACM Symposium on Theory ofComputing (STOC), pp. 569578, 2011.

    [22] D. Feldman, M. Monemizadeh and C. Sohler. APTAS for k-means clustering based on weak coresets.Proceedings of the 23rd Annual ACM Symposium onComputational Geometry, pp. 11-18, 2007.

    [23] D. Feldman, L. Schulman. Data reduction for

    weighted and outlier-resistant clustering. Proceed-ings of the 23rd Annual ACM-SIAM Symposium onDiscrete Algorithms (SODA), pp. 1343-1354, 2012.

    [24] G. Frahling and C. Sohler. Coresets in dynamicgeometric data streams. Proceedings of the 37thAnnual ACM Symposium on Theory of Computing(STOC), pp. 209217, 2005.

    [25] G. Frahling, C. Sohler. A Fast k-Means Implementa-tion Using Coresets. International Journal of Com-putational Geometry and Applications, 18(6): 605-625, 2008.

    [26] S. Har-Peled. How to Get Close to the MedianShape. CGTA, 36 (1): 39-51, 2007.

    [27] S. Har-Peled. No Coreset, No Cry. Proceedings

    of the 24th Foundations of Software Technology andTheoretical Computer Science (FSTTCS), pp. 324-335, 2004.

    [28] S. Har-Peled and A. Kushal. Smaller coresets fork-median and k-means clustering. Proceedings ofthe 21st Annual ACM Symposium on ComputationalGeometry, pp. 126134, 2005.

    [29] S. Har-Peled and S. Mazumdar. Coresets for k-means and k-median clustering and their applica-tions. Proceedings of the 36th Annual ACM Sympo-sium on Theory of Computing (STOC), pp. 291300,2004.

    [30] J. Hellerstein. Parallel Programming in the Age ofBig Data. Gigaom Blog, 9th November, 2008.

    [31] M. Hilbert and P. Lopez. The Worlds TechnologicalCapacity to Store, Communicate, and ComputeInformation. Science, volume 332, No. 6025, pp. 60-65, 2011.

    [32] A. Jacobs. The Pathologies of Big Data.ACMQueue, 6 July 2009.

    [33] D. Feldman, A. Krause, and Matthew Faulkner.Scalable Training of Mixture Models via CoresetsProc. 25th Conference on Neural Information Pro-cessing Systems (NIPS) 2011

    [34] J. Matousek. On Approximate Geometric k-Clustering. Discrete & Computational Geometry,volume 24, No. 1, pp. 66-84, 2000.

    [35] M. Langberg and L. J. Schulman. Universal epsilon-approximators for Integrals. Proceedings of the 21stAnnual ACM-SIAM Symposium on Discrete Algo-rithms (SODA), pp. 598-607, 2010.

    [36] K. Varadarajan and X. Xiao. A near-linear algo-rithm for projective clustering integer points., Pro-ceedings of the Annual ACM-SIAM Symposium onDiscrete Algorithms (SODA), 2012

    [37] O. J. Reichman, and M. B. Jones and M. P. Schild-hauer. Challenges and Opportunities of OpenData in Ecology. Science, 331 (6018): 703-5.doi:10.1126/science.1197962, 2011.

    [38] T. Segaran and J. Hammerbacher. BeautifulData: The Stories Behind Elegant Data Solutions.OReilly Media. p. 257, 2009.

    [39] N. Shyamalkumar, K. Varadarajan. Efficient sub-space approximation algorithms. Proceedings ofthe 18th Annual ACM-SIAM Symposium on Discrete

    Algorithms (SODA). pp. 532540, 2007.[40] T. White, Hadoop: The definitive guide. OReilly

    Media. 2012.

    A Complete Proof for Condition 2.

    For the second condition, assume thatS is a set ofsubsets ofPrepresenting thef2(k) subsets accordingto an optimal f2(k)-clustering. Let a set C ofk cen-ters be given, and define the partitioning S1, . . . , S kfor every S S according to Cas above. By Equa-tion (11.11) and by the precondition of Condition 2,

    SS

    kj=1

    |Sj|d((Sj ), (S))

    =SS

    kj=1

    xSj

    d(x, (S)) SS

    kj=1

    xSj

    d(x, (Sj ))

    f3() opt(P, k).We use the same technique as in the proof thatCondition 1 holds. There are two changes: First,there are|S| sets where the centroids of the subsetsmust be moved to the centroid of the specificS(wherein the above proof, we only had one set S). Second,the bound depends on opt(P, k) instead ofSS, sothe approximation is dependent on opt(P, k) as well,but this is consistent with the statement in Condition2. The complete proof that Condition 2 holds can befound in the appendix, it is very similar to the proofthat Condition 1 holds with minor changes due to thetwo differences.

    We set f3() = f1() and again virtually addpoints. For each S S and each subset Sj of S,we add a point with weight m4 |Sj | and coordinate(S) + 4m ( j ) toSj . Notice that these points lie

    19

  • 8/13/2019 Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Pojective Clustering

    20/20

    within the convex setA thatdB is defined on becausewe assumed thatS is P-covering.

    We name the new sets Sj ,S and S. Notice that

    the centroid ofSj is now

    |Sj | (Sj ) + m4|Sj|

    (S) + 4m((S) (Sj ))

    (1 + m4 )|Sj |=(S)

    in all cases. Again, clustering S with c((S)) isan upper bound for the clustering cost of S withc((S)), and because the centroid of Sj is (S),clustering every Sj with c((Sj )) is an upper boundon clustering S with c((S)). Finally, we have toupper bound the cost of clustering all Sj in all Swith c((Sj )), which we again do by bounding theadditional cost incurred by the added points. Addingthis cost over all Syields

    SS

    kj=1

    1

    4m|Sj | d((S)

    + 4

    m ((S) (Sj )) , c((Sj )))

    SS

    kj=1

    m

    4 |Sj | B((S)

    + 4m ((S) (Sj )) c((Sj )))

    2 = a2.

    For the last equality, we define |S| vectorsaS byaSj :=m|Sj|/4B((S)+ 4m((S) (Sj ))c((Sj )))

    and concatenate them in arbitrary but fixed order toget a k |S | dimensional vector a. By the triangleinequality, aSj

    m|Sj|/4B((1 + 4m )((S)

    (Sj )))+

    m|Sj|/4B((Sj )c((Sj ))) =bSj+dSjwith bSj =

    m|Sj |/4B((1 + 4m )((S) (Sj )))

    and dSj = m|Sj

    |/4

    B((Sj )

    c((Sj )))

    . Define

    b and d by concatenating the vectors bS and dS,respectively, in the same order as used for a. Thenwe can again conclude that

    a b +d b + d,

    where we use the triangle inequality for the second

    inequality. Now we observe that

    b2 =SS

    kj=1

    m

    4 |Sj |B((1 + 4

    m)((S) (Sj )))2

    =m

    4SS

    k

    j=1 |

    Sj|(1 +

    4

    m)2

    B((Sj )

    (S))

    2

    m4

    (1 + 4

    m)2SS

    kj=1

    |Sj | 1m

    d((Sj ), (S))2

    4

    opt(P, k).

    Additionally, by the definition ofm-similarity andby Equation (11.11) it holds that

    d2 =SS

    kj=1

    1

    4m|Sj |B((Sj ) c((Sj )))2

    4

    SS

    kj=1

    |Sj |d((Sj ), c((Sj )))

    4

    SS

    kj=1

    xSj

    d(x, (Sj )).

    This implies that a b + d 2

    /2

    opt(P, k) and thus

    a2 opt(P, k).