9
K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign pts to centroids). Same as above but means calculated without scanning. PK0-means: (Zero Scans) Same as above but both clusters and means calculated without scanning. 1.Pick K centroids, {C i } i=1..K 2. Calc SPTreeSet, D i =D(X,C i ) (col of distances from all x to C i ) to get P(D i D j ) i<j ( predicate is dis(x,C i )dis(x,C j ) ). 4. Calculate the mask-pTrees for the clusters goes as follows: PC 1 = P(D 1 D 2 ) & P(D 1 D 3 ) & P(D 1 D 4 ) & ... & P(D 1 D K ) PC 2 = P(D 2 D 3 ) & P(D 2 D 4 ) & ... & P(D 2 D K ) & ~PC 1 PC 3 = P(D 3 D 4 ) & ... & P(D 3 D K ) & ~PC 1 & ~PC 2 . . . PC k = & ~PC 1 & ~PC 2 & ... & ~PC K-1 5. Calculate the new Centroids, C i = Sum(X&PC i )/count(PC i ) 6. If stop_cond=false, start next iteration with new centroids. Note: In PK0 2. above, Md's 2's complement formulas can be used to get mask pTrees, P(D i D j ). FAUST (midpoint or?, which uses Md's dot product formulas) can be used. Is one faster than the other? PKL-means: ( P K-Less means, pronounced pickle means ) For all K: 4'. Calculate cluster mask pTrees. For K=2..n, PC 1K = P(D 1 D 2 ) & P(D 1 D 3 ) & P(D 1 D 4 ) & ... & P(D 1 D K ) PC 2K = P(D 2 D 3 ) & P(D 2 D 4 ) & ... & P(D 2 D K ) & ~PC 1 . . . PC K = P(X) & ~PC 1 & ... & ~PC K-1 6'. If k s.t. stop_cond = true, stop and choose that k, else start the next iteration with these new centroids. 3.5'. Continue with certain k's only (e.g., top t? Top means? a. Sum of cluster diams (use max, min of D(Cluster i , C j ), or D(Cluster i . Cluster j ) ). b. Sum of diams of cluster gaps (Use D(listPC i , C j ) or D(listPC i , listPC j ). c. other? In 4', 1st round, pick n 2 centroids, find all PC 1K ? (e.g., for K=2 (find PC h2 , PC h2 , h=n 1 ..n 2 ), on PC h2 's do it again... PKLD-means ( P K-Less Divisive, pronounced pickled means ) 1. PKL means w K=2. 2. Repeat 1 on each until stop_cond=true on each branch

K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign

Embed Size (px)

Citation preview

Page 1: K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign

K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond.

PK1-means: (One scan - to reassign pts to centroids). Same as above but means calculated without scanning.

PK0-means: (Zero Scans) Same as above but both clusters and means calculated without scanning.1.Pick K centroids, {Ci}i=1..K 2. Calc SPTreeSet, Di=D(X,Ci) (col of distances from all x to Ci) to get P(DiDj) i<j ( predicate is dis(x,Ci)dis(x,Cj) ). 4. Calculate the mask-pTrees for the clusters goes as follows: PC1 = P(D1D2) & P(D1D3) & P(D1D4) & ... & P(D1DK) PC2 = P(D2D3) & P(D2D4) & ... & P(D2DK) & ~PC1

PC3 = P(D3D4) & ... & P(D3DK) & ~PC1 & ~PC2 . . . PCk = & ~PC1 & ~PC2 & ... & ~PCK-1 5. Calculate the new Centroids, Ci = Sum(X&PCi)/count(PCi) 6. If stop_cond=false, start next iteration with new centroids.

Note: In PK0 2. above, Md's 2's complement formulas can be used to get mask pTrees, P(D iDj). FAUST (midpoint or?, which uses Md's dot product formulas) can be used. Is one faster than the other?PKL-means: ( P K-Less means, pronounced pickle means ) For all K: 4'. Calculate cluster mask pTrees. For K=2..n, PC1K = P(D1D2) & P(D1D3) & P(D1D4) & ... & P(D1DK) PC2K = P(D2D3) & P(D2D4) & ... & P(D2DK) & ~PC1 . . . PCK = P(X) & ~PC1 & ... & ~PCK-1 6'. If k s.t. stop_cond = true, stop and choose that k, else start the next iteration with these new centroids.3.5'. Continue with certain k's only (e.g., top t? Top means? a. Sum of cluster diams (use max, min of D(Clusteri, Cj), or D(Clusteri. Clusterj) ). b. Sum of diams of cluster gaps (Use D(listPCi, Cj) or D(listPCi, listPCj). c. other?

In 4', 1st round, pick n2 centroids, find all PC1K? (e.g., for K=2 (find PCh2, PCh2, h=n1..n2), on PCh2's do it again...

PKLD-means ( P K-Less Divisive, pronounced pickled means ) 1. PKL means w K=2. 2. Repeat 1 on each until stop_cond=true on each branch

Page 2: K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign

Additional thoughts: Calculate PTreeSet(dis(x,X-x)), then sort desc, gives a full ranking as to singleton-outlier-ness. Or take the global medoid, C, increase r until  ct(dis(x,Disk(C,r)))>ct(X)–n, then declare the compliment as outliers.Or, loop thru all pts, x, one at a time, the algorithm is O(n). ( O(n2) for horiz: x, find dis(x,y) yx (O(n(n-1)/2)=O(n2) Or predict C so it is not X-x but a fixed subset?Or create 3 column “distance table”,  DIS(x,y,d(x,y))  (limit it to only those distances < thresh?) where dis(x,y) is a PTreeSet of those distances. If we have DIS as a PTreeSet both ways - have one for “y-pTrees” and another for “x-pTrees”. y’s -->x’s  0  2  1  3  1  2…|       0  2  5  9  1…v          0  4  6  3…y’s close to x are in it’s cluster.  If the cluster is small, and next larger d(x,y) is large, x-cluster members are outliers.

Tidying up each round? Fusion: Check for clusters that should be fused?  Fuse (decrease k)1. Empty clusters with any other and reduce k (this is probably assumed in all k-means methods since there is no mean.).2. For some a>1, max(D(CLUSTi,Cj))< a*D(Ci,Cj) and max(D(CLUSTj,Ci))< a*D(Ci,Cj), fuse CLUSTi and CLUSTj. Avg better?

Fission: Split cluster (increase k), if a. mean and vom are quite far apart,b. the cluster is quite sparse. Sparsity measure? max(D( CLUS,C))/count(CLUS).(Pick fission centroid y at max dis from C (use maxD(x,C)). Pick z at max dis from y. (y and z are close to diametric opposites in C).)

If the focus is on outliers or anomalies, the following variations might be useful:

PKLD2-means1. Start with K=2. 2. Each round, if the fission_condition=true (a or b above or???), pick y and z as above. Then the pick the new fission centroids to be the points on the y-z-line 1/8 and 7/8 of the way from y to z (or some other fraction).

PKLD1-means1. Start with K=1. 2. Each round, if the fission_condition=true (a or b above or???), pick y and z as above. Then the pick the new fission centroids to be the points on the y-z-line 1/8 and 7/8 of the way from y to z (or some other fraction).

Page 3: K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign

1. no clusters determined yet.

o 0

r1

  v1 r2                  v2

     r3    v3

v4

dim2

dim1

Alg1: Choose dim. 3 clusters, {r1,r2,r3,O}, {v1,v2,v3,v4}, {0} by: 1.a: when d(mean,median) >c*width, declare cluster. 1.b: Same alg on subclusts.

mean

median

mean

median

Declare {r1,r2,r3,O}

mean

median

mean

median

mean

median

Declare {0,v1} or {v1,v2}? Take {v1,v2} (on median side of mean). Makes {0} a cluster (outlier, since it's singleton). Continuing with {v1,v2}:

mean

median

mean

median

Declare {v1,v2,v3,v4}. Have to loop, but not on next m projs if close?

Oblique: grid of Oblique dir_vects, e.g., For 3D, DirVect fromeach PTM triangle. With projections onto those lines, do 1 or 2 above. Order = any sphere grid: Sn≡{x≡(x1...xn)Rn | xi

2=1}, polar coords.

Can skip doubletons since mean always same as median.

Alg3: Calc mean and vom. Do 1a or 1b on line connecting them. Repeat on each cluster, use another line? Adjust proj lines, stop cond

Alg2: 2.a density > Density_Thresh, declare (density≡count/size).

Alg4: Proj to mean-vom-line, mn=6.3,5.9 vom=6,5.5 (11,10=outlier). 4.b, perp line?

4,9 2,8 5,8   4,6         3,4

dim2

dim1

11,10

10,5 9,4

  8,3 7,2

6.3,5.96,5.5

lexicographical polar coords? 180n too many? Use e.g., 30 deg units, giving 6n vectors, for dim=n. Attrib relevance important!Alg1-2: Use 1st criteria to trigger from 1.a, 2.a to declare clusters.

MASTERMINE (Medoid-based Affinity Subset deTERMINEr)

   435  524 504        545    323                          1

2

3mean=(8.18, 3.27, 3.73)

vom=(7,4,3) 924

                     b43      e43                 c63           752               f72

2. (9,2,4) determined as an outlier cluster.

3. Use red dim line, (7,5,2) an outlier cluster. maroon pts determined as cluster, purple pts too.3.a use mean-vom again would the same be determined?

Other option? use a p-Kmeans approach. Could use K=2 and divisive (using a GA mutation at various times to get us off a non-convergent track)?

Notes:Each round, reduce dim by one (low bound on the loop.)Each round, just need good line (in remaining hyperplane) to project cluster (so far).1. pick line thru proj'd mean, vom (vom is dependent on basis used. better way?)2. pick line thru longest diameter? ( or diam 1/2 previous diam?).3. try a direction vector. Then hill climb it in direction increase in diam of proj'd set.

Page 4: K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign

r   r vv r mR   r      v v v       r    r      v mV v     r    v v    r         v                   

FAUST Oblique FAUST Oblique (our best classifier?)(our best classifier?)

PR=P(X o dR

) < aR

1 pass gives classR pTree

D≡ mRmV

d=D/|D|

Separate class R using midpoint midpoint of meansof means method: Calc a

(mR+(mV-mR)/2)od = a = (mR+mV)/2od (works also if D=mVmR,

d

Training≡placing cut-hyper-plane(s) (CHP) (= n-1 dim hyperplane cutting space in two). Classification is 1 horizontal program (AND/OR) across pTrees, giving a mask pTree for each entire predicted class (all unclassifieds at-a-time)Accuracy improvement? Consider the dispersion within classes when placing the CHP. E.g., use the

1. vectors_of_median, vomvom, to represent each class, not the mean mV, where vomV ≡(median{v1|vV},

2. midpt_std, vom_std methodsmidpt_std, vom_std methods: project each class on d-line; then calculate std (one horizontal formula per class using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr and mv

median{v2|vV}, ...)

vomV

v1

v2

vomR

std of distances, vod, from origin

along the d-line

dim 2

dim 1

d-line

Note:training (finding a and d) is a one-time process. If we don’t have training pTrees, we can use horizontal data for a,d (one time) then apply the formula to test data (as pTrees)

Page 5: K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign

APPENDIX: The PTreeSet Genius for Big DataBig Vertical Data: PTreeSet (Dr. G. Wettstein's) perfect for BVD! (pTrees both horiz and vert)

PTreeSets incl methods for horiz querying and vertical DM, multihopQuery/DM, and XML. T(A1...An) is a PTreeSet data structure = bit matrix with (typically) each numeric attr converted to fixedpt(?), (negs?) bitsliced (pt_posschema) and category attr bitmapped; coded then bitmapped; num coded then bisliced (or as is, ie, char(25) NAME col stored outside PTreeSet?A1..Ak num w bitwidths=bw1..bwk; Ak+1..An categorical w counts=cck+1...ccn, PTreeSet is bitmatrix:

0

1

0

1

0

0

1

A1,bw1

0

0

0

0

0

1

0

1

0

1

0

0

1

0

row number

N

...

5

4

3

2

1

A1,bw1

-1 ... A1,0

0

0

0

0

0

0

0

A2,bw2

0

1

0

1

0

0

1

...

0

0

0

0

0

1

0

Ak+1,c1

0

0

1

0

0

1

0

..An,ccn

Methods for this data structure can provide fast horizontal row access , e.g., an FPGA could (with zero delay) convert each bit-row back to original data row.

Methods already exist to provide vertical (level-0 or raw pTree) access.

Add any Level1 PTreeSet can be added: given any row partition (eg, equiwidth =64 row intervalization) and a row predicate (e.g., 50% 1-bits ).

Add "level-1 only" DM meth, e.g., FPGA converts unclassified rowsets to equiwidth=64, 50% level1 pTrees, then entire batch would be FAUST classified in one horiz program. Or lev1 pCKNN.

1

0

1

1

A1,bw1

0

0

1

0

0

0

1

1

inteval number

roof(N/64)

...

2

1

A1,bw1

-1 ... A1,0

0

0

0

0

A2,bw2

1

0

0

1

...

0

0

0

1

Ak+1,c1

1

0

1

0

...An,ccn

pDGP (pTree Darn Good Protection) by permuting col ord (permution = key). Random pre-pad for each bit-column would makes it impossible to break the code by simply focusing on the first bit row.

Relationships (rolodex cards) are 2 PTreeSets, AHGPeoplePTreeSet (shown) and AHGBasePairPositionPTreeSet (rotation of shown).Vertical Rule Mining, Vertical Multi-hop Rule Mining and Classification/Clustering methods (viewing AHG as either a People table

(cols=BPPs) or as a BPP table (cols=People). MRM and Classification done in combination?Any table is a relationship between row and column entities (heterogeneous entity) - e.g., an image = [reflect. labelled] relationship between

pixel entity and wavelength interval entity. Always PTreeSetting both ways facilitates new research and make horizontal row methods (using FPGAs) instantaneous (1 pass across the row pTree)

More security?: all pTrees same (max) depth, and intron-like pads randomly interspersed...

Most bioinformatics done so far is not really data mining but is more toward the database querying side. (e.g., a BLAST search).A radical approach View whole Human Genome as 4 binary relationships between People and base-pair-positions (ordered by chromosome first, then gene region?).

AHG [THG/GHG/CHG] is relationship between People and adenine(A) [thymine(T)/guanine(G)/cytosine(C)] (1/0 for yes/no)

Order bpp? By chromosome and by gene or region (level2 is chromosome, level1 is gene within chromosome.) Do it to facilitate cross-organism bioinformatics data mining?

Create both People and BPP-PTreeSet w human health records feature table (training set for classification and multi-hop ARM.)comprehensive decomp (ordering of bpps) FOR cross species genomic DM. If separate PTreeSets for each chrmomsome (even each region - gene, intron exon...) then we can may be able to dataming horizontally across the all of these vertical pTrees.

pc bc lc cc pe age ht wt

AH

G(P

,bpp)

00

1

10

0

00

1

10

0

00

00

11

10

0

00

00

01

00

00

01

01

00

00

00

00

11

01

00

P 7B ... 5 4 3 2 1

12

3bpp

45

...3B

gene

ch

rom

osom

e

The red person features used to define classes. AHGp pTrees for data mining.We can look for similarity (near neighbors) in a particular chromosome, a particular gene sequence, of overall or anything else.

Page 6: K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign

A facebook Member, m, purchases Item, x, tells all friends. Let's make everyone a friend of him/her self. Each friend responds back with the Items, y, she/he bought and liked.

Facebook-Buys:

Members4 3 2 1

F≡Friends(M,M)

0 1 1 1

1 0 1 1

0 1 1 0

1 1 0 1

1

2

3

4

Members

P≡Purchase(M,I)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

I≡Items2 3 4 5

XI MX≡&xXPx People that purchased everything in X.

FX≡ORmMXFb = Friends of a MX person.

So, X={x}, is Mx Purchases x strong"Mx=ORmPxFmx frequent if Mx large. This is a tractable calculation.

Take one x at a time and do the OR.Mx=ORmPxFmx confident if Mx large. ct( Mx Px ) / ct(Mx) > minconf

4 3 2 1

0

1

0

1

2

1 0 1 1

1 0 0 1

2

4

K2 = {1,2,4} P2 = {2,4} ct(K2) = 3ct(K2&P2)/ct(K2) = 2/3

To mine X, start with X={x}. If not confident then no superset is. Closure: X={x.y} for x and y forming confident rules themselves....

ct(ORmPxFm & Px)/ct(ORmPx

Fm)>mncnf

Kx=OR Ogx frequent if Kx large (tractable- one x at a time and OR.gORbPxFb

Kiddos4 3 2 1

F≡Friends(K,B)

0 1 1 1

1 0 1 1

0 1 1 0

1 1 0 1

1

2

3

4

Buddies

P≡Purchase(B,I)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

I≡Items2 3 4 5

1

2

3

4

Groupies

Others(G,K)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

4 3 2 1

0

1

0

1

2

1 0 1 1

1 0 0 1

2

4

K2={1,2,3,4} P2={2,4} ct(K2) = 4ct(K2&P2)/ct(K2)=2/4

0

1

0

1

4

1

1

1

0

2

1

1

0

1

1

1

2

3

4

Fcbk buddy, b, purchases x, tells friends.

Friend tells all friends.Strong purchase poss?Intersect rather than union

(AND rather than OR). Ad to friends of friends

Kiddos4 3 2 1

F≡Friends(K,B)

0 1 1 1

1 0 1 1

0 1 1 0

1 1 0 1

1

2

3

4

Buddies

P≡Purchase(B,I)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

I≡Items2 3 4 5

1

2

3

4

Groupies

Compatriots (G,K)

0 0 1 0

1 0 0 1

0 1 0 0

1 0 1 1

4 3 2 1

0

1

0

1

2

1 0 1 1

1 0 0 1

2

4

K2={2,4} P2={2,4} ct(K2) = 2ct(K2&P2)/ct(K2) = 2/2

0

1

0

1

4

1

1

0

1

1

1

2

3

4

Page 7: K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign

The Multi-hop Closure Theorem A hop is a relationship, R, hopping from entities E to F.

A condition is downward [upward] closed: If when it is true of A, it is true for all subsets [supersets], D, of A.

Given an (a+c)-hop multi-relationship, where the focus entity is a hops from the antecedent and c hops from the consequent, if a [or c] is odd/even then downward/upward closure applies to frequency and confidence.

A pTree, X, is said to be "covered by" a pTree, Y, if one-bit in X, there is a one-bit at that same position in Y.

Lemma-0: For any two pTrees, X and Y, X&Y is covered by X and thus ct(X&Y) ct(X) and list(X&Y)list(X)Proof-0: ANDing with Y may zero some of X's ones but it will never change any zeros to ones.

Lemma-1: Let AD, &aAXa covers &aDXa

Proof-1&2: Let Z=&aD-AXa then &aDXa =Z&(&aAXa). lemma-1 now follows from lemma-0, as does

Lemma-2: Let AD, &clist(&aDXa)Yc covers &clist(&aAXa)Yc

D'=list(&aAXa)A'=list(&aDXa) so by lemma-1, we get lemma-2:

Lemma-2: Let AD, &clist(&aDXa)Yc covers &clist(&aAXa)Yc

Lemma-3: AD, &elist(&clist(&aAXa)Yc)We covers &elist(&clist(&aDXa)Yc)WeProof-3: lemma-3 in the same way from lemma-1 and lemma-2. Continuing this establishes: If there

are an odd number of nested &'s then the expression with D is covered by the expression with A. Therefore the count with D with A. Thus, if the frequent expression and the confidence expression are > threshold for A then the same is true for D. This establishes downward closure.

Exactly analogously, if there are an even number of nested &'s we get the upward closures.

Page 8: K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign

L

L

R

R

L

L

R R

Level-2 follows the level-1 LLRR pattern with another LLRR pattern.

L

L

R

R

L

L

L

L

R

L

L

R

RRR

R

R

For PTM, Peel from south to north pole along quadrant great circles and the equator.

Level-3 follows level-2 with LR when level-2 pattern is L and RL when level-2 pattern is R

R

L

L

R

R

L L

R

L

L

R

R

LL

R

R

R

L

L R

R

LL

R

LL

R

R

L

L

R

R

What ordering is best for spherical data (e.g., Astronomical bodies on the celestial sphere, sharing origin and

equatorial plane with earth and no radius. Hierarchical Triangle Mesh (HTM) orders equilateral triangulations (recursively).

Ptree Triangle Mesh (PTM) does also. (RA=Recession Angle; dec=declination)

...

1,2

1,1

1,0

1,3

1

1,1,2

1,1,01,1,11.1.3

HTM sub-triangle ordering

PTM_LLRR_LLRR_LR...

Mathematical theorem: n, an n-sphere filling (n-1)-sphere?

Corollary: sphere filling circle (2-sphere filling 1-sphere).

Proof-2: Let Cn ≡ the level-n circle,

C ≡ limitnCn is a circle which fills the 2-sphere!

Proof: Let x be any point on the 2-sphere.

distance(x,Cn) sidelength (=diameter) of the level-n triangles.

sidelengthn+1 = ½ * sidelengthn.

d(x,C) ≡ lim d(x,Cn) lim sidelengthn sidelength1 * lim ½n = 0

x

See 2012_05_07 notes for level-4 circle.

Page 9: K-means: Select k centroids. Assign pts to closest centroid. Calc new centroids (mean, vom,...). 1 Iterate until stop_cond. PK1-means: (One scan - to reassign

Mark Silverman: I start randomly - converges in 3 cycles. Here I increase k from 3 to 5. 5th centroid could not find a member (at 0,0), 4th centroid picks up 2 points that look remarkably anomalous Treeminer, Inc. (240) 389-0750

WP: Start with large k? Each round, "tidy up" by fusing pairs of clusters using  max( P(dis(CLUSi, Cj))) < dis(Ci, Cj) and max( P(dis(CLUSj, Ci))) < dis(Ci, Cj) ? Eliminate empty clusters and reduce k. (Avg better than max ? in the above). Mark: Curious about one other state it converges to. Seems like when we exceed optimal k, some instability.

WP: Tiding up would fuse Series4 and series3 into series34 Then calc centroid34. Next fuse Series34 and series1 into series134, calc centrod34

Also?: Each round, split a cluster (create 2nd centroid) ifmean and vector_of_medians far apart. (A second go at thismitosis based on density of the cluster. If a cluster is too sparse,split it. A pTree (no looping) sparsity measure:  max(dis( CLUSTER,CENTROID )) / count(CLUSTER)

X