16
Machine Learning is based on Near Neighbor Set(s), NNS. Clustering , even density based, identifies near neighbor cores 1 st (round NNS s , about a center). Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity >0 >0 : d(x,a)< d(f(x),f(a))< where f assigns a class to a feature vector, or -NNS of f(a), a -NNS of a in its pre-image. f(Dom) categorical >0 : d(x,a)<f(x)=f(a) Caution: For classification, it may be the case that one has to use the continuity in lower dimensions to get a predication (due to data sparseness). E.g., 1234 5 a 6 1,2,3,4,5,6,7,8 are all distance from a and 1,2,3,4 -->C 5,6,7,8-->D. 7 8 Any that gives us a vote gives us an tie vote. However projecting onto the vertical subspace taking /2 we see that /2 nbrhd about a contains only 5 and 6 so gives us class D. Using horizontal data, NNS derivation requires at least one scan (at least O(n)). L disk NNS can be derived using vertical-data in O(log 2 n) yet usually Euclidean disks are preferred. (Note: Euclidean and Manhattan coincide in Binary data sets). Our solution in a sentence: Circumscribe the desired Euclidean- nbrhd with functional-contours, (sets of the type f - 1 ([b,c] ) until the intersection is scannable, then scan it for Euclidean--nbrhd membership. Advantage: intersection can be determined before scanning - create and AND functional contour P-trees. Data Mining can be broken down into 2 areas, Machine Learning and Assoc. Rule Mining Machine Learning can be broken down into 2 areas, Clustering and Classification. Clustering can be broken down into 2 types, Isotropic (round clusters) and Density-based Classification can be broken down into to types, Model-based and Neighbor-based Database analysis can be broken down into 2 areas, Querying and Data Mining.

Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s, about a center)

Embed Size (px)

Citation preview

Page 1: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

Machine Learning is based on Near Neighbor Set(s), NNS.

Clustering, even density based, identifies near neighbor cores 1st (round NNSs, about a center).

Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity

>0 >0 : d(x,a)< d(f(x),f(a))< where f assigns a class to a feature vector, or

-NNS of f(a), a -NNS of a in its pre-image. f(Dom) categorical >0 : d(x,a)<f(x)=f(a)

Caution: For classification, it may be the case that one has to use the continuity in lower dimensions to get a predication (due to data sparseness). E.g., 1234 5 a 6 1,2,3,4,5,6,7,8 are all distance from a and 1,2,3,4 -->C 5,6,7,8-->D. 7 8 Any that gives us a vote gives us an tie vote. However projecting onto the vertical subspace taking /2 we see that /2 nbrhd about a contains only 5 and 6 so gives us class D.

Using horizontal data, NNS derivation requires at least one scan (at least O(n)).

L disk NNS can be derived using vertical-data in O(log2n) yet usually Euclidean disks are preferred. (Note: Euclidean and Manhattan coincide in Binary data sets).

Our solution in a sentence: Circumscribe the desired Euclidean- nbrhd with functional-contours, (sets of the type f -1([b,c] ) until the intersection is scannable, then scan it for Euclidean--nbrhd membership.

Advantage: intersection can be determined before scanning - create and AND functional contour P-trees.

Data Mining can be broken down into 2 areas, Machine Learning and Assoc. Rule Mining

Machine Learning can be broken down into 2 areas, Clustering and Classification.

Clustering can be broken down into 2 types, Isotropic (round clusters) and Density-based

Classification can be broken down into to types, Model-based and Neighbor-based

Database analysis can be broken down into 2 areas, Querying and Data Mining.

Page 2: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

and SY, the f-contour(S) = f-1(S) Equiv., Af-contour(S) = Select x1..xn From R* Where x.Af=f(x1..xn)

If S={a}, we use f-Isobar(a) equiv. Af-Isobar(a)

If f is a local density and {Sk} is a partition of Y, {f-1(Sk)} partitions R. (eg, In OPTICS, f=reachability distance, {Sk} is the partition produced by intersections of graph-f wrt to a walk of R and a horizontal line.A Weather map use equiwidth interval partition of S=Reals (barometric pressure or temperature contours).A grid is the intersection partition with respect to the dimension projection functions (next slide).A Class is a contour under f:RC, the class map.An L -disk about a is the intersection of the -dimension_projection contours containing a.

Contours: f:R(A1..An) Y

A1 A2 An

: : . . .

SfR

R

Y

S

graph(f) = { ( x, f(x) ) | xR }

f-contour(S)

Y

A1 A2 An Af

x1 x2 xn f(x1..xn): . . .

R*

Equivalently, derived attribute, Af, with domain=Y(equivalence is x.Af = f(x) xR)

A1 A2 An

x1 x2 xn

: . . .

f(x)fR Y

x

Page 3: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

f:RY, partition S={Sk} of Y, {f-1(Sk)}= S,f-grid of R (grid cells=contours)

If Y=Reals, thej.lo f-grid is produced by agglomerating over the j lo bits of Y, fixed (b-j) hi bit pattern.

The j lo bits walk [isobars of] cells. The b-j hi bits identify cells. (lo=extension / hi=intention) Let b-1,...,0 be the b bit positions of Y. The j.lo f-grid is the partition of R generated by f and S ={Sb-1,...,b-j | Sb-1,...,b-j = [(b-1)(b-2)...(b-j)0..0, (b-1)(b-2)...(b-j)1..1)} partition of Y=Reals.

If F={fh}, thej.lo F-grid is the intersection partition of the j.lo fh-grids (intersection of partitions).

The canonical j.lo grid is the j.lo -grid; ={d:RR[Ad] | d = dth coordinate projection} j-hi gridding is similar ( the b-j lo bits walk cell contents / j hi bits identify cells).

If the horizontal and vertical dimensions have bitwidths 3 and 2 respectively:

000 001 010 011 100 101 110 111

11

10

01

00

2.lo grid 1.hi gridWant square cells or a square pattern?

000 001 010 011 100 101 110 111

11

10

01

00

GRIDs

Page 4: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

2.lo grid 1.hi grid

111

110

101

100

j.lo and j.hi gridding continued

The horizontal_bitwidth = vertical_bitwidth = b iff j.lo grid = (b-j).hi grid

e.g., for hb=vb=b=3 and j=2:

000 001 010 011 100 101 110 111

011

010

001

000

111

110

101

100

000 001 010 011 100 101 110 111

011

010

001

000

Page 5: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

SOME Useful NearNeighborSets (NNS) Given a similarity s:RRReals and a CR

(i.e., s(x,y)=s(y,x); s(x,x)s(x,y) x,yR)

Cardinal disk, skins and rings:

disk(C,r) {xR | s(x,C)r} also = functional contour, f-1([r, ), where f(x)=sC(x)=s(x,C)

skin(C,r) disk(C,r) - C

ring(C,r2,r1) disk(C,r2)-disk(C,r1) skin(C,r2)-skin(C,r1) also = functional contour, sC-1(r1,r2]

For C = {a}

ar1

r2

C

r1r2

Ordinal disks, skins and rings:

disk(C,k) C : |disk(C,k)C'|=k and s(x,C)s(y,C) xdisk(C,k), ydisk(C,k)

skin(C,k) = disk(C,k)-C (skin comes from s k immediate neighbors and is a kNNS of C.)

ring(C,k) = cskin(C,k)-cskin(C,k-1) closeddisk(C,k)alldisk(C,k); closedskin(C,k)allskin(C,k)

L skins: skin(a,k) = {x | d, xd is one of the k-NNs of ad} - a local normalization?

A distance, d, generates a similarity many ways, e.g., s(x,y)=1/(1+d(x,y)):

(or if the relationship various by location, s(x,y)=(x,y)/(1+d(x,y))

s

d1

Note: closeddisk and closedskin(C,k) are redundant, since closeddisk(C,k) = disk(C,s(C,y)) where y is any k th NN of C

s(x,y)=e-d(x,y)2 :

s

d1

0 : d(x,y)>s(x,y)= e-d(x,y)2/std-e-2/std: d(x,y) (vote weighting IS a similarity

assignment, so the similarity-to-distance graph IS a vote weighting for classification)

s

d1-e-2/std

Page 6: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

0Pf,S is called a P-tree for short and is just the existential R*-bit map of S R*.Af

The Compressed P-tree, sPf,S is the compression of 0Pf,S with equi-width leaf size, s, as follows

1. Choose a walk of R (converts 0Pf,S from bit map to bit vector)2. Equi-width partition 0Pf,S with segment size, s (s=leafsize, the last segment can be short)3. Eliminate and mask to 0, all pure-zero segments (call mask, NotPure0 Mask or EM)4. Eliminate and mask to 1, all pure-one segments (call mask, Pure1 Mask or UM)

Compressing each leaf of sPf,S with leafsize=s2 gives: s1,s2Pf,S Recursivly, s

1, s

2, s

3Pf,S s

1, s

2, s

3, s

4Pf,S ...

(builds an EM and a UM tree)

BASIC P-treesIf Ai Real or Binary and fi,j(x) jth bit of xi ; {(*)Pfi,j ,{1} (*)Pi,j}j=b..0 are basic (*)P-trees of Ai, *= s1..sk

If Ai Categorical and fi,a(x)=1 if xi=a, else 0; {(*)Pfi,a,{1} (*)Pi,a}aR[Ai] are basic (*)P-trees of Ai

Notes:The UM masks (e.g., of 2k,...,20Pi,j, with k=roof(log2|R| ), form a (binary) tree.Whenever the EM bit is 1, that entire subtree can be eliminated (since it represents a pure0 segment), then a 0-node at level-k (lowest level = level-0) with no sub-tree indicates a 2k-run of zeros. In this construction, the UM tree is redundant.We call these EM trees the basic binary P-trees. The next slide shows a top-down (easy to understand) construction of and the following slide is a (much more efficient) bottom up construction of the same. We have suppressed the leafsize prefix.

(EM=existential aggregation UM=universal aggregation)

f:R(A1..An)Y SY The (uncompressed) Predicate-tree 0Pf, S is : 0Pf,S(x)=1(true) iff f(x)S

Page 7: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

6. 1st half of 1st of 2nd is 1

00 0 0 1 1

4. 1st half of 2nd half not 0 00 0 0

2. 1st half is not pure1 0

00 0

1. Whole file is not pure1 0

Horizontal structures(records)

Scanned vertically

P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 10

0 1 0 0 1 01

0 0 00 0 0 1 01 10

0 1 0

0 1 0 1 0

0 0 01 0 01

0 1 0

0 0 0 1 0

0 0 10 1

0 0 10 1 01

0 0 00 1 01

0 0 0 0 1 0 010 015. 2nd half of 2nd half is 1

00 0 0 1

R11

00001011

then process using multi-operand logical ANDs.

Vertical basic binary Predicate-tree (P-tree): vertically partition table; compress each vertical bit slice into a basic binary P-tree as follows

010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

R( A1 A2 A3 A4)

A data table, R(A1..An), containing horizontal structures (records) isprocessed vertically (vertical scans)

The basic binary P-tree, P1,1, for R11 is built top-down by record truth of predicate pure1 recursively on halves, until purity.

3. 2nd half is not pure1 0 00 0

7. 2nd half of 1st of 2nd not 0

00 0 0 1 10

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

Eg, Count number of occurences of 111 000 001 100 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2

01 21-level

But it is pure (pure0) so this branch ends

Page 8: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

R11 0 0 0 0 1 0 1 1

Top-down construction of basic binary P-trees is good for understanding, but bottom-up is more efficient.

Bottom-up construction of P11 is done using in-order tree traversal and the collapsing of pure siblings, as follow:

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

P11

0 0

0

0 0

0

1 0

0

0

0

1 1

1

0

Page 9: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

To count occurrences of 7,0,1,4 use pure111000001100: 0 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0

01 ^

7 0 1 4

P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 10

0 1 0 0 1 01

0 0 00 0 0 1 01 10

0 1 0

0 1 0 1 0

0 0 01 0 01

0 1 0

0 0 0 1 0

0 0 10 1

0 0 10 1 01

0 0 00 1 01

0 0 0 0 1 0 010 01^ ^ ^ ^ ^ ^ ^ ^ ^

R(A1 A2 A3 A4)2 7 6 16 7 6 02 7 5 12 7 5 75 2 1 42 2 1 57 0 1 47 0 1 4

010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

=

This 0 makes entire left branch 0These 0s make this node 0 These 1s and these 0s make this 1

21-level has the only 1-bit so the 1-count = 1*21 = 2

Processing Efficiencies? (prefixed leaf-sizes have been removed)

Page 10: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

A useful functional: TV(a) =xR(x-a)o(x-a) If we use d for a index variable over the dimensions,

= xRd=1..n(xd2 - 2adxd + ad

2) i,j,k bit slices indexes

= xRd=1..n(k2kxdk)2 - 2xRd=1..nad(k2kxdk) + |R||a|2

= xd(i2ixdi)(j2

jxdj) - 2xRd=1..nad(k2kxdk) + |R||a|2

= xdi,j 2i+jxdixdj

- 2 x,d,k2k ad xdk + |R||a|2

= x,d,i,j 2i+j xdixdj

- |R||a|2 2 dad x,k2

kxdk +

TV(a) = i,j,d 2i+j |Pdi^dj| - |R||a|2 k2

k+1 dad |Pdk| +

Note that the first term does not depend upon a. Thus, the derived attribute, TV-TV() (eliminate 1st term) is much simpler to compute and has identical contours (just lowers the graph by TV() ).

We also find it useful to post-compose a log to reduce the number of bit slices.The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a).

= x,d,i,j 2i+j xdixdj

+ dadad )|R|( -2dadd +

= x,d,i,j 2i+j xdixdj

- |R|dadad2|R| dadd +

Page 11: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

The length of g(a) depends only on the length of a-, so isobars are hyper-circles centered at

The graph of g is a log-shaped hyper-funnel:

From equation 7,

f(a)=TV(a)-TV() d(adad- dd) )= |R| ( -2d(add-dd) +

TV(a) = x,d,i,j 2i+j xdixdj

+ |R| ( -2dadd + dadad )

+ dd2 )= |R|( dad

2 - 2ddad

g(a) HDTV(a) = ln( f(a) )= ln|R| + ln|a-|2

= |R| |a-|2 so f()=0

go inward and outward along a- by to the points;inner point, b=+(1-/|a-|)(a-) andouter point, c=-(1+/|a-|)(a-).

-contour(radius about a)

a

For an -contour ring (radius about a)

Then take g(b) and g(c) as lower and upper endpoints of a vertical interval.

Then we use EIN formulas on that interval to get a mask P-tree for the -contour(which is a well-pruned superset of the -neighborhood of a) b c

g(b)

g(c)

x1

x2

g(a)=HDTV(x)

Page 12: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

use circumscribing Ad-contour

(Note: Ad is not a derived attribute at all, but just Ad, so we already have its basic P-trees).

-contour(radius about a)

a

HDTV(b)

HDTV(c)

b c

As pre-processing, calculate basic P-trees for the HDTV derived attribute (or another hypercircular contour derived attribute).

To classify a1. Calculate b and c (Depend on a, )2. Form mask P-tree for training pts with HDTV-values[HDTV(b),HDTV(c)]3. User that P-tree to prune out the candidate NNS.4. If the count of candidates is small, proceed to scan and assign class votes using Gaussian vote

function, else prune further using a dimension projections).

contour of dimension projection

f(a)=a1

x1

x2

HDTV(x)

If the HDTV circumscribing contour of a is still too populous,

We can also note that HDTV can be further simplified (retaining same contours) using h(a)=|a-|. Since we create the derived attribute by scanning the training set, why not just use this very simple function?Others leap to mind, e.g., hb(a)=|a-b|

(Use voting function, G(x) = Gauss(|x-a|)-Gauss(), where Gauss(r) is (1/(std*2)e-(r-mean)2/2var

(std, mean, var are wrt set distances from a of voters i.e., {r=|x-a|: x a voter} )

Page 13: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

12

3

TV(x15)-TV()

12

34

5

XY

TV-TV()

45

TV()=TV(x33)

TV(x15)

12

34

5

XY

TV

12

34

5

Graphs of functionals with hyper-circular contours

hb(a)=|a-b|

b

h(a)=|a-|

HDTV

Page 14: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

Angular Variation functionals: e.g., AV(a) ( 1/|a| ) xR xoa d is an index over the dimensions,

= (1/|a|)xRd=1..nxdad

= (1/|a|)d=1..n(xxd) ad

COS (and AV) has hyper-conic isobars center on

= |R|/|a|d=1..n((xxd)/|R|) ad

= |R|/|a|d=1..n d ad

= ( |R|/|a| ) o a

COS(a) AV(a)/(|||R|) = oa/(|||a|) = cos(a

)

COS and AV have -contour(a) = the space between two hyper-cones center on which just circumscribes the Euclidean -hyperdisk at a.

COS(a)

COS(a)

a

= (1/|a|)d(xxdad) factor out ad

COSb(a)?

a

bIntersection (in pink) with HDTV -contour.

Graphs of functionals with hyper-conic contours:E.g., COSb(a) for any vector, b

Page 15: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

f(a)x = (x-a)o(x-a) d = index over dims,

= d=1..n(xd2 - 2adxd + ad

2) i,j,k bit slices indexes

= d=1..n(k2kxdk)2 - 2d=1..nad(k2kxdk) + |a|2

= d(i2ixdi)(j2

jxdj) - 2d=1..nad(2kxdk) + |a|2

=di,j 2i+jxdixdj

- 2 d,k2k ad xdk + |a|2

f(a)x = i,j,d 2i+j (Pdi^dj)x - |a|2 k2

k+1 dad (Pdk)x +

β exp( -f(a)x ) = βexp(-i,j,d 2i+j (Pdi^dj)x) *exp( -|a|2 )* exp( k2

k+1 dad (Pdk)x )

Adding up the Gaussian weighted votes for class c:

β exp( -f(a)x ) = β (exp(-i,j,d 2i+j (Pdi^dj)x)exp( -|a|2 ) * exp( k2

k+1 dad (Pdk)x ) )

xcβ exp( -f(a)x ) = β xc (exp(-i,j,d 2i+j (Pdi^dj)x)exp( -|a|2 ) * exp( k2

k+1 dad (Pdk)x ) )

xc exp((-i,j,d 2i+j (Pdi^dj)x) + k,d2

k+1 ad (Pdk)x )

xc exp( ij,d -2i+j (Pdi^dj)x + i=j,d(ad2i+1-22i ) (Pdi)x )

Collecting diagonal terms inside exp

i,j,d inside exp we have coefs which do not involve x multiplied by a 1-bit or a 0-bit, depending on xthus for fixed i,j,d we either have the x-indep coef (if 1-bit) or we don't (if 0-bit)

xc ( ij,d exp(-2i+j (Pdi^dj)x) * i=j,d exp((ad2i+1-22i)(Pdi)x) )

( ij,d:Pdijx=1 exp(-2i+j ) * i=j,d:Pdijx=1 exp((ad2i+1-22i)) ) (eq1)

Page 16: Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)

Suppose there are two classes, red (-) and green (+) on the -cylinder shown. Then the vector connecting medians (vcm) in YZ space is shown in purple.Then the unit vector in the direction of the vector connecting medians (uvcm) in YZ space is shown in blue.The vector from the midpoint of the vectors of medians to s is in orange.The inner product is of the blue and the orange is the same as the inner product we would get by doing it in 3D!The point is that the x-component of the red vector of medians and that of the green are identical so that the x component of the vcm is zero.(small vcm component means prune out!

s

x

y

z