152
TUTORIAL: TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets (Most recent copy) available at: http://www.cs.yale.edu/homes/mmahoney http://www.cs.rpi.edu/~drinep Petros Drineas Petros Drineas RPI Dept. of Computer Science Michael W. Mahoney Michael W. Mahoney Yale University Dept. of Mathematics (now at Yahoo! Research)

KDD tutorial

Embed Size (px)

DESCRIPTION

please visit softpediaz.blogspot.com and support me also view all the datas about computer

Citation preview

Page 1: KDD tutorial

TUTORIAL:TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

(Most recent copy) available at:

http://www.cs.yale.edu/homes/mmahoney

http://www.cs.rpi.edu/~drinep

Petros DrineasPetros Drineas

RPIDept. of Computer Science

Michael W. MahoneyMichael W. Mahoney

Yale UniversityDept. of Mathematics

(now at Yahoo! Research)

Page 2: KDD tutorial

2

Randomized Linear Algebra Algorithms

Goal: To develop and analyze (fast) Monte Carlo algorithms for performing useful computations on large matrices and tensors.

• Matrix Multiplication• Computation of the Singular Value Decomposition• Computation of the CUR Decomposition• Testing Feasibility of Linear Programs• Least Squares Approximation• Tensor computations: SVD generalizations• Tensor computations: CUR generalization

Such computations generally require time which is superlinear in the number of nonzero elements of the matrix/tensor, e.g., O(n3) for n £ n matrices.

Page 3: KDD tutorial

3

Randomized Linear Algebra Algorithms

Motivation:

• (Algorithmic) To speed up computations in applications where extremely large data sets are modeled by matrices/tensors and, e.g., O(n3) time is not an option.

• (Algorithmic) To reduce the memory requirements in applications where the data sets are modeled by matrices/tensors, and storing the whole data is not an option, because the data sets are extremely large.• (Equivalent to the above) To provide some analysis of the performance of the accuracy performance of simple algorithms when only a small sample of the full data set is available.

• (Structural) To reveal novel structural properties of the datasets, given sufficient computational time.

Page 4: KDD tutorial

4

Example: the CUR decomposition

O(1) columnsO(1) rows

Carefully chosen U

Goal: make (some norm) of A-CUR small.

Why? Given a sample consisting of a few columns (C) and a few rows (R) of A, we can compute U and “reconstruct” A as CUR; if the sampling probabilities are not “too bad”, we get provably good accuracy.

Why? After making two passes over A, we can compute provably good C, U, and R and store them (“sketch”) instead of A: O(m+n) vs. O(n2) space.

Why? Given sufficient time, we can find C, U and R such that A – CUR is “very” small.

This might lead to better understanding of the data.

Page 5: KDD tutorial

5

Applications of such algorithms

Matrices arise, e.g., since m objects (documents, genomes, images, web pages), each with n features, may be represented by an m x n matrix A.

• Covariance Matrices• Latent Semantic Indexing• DNA Microarray Data• Eigenfaces and Image Recognition• Similarity Queries• Matrix Reconstruction• LOTS of other data applications!!More generally,• Linear and Nonlinear Programming Applications• Design of Approximation Algorithms• Statistical Learning Theory Applications

Page 6: KDD tutorial

6

Overview (1/3)

• Data Streaming Models and Random Sampling

• Matrix Multiplication• Feasibility testing of Linear Programs

• Singular Value Decomposition• Sampling rows/columns to approximate singular vectors/values• Sampling elements to approximate singular vectors/values

• CUR Matrix Decomposition• Lower bounds on sampling complexities

Page 7: KDD tutorial

7

Overview (2/3)

• Applications of Matrix CUR• Designing approximation algorithms • Data mining• Microarray data• Recommendation Systems

• Tensor-based data sets• Tensor-CUR

• Applications of Tensor-CUR• Hyperspectral data• Recommendation systems

Page 8: KDD tutorial

8

Overview (3/3)

• Kernel-CUR and the Nystrom Method

• Regression problems• Least squares problems

• The CUR algorithm revisited• A general bound• Iterative sampling of rows/columns• “Optimal” choices of rows/columns• Application: finding ht-SNPs

• Conclusions & open problems

Page 9: KDD tutorial

9

Computation on Massive Data Sets

Data are too large to fit into main memory; they are either not stored or are stored in external memory.

Algorithms that compute on data streams examine the stream, keep a small “sketch” of the data, and perform computations on the sketch.

Algorithms are (typically) randomized and approximate.

Evaluate performance by measures such as: • the time to process an item• the number of passes over the data• the additional workspace and time• the quality of the approximation returned

Munro & Paterson ’78: studied “the relation between the amount of internal storage available and the number of passes required to select the k-th highest of n inputs.”

Page 10: KDD tutorial

10

The Pass Efficient Model

Motivation: Amount of disk/tape space has increased enormously; RAM and computing speeds have increased less rapidly. • Can store large amounts of data, but• Cannot process these data with traditional algorithms.

In the Pass-Efficient Model:• Data are assumed to be stored on disk/tape.• Algorithm has access to the data via a pass over the data.• Algorithm is allowed additional RAM space and additional computation time.

An algorithm is pass-efficient if it uses a small constant number of passes and sublinear additional time and space to compute a description of the solution.Note: If data are an m x n matrix A, then algorithms which require additional time and space that is O(m+n) or O(1) are pass-efficient.

Page 11: KDD tutorial

11

Random Sampling

Random sampling is used to estimate some parameter defined over a very large set by looking at only a very small subset.

Uniform Sampling: every piece of data is equally likely to be picked.Advantages:

• “Coins” can be tossed “blindly.” • Even if the number of data elements is not known in advance, can select one element u.a.r. in one pass over the data.• Much recent work on quantities that may be approximated with a small uniformly drawn sample.Disadvantages:• Many quantities cannot be approximated well with a small random sample that is uniformly drawn.• E.g., estimate the sum of N numbers, where one number is very large and the others very small.

Page 12: KDD tutorial

12

Random Sampling (cont’d)

Nonuniform Sampling: Advantages:

• Can obtain much more generality and big gains, e.g., can approximately solve problems in sparse as well as dense matrices. • Smaller sampling complexity for similar bounds.

Disadvantages:• Must determine the nonuniform probabilities; multiple passes over the data usually needed.

Randomized Algorithms for Linear Algebra problems:

A “sketch” consisting of a small number of judiciously chosen and randomly sampled rows and columns (or elements) is sufficient

for provably rapid and efficient approximation of many matrix operations.

Page 13: KDD tutorial

13

Overview (1/3)

• Data Streaming Models and Random Sampling

• Matrix Multiplication• Feasibility testing of Linear Programs

• Singular Value Decomposition• Sampling rows/columns to approximate singular vectors/values• Sampling elements to approximate singular vectors/values

• CUR Matrix Decomposition• Lower bounds on sampling complexities

Page 14: KDD tutorial

14

Approximating Matrix Multiplication …

(D. & Kannan FOCS ’01, and D., Kannan, & M. TR ’04, SICOMP ’05)

Problem Statement

Given an m-by-n matrix A and an n-by-p matrix B, approximate the product A·B,

OR, equivalently,

Approximate the sum of n rank-one matrices.

Each term in the summation is a rank-one matrix

i-th column of A i-th row of B

Page 15: KDD tutorial

15

…by random sampling

Algorithm

1. Fix a set of probabilities pi, i=1…n, summing up to 1.

2. For t=1 up to s,

set jt = i, where Pr(jt = i) = pi;

(Pick s terms of the sum, with replacement, with respect to the pi.)

3. Approximate AB by the sum of the s terms, after scaling.

i-th column of A i-th row of B

Page 16: KDD tutorial

16

Random sampling (cont’d)

i-th column of A i-th row of B

Keeping the terms j1, j2, … js.

Page 17: KDD tutorial

17

The algorithm (matrix notation)

Algorithm

1. Pick s columns of A to form an m-by-s matrix C and the corresponding s rows of B to form an s-by-p matrix R.

2. (discard A and B) Approximate A · B by C · R.

Page 18: KDD tutorial

18

The algorithm (matrix notation, cont’d)

• Create C and R by performing s i.i.d. trials, with replacement.

• For t=1 up to s, pick a column A(jt) and a row B(jt)

with probability

• Include A(jt)/(spjt

)1/2 as a column of C, and B(jt)/(spjt

)1/2 as a row of

R.

Page 19: KDD tutorial

19

Simple Lemmas

The input matrices are given in “sparse unordered representation”; e.g., their non-zero entries are presented as triples (i, j, Aij) in any order.

• The expectation of CR (element-wise) is AB.

• Our nonuniform sampling minimizes the variance of the estimator.

• It is easy to implement the sampling in two passes.

• If the matrices are dense the algorithm runs in O(smp) time, instead of O(nmp) time,

• It requires O(sm+sp) RAM space.

• Does not tamper with the sparsity of the matrices.

Page 20: KDD tutorial

20

Error Bounds

For the above algorithm,

If ||AB||F = (||A||F ||B||F), then the above bound is a relative error bound.

This happens if there is “not much cancellation” in the multiplication.

Page 21: KDD tutorial

21

Error Bounds (tight concentration)

For the above algorithm, with probability at least 1-,

Notice that we removed the expectation (by applying a martingale argument) and having an extra log(1/) factor.

(Markov’s inequality would also remove the expectation, introducing an extra 1/ factor.)

Page 22: KDD tutorial

22

Special case: B = AT

If B = AT, then the sampling probabilities are

Also, R = CT, and the error bounds are

Page 23: KDD tutorial

23

Special case: B = AT (cont’d)

Improvement for the spectral norm bound for the special case B = AT.

(Rudelson & Vershynin ’04, Vershynin ’04)

• Uses a result of M. Rudelson for random vectors in isotropic position.

• Tight concentration results can be proven using Talagrand’s theory.

• The sampling procedure is slightly different; s columns/rows are kept in expectation, i.e., column i is picked with probability

Page 24: KDD tutorial

24

Empirical evaluation: setup

Database Dense document-concept matrix A with 5000 documents and 320 concepts.

ExperimentOur goal was to identify all document-document matches, e.g. compute AAT and identify all entries larger than some threshold (proximity problem).

(Data from D. Lewis & E. Cohen, SODA ’97 & J. of Algorithms ’99)

The algorithm Approximate AAT (using our method) by CCT. Find all entries of CCT that are larger that -, > 0 (“candidate entries”). Deterministically compute the dot products for “candidate entries”. Return the ones that are larger than as matches.

Page 25: KDD tutorial

25

Empirical evaluation: results ( = 0.85)

Page 26: KDD tutorial

26

Empirical evaluation: results ( = 0.75)

Page 27: KDD tutorial

27

Experiments: uniform sampling

Page 28: KDD tutorial

28

Random sampling of Linear Programs

Let P(i) denote the i th column of the r-by-n matrix P and suppose that the following Linear Program is feasible:

With probability at least 1-, the following LP is feasible as well:

Form Q, a random non-uniform subset of {1…n}, with |Q|=q. Q is formed in q i.i.d. trials, where

(D., Kannan, & M. TR ‘04 & STACS ’05)

Page 29: KDD tutorial

29

Sample a few variables!

If feasible …

…then feasible (with high probability).

The picture …

Page 30: KDD tutorial

30

Another picture …The i-th constraint

is feasible

Discarding variables

“perturbs” the constraint

Relaxing the constraint guarantees feasibility.

The bound on b emerges by applying the Matrix Multiplication result on the matrix-vector product Px.

Page 31: KDD tutorial

31

A converse LP sampling lemma

Let P(i) denote the i th column of the r-by-n matrix P and suppose that the following Linear Program is infeasible:

With probability at least 1-, the following LP is infeasible as well:

Form Q, a random non-uniform subset of {1…n}, with |Q|=q. Q is formed in q i.i.d. trials, where

Page 32: KDD tutorial

32

Sample a few variables!

If infeasible …

…then infeasible (with high probability).

The picture …

Page 33: KDD tutorial

33

Another picture …The i-th constraint

is infeasible

Discarding variables

“perturbs” the constraint

Tightening the constraint guarantees infeasibility.

The bound on b emerges by applying the Matrix Multiplication result on the matrix-vector product Px.

Page 34: KDD tutorial

34

Overview (1/3)

• Data Streaming Models and Random Sampling

• Matrix Multiplication• Feasibility testing of Linear Programs

• Singular Value Decomposition• Sampling rows/columns to approximate singular vectors/values• Sampling elements to approximate singular vectors/values

• CUR Matrix Decomposition• Lower bounds on sampling complexities

Page 35: KDD tutorial

35

Singular Value Decomposition (SVD)

1. Exact computation of the SVD takes O(min{mn2 , m2n}) time.

2. The top few singular vectors/values can be approximated faster (Lanczos/ Arnoldi methods).

U (V): orthogonal matrix containing the left (right) singular vectors of A.

: diagonal matrix containing the singular values of A.

Page 36: KDD tutorial

36

Rank k approximations (Ak)

Ak is a matrix of rank k such that ||A-Ak||2,F is minimized over all rank k matrices!

This property of very useful in the context of Principal Component Analysis.

Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.

k: diagonal matrix containing the top k singular values of A.

Also, Ak=UkUkTA.

Page 37: KDD tutorial

37

Approximating SVD in O(n) time

(Frieze, Kannan & Vempala FOCS ‘98, D., Frieze, Kannan, Vempala & Vinay SODA ’99, JML ’04, D. Kannan, & M. TR ’04, SICOMP ’05)

Given: m x n matrix A

• Sample c columns from A and rescale to form the m x c matrix C.

• Compute the m x k matrix Hk of the top k left singular vectors of C.Structural Theorem: For any probabilities and number of columns:

||A-HkHkTA||2,F

2 ≤ ||A-Ak||2,F2 + 2√k||AAT-CCT||F

Algorithmic Theorem: If pi = |A(i)|2/||A||F2 and c ≥ 42k/2, then:

||A-HkHkTA||2,F

2 ≤ ||A-Ak||2,F2 + ||A||F

2.

Proof: via matrix multiplication theorem and matrix perturbation theory.

Page 38: KDD tutorial

38

Title:C:\Petros\Image Processing\baboondet.epsCreator:MATLAB, The Mathworks, Inc.Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Original matrix After sampling columns

Compute the top k left singular vectors of the matrix C and store them in the 512-by-k matrix Hk.

Example of randomized SVDA C

Page 39: KDD tutorial

39

Example of randomized SVD (cont’d)

Title:C:\Petros\Image Processing\baboondet.epsCreator:MATLAB, The Mathworks, Inc.Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

A

A and HkHkTA are close.

HkHkTA

Page 40: KDD tutorial

40

Element-wise sampling

The Algorithm in 2 lines:

• To approximate a matrix A, keep a few elements of the matrix (instead of rows or columns) and zero out the remaining elements.

• Compute a rank k approximation to this sparse matrix (using Lanczos methods).

(Achlioptas & McSherry, STOC ’01, JACM ’05)

More details:

Let pij 2 [0,1] for all i,j. Create the matrix S from A such that:

||A-S||2 is bounded ! (i) the singular values of A and S are close, and (ii, under additional assumptions) the top k left (right) singular vectors of S span a subspace that is close the to subspace spanned by the top k left (right) singular vectors of A.

Page 41: KDD tutorial

41

How to use it

Approximating singular values fast:

• Zero out (a large number of) elements of A, scale the remaining ones appropriately.

• Compute the singular values of the resulting sparse matrix using iterative techniques.

• (Good choice for pij: pij = sAij2/i,j Aij

2, where s denotes the expected number of elements that we seek to keep in S.)

• Note: Each element is kept or discarded independently of the others.Similar ideas have been used to:

• explain the success of Latent Semantic Indexing (LSI)(Papadimitriou, Raghavan, Tamaki, Vempala, PODS ’98 & Azar, Fiat, Karlin, McSherry, Saia STOC ’01)

• design recommendation systems(Azar, Fiat, Karlin, McSherry, Saia STOC ’01 )

• speedup kernel computations(Achlioptas, McSherry, and Schölkopf, NIPS ’02)

Page 42: KDD tutorial

42

Element-wise vs. row/column sampling

Quite different techniques!

• Row/column sampling preserves structural properties of the matrices.

• Element-wise sampling explains how adding noise and/or quantizing the elements of a matrix perturbs its singular values/vectors.

• The two techniques could be complementary!

Some similarities and differences:

• Similar error bounds.

• Element-wise sampling is doable in one pass!

• Running time of element-wise sampling depends on the speed of, e.g., Arnoldi methods.

Page 43: KDD tutorial

43

Overview (1/3)

• Data Streaming Models and Random Sampling

• Matrix Multiplication• Feasibility testing of Linear Programs

• Singular Value Decomposition• Sampling rows/columns to approximate singular vectors/values• Sampling elements to approximate singular vectors/values

• CUR Matrix Decomposition• Lower bounds on sampling complexities

Page 44: KDD tutorial

44

A novel CUR matrix decomposition

Create an approximation to the original matrix of the following form:

O(1) columnsO(1) rows

Carefully chosen U

An application: Given a query vector x, instead of computing A · x, compute CUR · x to identify its nearest neighbors,

(D. & Kannan, SODA ’03, D., Kannan, & M. TR ’04, SICOMP ’05)

Page 45: KDD tutorial

45

The CUR decomposition

Given a large m-by-n matrix A (stored on disk), compute a decomposition CUR of A such that:

1. C consists of c = O(k/2) columns of A.

2. R consists of r = O(k/2) rows of A.

3. C (R) is created using importance sampling, e.g. columns (rows) are picked in i.i.d. trials with respect to probabilities

Page 46: KDD tutorial

46

The CUR decomposition (cont’d)

Given a large m-by-n matrix A (stored on disk), compute a decomposition CUR of A such that:

4. C, U, R can be stored in O(m+n) space, after making two passes through the entire matrix A, using O(m+n) additional space and time.

5. The product CUR satisfies (with high probability)

Page 47: KDD tutorial

47

Computing U

Intuition (which can be formalized):

The CUR algorithm essentially expresses every row of the matrix A as a linear combination of a small subset of the rows of A.

• This small subset consists of the rows in R.

• Given a row of A – say A(i) – the algorithm computes a good fit for the row A(i) using the rows in R as the basis, by approximately solving

Notice that only c = O(1) element of the i-th row are given as input.

However, a vector of coefficients u can still be computed.

Page 48: KDD tutorial

48

Computing U (cont’d)

However, our CUR decomposition approximates the vectors u instead of exactly computing them.

Later in this talk: new error bounds using the optimal coefficients.

Given c elements of A(i) the algorithm computes a good fit for the row A(i) using the rows in R as the basis, by approximately solving:

Page 49: KDD tutorial

49

Error bounds for CUR

Assume Ak is the “best” rank k approximation to A (through SVD). Then, if we pick O(k/2) rows and O(k/2) columns,

If we pick O(1/2) rows and O(1/2) columns,

Page 50: KDD tutorial

50

Other (randomized) CUR decompositions

Computing U in constant (O(1) instead of O(m+n)) space and time:(D., Kannan, & M. TR ’04, SICOMP ’05)

• samples O(poly(k,)) rows and columns of A & needs an extra pass.

• significantly improves the error bounds of Frieze, Kannan, and Vempala, FOCS ’98.Computing U and R for any C:(D., M., & Muthukrishnan ’05)

• For any subset of the columns, denoted C (e.g., chosen by the practitioner)

• Obtain bounds of the form: || A - CUR ||F ≤ ( 1 + ) || A - CC+ A ||F

• Uses ideas from approximating L2 Regression problems by random sampling.

• Can combine with recent algorithms/heuristics for choosing columns; more details later.

Page 51: KDD tutorial

51

Other CUR decompositions

Compute sparse low-rank approximations to sparse matrices.

The quasi-Gram-Schmidt (variant of QR) algorithm:• Input: m x n matrix A. • Output: m x k matrix C (k columns of A) and upper-triangular k x k matrix SC (that orthogonalizes these columns).

Apply this algorithm to A and AT: • C = QC SC and RT = QR SR.• QC and QR not explicitly computed.

Then A ≈ CUR, where U minimizes ||A-CUR||F2 is:

(G.W. Stewart, Num. Math. ’99, TR ’04)

Page 52: KDD tutorial

52

Other CUR decompositions, cont’d

Notes:

• Scattering applications with large matrices with low-rank blocks.• Relate to maximum volume concept in interpolation theory.• They call the CUR decomposition the pseudoskeleton component of A.

If A is approximately rank-k to accuracy (for sufficiently small) then there exist C, R, and W such that

(Goreinov, Tyrtyshnikov, and Zamarashkin, LAA ’97 & Goreinov, Tyrtyshnikov, Cont. Math. ’01)

Page 53: KDD tutorial

53

Lower Bounds

Question: How many queries does a sampling algorithm need to approximate a given function accurately with high probability?

(Ziv Bar-Yossef ’03, ’04) Lower bounds for the low rank matrix approximation problem and the matrix reconstruction problem.• Any sampling algorithm that w.h.p. finds a good low rank approximation requires (m+n) queries.• Even if the algorithm is given the exact weight distribution over the columns of a matrix it will still require (k/4) queries.• Finding a matrix D such that ||A-D||F ≤ ||A||F requires (mn) queries and that

finding a D such that ||A-D||2 ≤ ||A||2 requires (m+n) queries.

Applied to our results:• The LinearTimeSVD algorithm is optimal w.r.t. ||||F bounds.

• The ConstantTimeSVD algorithm is optimal w.r.t. ||||2 bounds up to poly factors.• The CUR algorithm is optimal for constant .

Page 54: KDD tutorial

54

Overview (2/3)

• Applications of Matrix CUR• Designing approximation algorithms • Data mining• Microarray data• Recommendation Systems

• Tensor-based data sets• Tensor-CUR

• Applications of Tensor-CUR• Hyperspectral data• Recommendation systems

Page 55: KDD tutorial

55

Approximating Max-Cut

Max-Cut (NP-hard for general and dense graphs, Max-SNP)

Given a graph G=(V,E), |V|=n, partition V in two disjoint subsets V1 and V2 such that the number of edges of E that have one endpoint in V1 and one endpoint in V2 is maximized.

Goemans & Williamson ’94: .878-approximation algorithm (might be tight!)

Arora, Karger & Karpinski ’95: LP based, n2 additive error(*)

De La Vega ’96: Combinatorial methods, n2 additive error

Frieze & Kannan ’96: Linear Algebraic methods, n2 additive error(*)

Goldreich, Goldwasser & Ron ’96: Property Testing, n2 additive error

Alon, De La Vega, Kannan & Karpinski ’02, ’03 : Cut Decomposition, n2 additive error(*)

All the above algorithms run in constant time/space.

(*) More generally, it achieves nr for all problems in Max-r-CSP (=Max-SNP)

Page 56: KDD tutorial

56

Weighted Max-Cut via CUR

(D., Kannan, & M. TR ’04 and STACS ’05)

Let A denote the adjacency matrix of G. All previous algorithms also guarantee an

additive error approximation for the weighted Max-Cut problem, where Amax is the maximum edge weight in the graph.

in constant time and space after reading the graph a small number of times.

Our result:

We can approximate the weighted Max-Cut of G(V,E) up to additive error

Page 57: KDD tutorial

57

The algorithm (sketch)

Compute the MaxCut of CUR, thus getting an approximation to the MaxCut of A.

It turns out that we can solve the problem on CUR in constant time, in constant space…

Let A denote the adjacency matrix of the weighted graph G=(V,E). Then,

Step 1: Replace A by the ConstantTimeCUR decomposition. Then,

Follows from

Page 58: KDD tutorial

58

Data mining with CUR

We are given m (>106) objects and n(>105) features describing the objects.

Database

An m-by-n matrix A (Aij shows the “importance” of feature j for object i).

E.g., m documents, represented w.r.t. n terms.

Queries

Given a new object x, find similar objects in the database (nearest neighbors).

Page 59: KDD tutorial

59

Data mining with CUR (cont’d)

Key observation: The exact value xT· d might not be necessary.

1. The feature values in the vectors are set by coarse heuristics.

2. It is in general enough to see if xT· d > Threshold.

feature 1

fea

ture

2

Object x

Object d

(d,x)

Two objects are “close” if the angle between their corresponding (normalized) vectors is small. So,

xT·d = cos(x,d)

is high when the two objects are close.

A·x computes all the angles and answers the query.

Page 60: KDD tutorial

60

Data mining with CUR (cont’d)

Assume that CUR is an approximation to A, such that CUR is stored efficiently (e.g. in RAM).

Given a query vector x, instead of computing A · x, compute CUR · x to identify its nearest neighbors.

The CUR algorithm guarantees a bound on the worst case choice of x.

(Also recall how we used the matrix multiplication algorithm to solve the proximity problem, namely find all pairs of objects that are “close”.)

Page 61: KDD tutorial

61

Genetic microarray data and CUR

Exploit structural properties of CUR in biological applications:

Common in Biological/Chemical/Medical applications of PCA:

• Explain the singular vectors, by mapping them to meaningful biological processes.

• This is a “challenging” task (think: reification) !

CUR is a low-rank decomposition in terms of the data that practitioners understand.

• Use it to explain the data and do dimensionality reduction, classification, clustering.

Gene microarray data: M., D., & Alter (UT Austin) (sporulation and cell cycle data).

Find a “good” set of genes and arrays to include in C and R?

Provable and/or heuristic strategies are acceptable.

genes

Experimental conditions

Page 62: KDD tutorial

62

Recommendation Systems

• (Kumar, Raghavan, Rajagopalan & Tomkins FOCS ’98)

Assuming strong clustering of the products, they offer competitive algorithms even with only 2 samples/customer.• (Azar, Fiat, Karlin, McSherry & Saia STOC ’01)

Assuming sampling of (mn) entries of A and a certain gap requirement, they -very- accurately recreate A.

The problem:

Assume the existence of m customers and n products. Then, A is an (unknown) matrix s.t. Aij is the utility of product j for customer i.

Our goal is to recreate A from a few samples, s.t. we can recommend high utility products to customers.

(D., Raghavan, & Kerenidis, STOC ’02)

Page 63: KDD tutorial

63

Recommendation Systems

Critical assumption (supported by experimental evidence):

There exist a small, constant (e.g. k) number of different customer types or, equivalently, the rows of the matrix A are (approximately) linearly dependent or, equivalently, the matrix A has a “good” low-rank (e.g. k) approximation.

The problem:

Assume the existence of m customers and n products. Then, A is an (unknown) matrix s.t. Aij is the utility of product j for customer i.

Our goal is to recreate A from a few samples, s.t. we can recommend high utility products to customers.

Page 64: KDD tutorial

64

Recommendation Systems (cont’d)

Question:

Can we get competitive performance by sampling less than (mn) elements?

products

customers

Customer sample

(guinea pigs)

Customer sample

(purchases, small surveys)

Answer:

Apply the CUR decomposition.

Page 65: KDD tutorial

65

Recommendation Systems (cont’d)

Details:

Sample a constant number of rows and columns of A and compute A’=CUR. Assuming that A has a “good” low rank approximation,

This implies that we can make relatively accurate recommendations with far fewer samples.

Page 66: KDD tutorial

66

Overview (2/3)

• Applications of Matrix CUR• Designing approximation algorithms • Data mining• Microarray data• Recommendation Systems

• Tensor-based data sets• Tensor-CUR

• Applications of Tensor-CUR• Hyperspectral data• Recommendation systems

Page 67: KDD tutorial

67

Datasets modeled as tensors

Goal: Extract structure from a tensor dataset A (naively, a dataset subscripted by multiple indices) using a small number of samples.

Mode 2

Mode 1

Mode 3Q. What do we know about tensor decompositions?

A. Not much, although tensors arise in numerous applications.

m £ n £ p tensor A

Page 68: KDD tutorial

68

Tensors in Applications

Tensors appear both in Math and CS.

• Represent high dimensional functions.

• Connections to complexity theory (i.e., matrix multiplication complexity).

• Statistical applications (i.e., Independent Component Analysis, higher order statistics, etc.).

• Large data-set applications (e.g., Medical Imaging & Hyperspectral Imaging)Problem: There does not exist a definition of tensor rank (and associated tensor SVD) with the – nice – properties found in the matrix case.

(Lek-Heng Lim ’05: strong impossibility results!)

Heuristic solution: “unfold” the tensor along a mode and apply Linear Algebra.

Page 69: KDD tutorial

69

Tensor rank

A definition of tensor rank

Given a tensor

find the minimum number of rank one tensors into it can be decomposed.

computing it is NP-hard only weak bounds are known tensor rank depends on the underlying ring of scalars successive rank one approxi-imations are no good

agrees with matrices for d=2 related to computing bilinear forms and algebraic complexity theory some approximations are possible, see DKKV ’05

BUT

outer product

(Hastad, J of Algorithms ’90, DeLaVega, Kannan, Karpinski, Vempala STOC ’05)

Page 70: KDD tutorial

70

Tensor -rank

The -rank of a tensor

Given

create the “unfolded” matrix

and compute its rank, called the -rank.

Pros:

• Easily computable,

Cons:

• different -ranks are different,

• information is lost.

“unfold”

Page 71: KDD tutorial

71

Tensors in real applications

3 classes of tensors in data applications

1. All modes are comparable (e.g., tensor faces, chemometrics)

2. A priori dominant mode (e.g., coarse scales, microarrays vs. time, images vs. frequency)

3. All other combinations

(D. and M. ’05) TensorSVD paper deals with (1).

We will focus on (2), where there is a preferred mode.

Page 72: KDD tutorial

72

The TensorCUR algorithm (3-modes)

Choose the preferred mode (time)

Pick a few representative snapshots:

Express all the other snapshots in terms of the representative snapshots.

n environmental conditions

m genes

p time/frequency slices

n environmental conditions

m genes

2 samplesrandomly sample

Page 73: KDD tutorial

73

The TensorCUR algorithm (cont’d)

Let R denote the tensor of the sampled snapshots.

Express the remaining images as linear combinations of the sampled snapshots.

n environmental conditions

m genes

2 samples

First, pick a constant number of “fibers” of the tensor A (the red dotted lines).

Express the remaining snapshots as linear combination of the sampled snapshots.

sampled snapshots

sampled fibers

p time steps

n environmental conditions

m genes

Page 74: KDD tutorial

74

The TensorCUR algorithm (cont’d)

Theorem:

Best rank ka approximation to A[a]

Unfold R along the dimension and pre-multiply

by CU

How to proceed:

Can recurse on each sub-tensor in R,

or do SVD, exact or approximate,

or do kernel-based diffusion analysis,

or do wavelet-based methods.

TensorCUR:

Framework for dealing with very large tensor-based data sets,

to extract a “sketch” in a principled and optimal manner,

which can be coupled with more traditional methods of data analysis.

Page 75: KDD tutorial

75

Overview (2/3)

• Applications of Matrix CUR• Designing approximation algorithms • Data mining• Microarray data• Recommendation Systems

• Tensor-based data sets• Tensor-CUR

• Applications of Tensor-CUR• Hyperspectral data• Recommendation systems

Page 76: KDD tutorial

76

TensorCUR on scientific data sets

Apply random sampling methodology and kernel-based Laplacian methods to large physical and chemical and biological data sets.

Common application areas of large data set analysis:• telecommunications,• finance,• web-based modeling, and• astronomy.

Scientific data sets are quite different:• with respect to their size,• with respect to their noise properties, and • with respect to the available field-specific intuition.

Page 77: KDD tutorial

77

Data sets being considered

Sequence and mutational data from G-protein coupled receptors • to identify mutants with enhanced stability properties.

Genomic microarray data • to understand large-scale genomic and cellular behavior.

Hyperspectral colon cancer data • for improved detection of anomalous behavior.

Time-resolved fMRI data • to better represent large, complex visual brain-imaging data.

Simulational data • to more efficiently conduct large scale computations.

Page 78: KDD tutorial

78

Page 79: KDD tutorial

79

Page 80: KDD tutorial

80

Sampling hyperspectral data

Sample slabs depending on total absorption:

Sample fibers uniformly (since intensity depends on stain).

Page 81: KDD tutorial

81

Look at the exact 65-th slab.

Page 82: KDD tutorial

82

The 65-th slab approximately reconstructed

This slab was reconstructedby approximate least-squares fit to the basis from slabs 41 and 50, using 1000 (of 250K) pixels/fibers.

Page 83: KDD tutorial

83

Tissue Classification - Exact Data

Page 84: KDD tutorial

84

Tissue Classification - Ns=12 & Nf=1000

Page 85: KDD tutorial

85

TensorCUR on Recommendation Systems

Our previous setup:

Assume the existence of m customers and n products. Then, A is an (unknown) matrix s.t. Aij is the utility of product j for customer i.

Our goal is to recreate A from a few samples, s.t. we can recommend high utility products to customers.

products

customers

Customer sample

(guinea pigs)

Customer sample

(purchases, small surveys)

Page 86: KDD tutorial

86

Recommendation Systems (cont’d)

Comment:

It is folklore knowledge in economics literature that product utility is an ordinal and not a cardinal quantity.

Thus, it is more natural to compare products than assign utility values.

Model revisited:

Every customer has an n-by-n matrix (whose entries are §1) and represent pairwise product comparisons.

Overall, there are m such matrices, forming an n-by-n-by-m 3-mode array or a three-dimensional tensor, denoted by A.

We seek to extract the “structure” of this tensor by sampling.

Page 87: KDD tutorial

87

Recommendation Systems (cont’d)

Our goal:

Recreate the tensor A from a few samples in order to recommend high utility products to the customers.

n products

n products

m customers

(Similar to previous discussion.)

Page 88: KDD tutorial

88

Overview (3/3)

• Kernel-CUR and the Nystrom Method

• Regression problems• Least squares problems

• The CUR algorithm revisited• A general bound• Iterative sampling of rows/columns• “Optimal” choices of rows/columns• Application: finding ht-SNPs

• Conclusions & open problems

Page 89: KDD tutorial

89

Motivation for Kernels (1 of 3)

Methods to extract linear structure from the data:

• Support Vector Machines (SVMs).

• Gaussian Processes (GPs).

• Singular Value Decomposition (SVD) and the related PCA.

Kernel-based learning methods to extract non-linear structure:

• Choose features to define a (dot product) space F.

• Map the data, X, to F by : X ! F.

• Do classification, regression, and clustering in F with linear methods.

Page 90: KDD tutorial

90

Motivation for Kernels (2 of 3)

• Use dot products for information about mutual positions.

• Define the kernel or Gram matrix: Gij=kij=((X(i)), (X(j))).

• Algorithms that are expressed in terms of dot products may be given as input the Gram matrix G instead of the data covariance matrix XTX.

Note: Isomap, LLE, graph Laplacian eigenmaps, Hessian eigenmaps, SDE (dimensionality reduction methods for nonlinear manifolds) are kernel-PCA for particular Gram matrices.

Note: for Mercer kernels, G is SPSD.

Page 91: KDD tutorial

91

Motivation for Kernels (3 of 3)

If the Gram matrix G, where Gij=kij=((X(i)), (X(j))), is dense but has low numerical rank, then calculations of interest still need O(n2) space and O(n3) time:

• matrix inversion in GP prediction,

• quadratic programming problems in SVMs,

• computation of eigendecomposition of G.

Relevant recent work using low-rank methods:

• (Achlioptas, McSherry, and Schölkopf, NIPS ’02), randomized kernels.

• Nystrom method for out-of-sample extensions (Williams and

Seeger, NIPS ’01 and others),.

Page 92: KDD tutorial

92

Kernel-CUR

Main algorithm:

• Randomized algorithm to approximate a Gram matrix.

• Low-rank approximation in terms of columns (and rows) of G=XTX.

Main quality-of-approximation theorem:

• Provably good approximation if nonuniform probabilities are used.

Discussion of the Nystrom method:

• Nystrom method for integral equations and matrix problems.

• Relationship to randomized SVD and CUR algorithms.

(D. & M., COLT ’05, TR ’05)

Page 93: KDD tutorial

93

Kernel-CUR Algorithm

Input: n x n SPSD matrix G, probabilities {pi, 1=1,…,n}, c <= n, and k <= c.

Output: n x c matrix C, and c x c matrix Wk+ (s.t. CWk

+CT ≈ G).

Algorithm:• Pick c columns of G in i.i.d. trials, with replacement and with respect to the {pi}; let L be the set of indices of the sampled columns.• Scale each sampled column (with index i 2 L) by dividing its by (cpi)1/2.• Let C be the n x c matrix containing the rescaled sampled columns.• Let W be the c x c submatrix of G with entries Gij/(cpi

1/2pj1/2), i,j 2

L.• Compute Wk

+.

Page 94: KDD tutorial

94

Notes on the algorithm

Note the structural simplicity of the algorithm:

• C consists of a small number of representative data points.

• W consists of the induced subgraph defined by those points.

Computational resource requirements:

•Assume the data X (or Gram matrix G) are stored externally.

•Algorithm performs two passes over the data.

•Algorithm uses O(n) additional scratch space and additional computation time.

Page 95: KDD tutorial

95

Analysis of Kernel-CUR

Let > 0 and = O(log(1/)).

Construct an approximation CWk+CT with the Kernel-CUR algorithm

by sampling c columns of G with probabilities pi = Gii2/ i Gii

2.

If c = O(k log(1/)/4), then with probability at least 1-,

If c = O(log(1/)/4), then with probability at least 1-,

Page 96: KDD tutorial

96

Interpreting the sampling probabilities

Our sampling probabilities ignore correlations:

pi = Gii2/ i Gii

2 = |X(i)|2/||X||F2

If the sampling probabilities were:

pi = |G(i)|2/||G||F2

• they would provide a bias towards data points that are more “important” - longer and/or more representative.

• the additional error would be ||G||F and not i Gii2= ||X||F

2 (where G=XTX).

Page 97: KDD tutorial

97

The Nystrom Method (1 of 3)

Consider the eigenfunction problem:

Discretize with a quadrature rule ({wj} are the weights and {sj} are the quadrature points):

Choose a set {xi} of Nystrom points (often the same as {si}):

Extend the eigenvectors on the Nystrom points on D:

Page 98: KDD tutorial

98

The Nystrom Method (2 of 3)

Page 99: KDD tutorial

99

The Nystrom Method (3 of 3)

Randomized SVD Algorithms

• Randomly sample columns (or rows).

• Compute/approximate low-dimensional singular vectors.

• Nystrom-extend to approximate Hk, the high-dimensional singular vectors.

• Bound ||A-HkHkTA||2,F <= ||A-Ak||2,F + ||A||F.

Randomized CUR Algorithms

• Randomly sample columns and rows.

• Bound ||A-CUR||2,F <= ||A-Ak||2,F + ||A||F.

• Do not need or use the SPSD property.

Page 100: KDD tutorial

100

Fast Computation with Kernels

To construct a coarse-grained version of the data graph:

Construct landmarks,

Partition/Quantization,

Diffusion wavelets.

Kernel-based diffusion

Adjacency matrix, t=0

To construct landmarks, randomly sample with the “right” probabilities:

for outliers,

uniform sampling.

Adjacency matrix, t=t*

Page 101: KDD tutorial

101

Overview (3/3)

• Kernel-CUR and the Nystrom Method

• Regression problems• Least squares problems

• The CUR algorithm revisited• A general bound• Iterative sampling of rows/columns• “Optimal” choices of rows/columns• Application: finding ht-SNPs

• Conclusions & open problems

Page 102: KDD tutorial

102

Regression problems

We seek sampling-based algorithms for solving l2 regression.

We are interested in overconstrained problems, n >> d. Typically, there is no x such that Ax = b.

(D., M. & Muthukrishnan, ’05)

Page 103: KDD tutorial

103

sampled “rows” of b

sampled rows of A

scaling to account for

undersampling

“Induced” regression problems

Page 104: KDD tutorial

104

Regression problems, definition

There is work by K. Clarkson in SODA 2005 on sampling-based algorithms for l1 regression ( = 1 ) for overconstrained problems.

Page 105: KDD tutorial

105

Exact solution

Pseudoinverse of A

Projection of b on the subspace

spanned by the columns of A

Page 106: KDD tutorial

106

Singular Value Decomposition (SVD)

U (V): orthogonal matrix containing the left (right) singular vectors of A.

: diagonal matrix containing the singular values of A.

rank of A.

Computing the SVD takes O(d2n) time. The pseudoinverse of A is

Page 107: KDD tutorial

107

Questions …

Can sampling methods provide accurate estimates for l2 regression?

Is it possible to approximate the optimal vector and the optimal value value Z2 by only looking at a small sample from the input?

(Even if it takes some sophisticated oracles to actually perform the sampling …)

Equivalently, is there an induced subproblem of the full regression problem, whose optimal solution and its value Z2,s aproximates the optimal solution and its value Z2?

Page 108: KDD tutorial

108

Creating an induced subproblem

Algorithm

1. Fix a set of probabilities pi, i=1…n, summing up to 1.

2. Pick r indices from {1…n} in r i.i.d. trials, with respect to the pi’s.

3. For each sampled index j, keep the j-th row of A and the j-th element of b; rescale both by (1/rpj)1/2.

Page 109: KDD tutorial

109

The induced subproblem

sampled elements of b, rescaled

sampled rows of A, rescaled

Page 110: KDD tutorial

110

Our results

If the pi satisfy certain conditions, then with probability at least 1-,

The sampling complexity is

Page 111: KDD tutorial

111

Our results, cont’d

If the pi satisfy certain conditions, then with probability at least 1-,

The sampling complexity is

A): condition number of A

Page 112: KDD tutorial

112

Back to induced subproblems …

sampled elements of b, rescaled

sampled rows of A, rescaled

The relevant information for l2 regression if n >> d is contained in an induced subproblem of size O(d2)-by-d.

(upcoming writeup: we can reduce the sampling complexity to r = O(d).)

Page 113: KDD tutorial

113

Conditions on the probabilities, SVD

Let U(i) denote the i-th row of U.

Let U? 2 Rn £ (n -) denote the orthogonal complement of U.

U (V): orthogonal matrix containing the left (right) singular vectors of A.

: diagonal matrix containing the singular values of A.

rank of A.

Page 114: KDD tutorial

114

Conditions for the probabilities

The conditions that the pi must satisfy, for some 1, 2, 3 2 (0,1]:

lengths of rows of matrix of left singular vectors of A

Component of b not in the span of the columns of A

The sampling complexity is:

Small i ) more

sampling

Page 115: KDD tutorial

115

Computing “good” probabilities

Open question: can we compute “good” probabilities faster, in a pass efficient manner?

Some assumptions might be acceptable (e.g., bounded condition number of A, etc.)

In O(nd2) time we can easily compute pi’s that satisfy all three conditions, with 1 = 2 = 3 = 1/3.

(Too expensive!)

Page 116: KDD tutorial

116

Overview (3/3)

• Kernel-CUR and the Nystrom Method

• Regression problems• Least squares problems

• The CUR algorithm revisited• A general bound• Iterative sampling of rows/columns• “Optimal” choices of rows/columns• Application: finding ht-SNPs

• Conclusions & open problems

Page 117: KDD tutorial

117

Revisiting the CUR decomposition

1. How do we draw the rows and columns of A to include in C and R?

2. How do we construct U?

Create an approximation to A, using rows and columns of A

O(1) columnsO(1) rows

Carefully chosen U

Goal: provide (good) bounds for some norm of the error matrix

A – CUR

(D., M. & Muthukrishnan, ’05)

Page 118: KDD tutorial

118

A simple algorithm

Create an approximation to A, using rows and columns of A

O(1) columnsO(1) rows

Carefully chosen U

Algorithm

Step 1: pick a few columns of A and include them in C (“basis” columns).

(Any set of columns will do; “good” choices will be discussed later)

Step 2: express all columns of A as linear combinations of the “basis” columns.

Page 119: KDD tutorial

119

Step 2

The second step is trivial; it amounts to solving an l2 regression problem.

Unfortunately, it necessitates looking at every element of A(i) and expresses A as CC+A.Can we approximately find the vector of coefficients by looking only at a few elements of A(i)? Yes!

i-th column

“basis” columns

Page 120: KDD tutorial

120

Step 3

Pick a few ( r ) rows of A and solve “induced” sampled l2 regression problems using these rows for all i 2 {1…n} (every column of A).

(Not straightforward from the l2 regression results.)

i-th column of

R

sampled rows of C

D: rescaling, diagonal matrix

Page 121: KDD tutorial

121

Error bounds (given C)

In O(c2m + cmn) = O(mn) time, we can construct D and R such that

holds with probability at least 1-. We need to pick r= O(c2 log(1/)/2) rows.

(upcoming writeup: we can reduce c2 to c.)

Constructing D and R: we compute a set of probabilities pi and sample and rescale rows of A in r i.i.d. trials with respect to the pi’s.

Page 122: KDD tutorial

122

Forming the pi (1/3)

Let U(i) denote the i-th row of U.

Let U? 2 Rm £ (m -) denote the orthogonal complement of U.

U (V): orthogonal matrix containing the left (right) singular vectors of C.

: diagonal matrix containing the singular values of C.

rank of C.

Page 123: KDD tutorial

123

Forming the pi (2/3)

Compute pi that satisfy, for some 1, 2, 3 2 (0,1]:

lengths of rows of matrix of left singular vectors of C

Component of A not in the span of the columns of C

The sampling complexity is:

Small i ) more

sampling

Page 124: KDD tutorial

124

Forming the pi (3/3)

In O(c2m + mnc) = O(mn) time we can easily compute pi that satisfy all the constraints, with 1, 2, 3 all equal to 1/3.

Page 125: KDD tutorial

125

Overall decomposition columns of A

rows of A

”intersection” of C and R

(r £ c)

diagonal rescaling matrix (r £ c)

Page 126: KDD tutorial

126

Overall decomposition (a variant) columns of A

rows of A

”intersection” of C and R

(Goreinov, Tyrtyrshnikov, & Zamarashkin LAA ’98)

Error bounds if C and R contain the columns and rows of A that define a parallelpiped of maximal volume.

Page 127: KDD tutorial

127

A (first) choice for C

(D., Frieze, Kannan, Vempala & Vinay ’99, ‘03 and D., Kannan, & M. TR ‘04 & SICOMP ‘05)

(For any k) Assume Ak is the “best” rank k approximation to A (through SVD). Then, after making two passes through A, we can construct a matrix C such that, with probability at least 1-,

We need to pick O( k log(1/) / 2 ) columns to include in C.

(Columns are picked with probability proportional to their Euclidean length squared; recall the SVD algorithm early in this tutorial.)

Page 128: KDD tutorial

128

CUR error

(For any k) In O(mn) time, we can construct C, U, and R such that

holds with probability at least 1-, by picking

O( k log(1/) / 2 ) columns, and

O( k2 log3(1/) / 6 ) rows.

Page 129: KDD tutorial

129

A (second) choice for C

(Rademacher, Vempala & Wang ’04, D. & M. TR ’05)

(For any k) Assume Ak is the “best” rank k approximation to A (through SVD). Then, after making 2t passes through A, we can construct a matrix C such that, with probability at least 1-,

We need to pick columns of A in t rounds, picking O(k log(1/) / 2) columns per round.

Page 130: KDD tutorial

130

The MultiPass sampling algorithm

R = A; S = empty set;

For i=1…t

1. Pick c columns from A0; each column is sampled with probability proportional to its Euclidean length squared.

2. S = S [ {sampled columns from step 1};

3. R = A – PS A;

End For

In words, we subtract the projection of A on the subspace spanned by the columns that are included in the

sample.

Page 131: KDD tutorial

131

Intuition

R = A; S = empty set;

For i=1…t

1. Pick c columns from A0; each column is sampled with probability proportional to its Euclidean length squared.

2. S = S [ {sampled columns from step 1};

3. R = A – PS A;

End For

A simple idea: this step allows us to focus on the subspace of A that is not “sufficiently” spanned by the current

selection of columns.

Page 132: KDD tutorial

132

CUR error

(For any k) In O(mnt) time, we can construct C, U, and R such that

holds with probability at least 1-, by picking

O( t k log(1/) / 2 ) columns, and

O( t2 k2 log3(1/) / 6 ) rows.

Page 133: KDD tutorial

133

A (third) choice for C

(Rademacher, Vempala & Wang ’04)

(For any k) Assume Ak is the “best” rank k approximation to A (through SVD). Then, there exist c = O(k2/2) columns of A such that

( It seems possible to use the volume work by Goreinov et. al. to design randomized strategies to pick such columns. )

Page 134: KDD tutorial

134

CUR error

(For any k) In O(nc + mn) time, we can construct C, U, and R such that

holds with probability at least 1-, by picking

O( k2 / 2 ) columns, and

O( k4 log(1/) / 6 ) rows.

Notice that we can pick the rows efficiently, given the columns.

Page 135: KDD tutorial

135

Example: finding ht-SNPs

Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome.

(Locations at the human genome where typically two alternate nucleotide bases are observed.)

Why are SNPs important: they occur quite frequently within the genome allowing the tracking of disease genes and population histories.

Thus, they are effective markers for genomic research!

There are ∼10 million SNPs in the human genome (under some assumptions).

(data from K. Kidd’s lab at Yale University, joint work with Dr. Paschou at Yale University)

Page 136: KDD tutorial

136

Research topics

Research Topics (working within a specific chromosomal region of the human genome):

(i) Are different SNPs correlated, either within a population, or across different populations? (Folklore knowledge: yes).

(ii) Find a “good” set of haplotype tagging-SNPs capturing the diversity of a chromosomal region of the human genome.

(iii) Is extrapolation feasible? (Recall Nystrom/CUR extrapolation.) This would save (a lot of) time and money!

Existing literature

Pairwise metrics of SNP correlation, called LD (linkage disequilibrium) distance, based on nucleotide frequencies and co-occurrences.

Almost no metrics exist for measuring correlation between more than 2 SNPs and LD is very difficult to generalize.

Exhaustive and semi-exhaustive algorithms in order to pick “good” ht-SNPs that have small LD distance with all other SNPs.

Using Linear Algebra: an SVD based algorithm was proposed by Lin & Altman, Am. J. Hum. Gen. 2004.

Page 137: KDD tutorial

137

The raw data

Notes:

• Each SNP genotype consists of two alleles.

• Observed SNPs are in known positions in the human genome.

• Exactly two nucleotides appear in each column.

SNPs

indiv

idu

als

AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG

GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA

GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG

GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG

GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA

GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA

GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA

Page 138: KDD tutorial

138

Encoding the data

How?

• Exactly two nucleotides (out of A,G,C,T) appear in each column.

• Thus, the two alleles might be both equal to the first one (encode by +1), both equal to the second one (encode by -1), or different (encode by 0).

SNPs

indiv

idu

als

Notes

• Order of the alleles is irrelevant, so TG is the same as GT.

• Encoding, e.g., GG to +1 and TT to -1 is not any different (for our purposes) from encoding GG to -1 and TT to +1. (Recall that flipping the signs of the columns of a matrix only changes the signs of the left singular vectors.)

0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1

-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

-1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1

-1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1

-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1 1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 0 -1 -1 1

Page 139: KDD tutorial

139

Evaluating (linear) structure

For each population

• We examined samples from 38 different populations, with an average size of 50 subjects. For each subject, we were given 63 SNPs, covering a chromosomal region of the human genome which is ¼ 900,000 bases long, close to the end of the long arm of chromosome 17.

• For each population, we ran SVD to determine the “optimal” number k of principal components that are necessary in order to cover 90% of its spectrum.

Expressing columns of A as linear combinations of

the top k left singular vectors

This means that if we select the top k left singular vectors Uk we can express every column (i.e, SNP) of A as a linear combination of the left singular vectors loosing at most 10% of the information in the matrix:

BUT: Uk consists of “superpositions” of the columns of A. Is it possible to pick a small number (e.g., roughly k) of columns of A and express every column (i.e., SNP) of A as a linear combination of the picked columns (matrix C) loosing at most 10% of the information in the matrix?

Expressing columns of A as linear combinations of

a few columns of A

Page 140: KDD tutorial

140

Algorithms

•We ran various schemes to select “good” columns (i.e., ht-SNPs).

•A greedy Multi-Pass heuristic scheme gave the best results.

(Select one column in each round, subtracting from A the projection of A on this column and repeating.)

Notice that some provable results exist for similar algorithms.

Nice feature: SVD provides a non-trivial (maybe not achievable) lower bound.

In many cases, the lower bound is attained by the greedy heuristic!

(In our data, at most k+5 columns suffice to extract 90% of the structure.)

Page 141: KDD tutorial

141

America

Oceania

Asia

Europe

Africa

Page 142: KDD tutorial

142

Extrapolation

Given a small number of SNPs for all subjects, and all SNPs for some subjects, extrapolate the values of the missing SNPs.

SNPs

individuals

“Training” data

(for a few subjects, we are given all

SNPs)

SNP sample

(for all subjects, we are given a small number of

SNPs)

Page 143: KDD tutorial

143

Extrapolation

We split our data in “training ” (90%) and “test” (10%).

The training set corresponds to R; for the “test” set we are given data on a small number of SNPs, picked using our greedy multipass heuristic on the “training” set.

SNPs

individuals

“Training” data

(for a few subjects, we are given all

SNPs)

SNP sample

(for all subjects, we are given a small number of

SNPs)

Page 144: KDD tutorial

144

Rondonian SuruiRondonian Surui

Page 145: KDD tutorial

145

Arizona PimaArizona Pima

Page 146: KDD tutorial

146

AtayalAtayal

Page 147: KDD tutorial

147

SamaritanSamaritan

Page 148: KDD tutorial

148

European European AmericanAmerican

Page 149: KDD tutorial

149

African AmericanAfrican American

Page 150: KDD tutorial

150

Overview (3/3)

• Kernel-CUR and the Nystrom Method

• Regression problems• Least squares problems

• The CUR algorithm revisited• A general bound• Iterative sampling of rows/columns• “Optimal” choices of rows/columns• Application: finding ht-SNPs

• Conclusions & open problems

Page 151: KDD tutorial

151

Conclusions & Open Problems

•Empirical evaluation of different sampling probabilities• Uniform• Row/column lengths squared• Lengths of the rows of the left/right singular vectors• Inverse squares of row/column lengths to pick outliers• Others, depending on the problem to be solved?

• Empirical evaluation for different learning problemsMatrix reconstruction, regression, classification, clustering, etc.

• Use CUR for improved interpretability for data matricesStructural vs. algorithmic issues

• Use CW+CT for improved interpretability of Gram MatricesStructural vs. algorithmic issues

Page 152: KDD tutorial

152

Conclusions & Open Problems

• Impose other structural properties in CUR-type decompositions• Non-negativity• Element quantization, e.g., to 0,1,-1 • Block-SVD type structure

• Robust heuristics and robust extensionsEspecially for noisy data

• Quasi Monte-Carlo vs. Monte-Carlo vs. Deterministic heuristics of the sketch for different data sets

Especially for not-so-large datasets

• Relate the pass-efficient framework to practice• passes over the data• a priori knowledge regarding data generation• oracle based models