55
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector Space Model Denis Helic ISDS, TU Graz October 30, 2018 Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 1 / 55

Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Knowledge Discovery and Data Mining 1 (VO) (706.701)Data Matrices and Vector Space Model

Denis Helic

ISDS, TU Graz

October 30, 2018

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 1 / 55

Page 2: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Big picture: KDDM

Knowledge Discovery Process

PreprocessingTransformation

Linear Algebra

Mathematical Tools

Information Theory Statistical Inference

Probability Theory Hardware & ProgrammingModel

Infrastructure

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 2 / 55

Page 3: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Outline

1 Recap

2 Data Representation

3 Data Matrix

4 Vector Space Model: Document-Term Matrix

5 Recommender Systems: Utility Matrix

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 3 / 55

Page 4: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recap

Recap

RecapReview of the preprocessing, feature extraction and selection phase

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 4 / 55

Page 5: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recap

Recap – Preprocessing

Initial phase of the Knowledge Discovery process... acquire the data to be analyzede.g. by crawling the data from the Web... prepare the datae.g. by cleaning and removing outliers

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 5 / 55

Page 6: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recap

Recap – Feature Extraction

Example of features:Images → colors, textures, contours, ...Signals → frequency, phase, samples, spectrum, ...Time series → ticks, trends, self-similarities, ...Biomed → dna sequence, genes, ...Text → words, POS tags, grammatical dependencies, ...

Features encode these properties in a way suitable for a chosen algorithm

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 6 / 55

Page 7: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recap

Recap – Feature Selection

Approach: select the sub-set of all features without redundant orirrelevant features

Unsupervised, e.g. heuristics (black & white lists, score for ranking, …)Supervised, e.g. using a training data set (calculate measure to assesshow discriminative is a feature, e.g. information gain)Penalty for more features, e.g. regularization

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 7 / 55

Page 8: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recap

Recap – Text mining

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 8 / 55

Page 9: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Data Representation

Representing data

Once when we have the features that we want: how can we representthem?Two representations:

1 Mathematical (model, analyze, and reason about data and algorithmsin a formal way)

2 Representation in computers (data structures, implementation,performance)

In this course we concentrate on mathematical modelsKDDM2 is about second representation

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 9 / 55

Page 10: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Data Representation

Representing data

Given: Preprocessed data objects as a set of featuresE.g. for text documents set of words, bigrams, n-grams, …Given: Feature statistics for each data objectE.g. number of occurrences, magnitudes, ticks, …Find: Mathematical model for calculationsE.g. similarity, distance, add, subtract, transform, …

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 10 / 55

Page 11: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Data Representation

Representing data

Let us think (geometrically) about features as dimensions(coordinates) in an 𝑚-dimensional spaceFor two features we have 2D-space, for three features we have3D-space, and so onFor numeric features the feature statistics will be numbers comingfrom e.g. ℝThen each data object is a point in the 𝑚-dimensional spaceThe value of a given feature is then the value for its coordinate in this𝑚-dimensional space

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 11 / 55

Page 12: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Data Representation

Representing data: Example

Text documentsSuppose we have the following documents:

DocID Documentd1 Chinese Chinesed2 Chinese Chinese Tokyo Tokyo Beijing Beijing Beijingd3 Chinese Tokyo Tokyo Tokyo Beijing Beijingd4 Tokyo Tokyo Tokyod5 Tokyo

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 12 / 55

Page 13: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Data Representation

Representing data: Example

Now, we take words as featuresWe take word occurrences as the feature valuesThree features: Beijing, Chinese, Tokyo

DocFeature Beijing Chinese Tokyo

d1 0 2 0d2 3 2 2d3 2 1 3d4 0 0 3d5 0 0 1

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 13 / 55

Page 14: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Data Representation

Representing data: Example

x(Beijing)

01

2

3

y(Chinese)

0

1

2

3

z(Tokyo

)

0

1

2

3

d1:(0,2,0)

d2:(3,2,2)

d3:(2,1,3)d4:(0,0,3)

d5:(0,0,1)

Geometric representation of data

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 14 / 55

Page 15: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Data Matrix

Representing data

Now, an intuitive representation of the data is a matrixIn a general caseColumns correspond to features, i.e. dimensions or coordinates in an𝑚-dimensional spaceRows correspond to data objects, i.e. data pointsAn element 𝑑𝑖𝑗 in the 𝑖-th row and the 𝑗-th column is the 𝑗-thcoordinate of the 𝑖-th data pointThe data is represented by a matrix D ∈ ℝ𝑛×𝑚, where 𝑛 is thenumber of data points and 𝑚 the number of features

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 15 / 55

Page 16: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Data Matrix

Representing data

For textColumns correspond to terms, words, and so onI.e. each word is a dimension or a coordinate in an 𝑚-dimensionalspaceRows correspond to documentsAn element 𝑑𝑖𝑗 in the 𝑖-th row and the 𝑗-th column is the e.g. numberof occurrences of the 𝑗-th word in the 𝑖-th documentThis matrix is called document-term matrix

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 16 / 55

Page 17: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Data Matrix

Representing data: example

D =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0 2 03 2 22 1 30 0 30 0 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 17 / 55

Page 18: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Vector interpretation

Alternative and very useful interpretation of data matrices is in termsof vectorsEach data point, i.e. each row in the data matrix can be interpretedas a vector in an 𝑚-dimensional spaceThis allows us to work with:

1 the vector direction, i.e. the line connecting the origin and a given datapoint

2 the vector magnitude (length), i.e. the distance from the origin and agiven data point → feature weighting

3 similarity, distance, correlations between vectors4 manipulate vectors, and so on.

Vector space model

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 18 / 55

Page 19: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Vector interpretation: Example

x(Beijing)

01

2

3

y(Chinese)

0

1

2

3

z(Tokyo

)

0

1

2

3

d1:(0,2,0)

d2:(3,2,2)

d3:(2,1,3)d4:(0,0,3)

d5:(0,0,1)

Geometric representation of data

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 19 / 55

Page 20: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Vector space model: feature weighting

What we did so far is that we counted word occurrences and usedthese counts as the coordinates in a given dimensionThis weighting scheme is referred to as term frequency (TF)Three features: Beijing, Chinese, Tokyo

DocFeature Beijing Chinese Tokyo

d1 0 2 0d2 3 2 2d3 2 1 3d4 0 0 3d5 0 0 1

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 20 / 55

Page 21: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Vector space model: TF

Please note the difference between the vector space model with TFweighting and e.g. Standard boolean model in information retrievalStandard boolean model uses binary feature valuesIf the feature 𝑗 occurs in the document 𝑖 (regardless how many times)then 𝑑𝑖𝑗 = 1Otherwise 𝑑𝑖𝑗 = 0Allows very simple manipulation of the model, but does not capturethe relative importance of the feature 𝑗 for the document 𝑖If a feature occurs more frequently in a document then it is moreimportant for that document and TF captures this intuition

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 21 / 55

Page 22: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Vector space model: feature weighting

However, raw TF has a critical problem: all terms are consideredequally important for assessing relevanceIn fact, certain terms have no discriminative power at all consideringrelevanceFor example, a collection of documents on the auto industry will havemost probably term “auto” in every documentWe need to penalize terms which occur too often in the completecollectionWe start with the notion of document frequency (DF) of a termThis is the number of documents that contain a given term

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 22 / 55

Page 23: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Vector space model: IDF

If a term is discriminative then it will only appear in a small numberof documentsI.e. its DF will be low, or its inverse document frequency (IDF) willbe highIn practice we typically calculate IDF as:

𝐼𝐷𝐹𝑗 = 𝑙𝑜𝑔( 𝑁𝐷𝐹𝑗

),

where 𝑁 is the number of documents and 𝐷𝐹𝑗 is the documentfrequency of the term 𝑗.

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 23 / 55

Page 24: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Vector space model: TFxIDF

Finally, we combine TF and IDF to produce a composite weight for agiven feature

𝑇𝐹𝑥𝐼𝐷𝐹𝑖𝑗 = 𝑇𝐹𝑖𝑗𝑙𝑜𝑔( 𝑁𝐷𝐹𝑗

)

Now, this quantity will be high if the feature 𝑗 occurs only a couple oftimes in the whole collection, and most of these occurrences are inthe document 𝑖

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 24 / 55

Page 25: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

TFxIDF: Example

Text documentsSuppose we have the following documents:

DocID Documentd1 Chinese Chinesed2 Chinese Chinese Tokyo Tokyo Beijing Beijing Beijingd3 Chinese Tokyo Tokyo Tokyo Beijing Beijingd4 Tokyo Tokyo Tokyod5 Tokyo

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 25 / 55

Page 26: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

TFxIDF: Example

Three features: Beijing, Chinese, TokyoTF:

DocFeature Beijing Chinese Tokyo

d1 0 2 0d2 3 2 2d3 2 1 3d4 0 0 3d5 0 0 1

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 26 / 55

Page 27: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

TFxIDF: Example

Three features: Beijing, Chinese, TokyoDF:

𝐷𝐹𝐵𝑒𝑖𝑗𝑖𝑛𝑔 = 2𝐷𝐹𝐶ℎ𝑖𝑛𝑒𝑠𝑒 = 3

𝐷𝐹𝑇𝑜𝑘𝑦𝑜 = 4

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 27 / 55

Page 28: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

TFxIDF: Example

Three features: Beijing, Chinese, TokyoIDF:

𝐼𝐷𝐹𝑗 = 𝑙𝑜𝑔( 𝑁𝐷𝐹𝑗

)

𝐼𝐷𝐹𝐵𝑒𝑖𝑗𝑖𝑛𝑔 = 0.3979𝐼𝐷𝐹𝐶ℎ𝑖𝑛𝑒𝑠𝑒 = 0.2218

𝐼𝐷𝐹𝑇𝑜𝑘𝑦𝑜 = 0.0969

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 28 / 55

Page 29: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

TFxIDF: Example

Three features: Beijing, Chinese, TokyoTFxIDF:

𝑇𝐹𝑥𝐼𝐷𝐹𝑖𝑗 = 𝑇𝐹𝑖𝑗𝑙𝑜𝑔( 𝑁𝐷𝐹𝑗

)

DocFeature Beijing Chinese Tokyo

d1 0 0.4436 0d2 1.1937 0.4436 0.1938d3 0.7958 0.2218 0.2907d4 0 0 0.2907d5 0 0 0.0969

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 29 / 55

Page 30: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

TFxIDF: Example

x(Beijing)

0.00.2

0.40.6

0.81.0

y(Chinese)

0.00.2

0.40.6

0.81.0

z(Tokyo

)

0.0

0.2

0.4

0.6

0.8

1.0

d1

d2d3

d4

d5

Geometric representation of data

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 30 / 55

Page 31: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

TFxIDF: Example

x(Beijing)

01

2

3

y(Chinese)

0

1

2

3

z(Tokyo

)

0

1

2

3

d1:(0,2,0)

d2:(3,2,2)

d3:(2,1,3)d4:(0,0,3)

d5:(0,0,1)

Geometric representation of data

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 31 / 55

Page 32: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Vector space model: TFxIDF

Further TF variants are possibleSub-linear TF scaling, e.g. √𝑇𝐹 or 1 + 𝑙𝑜𝑔(𝑇𝐹)Unlikely that e.g. 20 occurrences of a single term carry 20 times theweight of a single occurrenceAlso very often TF-normalization by e.g. dividing by the maximal TFin the collection

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 32 / 55

Page 33: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Vector space model: vector manipulation

Vector addition: concatenate two documentsVector subtraction: diff two documentsTransform the vector space in a new vector space: projection andfeature selectionSimilarity calculationsDistance calculationsScaling for visualization

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 33 / 55

Page 34: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Vector space model: similarity and distance

Similarity and distance allow us to order documentsThis is useful in many applications:

1 Relevance ranking (information retrieval)2 Classification3 Clustering4 Transformation and projection

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 34 / 55

Page 35: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Properties of similarity and distance

DefinitionLet 𝐷 be the set of all data objects, e.g. all documents. Similarity anddistance are functions 𝑠𝑖𝑚 ∶ 𝐷𝑥𝐷 → ℝ, 𝑑𝑖𝑠𝑡 ∶ 𝐷𝑥𝐷 → ℝ with thefollowing properties:

Symmetry

𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗) = 𝑠𝑖𝑚(𝑑𝑗, 𝑑𝑖)𝑑𝑖𝑠𝑡(𝑑𝑖, 𝑑𝑗) = 𝑑𝑖𝑠𝑡(𝑑𝑗, 𝑑𝑖)

Self-similarity (self-distance)

𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗) ≤ 𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑖) = 𝑠𝑖𝑚(𝑑𝑗, 𝑑𝑗)𝑑𝑖𝑠𝑡(𝑑𝑖, 𝑑𝑗) ≥ 𝑑𝑖𝑠𝑡(𝑑𝑖, 𝑑𝑖) = 𝑑𝑖𝑠𝑡(𝑑𝑗, 𝑑𝑗) = 0

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 35 / 55

Page 36: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Properties of similarity and distance

Additional propertiesSome additional properties of the 𝑠𝑖𝑚 function:

Normalization of similarity to the interval [0, 1]

𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗) ≥ 0𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑖) = 1

Normalization of similarity to the interval [−1, 1]

𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗) ≥ −1𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑖) = 1

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 36 / 55

Page 37: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Properties of similarity and distance

Additional propertiesSome additional properties of the 𝑑𝑖𝑠𝑡 functions:

Triangle inequality

𝑑𝑖𝑠𝑡(𝑑𝑖, 𝑑𝑗) ≤ 𝑑𝑖𝑠𝑡(𝑑𝑖, 𝑑𝑘) + 𝑑𝑖𝑠𝑡(𝑑𝑘, 𝑑𝑗)

If triangle inequality is satisfied then 𝑑𝑖𝑠𝑡 function defines a metric onthe vector space

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 37 / 55

Page 38: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Euclidean distance

NormWe can apply the Euclidean or ℓ2 norm as to caculate a distance metric inthe vector space model.

||x||2 =√√√⎷

𝑛∑𝑖=1

𝑥2𝑖

||x||2 = √x𝑇x

Euclidean distance of two vectors d𝑖 and d𝑗 is the length of theirdisplacement vector x = d𝑖 − d𝑗

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 38 / 55

Page 39: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Euclidean distance: Example

x(Beijing)

01

2

3

y(Chinese)

0

1

2

3

z(Tokyo

)

0

1

2

3

d1:(0,2,0)

d2:(3,2,2)

d3:(2,1,3)d4:(0,0,3)

d5:(0,0,1)

Geometric representation of data

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 39 / 55

Page 40: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Euclidean distance

D =⎛⎜⎜⎜⎜⎜⎜⎜⎝

d𝑇1

d𝑇2⋮d𝑇

𝑛

⎞⎟⎟⎟⎟⎟⎟⎟⎠

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 40 / 55

Page 41: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Euclidean distance

D =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0 2 03 2 22 1 30 0 30 0 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 41 / 55

Page 42: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Euclidean distance

dist =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0. 3.60555128 3.74165739 3.60555128 2.236067983.60555128 0. 1.73205081 3.74165739 3.741657393.74165739 1.73205081 0. 2.23606798 3.3.60555128 3.74165739 2.23606798 0. 2.2.23606798 3.74165739 3. 2. 0.

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 42 / 55

Page 43: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Cosine similarity

Angle between vectorsWe can think of the angle 𝜑 between two vectors d𝑖 and d𝑗 as a measureof their similarity.

Smaller angles mean that the vectors point in two directions that areclose to each otherThus, smaller angles mean that the vectors are more similarLarger angles imply smaller similarityTo obtain a numerical value from the interval [0, 1] we can calculate𝑐𝑜𝑠𝜑Cosine of an angle can be also negative! Why do we always get thevalues from [0, 1]

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 43 / 55

Page 44: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Cosine similarity: Example

x(Beijing)

01

2

3

y(Chinese)

0

1

2

3

z(Tokyo

)

0

1

2

3

d1:(0,2,0)

d2:(3,2,2)

d3:(2,1,3)d4:(0,0,3)

d5:(0,0,1)

ϕ1,5

ϕ2,3

Geometric representation of data

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 44 / 55

Page 45: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Cosine similarity

We can calculate the cosine similarity by making use of the dotproduct:

𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗) =d𝑇

𝑖 d𝑗||d𝑖||2||d𝑗||2

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 45 / 55

Page 46: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Cosine similarity

D =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0 2 03 2 22 1 30 0 30 0 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 46 / 55

Page 47: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Vector Space Model: Document-Term Matrix

Cosine similarity

sim =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1. 0.48507125 0.26726124 0. 0.0.48507125 1. 0.90748521 0.48507125 0.485071250.26726124 0.90748521 1. 0.80178373 0.80178373

0. 0.48507125 0.80178373 1. 1.0. 0.48507125 0.80178373 1. 1.

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 47 / 55

Page 48: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recommender Systems: Utility Matrix

Representing data as matrices

There are many other sources of the data that can be represented aslarge matricesWe use also matrices to represent the social networks, etc.Measurement data, and much moreIn recommender systems, we represent user ratings of items as autility matrix

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 48 / 55

Page 49: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recommender Systems: Utility Matrix

Recommender systems

Recommender systems predict user responses to optionsE.g. recommend news articles based on prediction of user interestsE.g. recommend products based on the predictions of what usermight likeRecommender systems are necessary whenever users interact withhuge catalogs of itemsE.g. thousands, even millions of products, movies, music, and so on.

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 49 / 55

Page 50: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recommender Systems: Utility Matrix

Recommender systems

These huge numbers of items arise because online you can interactalso with the “long tail” itemsThese are not available in retail because of e.g. limited shelf spaceRecommender systems try to support this interaction by suggestingcertain items to the userThe suggestions or recommendations are based on what the systemsknow about the user and about the items

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 50 / 55

Page 51: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recommender Systems: Utility Matrix

Formal model: the utility matrix

In a recommender system there are two classes of entities: users anditems

Utility FunctionLet us denote the set of all users with 𝑈 and the set of all items with 𝐼.We define the utility function 𝑢 ∶ 𝑈𝑥𝐼 → 𝑅, where 𝑅 is a set of ratings andis a totally ordered set.For example, 𝑅 = {1, 2, 3, 4, 5} set of star ratings, or 𝑅 = [0, 1] set of realnumbers from that interval.

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 51 / 55

Page 52: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recommender Systems: Utility Matrix

The Utility Matrix

The utility function maps pairs of users and items to numbersThese numbers can be represented by a utility matrix M ∈ ℝ𝑛×𝑚,where 𝑛 is the number of users and 𝑚 the number of itemsThe matrix gives a value for each user-item pair where we knowabout the preference of that user for that itemE.g. the values can come from an ordered set (1 to 5) and represent arating that a user gave for an item

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 52 / 55

Page 53: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recommender Systems: Utility Matrix

The Utility Matrix

We assume that the matrix is sparseThis means that most entries are unknownThe majority of the user preferences for specific items is unknownAn unknown rating means that we do not have explicit informationIt does not mean that the rating is lowFormally: the goal of a recommender system is to predict the blank inthe utility matrix

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 53 / 55

Page 54: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recommender Systems: Utility Matrix

The Utility Matrix: example

UserMovie HP1 HP2 HP3 Hobbit SW1 SW2 SW3

A 4 5 1B 5 5 4C 2 4 5D 3 3

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 54 / 55

Page 55: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/vspace.pdf · Knowledge Discovery and Data Mining 1 (VO) (706.701) Data Matrices and Vector

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Recommender Systems: Utility Matrix

Recommender systems: short summary

Three key problems in the recommender systems:Collecting data for the utility matrix

1 Explicit by e.g. collecting ratings2 Implicit by e.g. interactions (users buy a product implies a high rating)

Predicting missing values in the utility matrix1 Problems are sparsity, cold start, and so on.2 Content-based, collaborative filtering, matrix factorization

Evaluating predictions1 Training dataset, test dataset, error

Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 55 / 55