15
Introduction Introduction to String to String Kernels Kernels Blaz Fortuna JSI, Slovenija

Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Embed Size (px)

Citation preview

Page 1: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Introduction to Introduction to String KernelsString Kernels

Blaz FortunaJSI, Slovenija

Page 2: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

What is a Kernel? Inner-product Similarity

between documents

Documents mapped into some higher dimensional feature space

Page 3: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Why to use Kernels? Mapped documents

are not explicitly calculated

Linear algorithms can be applied on mapped documents

Input documents can be anything (not necessary vectors)!

Page 4: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Algorithms using Kernels Support Vector Machine

(classification, regression, …) Kernel Principal Component Analysis Kernel Canonical Correlation

Analysis Nearest Neighbour …

Page 5: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Representation of text

Vector-space model (bag of words) Most commonly used Each document is encoded as a feature

vector with with word frequencies as elements

IDF weighting, normalized Similarity is inner-product (cosine

distance) Can be viewed as a kernel

Page 6: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Basic Idea of String KernelsWords -> Substrings

Each document is encoded as a feature vector with substring frequencies as elements

More contiguous substrings receive higher weighting (trough < 1)

ca ar cr ba br ap cp

car 2 2 3 0 0 0 0

bar 0 2 0 2 3 0 0

cap

2 0 0 0 0 2 3

Page 7: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Kernel Trick Computation of feature vectors is

very expensive For algorithms that use kernels

only inner-product is needed This can be efficiently computed

without explicit use of feature vectors (dynamic programming)

Page 8: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Advantage of String Kernel Detection of words with different

suffixes or prefixes Example:

‘microcomputer’ ‘computers’ ‘computerbased’

Page 9: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Extensions 1/2 Use of syllables or words

Documents are viewed as a sequence of syllables or words instead of characters

Reduces length of documents Syllables still eliminate need for stammer

Convex Combinations of Kernels Use of substrings with different lengths No extra computational cost

Page 10: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Extensions 2/2 Different weighing for symbols

Introduction of weighting similar to IDF Low computational cost

Soft-Matching Similar symbols are matched Use of WordNet for matching synonyms Computational cost comes from matching

Page 11: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Speed performance String kernel is much slower and

memory consuming than BOW text representation

DP implementation is O(n|s||t|) n – length of substring |s|, |t| – length of documents s and t

Memory consumption is O(|s||t|)

Page 12: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

How to be Faster TRIE – only count more contiguous

substrings Dimension reduction – documents are

projected into subspace spanned by most frequent continuous substrings

Incomplete Cholesky Decomposition – approximation of kernel matrix

Page 13: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Experiments Subset of Reuters 21578 dataset Bow vs. String kernel

300 train + 700 test 600 train + 400 test

Approximation techniques

Page 14: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Bow vs. String kernel

CE* F1 NSV*

CE* F1 NSV*

Time

String Kernel

15 87 184 3 97 305 208Syllable Kernel

12 89 218 4 97 379 29

Word Kernel

18 85 157 3 97 281 9BOW – TF only

18 84 150 3 97 287 1/6BOW – TFIDF

8 93 252 4 97 443 1/6CE – Classification error, NSV – number of support vectors

Page 15: Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Approximations

Prec [%]

Rec [%]

Time [sec]

TFIDF 95 97 24

DR (1500)

87 90 24

DR (2500)

87 91 48

DR (3500)

89 91 64

ICD (200)

86 92 49

ICD (450)

88 92 114

ICD (750)

90 94 244