Introduction to String Kernels Blaz Fortuna JSI, Slovenija

Introduction to Introduction to String KernelsString Kernels

Blaz FortunaJSI, Slovenija

What is a Kernel? Inner-product Similarity

between documents

Documents mapped into some higher dimensional feature space

Why to use Kernels? Mapped documents

are not explicitly calculated

Linear algorithms can be applied on mapped documents

Input documents can be anything (not necessary vectors)!

Algorithms using Kernels Support Vector Machine

(classification, regression, …) Kernel Principal Component Analysis Kernel Canonical Correlation

Analysis Nearest Neighbour …

Representation of text

Vector-space model (bag of words) Most commonly used Each document is encoded as a feature

vector with with word frequencies as elements

IDF weighting, normalized Similarity is inner-product (cosine

distance) Can be viewed as a kernel

Basic Idea of String KernelsWords -> Substrings

Each document is encoded as a feature vector with substring frequencies as elements

More contiguous substrings receive higher weighting (trough < 1)

ca ar cr ba br ap cp

car 2 2 3 0 0 0 0

bar 0 2 0 2 3 0 0

cap

2 0 0 0 0 2 3

Kernel Trick Computation of feature vectors is

very expensive For algorithms that use kernels

only inner-product is needed This can be efficiently computed

without explicit use of feature vectors (dynamic programming)

Advantage of String Kernel Detection of words with different

suffixes or prefixes Example:

‘microcomputer’ ‘computers’ ‘computerbased’

Extensions 1/2 Use of syllables or words

Documents are viewed as a sequence of syllables or words instead of characters

Reduces length of documents Syllables still eliminate need for stammer

Convex Combinations of Kernels Use of substrings with different lengths No extra computational cost

Extensions 2/2 Different weighing for symbols

Introduction of weighting similar to IDF Low computational cost

Soft-Matching Similar symbols are matched Use of WordNet for matching synonyms Computational cost comes from matching

Speed performance String kernel is much slower and

memory consuming than BOW text representation

DP implementation is O(n|s||t|) n – length of substring |s|, |t| – length of documents s and t

Memory consumption is O(|s||t|)

How to be Faster TRIE – only count more contiguous

substrings Dimension reduction – documents are

projected into subspace spanned by most frequent continuous substrings

Incomplete Cholesky Decomposition – approximation of kernel matrix

Experiments Subset of Reuters 21578 dataset Bow vs. String kernel

300 train + 700 test 600 train + 400 test

Approximation techniques

Bow vs. String kernel

CE* F1 NSV*

CE* F1 NSV*

Time

String Kernel

15 87 184 3 97 305 208Syllable Kernel

12 89 218 4 97 379 29

Word Kernel

18 85 157 3 97 281 9BOW – TF only

18 84 150 3 97 287 1/6BOW – TFIDF

8 93 252 4 97 443 1/6CE – Classification error, NSV – number of support vectors

Approximations

Prec [%]

Rec [%]

Time [sec]

TFIDF 95 97 24

DR (1500)

87 90 24

DR (2500)

87 91 48

DR (3500)

89 91 64

ICD (200)

86 92 49

ICD (450)

88 92 114

ICD (750)

90 94 244