Upload
clarissa-henry
View
216
Download
0
Embed Size (px)
Citation preview
Introduction to Introduction to String KernelsString Kernels
Blaz FortunaJSI, Slovenija
What is a Kernel? Inner-product Similarity
between documents
Documents mapped into some higher dimensional feature space
Why to use Kernels? Mapped documents
are not explicitly calculated
Linear algorithms can be applied on mapped documents
Input documents can be anything (not necessary vectors)!
Algorithms using Kernels Support Vector Machine
(classification, regression, …) Kernel Principal Component Analysis Kernel Canonical Correlation
Analysis Nearest Neighbour …
Representation of text
Vector-space model (bag of words) Most commonly used Each document is encoded as a feature
vector with with word frequencies as elements
IDF weighting, normalized Similarity is inner-product (cosine
distance) Can be viewed as a kernel
Basic Idea of String KernelsWords -> Substrings
Each document is encoded as a feature vector with substring frequencies as elements
More contiguous substrings receive higher weighting (trough < 1)
ca ar cr ba br ap cp
car 2 2 3 0 0 0 0
bar 0 2 0 2 3 0 0
cap
2 0 0 0 0 2 3
Kernel Trick Computation of feature vectors is
very expensive For algorithms that use kernels
only inner-product is needed This can be efficiently computed
without explicit use of feature vectors (dynamic programming)
Advantage of String Kernel Detection of words with different
suffixes or prefixes Example:
‘microcomputer’ ‘computers’ ‘computerbased’
Extensions 1/2 Use of syllables or words
Documents are viewed as a sequence of syllables or words instead of characters
Reduces length of documents Syllables still eliminate need for stammer
Convex Combinations of Kernels Use of substrings with different lengths No extra computational cost
Extensions 2/2 Different weighing for symbols
Introduction of weighting similar to IDF Low computational cost
Soft-Matching Similar symbols are matched Use of WordNet for matching synonyms Computational cost comes from matching
Speed performance String kernel is much slower and
memory consuming than BOW text representation
DP implementation is O(n|s||t|) n – length of substring |s|, |t| – length of documents s and t
Memory consumption is O(|s||t|)
How to be Faster TRIE – only count more contiguous
substrings Dimension reduction – documents are
projected into subspace spanned by most frequent continuous substrings
Incomplete Cholesky Decomposition – approximation of kernel matrix
Experiments Subset of Reuters 21578 dataset Bow vs. String kernel
300 train + 700 test 600 train + 400 test
Approximation techniques
Bow vs. String kernel
CE* F1 NSV*
CE* F1 NSV*
Time
String Kernel
15 87 184 3 97 305 208Syllable Kernel
12 89 218 4 97 379 29
Word Kernel
18 85 157 3 97 281 9BOW – TF only
18 84 150 3 97 287 1/6BOW – TFIDF
8 93 252 4 97 443 1/6CE – Classification error, NSV – number of support vectors
Approximations
Prec [%]
Rec [%]
Time [sec]
TFIDF 95 97 24
DR (1500)
87 90 24
DR (2500)
87 91 48
DR (3500)
89 91 64
ICD (200)
86 92 49
ICD (450)
88 92 114
ICD (750)
90 94 244