51
Multi-Task Kernel Learning J. Saketha Nath CSE, IIT-Bombay Saketh MLSS - IIT-Madras

Multi-Task Kernel Learning

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-Task Kernel Learning

Multi-Task Kernel Learning

J. Saketha Nath

CSE, IIT-Bombay

Saketh MLSS - IIT-Madras

Page 2: Multi-Task Kernel Learning

Multi-Task Learning

Setting:

Multiple related learning tasks

Eg. Object recognition

Exploit task relatedness for better generalization

The problem:

Learn shared features across tasks

If possible, sparse feature representations

Saketh MLSS - IIT-Madras

Page 3: Multi-Task Kernel Learning

Multi-Task Learning

Setting:

Multiple related learning tasks

Eg. Object recognition

Exploit task relatedness for better generalization

The problem:

Learn shared features across tasks

If possible, sparse feature representations

Saketh MLSS - IIT-Madras

Page 4: Multi-Task Kernel Learning

A simple case...

Suppose:

Tasks share a few input features.

Formulation:

minw,b,ξ

12

(∑df=1 ‖wf‖2

)2+ C

∑Tt=1

∑mti=1 ξti

s.t. yti(w>t xti − bt) ≥ 1− ξti, ξti ≥ 0

Saketh MLSS - IIT-Madras

Page 5: Multi-Task Kernel Learning

A simple case...

Suppose:

Tasks share a few input features.

Formulation:

minw,b,ξ

12

(∑df=1 ‖wf‖2

)2+ C

∑Tt=1

∑mti=1 ξti

s.t. yti(w>t xti − bt) ≥ 1− ξti, ξti ≥ 0

Saketh MLSS - IIT-Madras

Page 6: Multi-Task Kernel Learning

A simple case...

Suppose:

Tasks share a few input features.

Formulation:

minw,b,ξ

12

∑Tt=1 ‖wt‖22 + C

∑Tt=1

∑mti=1 ξti

s.t. yti(w>t xti − bt) ≥ 1− ξti, ξti ≥ 0

Saketh MLSS - IIT-Madras

Page 7: Multi-Task Kernel Learning

A simple case...

Suppose:

Tasks share a few input features.

Formulation:

minw,b,ξ

12

∑Tt=1 ‖wt‖21 + C

∑Tt=1

∑mti=1 ξti

s.t. yti(w>t xti − bt) ≥ 1− ξti, ξti ≥ 0

Saketh MLSS - IIT-Madras

Page 8: Multi-Task Kernel Learning

A simple case...

Suppose:

Tasks share a few input features.

Formulation:

minw,b,ξ

12

(∑df=1 ‖wf‖2

)2+ C

∑Tt=1

∑mti=1 ξti

s.t. yti(w>t xti − bt) ≥ 1− ξti, ξti ≥ 0

Saketh MLSS - IIT-Madras

Page 9: Multi-Task Kernel Learning

l1-l2 Regularizer

d∑f=1

‖wf‖2 ⇐︸︷︷︸l1

‖w1‖2

...‖wd‖2

⇐︸︷︷︸l2

w11 . . . wT1 ← w1

......

......

w1d . . . wTd ← wd

↑ ↑w1 . . . wT

Saketh MLSS - IIT-Madras

Page 10: Multi-Task Kernel Learning

Interpretation of lp regularization

Consider minx:‖x‖2≤1 f(x)

Saketh MLSS - IIT-Madras

Page 11: Multi-Task Kernel Learning

Interpretation of lp regularization

Consider minx:‖x‖2≤1 f(x)

Saketh MLSS - IIT-Madras

Page 12: Multi-Task Kernel Learning

Interpretation of lp regularization

Consider minx:‖x‖1≤1 f(x)

Saketh MLSS - IIT-Madras

Page 13: Multi-Task Kernel Learning

Interpretation of lp regularization

Consider minx:‖x‖1≤1 f(x)

Saketh MLSS - IIT-Madras

Page 14: Multi-Task Kernel Learning

Interpretation of lp regularization

Consider minx:‖x‖∞≤1 f(x)

Saketh MLSS - IIT-Madras

Page 15: Multi-Task Kernel Learning

Interpretation of lp regularization

Summary:

1 ≤ p < 2 promote sparsity

p = 2 induces robustness, rotation-invariant

2 < p <∞ promote non-sparse combinations

p =∞ promotes equal weightages

Saketh MLSS - IIT-Madras

Page 16: Multi-Task Kernel Learning

A Bit More Realistic Case...

Suppose: [Argyriou et.al., 08]

Tasks share a few (may be learnt) features.

Rotationally transformed features

Formulation:

minw,b,ξ,L

(∑df=1 ‖wf‖2

)2+ C

∑Tt=1

∑mti=1 ξti

s.t. yti(w>t L>xti − bt) ≥ 1− ξti, ξti ≥ 0, L ∈ Od

Saketh MLSS - IIT-Madras

Page 17: Multi-Task Kernel Learning

A Bit More Realistic Case...

Suppose: [Argyriou et.al., 08]

Tasks share a few (may be learnt) features.

Rotationally transformed features

Formulation:

minw,b,ξ,L

(∑df=1 ‖wf‖2

)2+ C

∑Tt=1

∑mti=1 ξti

s.t. yti(w>t L>xti − bt) ≥ 1− ξti, ξti ≥ 0, L ∈ Od

Saketh MLSS - IIT-Madras

Page 18: Multi-Task Kernel Learning

A Bit More Realistic Case...

Suppose: [Argyriou et.al., 08]

Tasks share a few (may be learnt) features.

Rotationally transformed features

Formulation:

minw,b,ξ,L

(∑df=1 ‖wf‖2

)2+ C

∑Tt=1

∑mti=1 ξti

s.t. yti(w>t L>xti − bt) ≥ 1− ξti, ξti ≥ 0, L ∈ Od

Saketh MLSS - IIT-Madras

Page 19: Multi-Task Kernel Learning

Multi-task Sparse Feature Learning (MTSFL)Formulation

Summary:

Though non-convex global optimum can be obtained

Can be kernelized

Efficient alternate minimization algorithm (EVD per iteration)

Achieves state-of-the-art performance on benchmarks

Discussion:

Rotationally transformed features — too restrictive

Essential for convexity

Idea: Enrich the input space itself

Multiple Kernel Learning (MKL) ??

Saketh MLSS - IIT-Madras

Page 20: Multi-Task Kernel Learning

Multi-task Sparse Feature Learning (MTSFL)Formulation

Summary:

Though non-convex global optimum can be obtained

Can be kernelized

Efficient alternate minimization algorithm (EVD per iteration)

Achieves state-of-the-art performance on benchmarks

Discussion:

Rotationally transformed features — too restrictive

Essential for convexity

Idea: Enrich the input space itself

Multiple Kernel Learning (MKL) ??

Saketh MLSS - IIT-Madras

Page 21: Multi-Task Kernel Learning

Multi-task Sparse Feature Learning (MTSFL)Formulation

Summary:

Though non-convex global optimum can be obtained

Can be kernelized

Efficient alternate minimization algorithm (EVD per iteration)

Achieves state-of-the-art performance on benchmarks

Discussion:

Rotationally transformed features — too restrictive

Essential for convexity

Idea: Enrich the input space itself

Multiple Kernel Learning (MKL) ??

Saketh MLSS - IIT-Madras

Page 22: Multi-Task Kernel Learning

Central Idea

Pose the problem as that of learning a shared kernel

Outline:

Two formulations:learn kernel shared across tasks (MK-MTFL)

Extension of standard MKL to multi-task case

learn sparse representation from shared kernel (MK-MTSFL)

Extension of MTSFL to multiple base kernels

Saketh MLSS - IIT-Madras

Page 23: Multi-Task Kernel Learning

Central Idea

Pose the problem as that of learning a shared kernel

Outline:

Two formulations:learn kernel shared across tasks (MK-MTFL)

Extension of standard MKL to multi-task case

learn sparse representation from shared kernel (MK-MTSFL)

Extension of MTSFL to multiple base kernels

Saketh MLSS - IIT-Madras

Page 24: Multi-Task Kernel Learning

Notational stuff...

k1, . . . , kn base kernels

φj(·) implicit mapping with kj

wtjf — tth task, jth kernel, f th feature loading

w·jf ,wt·f ,wtj·

Linear model: ft(x) =∑n

j=1 w>tj·φj(x)− bt

Saketh MLSS - IIT-Madras

Page 25: Multi-Task Kernel Learning

MK-MTFL Formulation

Primal:

minw,b,ξ

12

l1-l2-l2︷ ︸︸ ︷ n∑j=1

(T∑t=1

(‖wtj·‖2)2

) 12

2

+C∑T

t=1

∑mti=1 ξti

s.t. yti(∑n

j=1 w>tj·φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0

Partial Dual:

minγ∈∆n

maxαt∈Smt (C)

T∑t=1

1>αt −12α>t Yt

n∑j=1

γjKtj

Ytαt

Saketh MLSS - IIT-Madras

Page 26: Multi-Task Kernel Learning

MK-MTFL Formulation

Primal:

minw,b,ξ

12

l1-l2-l2︷ ︸︸ ︷ n∑j=1

(T∑t=1

(‖wtj·‖2)2

) 12

2

+C∑T

t=1

∑mti=1 ξti

s.t. yti(∑n

j=1 w>tj·φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0

Partial Dual:

minγ∈∆n

maxαt∈Smt (C)

T∑t=1

1>αt −12α>t Yt

n∑j=1

γjKtj

Ytαt

Saketh MLSS - IIT-Madras

Page 27: Multi-Task Kernel Learning

MK-MTFL Formulation

Primal (2≤p≤∞):

minw,b,ξ

12

l1-lp-l2, p≥2︷ ︸︸ ︷ n∑j=1

(T∑t=1

(‖wtj·‖2)p) 1

p

2

+C∑T

t=1

∑mti=1 ξti

s.t. yti(∑n

j=1 w>tj·φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0

Partial Dual(p̄= p

p−2

):

minγ∈∆n

maxλj∈∆T,p̄

maxαt∈Smt (C)

T∑t=1

1>αt −12α>t Yt

n∑j=1

γjKtj

λjt

Ytαt

Saketh MLSS - IIT-Madras

Page 28: Multi-Task Kernel Learning

MK-MTFL Formulation

Primal (2≤p≤∞):

minw,b,ξ

12

l1-lp-l2, p≥2︷ ︸︸ ︷ n∑j=1

(T∑t=1

(‖wtj·‖2)p) 1

p

2

+C∑T

t=1

∑mti=1 ξti

s.t. yti(∑n

j=1 w>tj·φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0

Partial Dual(p̄= p

p−2

):

minγ∈∆n

maxλj∈∆T,p̄

maxαt∈Smt (C)

T∑t=1

1>αt −12α>t Yt

n∑j=1

γjKtj

λjt

Ytαt

Saketh MLSS - IIT-Madras

Page 29: Multi-Task Kernel Learning

MK-MTFL Formulation

Summary:

Novel formulation for learning shared kernel

Extension of MKL to multi-task case

Tasks can be unequally reliable

Efficient mirror-descent based alg.

Each step solves T regular SVMs O(∑T

t=1m2tdn)

Saketh MLSS - IIT-Madras

Page 30: Multi-Task Kernel Learning

MK-MTSFL Formulation

Primal (1≤q≤2):

minw,b,ξ,L

12

lq-l1-l2︷ ︸︸ ︷ n∑j=1

dj∑f=1

‖w·jf‖2

q2q

+C∑T

t=1

∑mti=1 ξti

s.t. yti(∑n

j=1 w>tj·L>j φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0,Lj ∈ Odj

Partial Dual“q̄= q

2−q

”:

minQ

T∑t=1

maxαt∈Smt (C)

1>αt − 12α>t Yt

(∑nj=1 M>

tjQjMtj

)Ytαt

s.t. Qj � 0,∑n

j=1(trace(Qj))q̄ ≤ 1

Saketh MLSS - IIT-Madras

Page 31: Multi-Task Kernel Learning

MK-MTSFL Formulation

Primal (1≤q≤2):

minw,b,ξ,L

12

lq-l1-l2︷ ︸︸ ︷ n∑j=1

dj∑f=1

‖w·jf‖2

q2q

+C∑T

t=1

∑mti=1 ξti

s.t. yti(∑n

j=1 w>tj·L>j φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0,Lj ∈ Odj

Partial Dual“q̄= q

2−q

”:

minQ

T∑t=1

maxαt∈Smt (C)

1>αt − 12α>t Yt

(∑nj=1 M>

tjQjMtj

)Ytαt

s.t. Qj � 0,∑n

j=1(trace(Qj))q̄ ≤ 1

Saketh MLSS - IIT-Madras

Page 32: Multi-Task Kernel Learning

MK-MTSFL Formulation

Summary:

Novel formulation for learning shared sparse featurerepresentations

Trace-norm constraints lead to low rank matrices

Extension of MTSFL [Argyriou et.al., 08] to multiple base kernels

Though non-convex, global optimal can be efficiently obtained

Efficient mirror-descent based algorithm

Each step solves T regular SVMs, n EVDs of full matrices

Faster convergence in practice than alternate minimization

Saketh MLSS - IIT-Madras

Page 33: Multi-Task Kernel Learning

Solving MK-MTSFL

Partial Dual:

minQ

T∑t=1

maxαt∈Smt (C)

1>αt −12α>t Yt

n∑j=1

M>tjQjMtj

Ytαt

s.t. Qj � 0,n∑j=1

(trace(Qj))q̄ ≤ 1

g(Q) cannot be analytically computed

Danskin’s theorem provides ∇g(Q)Involves solving T regular SVMs

Saketh MLSS - IIT-Madras

Page 34: Multi-Task Kernel Learning

Solving MK-MTSFL

Partial Dual:

minQ

g(Q)︷ ︸︸ ︷T∑t=1

maxαt∈Smt (C)

1>αt −12α>t Yt

n∑j=1

M>tjQjMtj

Ytαt

s.t. Qj � 0,n∑j=1

(trace(Qj))q̄ ≤ 1

g(Q) cannot be analytically computed

Danskin’s theorem provides ∇g(Q)Involves solving T regular SVMs

Saketh MLSS - IIT-Madras

Page 35: Multi-Task Kernel Learning

Solving MK-MTSFL

Partial Dual:

minQ

g(Q)︷ ︸︸ ︷T∑t=1

maxαt∈Smt (C)

1>αt −12α>t Yt

n∑j=1

M>tjQjMtj

Ytαt

s.t. Qj � 0,n∑j=1

(trace(Qj))q̄ ≤ 1

g(Q) cannot be analytically computed

Danskin’s theorem provides ∇g(Q)Involves solving T regular SVMs

Saketh MLSS - IIT-Madras

Page 36: Multi-Task Kernel Learning

Projected (Sub-)Gradient Descent

minx∈X f(x) (f is convex, Lipschitz, X is compact)

At iteration k:

f is approx. by linear func. f(x) = f(xk) +∇f(xk)>(x− xk)valid only when ‖x− xk‖2 is small

xk+1

= arg minx∈X

sk∇f(xk)>(x− xk) +12‖x− xk‖22

= arg minx∈X

12‖x− (xk − sk∇f(xk))‖22

= ΠX (xk − sk∇f(xk))

Convergence guarantees with some choices of step-sizes (sk)

“Optimal” for Euclidean geometry

Saketh MLSS - IIT-Madras

Page 37: Multi-Task Kernel Learning

Projected (Sub-)Gradient Descent

minx∈X f(x) (f is convex, Lipschitz, X is compact)

At iteration k:

f is approx. by linear func. f(x) = f(xk) +∇f(xk)>(x− xk)valid only when ‖x− xk‖2 is small

xk+1 = arg minx∈X

sk∇f(xk)>(x− xk) +12‖x− xk‖22

= arg minx∈X

12‖x− (xk − sk∇f(xk))‖22

= ΠX (xk − sk∇f(xk))

Convergence guarantees with some choices of step-sizes (sk)

“Optimal” for Euclidean geometry

Saketh MLSS - IIT-Madras

Page 38: Multi-Task Kernel Learning

Projected (Sub-)Gradient Descent

minx∈X f(x) (f is convex, Lipschitz, X is compact)

At iteration k:

f is approx. by linear func. f(x) = f(xk) +∇f(xk)>(x− xk)valid only when ‖x− xk‖2 is small

xk+1 = arg minx∈X

sk∇f(xk)>(x− xk) +12‖x− xk‖22

= arg minx∈X

12‖x− (xk − sk∇f(xk))‖22

= ΠX (xk − sk∇f(xk))

Convergence guarantees with some choices of step-sizes (sk)

“Optimal” for Euclidean geometry

Saketh MLSS - IIT-Madras

Page 39: Multi-Task Kernel Learning

Projected (Sub-)Gradient Descent

minx∈X f(x) (f is convex, Lipschitz, X is compact)

At iteration k:

f is approx. by linear func. f(x) = f(xk) +∇f(xk)>(x− xk)valid only when ‖x− xk‖2 is small

xk+1 = arg minx∈X

sk∇f(xk)>(x− xk) +12‖x− xk‖22

= arg minx∈X

12‖x− (xk − sk∇f(xk))‖22

= ΠX (xk − sk∇f(xk))

Convergence guarantees with some choices of step-sizes (sk)

“Optimal” for Euclidean geometry

Saketh MLSS - IIT-Madras

Page 40: Multi-Task Kernel Learning

Mirror Descent

Key Idea:

Bregmann divergence based regularizer so that per-stepproblem is easy

xk+1 = arg minx∈X

sk∇f(xk)>(x− xk) +12‖x− xk‖22

Bregmann Divergence:

Strongly convex ω(·): Dx(y) = ω(y)−ω(x)−∇ω(x)>(y− x)Common choices:

X Sphere: ω(x) = 12‖x‖

22

X Simplex: ω(x) =∑

i xi log(xi)X Spectrahedron: ω(x) = trace(x log(x))

Saketh MLSS - IIT-Madras

Page 41: Multi-Task Kernel Learning

Mirror Descent

Key Idea:

Bregmann divergence based regularizer so that per-stepproblem is easy

xk+1 = arg minx∈X

sk∇f(xk)>(x− xk) +Dxk(x)

Bregmann Divergence:

Strongly convex ω(·): Dx(y) = ω(y)−ω(x)−∇ω(x)>(y− x)Common choices:

X Sphere: ω(x) = 12‖x‖

22

X Simplex: ω(x) =∑

i xi log(xi)X Spectrahedron: ω(x) = trace(x log(x))

Saketh MLSS - IIT-Madras

Page 42: Multi-Task Kernel Learning

Mirror Descent

Key Idea:

Bregmann divergence based regularizer so that per-stepproblem is easy

xk+1 = arg minx∈X

sk∇f(xk)>(x− xk) +Dxk(x)

Bregmann Divergence:

Strongly convex ω(·): Dx(y) = ω(y)−ω(x)−∇ω(x)>(y− x)Common choices:

X Sphere: ω(x) = 12‖x‖

22

X Simplex: ω(x) =∑

i xi log(xi)X Spectrahedron: ω(x) = trace(x log(x))

Saketh MLSS - IIT-Madras

Page 43: Multi-Task Kernel Learning

Solving MK-MTSFL

Our Finding:

Entropy function trace(x log(x)) good enough for our problem

Per-step problem:

minQ

∑nj=1 {trace(ζjQj) + trace(Qj log(Qj))}

s.t. Qj � 0,∑k

j=1(trace(Qj))q̄ ≤ 1

After EVDs of Qj:

minρ

∑nj=1 (ρj log(ρj) + ρjπj)

s.t. ρj ≥ 0,∑n

j=1 ρjq̄ ≤ 1

Saketh MLSS - IIT-Madras

Page 44: Multi-Task Kernel Learning

Solving MK-MTSFL

Our Finding:

Entropy function trace(x log(x)) good enough for our problem

Per-step problem:

minQ

∑nj=1 {trace(ζjQj) + trace(Qj log(Qj))}

s.t. Qj � 0,∑k

j=1(trace(Qj))q̄ ≤ 1

After EVDs of Qj:

minρ

∑nj=1 (ρj log(ρj) + ρjπj)

s.t. ρj ≥ 0,∑n

j=1 ρjq̄ ≤ 1

Saketh MLSS - IIT-Madras

Page 45: Multi-Task Kernel Learning

Solving MK-MTSFL

Our Finding:

Entropy function trace(x log(x)) good enough for our problem

Per-step problem:

minQ

∑nj=1 {trace(ζjQj) + trace(Qj log(Qj))}

s.t. Qj � 0,∑k

j=1(trace(Qj))q̄ ≤ 1

After EVDs of Qj:

minρ

∑nj=1 (ρj log(ρj) + ρjπj)

s.t. ρj ≥ 0,∑n

j=1 ρjq̄ ≤ 1

Saketh MLSS - IIT-Madras

Page 46: Multi-Task Kernel Learning

Simulations

Datasets:

School: Multi-task benchmark. Prediction of studentperformance in various schools.

139 regression tasks28 input features15 training examples per task

Letters: OCR dataset. Each letter considered as a task.

9 binary classification tasks128 input features10 training examples per task

Dermatology: Bio-informatics dataset. Predicting one of sixskin-diseases.

15 binary classification tasks33 input features10 training examples per task

Saketh MLSS - IIT-Madras

Page 47: Multi-Task Kernel Learning

Simulations

Table: Comparison of generalization performance

SVM MTSFL MK-MTFL MK-MTSFLp =2 7 Inf q =1 1.5 1.99

S -45.88 13.94 10.76 13.80 10.52 14.07 13.80 13.94

L 74.89 75.54 78.28 78.30 78.31 76.38 76.93 74.57

D 8 6 0 0 0 8 7 5.33

MTFSTL – 179sec, MK-MTFL – 192sec and MK-MTSFL –15445sec.

Saketh MLSS - IIT-Madras

Page 48: Multi-Task Kernel Learning

Simulations

Saketh MLSS - IIT-Madras

Page 49: Multi-Task Kernel Learning

Conclusions

Two novel formulations for multi-task feature learning:Extension of MKL to multi-task case (non-sparse)

Simple, good generalization, scalable

Extension of MTSFL to multiple base kernels (sparse)

better generalization than state-of-the-art

Efficient mirror-descent based algorithm

Faster convergence

Sparse representations may not always be desirable

Saketh MLSS - IIT-Madras

Page 50: Multi-Task Kernel Learning

Questions ?

Saketh MLSS - IIT-Madras

Page 51: Multi-Task Kernel Learning

Thank You

Saketh MLSS - IIT-Madras