An Online Learning Algorithm for Bilinear Models

Yuanbin Wu Shiliang Sun

East China Normal University

Yuanbin Wu Shiliang Sun An Online Learning Algorithm for Bilinear Models 1 / 27

Introduction

Bilinear modelsOnline learningRegret analysis

Introduction: bilinear models

Linear model for multi-class classification

h(x) = arg maxy∈Y

w⊺φ(x, y)

Matrix form linear model

h(x) = arg maxy∈Y

Tr(W ⊺Φ(x , y))

Bilinear model

h(x) = arg maxy∈Y

α⊺Φ(x, y)β

Introduction: bilinear models

Linear model for multi-class classification

h(x) = arg maxy∈Y

w⊺φ(x, y)

Matrix form linear model

h(x) = arg maxy∈Y

Tr(W ⊺Φ(x , y))

Bilinear model

h(x) = arg maxy∈Y

α⊺Φ(x, y)β

Matrix feature

Rank 1 constraint on W

Introduction: online learning

Online convex optimization Convexity is violated by rank constraints Ω1 = W |rank(W ) ≤ 1 is not a convex set

The primal dual perspective can help The dual problem is always convex

Gradients for matrix norms Singular value decomposition

Introduction: regret analysis

The regret of an online algorithm w.r.t. strategy U

RN (U ) = 1N

N∑t=1

Lt(Wt)−1N

N∑t=1

Lt(U ).

Bound of the Hessian (strongly smoothness)

f (x + y) ≤ f (x) +∇f (x)⊺y + β

2∥y∥2

Can we have similar bounds for rank constrained problems?

Outlines

1 Bilinear Model

2 Online Learning Algorithm

3 Regret Analysis

4 Experiments

5 Conclusion

Bilinear Model

DefinitionWe define the bilinear model with discriminant function

h(x) = arg maxy∈Y

α⊺Φ(x, y)β

where α ∈ Rm , β ∈ Rn . The model parameter W = αβ⊺ is a rank 1matrix.

Why the bilinear formulation semantic relations among features more compact model

Bilinear Model

Example: sequential labelling The linear model:

h(x) = arg maxy∈Y

n∑i=1

w⊺ Φ(x, yi , yi−1)

The bilinear model:

h(x) = arg maxy∈Y

n∑i=1

α⊺[

ζ(x, yi)⊗ ζ(x, yi−1)]

Number of parameters from O(n2) to O(n)

… …y0 y1 yi yn-1 yn

Bilinear Model

h(x) = arg maxy∈Y

n∑i=1

w⊺ Φ(x, yi , yi−1)

[yiyi−1 BB BI BO IB II IO OB OI OO0 0 1 0 0 0 0 0 0

[B I OB 0 0 1I 0 0 0O 0 0 0

[B 1I 0O 0

] [B I O0 0 1

Φ(x, yi , yi−1) Φ(x, yi , yi−1) ζ1(x, yi) ζ⊺2 (x, yi−1)

The bilinear model:

h(x) = arg maxy∈Y

n∑i=1

α⊺ [

ζ(x, yi) ⊗ ζ(x, yi−1)]

Bilinear Model

h(x) = arg maxy∈Y

n∑i=1

w⊺ Φ(x, yi , yi−1)

The bilinear model:

h(x) = arg maxy∈Y

n∑i=1

α⊺[

ζ(x, yi)⊗ ζ(x, yi−1)

Bilinear Model

h(x) = arg maxy∈Y

n∑i=1

w⊺ Φ(x, yi , yi−1)

The bilinear model:

h(x) = arg maxy∈Y

n∑i=1

α⊺[

ζ(x, yi)⊗ ζ(x, yi−1)]

Bilinear Model

h(x) = arg maxy∈Y

n∑i=1

w⊺ Φ(x, yi , yi−1)

The bilinear model:

h(x) = arg maxy∈Y

n∑i=1

α⊺[

ζ(x, yi)⊗ ζ(x, yi−1)]

Online Learning Algorithm

Large margin optimization problem

minW =αβ⊺∈Ω1

12∥W ∥2F + C

N∑j=1

[1− ⟨W , ∆Φj⟩]+,

where ∆Φj ≜ Φ(x j , yj)− Φ(x j , h(x j)),Ω1 is the set of rank 1 matrices.

Biconvex problem

minα,β

12∥α∥2 + 1

2∥β∥2 + C

N∑j=1

[1− α⊺∆Φjβ]+,

blockwise coordinate descent degenerated cases: only solve a 0-order model on ζ(x, yi)

Large margin optimization problem

minW =αβ⊺∈Ω1

12∥W ∥2F + C

N∑j=1

[1− ⟨W , ∆Φj⟩]+,

where ∆Φj ≜ Φ(x j , yj)− Φ(x j , h(x j)),Ω1 is the set of rank 1 matrices.

Biconvex problem

minα,β

12∥α∥2 + 1

2∥β∥2 + C

N∑j=1

[1− α⊺∆Φjβ]+,

blockwise coordinate descent degenerated cases: only solve a 0-order model on ζ(x, yi)

Our plan: from the dual mirror descent style updates

Wt−1∇F−−−−→ Θt−1y−ηt∇Lt

Wt∇F∗←−−−− Θt

Define F1(W ) = 12∥W ∥

2F if W ∈ Ω1, +∞ otherwise.

The dual problem

D(η)=N∑

j=1ηj − max

W ∈Ω1

⟨W ,N∑

j=1ηj∆Φj⟩ − 1

2∥W ∥2F

N∑j=1

ηj − F∗1 (ΘN ), ηj ∈ [0, C ].

whereΘN = ΘN−1 + ηN ∆ΦN (gradients of hinge loss, mirror space)

F∗1 (Θ) = max

W ∈Ω1⟨W , Θ⟩ − 1

2∥W ∥2F (the Frenchel dual)

The dual problem

D(η)=N∑

j=1ηj − max

W ∈Ω1

⟨W ,N∑

j=1ηj∆Φj⟩ − 1

2∥W ∥2F

N∑j=1

ηj − F∗1 (ΘN ), ηj ∈ [0, C ].

F∗1 (Θ) = max

W ∈Ω1⟨W , Θ⟩ − 1

The dual problem

D(η)=N∑

j=1ηj − max

W ∈Ω1

⟨W ,N∑

j=1ηj∆Φj⟩ − 1

2∥W ∥2F

N∑j=1

ηj − F∗1 (ΘN ), ηj ∈ [0, C ].

F∗1 (Θ) = max

W ∈Ω1⟨W , Θ⟩ − 1

The dual problem D(η) =∑N

j=1 ηj − F∗1 (ΘN )

ΘN = ΘN−1 + ηN ∆ΦN F∗1 (Θ) = max

W∈Ω1⟨W , Θ⟩ − 1

2∥W ∥2F

A series of dual problems Dt+1(η) =∑t

j=1 ηj − F∗1 (Θt), t = 1, 2, . . . , N

uses Wt−1 = αt−1β⊺t−1 to predict xt , yt = h(xt);

sets the dual variable ηt as

0 yt = yt

C yt = yt

updates Wt :

Wt =∇F∗1 (Θt) = arg max

W∈Ω1⟨W , Θt⟩ −

12∥W ∥2

σ1 =σ2= σ1u1v⊺1

j=1 ηj − F∗1 (ΘN )

D(η) =∑N

j=1 ηj − 12∥ΘN∥22

W∈Ω1⟨W , Θ⟩ − 1

2∥W ∥2F

Proposition: F∗1 (Θ) = 1

2∥Θ∥22 = 1

2∥Θ∥2s(∞) = 1

2σ1(Θ)2

SVD has property of “the best low rank approximation”

j=1 ηj − F∗1 (Θt), t = 1, 2, . . . , N

0 yt = yt

C yt = yt

updates Wt :

W∈Ω1⟨W , Θt⟩ −

12∥W ∥2

σ1 =σ2= σ1u1v⊺1

j=1 ηj − F∗1 (ΘN )

D(η) =∑N

j=1 ηj − 12∥ΘN∥22

W∈Ω1⟨W , Θ⟩ − 1

2∥W ∥2F

j=1 ηj − F∗1 (Θt), t = 1, 2, . . . , N

0 yt = yt

C yt = yt

updates Wt :

W∈Ω1⟨W , Θt⟩ −

12∥W ∥2

σ1 =σ2= σ1u1v⊺1

j=1 ηj − F∗1 (ΘN )

D(η) =∑N

j=1 ηj − 12∥ΘN∥22

W∈Ω1⟨W , Θ⟩ − 1

2∥W ∥2F

j=1 ηj − F∗1 (Θt), t = 1, 2, . . . , N

0 yt = yt

C yt = yt

updates Wt :

W∈Ω1⟨W , Θt⟩ −

12∥W ∥2

σ1 =σ2= σ1u1v⊺1

j=1 ηj − F∗1 (ΘN )

D(η) =∑N

j=1 ηj − 12∥ΘN∥22

W∈Ω1⟨W , Θ⟩ − 1

2∥W ∥2F

j=1 ηj − F∗1 (Θt), t = 1, 2, . . . , N

0 yt = yt

C yt = yt

updates Wt :

W∈Ω1⟨W , Θt⟩ −

12∥W ∥2

σ1 =σ2= σ1u1v⊺1

j=1 ηj − F∗1 (ΘN )

D(η) =∑N

j=1 ηj − 12∥ΘN∥22

W∈Ω1⟨W , Θ⟩ − 1

2∥W ∥2F

j=1 ηj − F∗1 (Θt), t = 1, 2, . . . , N

0 yt = yt

C yt = yt

updates Wt :

W∈Ω1⟨W , Θt⟩ −

12∥W ∥2

σ1 =σ2= σ1u1v⊺1

j=1 ηj − F∗1 (ΘN )

D(η) =∑N

j=1 ηj − 12∥ΘN∥22

W∈Ω1⟨W , Θ⟩ − 1

2∥W ∥2F

j=1 ηj − F∗1 (Θt), t = 1, 2, . . . , N

0 yt = yt

C yt = yt

updates Wt :

W∈Ω1⟨W , Θt⟩ −

12∥W ∥2

σ1 =σ2= σ1u1v⊺1

uses Wt−1 = αt−1β⊺t−1 to predict x t ,

yt = h(x t) = arg maxy∈Y

α⊺t−1∆Φt(x t , y)βt−1

0 yt = yt

C yt = yt

updates Wt :

Θt= Θt−1 + ηt∆Φt =p∑

i=1σiuiv⊺i

Wt= σ1u1v⊺1

Wt = ∇F∗1 (Θt) = σ1u1v⊺1

Full SVD is expensive, only needs the leading singular vectors

Power iteration if σ1(Θ) = σ2(Θ)

α(τ+1) = Θ⊺Θα(τ),α(τ+1)

∥α(τ+1)∥→ u1

β(τ+1) = ΘΘ⊺β(τ),β(τ+1)

∥β(τ+1)∥→ v1

Initial value and normalization⋆ Θt = Θt−1 + ηt∆Φt

⋆ if ∆Φt is “small”, αt is close to αt−1⋆ if ∆Φt is “sparse”, normalization could be efficient

Wt = ∇F∗1 (Θt) = σ1u1v⊺1

Full SVD is expensive, only needs the leading singular vectorsPower iteration

if σ1(Θ) = σ2(Θ)

α(τ+1) = Θ⊺Θα(τ),α(τ+1)

∥α(τ+1)∥→ u1

β(τ+1) = ΘΘ⊺β(τ),β(τ+1)

∥β(τ+1)∥→ v1

Wt = ∇F∗1 (Θt) = σ1u1v⊺1

Full SVD is expensive, only needs the leading singular vectorsPower iteration

if σ1(Θ) = σ2(Θ)

α(τ+1) = Θ⊺Θα(τ),α(τ+1)

∥α(τ+1)∥→ u1

β(τ+1) = ΘΘ⊺β(τ),β(τ+1)

∥β(τ+1)∥→ v1

Regret Analysis

The regret w.r.t. strategy U

RN (U ) = 1N

N∑t=1

Lt(Wt)−1N

N∑t=1

Lt(U ).

Wt are weights at each roundLt is the hinge loss

Regret Analysis

Previous analysis (mirror descent)

Wt∇F∗←−−−− Θt

If Lt is convex and F is strongly convex, then RN (U ) = O( 1√N

In bilinear model F1(W ) = 1

2∥W ∥2F if W ∈ Ω1, +∞ otherwise.

not convex F∗∗

1 (W ) = 12∥W ∥

22 = F1

The analysis of mirror descent is not directly applicable

Regret Analysis

Previous analysis (mirror descent)

Wt∇F∗←−−−− Θt

F1(W ) = 12∥W ∥

2F, W ∈ Ω1

If Lt is convex and F is strongly convex, then RN (U ) = O( 1√N

In bilinear model F1(W ) = 1

2∥W ∥2F if W ∈ Ω1, +∞ otherwise.

not convex F∗∗

1 (W ) = 12∥W ∥

22 = F1

The analysis of mirror descent is not directly applicable

Regret Analysis

Lower bound of dual objective + weak dualityBound the increase of the dual objective

∆t= Dt+1(η1, . . . , ηt)−Dt(η1, . . . , ηt−1)

= C − 12∥Θt−1 + C∆Φt∥22 + 1

2∥Θt−1∥22.

By the Taylor expansion:

12∥Θ + E∥22 ≤

12∥Θ∥22 + ⟨∇∥Θ∥2, E⟩+ vec(E)⊺H (Θ)vec(E)

where Θ = Θ + θE , θ ∈ (0, 1)

Bound the Hessian term

Regret Analysis

Our result (by bounding the Hessian) If σ1(Θ) = σ2(Θ) > 0,

12∥Θ + E∥2

2 ≤12∥Θ∥2

2 + ⟨∇∥Θ∥2, E⟩+ ∥E∥2F

2l1− σ2

where [σ1, . . . , σl ] = σ(Θ), Θ = Θ + θE , θ ∈ (0, 1)

Known result on Schatten norm (Ball et al., 1994; Kakade et al., 2012):

Schatten norm: ∥Θ∥s(p) = ∥σ(Θ)∥p, ∥Θ∥s(∞) = ∥Θ∥2 = σ1(Θ) for p ∈ [2,∞], 1

p + 1q = 1,

12∥Θ + E∥2

s(p) ≤12∥Θ∥2

s(p) + ⟨∇∥Θ∥s(p), E⟩+∥E∥2

2(q − 1).

The bound is trivial if p =∞.

Regret Analysis

12∥Θ + E∥2

2 ≤12∥Θ∥2

2 + ⟨∇∥Θ∥2, E⟩+ ∥E∥2F

2l1− σ2

Known result on Schatten norm (Ball et al., 1994; Kakade et al., 2012): Schatten norm: ∥Θ∥s(p) = ∥σ(Θ)∥p, ∥Θ∥s(∞) = ∥Θ∥2 = σ1(Θ)

for p ∈ [2,∞], 1p + 1

q = 1,

12∥Θ + E∥2

s(p) ≤12∥Θ∥2

s(p) + ⟨∇∥Θ∥s(p), E⟩+∥E∥2

2(q − 1).

Regret Analysis

12∥Θ + E∥2

2 ≤12∥Θ∥2

2 + ⟨∇∥Θ∥2, E⟩+ ∥E∥2F

2l1− σ2

Known result on Schatten norm (Ball et al., 1994; Kakade et al., 2012): Schatten norm: ∥Θ∥s(p) = ∥σ(Θ)∥p, ∥Θ∥s(∞) = ∥Θ∥2 = σ1(Θ) for p ∈ [2,∞], 1

p + 1q = 1,

12∥Θ + E∥2

s(p) ≤12∥Θ∥2

s(p) + ⟨∇∥Θ∥s(p), E⟩+∥E∥2

2(q − 1).

Regret Analysis

Proposition (Regret)Assume for all Θ = Θt−1, E = C∆Φt , the bound of Hessian holds. Then

RN (U ) ≤ 12CN

∥U∥2F + 2lCN

N∑t=1

∥∆Φt∥2F1− σt

The role of σt1

the speed of power iteration the regret bound

Regret Analysis

Bound σ1σ2

: margin requirement + “σ1 is uniformly greater than σ2”

PropositionAssume that supj,W ∥∆Φj∥2 ≤ M1, supj,W ∥∆Φj∥k(2) ≤ M2. If M1 > M2

2and ∃W has margin γ w.r.t. ∥ · ∥s(1), where γ ∈ (M2

2 , M1), then

σt1≤ M2 − γ

CorollaryThe regret is bounded by

RN (U ) ≤ 12CN

∥U∥2F + 2Cl2M 21

2γ −M2.

Experiments

Two sequential labelling tasks Chinese words segmentation Text chunking

Baselines Linear model (structured perceptron) Blockwise coordinate descent of the biconvex problem Batch learner (CRF+L2, CRF+L1)

Experiments

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0−0.3

−0.2

−0.1

89.7 92.0 92.7 93.2 93.5 93.8 94.0 94.1 94.4 94.4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0−0.3

−0.2

−0.1

91.5 93.3 94.5 95.1 95.7 95.8 96.1 96.2 96.4 96.5

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0−0.3

−0.2

−0.1

87.5 89.6 90.7 91.5 92.1 92.5 92.7 93.5 93.8 94.0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0−0.3

−0.2

−0.1

88.5 91.1 92.6 93.3 93.8 93.9 93.9 94.0 94.1 94.2

bol bcd sp

Figure: Chinese word segmentation.Yuanbin Wu Shiliang Sun An Online Learning Algorithm for Bilinear Models 22 / 27

Experiments

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0−0.4

−0.3

−0.2

−0.1

90.2 91.4 92.2 92.7 92.8 93.0 93.2 93.3 93.4 93.6

Chunking

bol bcd sp

Figure: Text chunking.

Experiments

Compared with linear models When the training set is small, the advantage of bol is more obvious The model is more compact

Compared with blockwise coordinate descent Prevent attracting by solutions of 0-order model.

Experiments

0 20 40 60 80 1000.00

0.30bol

0 20 40 60 80 1000.000

0.040bol

Figure: Convergence.

Conclusion

An online learning algorithm for bilinear modelA second order approximation of the squared spectral normFuture works

rank k constraints roughly, needs to compute the leading k singular vectors

Thanks

An Online Learning Algorithm for Bilinear Models · Yuanbin Wu Shiliang Sun An Online Learning...

Documents

Inference for a Special Bilinear Time Series Model · 2018-09-10 · et al. (1990), Sesay and Subba Rao (1992), Gabr (1998) and Hili (2008). Extension to periodic bilinear models

UNIT - 2 BILINEAR TRANSFORMATION

Based on the book by Fan/Yao Nonlinear Time Serieshomepage.univie.ac.at/robert.kunst/nonlin1.pdfCharacteristics of Time Series Threshold models ARCH and GARCH models Bilinear models

Bilinear Quadrilateral (Q4): CQUAD4 in GENESIS1 Bilinear Quadrilateral (Q4): CQUAD4 in GENESIS The displacement field in terms of generalized coordinates: So, u and v fields are bilinear

Bilinear Maps

Linear, bilinear, and linearLinear, bilinear, and linear-bilinear Linear, … · 2018-11-19 · History of linear-bilinear models for assessing GxE History of linear-bilinear models

Seismic active earth pressure on bilinear retaining walls ... · Seismic active earth pressure on bilinear retaining walls ... Pseudo-dynamic method, Bilinear retaining wall,

HARDY SPACE ESTIMATES FOR BILINEAR SQUARE FUNCTIONS …€¦ · hardy space estimates for bilinear square functions and calderon-zygmund operators 3´ smoothing bilinear Calderon-Zygmund

Bilinear Models for Fine-grained Visual Recognitionsmaji/presentations/... · Bilinear Models for Fine-grained Visual Recognition College of Information and Computer Sciences

SPE 7490 Cinco-Ley Bilinear

Bilinear CNN Models for Fine-grained Visual …vis- CNN Models for Fine-grained Visual Recognition Tsung-Yu Lin Aruni RoyChowdhury Subhransu Maji University of Massachusetts, Amherst

DYNAMIC P- EFFECTS OF RC COLUMNS › pdfs › w78-2006-tf633.pdfunderstand the effects of the hysteretic models (bilinear, bilinear with slackness, and stiffness degradation models),

Bilinear Form

Chapter 3. Bilinear forms - Trinity College, Dublinpete/ma1212/chapter3.pdfChapter 3. Bilinear forms Lecture notes for MA1212 P. Karageorgis pete@maths.tcd.ie 1/25. Bilinear forms

Bilinear Isotropic Hardening Behavior

Bilinear and Trilinear Regression Models with Structured ...liu.diva-portal.org/smash/get/diva2:813054/FULLTEXT01.pdf · Models with Structured Covariance Matrices ... Alexandre Lyambabaje,

Bilinear Mappings in Formal Cryptography

IIR Filters, Bilinear Transformation Method

Yuanbin Wu cs@ecnu

Bootstrapping bilinear models of simple Vehiclesmurray/preprints/cm13-ijrr_s.pdf · Bootstrapping bilinear models of simple Vehicles Andrea Censi Richard M. Murray Abstract—Learning