Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction

Introduction Stochastic Optimization Distributed Optimization Online Learning Summary

Mining Big Data

Lijun [email protected]

Nanjing University, China

December 23, 2015

http://cs.nju.edu.cn/zlj Big Data

NANJING UNIVERSITY

mailto:[email protected]

http://cs.nju.edu.cn/zlj


Outline

1 IntroductionBig DataSupervised Learning

2 Stochastic OptimizationTime ReductionSpace Reduction

3 Distributed OptimizationDistributed Gradient DescentADMM

4 Online LearningFull-Information SettingBandit Setting

5 Summary



Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryBig Data Supervised Learning

Outline





5 Summary




Big Data

https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg


https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg



The Four V’s of Big Data

http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg


http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg



Outline





5 Summary




Supervised Learning (I)

Training

InputA set of training data (x1, y1), . . . , (xn, yn)

Classification v.s. RegressionBatch Setting v.s. Online Setting

Output

A function g(·) such that yi ≈ g(xi), ∀i

Testing

Given a testing point x, predict its label by g(x)

Assumption

Training and testing data are independent and identicallydistributed (i.i.d.)





Training



Output


Testing


Assumption






Training



Output


Testing


Assumption





Supervised Learning (II)

Loss Function

Measure the discrepancy between y and g(x)

Binary loss: ℓ(u, v) = 1(uv < 0)

Hinge loss: ℓ(u, v) = max(0, 1 − uv)

Squared loss: ℓ(u, v) = (u − v)2

Empirical Risk

1n

n∑

i=1

ℓ(yi , g(xi))

Risk

E(x,y) [ℓ(y , g(x))]

Our goal is to minimize the risk instead of empirical risk.





Loss Function





Empirical Risk

1n

n∑

i=1

ℓ(yi , g(xi))

Risk

E(x,y) [ℓ(y , g(x))]






Loss Function





Empirical Risk

1n

n∑

i=1

ℓ(yi , g(xi))

Risk

E(x,y) [ℓ(y , g(x))]




Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction

Outline





5 Summary




What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

∫

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y)

[ℓ(y , x⊤w)

]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.




What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

∫

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y)

[ℓ(y , x⊤w)

]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.




What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient m(·)

E[m(w)] = ∇f (w)

Empirical Risk Minimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 uniformly at random

E[∇ℓ(yt , x⊤t w)] = ∇1

n

n∑

i=1

ℓ(yi , x⊤i w)




What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient m(·)

E[m(w)] = ∇f (w)


minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 uniformly at random

E[∇ℓ(yt , x⊤t w)] = ∇1

n

n∑

i=1

ℓ(yi , x⊤i w)




Outline





5 Summary




Time Reduction (I)


minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))




Time Reduction (I)


minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)




1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge





Time Reduction (I)


minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)




1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge





Time Reduction (II)


minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) uniformly at random3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))




Time Reduction (II)


minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)




4: wt+1 = ΠW(w′t+1)

5: end for






Time Reduction (II)


minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)




4: wt+1 = ΠW(w′t+1)

5: end for






Outline





5 Summary




Space Reduction (I)

Nuclear Norm Regularized Optimization over Matrices

minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗Matrix completion, multi-class classification

Both m and n can be very large

The optimal solution W∗ is low-rank

Collaborative Filtering—Matrix Completion

minW∈Rm×n

F (W ) =∑

(i,j)∈Ω(Wij − Mij)

2 + λ‖W‖∗

There are m users and n items

M ∈ Rm×n is the underlying user-item rating matrix

Ω is the set of observed indices




Space Reduction (I)


minW∈W∈Rm×n




Collaborative Filtering—Matrix Completion

minW∈Rm×n

F (W ) =∑

(i,j)∈Ω(Wij − Mij)

2 + λ‖W‖∗

There are m users and n items

M ∈ Rm×n is the underlying user-item rating matrix

Ω is the set of observed indices




Space Reduction (II)


minW∈W∈Rm×n





1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)






minW∈W∈Rm×n





1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge







minW∈W∈Rm×n





1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge





Space Reduction (III)


minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗

Stochastic Proximal Gradient Descent (SPGD)[Zhang et al., 2015]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

The Advantage—Space Reduction

Space complexity per iteration is low: O((m + n)r)






minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗




12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗









minW∈W∈Rm×n

F (W ) = f (W ) + λ‖W‖∗




12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗







Limitations

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Total Time Complexity of GD and SGD


GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)




Limitations




f (w) ≤ ǫ

Iteration Complexity of GD and SGD


GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)



GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)




Limitations




f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6


GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ



GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)




Limitations




f (w) ≤ ǫ



GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ



GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)




Limitations




f (w) ≤ ǫ



GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD ǫ = 10−6


GD O(

n√ǫ

)103n O

(n√κ log 1

ǫ

)6n

√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ



Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryDistributed Gradient Descent ADMM

Outline





5 Summary




Introduction (I)

Empirical Risk Minimization in Distributed Setting

minw∈W

f (w) =k∑

j=1

nj∑

i=1

ℓ(y ji ,w

⊤xji)

(xj1, y

j1) . . . (x

jnj, y j

nj) are training data in the j-th machine




Introduction (II)

General Distributed Optimization

minw∈W

f (w) =k∑

j=1

fj(w)

For empirical risk minimization, we have

fj(w) =

nj∑

i=1

ℓ(y ji ,w

⊤xji)




Outline





5 Summary




Distributed Gradient Descent

Gradient Descent

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

3: wt+1 = ΠW(w′t+1)

4: end for


1: for t = 1, 2, . . . ,T do2: Server: send wt to each worker3: Each worker j : calculate ∇fj(wt) and send it to server

4: Server: w′t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

5: Server: wt+1 = ΠW(w′t+1)

6: end for





Gradient Descent

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

3: wt+1 = ΠW(w′t+1)

4: end for


1: for t = 1, 2, . . . ,T do2: Server: send wt to each worker3: Each worker j : calculate ∇fj(wt) and send it to server

4: Server: w′t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

5: Server: wt+1 = ΠW(w′t+1)

6: end for




Step 1

Server: send wt to each worker




Step 2

Each worker j : calculate ∇fj(wt) and send it to server




Steps 3 and 4

Server: w′t+1 = wt − ηt

(∑kj=1 ∇fj(wt)

)

Server: wt+1 = ΠW(w′t+1)




Outline





5 Summary




Reformulation of the Distributed Optimization (I)


minw∈W

f (w) =k∑

j=1

fj(w)

Global Consensus Problem

miny∈W,wj=yk

j=1

k∑

j=1

fj(wj)

The Lagrangian

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉






minw∈W

f (w) =k∑

j=1

fj(w)


miny∈W,wj=yk

j=1

k∑

j=1

fj(wj)

The Lagrangian

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1







minw∈W

f (w) =k∑

j=1

fj(w)


miny∈W,wj=yk

j=1

k∑

j=1

fj(wj)

The Lagrangian

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1





Reformulation of the Distributed Optimization (II)

The Augmented Lagrangian [Roux et al., 2012]

maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1

fj(wj)− 〈λj ,wj − y〉+ 12γ

‖wi − y‖22

Alternating Direction Method of Multipliers (ADMM)

1: for t = 1, 2, . . . ,T do2: Server: send yt and λ

tj to worker j = 1, . . . , k

3: Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ

‖yt + γλtj − w‖2

2, j = 1, . . . , k

4: Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

5: Server: λt+1j = λ

tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for






maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1


‖wi − y‖22





wt+1j = argmin

wfj(w) +

12γ


2, j = 1, . . . , k


1k

∑kj=1(w

t+1j − γλt

j ))


tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for






maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1


‖wi − y‖22





wt+1j = argmin

wfj(w) +

12γ


2, j = 1, . . . , k


1k

∑kj=1(w

t+1j − γλt

j ))


tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for






maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1


‖wi − y‖22





wt+1j = argmin

wfj(w) +

12γ


2, j = 1, . . . , k


1k

∑kj=1(w

t+1j − γλt

j ))


tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for






maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1


‖wi − y‖22





wt+1j = argmin

wfj(w) +

12γ


2, j = 1, . . . , k


1k

∑kj=1(w

t+1j − γλt

j ))


tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for






maxλ1,...,λk

miny∈W,w1,...,wk

k∑

j=1


‖wi − y‖22





wt+1j = argmin

wfj(w) +

12γ


2, j = 1, . . . , k


1k

∑kj=1(w

t+1j − γλt

j ))


tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k

6: end for




Step 1

Server: send yt and λtj to worker j = 1, . . . , k




Step 2

Each worker j : calculate wt+1j and send it to server

wt+1j = argmin

wfj(w) +

12γ


2, j = 1, . . . , k




Steps 3 and 4

Server: yt+1 = ΠW(

1k

∑kj=1(w

t+1j − γλt

j ))

Server: λt+1j = λ

tj +

1γ

(yt+1 − wt+1

j

), j = 1, . . . , k




Challenges of Distributed Optimization

CommunicationSynchronization



Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryFull-Information Setting Bandit Setting

Outline





5 Summary




Online Learning (I)

Online Learning [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Receive a question xt ∈ X3: Predict pt ∈ D4: Receive true answer yt ∈ Y5: Suffer loss ℓ(yt , pt)6: end for




Online Learning (II)



An Example—Online Classification

1: for t = 1, 2, . . . ,T do2: Receive a question xt ∈ X ⊆ R

d

3: Predict pt ∈ D = 0, 14: Receive true answer yt ∈ Y = 0, 15: Suffer loss ℓ(yt , pt) = |yt − pt |6: end for




Online Learning (III)



Cumulative Loss

Cumulative Loss =T∑

t=1

ℓ(yt , pt)




Online Learning (VI)



Regret with respect to a hypothesis class H

Regret =T∑

t=1

ℓ(yt , pt)− minh∈H

T∑

t=1

ℓ(yt , h(xt))

h : X 7→ D




Online Convex Optimization (I)

Online Convex Optimization [Shalev-Shwartz, 2011]

1: for t = 1, 2, . . . ,T do2: Predict a vector wt ∈ W ⊆ R

d

3: Receive a convex loss function ft : W ∈ R

4: Suffer loss ft(wt)5: end for




Online Convex Optimization (II)



d



An Example—Online SVM


d

3: Receive a training instance (xt , yt)4: Suffer loss ft(wt) = ℓ(yt , x⊤

t wt) = max(0, 1 − ytx⊤t wt)

5: end for




Online Convex Optimization (III)



d



Regret

Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w)




Outline





5 Summary




Online Gradient Descent

Full-Information Setting

The learner observes not only ft(wt) but also ft(·).

Online Gradient Descent [Zinkevich, 2003, Hazan et al., 2007]


d


4: Suffer loss ft(wt)5: wt+1 = wt − ηt∇ft(wt)6: end for

Regret Bound [Zinkevich, 2003, Hazan et al., 2007]

Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w) = O(√

T)

or O (log T )









d




Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w) = O(√

T)

or O (log T )









d




Regret =T∑

t=1

ft(wt)− minw∈W

T∑

t=1

ft(w) = O(√

T)

or O (log T )




Outline





5 Summary




Bandit Setting

Bandit Setting

The learner only observes not only ft(wt).

Generation Process of the Loss Functions

Stochastic: f1, . . . , ft are i.i.d. sampledAdversarial

Oblivious: ft is independent of w1, . . . ,wt−1

Nonoblivious: ft may depend on w1, . . . ,wt−1

Types of the Loss Functions

Convex

Linear

Multi-armed Bandit




Bandit Setting

Bandit Setting







Convex

Linear

Multi-armed Bandit




Bandit Setting

Bandit Setting







Convex

Linear

Multi-armed Bandit




Multi-armed Bandit

http://blog.mynaweb.com/wp-content/uploads/2013/02/Bandits.png


http://blog.mynaweb.com/wp-content/uploads/2013/02/Bandits.png



Stochastic Multi-armed Bandit

Setting

There are K armsSuccessive pulls of arm i yield rewards X i

1, X i2, . . .

Those rewards are i.i.d. according to an unknowndistribution Di with unknown mean µi

The Learning Process

1: for t = 1, 2, . . . ,T do2: pull arm i ∈ [K ]3: Receive reward X i sampled from distribution Di

4: end for

How to maximize the rewards?Always pull arm i = argmaxi∈[K ] µi





Setting


1, X i2, . . .




4: end for






Setting


1, X i2, . . .




4: end for





Exploitation-Exploration Dilemma

A Naive Algorithm

Exploration1 Pull each arm m = 100 times2 Calculate the estimated mean µi each arm i

Exploitation1 Always pull arm i = argmaxi∈[K ] µi




Upper Confidence Bounds (I)

Exploitation-Exploration Tradeoff by Upper Confidence Bounds(UCB) [Auer et al., 2002]

1: for t = 1, 2, . . . ,T do2: Pull arm i = argmaxi∈[K ] ui (WHP, ui ≥ µi )3: Receive reward X i sampled from distribution Di

4: Update the upper bound ui for arm i5: end for




Upper Confidence Bounds (II)







Upper Confidence Bounds (III)







Upper Confidence Bounds (IV)







Upper Confidence Bounds (V)







Outline





5 Summary




Summary

Stochastic Optimization

Stochastic Gradient Descent (SGD)—Time Reduction

Stochastic Proximal Gradient Descent (SPGD)—SpaceReduction

Distributed Optimization



Online Learning

Full-Information Setting—Online Gradient Descent

Bandit Setting—Exploitation-Exploration Dilemma




Reference I

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002).Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2-3):235–256.

Hazan, E., Agarwal, A., and Kale, S. (2007).Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2-3):169–192.

Hazan, E. and Kale, S. (2011).Beyond the regret minimization barrier: an optimal algorithm for stochasticstrongly-convex optimization.In Proceedings of the 24th Annual Conference on Learning Theory, pages421–436.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009).Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609.

Roux, N. L., Schmidt, M., and Bach, F. (2012).A stochastic gradient method with an exponential convergence rate for finitetraining sets.In Advances in Neural Information Processing Systems 25, pages 2672–2680.




Reference II

Shalev-Shwartz, S. (2011).Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107¨C–194.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015).Stochastic proximal gradient descent for nuclear norm regularization.ArXiv e-prints, arXiv:1511.01664.

Zinkevich, M. (2003).Online convex programming and generalized infinitesimal gradient ascent.In Proceedings of the 20th International Conference on Machine Learning, pages928–936.



Documents

Mining Big Data - Nanjing University · Introduction Stochastic Optimization Distributed Optimization Online Learning SummaryTime Reduction Space Reduction Outline 1 Introduction