ICML2013読み会 Large-Scale Learning with Less RAM via Randomization

Large-Scale Learning with Less RAM via Randomization[Golovin+ ICML13]

ICML読み会 2013/07/09Hidekazu Oiwa (@kisa12012)

自己紹介

•大岩秀和•東大D2@中川研•専門：機械学習や自然言語処理•得意技：大規模データ最適化／スパース化• Twitter: @kisa12012

2

読む論文• Large-Scale Learning with Less RAM via Randomization (ICML13)

• D. Golovin, C. Sculley, H. B. McMahan, M. Young (Google)

• http://arxiv.org/abs/1303.4664 (論文)• NIPS2012のWorkshopが初出•図／表は，論文より引用

3

http://arxiv.org/abs/1303.4664

http://arxiv.org/abs/1303.4664

一枚概要• 重みベクトルの省メモリ化• GPU/L1-cacheに載せるためビット数を自動調節• SGDベースのアルゴリズムを提案

• 精度をほとんど落とさず，メモリ使用量の削減を実現• 学習時：50%, 予測時：95%• Regretによる理論保証もあり

4

float (32bits) Ex. (5bits) � = (1.0, . . . ,�1.52)

Q2,2

� = (1.0, . . . ,�1.50)

導入

背景：ビッグデータ！！•メモリ容量が重要な制約に•全データがメモリに載らない• GPU／L1Cacheでデータ処理可能か？•学習時のみでなく予測時にも重要•検索広告やメールフィルタのレイテンシに影響•重みベクトルのメモリ量を削減したい

6

float型は実用上オーバースペック

7

Large-Scale Learning with Less RAM via Randomization

Figure 1. Histogram of coe�cients in a typical large-scalelinear model trained from real data. Values are tightlygrouped near zero; a large dynamic range is superfluous.

Contributions This paper gives the following theo-retical and empirical results:

1. Using a pre-determined fixed-point representationof coe�cient values reduces cost from 32 to 16 bitsper value, at the cost of a small linear regret term.

2. The cost of a per-coordinate learning rate sched-ule can be reduced from 32 to 8 bits per coordinateusing a randomized counting scheme.

3. Using an adaptive per-coordinate coarse represen-tation of coe�cient values reduces memory costfurther and yields a no–regret algorithm.

4. Variable-width encoding at prediction time allowscoe�cients to be encoded even more compactly(less than 2 bits per value in experiments) withnegligible added loss.

Approaches 1 and 2 are particularly attractive, as theyrequire only small code changes and use negligible ad-ditional CPU time. Approaches 3 and 4 require moresophisticated data structures.

2. Related Work

In addition to the sources already referenced, relatedwork has been done in several areas.

Smaller Models A classic approach to reducingmemory usage is to encourage sparsity, for example viathe Lasso (Tibshirani, 1996) variant of least-squaresregression, and the more general application of L

1

reg-ularizers (Duchi et al., 2008; Langford et al., 2009;Xiao, 2009; McMahan, 2011). A more recent trendhas been to reduce memory cost via the use of featurehashing (Weinberger et al., 2009). Both families of ap-proaches are e↵ective. The coarse encoding schemesreported here may be used in conjunction with thesemethods to give further reductions in memory usage.

Randomized Rounding Randomized roundingschemes have been widely used in numerical com-puting and algorithm design (Raghavan & Tompson,1987). Recently, the related technique of random-ized counting has enabled compact language models

(Van Durme & Lall, 2009). To our knowledge, thispaper gives the first algorithms and analysis for onlinelearning with randomized rounding and counting.

Per-Coordinate Learning Rates Duchi et al.(2010) and McMahan & Streeter (2010) demon-strated that per-coordinate adaptive regularization(i.e., adaptive learning rates) can greatly boost pre-diction accuracy. The intuition is to let the learningrate for common features decrease quickly, while keep-ing the learning rate high for rare features. This adap-tivity increases RAM cost by requiring an additionalstatistic to be stored for each coordinate, most oftenas an additional 32-bit integer. Our approach reducesthis cost by using an 8-bit randomized counter instead,using a variant of Morris’s algorithm (Morris, 1978).

3. Learning with Randomized

Rounding and Probabilistic Counting

For concreteness, we focus on logistic regression withbinary feature vectors x 2 {0, 1}d and labels y 2{0, 1}. The model has coe�cients � 2 Rd, and givespredictions p

�

(x) ⌘ �(� · x), where �(z) ⌘ 1/(1+e

�z)is the logistic function. Logistic regression finds themodel that minimizes the logistic–loss L. Given a la-beled example (x, y) the logistic–loss is

L(x, y;�) ⌘ �y log (p�

(x))� (1� y) log (1� p

�

(x))

where we take 0 log 0 = 0. Here, we take log to bethe natural logarithm. We define kxk

p

as the `

p

normof a vector x; when the subscript p is omitted, the`

2

norm is implied. We use the compressed summa-tion notation g

1:t

⌘ Pt

s=1

g

s

for scalars, and similarly

f

1:t

(x) ⌘ Pt

s=1

f

s

(x) for functions.

The basic algorithm we propose and analyze is a vari-ant of online gradient descent (OGD) that stores coef-ficients � in a limited precision format using a discreteset (✏Z)d. For each OGD update, we compute eachnew coe�cient value in 64-bit floating point represen-tation and then use randomized rounding to projectthe updated value back to the coarser representation.

A useful representation for the discrete set (✏Z)d isthe Qn.m fixed-point representation. This uses n bitsfor the integral part of the value, and m bits for thefractional part. Adding in a sign bit results in a totalof K = n + m + 1 bits per value. The value m may befixed in advance, or set adaptively as described below.We use the method RandomRound from Algorithm 1to project values onto this encoding.

The added CPU cost of fixed-point encoding and ran-domized rounding is low. Typically K is chosen tocorrespond to a machine integer (say K = 8 or 16),

32bitもの精度で値を保持する必要があるのか？

線形分類器の重みベクトル値のヒストグラム図

特徴種類数

•固定長にすれば良い？•最適な重みベクトルが固定bit長で表せない場合，永遠に収束しない

•アイデア•学習時はステップ幅に応じて表現方法を変える

bit数を削減 : どのように？

8

�

�

� �⇤

�

アルゴリズム

bit長の記法の定義• ：固定bit長表現の記法• ：仮数部• ：指数部• では(n+m+1)bits使用• 1bitは符号• ：表現可能な点の間のgap

10

Qn.m

Qn.m

✏ ✏

n n m

m

�

1.5⇥ 2�1

アルゴリズム

11


Algorithm 1 OGD-Rand-1dinput: feasible set F = [�R,R], learning rate schedule⌘t, resolution schedule ✏t

define fun Project (�) = max(�R,min(�, R))Initialize �1 = 0for t=1, . . . , T do

Play the point �t, observe gt

�t+1 = Project��t � ⌘tgt

�

�t+1 RandomRound(�t+1, ✏t)

function RandomRound(�, ✏)a ✏

⌅�✏

⇧; b ✏

⌃�✏

⌥

return

(b with prob. (� � a)/✏

a otherwise

so converting back to a floating point representa-tions requires a single integer-float multiplication (by✏ = 2�m). Randomized rounding requires a call toa pseudo-random number generator, which may bedone in 18-20 flops. Overall, the added CPU overheadis negligible, especially as many large-scale learningmethods are I/O bound reading from disk or networkrather than CPU bound.

3.1. Regret Bounds for Randomized Rounding

We now prove theoretical guarantees (in the form ofupper bounds on regret) for a variant of OGD thatuses randomized rounding on an adaptive grid as wellas per-coordinate learning rates. (These bounds canalso be applied to a fixed grid). We use the standarddefinition

Regret ⌘TX

t=1

f

t

(�t

)� argmin�

⇤2F

TX

t=1

f

t

(�⇤)

given a sequence of convex loss functions ft

. Here the�

t

our algorithm plays are random variables, and sincewe allow the adversary to adapt based on the pre-viously observed �

t

, the f

t

and post-hoc optimal �⇤

are also random variables. We prove bounds on ex-

pected regret, where the expectation is with respectto the randomization used by our algorithms (high-probability bounds are also possible). We considerregret with respect to the best model in the non-

discretized comparison class F = [�R,R]d.

We follow the usual reduction from convex to lin-ear functions introduced by Zinkevich (2003); see alsoShalev-Shwartz (2012, Sec. 2.4). Further, since weconsider the hyper-rectangle feasible set F = [�R,R]d,the linear problem decomposes into n independentone-dimensional problems.1 In this setting, we con-sider OGD with randomized rounding to an adaptive

1Extension to arbitrary feasible sets is possible, but

grid of resolution ✏

t

on round t, and an adaptive learn-ing rate ⌘

t

. We then run one copy of this algorithmfor each coordinate of the original convex problem,implying that we can choose the ⌘

t

and ✏

t

schedulesappropriately for each coordinate. For simplicity, weassume the ✏

t

resolutions are chosen so that �R and+R are always gridpoints. Algorithm 1 gives the one-dimensional version, which is run independently oneach coordinate (with a di↵erent learning rate and dis-cretization schedule) in Algorithm 2. The core resultis a regret bound for Algorithm 1 (omitted proofs canbe found in the Appendix):

Theorem 3.1. Consider running Algorithm 1 with

adaptive non-increasing learning-rate schedule ⌘

t

, and

discretization schedule ✏

t

such that ✏

t

�⌘

t

for a con-

stant � > 0. Then, against any sequence of gradi-

ents g

1

, . . . , g

T

(possibly selected by an adaptive ad-

versary) with |gt

| G, against any comparator point

�

⇤ 2 [�R,R], we have

E[Regret(�⇤)] (2R)2

2⌘T

+1

2(G2 + �

2)⌘1:T

+ �R

pT .

By choosing � su�ciently small, we obtain an expectedregret bound that is indistinguishable from the non-rounded version (which is obtained by taking � = 0).In practice, we find simply choosing � = 1 yields ex-cellent results. With some care in the choice of normsused, it is straightforward to extend the above resultto d dimensions. Applying the above algorithm on aper-coordinate basis yields the following guarantee:

Corollary 3.2. Consider running Algorithm 2 on

the feasible set F = [�R,R]d, which in turn runs

Algorithm 1 on each coordinate. We use per-

coordinate learning rates ⌘

t,i

= ↵/

p⌧

t,i

with ↵ =p2R/

pG

2 + �

2

, where ⌧

t,i

t is the number of non-

zero g

s,i

seen on coordinate i on rounds s = 1, . . . , t.Then, against convex loss functions f

t

, with g

t

a sub-

gradient of f

t

at �

t

, such that 8t, kgt

k1 G, we have

E[Regret] dX

i=1

✓2R

q2⌧T,i(G2 + �

2) + �R

p⌧T,i

◆.

The proof follows by summing the bound from The-orem 3.1 over each coordinate, considering only therounds when g

t,i

6= 0, and then using the inequalityPT

t=1

1/pt 2

pT to handle the sum of learning rates

on each coordinate.

The core intuition behind this algorithm is that for fea-tures where we have little data (that is, ⌧

i

is small, for

choosing the hyper-rectangle simplifies the analysis; inpractice, projection onto the feasible set rarely helps per-formance.

普通のSGD

Random Roundでbit数をに落とすQn.m

Random Round図示

12

�

a : bb

a+ b

a

a+ b％

� �

�

1 : 4

80％ 20％

� �

Example

gap の決め方• SGDのステップ幅と任意の定数より，• を満たすようにbit長を設定

•この時，Regretの上限がで求まる

13

⌘t � > 0

✏t �⌘t






�



⌅�✏

⇧; b ✏

⌃�✏

⌥

return


a otherwise




Regret ⌘TX

t=1

f

t

(�t

)� argmin�

⇤2F

TX

t=1

f

t

(�⇤)


. Here the�

t


t

, the f

t








t


t


t

and ✏

t


t




t

, and


t

such that ✏

t

�⌘

t

for a con-


ents g

1

, . . . , g

T


versary) with |gt


�



2⌘T

+1

2(G2 + �

2)⌘1:T

+ �R

pT .






t,i

= ↵/

p⌧

t,i

with ↵ =p2R/

pG

2 + �

2

, where ⌧

t,i


zero g

s,i


t

, with g

t

a sub-

gradient of f

t

at �

t

, such that 8t, kgt

k1 G, we have

E[Regret] dX

i=1

✓2R

q2⌧T,i(G2 + �

2) + �R

p⌧T,i

◆.


t,i


t=1

1/pt 2


on each coordinate.


i

is small, for


Thm. 3.1.

where






�



⌅�✏

⇧; b ✏

⌃�✏

⌥

return


a otherwise




Regret ⌘TX

t=1

f

t

(�t

)� argmin�

⇤2F

TX

t=1

f

t

(�⇤)


. Here the�

t


t

, the f

t








t


t


t

and ✏

t


t




t

, and


t

such that ✏

t

�⌘

t

for a con-


ents g

1

, . . . , g

T


versary) with |gt


�



2⌘T

+1

2(G2 + �

2)⌘1:T

+ �R

pT .






t,i

= ↵/

p⌧

t,i

with ↵ =p2R/

pG

2 + �

2

, where ⌧

t,i


zero g

s,i


t

, with g

t

a sub-

gradient of f

t

at �

t

, such that 8t, kgt

k1 G, we have

E[Regret] dX

i=1

✓2R

q2⌧T,i(G2 + �

2) + �R

p⌧T,i

◆.


t,i


t=1

1/pt 2


on each coordinate.


i

is small, for


��⌘t

✏t

✏t

O(pT )

� ! 0 でfloat型のRegret上限

Per-coordinate learning rates (a.k.a. AdaGrad)[Duchi+ COLT10]

• 特徴毎にステップ幅を変化• 頻繁に出現する特徴は，ステップ幅をどんどん下げる• 稀にしか出現しない特徴は，ステップ幅をあまり下げない

• 非常に高速に最適解を求めるための手法• 各特徴の出現回数を保持しなければならない• 32bits int型• Morris’ algorithm [Morris+ 78] で近似 -> 8bits

14

�

�

Morris’ Algorithm

•頻度カウンタを確率変数にする•特徴が出現するたびに，以下の操作を行う•確率でCを1インクリメント• の値を頻度として返す• は真の頻度カウント数の不偏推定量

15


Algorithm 2 OGD-Rand

input: feasible set F = [�R,R]d, parameters ↵, � > 0Initialize �1 = 0 2 Rd; 8i, ⌧i = 0for t=1, . . . , T do

Play the point �t, observe loss function ft

for i=1, . . . , d do

let gt,i = rft(xt)iif gt,i = 0 then continue⌧i ⌧i + 1let ⌘t,i = ↵/

p⌧i and ✏t,i = �⌘t,i

�t+1,i Project��t,i � ⌘t,igt,i

�

�t+1,i RandomRound(�t+1,i, ✏t,i)

example rare words in a bag-of-words representation,identified by a binary feature), using a fine-precisioncoe�cient is unnecessary, as we can’t estimate the cor-rect coe�cient with much confidence. This is in factthe same reason using a larger learning rate is ap-propriate, so it is no coincidence the theory suggestschoosing ✏

t

and ⌘

t

to be of the same magnitude.

Fixed Discretization Rather than implementingan adaptive discretization schedule, it is more straight-forward and more e�cient to choose a fixed grid res-olution, for example a 16-bit Qn.m representation issu�cient for many applications.2 In this case, one canapply the above theory, but simply stop decreasing thelearning rate once it reaches say ✏ (= 2�m). Then, the⌘

1:T

term in the regret bound yields a linear term likeO(✏T ); this is unavoidable when using a fixed reso-lution ✏. One could let the learning rate continue todecrease like 1/

pt, but this would provide no benefit;

in fact, lower-bounding the learning-rate is known toallow online gradient descent to provide regret boundsagainst a moving comparator (Zinkevich, 2003).

Data Structures There are several viable ap-proaches to storing models with variable–sized coef-ficients. One can store all keys at a fixed (low) preci-sion, then maintain a sequence of maps (e.g., as hash-tables), each containing a mapping from keys to coe�-cients of increasing precision. Alternately, a simple lin-ear probing hash–table for variable length keys is e�-cient for a wide variety of distributions on key lengths,as demonstrated by Thorup (2009). With this datastructure, keys and coe�cient values can be treated asstrings over 4-bit or 8-bit bytes, for example. Bland-ford & Blelloch (2008) provide yet another data struc-ture: a compact dictionary for variable length keys.Finally, for a fixed model, one can write out the string

2If we scale x ! 2x then we must take � ! �/2 tomake the same predictions, and so appropriate choices ofn and m must be data-dependent.

s of all coe�cients (without end of string delimiters),store a second binary string of length s with ones atthe coe�cient boundaries, and use any of a number ofrank/select data structures to index into it, e.g., theone of Patrascu (2008).

3.2. Approximate Feature Counts

Online convex optimization methods typically use alearning rate that decreases over time, e.g., setting ⌘

t

proportional to 1/pt. Per-coordinate learning rates

require storing a unique count ⌧i

for each coordinate,where ⌧

i

is the number of times coordinate i has ap-peared with a non-zero gradient so far. Significantspace is saved by using a 8-bit randomized countingscheme rather than a 32-bit (or 64-bit) integer to storethe d total counts. We use a variant of Morris’ prob-abilistic counting algorithm (1978) analyzed by Flajo-let (1985). Specifically, we initialize a counter C = 1,and on each increment operation, we increment C withprobability p(C) = b

�C , where base b is a parameter.

We estimate the count as ⌧(C) = b

C�b

b�1

, which is anunbiased estimator of the true count. We then uselearning rates ⌘

t,i

= ↵/

p⌧

t,i

+ 1, which ensures thateven when ⌧

t,i

= 0 we don’t divide by zero.

We compute high-probability bounds on this counterin Lemma A.1. Using these bounds for ⌘

t,i

in conjunc-tion with Theorem 3.1, we obtain the following result(proof deferred to the appendix).

Theorem 3.3. Consider running the algorithm of

Corollary 3.2 under the assumptions specified there,

but using approximate counts ⌧

i

in place of the exact

counts ⌧

i

. The approximate counts are computed using

the randomized counter described above with any base

b > 1. Thus, ⌧

t,i

is the estimated number of times

g

s,i

6= 0 on rounds s = 1, . . . , t, and the per–coordinate

learning rates are ⌘

t,i

= ↵/

p⌧

t,i

+ 1. With an appro-

priate choice of ↵ we have

E[Regret(g)] = o

⇣R

pG

2 + �

2

T

0.5+�

⌘for all � > 0,

where the o-notation hides a small constant factor and

the dependence on the base b.

3

4. Encoding During Prediction Time

Many real-world problems require large-scale predic-

tion. Achieving scale may require that a trained modelbe replicated to multiple machines (Bucilua et al.,2006). Saving RAM via rounding is especially at-tractive here, because unlike in training accumulated

3Eq. (5) in the appendix provides a non-asymptotic (butmore cumbersome) regret bound.





for i=1, . . . , d do




�



t

and ⌘

t



1:T









t




i




C�b

b�1


t,i

= ↵/

p⌧

t,i


t,i



t,i





i


counts ⌧

i



b > 1. Thus, ⌧

t,i


g

s,i



t,i

= ↵/

p⌧

t,i

+ 1. With an appro-


E[Regret(g)] = o

⇣R

pG

2 + �

2

T

0.5+�

⌘for all � > 0,



3





�(C)

Per-Coordinate版アルゴリズム

16





for i=1, . . . , d do




�



t

and ⌘

t



1:T









t




i




C�b

b�1


t,i

= ↵/

p⌧

t,i


t,i



t,i





i


counts ⌧

i



b > 1. Thus, ⌧

t,i


g

s,i



t,i

= ↵/

p⌧

t,i

+ 1. With an appro-


E[Regret(g)] = o

⇣R

pG

2 + �

2

T

0.5+�

⌘for all � > 0,



3





頻度のカウンティング．Morris’s Algo.により8bits化可能．頻度情報を使ってステップ幅を決定．

予測時はさらに近似可能• 予測時は予測への影響が少なければ，bit数はかなり大胆に減らせる

• Lemma 4.1, 4.2, Theorem 4.3: Logistic Lossの場合の近似の程度に応じた発生しうる誤差分析

• さらに圧縮を使えば情報論的下限までメモリを削減可能

17


Figure 1. Histogram of coe�cients in a typical large-scalelinear model trained from real data. Values are tightlygrouped near zero; a large dynamic range is superfluous.

Contributions This paper gives the following theo-retical and empirical results:

1. Using a pre-determined fixed-point representationof coe�cient values reduces cost from 32 to 16 bitsper value, at the cost of a small linear regret term.

2. The cost of a per-coordinate learning rate sched-ule can be reduced from 32 to 8 bits per coordinateusing a randomized counting scheme.

3. Using an adaptive per-coordinate coarse represen-tation of coe�cient values reduces memory costfurther and yields a no–regret algorithm.

4. Variable-width encoding at prediction time allowscoe�cients to be encoded even more compactly(less than 2 bits per value in experiments) withnegligible added loss.

Approaches 1 and 2 are particularly attractive, as theyrequire only small code changes and use negligible ad-ditional CPU time. Approaches 3 and 4 require moresophisticated data structures.

2. Related Work

In addition to the sources already referenced, relatedwork has been done in several areas.

Smaller Models A classic approach to reducingmemory usage is to encourage sparsity, for example viathe Lasso (Tibshirani, 1996) variant of least-squaresregression, and the more general application of L

1

reg-ularizers (Duchi et al., 2008; Langford et al., 2009;Xiao, 2009; McMahan, 2011). A more recent trendhas been to reduce memory cost via the use of featurehashing (Weinberger et al., 2009). Both families of ap-proaches are e↵ective. The coarse encoding schemesreported here may be used in conjunction with thesemethods to give further reductions in memory usage.

Randomized Rounding Randomized roundingschemes have been widely used in numerical com-puting and algorithm design (Raghavan & Tompson,1987). Recently, the related technique of random-ized counting has enabled compact language models

(Van Durme & Lall, 2009). To our knowledge, thispaper gives the first algorithms and analysis for onlinelearning with randomized rounding and counting.

Per-Coordinate Learning Rates Duchi et al.(2010) and McMahan & Streeter (2010) demon-strated that per-coordinate adaptive regularization(i.e., adaptive learning rates) can greatly boost pre-diction accuracy. The intuition is to let the learningrate for common features decrease quickly, while keep-ing the learning rate high for rare features. This adap-tivity increases RAM cost by requiring an additionalstatistic to be stored for each coordinate, most oftenas an additional 32-bit integer. Our approach reducesthis cost by using an 8-bit randomized counter instead,using a variant of Morris’s algorithm (Morris, 1978).

3. Learning with Randomized

Rounding and Probabilistic Counting

For concreteness, we focus on logistic regression withbinary feature vectors x 2 {0, 1}d and labels y 2{0, 1}. The model has coe�cients � 2 Rd, and givespredictions p

�

(x) ⌘ �(� · x), where �(z) ⌘ 1/(1+e

�z)is the logistic function. Logistic regression finds themodel that minimizes the logistic–loss L. Given a la-beled example (x, y) the logistic–loss is

L(x, y;�) ⌘ �y log (p�

(x))� (1� y) log (1� p

�

(x))

where we take 0 log 0 = 0. Here, we take log to bethe natural logarithm. We define kxk

p

as the `

p

normof a vector x; when the subscript p is omitted, the`

2

norm is implied. We use the compressed summa-tion notation g

1:t

⌘ Pt

s=1

g

s

for scalars, and similarly

f

1:t

(x) ⌘ Pt

s=1

f

s

(x) for functions.

The basic algorithm we propose and analyze is a vari-ant of online gradient descent (OGD) that stores coef-ficients � in a limited precision format using a discreteset (✏Z)d. For each OGD update, we compute eachnew coe�cient value in 64-bit floating point represen-tation and then use randomized rounding to projectthe updated value back to the coarser representation.

A useful representation for the discrete set (✏Z)d isthe Qn.m fixed-point representation. This uses n bitsfor the integral part of the value, and m bits for thefractional part. Adding in a sign bit results in a totalof K = n + m + 1 bits per value. The value m may befixed in advance, or set adaptively as described below.We use the method RandomRound from Algorithm 1to project values onto this encoding.

The added CPU cost of fixed-point encoding and ran-domized rounding is low. Typically K is chosen tocorrespond to a machine integer (say K = 8 or 16),

ただし，下限はあまり小さくない

実験

RCV1 Dataset

19


Figure 2. Rounding at Training Time. The fixed q2.13 encoding is 50% smaller than control with no loss. Per-coordinatelearning rates significantly improve predictions but use 64 bits per value. Randomized counting reduces this to 40 bits.Using adaptive or fixed precision reduces memory use further, to 24 total bits per value or less. The benefit of adaptiveprecision is seen more on the larger CTR data.

roundo↵ error is no longer an issue. This allows evenmore aggressive rounding to be used safely.

Consider a rounding a trained model � to some �.We can bound both the additive and relative e↵ect onlogistic–loss L(·) in terms of the quantity |� ·x� � ·x|:Lemma 4.1 (Additive Error). Fix �, � and (x, y). Let� = |� · x� � · x|. Then the logistic–loss satisfies

L(x, y; �)� L(x, y;�) �.

Proof. It is well known that��@L(x,y;�)

@�i

�� 1 for all

x, y,� and i, which implies the result.

Lemma 4.2 (Relative Error). Fix �, � and (x, y) 2{0, 1}d ⇥ {0, 1}. Let � = |� · x� � · x|. Then

L(x, y; �)� L(x, y;�)L(x, y;�) e

� � 1.

See the appendix for a proof. Now, suppose we areusing fixed precision numbers to store our model co-e�cients such as the Qn.m encoding described earlier,with a precision of ✏. This induces a grid of feasi-ble model coe�cient vectors. If we randomly roundeach coe�cient �

i

(where |�i

| 2n) independently upor down to the nearest feasible value �

i

, such that

E[�i

] = �

i

, then for any x 2 {0, 1}d our predicted log-odds ratio, � ·x is distributed as a sum of independentrandom variables {�

i

| xi

= 1}.Let k = kxk

0

. In this situation, note that |� · x � � ·x| ✏kxk

1

= ✏k, since |�i

� �

i

| ✏ for all i. ThusLemma 4.1 implies

L(x, y; �)� L(x, y;�) ✏ kxk1

.

Similarly, Lemma 4.2 immediately provides an upperbound of e✏k � 1 on relative logistic error; this boundis relatively tight for small k, and holds with proba-bility one, but it does not exploit the fact that therandomness is unbiased and that errors should cancelout when k is large. The following theorem gives abound on expected relative error that is much tighterfor large k:

Theorem 4.3. Let � be a model obtained from �

using unbiased randomized rounding to a precision ✏

grid as described above. Then, the expected logistic–

loss relative error of � on any input x is at most

2p2⇡k exp

�✏

2

k/2�✏ where k = kxk

0

.

Additional Compression Figure 1 reveals that co-e�cient values are not uniformly distributed. Stor-ing these values in a fixed-point representation meansthat individual values will occur many times. Basicinformation theory shows that the more common val-

Train Test Feature

RCV1 20,242 677,399 47,236

CTR Dataset非公開の検索広告クリックログデータ

20

Data Feature

CTR 30M 20M


Figure 2. Rounding at Training Time. The fixed q2.13 encoding is 50% smaller than control with no loss. Per-coordinatelearning rates significantly improve predictions but use 64 bits per value. Randomized counting reduces this to 40 bits.Using adaptive or fixed precision reduces memory use further, to 24 total bits per value or less. The benefit of adaptiveprecision is seen more on the larger CTR data.

roundo↵ error is no longer an issue. This allows evenmore aggressive rounding to be used safely.

Consider a rounding a trained model � to some �.We can bound both the additive and relative e↵ect onlogistic–loss L(·) in terms of the quantity |� ·x� � ·x|:Lemma 4.1 (Additive Error). Fix �, � and (x, y). Let� = |� · x� � · x|. Then the logistic–loss satisfies

L(x, y; �)� L(x, y;�) �.

Proof. It is well known that��@L(x,y;�)

@�i

�� 1 for all

x, y,� and i, which implies the result.

Lemma 4.2 (Relative Error). Fix �, � and (x, y) 2{0, 1}d ⇥ {0, 1}. Let � = |� · x� � · x|. Then

L(x, y; �)� L(x, y;�)L(x, y;�) e

� � 1.

See the appendix for a proof. Now, suppose we areusing fixed precision numbers to store our model co-e�cients such as the Qn.m encoding described earlier,with a precision of ✏. This induces a grid of feasi-ble model coe�cient vectors. If we randomly roundeach coe�cient �

i

(where |�i

| 2n) independently upor down to the nearest feasible value �

i

, such that

E[�i

] = �

i

, then for any x 2 {0, 1}d our predicted log-odds ratio, � ·x is distributed as a sum of independentrandom variables {�

i

| xi

= 1}.Let k = kxk

0

. In this situation, note that |� · x � � ·x| ✏kxk

1

= ✏k, since |�i

� �

i

| ✏ for all i. ThusLemma 4.1 implies

L(x, y; �)� L(x, y;�) ✏ kxk1

.

Similarly, Lemma 4.2 immediately provides an upperbound of e✏k � 1 on relative logistic error; this boundis relatively tight for small k, and holds with proba-bility one, but it does not exploit the fact that therandomness is unbiased and that errors should cancelout when k is large. The following theorem gives abound on expected relative error that is much tighterfor large k:

Theorem 4.3. Let � be a model obtained from �

using unbiased randomized rounding to a precision ✏

grid as described above. Then, the expected logistic–

loss relative error of � on any input x is at most

2p2⇡k exp

�✏

2

k/2�✏ where k = kxk

0

.

Additional Compression Figure 1 reveals that co-e�cient values are not uniformly distributed. Stor-ing these values in a fixed-point representation meansthat individual values will occur many times. Basicinformation theory shows that the more common val-

Data Feature

CTR Billion Billion のデータでもほとんど同じ結果だよと言ってる

予測モデルの近似性能

21


Table 1. Rounding at Prediction Time for CTR Data.Fixed-point encodings are compared to a 32-bit floatingpoint control model. Added loss is negligible even whenusing only 1.5 bits per value with optimal encoding.

Encoding AucLoss Opt. Bits/Val

q2.3 +5.72% 0.1

q2.5 +0.44% 0.5

q2.7 +0.03% 1.5

q2.9 +0.00% 3.3

ues may be encoded with fewer bits. The theoret-ical bound for a whole model with d coe�cients is�

Pdi=1

log p(�i)

d

bits per value, where p(v) is the proba-bility of occurrence of v in � across all dimensions d.Variable length encoding schemes may approach thislimit and achieve further RAM savings.

5. Experimental Results

We evaluated on both public and private large datasets. We used the public RCV1 text classificationdata set, specifically from Chang & Lin (2011). Inkeeping with common practice on this data set, thesmaller “train” split of 20,242 examples was used forparameter tuning and the larger “test” split of 677,399examples was used for the full online learning exper-iments. We also report results from a private CTRdata set of roughly 30M examples and 20M features,sampled from real ad click data from a major searchengine. Even larger experiments were run on data setsof billions of examples and billions of dimensions, withsimilar results as those reported here.

The evaluation metrics for predictions are error ratefor the RCV1 data, and AucLoss (or 1-AUC) relativeto a control model for the CTR data. Lower valuesare better. Metrics are computed using progressivevalidation (Blum et al., 1999) as is standard for onlinelearning: on each round a prediction is made for agiven example and record for evaluation, and only afterthat is the model allowed to train on the example. Wealso report the number of bits per coordinate used.

Rounding During Training Our main results aregiven in Figure 2. The comparison baseline is onlinelogistic regression using a single global learning rateand 32-bit floats to store coe�cients. We also test thee↵ect of per-coordinate learning rates with both 32-bit integers for exact counts and with 8-bit random-ized counts. We test the range of tradeo↵s availablefor fixed-precision rounding with randomized counts,varying the number of precision m in q2.m encoding toplot the tradeo↵ curve (cyan). We also test the range

of tradeo↵s available for adaptive-precision roundingwith randomized counts, varying the precision scalar� to plot the tradeo↵ curve (dark red). For all ran-domized counts a base of 1.1 was used. Other thanthese di↵erences, the algorithms tested are identical.

Using a single global learning rate, a fixed q2.13 en-coding saves 50% of the RAM at no added loss com-pared to the baseline. The addition of per-coordinatelearning rates gives significant improvement in predic-tive performance, but at the price of added memoryconsumption, increasing from 32 bits per coordinate to64 bits per coordinate in the baselines. Using random-ized counts reduces this down to 40 bits per coordi-nate. However, both the fixed-precision and the adap-tive precision methods give far better results, achiev-ing the same excellent predictive performance as the64-bit method with 24 bits per coe�cient or less. Thissaves 62.5% of the RAM cost compared to the 64-bitmethod, and is still smaller than using 32-bit floatswith a global learning rate.

The benefit of adaptive precision is only apparent onthe larger CTR data set, which has a “long tail” distri-bution of support across features. However, it is usefulto note that the simpler fixed-precision method alsogives great benefit. For example, using q2.13 encod-ing for coe�cient values and 8-bit randomized countersallows full-byte alignment in naive data structures.

Rounding at Prediction Time We tested the ef-fect of performing coarser randomized rounding of afully-trained model on the CTR data, and compared tothe loss incurred using a 32-bit floating point represen-tation. These results, given in Table 1, clearly supportthe theoretical analysis that suggests more aggressiverounding is possible at prediction time. Surprisinglycoarse levels of precision give excellent results, withlittle or no loss in predictive performance. The mem-ory savings achievable in this scheme are considerable,down to less than two bits per value for q2.7 with the-oretically optimal encoding of the discrete values.

6. Conclusions

Randomized storage of coe�cient values provides ane�cient method for achieving significant RAM savingsboth during training and at prediction time.

While in this work we focus on OGD, similar ran-domized rounding schemes may be applied to otherlearning algorithms. The extension to algorithms thate�ciently handle L

1

regularization, like RDA (Xiao,2009) and FTRL-Proximal (McMahan, 2011), is rela-

情報論的下限まで小さくした時のサイズ

まとめ• 重みベクトルの省メモリ化• Randomized Rounding

• GPUやL1cacheに載る形で学習／予測可能• FOBOS等への拡張もStraightforward• と著者は書いていて，脚注にProof Sketch• 本当に成立するかどうか，各自調べる必要がありそう

22

float (32bits) Ex. (5bits) � = (1.0, . . . ,�1.52)

Q2,2

� = (1.0, . . . ,�1.50)

Technology

ICML2013読み会 Large-Scale Learning with Less RAM via Randomization