45
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Embed Size (px)

Citation preview

Page 1: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Efficient Logistic Regression with Stochastic Gradient Descent

William Cohen

Page 2: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Admin• Phrase assignment extension– We forgot about discarding phrases with stop words– Aside: why do stop words hurt performance?

• How many map-reduce assignments does it take to screw in a lightbulb?– ?

• AWS mechanics– I will circulate keys for $100 of time

• …soon– You’ll need to sign up for an account with a credit/debit card anyway as guarantee

• Let me know if this is a major problem– $100 should be plenty

• Let us know if you run out

Page 3: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Reminder: Your map-reduce assignments are mostly done

Old NB learning• Stream & sort• Stream – sort – aggregate

– Counter update “message”– Optimization: in-memory hash, periodically emptied– Sort of messages– Logic to aggregate them

• Workflow is on one machine

New NB learning• Map-reduce• Map – shuffle - reduce

– Counter update Map– Combiner– (Hidden) shuffle & sort– Sum-reduce

• Workflow is done in parallel

Page 4: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

NB implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| tee words.dat | sort | java answerWordCountRequests| tee test.dat| sort | testNBUsingRequests

id1 w1,1 w1,2 w1,3 …. w1,k1

id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...

X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……

524510542120

373

train.dat counts.dat

Map to generate counter updates + Sum combiner + Sum reducer

Page 5: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| tee words.dat | sort | java answerWordCountRequests| tee test.dat| sort | testNBUsingRequests

words.dat

w Counts associated with W

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …

zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

Map

Reduce

Page 6: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| tee words.dat | sort | java answerWordCountRequests| tee test.dat| sort | testNBUsingRequests

w Countsaardvark C[w^Y=sports]

=2

agent …

zynga …

found ~ctr to id1

aardvark ~ctr to id2

…today ~ctr to idi

w Counts

aardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…

agent ~ctr to id345

agent ~ctr to id9854

… ~ctr to id345

agent ~ctr to id34742

zynga C[…]

zynga ~ctr to id1

output looks like thisinput looks like this

words.dat

Map + IdentityReduce; save output in temp files

Identity Map with two sets of inputs; custom secondary sort (used in shuffle/sort)

Reduce, output to temp

Page 7: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| tees words.dat | sort | java answerWordCountRequests| tee test.dat| sort | testNBUsingRequests

Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…

id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..

Output looks like this

test.dat

Identity Map with two sets of inputs; custom secondary sort

Page 8: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| tees words.dat | sort | java answerWordCountRequests| tee test.dat| sort | testNBUsingRequestsInput looks like

thisKey Value

id1 found aardvark zynga farmville today

~ctr for aardvark is C[w^Y=sports]=2

~ctr for found is

C[w^Y=sports]=1027,C[w^Y=worldNews]=564

id2 w2,1 w2,2 w2,3 ….

~ctr for w2,1 is …

… …

Reduce

Page 9: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Reminder: The map-reduce assignments are mostly done

Old NB testing• Request-answer

– Produce requests– Sort requests and records together– Send answers to requestors – Sort answers and sending entities together– …– Reprocess input with request answers

• Workflows are isomorphic

New NB testing• Two map-reduces

– Produce requests w/ Map– (Hidden) shuffle & sort with custom secondary sort– Reduce, which sees records first, then requests– Identity map with two inputs– Custom secondary sort– Reduce that sees old input first, then answers

• Workflows are isomporphic

Page 10: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Outline

• Reminder about early/next assignments• Logistic regression and SGD– Learning as optimization– Logistic regression: • a linear classifier optimizing P(y|x)

– Stochastic gradient descent• “streaming optimization” for ML problems

–Regularized logistic regression– Sparse regularized logistic regression–Memory-saving logistic regression

Page 11: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization: warmup• Goal: Learn the parameter θ of a binomial• Dataset: D={x1,…,xn}, xi is 0 or 1• MLE estimate of θ, Pr(xi=1)–#[flips where xi =1]/#[flips xi ] = k/n

• Now: reformulate as optimization…

Page 12: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization: warmupGoal: Learn the parameter θ of a binomialDataset: D={x1,…,xn}, xi is 0 or 1, k of them are 1Easier to optimize:

Page 13: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization: warmupGoal: Learn the parameter θ of a binomialDataset: D={x1,…,xn}, xi is 0 or 1, k are 1

Page 14: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization: warmupGoal: Learn the parameter θ of a binomialDataset: D={x1,…,xn}, xi is 0 or 1, k of them are 1 = 0

θ= 0θ= 1 k- kθ – nθ + kθ = 0

nθ = k θ = k/n

Page 15: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization: general procedure

• Goal: Learn the parameter θ of a classifier–probably θ is a vector

• Dataset: D={x1,…,xn}• Write down Pr(D|θ) as a function of θ• Maximize by differentiating and setting to zero

Page 16: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization: general procedure

• Goal: Learn the parameter θ of …• Dataset: D={x1,…,xn}–or maybe D={(x1,y1)…,(xn , yn)}

• Write down Pr(D|θ) as a function of θ–Maybe easier to work with log Pr(D|θ)

• Maximize by differentiating and setting to zero–Or, use gradient descent: repeatedly take a small step in the direction of the gradient

Page 17: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization: general procedure for SGD

• Goal: Learn the parameter θ of …• Dataset: D={x1,…,xn}– or maybe D={(x1,y1)…,(xn , yn)}

• Write down Pr(D|θ) as a function of θ• Big-data problem: we don’t want to load all the data D into memory, and the gradient depends on all the data• Solution: – pick a small subset of examples B<<D– approximate the gradient using them

• “on average” this is the right direction– take a step in that direction– repeat….

B = one example is a very popular choice

Page 18: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for logistic regression

• Goal: Learn the parameter θ of a classifier– Which classifier?– We’ve seen y = sign(x . w) but sign is not continuous…– Convenient alternative: replace sign with the logistic function

Page 19: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for logistic regression

• Goal: Learn the parameter w of the classifier• Probability of a single example (y,x) would be• Or with logs:

Page 20: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen
Page 21: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

p

Page 22: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen
Page 23: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Key computational point: • if xj=0 then the gradient of wj is zero• so when processing an example you only need to update weights for the non-zero features of an example.

An observation

Page 24: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for logistic regression

• Final algorithm:-- do this in random order

Page 25: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Another observation

• Consider averaging the gradient over all the examples D={(x1,y1)…,(xn , yn)}

Page 26: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for logistic regression

• Goal: Learn the parameter θ of a classifier– Which classifier?– We’ve seen y = sign(x . w) but sign is not continuous…– Convenient alternative: replace sign with the logistic function

• Practical problem: this overfits badly with sparse features– e.g., if wj is only in positive examples, its gradient is always positive !

Page 27: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Outline

• [other stuff]• Logistic regression and SGD– Learning as optimization– Logistic regression: • a linear classifier optimizing P(y|x)

– Stochastic gradient descent• “streaming optimization” for ML problems

–Regularized logistic regression– Sparse regularized logistic regression–Memory-saving logistic regression

Page 28: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Regularized logistic regression• Replace LCL• with LCL + penalty for large weights, eg• So:• becomes:

Page 29: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Regularized logistic regression• Replace LCL• with LCL + penalty for large weights, eg• So the update for wj becomes:• Or

Page 30: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for logistic regression

• Algorithm:-- do this in random order

Page 31: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for regularized logistic regression

• Algorithm:Time goes from O(nT) to O(nmT) where• n = number of non-zero

entries, • m = number of features, • T = number of passes over

data

Page 32: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for regularized logistic regression

• Final algorithm:• Initialize hashtable W• For each iteration t=1,…T– For each example (xi,yi)• pi = …• For each feature W[j]–W[j] = W[j] - λ2μW[j]–If xij>0 then»W[j] = W[j] + λ(yi - pi)xj

Page 33: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for regularized logistic regression

• Final algorithm:• Initialize hashtable W• For each iteration t=1,…T– For each example (xi,yi)• pi = …• For each feature W[j]–W[j] *= (1 - λ2μ)–If xij>0 then»W[j] = W[j] + λ(yi - pi)xj

Page 34: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for regularized logistic regression

• Final algorithm:• Initialize hashtable W• For each iteration t=1,…T– For each example (xi,yi)• pi = …• For each feature W[j]–If xij>0 then»W[j] *= (1 - λ2μ)A»W[j] = W[j] + λ(yi - pi)xj

A is number of examples seen since the last time we did an x>0 update on W[j]

Page 35: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for regularized logistic regression

• Final algorithm:• Initialize hashtables W, A and set k=0• For each iteration t=1,…T– For each example (xi,yi)• pi = … ; k++• For each feature W[j]–If xij>0 then»W[j] *= (1 - λ2μ)k-A[j]»W[j] = W[j] + λ(yi - pi)xj»A[j] = k

k-A[j] is number of examples seen since the last time we did an x>0 update on W[j]

Page 36: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for regularized logistic regression

• Final algorithm:• Initialize hashtables W, A and set k=0• For each iteration t=1,…T– For each example (xi,yi)• pi = … ; k++• For each feature W[j]–If xij>0 then»W[j] *= (1 - λ2μ)k-A[j]»W[j] = W[j] + λ(yi - pi)xj»A[j] = k

Time goes from O(nmT) back to to O(nT) where• n = number of non-zero

entries, • m = number of features, • T = number of passes over

data

• memory use doubles (m 2m)

Page 37: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Outline

• [other stuff]• Logistic regression and SGD– Learning as optimization– Logistic regression: • a linear classifier optimizing P(y|x)

– Stochastic gradient descent• “streaming optimization” for ML problems

–Regularized logistic regression– Sparse regularized logistic regression–Memory-saving logistic regression

Page 38: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Pop quiz

• In text classification most words area. rareb. not correlated with any classc. given low weights in the LR classifierd. unlikely to affect classificatione. not very interesting

Page 39: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Pop quiz

• In text classification most bigrams area. rareb. not correlated with any classc. given low weights in the LR classifierd. unlikely to affect classificatione. not very interesting

Page 40: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Pop quiz

• Most of the weights in a classifier are– important–not important

Page 41: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

How can we exploit this?• One idea: combine uncommon words together randomly• Examples:

– replace all occurrances of “humanitarianism” or “biopsy” with “humanitarianismOrBiopsy”– replace all occurrances of “schizoid” or “duchy” with “schizoidOrDuchy”– replace all occurrances of “gynecologist” or “constrictor” with “gynecologistOrConstrictor”– …

• For Naïve Bayes this breaks independence assumptions– it’s not obviously a problem for logistic regression, though

• I could combine– two low-weight words (won’t matter much)– a low-weight and a high-weight word (won’t matter much)– two high-weight words (not very likely to happen)

• How much of this can I get away with?– certainly a little– is it enough to make a difference?

Page 42: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

How can we exploit this?• Another observation: – the values in my hash table are weights– the keys in my hash table are strings for the feature names• We need them to avoid collisions

• But maybe we don’t care about collisions?– Allowing “schizoid” & “duchy” to collide is equivalent to replacing all occurrences of “schizoid” or “duchy” with “schizoidOrDuchy”

Page 43: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for regularized logistic regression

• Algorithm:• Initialize hashtables W, A and set k=0• For each iteration t=1,…T– For each example (xi,yi)• pi = … ; k++• For each feature j: xij>0:

»W[j] *= (1 - λ2μ)k-A[j]»W[j] = W[j] + λ(yi - pi)xj»A[j] = k

Page 44: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for regularized logistic regression

• Algorithm:• Initialize arrays W, A of size R and set k=0• For each iteration t=1,…T– For each example (xi,yi)• Let V be hash table so that • pi = … ; k++• For each hash value h: V[h]>0:

»W[h] *= (1 - λ2μ)k-A[j]»W[h] = W[h] + λ(yi - pi)V[h]»A[j] = k

Page 45: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Learning as optimization for regularized logistic regression

• Algorithm:• Initialize arrays W, A of size R and set k=0• For each iteration t=1,…T– For each example (xi,yi)• Let V be hash table so that • pi = … ; k++• For each hash value h: V[h]>0:

»W[h] *= (1 - λ2μ)k-A[j]»W[h] = W[h] + λ(yi - pi)V[h]»A[j] = k

???