Optimization via (too much?) Randomization
Peter Richtarik
Why parallelizing like crazy and being lazy can be good
Optimization as Mountain Climbing
Optimization with Big Data
* in a billion dimensional space on a foggy day
Extreme* Mountain Climbing=
Big Data
• digital images & videos• transaction records• government records• health records• defence• internet activity (social media,
wikipedia, ...)• scientific measurements
(physics, climate models, ...)
BIG Volume BIG Velocity BIG Variety
God’s Algorithm = Teleportation
If You Are Not a God...
x0x1
x2 x3
start
settle for this
holy grail
Randomized Parallel Coordinate Descent
Western General Hospital(Creutzfeldt-Jakob Disease)
Arup (Truss Topology Design)
Ministry of Defence dstl lab(Algorithms for Data Simplicity)Royal Observatory
(Optimal Planet Growth)
Optimization as Lock Breaking
A Lock with 4 Dials
Setup: Combination maximizing F opens the lock
x = (x1, x2, x3, x4) F(x) = F(x1, x2, x3, x4)
A function representing the
“quality” of a combination
Optimization Problem: Find combination maximizing F
Optimization Algorithm
A System of Billion Locks with Shared Dials
# dials = n
x1
x2
x3
x4
xn
Lock
1) Nodes in the graph correspond to dials
2) Nodes in the graph also correspond to locks: each lock (=node) owns dials connected to it in the graph by an edge
= # locks
How do we Measure the Quality of a Combination?
F : Rn R
• Each lock j has its own quality function Fj
depending on the dials it owns
• However, it does NOT open when Fj is maximized
• The system of locks opens when
is maximized
F = F1 + F2 + ... + Fn
1) Randomly select a lock
2) Randomly select a dial belonging to the lock
3) Adjust the value on the selected dial based only on the info corresponding to the selected lock
An Algorithm with (too much?) Randomization
IDLE IDLE
IDLE IDLE
IDLE
IDLE
Synchronous Parallelization
J4
J7
J1
J5
J8
J2
time
J6
J9
J3Processor 1
Processor 2
Processor 3 WASTEFUL
Crazy (Lock-Free) Parallelization
time
J4 J5 J6
J7 J8 J9
J1 J2 J3Processor 1
Processor 2
Processor 3 NO WASTE
Crazy Parallelization
Crazy Parallelization
Crazy Parallelization
Crazy Parallelization
Theoretical Result
Average # dials in a lock
Average # of dials common between 2 locks
# Locks
# Processors
Computational Insights
Theory vs Reality
Why parallelizing like crazy and being lazy can be good?
Randomization
• Effectivity• Tractability• Efficiency• Scalability (big data)• Parallelism• Distribution• Asynchronicity
Parallelization
Optimization Methods for Big Data
• Randomized Coordinate Descent– P. R. and M. Takac: Parallel coordinate descent methods
for big data optimization, ArXiv:1212.0873 [can solve a problem with 1 billion variables in 2 hours using 24
processors]• Stochastic (Sub) Gradient Descent
– P. R. and M. Takac: Randomized lock-free methods for minimizing partially separable convex functions
[can be applied to optimize an unknown function]• Both of the above
M. Takac, A.Bijral, P. R. and N. Srebro: Mini-batch primal and dual methods for support vector machines, ArXiv:1303.xxxx
Final 2 Slides
ToolsProbability
Machine LearningMatrix Theory
HPC