1
fitted distribution error < 0.05 error > 0.05 estimated #iterations speculation Improving the life of a data scientist Data scientist today What people think he does What he thinks he does What he actually does Observation Gradient Descent (GD) Stochastic GD Batch GD Mini-batch GD Optimization problem ML tasks • classification • clustering •… Processing Platform express solve execute min w X i2data f i (w )+ g (w ) GD Cost Model GD Abstraction Planner Rewriter GD Iteration Estimator GD Plan Space Cost-based GD optimizer Solution GD TASK (declarative query) Hyperparameter tuning mplementation Implementation Select algorithm General Problem Our Focus Not all-times winner Transform Data units Stage Sample Compute Update Convergence Loop Preparation phase 1 Processing phase 2 Convergence phase 3 true false Model SGD Plan +1 2:0.1 4:0.4 10:0.3 -1 3:0.3 4:0.5 9:0.5 +1 1:0.1 2:0.7 6:0.2 1 [2, 4, 10] [0.1, 0.4, 0.3] -1 [3, 4, 9] [0.3, 0.5, 0.6] 1 [1, 2, 6] [0.1, 0.7, 0.2] label indices values 1 [2, 4, 10] [0.1, 0.4, 0.3] 1 [2, 4, 10] [0.1, 0.4, 0.3] How to model ML tasks? How to optimize GD plans? Staging Update Transform Loop Convergence Sample Compute 1 2 3 4 5 Bernoulli 1 2 3 4 5 Random-partition 3 1 2 4 5 Shuffle-partition partitions Lazy transformation Sampling techniques GD plans BGD SGD/MGD Eager Lazy Eager Lazy Bernoulli Shuffle Random Bernoulli Shuffle Random How to get # iterations? i 0 =0 a =0.1 w 0 = [0.0, 0.0, ..., 0.0] (r ,w ) w k+1 = w k - rf (w k ) i k+1 = i k +1 δ = ||w k+1 - w k || δ > 0.01 Rheem More info Visit our web page: http://da.qcri.org/rheem Source code: https://github.com/rheem-ecosystem/rheem Source code of abstraction available at https://github.com/rheem-ecosystem/rheem A Cross-Platform System ML4all • Rheem: Enabling Multi-Platform Task Execution SIGMOD 2016, San Francisco, USA (demo paper) • Road to Freedom in Data Analytics EDBT 2016, Bordeaux, France (vision paper) Join us at the Spark Summit 2017 San Francisco, USA A DB-like Machine Learning System Training time (sec) 1 100 10000 adult covtype yearpred rcv1 higgs svm1 svm2 svm3 436 116 111 77 72 13 30 16 10,000 3,600 10,000 2,595 1,512 2,184 96 14 27 18 MLlib SystemML Our system SystemML conversion 24 13 22 16 41 46 MGD Performance dataset 0 450 900 1350 1800 SGD MGD(1K) MGD(10K) BGD 1,664 206 33 216 489 201 31 211 487 201 31 Spark Our system Bismarck-Spark raction has negligible overhead Abstraction benefit 1 10 100 1000 10000 100000 Min Max Our system Min Max Our system Min Max Our system Min Max Our system Min Max Our system Min Max Our system Min Max Our system Min Max Our system adult covtype yearpred rcv1 higgs svm1 svm2 svm3 Training time (sec) Speculation Min Max Plan execution Speculation BGD MGD lazy random SGD eager shuffle SGD lazy shuffle SGD lazy shuffle SGD lazy shuffle SGD lazy shuffle SGD lazy shuffle igure 8: ML4all always performs very close to t GD Plan choice Training time (sec) 0 15 30 45 60 Dataset adult covtype yearpred rcv1 35 36 41 33 41 42 37 35 Real Estimated (a) Run of 1,000 iterations Training time (sec) 10 100 1000 10000 Dataset adult covtype yearpred rcv1 312 9 248 96 588 9 252 139 Real Estimated (b) Run to convergence Time estimates Results A Cost-based Optimizer for Gradient Descent Optimization Zoi Kaoudi Jorge Quianè-Ruiz Saravanan Thirumuruganathan Sanjay Chawla Divy Agrawal 1 Error sequence follows known distribution 2 Shape of error sequence on sample D’ << D ' Shape of error sequence over D Speculative approach 1.Take a sample D’ << D 2.Run GD for a larger error 3.Fit the distribution Training time (sec) 1 100 10000 Dataset adult covtype rcv1 1,818 26 7 911 173 24 19,086 9 10 batch GD stochastic GD mini-batch GD

Improving the life of a data scientistda.qcri.org/jquiane/jorgequianeFiles/presentations/... · A DB-like Machine Learning System) 1 100 10000 Dataset adult covtypeyearpred rcv1 higgs

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improving the life of a data scientistda.qcri.org/jquiane/jorgequianeFiles/presentations/... · A DB-like Machine Learning System) 1 100 10000 Dataset adult covtypeyearpred rcv1 higgs

fitted distributionerror < 0.05error > 0.05

estimated #iterations

speculation

Improving the life of a data scientistData scientist today

What people think he does What he thinks he does What he actually does

ObservationGradient Descent

(GD)

Stochastic GDBatch GD

Mini-batch GD

Optimization problem

ML tasks • classification • clustering • …

Processing Platform

express solve execute

minw

X

i2data

fi(w) + g(w)

GD Cost Model

GD Abstraction Planner

Rewriter

GD Iteration Estimator

GD Plan Space

Cost-based GD optimizerSolution

GD TASK (declarative query)Hyperparameter

tuning

ImplementationImplementation

Select algorithm

General Problem Our FocusNot all-times winner

false

Transform

Data units

Stage

Sample

Compute

Update

Convergence

Loop

Preparation phase1

Processing phase2

Convergence phase3

truefalse

Model

SGD

Pla

n

+1 2:0.1 4:0.4 10:0.3-1 3:0.3 4:0.5 9:0.5+1 1:0.1 2:0.7 6:0.2

1 [2, 4, 10] [0.1, 0.4, 0.3]-1 [3, 4, 9] [0.3, 0.5, 0.6]1 [1, 2, 6] [0.1, 0.7, 0.2]

label

indices

values

1 [2, 4, 10] [0.1, 0.4, 0.3]

1 [2, 4, 10] [0.1, 0.4, 0.3]

How to model ML tasks?

How to optimize GD plans?

Staging

Update

Transform

Loop

Convergence

Sample

Compute

1 2 3 4 5

Bernoulli1 2 3 4 5

Random-partition3 1 2 4 5

Shuffle-partition

partitions

Lazy transformation

Sampling techniques

GD plans

BGD SGD/MGD

Eager Lazy

Eager Lazy

Bernoulli ShuffleRandom Bernoulli ShuffleRandom

How to get # iterations?

i0 = 0

a = 0.1

w0 = [0.0, 0.0, ..., 0.0]

(r , w)

wk+1 = wk � ↵rf(wk)

ik+1 = ik + 1

� = ||wk+1 � wk||

� > 0.01

Rheem

More info

Visit our web page: http://da.qcri.org/rheemSource code: https://github.com/rheem-ecosystem/rheem

Source code of abstraction available at https://github.com/rheem-ecosystem/rheem

A Cross-Platform System

ML4all

• Rheem: Enabling Multi-Platform Task Execution SIGMOD 2016, San Francisco, USA (demo paper)

• Road to Freedom in Data Analytics EDBT 2016, Bordeaux, France (vision paper)

Join us at the Spark Summit 2017

San Francisco, USA

A DB-like Machine Learning System

Trai

ning

tim

e (s

ec)

1

100

10000

Datasetadult covtypeyearpred rcv1 higgs svm1 svm2 svm3

MLlibSystemMLOur system

BGD

2051 36

632

8

32

6

569204

SystemML conversion

(a) BGD

Trai

ning

tim

e (s

ec)

1

100

10000

Datasetadult covtypeyearpred rcv1 higgs svm1 svm2 svm3

436

1161117772

1330

16

10,000

3,600

162213

10,000

2,5951,5122,184

96

1427

18

MLlibSystemMLOur system

MGD

SystemML conversion

24

1322 16

41 46

(b) MGD

Trai

ning

tim

e (s

ec)

1

10

100

Datasetadult covtypeyearpred rcv1 higgs svm1 svm2 svm 3

MLlibSystemMLOur system

SGD

6

17

11

30

8

30

110

173

SystemML conversion

(c) SGD

Figure 9: Training time (sec). Our system significantly outperforms both MLlib and SystemML, thanks toits novel sampling mechanisms and its lazy transformation technique.

1000 iterations MLlib-GD MLlib-SGD MLlib-MGD ML4all-SGD ML4all GD NEW RESULTS MLlib ML4all

adult (LogR) 63.6563333333333 33.843 SGD (eager-random) 33871

covtype (LinR) 92.2523333333333 73.658 69.3983333333333 37.382 SGD (eager-random) 252.54 1856 iterations, 0.001 131.756465517241 44111

yearPred (LinR) 108.356 42.1966666666667 SGD (lazy-shuffle) 42.237 SGD (eager-random) 69927 31690

rcv1 (SVM) 855.448333333333 757.117333333333 904.312333333333 41.1923333333333 SGD (lazy-shuffle) 104.604666666667 1055 iterations, e=0.1 95060 33237

kdda (SVM) crashed crashed crashed 346.807 SGD (lazy-random) 1038311

SVM A 1205.151 34.69833333 lazy-shuffle 1281130 31630

2.7M (5GB) 845.261333333333 54.1523333333333 SGD (eager-random) 34.69833333 lazy-shuffle

5.5M (10GB) 1566.044 1205.151 1136.739 72.0886666666667 SGD (eager-random) 36.425

11M (20GB) 1721.823 108.257666666667 SGD (eager-random) 39.092

22M (40GB) 2567.43 187.695 38.24066667

44M (80GB) 3928.575 322.500333333333 SGD (eager-random) 44.808 87.6757498660953

88M (160GB) 58% cached (SGD of MLlib killed after 12 hours (job 345) -> 4.3min per iteration

1536.64433333333 58.672

1k (183MB) 51.499 34.9113333333333 32.70633333

10k (1.8GB) 507.152 49.8223333333333 53.10266667

50k (9GB) 1108.66066666667 93.5683333333333 136.572

100k (18GB) 1765.451 159.645333333333 230.5406667

500k (90GB) 8511.65766666667 615.701333333333 1219.1535 13.8243287871176

changed covtype from LinR to LogR since the data is for binary classification (runtimes should not change)

Trai

ning

tim

e (s

ec)

0

300

600

900

1200

adult (LogR)

covtype (LogR)

yearPred (LinR)

rcv1 (SVM)

kdda (SVM)

synth (SVM)

MLlibML4all

Trai

ning

tim

e (s

)

0

1000

2000

3000

4000

#points (size)

2.7M (5GB)

5.5M (10GB)

11M (20GB)

22M (40GB)

44M (80GB)

88M(160GB)

MLlibEager-randomLazy-shuffle

Trai

ning

tim

e (s

)

02250450067509000

#features (size)

1k (180MB)

10k (1.8GB)

50k (9GB)

100k(18GB)

500k(90GB)

MLlibEager-randomLazy-shuffle

Trai

ning

tim

e (s

ec)

0

300

600

900

1200

adult (LogR)

covtype (LinR)

yearPred (LinR)

rcv1 (SVM)

kdda (SVM)

synth (SVM)

MLlibML4all

Trai

ning

tim

e (s

ec)

0

1000

2000

3000

4000

#points (size)

2.7M

(5GB)

5.5M

(10GB)

11M

(20GB)

22M

(40GB)

44M

(80GB)

88M

(160G

B)

MLlibML4all (eager-random)ML4all (lazy-shuffle)

Trai

ning

tim

e (s

ec)

0

2250

4500

6750

9000

#features (size)

1k

(180M

B)10

k

(1.8G

B) 50k

(9GB)

100k

(18GB)

500k

(90GB)

MLlibML4all (eager-random)ML4all (lazy-shuffle)

�1

(a) Scaling #points.

1000 iterations MLlib-GD MLlib-SGD MLlib-MGD ML4all-SGD ML4all GD NEW RESULTS MLlib ML4all

adult (LogR) 63.6563333333333 33.843 SGD (eager-random) 33871

covtype (LinR) 92.2523333333333 73.658 69.3983333333333 37.382 SGD (eager-random) 252.54 1856 iterations, 0.001 131.756465517241 44111

yearPred (LinR) 108.356 42.1966666666667 SGD (lazy-shuffle) 42.237 SGD (eager-random) 69927 31690

rcv1 (SVM) 855.448333333333 757.117333333333 904.312333333333 41.1923333333333 SGD (lazy-shuffle) 104.604666666667 1055 iterations, e=0.1 95060 33237

kdda (SVM) crashed crashed crashed 346.807 SGD (lazy-random) 1038311

SVM A 1205.151 34.69833333 lazy-shuffle 1281130 31630

2.7M (5GB) 845.261333333333 54.1523333333333 SGD (eager-random) 34.69833333 lazy-shuffle

5.5M (10GB) 1566.044 1205.151 1136.739 72.0886666666667 SGD (eager-random) 36.425

11M (20GB) 1721.823 108.257666666667 SGD (eager-random) 39.092

22M (40GB) 2567.43 187.695 38.24066667

44M (80GB) 3928.575 322.500333333333 SGD (eager-random) 44.808 87.6757498660953

88M (160GB) 58% cached (SGD of MLlib killed after 12 hours (job 345) -> 4.3min per iteration

1536.64433333333 58.672

1k (183MB) 51.499 34.9113333333333 32.70633333

10k (1.8GB) 507.152 49.8223333333333 53.10266667

50k (9GB) 1108.66066666667 93.5683333333333 136.572

100k (18GB) 1765.451 159.645333333333 230.5406667

500k (90GB) 8511.65766666667 615.701333333333 1219.1535 13.8243287871176

changed covtype from LinR to LogR since the data is for binary classification (runtimes should not change)

Trai

ning

tim

e (s

ec)

0

300

600

900

1200

adult (LogR)

covtype (LogR)

yearPred (LinR)

rcv1 (SVM)

kdda (SVM)

synth (SVM)

MLlibML4all

Trai

ning

tim

e (s

)

0

1000

2000

3000

4000

#points (size)

2.7M (5GB)

5.5M (10GB)

11M (20GB)

22M (40GB)

44M (80GB)

88M(160GB)

MLlibEager-randomLazy-shuffle

Trai

ning

tim

e (s

)

02250450067509000

#features (size)

1k (180MB)

10k (1.8GB)

50k (9GB)

100k(18GB)

500k(90GB)

MLlibEager-randomLazy-shuffle

Trai

ning

tim

e (s

ec)

0

300

600

900

1200

adult (LogR)

covtype (LinR)

yearPred (LinR)

rcv1 (SVM)

kdda (SVM)

synth (SVM)

MLlibML4all

Trai

ning

tim

e (s

ec)

0

1000

2000

3000

4000

#points (size)

2.7M

(5GB)

5.5M

(10GB)

11M

(20GB)

22M

(40GB)

44M

(80GB)

88M

(160G

B)

MLlibML4all (eager-random)ML4all (lazy-shuffle)

Trai

ning

tim

e (s

ec)

0

2250

4500

6750

9000

#features (size)

1k

(180M

B)10

k

(1.8G

B) 50k

(9GB)

100k

(18GB)

500k

(90GB)

MLlibML4all (eager-random)ML4all (lazy-shuffle)

�1

(b) Scaling #features.

Figure 10: Our system’s scalability compared toMLlib. It scales gracefully with both the numberof data points and features.

tem is still faster than MLlib. This is because we used map-

Partitions and reduce instead of treeAggregate, which re-sulted in better data locality and hence better response timesfor larger datasets. Notice that SystemML is slightly fasterthan our system for the small datasets, because SystemMLprocesses them locally. The largest bottleneck of SystemMLfor small datasets is the time to convert the dataset to itsbinary format. However, we observe that our system sig-nificantly outperforms SystemML for larger datasets, whenSystemML runs on Spark. In fact, we had to stop Sys-temML after 3 hours for the higgs dataset, while for thethree dense synthetic datasets SystemML failed with out ofmemory exceptions.

(2) For MGD (Figure 9(b)), we observe that our systemoutperforms, on average, both MLib and SystemML: It hassimilar performance to MLib and SystemML. However, Sys-temML requires an extra overhead of converting the data toits binary representation. It is up to 28 times faster thanMLib and more than 2 orders of magnitude faster than Sys-temML for large datasets (higgs, svm1, and svm2). Espe-cially, for the dataset svm3 that does not fit entirely intoSpark’s cache, MLlib incurred disk IOs in each iteration re-sulting in a training time per iteration of 6 min. Thus, wehad to terminate the execution after 3 hours. The large ben-efits of our system come from the shu✏e-partition samplingtechnique, which significantly saves IO costs.

(3) For SGD (Figure 9(c)), we observe that our system issignificantly superior than MLlib (by a factor from 2 forsmall datasets to 46 for larger datasets). In fact, similarlyto MGD, MLlib incurred many disk IOs for svm3. We hadto stop the execution after 3 hours. In contrast, SystemMLhas lower training times for the very small datasets (adult,covtype, and yearpred), thanks to its binary data represen-tation that makes local processing faster. However, the costof converting data to its binary data representation is higherthan its training time itself, which makes SystemML slowerthan our system (except for covtype). Things get worse forSystemML as the data grows. Our system is more than 2orders of magnitude faster than SystemML. The benefits of

our system on SGD is mainly due to the lazy transformationused by our system. In fact, as for BGD and MGD, we hadto stop SystemML after 3 hours for the higgs dataset whileit failed with out of memory exceptions for the three densedatasets. Notice that the training time for a larger datasetmay be smaller if the number of iterations to converge issmaller. For example, this is the case for the dataset cov-

type, which required 923 iterations to converge using SGD,in contrast to rcv1, which required only 196. This resultedin smaller training time for rcv1 than covtype.

8.4.2 ScalabilityFigure 10 shows the scalability results for SGD for the

two largest synthetic datasets (SVM A and SVM B), when in-creasing the number of data points (Figure 10(a)) and thenumber of features (Figure 10(b)). Notice that we discardedSystemML as it was not able to run on these dense datasets.We plot the runtimes of the eager-random and the lazy-shu✏e GD plan. We observe that both plans outperformMLlib by more than one order of magnitude in both cases.In particular, we observe that our system scales gracefullywith both the number of data points and the number of fea-tures while MLlib does not. This is even more prominent forthe datasets that do not fit in Spark’s cache memory. Es-pecially, we observe that the lazy-shu✏e plan scales betterthan the eager-random. This shows the high e�ciency ofour shu✏ed-partition sampling mechanism in combinationwith the lazy transformation. Note that we had to stop theexecution of MLlib after 24 hours for the largest dataset of88 million points in Figure 10(a). MLlib took 4.3 min foreach iteration and thus, would require 3 days to completewhile our GD plan took only 25 minutes. This leads to morethan 2 orders of magnitude improvement over MLlib.

8.4.3 Benefits and overhead of abstractionWe also evaluate the benefits and overhead of using the

ML4all abstraction. For this, we implemented the planproduced by ML4all directly on top of Spark. We also im-plemented the Bismarck abstraction [12], which comes witha Prepare UDF, while the Compute and Update are com-bined, on Spark. Recall that a key advantage of separatingCompute from Update is that the former can be parallelizedwhere the latter has to be e↵ectively serialized. When thesetwo operators are combined into one, parallelization cannotbe leveraged. Its Prepare UDF, however, can be parallelized.Figure 11 illustrates the results of these experiments. We

observe that ML4all adds almost no additional overheadto plan execution as it has very similar runtimes as the pureSpark implementation. We also observe that our systemand Bismarck have similar runtimes for SGD and MGD(1k)and for all three data sets. This is because our prototyperuns in a hybrid mode and parts of the plan are executed

MGD

Performance

dataset

GD Spark Our system Bismarck-Spark SGD Spark Our system Bismarck-Spark MGD b=1000 Spark ML4all Spark-local ML4all-local Bismarck-on-Spark

adult 22.896 23.0543333333333 231.117333333333 32.147 33.843 30.2636666666667 76.6613333333333 76.9013333333333 43.3326666666667 44.2973333333333 41.0406666666667

rcv1 621.77 622 crashed 34.1603333333333 32.6746666666667 33.4693333333333 223.016666666667 230.040666666667 153.802333333333 155.386 150.857

USCensus90 168.990333333333 173.372666666667

synth 211.05 216.294833333333 crashed 33.031 31.2083333333333 33.0366666666667 207.384666666667 200.811 206.2525

rcv1 621.77 609.165 b=10000

adult 127.202 137.3005

adult rcv1 rcv1 588.948 crashed

SGD 32.147 33.843 30.26366667 34.16033333 32.67466667 33.46933333 synth 488.5175 1663.519

MGD(1K) 43.33266667 44.29733333 41.04066667 153.8023333 155.386 150.857

MGD(10K) 127 127.202 137.3005 586 588.948 crashed

BGD 22.896 23.05433333 231.1173333 621.77 622

SGD synth 31 31.20833333 33.03666667

MGD(1K) 201 200.811 206.2525

MGD(10K) 487 488.5175 1663.519

BGD 211.05 216.2948333 crashed

Trai

ning

tim

e (s

ec)

0

175

350

525

700

adult rcv1 USCensus90

SparkOur system

Trai

ning

tim

e (s

ec)

0

75

150

225

300

SGD MGD(1K) MGD(10K) BGD

SparkOur systemBismarck-Spark

0

175

350

525

700

SGD MGD(1K) MGD(10K) BGD

SparkOur systemBismarck-Spark

0

450

900

1350

1800

SGD MGD(1K) MGD(10K) BGD

SparkOur systemBismarck-Spark

Trai

ning

tim

e (s

ec)

0

60

120

180

240

adult rcv1

distributedhybrid

�1

(a) adult dataset

GD Spark Our system Bismarck-Spark SGD Spark Our system Bismarck-Spark MGD b=1000 Spark ML4all Spark-local ML4all-local Bismarck-on-Spark

adult 22.896 23.0543333333333 231.117333333333 32.147 33.843 30.2636666666667 76.6613333333333 76.9013333333333 43.3326666666667 44.2973333333333 41.0406666666667

rcv1 621.77 622 crashed 34.1603333333333 32.6746666666667 33.4693333333333 223.016666666667 230.040666666667 153.802333333333 155.386 150.857

USCensus90 168.990333333333 173.372666666667

synth 211.05 216.294833333333 crashed 33.031 31.2083333333333 33.0366666666667 207.384666666667 200.811 206.2525

rcv1 621.77 609.165 b=10000

adult 127.202 137.3005

adult rcv1 rcv1 588.948 crashed

SGD 32.147 33.843 30.26366667 34.16033333 32.67466667 33.46933333 synth 488.5175 1663.519

MGD(1K) 43.33266667 44.29733333 41.04066667 153.8023333 155.386 150.857

MGD(10K) 127 127.202 137.3005 586 588.948 crashed

BGD 22.896 23.05433333 231.1173333 621.77 622

SGD synth 31 31.20833333 33.03666667

MGD(1K) 201 200.811 206.2525

MGD(10K) 487 488.5175 1663.519

BGD 211.05 216.2948333 crashed

Trai

ning

tim

e (s

ec)

0

175

350

525

700

adult rcv1 USCensus90

SparkOur system

Trai

ning

tim

e (s

ec)

0

75

150

225

300

SGD MGD(1K) MGD(10K) BGD

SparkOur systemBismarck-Spark

0

175

350

525

700

SGD MGD(1K) MGD(10K) BGD

SparkOur systemBismarck-Spark

0

450

900

1350

1800

SGD MGD(1K) MGD(10K) BGD

SparkOur systemBismarck-Spark

Trai

ning

tim

e (s

ec)

0

60

120

180

240

adult rcv1

distributedhybrid

�1

(b) rcv1 dataset

GD Spark Our system Bismarck-Spark SGD Spark Our system Bismarck-Spark MGD b=1000 Spark ML4all Spark-local ML4all-local Bismarck-on-Spark

adult 22.896 23.0543333333333 231.117333333333 32.147 33.843 30.2636666666667 76.6613333333333 76.9013333333333 43.3326666666667 44.2973333333333 41.0406666666667

rcv1 621.77 622 crashed 34.1603333333333 32.6746666666667 33.4693333333333 223.016666666667 230.040666666667 153.802333333333 155.386 150.857

USCensus90 168.990333333333 173.372666666667

synth 211.05 216.294833333333 crashed 33.031 31.2083333333333 33.0366666666667 207.384666666667 200.811 206.2525

rcv1 621.77 609.165 b=10000

adult 127.202 137.3005

adult rcv1 rcv1 588.948 crashed

SGD 32.147 33.843 30.26366667 34.16033333 32.67466667 33.46933333 synth 488.5175 1663.519

MGD(1K) 43.33266667 44.29733333 41.04066667 153.8023333 155.386 150.857

MGD(10K) 127 127.202 137.3005 586 588.948 crashed

BGD 22.896 23.05433333 231.1173333 621.77 622

SGD synth 31 31.20833333 33.03666667

MGD(1K) 201 200.811 206.2525

MGD(10K) 487 488.5175 1663.519

BGD 211.05 216.2948333 crashed

Trai

ning

tim

e (s

ec)

0

175

350

525

700

adult rcv1 USCensus90

SparkOur system

Trai

ning

tim

e (s

ec)

0

75

150

225

300

SGD MGD(1K) MGD(10K) BGD

SparkOur systemBismarck-Spark

0

175

350

525

700

SGD MGD(1K) MGD(10K) BGD

SparkOur systemBismarck-Spark

0

450

900

1350

1800

SGD MGD(1K) MGD(10K) BGD

1,664

20633

216

489

20131

211

487

20131

SparkOur systemBismarck-Spark

Trai

ning

tim

e (s

ec)

0

60

120

180

240

adult rcv1

distributedhybrid

�1

(c) svm1 datasetFigure 11: ML4all abstraction benefits and overhead. The proposed abstraction has negligible overhead w.r.t.hard-coded Spark programs while it allows for exhaustive distributed execution.

in a centralized fashion thus negating the separation of theCompute and the Update step. As the dataset cardinality ordimensionality increases, the advantages of ML4all becomeclear. Our system is (i) slightly faster for MGD(10k) for asmall dataset (Figure 11(a)), (ii) more than 3 times fasterfor MGD(10k) in Figure 11(c), because of the distribution ofthe gradient computation, and (iii) able to run MGD(10k)in Figure 11(b) while the Bismarck abstraction fails due tothe large number of features of rcv1. This is also the reasonthat the Bismark abstraction fails to run BGD for the samedataset of rcv1, but for svm1 the reason it fails is the largenumber of data points. This clearly shows that the Bismarckabstraction cannot scale with the dataset size. In contrast,our system scales gracefully in all cases as it execute thealgorithms in a distributed fashion whenever required.

8.4.4 SummaryThe high e�ciency of our system comes from its (i) lazy

transformation technique, (ii) novel sampling mechanisms,and (iii) e�cient execution operators. All these results notonly show the high e�ciency of our optimizations tech-niques, but also the power of the ML4all abstraction thatallows for such optimizations without adding any overhead.

8.5 AccuracyThe reader might think that our system achieves high per-

formance at the cost of sacrificing accuracy. However, thisis far from the truth. To demonstrate this, we measure thetesting error of each system and each GD algorithm. Weused the test datasets from LIBSVM when available, other-wise we randomly split the initial dataset in training (80%)and testing (20%). We then apply the model (i.e., weightsvector) produced on the training dataset to each examplein the testing dataset to determine its output label. Weplot the mean square error of the output labels comparedto the ground truth. Recall that we have used the sameparameters (e.g., step size) in all systems.

Let us first note that, as expected, all systems return thesame model for BGD and hence we omit the graph as thetesting error is exactly the same. Figure 12 shows the resultsfor MGD and SGD. We omit the results for svm3 as only oursystem could converge in a reasonable amount of time. Al-though our system uses aggressive sampling techniques insome cases, such as shu✏e-partition for the large datasets inMGD4, the error is significantly close to the ones of MLliband SystemML. The only case where shu✏e-partition influ-ences the testing error is for rcv1 in SGD. The testing errorfor MLlib is 0.08, while in our case it is 0.18. This is due tothe skewness of the data. SystemML having a testing errorof 0.3 also seems to su↵er from this problem. We are cur-

4Table 4 in Appendix E shows the plan chosen in each case.

Test

ing

erro

r (M

SE)

0

0.15

0.3

0.45

0.6

Dataset

adult

covtype

yearpred rcv

1higgs

svm1svm

2

MLlib SystemML Our system

MGD

(a) MGD

Test

ing

erro

r (M

SE)

0

0.15

0.3

0.45

0.6

Dataset

adult

covtype

yearpred rcv

1higgs

svm1svm

2

MLlib SystemML Our system

SGD

(b) SGD

Figure 12: Testing error (mean square error). ForSGD/MGD, our system achieves an error close toMLlib even if it uses di↵erent sampling methods.

Eager transformation MGD

Bernoulli Random-partition Shuffle-partition Eager transformation SGD

Bernoulli Random-partition Shuffle-partition Shuffle partition MGD

Eager Lazy Random partition MGD

Eager Lazy

adult 15.764 20.399 21.08 27.201 13.6535 24.05 21.08 20.279 20.399 19.726

covtype 33.628 33.203 38.612 44.937 36.508 33.6 38.612 37.495 33.203 30.422

yearpred 17.167 18.321 16.96 21.371 37.487 16.611 16.96 13.131 18.321 14.476

rcv1 72.318 140.345 150.909 28.806 56.897 25.198 150.909 161.452 140.345 159.516

higgs 117.731 279.982 77.049 50.012 24.459 17.432 77.049 81.338 279.982 277.572

svm1 134.217 354.082 111.183 57.004 32.662 14.982 111.183 142.939 354.082 381.475

svm2 1118.997 2669.537 115.718 413.835 38.201 15.241 115.718 145.139 2669.537 0

Lazy transformation MGD

Lazy transformation SGD

Shuffle partition SGD

Random partition SGD

adult 19.726 20.279 11.8855 13.5415 24.05 13.5415 13.6535 11.8855

covtype 30.422 37.495 33.419 31.082 33.6 31.082 36.508 33.419

yearpred 14.476 13.131 21.318 11.764 16.611 11.764 37.487 21.318

rcv1 159.516 161.452 34.071 23.066 25.198 23.066 56.897 34.071

higgs 277.572 81.338 13.418 10.599 17.432 10.599 24.459 13.418

svm1 381.475 142.939 17.322 10.076 14.982 10.076 32.662 17.322

svm2 0 145.139 13.496 10.522 15.241 10.522 38.201 13.496

Trai

ning

tim

e (s

)

1

100

10000

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

BernoulliRandom-partitionShuffle-partition

Trai

ning

tim

e (s

)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Random-partitionShuffle-partition

Trai

ning

tim

e (s

ec)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

BernoulliRandom-partitionShuffle-partition

Trai

ning

tim

e (s

ec)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Random-partition Shuffle-partition

Trai

ning

tim

e (s

)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Eager Lazy

Trai

ning

tim

e (s

)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Eager Lazy

Trai

ning

tim

e (s

ec)

1

100

10000

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Eager Lazy

Trai

ning

tim

e (s

ec)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Eager Lazy

�1

(a) Eager transformation

Eager transformation MGD

Bernoulli Random-partition Shuffle-partition Eager transformation SGD

Bernoulli Random-partition Shuffle-partition Shuffle partition MGD

Eager Lazy Random partition MGD

Eager Lazy

adult 15.764 20.399 21.08 27.201 13.6535 24.05 21.08 20.279 20.399 19.726

covtype 33.628 33.203 38.612 44.937 36.508 33.6 38.612 37.495 33.203 30.422

yearpred 17.167 18.321 16.96 21.371 37.487 16.611 16.96 13.131 18.321 14.476

rcv1 72.318 140.345 150.909 28.806 56.897 25.198 150.909 161.452 140.345 159.516

higgs 117.731 279.982 77.049 50.012 24.459 17.432 77.049 81.338 279.982 277.572

svm1 134.217 354.082 111.183 57.004 32.662 14.982 111.183 142.939 354.082 381.475

svm2 1118.997 2669.537 115.718 413.835 38.201 15.241 115.718 145.139 2669.537 0

Lazy transformation MGD

Lazy transformation SGD

Shuffle partition SGD

Random partition SGD

adult 19.726 20.279 11.8855 13.5415 24.05 13.5415 13.6535 11.8855

covtype 30.422 37.495 33.419 31.082 33.6 31.082 36.508 33.419

yearpred 14.476 13.131 21.318 11.764 16.611 11.764 37.487 21.318

rcv1 159.516 161.452 34.071 23.066 25.198 23.066 56.897 34.071

higgs 277.572 81.338 13.418 10.599 17.432 10.599 24.459 13.418

svm1 381.475 142.939 17.322 10.076 14.982 10.076 32.662 17.322

svm2 0 145.139 13.496 10.522 15.241 10.522 38.201 13.496

Trai

ning

tim

e (s

)

1

100

10000

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

BernoulliRandom-partitionShuffle-partition

Trai

ning

tim

e (s

)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Random-partitionShuffle-partition

Trai

ning

tim

e (s

ec)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

BernoulliRandom-partitionShuffle-partition

Trai

ning

tim

e (s

ec)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Random-partition Shuffle-partition

Trai

ning

tim

e (s

)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Eager Lazy

Trai

ning

tim

e (s

)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Eager Lazy

Trai

ning

tim

e (s

ec)

1

100

10000

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Eager Lazy

Trai

ning

tim

e (s

ec)

1

10

100

Dataset

adult

covtype

yearpred rcv

1higgs

svm1

svm2

Eager Lazy

�1

(b) Lazy transformation

Figure 13: Sampling e↵ect in MGD for eager andlazy transformation.

rently working to improve this sampling technique for suchcases. However, in cases where the data is now skewed ourtesting error even for SGD is very close to the one of ML-lib. Thus, we can conclude that ML4all decreases trainingtimes without a↵ecting the accuracy of the model.

8.6 In-DepthWe analyze in detail how the sampling and the transfor-

mation techniques a↵ect performance when running MGDwith 1, 000 samples and SGD until convergence with thetolerance set to 0.001 and a maximum of 1, 000 iterations.

8.6.1 Varying the sampling techniqueWe first fix the transformation and vary the sampling tech-

nique. Figure 13 shows how the sampling technique a↵ectsMGD when using eager and lazy transformation. First, ineager transformation for small datasets, using the Bernoullisampling is more beneficial (Figure 13(a)). This is becauseMGD needs a thousand samples per iteration and thus, afull scan of the whole dataset per iteration does not penal-ize the total execution time. However, for larger datasetsthat consist of more partitions, the shu✏e-partition is fasterin all cases as it accesses only few partitions.For the lazy transformation (Figure 13(b)), we ran only

the random-partition and shu✏e-partition sampling tech-niques. Using a plan with Bernoulli sampling and lazy trans-formation is always ine�cient as explained in Section 6.We observe that for MGD and the two small datasets of

Abstraction benefit

#Iterations

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01 0.001

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

22 2582 16

9156

201

227

2131430

adult

1437 2009

1709 2586

2125

30792

14365

26010

17085

(a) adult dataset

#Iterations

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01 0.001

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

26 26788

1211

1113 1489

134 221

3518 10294

7737 12648

1856

2210

31620

102935

67582

126480

covtype

(b) covtype dataset

#Iterations

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

rcv1

799 1197

1158

1293

1201

1297

21942

10175

30263

10991

30421

11016

(c) rcv1 dataset

Figure 6: ML4all obtains good estimates for the number of iterations for all GD algorithms.

GD (1000 iterations)

SGD-eager-random

SGD-eager-shuffle SGD-eager-spark SGD-lazy-random SGD-lazy-shuffle MGD-eager-random (b=1000)

MGD-eager-shuffle (b=1000)

MGD-eager-spark MGD-lazy-random MGD-lazy-shuffle Real Estimated Real transform time

Real sort time for lazy

adult 81.9893333333333 34.8286666666667 35.581 63.6563333333333 85.167 34.035 89.571 88.212 47.7803333333333 614.276333333333 84.3903333333333 34.8286666666667 33.016 SGD-eager-random 5638.66666666667 7788.33333333333

covtype 146.585333333333 37.382 41.301 73.658 314.792 40.03 96.3566666666667 96.195 53.3363333333333 630.840666666667 91.2843333333333 37.382 41.185 SGD-eager-random 8599 13284.6666666667

yearpred 100.336666666667 42.237 42.3736666666667 108.356 278.1685 42.1966666666667 152.578 142.684666666667 50.2226666666667 1105.164 159.538 42.1966666666667 35.589 SGD-lazy-shuffle 8456 12778.3333333333

rcv1 946.157 50.724 48.7695 757.117333333333 269.503333333333 41.1923333333333 301.094333333333 170.331333333333 193.43 1417.90866666667 233.560666666667 41.1923333333333 35.089 SGD-lazy-shuffle 9061.33333333333 14785.6666666667

synth SVM 1024.604 76.2323333333333 114.528666666667 1205.151 288.354666666667 87.025 2647.987 183.335333333333 863.817666666667 9747.694 258.443666666667 76.2323333333333 58.571

0.0520452500813506

0.10173345460382

0.156592147878979

0.148166730054946 0.114634395654774

Trai

ning

tim

e (s

ec)

0

15

30

45

60

adult covtype yearpred rcv1

Real Estimated

Trai

ning

tim

e (s

ec)

0

15

30

45

60

Datasetadult covtype yearpred rcv1

Real Estimated

�1

(a) Run of 1,000 iterations

GD time GD iterations SGD-eager-random time

SGD-eager-random iterations

SGD-eager-shuffle-partition time

SGD-eager-shuffle-partition iterations

SGD-lazy-random time

SGD-lazy-random iterations

SGD-lazy-shufflepartition time

SGD-lazy-shufflepartition iterations

MGD-eager-shuffle-partition time

MGD-eager-shuffle-partition iterations

MGD-eager-random time

MGD-eager-random iterations

MGD-eager-spark time

MGD-eager-spark iterations

MGD-lazy-random time

MGD-lazy-random iterations

MGD-lazy-shufflepartition time

MGD-lazy-shufflepartition iterations

adult 0.1 10.082 22 11.895 147.66666666666712.7973333333333 157 23.949 172.666666666667 10.691 145.333333333333 12.128 136 7.25166666666667 32.6666666666667 17.5046666666667 33.6666666666667 11.1883333333333 123.666666666667

adult 0.01 25.9373333333333 227 54.00633333333331908.33333333333 57.588 2076 144.840333333333 1781.66666666667 56.816 2271 74.7513333333333 2040.66666666667 34.3586666666667 803 168.102 853.666666666667 63.167 2023.66666666667

adult 0.001 139.251 2586 507.582 26044.6666666667421.641333333333 21973.3333333333 2221.261 31108 467.288 24915.6666666667 680.991 24438.3333333333 1088.725 45213.5 - - 497.267666666667 22724

covtype 0.1 17.4193333333333 26 41.39333333333331085.6666666666732.7056666666667 838.666666666667 346.113333333333 1084.33333333333 44.742 1443 60.826 1460 35.6813333333333 257.666666666667 21.2843333333333 239 160.914 238 60.412 1744.33333333333

covtype 0.01 38.272 134 172.2773333333336481.33333333333155.765666666667 6991 2104.446 6817.5 215.579666666667 10657.3333333333 203.432666666667 6446.66666666667 119.88 1291.33333333333 66.506 1419.66666666667 851.12 1374.66666666667 182.061666666667 7059.66666666667

covtype 0.001 252.025 1856 1437.8356666666761121.33333333331404.68833333333 73509.3333333333 18886.9425 62129 1374.76033333333 73566.6666666667 1832.23966666667 68450.6666666667 1087.54733333333 14254.6666666667 498.987 14476.3333333333 7286.24966666667 13547.3333333333 1141.045 47368.6666666667

yearpred 0.1 15.814 44 12.3663333333333 60 12.3663333333333 58.6666666666667 25.631 60 9.179 59 14.806 58 22.5576666666667 45 13.087 45 62.5686666666667 45 12.358 58

rcv1 0.1 275.982666666667 799 61.23733333333331229.66666666667 52.495 1259.66666666667 354.316666666667 1212.66666666667 36.3803333333333 1104 87.5776666666667 1147.66666666667 381.023 1163.66666666667 844.877 1145 1677.33333333333 1146.66666666667 120.833666666667 1189.33333333333

rcv1 0.01 19086.1115 21942 994.48233333333330743.6666666667 910.623 29416 8156.54366666667 31104 587.945 29485 1817.81866666667 31969.6666666667 stopped after 4 hours

2392.55666666667 28556.6666666667

SVM synthetic 220.484666666667 145 104.8596666666673245.33333333333122.480666666667 3525 981.911 3588.66666666667 83.6226666666667 2871.33333333333 200.208666666667 3408.66666666667 - - 1479.1635 3978.5 - more than 1hour 378.192333333333 3969.33333333333

0.073 1901.26066666667

0.757 930.857666666667

0.063

SGD Speculation 2 (1000)

mini Speculation 2 (1000, 10)

Real Iterations BGD-real SGD-real MGD-real BGD-estim SGD-estim MGD-estim

adult 0.1 22 155.666666666667 81.5 25 201.333333333333 169 0.991797831929939 Real Es'mated Method Realmethod

adult 0.01 227 2009.25 1430.25 213 1709 1437 adult(0.1) 7.251666667 6.208 GD MGD-eager-

bernoulliadult 0.001 2586 26010.4166666667 30791.9444444444 2125 17085 14365 adult

(0.01) 25.93733333 19.5733882030178 GD

covtype 0.1 26 1112.91666666667 787.8 26 1488.5 1211 0.172113484747303 adult(0.001) 139.251 95.639 GD

covtype 0.01 134 7736.79166666667 3518.4 221 12648 10294 covtype(0.1) 17.41933333 12.053 GD

covtype 0.001 1856 67581.5833333333 31619.5333333333 2210 126480 102935 covtype(0.01) 38.272 33.542 GD

yearpred 0.1 44 59.4166666666667 50.2 45 59.3333333333333 58 covtype(0.001) 252.025 248.325 GD

rcv1 0.1 799 1201.5 1158.46666666667 1197 1296.66666666667 1293 519 0.1476234833532381.47623483353238 yearPred(0.1) 9.179 8.665 Lazyshuffle

rcv1 0.01 21942 30421.2222222222 30263.1666666667 10175 11016 10991 rcv1(0.1) 36.38033333 37.911 Lazyshuffle

145 3307.58333333333 rcv1(0.01) 587.945 312.183 Lazyshuffle

Trai

ning

tim

e (s

ec)

1

10

100

Dataset

adult (0.1)

adult (0.01)

adult (0.001)

covtype (0.1)

covtype (0.01)

covtype (0.001)

yearPred (0.1)

rcv1 (0.1)

rcv1 (0.01)

Real Estimated

Trai

ning

tim

e (s

ec)

10

100

1000

10000

Datasetadult covtype yearpred rcv1

Real Estimated

100

10000

adult covtype yearpred rcv1

Real Estimated

#Ite

ratio

ns

10

1001000

10000100000

1000000

0.1 0.01 0.001

BGD-real MGD-realSGD-real BGD-estimMGD-estim SGD-estim

#Ite

ratio

ns

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01 0.001

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

#Ite

ratio

ns

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01 0.001

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

#Ite

ratio

ns

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

�1

(b) Run to convergence

Figure 7: ML4all obtains accurate time estimates.

1

10

100

1000

10000

100000

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

adult covtype yearpred rcv1 higgs svm1 svm2 svm3

Trai

ning

tim

e (s

ec) Execution

Speculation

MinMaxPlan executionSpeculation

BGD

MGD lazy

random

SGD eagershuffle

SGD lazyshuffle

SGD lazyshuffle

SGD lazyshuffle

SGD lazyshuffle

SGD lazyshuffle

Figure 8: ML4all always performs very close to thebest plan by choosing it plus a small overhead.

by our optimizer. For this, we used a larger variety of realand synthetic datasets and measure the training time.

Figure 8 illustrates the training times of the best (min)and worst (max) GD plan as well as of the GD plan selectedby ML4all for each dataset. Notice that the latter timeincludes the time taken by our optimizer to choose the GDplan (speculation part) plus the time to execute it. The leg-end above the green bars indicate which was the GD planthat our optimizer chose. Although for most datasets SGDwas the best choice, other GD algorithms can be the win-ner for di↵erent tolerance values and tasks as we showedin the introduction. We make two observations from theseresults. First, ML4all always selects the fastest GD planand, second, ML4all incurs a very low overhead due to thespeculation. Therefore, even with the optimization over-head, ML4all still achieves very low training times - closeto the ones a user would achieve if she knew which plan torun. In fact, the optimization time is between 4.6 to 8 sec-onds for all datasets. From this overhead time, around 4 secis the overhead of Spark’s job initialization for collecting thesample. Given that usually the training time of ML modelsis in the order of hours, few seconds are negligible. It isworth noting that we observed an optimization time of lessthen 100 milliseconds when just the number of iterations isgiven.

All the above results show the e�ciency of our cost modeland the accuracy of ML4all to estimate the number ofiterations that a GD algorithm requires to converge, whilemaintaining the optimization cost negligible.

8.4 The Power of Abstraction

We proceed to demonstrate the power of the ML4all ab-straction. We show how (i) the commuting of the Transformand the Loop operator (i.e., lazy vs. eager transformation)can result in rich performance dividends, and (ii) decou-pling the Compute operator with the choice of the samplingmethod for MGD and SGD can yield substantial perfor-mance gains too. In particular, we show how these opti-mization techniques allow our system to outperform base-line systems as well as to scale in terms of data points andnumber of features. Moreover, we show the benefits andoverhead of the proposed GD abstraction.

8.4.1 System performanceWe compare our system with MLlib and SystemML. As

neither of these systems have an equivalent of a GD opti-mizer, we ran BGD, MGD and SGD and we used ML4all

just to find the best plan given a GD algorithm, i.e., whichsampling to use and whether to use lazy transformation ornot. We ran BGD, SGD, and MGD with a batch size of1, 000 in all three systems until convergence. We considereda tolerance of 0.001 and a maximum of 1, 000 iterations.Let us now stress three important points. First, note that

the API of MLlib allows users to specify the fraction of thedata that will be processed in each iteration. Thus, we setthis fraction to 1 for BGD while, for SGD and MGD, wecompute the fraction as the batch size over the total sizeof the dataset. However, the Bernoulli sample mechanismimplemented in Spark (and used in MLlib) does not exactlyreturn the number of sample data requested. For this rea-son, for SGD, we set the fraction slightly higher to reducethe chances that the sample will be empty. We found this tobe more e�cient than checking if the sample is empty and,in case it is, run the sample process again. Second, we usedthe DeveloperApi in order to be able to specify a conver-gence condition instead of a constant number of iterations.Third, as SystemML does not support the LIBSVM format,we had to convert all our real datasets into SystemML bi-nary representation. We used the source code provided tous by the authors of [8], which first converts the input fileinto a Spark RDD using the MLlib tools and then convertsit into matrix binary blocks. The performance results forSystemML show the breakdown between the training timeand this few seconds conversion time.Figure 9 shows the training time in log-scale for di↵erent

real datasets and two larger synthetic ones. Note that forour system, the plots of SGD and MGD show the runtimeof the best plan for the specific GD algorithm. Details onthese plans as well as the number of iterations required toconverge can be found in Table 4 in Appendix E. From theseresults we can make the following three observations:(1) For BGD (Figure 9(a)), we observe that even if sam-pling and lazy transformation are not used in BGD, our sys-

GD Plan choice

#Iterations

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01 0.001

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

22 2582 16

9156

201

227

2131430

adult

1437 2009

1709 2586

2125

30792

14365

26010

17085

(a) adult dataset

#Iterations

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01 0.001

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

26 26788

1211

1113 1489

134 221

3518 10294

7737 12648

1856

2210

31620

102935

67582

126480

covtype

(b) covtype dataset

#Iterations

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

rcv1

799 1197

1158

1293

1201

1297

21942

10175

30263

10991

30421

11016

(c) rcv1 dataset

Figure 6: ML4all obtains good estimates for the number of iterations for all GD algorithms.

GD (1000 iterations)

SGD-eager-random

SGD-eager-shuffle SGD-eager-spark SGD-lazy-random SGD-lazy-shuffle MGD-eager-random (b=1000)

MGD-eager-shuffle (b=1000)

MGD-eager-spark MGD-lazy-random MGD-lazy-shuffle Real Estimated Real transform time

Real sort time for lazy

adult 81.9893333333333 34.8286666666667 35.581 63.6563333333333 85.167 34.035 89.571 88.212 47.7803333333333 614.276333333333 84.3903333333333 34.8286666666667 33.016 SGD-eager-random 5638.66666666667 7788.33333333333

covtype 146.585333333333 37.382 41.301 73.658 314.792 40.03 96.3566666666667 96.195 53.3363333333333 630.840666666667 91.2843333333333 37.382 41.185 SGD-eager-random 8599 13284.6666666667

yearpred 100.336666666667 42.237 42.3736666666667 108.356 278.1685 42.1966666666667 152.578 142.684666666667 50.2226666666667 1105.164 159.538 42.1966666666667 35.589 SGD-lazy-shuffle 8456 12778.3333333333

rcv1 946.157 50.724 48.7695 757.117333333333 269.503333333333 41.1923333333333 301.094333333333 170.331333333333 193.43 1417.90866666667 233.560666666667 41.1923333333333 35.089 SGD-lazy-shuffle 9061.33333333333 14785.6666666667

synth SVM 1024.604 76.2323333333333 114.528666666667 1205.151 288.354666666667 87.025 2647.987 183.335333333333 863.817666666667 9747.694 258.443666666667 76.2323333333333 58.571

0.0520452500813506

0.10173345460382

0.156592147878979

0.148166730054946 0.114634395654774

Trai

ning

tim

e (s

ec)

0

15

30

45

60

adult covtype yearpred rcv1

Real Estimated

Trai

ning

tim

e (s

ec)

0

15

30

45

60

Datasetadult covtype yearpred rcv1

353641

334142

3735

Real Estimated

�1

(a) Run of 1,000 iterations

GD time GD iterations SGD-eager-random time

SGD-eager-random iterations

SGD-eager-shuffle-partition time

SGD-eager-shuffle-partition iterations

SGD-lazy-random time

SGD-lazy-random iterations

SGD-lazy-shufflepartition time

SGD-lazy-shufflepartition iterations

MGD-eager-shuffle-partition time

MGD-eager-shuffle-partition iterations

MGD-eager-random time

MGD-eager-random iterations

MGD-eager-spark time

MGD-eager-spark iterations

MGD-lazy-random time

MGD-lazy-random iterations

MGD-lazy-shufflepartition time

MGD-lazy-shufflepartition iterations

adult 0.1 10.082 22 11.895 147.66666666666712.7973333333333 157 23.949 172.666666666667 10.691 145.333333333333 12.128 136 7.25166666666667 32.6666666666667 17.5046666666667 33.6666666666667 11.1883333333333 123.666666666667

adult 0.01 25.9373333333333 227 54.00633333333331908.33333333333 57.588 2076 144.840333333333 1781.66666666667 56.816 2271 74.7513333333333 2040.66666666667 34.3586666666667 803 168.102 853.666666666667 63.167 2023.66666666667

adult 0.001 139.251 2586 507.582 26044.6666666667421.641333333333 21973.3333333333 2221.261 31108 467.288 24915.6666666667 680.991 24438.3333333333 1088.725 45213.5 - - 497.267666666667 22724

covtype 0.1 17.4193333333333 26 41.39333333333331085.6666666666732.7056666666667 838.666666666667 346.113333333333 1084.33333333333 44.742 1443 60.826 1460 35.6813333333333 257.666666666667 21.2843333333333 239 160.914 238 60.412 1744.33333333333

covtype 0.01 38.272 134 172.2773333333336481.33333333333155.765666666667 6991 2104.446 6817.5 215.579666666667 10657.3333333333 203.432666666667 6446.66666666667 119.88 1291.33333333333 66.506 1419.66666666667 851.12 1374.66666666667 182.061666666667 7059.66666666667

covtype 0.001 252.025 1856 1437.8356666666761121.33333333331404.68833333333 73509.3333333333 18886.9425 62129 1374.76033333333 73566.6666666667 1832.23966666667 68450.6666666667 1087.54733333333 14254.6666666667 498.987 14476.3333333333 7286.24966666667 13547.3333333333 1141.045 47368.6666666667

yearpred 0.1 15.814 44 12.3663333333333 60 12.3663333333333 58.6666666666667 25.631 60 9.179 59 14.806 58 22.5576666666667 45 13.087 45 62.5686666666667 45 12.358 58

rcv1 0.1 275.982666666667 799 61.23733333333331229.66666666667 52.495 1259.66666666667 354.316666666667 1212.66666666667 36.3803333333333 1104 87.5776666666667 1147.66666666667 381.023 1163.66666666667 844.877 1145 1677.33333333333 1146.66666666667 120.833666666667 1189.33333333333

rcv1 0.01 19086.1115 21942 994.48233333333330743.6666666667 910.623 29416 8156.54366666667 31104 587.945 29485 1817.81866666667 31969.6666666667 stopped after 4 hours

2392.55666666667 28556.6666666667

SVM synthetic 220.484666666667 145 104.8596666666673245.33333333333122.480666666667 3525 981.911 3588.66666666667 83.6226666666667 2871.33333333333 200.208666666667 3408.66666666667 - - 1479.1635 3978.5 - more than 1hour 378.192333333333 3969.33333333333

0.073 1901.26066666667

0.757 930.857666666667

0.063

SGD Speculation 2 (1000)

mini Speculation 2 (1000, 10)

Real Iterations BGD-real SGD-real MGD-real BGD-estim SGD-estim MGD-estim

adult 0.1 22 155.666666666667 81.5 25 201.333333333333 169 0.991797831929939 Real Es'mated Method Realmethod

adult 0.01 227 2009.25 1430.25 213 1709 1437 adult(0.1) 7.251666667 6.208 GD MGD-eager-

bernoulliadult 0.001 2586 26010.4166666667 30791.9444444444 2125 17085 14365 adult

(0.01) 25.93733333 19.5733882030178 GD

covtype 0.1 26 1112.91666666667 787.8 26 1488.5 1211 0.172113484747303 adult(0.001) 139.251 95.639 GD

covtype 0.01 134 7736.79166666667 3518.4 221 12648 10294 covtype(0.1) 17.41933333 12.053 GD

covtype 0.001 1856 67581.5833333333 31619.5333333333 2210 126480 102935 covtype(0.01) 38.272 33.542 GD

yearpred 0.1 44 59.4166666666667 50.2 45 59.3333333333333 58 covtype(0.001) 252.025 248.325 GD

rcv1 0.1 799 1201.5 1158.46666666667 1197 1296.66666666667 1293 519 0.1476234833532381.47623483353238 yearPred(0.1) 9.179 8.665 Lazyshuffle

rcv1 0.01 21942 30421.2222222222 30263.1666666667 10175 11016 10991 rcv1(0.1) 36.38033333 37.911 Lazyshuffle

145 3307.58333333333 rcv1(0.01) 587.945 312.183 Lazyshuffle

Trai

ning

tim

e (s

ec)

1

10

100

Dataset

adult (0.1)

adult (0.01)

adult (0.001)

covtype (0.1)

covtype (0.01)

covtype (0.001)

yearPred (0.1)

rcv1 (0.1)

rcv1 (0.01)

Real Estimated

Trai

ning

tim

e (s

ec)

10

100

1000

10000

Datasetadult covtype yearpred rcv1

312

9

24896

588

9

252139

Real Estimated

100

10000

adult covtype yearpred rcv1

Real Estimated

#Ite

ratio

ns

10

1001000

10000100000

1000000

0.1 0.01 0.001

BGD-real MGD-realSGD-real BGD-estimMGD-estim SGD-estim

#Ite

ratio

ns

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01 0.001

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

#Ite

ratio

ns

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01 0.001

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

#Ite

ratio

ns

10

100

1000

10000

100000

1000000

Tolerance0.1 0.01

BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim

�1

(b) Run to convergence

Figure 7: ML4all obtains accurate time estimates.

1

10

100

1000

10000

100000

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

Min

Max

Our

sys

tem

adult covtype yearpred rcv1 higgs svm1 svm2 svm3

Trai

ning

tim

e (s

ec) Execution

Speculation

MinMaxPlan executionSpeculation

BGD

MGD lazy

random

SGD eagershuffle

SGD lazyshuffle

SGD lazyshuffle

SGD lazyshuffle

SGD lazyshuffle

SGD lazyshuffle

Figure 8: ML4all always performs very close to thebest plan by choosing it plus a small overhead.

by our optimizer. For this, we used a larger variety of realand synthetic datasets and measure the training time.

Figure 8 illustrates the training times of the best (min)and worst (max) GD plan as well as of the GD plan selectedby ML4all for each dataset. Notice that the latter timeincludes the time taken by our optimizer to choose the GDplan (speculation part) plus the time to execute it. The leg-end above the green bars indicate which was the GD planthat our optimizer chose. Although for most datasets SGDwas the best choice, other GD algorithms can be the win-ner for di↵erent tolerance values and tasks as we showedin the introduction. We make two observations from theseresults. First, ML4all always selects the fastest GD planand, second, ML4all incurs a very low overhead due to thespeculation. Therefore, even with the optimization over-head, ML4all still achieves very low training times - closeto the ones a user would achieve if she knew which plan torun. In fact, the optimization time is between 4.6 to 8 sec-onds for all datasets. From this overhead time, around 4 secis the overhead of Spark’s job initialization for collecting thesample. Given that usually the training time of ML modelsis in the order of hours, few seconds are negligible. It isworth noting that we observed an optimization time of lessthen 100 milliseconds when just the number of iterations isgiven.

All the above results show the e�ciency of our cost modeland the accuracy of ML4all to estimate the number ofiterations that a GD algorithm requires to converge, whilemaintaining the optimization cost negligible.

8.4 The Power of Abstraction

We proceed to demonstrate the power of the ML4all ab-straction. We show how (i) the commuting of the Transformand the Loop operator (i.e., lazy vs. eager transformation)can result in rich performance dividends, and (ii) decou-pling the Compute operator with the choice of the samplingmethod for MGD and SGD can yield substantial perfor-mance gains too. In particular, we show how these opti-mization techniques allow our system to outperform base-line systems as well as to scale in terms of data points andnumber of features. Moreover, we show the benefits andoverhead of the proposed GD abstraction.

8.4.1 System performanceWe compare our system with MLlib and SystemML. As

neither of these systems have an equivalent of a GD opti-mizer, we ran BGD, MGD and SGD and we used ML4all

just to find the best plan given a GD algorithm, i.e., whichsampling to use and whether to use lazy transformation ornot. We ran BGD, SGD, and MGD with a batch size of1, 000 in all three systems until convergence. We considereda tolerance of 0.001 and a maximum of 1, 000 iterations.Let us now stress three important points. First, note that

the API of MLlib allows users to specify the fraction of thedata that will be processed in each iteration. Thus, we setthis fraction to 1 for BGD while, for SGD and MGD, wecompute the fraction as the batch size over the total sizeof the dataset. However, the Bernoulli sample mechanismimplemented in Spark (and used in MLlib) does not exactlyreturn the number of sample data requested. For this rea-son, for SGD, we set the fraction slightly higher to reducethe chances that the sample will be empty. We found this tobe more e�cient than checking if the sample is empty and,in case it is, run the sample process again. Second, we usedthe DeveloperApi in order to be able to specify a conver-gence condition instead of a constant number of iterations.Third, as SystemML does not support the LIBSVM format,we had to convert all our real datasets into SystemML bi-nary representation. We used the source code provided tous by the authors of [8], which first converts the input fileinto a Spark RDD using the MLlib tools and then convertsit into matrix binary blocks. The performance results forSystemML show the breakdown between the training timeand this few seconds conversion time.Figure 9 shows the training time in log-scale for di↵erent

real datasets and two larger synthetic ones. Note that forour system, the plots of SGD and MGD show the runtimeof the best plan for the specific GD algorithm. Details onthese plans as well as the number of iterations required toconverge can be found in Table 4 in Appendix E. From theseresults we can make the following three observations:(1) For BGD (Figure 9(a)), we observe that even if sam-pling and lazy transformation are not used in BGD, our sys-

Time estimates Results

A Cost-based Optimizer for Gradient Descent OptimizationZoi Kaoudi Jorge Quianè-Ruiz Saravanan Thirumuruganathan Sanjay Chawla Divy Agrawal

1 Error sequence follows known distribution

2Shape of error sequence on sample D’ << D

'Shape of error sequence over D

Speculative approach1.Take a sample D’ << D 2.Run GD for a larger error 3.Fit the distribution

Trai

ning

tim

e (s

ec)

1

100

10000

Dataset

adult covtype rcv1

1,818

26

7

911

173

24

19,086

910

batch GDstochastic GDmini-batch GD