Analysis of Algorithms for Distributed Optimizationkubitron/courses/cs262...Analysis of Algorithms for Distributed Optimization Sanjay Krishnan [email protected] Virginia Smith

Analysis of Algorithms for Distributed Optimization

Sanjay [email protected]

Virginia [email protected]

ABSTRACTGradient descent (GD) and coordinate descent (CD) are twocompeting families of optimization algorithms used to solvenumerous machine learning tasks. The proliferation of large,web-scale datasets has led researchers to investigate mini-batch variants of these algorithms in parallel and distributedsettings. However, there is a lack of consensus in the commu-nity about the relative merits of these two algorithm familiesin various settings, and no best practice for evaluating per-formance. The number of free parameters associated witheach algorithm makes it difficult not only to choose betweenalgorithms, but also to tune each independently. To thisend, we formalize a methodology to test both algorithms ina distributed setting. We implement the algorithms usingSpark and analyze the results on a number of large, real-world datasets. Our results suggest several systems-relatedissues that are crucial to performance, including batch size,data skew, and patterns of communication. We comment onplausible remedies for these issues and also formalize new,open research questions within this area.

1. INTRODUCTIONLearning the parameters of a statistical model often involvessolving a convex optimization program. Many models canbe formulated as the minimization of a convex regularizedloss over the examples xi and model parameters θ:

minθ

1

n

n∑i

ξ(xi; θ) + λφ(θ) (1)

We can think of ξ as measuring the model’s error for a givenexample and φ as a penalty on on the parameters to preventover-fitting. Gradient descent (GD) and coordinate descent(CD) are two competing algorithm families for solving (1);both successfully applied in a number of problem domainsincluding regression, classification, and collaborative filter-ing. The growing size of data has led many researchers toinvestigate parallel and distributed variants of algorithmsfrom both families [1, 2, 3, 4, 5].

However, there is still a lack of consensus about when to ap-ply either algorithm or its variants. This paper proposes amethodology for comparing two state-of-the-art algorithmsfrom each family: Pegasos (GD) and Stochastic Dual Co-ordinate Ascent (CD). We focus on a single end-to-end usecase: large-scale classification. In particular, we comparethe performance of these two algorithms on L2 norm regu-larized, L1 hinge-loss Support Vector Machines:

minθ

1

n

n∑i

max(0, 1− yiθTxi) +λ

2||θ||22 (2)

We implement the algorithms on Spark [6] and explore trade-offs between running time, communication costs, and pa-rameter selection. These results suggest new systems-relatedissues that are crucial to algorithm performance, and area first step towards developing the type of heuristics thatwould allow for automated algorithm selection and tuning.Our main contributions are:

• A detailed distributed systems analysis of Pegasos andSDCA

• Experimental results on a number of real-world datasetsto compare the tradeoffs between these algorithms.

• A concluding proposal to adaptively tune batch size,cluster size, and partitioning meta-parameters.

In Section (2), we survey similar analyses on distributed ma-chine learning algorithms and other related work. In Section(3), we introduce the theory of both the algorithms and ourspecific application to SVM learning. In Sections (4) and(5), we describe the distributed architecture, a general formfor both algorithms, and the experimental setting. In Sec-tion (7), we discuss our experiments, datasets, and results,and conclude in Section (8). All of our experiments, imple-mentations, and results are available online.1

2. RELATED WORKIn recent years, there have been many advances in algo-rithms for distributed machine learning [3, 4, 5]. This is par-ticularly true in the area of optimization, which is a crucialcomponent in solving most machine learning tasks. The goalin distributed optimization is to use the resources of multiplemachines in order to optimize an objective simultaneously,alleviating the computational load of traditional algorithmsand in some cases enabling a solution to be reached at all.

1https://github.com/gingsmith/DistributedOptimization.git

Some of the most popular methods for distributed optimiza-tion include variants of gradient descent (GD) and coordi-nate descent (CD). There are other families of optimizationalgorithms, including interior point and simplex methods,which have been extensively studied in large scale settings[7, 8, 9, 10]; however, there is a general consensus that first-order gradient methods are the state-of-the art for uncon-strained loss minimization.

Despite being designed with serialization in mind, GD andCD algorithms can be adapted to a distributed environmentby processing the data in minibatches. A minibatch B ⊆ Ωis a subset of all training examples, Ω. This schema lendsitself to a MapReduce-type environment, as it is possible toprocess all of the datapoints of the minibatch in parallel,and then aggregate the final result to the master. The sizeof the minibatch can vary between 1 to n, the size of thetraining set. Interestingly, the size of the minibatch playsan important role in determining the convergence of the al-gorithm at hand [1, 11, 2]. Indeed, choosing the batch sizecan be a somewhat delicate process, a phenomenon that hasbeen shown theoretically by Takac et. al [12]. A batch sizethat is too small can result in slow updates, but a batch sizethat is too big can cause divergence. There are also clearsystems-related issues with the batch size, as it presents atradeoff between computation and communication in a dis-tributed environment. Though systems-related implicationsof this tradeoff has been proposed [13], it is still unclearhow to tune the batch size parameter in practice for bestperformance.

Of course, the idea of studying systems-related behavior foralgorithms running in a distributed environment is not new.Many have recently analyzed and proposed strategies specif-ically for improving systems performance of distributed orparallel machine learning algorithms [14, 15, 4, 16]. Thoughnone of these provide an in-depth analysis of the optimiza-tion algorithms discussed, we keep the general approachof these works in mind when developing our experimentalmethodology.

The focus of this study is on two particular variants of gradi-ent descent and coordinate descent, Pegasos and SDCA. Pe-gasos is a recently popularized sub-gradient descent methodused to solve Support Vector Machines in the primal. SDCAis a general-purpose coordinate descent algorithm, where up-dates are performed stochastically and by minimizing thedual objective. These algorithms, as well as the use case weconsider, Support Vector Machines, are described in detailin Section 3.

3. REGULARIZED LOSS MINIMIZATIONMany machine learning tasks can be cast as regularized lossminimization problems. These optimization problems takethe following form:

minθ

1

n

n∑i

ξ(xi; θ) + λφ(θ) (3)

where n is the training set size, θ is the parameter vectorwe aim to minimize with respect to, ξ is a non-negativeconvex loss, and λ is a regularization parameter used to pe-

nalize some function, φ, of the parameter vector. This gen-eral formula describes the optimization objective for a num-ber of machine learning algorithms, including L1-regularized(LASSO) and L2-regularized (ridge) regression, L1-loss andL2-loss SVMs, structural SVMs, and logistic regression. Weconsider a specific type of regularized loss minimization, theL1-loss, L2-regularized Support Vector Machine.

3.1 Support Vector MachinesSupport Vector Machines (SVMs) are an effective and pop-ular method used for classification [17]. We consider SVMsover a set of training examples (xi, yi)ni=1, where each xiis a k-dimensional feature vector, and yi ∈ −1,+1 is aclassification label. Formally, the goal is to find a parameterθ ∈ Rk such that:

minθ

1

n

n∑i

max(0, 1− yiθTxi) +λ

2||θ||22 (4)

is minimized, where || · ||2 represents the L2 norm, and θTxidenotes a standard inner product. The loss max(0, ·) isknown as the hinge-loss, and the associated objective is re-ferred to as the L1-SVM. We study L2-regularized SVMs,as indicated by the norm used to constrain the parametervector, θ.

A geometric intuition for the SVM objective is given in Fig-ure 1. The goal is to find a hyperplane θTx = 0 such that thedistance between classification groups, 2

||θ|| , is maximized.

The insight and assumption of SVMs is that maximizingthe size of this margin will lead to a better generalization ontest data.

Figure 1: Support Vector Machine

θTX = 0

_2_ ||θ||

Solving the objective (4) requires the use of an optimizationalgorithm. Many algorithms have been proposed to tacklethis problem, but for large-scale SVMs (where the train-ing set size is large and computation is a bottleneck) twovariants of first-order methods have been particularly popu-lar: stochastic sub-gradient descent (Pegasos) and stochas-tic dual coordinate ascent (SDCA). We describe both algo-rithms and their distributed variants below.

3.2 Stochastic Sub-Gradient DescentPegasos is an algorithm that performs stochastic sub-gradientdescent on the SVM primal objective. Stochastic gradientdescent methods have been shown to perform well empiri-cally, and generalize well to test data despite relatively poor

performance optimizing the objective function [18, 19]. Pe-gasos implements classic sub-gradient descent, but with aspecifically chosen step size. This is beneficial in tuning thealgorithm because it alleviates the need to determine thestep size, while still ensuring theoretical convergence guar-antees [20].

Pegasos works in the following way. The algorithm requiresa training set containing features xi ∈ Rk and labeled pointsyi. The user must also specify a stopping criterion, such asa total number of iterations T , and the regularization pa-rameter λ. First, the parameter vector w ∈ Rk is initializedto zero. At each subsequent iteration, the step size is up-dated, the gradient is calculated, and an update is madethat moves the parameter vector in the direction oppositethe gradient according to the specified step size. RunningPegasos with minibatches in a distributed environment re-quires selecting a random subset of the data stored acrossmultiple machines, finding the gradient of that subset in par-allel, and then aggregating the gradients to make the finalupdate to the parameter vector. Pseudocode for distributedPegasos with minibatches is given in Algorithm 1.

Algorithm 1 Distributed Pegasos with Minibatches

Input: (xi, yi)ni=1, λ > 0, T ≥ 1, b ≤ nInitialize: w1 = 0for t = 1, 2, . . . , T

set ηt = 1λt

choose Bt ⊆ 1, 2, . . . , n at random where |Bt| = bmap B+

t = i ∈ Bt : yi〈wt, xi〉 < 1reduce γt =

∑i∈B+

tyixi

update wt+1 =← (1− ηtλ)wt + ηtbγt

Output: wT+1

3.3 Stochastic Dual Coordinate AscentAn alternative to Pegasos is stochastic dual coordinate as-cent. SDCA performs a stochastic version of coordinate as-cent on the dual objective. Recent work has shown thatSDCA outperforms SGD, both theoretically and empirically[21, 22]. However, it is still unclear how the distributed vari-ants of these algorithms perform in practice.

As with Pegasos, SDCA works first by initializing the pa-rameter vector w ∈ Rk to zero and iterating over a trainingset (xi, yi)ni=1 of size n. The algorithm requires that theuser specify a stopping criterion, such as a total number ofiterations T , and the regularization parameter λ. However,SDCA has the added benefit of providing a certificate ofoptimality using the duality gap.

At each iteration of SDCA, the step size is updated, and foreach training point (xi, yi) and associated dual variable αi,the variable is updated so as to maximize the dual objec-tive while keeping all other coordinates fixed. This updatecan be performed in a distributed environment by selectinga minibatch of data points at random, iteratively updat-ing the associated dual coordinates locally on each machine,and then applying the final update to the parameter vectorw on the master. Pseudocode for distributed SDCA withminibatches is given in Algorithm 2.

Algorithm 2 Distributed SDCA with Minibatches

Input: (xi, yi)ni=1, λ > 0, T ≥ 1, b ≤ nInitialize: w1 = 0, C = 1

nλ, α(t) = 0

for t = 1, 2, . . . , Tchoose Bt ⊆ 1, 2, . . . , n at random where |Bt| = bfor i = 1, 2, . . . , bset G = yiw

ttxi − 1

set PG =

min(G, 0) if αi = 0,

max(G, 0) if αi = C,

G if 0 < αi < Cif PG 6= 0

update α(t+1)i = min(max(α

(t)i − G

||xi||22, 0), C)

reduce γt =∑i∈B+

t(α

(t+1)i − α(t)

i )yixi

update wt+1 = wt + γtOutput: wT+1

3.4 Extened Theory DiscussionPegasos is one popular variant of Stochastic Sub-gradientDescent (SGD). Many other methods aimed at large-scaleproblems have been proposed in literature [23, 24, 25, 26].In this work, we treat Pegasos as a representative of the SGDfamily. It is also important to note that in many variantsof SGD that include optimizations or averaging scheme; thesame optimizations can be applied to SDCA. Consequently,we felt it was a fair comparision to compare a batch versionof SDCA and Pegasos. For an overview of algorithms forclassification, variants of these algorithms, and parametersto tune, see Figure (15) in the appendix.

There are situations in which an optimizer has to use a dualmethod like SDCA. If a user wants to use a Kernel SVMinstead of a standard SVM, then dual method allows theuser to do so. A notion of duality also allows for confidenceintervals on the optimization. The duality-gap gives a guar-antee of how much the objective value at the current iteratecan improve.

4. DISTRIBUTED ARCHITECTUREWe implement Pegasos and SDCA in Spark running on acluster of m1.large Amazon EC2 nodes. This experimentalsetting contrasts from the distributed HPC framework usedin Tarkac et. al [27]. Our primary motivation for experi-menting on commodity systems is two fold: (1) accessibilityand (2) scalability.

Classification is often part of a larger OLAP pipeline involv-ing Hadoop/Spark/MapReduce. A system on HPC hard-ware or a optimized MPI architecture forces users to loadprocessed data from their existing pipeline into a new sys-tem. Instead, we integrate our system directly with the com-mon storage engines avaialble on Spark (HDFS, S3, Tachyon).Similarly, we test the scale-out (increasing the number ofnodes) and the scale-up (increasing the CPU/Memory ofeach node) tradeoffs. HPC systems are more difficult toscale and we hope to provide results that users can extrap-olate to their own datasets and hardware.

In this section, we discuss the three primary steps of bothSDCA and Pegasos: (1) sampling, (2) map, (3) aggregation.Surprisingly, these algorithms are functionally very similar

both in implementation and per-update performance.

!!!!!

RDD:1!

!!

Data!!!!!

!!

Data!!!!!

!!

Data!!!!!

!!

Data!!!!!

!!!!!

RDD:2!

!!!!

Mini,batch!!!!!

!!!!

Mini,batch!!!!!

!!!!

Mini,batch!!!!!

!!!!

Mini,batch!!!!!

Sample!

Figure 2: We load data into an RDD which subse-quently sampled.

4.1 Sampling/MinibatchesIn the previous section, we introduced minibatch optimiza-tion. Figure 2 shows how we construct the minibatches. Weassume that the data is stored on a type of distributed stor-age with partitions. Spark supports flat files, Hadoop FileSystem, Tachyon, and Amazon S3 storage media. We useSpark’s RDD as an abstraction for the underlying storage.

We model each example as a sparse vector (implementedwith a hash table), and the parameters as dense vectors(implemented with a float array). Upon loading the datainto an RDD[SparseClassificationPoint], we cache it inmemory. This way the parsed data is partitioned and keptin distributed memory.

We construct the minibatches by sampling fromRDD[SparseClassificationPoint]. Theory for stochasticoptimization has been developed on uniform random sam-pling with replacement. In practice, both mathematicallyand for performance reasons, sampling without replacementis used [28]. Thus, we choose to sample without replacement.The resulting RDD[SparseClassificationPoint] containsa paritioned minibatch; typically consisting of hundreds orthousands of examples.

!!!

!RDD:2!

!!!!

Mini)batch!!!!!

!!!!

Mini)batch!!!!!

!!!!

Mini)batch!!!!!

!!!!

Mini)batch!!!!!

!!!!

w(t),α(t)!!!!!

Sparse!Linear!Algebra!

Figure 3: Each mapper applies an iterator to thepartitioned minibatch. The iterator largely per-forms sparse vector operations.

4.2 Mapping over the PartitionsThe next step is to compute the neccessary model updates.In Figure 3, we show a schematic of this update. Given the

current model parameters, we iterate over each example inthe partition computing an update to the model.

In our implementation, we form an iterator which we pass asa closure to each one of the map tasks. The tasks then ap-ply the iterator to each partition of the minibatches. Sparkdoes provide an abstraction to avoid direct interaction withthe partitions, however, we found that the iterator basedupdates were considerably faster. We defer analysis of thisfor future work.

Within the iterator, we calculate the model updates usingsparse linear algebra primatives. The updates are simpleinner products, which correspond to joining two hash mapin our sparse vector implementation. Using the scala hashmap as a base data structure for linear algebra is a potentialperformance bottleneck, but this is not a limitation of ourimplementation. We could very easily apply another sparselibrary vector library such as BiDMat, Breeze, or ScalaNLP.

Sparse'Linear'Algebra'

'''

RDD:3'

'''

Updates'''''

'''

Updates'''''

''''

Updates'''''

''''

Updates'''''

Reduce:'Sum'''''

w(t+1),α(t+1)'''''

Figure 4: After the minibatch iteration, the resultsneed to be aggregated, and the model parametersneed to be updated

4.3 Aggregation and Convergence TestingAfter the map tasks run the iterator over each partition, weaggregate the results on the master node (Figure 4). Thisaggregation is a sum reduction for Pegasos. In SDCA, wehave to not only sum the updates, we also have to iterativelyupdate the dual variable. We implicitly assume the modelparameters can fit in memory. The master keeps copies ofthe model parameters at each iteration, and restarts theprocess after the update.

In the current implementation, we set a constant number ofiterations empirically determined for a 10−5 accuracy on theobjective value. In an automated system, we would wantto automatically determine stopping conditions. Unfortu-nately, calculating the objective value or test error could bequite time consuming; requiring a full pass over the data.For our experiments, we calculated this value every k it-erations; and it is an open question how to do this mostefficiently.

4.4 AnalysisTo analyze the performance of both techniques, we need tofirst define some notaion:

• N : the number of examples

• D : the number of features

• Nb : the number of batches

• b : the average number of samples in a batch.

• si : the sparsity of example i, the fraction of non-zerofeatures.

Both techniques start of with a calculating the gradient withrepsect to the current sampled example. This is a sparse-dense inner product that requires O(D(si)) time. Pegasosdoes not require additional vector computation so each up-date is completed in O(D(si)) time.

As SDCA updates both the primal and dual variables, itdoes require an additional sparse-sparse inner product. How-ever, SDCA runs this calculation only for examples wherethe box constraints described in the previous section are nottight. Consequently, SDCA requires O(D(1 + si))/Ω(D(si))time per example. In practice, as the result converges, theconstraints are tight for more of the examples and we findthat Pegasos and SDCA have similar computation per batch.

4.4.1 Data SkewThe updates within each iterator is a potential vulnerablepoint for data skew. If there is a lot of variation in si thendifferent batches may complete at different rates.

Even if we assume perfectly random partitioning, data skewcan be quite significant. Applying the CLT and a tail boundon normal random variables, we can show that the expectedcompletion time is for Pegasos is:

O(Dbµs + σs√

2D2b log(Nb)) (5)

For SDCA it is:

O(Db(µs + 1) + σs√

2D2b log(Nb)) (6)

Increasing the number of batches, increases the chance ofstragglers. This problem can be compounded if the data isnot partitioned randomly eg. range-based horizontal parti-tioning.

4.4.2 CommunicationThis is a key point of communication, as we have to transferthe model parameters to each one of the mappers. In Pe-gasos, we only have to transfer a dense vector w whose sizeis the number of features D. In SDCA, we have to transferw and, in addition, a dense vector α whose size is the num-ber of examples N . Therefore, for outgoing communication(in floating point numbers) from the master to the slaves isO(NbD) for Pegasos and O(Nb(N + D)) for SDCA. Sparkuses map-side aggregation for iterated mappers, and we onlyhave to process O(Nb) updates for both Pegasos and SDCA.Therefore, for communication-per-iteration Pegasos has:

O(Nb(D + 1)) (7)

and SDCA has:

O(Nb(D +N + 1)) (8)

Interestingly, the communication is not affected by the lay-out of the data. If we didn’t have map-side aggregations,then this may not be the case.

5. EXTENDED ARCHITECTURAL DISCUS-SION

In this section, we consider a few speculative points abouthow different system architectures may affect these algo-rithms.

5.1 More powerful cluster nodesTo fully take advantage of more powerful cluster nodes, wewould have to optimize the sparse linear algebra. In SVM’sthe gradients are dot products which can be written as ma-trix multiplications. Matrix multiplications can take ad-vantage of specialized hardware such as GPUs or chipsetfeatures such as Intel MKL.

5.2 Limit Communication in SDCASince each mapper touches samples that come from a singlepartition, we can limit communication by having each map-per maintain the dual variables for those examples. Thismakes SDCA’s communication exactly the same as Pegasos,but the surprising thing is that SDCA is updating both pri-mal and dual variables. That said, we could modify Pegasosto include tracking of the dual variable. However, as thealgorithm is not a dual method, this tracking would onlybe for measuring convergence and not help the algorithmconverge to the right solution any faster.

5.3 Effect of Storage MediaWe cache the parsed data in an RDD, and we have foundthat forcing this cache makes our subsequent iterations muchfaster. However, this is only possible if it can fit in mem-ory. See our experiments for our characterization of iterationtimes for different storage media.

5.4 Data AcquistionWe motivated our system with a data pipeline where thedata is featurized and lying in data warehouse style stor-age. However, some classification tasks may have differentacquistion pipelines. For example, in NLP tasks, prepro-cessing and featurization is a significant bottleneck. We maywant to featurize and clean the data if and when we samplethe mini-batch. If a classifier with 80% test accuracy doesnot require a full pass over all of the data, then we can savesome effort.

Likewise, if we change the data warehouse assumption to astreaming context, we would have to modify our approach.Using fraud detection scenario as an example, we may wantto retrain our model quickly based on new data streamingin. In this case, we would have to revise our iterative work-flow priortize more recent data and have a more complictedminibatch architecture.

6. METHODOLOGYBased on the analysis in the previous section, we designedour experiments to understand the tradeoffs between thetwo algorithms. The analysis identifies a few parametersthat will play a role: (1) number of partitions, (2) batchsampling fraction (3) sparsity of the dataset, (4) variabil-ity of the dataset, and (5) usability of the learned model.Accordingly, we design our experiments to measure thesevariables to understand the tradeoffs. In this section, we

detail some errata about our experiment methodology andhow we measure these values.

6.1 PartitionsAn RDD blocks its data according to the file system fromwhich it loads the data. For example, in HDFS, files arebroken into 64 MB blocks. We manually override this defaultbehavior in testing and set the number of partitions to atleast the number of mappers in the cluster. As far as weknow, our changes do not trigger an additional shuffle andthe locality of the data is preserved.

6.2 Sampling FractionPartitions may not be evenly blocked and there is random-ness in the sampling batch sizes may be different betweenpartitions/iterations. Consequently, we specify batch sizesby the fraction of examples that we sample. For large datasets,these samples are small (eg. 1

2048), and we ensure that in

one partition we are iterating through no more than 1000examples.

6.3 Communication and ComputationWe directly apply the analysis presented in Section (4) tocount communication in the algorithms. The unit of mea-sure is the number of floating point numbers sent betweenthe master and the workers. This ignores overheads dueto serialization in Spark, but we assume that this overheadis a constant factor that is consistent for both algorithms.Furthermore, our analysis could be extend to Hadoop bymodifying that constant factor. Similarly, for computationcounts, we estimated the number of floating point operationsthat were required for each iteration.

6.4 UsabilityWhile the other characteristics give us generalizable proper-ties of the algorithms, we wanted to understand how prac-tical our implementations are. By nature, these results arespecific to running the algorithms on Spark. For example,we considered wall clock times of each algorithms.

We also looked at test accuracy as opposed the value of theobjective function. The test accuracy measures the accu-racy of the model on held-out data, which is the importantmetric for a user. We further consider the cost of runningthe system by taking the time, number of nodes, and thecost of the on-demand EC2 instance.

Finally, we also measured failure modes of the algorithms.For a poor choice of parameters, does the algorithm giveincorrect solutions, poor performance, or misleading perfor-mance.

6.5 Dataset CharacteristicsAs in Section (4), the primary characteristics that we studyare: (1) the number of examples, (2) the number of fea-tures, (3) and sparsity pattern. We measured these in ourexperimental datasets offline, and did not include these mea-surements in our time/computation/communication calcula-tions. In Takac et al. [12], they consider a further parameterfor weighting minibatch aggregations; the spectral norm ofthe data. This value is very difficult to calculate or estimate

and we set this value to 1, with the understanding that forlarge batch sizes SDCA could go unstable (Section (7)).

7. RESULTS7.1 Datasets and ParametersUsing the described methodology, we analyze the tradeoffsbetween SGD and SDCA using five real-world datasets withvarying characteristics. A summary of the training and test-ing size, number of features, and sparsity of each dataset isgiven in Table 2.

Table 1: Datasets for Empirical Study

Dataset Training Testing Features Sparsity

astro-ph 29,882 32,487 99,757 0.08%cov1 522,911 58,101 54 22.22%rcv1 20,242 677,399 47,236 0.16%kdd 8,407,752 510,302 20,216,830 0.00018%imagenet 21,576 5,426 160,000 100%

The astro-ph dataset aims to classify abstracts from papersArXiv as belonging to the field of astro-physics. Cov1 classi-fies the first level of forest coverage from USGS data. RCV1is the first class of the Reuters text classification dataset,aiming to classify documents based on language. KDD isan education-related classification task from the KDD cup2010. Finally, Imagenet is a massive image database orga-nized according to the WordNet hierarchy. We use a subsetof 10 classes and perform 1-vs-all classification. We use λparameters for these datasets as specified in [21, 20]. Wecompare the algorithms on following parameters:

• Batch Fraction: the fraction of examples sampled fromthe partition in each update.

• Cluster Size: the number of m1.large EC2 nodes in thecluster.

• Iterations: Each minibatch update is counted as 1 it-eration.

• Time: aggregate wall clock time of the updates ex-cludes any time taken to diagnose or instrument themethods.

7.2 Batch Fractions and ConvergenceLarger batch sizes make both algorithms converge faster.However, if the batch size is too large SDCA tends to gounstable; that is the result does not converge. Pegasos doesoscillate for larger batch sizes as well but we did not en-counter divergence in our experiments.

We looked at the number of iterations to reach within 20%of the optimal objective value. Figure 5 shows the resultof this experiment. We see that the techniques convergeincreasingly fast until the batch sizes cause instability. Itis clear that a system optimally tuned for convergence inrelation to the number of iterations would set its batch sizeas high as possible.

The rcv1 dataset had interesting results as for a larger batchfraction SGD performed better and for a smaller one SDCAperformed better. In the other datasets, SDCA was alwaysfaster until the point that it diverged.

0 20 40 60 80

1/16

1/32

1/64

1/128

Iterations for 20% Accuracy

Batc

h F

raction

astro−ph

SGD

SDCA

0 0.5 1 1.5 2

x 104

1/32

1/64

1/128

1/256


Batc

h F

raction

cov1

SGD

SDCA

0 100 200 300 400

1/128

1/512

1/2048

1/8192


Batc

h F

raction

rcv1

SGD

SDCA

X Failure

X Failure

X Failure

Figure 5: Increasing the batch fractions makes the algorithms converge faster. However, this may also leadto oscillation in the objective value. SDCA may diverge if the batch size is too large.

50 100 150 2000

1

2

3

4

Iteration #

Log

Subo

ptim

ality

astro−ph

SDCASGD

50 100 150 2000

0.5

1

1.5

2

2.5

3

Iteration #

Log

Subo

ptim

ality

rcv1

SDCASGD

0.5 1 1.5 2x 104

0

2

4

6

8

Iteration #

Log

Subo

ptim

ality

cov1

SDCASGD

2 4 6 8x 104

0

1

2

3

4

5

6

Iteration #

Log

Subo

ptim

ality

kdd1

SDCASGD

100 200 300 4000

1

2

3

4

5

6

Iteration #

Log

Subo

ptim

ality

imagenet

SDCASGD

Figure 6: We empirically determined the largest batch size for which both algorithms converge. Given thissetting, we compare the suboptimality in the objective value to the number of iterations. The hardware used(1) astro-ph 4 node cluster, (2) cov1 8 node cluster, (3) rcv1 8 node cluster, and (4) kdd 10 node cluster.

7.3 Convergence ComparisionIn our next experiment, we compared the convergence ofPegasos and SDCA in relation to the number of iterations.We accordingly chose the empirically largest batch fractionfor both algorithms before instability. At each iteration,we measured the suboptimality2 of the solution. Figure 6suggests that SDCA is generally faster than Pegasos.

However, the instability and oscillation can be a problem.We see that for the cov and astro-ph datasets the oscillationcan lead to SDCA results that are less accurate than Pegasosfor the same number of iterations. We also found that onthe kdd and rcv1 datasets SDCA gave a more significantimprovement. We conjecture this is related to the sparsityof the dataset.

7.4 Test ErrorTo make the convergence results more concrete, we also mea-sured test error on a held out set of data. We compared thetest error to wall clock time (Figure 7). To an end-usertest error is important since that measures the efficacy of alearned classifier. We found that the test errors somewhatmirrored the results for the objective values but the two al-gorithms were much closer. As expected with test error, thebehavior was also much less predictable.

7.5 Batches and PerformanceSince the last experiment involved wall clock times, we alsomeasured the average time taken to process each batch. Fig-ure 8 shows results on the rcv1 dataset. Larger batches re-quire more time to process, and each iteration has to waituntil all of mappers for the batch complete their tasks. We

2We defined the optimal as the result with precision of 1e-5

1/64 1/128 1/512 1/2048 1/81920

10

20

30

40

50

Batch Fraction

Me

an

Ite

ratio

n T

ime

RCV1

SGD

SDCA

Figure 8: We looked at the mean time required fora single minibatch update.

found that our implementation of SDCA was affected by thebatch size less than Pegasos.

We believe that this behavior can be explained by a com-bination of SDCA’s regularization and Spark implementa-tion issues. SDCA converts the regularization into box con-straints which require progressively less computation as theanswer converges. However, we also observed that Spark’shandling of mappers with iterators is not consistent, and wedefer diagnosis of this to future work.

While Pegasos does more updates to the model parame-ters, SDCA does slightly more computation (Figure 9). Wecan also see that the box constraints add more variance toSDCA’s computation, since in some iterations less or moreexamples will be subjected to the constraints.

0 500 10000

1

2

3

4

Time (s)

Log

Sub

optim

ality

astro−ph

SDCASGD

50 100 150 2000

0.5

1

1.5

2

2.5

Time (s)

Log

Sub

optim

ality

rcv1

SDCASGD

0 0.5 1 1.5 2

x 104

−2

0

2

4

6

Time (s)

Log

Sub

optim

ality

cov1

SDCASGD

2 4 6 8

x 104

0

1

2

3

4

5

Time (s)

Log

Sub

optim

ality

kdd1

SDCASGD

50 100 150 200 250 3000

0.5

1

1.5

Time (s)

Log

Sub

optim

ality

imagenet

SDCASGD

Figure 7: In same setting as Figure 6, we compared test error to wall clock time. We found the algorthimswere much closer with respect to test error.

1/64 1/128 1/512 1/2048 1/81920

0.5

1

1.5

2x 10

6

Batch Fraction

Me

an

Ite

ratio

n C

om

p

RCV1

SGD

SDCA

Figure 9: We looked at the number of floating pointoperations needed for one minibatch update.

7.6 Data and Compute Skew

0 10 20 30 40 50 60

1/64

1/128

1/512

1/2048

1/8192

Max Iter − Min Iter time (s)

Batc

h F

raction

rcv1

SDCA

SGD

0 1 2 3 4 5 6

1/16

1/32

1/64

1/128

1/256

cov1

Max Iter − Min Iter time (s)

Batc

h F

raction

SDCA

SGD

Figure 10: The time difference between the quickestiteration and the slowest iteration. Iteration timesin sparse datasets can vary quite a bit.

We previously mentioned that in each batch update, we needto wait for all the mappers to finish their task. Within eachtask, we have to do a series of sparse vector dot products.If the vectors in one of the partitions are denser than therest then we can face serious data skew problems. In Figure10, we look at the range of iteration completion times (max- min), and compare these to the batch fraction. We cansee in a uniformly dense dataset like cov batch size does notaffect variation in completion times. However, in rcv1, wesee significant variance. Emperically, we found that for oursparse datasets there was a lot of idle time in the cluster.

7.7 CommunicationWe looked at communication costs in both algorithms for an8-node EC2 cluster. As our analysis suggests, SDCA has aslightly higher communication cost when there is map-sideaggregation. However, we calculated the communication ifthere was no map-side aggregation. Interestingly enough,

SDCA is requires on average less communication in thisregime. This can be attributed to saved variable updateddue to the box constraints. We measure this behavior ontwo datasets, cov1 and rcv1, and show results in Figure 11.

7.8 ScalabilityWe finally compared how the algorithms ran by varying thenumber of nodes in the cluster (Figure 12). We found thatfor astro-ph, cov, and rcv1 there were significant diminishingreturns to adding more nodes to the cluster. It does notseem like our results are communication bound, but ratherdue to idle tasks in the cluster. The sparser datasets (astro-ph and rcv1) benefited much more from parallelization. Wehypothesize that this is due to smaller tasks which are morerobust to data skew problems.

7.9 Other Measured System Parameters7.9.1 CPU cores

0 2 4 60

10

20

30

40

50astro−ph

CPU Cores

Tim

e(s

)

SDCA

SGD

1 2 3 4 5 60

50

100

150

200

CPU Cores

Tim

e (

s)

cov1

SDCA

SGD

Figure 13: Our experiments suggest that we are notCPU bound.

There is a question of whether we can make our iterationsfaster with more powerful nodes. We varied the number ofcores on a single node, allocating a mapper to each core,and tested the performance. We found a very sharp dropoff after two cores (Figure 13) on the sparser dataset astro-ph. On the denser dataset, cov1, we did see some gains.This suggests that our sparse vector implementation couldbe more efficient or astro-ph has more idle tasks.

7.9.2 Storage Media LatencyIn our implementation, we overrode the default Spark cachingbehavior. We cached the parsed dataset in memory. Thisassumes that there is enough memory in the cluster to holdthe parsed dataset. We measured the time for one update ineach iteration for different storage media. The table belowshows our results for the astro-ph dataset:

100

102

104

106

108

1010

1/64

1/128

1/512

1/2048

1/8192

Ba

tch

Fra

ctio

n

rcv1 cumulative comm (KB)

SGD ms−agg

SDCA ms−agg

SGD

SDCA

100

102

104

106

108

1010

1/16

1/32

1/64

1/128

1/256

Ba

tch

Fra

ctio

n

cov cumulative comm (KB)

Figure 11: Communication compared for both algorithms. To generalize our analysis to systems other thanSpark, we include extrapolated results for systems without map-side aggregation.

1 2 3 40

500

1000

1500

2000

2500

m1.large nodes

Tota

l T

ime (

s)

astro−ph

SDCA

SGD

0 2 4 6 80

1

2

3

4

5

6

7x 10

4

m1.large nodes

Tota

l T

ime (

s)

cov1

SDCA

SGD

2 4 6 80

5000

10000

15000

m1.large nodes

Tota

l T

ime (

s)

rcv1

SDCA

SGD

Figure 12: There are quickly diminishing returns to scaling up the system.

Table 2: Media and Iteration Time

Media SGD SDCA

Memory 0.35 s 0.27 sEBS (Disk) 0.58 s 0.79 sHDFS 1.38 s 1.60 sS3 8.79 s 11.21 s

7.9.3 CostWe can also parametrize our results by cost instead of time.Each m1.large EC2 instance costs 24 cents each hour. Figure14 shows the tradeoffs for having a larger cluster and havingthe result faster. Interestingly enough, on the cov dataset isactually more cost efficient to run the optimization on a 4node cluster not just faster.

8. CONCLUSION AND FUTURE WORKWe implemented two state-of-the-art convex optimizationalgorithms on Spark to solve large scale Support Vector Ma-chine problems. We compared the tradeoffs between SDCAand SGD to better understand what the best systems-levelpractices for these algorithms are. We found that SDCAgenerally converged faster, however, was prone to instabil-ity for small batch sizes. We also found that SDCA didparticularly well on sparse datasets.

Our results also suggest a few improvements to our imple-mentation: (1) batch sizes can be dynamically changed toprevent oscillation, (2) sparse datasets are a cause of dataskew and appropriate partitioning can more efficiently useall of the mappers, (3) more communication optimized im-plementations are possible with variables stored local to themappers.

Our results also introduce a few puzzling open questions: (1)How can we reconcile objective values and test error? (2)Can we determine the optimal size of a cluster before handwith respect to cost and completion time? (3) How wouldthis results look in something different than Spark, such asHadoop?

In future work, we will be working to implement a generalregularized loss optimizer for MLBase. This will be a directgeneralization of this work, which only focused on SVM’s.We will also work to incorporate the insights learned intoa autotuner for system parameters such as batch sizes. Wewill further investigate partitioning details in Spark that weconjecture will make our implementations much faster.

*Special thanks to Martin Jaggi, Ameet Talwalkar, and Shiv-aram Venkataraman for their help on this project.

9. REFERENCES[1] Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and

Carlos Guestrin. Parallel coordinate descent forl1-regularized loss minimization. In InternationalConference on Machine Learning (ICML), June 2011.

[2] Shai Shalev-Shwartz and Tong Zhang. Acceleratedmini-batch stochastic dual coordinate ascent. InNeural Information Processing Systems (NIPS),December 2013.

[3] Peter Richtarik and Martin Takac. Distributedcoordinate descent method for learning with big data.In arXiv:1310.2059, 2013.

[4] Fen Niu, Benjamin Recht, Christopher Re, andStephen Wright. Hogwild!: A lock-free approach toparallelizing stochastic gradient descent. In NeuralInformation Processing Systems (NIPS), 2011.

[5] Rainer Gemulla, Erik Nijkamp, Peter J Haas, andYannis Sismanis. Large-scale matrix factorization with

1 2 3 40.05

0.1

0.15

0.2

0.25

m1.large nodes

Cost ($

)

astro−ph

SDCA

SGD

0 2 4 6 82

4

6

8

10

m1.large nodes

Cost ($

)

cov1

SDCA

SGD

2 4 6 80

0.5

1

1.5

2

2.5

3

3.5

m1.large nodes

Cost ($

)

rcv1

SDCA

SGD

Figure 14: We can alternatively paramterize our scale-out by cost rather than completion time.

distributed stochastic gradient descent. In Proceedingsof the 17th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 69–77.ACM, 2011.

[6] Matei Zaharia, M. Chowdhury, M. J. Franklin,S. Shenker, and I. Stoica. Spark: Cluster computingwith working sets. In HotCloud, June 2010.

[7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato,and Jonathan Eckstein. Distributed optimization andstatistical learning via the alternating directionmethod of multipliers. Foundations and Trends R© inMachine Learning, 3(1):1–122, 2011.

[8] Seung-Jean Kim, Kwangmoo Koh, Michael Lustig,Stephen Boyd, and Dimitry Gorinevsky. Aninterior-point method for large-scale l1-regularizedleast squares. Selected Topics in Signal Processing,IEEE Journal of, 1(4):606–617, 2007.

[9] Philip E Gill, Walter Murray, and Michael ASaunders. Snopt: An sqp algorithm for large-scaleconstrained optimization. SIAM review, 47(1):99–131,2005.

[10] Robert E Bixby, John W Gregory, Irvin J Lustig,Roy E Marsten, and David F Shanno. Very large-scalelinear programming: A case study in combininginterior point and simplex methods. OperationsResearch, 40(5):885–897, 1992.

[11] p richtarik and m takac. Parallel coordinate descentmethods for big data optimization. InArXiv:1212.0873, 2012.

[12] Martin Takac, Avleen Bijral, Peter Richtarik, andNathan Srebro. Mini-batch primal and dual methodsfor svms. In International Conference on MachineLearning (ICML), March 2013.

[13] T Yang. Trading computation for communication:Distributed stochastic coordinate ascent. In NeuralInformation Processing Systems (NIPS), 2013.

[14] huasha zhao and john canny. Butterfly mixing:Accelerating incremental-update algorithms onclusters. In SIAM International Conference on DataMining (SDM), 2013.

[15] A agarwal and j duchi. Distributed delayed stochasticoptimization. In Neural Information ProcessingSystems (NIPS), 2011.

[16] Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka,Tamara Broderick, and Michael I. Jordan. Optimisticconcurrency control for distributed unsupervisedlearning. In Neural Information Processing Systems(NIPS), 2013.

[17] N Cristianini and J Shawe-Taylor. An Introduction toSupport Vector Machines. Cambridge UniversityPress, 2000.

[18] L. Bottou and O. Bousquet. The tradeoffs of largescale learning. In Neural Information ProcessingSystems (NIPS), 2008.

[19] S. Shalev-Shwartz and N. Srebro. Svm optimization:Inverse dependence on training set size. InInternational Conference on Machine Learning(ICML), 2008.

[20] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro,and Andrew Cotter. Pegasos: Primal estimatedsub-gradient solver for svm. 2010.

[21] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin,S. Sathiya Keerthi, and S. Sundararajan. A dualcoordinate descent method for large-scale linear svm.In International Conference on Machine Learning(ICML), July 2008.

[22] Shai Shalev-Shwartz and Tong Zhang. Stochastic dualcoordinate ascent methods for regularized lossminimization. Journal of Machine Learning Research,14:567–599, February 2013.

[23] Taiji Suzuki. Dual averaging and proximal gradientdescent for online alternating direction multipliermethod. In Proceedings of the 30th InternationalConference on Machine Learning (ICML-13), pages392–400, 2013.

[24] Yu-Hong Dai and Yaxiang Yuan. A nonlinearconjugate gradient method with a strong globalconvergence property. SIAM Journal on Optimization,10(1):177–182, 1999.

[25] Elad Hazan, Alexander Rakhlin, and Peter L Bartlett.Adaptive online gradient descent. In Advances inNeural Information Processing Systems, pages 65–72,2007.

[26] John Duchi, Elad Hazan, and Yoram Singer. Adaptivesubgradient methods for online learning and stochasticoptimization. The Journal of Machine LearningResearch, 999999:2121–2159, 2011.

[27] Martin Takac, Avleen Bijral, Peter Richtarik, andNathan Srebro. Mini-batch primal and dual methodsfor svms. In 30th International Conference onMachine Learning, 2013.

[28] Benjamin Recht and Christopher Re. Parallelstochastic gradient algorithms for large-scale matrixcompletion. Mathematical Programming Computation,pages 1–26, 2011.

Figure 15: Classification Flowchart: If a user is faced with a large-scale classification problem this explainsall of the free parameters. In this work, we focus on the parameters in the yellow boxes, and measure howthe environment impacts the selection of these parameters.

&ODVVLILFDWLRQ

690V ORJLVWLFUHJUHVVLRQ

GHFLVLRQWUHHVQHXUDOQHWV

/ORVV/UHJXODUL]HG

/ORVV/UHJ

/ORVV/UHJ

PRGHOIDPLO\

GLVWULEXWHG VHULDOSDUDOOHOVKDUHGPHP

HQYLURQPHQW

/ORVV/UHJ

FRRUGLQDWHGHVFHQW

JUDGLHQWGHVFHQW

DOJRULWKPIDPLO\

SULPDO SULPDOGXDOGXDO

SDUDOOHOJUDGLHQW VWRFKDVWLF F\FOLF

ORFDO6*'

3(*$626PLQLEDWFK

6'&$

RILWHUDWLRQV

PLQLEDWFKVL]H

ODPEGDUHJXODUL]DWLRQ

GDWDOD\RXW

XSGDWHSDWWHUQ

W\SHRIFRQYHUJHQFH

DOJRULWKP

FOXVWHUVL]H

GDWDVSDUVLW\QG

.(<

GHFLVLRQVZHPDGHRWKHUSRVVLELOLWLHV

SDUDPHWHUVWRWXQHSDUDPHWHUV

WRWHVWGXULQJH[SHULPHQWVSDUDPHWHUV

VSHFLILHGE\DOJRULWKP

YDQLOOD*'/%)*6 $GD*UDGXSGDWH

RILWHUDWLRQV


UDQGRPSHUP

GXDO

&21675$,176

WLPHDFFXUDF\

VWHSVL]HSHJDVRV

&ODVVLILFDWLRQ



/ORVV/UHJXODUL]HG

/ORVV/UHJ

/ORVV/UHJ

PRGHOIDPLO\


HQYLURQPHQW

/ORVV/UHJ

FRRUGLQDWHGHVFHQW

JUDGLHQWGHVFHQW

DOJRULWKPIDPLO\



ORFDO6*'

3(*$626PLQLEDWFK

6'&$

RILWHUDWLRQV

PLQLEDWFKVL]H


GDWDOD\RXW

XSGDWHSDWWHUQ

W\SHRIFRQYHUJHQFH

DOJRULWKP

FOXVWHUVL]H

GDWDVSDUVLW\QG

.(<






RILWHUDWLRQV


UDQGRPSHUP

GXDO

VWHSVL]HSHJDVRV

&21675$,176

WLPH DFFXUDF\ FRVW

&ODVVLILFDWLRQ



/ORVV/UHJXODUL]HG

/ORVV/UHJ

/ORVV/UHJ

PRGHOIDPLO\


HQYLURQPHQW

/ORVV/UHJ

FRRUGLQDWHGHVFHQW

JUDGLHQWGHVFHQW

DOJRULWKPIDPLO\



ORFDO6*'

3(*$626PLQLEDWFK

6'&$

RILWHUDWLRQV

PLQLEDWFKVL]H


GDWDOD\RXW

XSGDWHSDWWHUQ

W\SHRIFRQYHUJHQFH

DOJRULWKP

FOXVWHUVL]H

GDWDVSDUVLW\QG

.(<






RILWHUDWLRQV


UDQGRPSHUP

GXDO

&21675$,176

WLPHDFFXUDF\

VWHSVL]HSHJDVRV

Figure 16: Ganglia plots from the EC2 compute cluster. These screenshots show how data skew can causeuneven compute load on the worker nodes.

Documents

Analysis of Algorithms for Distributed Optimizationkubitron/courses/cs262...Analysis of Algorithms for Distributed Optimization Sanjay Krishnan [email protected] Virginia Smith