If you can't read please download the document

View

212Download

0

Embed Size (px)

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. Y, MONTH Z 1

Distributed Algorithms for Large ScaleLearning and Inference in Graphical Models

Alexander G. Schwing, Tamir Hazan, Marc Pollefeys, Fellow, IEEE andRaquel Urtasun, Member, IEEE

AbstractOver the past few years we have witnessed an increasing popularity in the use of graphical models for applications incomputational biology, computer vision and natural language processing. Despite the large body of work, most existing learningand inference algorithms can neither cope with large scale problems which require huge amounts of memory and computation,nor can they effectively parallelize structured learning with a small number of training instances. In this paper, we propose afamily of model-parallel learning and inference algorithms which exploit distributed computation and memory storage to handlelarge scale graphical models as well as a small number of training instances. Our algorithms combine primal-dual methods anddual decomposition to preserve the same theoretical guarantees (and for certain parameter choices the same solution) as if theywere run on a single machine. We demonstrate the capabilities of our algorithms in a wide variety of scenarios, showing theadvantages and disadvantages of model-parallelization.

Index Termsgraphical models, big data, sample parallelism, model parallelism

F

1 INTRODUCTION

MOST real-world applications are structured, i.e.,they are composed of multiple random vari-ables which are related. For example, in natural lan-guage processing, we might be interested in parsingsentences syntacticly. In computer vision, we mightwant to predict the depth of each pixel, or its semanticcategory. In computational biology, given a sequenceof proteins (e.g., lethal and edema factors, protectiveantigen) we might want to predict the 3D docking ofthe anthrax toxin. While individual variables could beconsidered independently, it has been demonstratedthat taking dependencies into account improves pre-diction performance significantly.

Prediction in structured models is typically per-formed by maximizing a scoring function over thespace of all possible outcomes, an NP-hard taskfor general graphical models. Notable exceptions aregraphical models with sub-modular energies or lowtree-width structure where inference is performedexactly in polynomial time. Unfortunately, most real-world problems do not belong to this category andonly approximate solutions can be obtained.

A wide variety of approaches have been proposedto obtain approximate solutions to the general infer-ence problem, e.g., via sampling [1] or via iterativealgorithms that solve a max-flow problem [2]. While

A. G. Schwing and R. Urtasun are with the Computer ScienceDepartment, University of Toronto, Canada.E-mail: see http://alexander-schwing.de

T.Hazan and M. Pollefeys are with University of Haifa, Israel andETH Zurich, Switzerland.

Fig. 1. 283 label disparity map computed from a12 MPixel stereo pair.

aforementioned approaches generally consider theproblem more globally, message passing approachesare interesting due to the local structure. While ef-fective in the case of small inference problems, mostapproaches assume that the graphical model and theinvolved variables fit into memory. This is an issueas the size of real-world graphical models is growingrapidly when operating with big data. These days,cameras for example have several mega-pixels, evenin low-cost cellphones. Holistic approaches on theother hand aim at solving multiple related tasks,hence the complexity also grows with the numberof jointly considered problems. But the challenge oflarge-scale models is not only related to memoryconsumption but also to computational complexity.Typical message passing algorithms for example scalewith the number of dimensions involved in the largestfunction, e.g., they are already inherently quadratic forgraphical models involving pairwise connections.

Learning with these models is typically done in thecontext of regularized risk minimization, where a log-linear model is employed, and a convex surrogate loss

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. Y, MONTH Z 2

is optimized. Most learning algorithms have the samelimitations as inference techniques, since they performinference for each parameter update step. On the onehand, max-margin Markov networks (M3Ns) or struc-tured SVMs (SSVMs) [3], [4] minimize the structuredhinge-loss, and perform a loss-augmented inferenceusing cutting plane optimization. Conditional randomfields (CRFs) [5] on the other hand, minimize the log-loss, and thus require the computation of the nor-malizer of a probability distribution, i.e., the partitionfunction, during each iteration. Primal-dual methods,such as [6], [7], have the advantage of interleavinglearning and inference. As a consequence they do notrequire convergence of the inference task in order toupdate the weights. While this results in reduced com-putational complexity, they assume, just like CRFs,SSVMs and M3Ns, that the graphical model and themessages can be stored on a single computer. More-over training with mini-batches containing a smallnumber of training instances is increasingly popularthese days. However, the predominant data-parallelscheme is not very effective for this setting. Thisagain limits the applicability of learning algorithmsfor large-scale problems and/or small mini-batch size.

To overcome these issues, in this work we proposea family of learning and inference algorithms whichexploit distributed computation and memory storageto handle large scale graphical models like the oneinvolved in the disparity computation illustrated inFig. 1. Our algorithms combine primal-dual methodsand dual decomposition to preserve the same theoret-ical guarantees (and for certain choices of parametersthe same solution) as if they were run on a singlemachine. For the learning task we make use of inter-leaving previously suggested in [6], [7]. All in all thepresented work extends [8], [9] to region graphs andprovides a unified implementation. We demonstratethe capabilities of our algorithms in a wide variety ofscenarios, showing the advantages and disadvantagesof model-parallelization.

In the remainder, we first review both non-distributed inference and non-distributed learning al-gorithms. Afterwards we discuss a dual decomposi-tion approach for parallelization of both task beforedescribing some features of our publicly availableimplementation. We then illustrate the applicabilityof the proposed approaches via experiments beforediscussing related work.

2 REVIEW: INFERENCE AND LEARNINGIn this section we review inference and learning algo-rithms that employ message passing techniques. Wefirst define the general setting.

Let x X be the input data, e.g., images orsentences, and s S the output elements, e.g., imagesegmentations or parse trees. We assume the outputspace elements s to lie within a discrete product space,

i.e., s = (s1, s2, . . .) S =i Si with Si = {1, . . . , |Si|}.

Let : X S RF be a feature mapping whichprojects from the object and output product space toan F -dimensional feature space.

During inference, we are interested in finding themost likely output s S given some data x X . Tothis end we maximize a distribution p ranging over alloutput symbols. We model p to be log-linear in someweights w RF , i.e.,

p(s | x,w) exp(w>(x, s)/

). (1)

We introduced the temperature to adjust the smooth-ness of the distribution, e.g., for = 0 the distribu-tion is concentrated on the maximizers of the scorew>(x, s), often also referred to as negative energy.

The learning task is concerned with finding thebest parameters w, given sample pairs (x, s)containing data symbol x and its correspondinggroundtruth output s. In the following we brieflyreview both, inference and learning in greater detail,and provide the intuitions necessary to follow thederivations of the distributed algorithms.

2.1 InferenceDuring inference, we are interested in finding themost likely output space configuration s S givendata x X . More formally we aim at solving

s = arg maxsS

w>(x, s). (2)

This task is generally intractable, i.e., NP-hard, giventhe exponentially many configurations within S.

To overcome this issue, we follow a variationalapproach and approximate the distribution over out-comes with a simpler factorized distribution. Towardsthis goal, we rephrase the task as the one of finding adistribution p(s) which minimizes the KL-divergenceDKL(p | exp(w>(x, s)/)) between the modeled dis-tribution p(s | x,w) given in Eq. (1) and p(s). Thecorresponding program is equivalent to

maxp

sS

p(s)(s | x) + H(p), (3)

with score (s | x) = w>(x, s), the entropy H(p) =s p(s) ln p(s) and the probability simplex of

corresponding dimension. Note that the task givenin Eq. (3) is identical to the program given inEq. (2) when = 0 since the distribution p will be any distribution over the set of maximizersarg maxsS (s | x). However, we introduce smooth-ness via > 0 which admits optimality guarantees ofapproximations that are easier to achieve [10], [11].

The exponentially sized output spaces render theoptimization of even the smooth objective intractablein general, and assumptions as well as approxima-tions are required. For many applications a singlemeasurement k, k {1, . . . , F}, i.e., an element ofthe feature vector , consists of functions that depend

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. Y, MONTH Z 3

s1 s2 s3

s{1,2}

s{1,2,3}

Fig. 2. Hasse diagram of a distribution that cannot bevisualized by a factor graph. Note the sets of regionswi