An Object-oriented Spatial and Temporal Bayesian Network ... · and to construct an object-oriented spatial Bayesian network (OOSBN) for modelling the neighbourhood spatial interactions

An Object-oriented Spatial and Temporal Bayesian Network forManaging Willows in an American Heritage River Catchment

Lauchlin WilkinsonFaculty of IT,

Monash Univ., AUS

Yung En CheeSchool of Botany,

Univ. of Melbourne, AUS

Ann E. NicholsonFaculty of IT,

Monash Univ., AUS

Pedro Quintana-AscencioDepartment of Biology,

Univ. of Central Florida, USA

Abstract

Willow encroachment into the naturallymixed landscape of vegetation types in theUpper St. Johns River Basin in Florida,USA, impacts upon biodiversity, aestheticand recreational values. To control the ex-tent of willows and their rate of expansioninto other extant wetlands, spatial contextis critical to decision making. Modellingthe spread of willows requires spatially ex-plicit data on occupancy, an understandingof seed production, dispersal and how the keylife-history stages respond to environmentalfactors and management actions. Nichol-son et al. (2012) outlined the architectureof a management tool to integrate GIS spa-tial data, an external seed dispersal modeland a state-transition dynamic Bayesian net-work (ST-DBN) for modelling the influenceof environmental and management factors ontemporal changes in willow stages. Thatpaper concentrated on the knowledge en-gineering and expert elicitation process forthe construction and scenario-based evalua-tion of the prototype ST-DBN. This paperextends that work by using object-orientedtechniques to generalise the knowledge or-ganisational structure of the willow ST-DBNand to construct an object-oriented spatialBayesian network (OOSBN) for modellingthe neighbourhood spatial interactions thatunderlie seed dispersal processes. We presentan updated architecture for the managementtool together with algorithms for implement-ing the dispersal OOSBN and for combiningall components into an integrated tool.

1 INTRODUCTION

The highly-valued Upper St. Johns River in Florida,USA has been the focus of considerable restoration in-vestment (Quintana-Ascencio et al., 2013). However,woody shrubs, primarily Carolina willow (Salix car-oliniana Michx.), have invaded areas that were histor-ically herbaceous marsh (Kinser et al., 1997). Thischange to the historical composition of mixed vegeta-tion types is considered undesirable, as extensive wil-low thickets detract from biodiversity, aesthetic andrecreational values. Overabundance of willows reduceslocal vegetation heterogeneity and habitat diversity.People also prefer open wetlands that offer a view-shed, navigable access and scope for recreation activi-ties such as wildlife viewing, fishing and hunting.

Managers seek to control the overall extent of willows,their rate of expansion into other extant wetland typesand encroachment into recently restored floodplainhabitats. Spatial context is critical to decision-makingas areas differ in terms of biodiversity, aesthetic andrecreational value, “invasibility” and applicable inter-ventions. For instance, vegetation communities thatare intact or distant from willow populations (seedsources) are less susceptible to invasion. With respectto interventions, mechanical clearing is restricted toareas where the substrate can support heavy machin-ery; prescribed fire depends on water levels and thequantity of “burnable” understorey vegetation.

Modelling willow spread requires spatially explicitdata on willow occupancy, an understanding of seedproduction, dispersal, germination and survival, andhow the key life-history stages respond to environmen-tal factors and management actions. Data and knowl-edge on these pieces of the puzzle are available fromecological and physiological theory, surveys, field andlaboratory experiments and domain experts.

State-transition (ST) models are a convenient meansof organising information and synthesising knowledgeto represent system states and transitions that are of

management interest. We build on recent studies thatcombine ST models with BNs to incorporate uncer-tainty in hypothesised states and transitions, and en-able sensitivity, diagnostic and scenario analysis for de-cision support in ecosystem management (e.g. Bashariet al., 2009; Rumpff et al., 2011). Our approach usesthe template described by Nicholson and Flores (2011)to explicitly model temporal changes in willow stages.

Nicholson et al. (2012) outlined the architecture ofa management tool that would integrate GIS spatialdata, a seed dispersal model and a state-transitiondynamic Bayesian network (ST-DBN) for modellingthe influence of environmental and management fac-tors on temporal changes in willow stages. That pa-per described the knowledge engineering and expertelicitation process for the construction and scenario-based evaluation of the prototype ST-DBN. This paperextends that work, using object-oriented techniquesto generalise the knowledge organisational structureof the willow ST-DBN and to construct an object-oriented spatial Bayesian network (OOSBN) for mod-elling the neighbourhood spatial interactions that un-derlie seed dispersal processes. We present an updatedarchitecture for the management tool that incorpo-rates GIS data and the new ST-OODBN and OOSBNstructures, together with algorithms for implementingthe dispersal OOSBN and for combining all compo-nents into an integrated tool.

2 BACKGROUND

Dynamic Bayesian Networks (DBNs) are a variantof ordinary BNs (Dean and Kanazawa, 1989; Nichol-son, 1992) that allow explicit modelling of changes overtime. A typical DBN has nodes for N variables of inter-est, with copies of each node for each time slice. Linksin a DBN can be divided into those between nodes inthe same time slice, and those in the next time slice.While DBNs have been used in some enviromental ap-plications (e.g. Shihab, 2008), their uptake has beenlimited.

State-and-transition models (STMs) have beenused to model changes over time in ecological sys-tems that have clear transitions between distinct states(e.g., in rangelands and woodlands, see Bestelmeyeret al., 2003; Rumpff et al., 2011). Nicholson and Flo-res (2011) proposed a template for state-transition dy-namic Bayesian networks (ST-DBNs) which formalisedand extended Bashari et al.’s model, combining BNswith the qualitative STMs.

The influence of environmental and management fac-tors on the main willow stages of management interestand their transitions is shown in our updated versionof the Nicholson et al. (2012) ST-DBN (see Figure 1).

For each cell (spatial unit), data on attributes such assoil, vegetation type and information about landscapeposition and context is supplied from GIS data. Thisdata provides inputs to parameterise the ST-DBN anddispersal model. A cell size of 100m x 100m (1 ha) waschosen to represent a modelling unit. This reflects theresolution of available spatial data for environmentalattributes, makes the computational demand associ-ated with seed dispersal modelling feasible, and is areasonable scale with respect to candidate manage-ment actions. A time step of one year was consid-ered appropriate given the willow’s growth and seedproduction cycle (Nicholson et al., 2012).

Seed production depends on the size and number ofreproductive (adult) stems within each cell. However,Seed Availability, the amount of seed available for ger-mination within a cell, depends on willow seed pro-duction and dispersal from surrounding cells. As theseprocesses are not accounted for in the ST-DBN (Fig-ure 1), a key focus of this paper is the developmentand integration of an object-oriented spatial Bayesiannetwork (OOSBN) to model the neighbourhood spa-tial interactions that underlie this process.

The purpose of the integrated tool is to synthesise cur-rent understanding and quantify important sources ofuncertainty to support decisions on where, when andhow to control willows most effectively. The ST-DBNmodels willow state transitions and characteristics inresponse to environmental and management factorswithin a single spatial unit and time step. For co-herent, effective and well-coordinated landscape-scalemanagement however, we want to be able to predictwillow response across space (at every cell) in the tar-get area and across time frames of management in-terest (e.g. 10-20 years). Such predictions can thenbe mapped and also aggregated across the target areato produce evaluation metrics for managers. Such atool would enable managers to “test”, visually com-pare and quantitatively evaluate different candidatemanagement strategies.

This real-world management problem is naturally de-scribed in terms of hierarchies of components that in-clude similar, repetitive structures. Object-oriented(OO) modelling has obvious advantages in this con-text. We apply OO techniques to generalise the knowl-edge organisational structure of the willow ST-DBNand design and construct the seed production and dis-persal spatial network.

3 AN ST-OODBN FOR WILLOWS

Various authors have advocated the use of OO mod-elling techniques to: a) help manage BN complexityvia abstraction and encapsulation, b) facilitate the

Enough_Bare_Ground

Burn_Decision

Burn_Intensity

Vegetation

Summer_PPTCanal_or_Center

Soil_Type

BurnEffect_on_Willow

Spring_PPTMech_Clearing

GrowingSeason_PPT

Available_Water_Spring Available_Water_Survival

Available_Water_GrowingSeasAvailable_Water_Germination

Level of Cover 0T.

Stage (T)

UnoccupiedYearlingSaplingAdultBurnt

2IDI2IDI2IDI2IDI2IDI

Proportion_Germinating Seedling_Survival_Proportion

Level of Cover 0T11.

Stage (T+1)

UnoccupiedYearlingSaplingAdultBurnt

2IDI2IDI2IDI2IDI2IDI

Seed_Availability NumberGerminating NumberSurviving

Rooted_Basal_Stem_Diameter 0T.

Rooted_Basal_Stem_Diameter 0T11.

Sapling_Transition

Yearling_Transition

UnOccupied_Transition

Burnt_Adult_Transition

NonInterv_Yearling_Transition

Adult_Transition

Seed_Production

Figure 1: State-and-transition dynamic Bayesian network (ST-DBN) for modelling the response of key willowstages to environmental factors and management actions within a single spatial unit. The willow stages ofmanagement interest are: unoccupied, yearling, sapling, adult and burnt adult. Colours indicate: (i) aspects ofwillow state in tan; (ii) seed availability, germination and seedling survival processes in orange; (iii) environmentalfactors in green; (iv) management options in red; and (v) willow state-transitions in purple.

construction of classes of objects that are internallycohesive and potentially more reuseable, and c) for-malise interfaces prior to integration (Koller and Pfef-fer, 1997; Neil et al., 2000; Kjærulff and Madsen, 2008;Korb and Nicholson, 2010; Molina et al., 2010). How-ever, examples in ecological and environmental man-agement are scant (Molina et al., 2010; Carmona et al.,2011; Johnson and Mengersen, 2012).

We follow the definition of OOBNs used in Kjærulffand Madsen (2008), and implemented in the Hugin BNsoftware package. A standard BN is made up of ordi-nary nodes, representing random variables. An OOBNclass is made up of both nodes, and objects, which areinstances of other classes. Thus an object may encap-sulate multiple sub-networks, giving a composite andhierarchical structure. Objects are connected to othernodes via some of its own ordinary nodes, called itsinterface nodes. The rest of the nodes are not visibleto the outside world, thus hiding information detail,another key OO concept. A class can be thought ofas a self-contained ‘template’ for an OOBN object, de-scribed by its name, its interface and its hidden part.

Finally, interface nodes are divided into input nodesand output nodes. Input nodes are the root nodeswithin an OOBN class, and when an object (instance)of that class becomes part of another class, each in-put node may be mapped to a single node (with thesame state space) in the encapsulating class. The out-put nodes are the only nodes that may become parentsof nodes in the encapsulating class. When displayingan OOBN, we show Hugin1 screen shots, where inputnodes are indicated with a dotted and shaded outline,and output nodes with a bold and shaded outline.

We converted the ST-DBN (Figure 1) into a ST-OODBN as follows. Using the five conceptual cate-gories of nodes from the original network as a guide,the network was split into two abstract class types.The first represents abstract influencing factors andconsists of three sub-classes that define Environmen-tal Conditions, Management Options and the germi-nation and seedling survival Processes Factors. Thesecond type represents state transitions of willows

1Note that Hugin OOBNs do not support inheritance,so there are no super-classes or sub-classes.

(see Figure 2) and defines five sub-classes, Transi-tionFromUnoccupied, TransitionFromYearling, Tran-sitionFromSapling, TransitionFromAdult and Transi-tionFromBurntAdult. Figure 3 illustrates the defini-tion of the TransitionFromYearling class.

E1 Ei M1 Mj

Tn

TransitionFromXx,y,t

Xm

S1 Sk P1 Pl

Figure 2: An abstract OOBN class for state transi-tions. Each implementation of a state transition out-put node Tn is defined by a combination of environ-mental conditions E1...i, management options M1...j ,previous state variables S1...k, process factors P1...l andany number of Xm hidden nodes. The only requiredinput is the node that defines the previous state, allothers are optional and are based on the implement-ing class. Dotted arrows indicate possible connectionsbetween input, hidden and output nodes.

TransitionFromYearlingx,y,t

Figure 3: The TransitionFromY earling implementa-tion of the abstract TransitionFromX type showingManagement Options in red, State Variables in yellowand Environmental Conditions in green. The outputTransition node Transition from Yearling to: is shownin blue with a bold and shaded outline.

These classes are instantiated as objects within a ST-OODBN class and when integrated with the seed dis-persal OOSBN (described below) defines the completeST model over a single time step (Figure 4). Recast-ing the network as a ST-OODBN makes the knowl-edge organisational structure explicit, whilst allowingnetwork complexity to be hidden and integration ef-forts to focus on the interfaces between components ofthe network. Note that in the ST-OODBN (Figure 4)

Seed Availability is an input node. Seed Productionand its spatial dispersal is modelled by a separate OOnetwork, which we present next.

4 AN OOSBN FOR SEEDPRODUCTION AND DISPERSAL

S.caroliana flowers in early spring and produces verylarge numbers of small seeds (∼165,000 per averageadult)that disperse by wind and water. Seed produc-tion is modelled by the Willow Seed Production OOBN(Figure 5), which is embedded in a broader seed dis-persal model described below.

The number of seeds produced by an adult is givenby the product of the number of Inflorescences, thenumber of Fruits per inflorescence and the number ofSeeds per fruit. Fruits per inflorescence and Seeds perfruit are defined by distributions estimated from em-pirical data. The number of Inflorescences increases asa function of adult size (represented by Rooted BasalStem Diameter) and this relationship has also beenestimated from empirical data.

Cover is the percentage of a 1 hectare cell that is oc-cupied by willows and Average Canopy Area is mod-elled as a function of Rooted Basal Stem Diameter.Together these two variables provide an estimate ofthe number of reproductive stems. Overall seed pro-duction within a cell, Seeds per Hectare, is then sim-ply the product of the seed production per stem, bythe number of reproductive stems. This Willow SeedProduction OOBN models seed production processesexplicitly rather than implicitly as in the ST-DBN pro-totype (Figure 1); an example of iterative and incre-mental knowledge engineering.

Willow seeds do not exhibit dormancy and have only ashort period of viability – those that fail to germinatein the year they are produced are lost. The amount ofseed available for germination within a cell depends onseed production and dispersal from surrounding cells.Thus, neighbourhood seed production and dispersalin combination with environmental and managementfactors determines patterns of willow spread and colo-nization.

Our approach to modelling seed dispersal is phe-nomenological rather than mechanistic. Wind-mediated seed dispersal is calculated using the Clarket al. (1999) dispersal kernel:

SDx′,y′

x,y = SPx′,y′ × 1

2πα2e−(

dα ) (1)

where SDx′,y′

x,y is the number of seeds arriving at cell(x, y) from those produced at a cell (x′, y′); it is theproduct of seed produced SPx′,y′ and an exponential

Willows ST-OODBNx,y,t

Figure 4: The resultant ST-OODBN class showing the four input nodes (Seed Availability, Rooted Basal StemDiameter (T), Stage (T) and Cover (T)) along with the Environmental Conditions, Management Options, ProcessFactors and five TransitionFromX objects that define the outputs (Stage (T+1), Rooted Basal Stem Diameter(T+1) and Cover (T+1)) after one time step. Input nodes are illustrated with a dotted and shaded outline,instances of OOBN classes as round cornered boxes and output nodes with a bold and shaded outline.

kernel where d is the distance between cells (x, y) and(x′, y′), and α is a distance parameter. To simulatestochasticity in dispersal events, α is a random vari-able that can be sampled from distributions designedto reflect the expected nature of dispersal (e.g. shortversus long distance dispersal) (Fox et al., 2009). Thisseed dispersal model is captured within the WindDis-persalKernelx′,y′,t OOBN (Figure 5), where the inputDistance node is set for the particular (x, y) and α isset as a discretised normal distribution with a mean of1 and a variance of 0.25.

For our purposes, we want to compute the seed avail-ability SAx,y for a target cell (x, y), which is the sumof the seeds dispersed to it from every cell in the studyarea:

SAx,y =∑

x′,y′∈Area

SDx′,y′

x,y (2)

A naive BN model of this additive function wouldmean a Seed Availability node with all the Seeds Dis-persed nodes (one for every cell) as its parents! For astudy area with width w cells, height h cells, a SeedAvailability node discretized to n states, and the SeedsDispersed node discretized to m states, the CPT forSeed Availability would include n×mw×h probabilities–clearly infeasible.

From an ecological perspective, however, not all cellswithin a study area are expected to contribute towardsfinal Seed Availability at any given cell. Indeed, be-cause the number of seeds dispersed from a seed pro-ducing cell declines exponentially with increasing dis-tance from that cell, we can make a simplifying as-sumption that after a certain distance, the number ofseeds dispersed is effectively negligible. As a startingpoint we assume a circular region of influence for thetarget cell, defined by a dispersal mask with radiusr. So for instance, a radius of eight cells (800 meters)implies π82, or ∼201 cells providing parents to the fi-nal Seed Availability node. This is still far too many,particularly as we are using a standard BN softwarepackage with discrete nodes and exact inference.

However, since Seed Availability is a simple additivefunction, we use the simple modelling trick of addingthe Seed Availability from each cell to the cumula-tive seed availability so far (via node Cumulative SeedAvailability). This is equivalent to repeatedly divorc-ing parents to reduce the size of the state space. Wecan think of this is as sequentially scanning over thespatial dimension in a similar way to rolling out a DBNover time. We call this a spatial Bayesian network(SBN), and the object-oriented variety an OOSBN.

SeedDispersal OOSBNt

WillowSeedProductionx'+n,y'+n,t

WindDispersalKernelx'+n,y'+n,t SeedAvailabilityx,y,t

TotalSeedAvailabilityx,y,t

TotalSeedAvailabilityx',y',t

TotalSeedAvailabilityx'+1,y'+1,t

TotalSeedAvailabilityx+n-1,y+n-1,t

Figure 5: The OOSBN architecture showing how Cover, Rooted Basal Stem Diameter, Cumulative Seed Avail-ability and Distance at locations (x′, y′) to (x′+n, y′+n) are combined to provide total seed availability for eachcell at location (x, y) at time t. Dashed arcs indicate that nodes are connected via the seed availability PTLayer.Multiple TotalSeedAvailability objects are illustrated within the SeedDispersal OOSBN reflecting that the totalseeds available at a given (x, y) is dependant on seeds being produced in multiple locations. These syntheticTotalSeedAvailabilty objects are shown as collapsed objects, hiding their private nodes, and showing just thefour input nodes that define the seeds dispersed from each producing cell to the target cell.

Our approach to integrating the seed dispersal makesuse of such an OOSBN, that is run πr2 times (i.e. oncefor every cell (x′, y′) within dispersal range) for eachcell (x, y) in the study area to disperse and then sumthe seed availability at each cell. The dispersal maskis flexible and can be designed to take on differentshapes to reflect potentially important influences onwind dispersal such wind direction, wind strength andterrain characteristics. However, in the results givenSection 6, we use a radius of eight cells.

Overall, our approach here is to mitigate the problemof large CPT sizes by turning the problem in to oneof computation time. This has the added benefit ofpotentially being able to be computed easily in paralleland thus regaining some computation efficiency.

5 Integrating the ST-OODBN withthe OOSBN

Figure 6 is an abstract representation of the systemarchitecture, specifically, the interactions between theGIS layers and the OO networks as the tool is used forprediction at yearly intervals over the required man-agement time frame.

For each cell in the GIS, there is conceptually onestate-transition network (ST-OODBN ) and one seedproduction and dispersal spatial network, (OOSBN ).In practice, we do not store all these as separate net-works, but rather re-use a single network structure,whose input nodes are re-parameterised for each cell,for each time step.

For the first time step, the system takes GIS data,which represents the initial conditions of the studyarea. These are stored in an internal data structure,PTLayer, that combines the spatial structure of theGIS, with distributions for the (discrete) nodes in thenetworks. PTLayers are used to store and pass thespatially referenced prior distributions of input nodesand posterior distributions of output nodes for boththe OOSBN and ST-OODBN (Figure 6). Each PT-Layer contains a number of fields, one for each of thenode states of the linked input and output nodes. Eachfield stores the probability mass of the correspondingnode state.

The PTLayer distributions are used to set the pri-ors for the input nodes at each time step t. Then in-ference (belief updating) is performed within the OOnetwork, producing new posterior probability distri-

I M,x'+n,y'+n

I 1,x'+n,y'+n

OOSBNt

I 1,x',y'

I M,x',y'

SA

I M,x'+n,y'+n

I 1,x'+n,y'+n

ST-OODBNx,y,t

O1I 1

I m On

SA

ST-OODBNx,y,t+1

O1I 1

I m On

SA

OOSBNt+1

I 1,x',y'

I M,x',y'

SA

PTLayers1..l

t t+1PTLayers

1..lPTLayers

1..l

GISiLayers1..g

PTLayersSeedAvailability

PTLayersSeedAvailability

Figure 6: An abstract representation of the management tool architecture.

butions from the output nodes for each OO network,which are then transferred back to the correspondingPTLayer. This is equivalent to the standard “roll-out”followed by “roll-up” steps done in prediction with twotime slice DBNs (Boyen and Koller, 1998) to avoid thecomputational complexity of rolling out a DBN over alarge number of time steps. But here, in addition, thePTLayers are being used as an intermediate storageacross the spatial grid, with the inputs for the nexttime t + 1 coming not only from both the networkfor the same cell, but from outputs from networks forother cells. This is done via the seed dispersal OOSBN,which uses the Seed Availability PTLayer to accumu-late the seed availability arising from seed productionin other cells within the dispersal mask. In effect, thePTLayer replaces both the temporal arcs if the net-work was rolled-out over many time steps, and thespatial arcs between the networks for different cells,which are essentially the cross-network arcs from seedproduction in one place to seed availability in another.Note that this method is limited to prediction only–we cannot use the model for diagnosis, or to identifythe starting states and management actions to achievea preferred end-state.

More formally, using the notation from Figure 6, attime t and at each cell location (x, y), PTLayers1...lare used to initialise the priors of nodes I1...m andI1...M of ST-OODBN x,y,t and OOSBN t respectively.After propagation within OOSBN t, beliefs from theoutput node SA are stored in PTLayerSeedAvailabilty

and then used to update the priors of input node SAof ST-OODBN x,y,t. Then belief updating is done forthe ST-OODBN x,y,t for each cell(x, y) and the beliefsfrom output nodes O1...n propagated back to PTLay-

ers1..l at time t+1. Although the mapping between theset of PTLayers1...l and input nodes I1...m is one-to-one, there may be cases where there are no GIS layersavailable for an input, in which case a prior distribu-tion is used. Finally, with respect to OOSBN propaga-tion, the range of locations x′, y′ . . . x′ + n, y′ + n (i.e.neighbourhood cells) that are included is defined bythe dispersal mask described in the previous section.

Algorithm 1 An algorithm for propagating aGIS coupled ST-OODBN with spatial OOSBN sub-networks1: function propagate(ST– OODBN,OOSBN, ptlayers, t)2: I ← I(ST-OODBN)3: O ← O(ST-OODBN)4: SA← getLayer(ptlayers, SeedAvailability)5: for t := 0 to t do6: PROPAGATE(OOSBN,ptlayers,dispMask)7: // computes all Seed Availabilities8: for all (x, y) ∈ Area do9: p(SeedAvailability)← SAx,y

10: for all Ii ∈ I do11: Lj ← getLayer(ptlayers, Ii)12: p(Ii)← Lj(x, y)13: end for14: update beliefs in ST-OODBN15: for all Oi ∈ O do16: Lj ← getLayer(ptlayers,Oi)17: Lj(x, y)← Bel(Oi)18: end for19: end for20: end for21: end function

The process is detailed in Algorithm 1, as a functionPROPAGATE which takes the ST-OODBN (shownin Fig. 4), and OOSBN (shown in Fig. 5) networks,a list of ptlayers (previously initialised from the GISlayers) whose cells correspond to the area under con-sideration, and the number of time steps T over whichto propagate the network. In the algorithm, we use

I(OOBN) (respectively O(OOBN)) to denote a func-tion that returns the input (resp. output) nodes ofthe interface of the OOBN, and getLayer(ptlayers,V)to denote a function that returns the PTLayer corre-sponding to a node V .

Algorithm 2 An algorithm for dispersing seeds bywind using an OOSBN class1: function propagate(OOSBN, ptlayers, dispMask)2: SA← getLayer(ptlayers, SeedAvailability)3: for all (x, y) ∈ Area do4: P (SAx,y = none)← 15: end for6: for all (x, y) ∈ Area do7: for all Ii ∈ Ix,y(OOSBN) do8: Lj ← getLayer(ptlayers, Ii)9: p(Ii)← Lj(x, y)10: end for11: for all (x′, y′) ∈ dispMask do12: p(CumSeedAvail)← SAx,y13: d←

√(x− x′)2 + (y − y′)2

14: p(Distance = d)← 115: update beliefs in OOSBN16: SAx,y ← Bel(SeedAvailability)17: end for18: end for19: end function

First, seed dispersal is done with the OOSBN usingAlgorithm 2. Then for each cell (x, y) in the area, foreach input node Ii, the distribution for that cell fromits corresponding ptlayer is set as the prior of the in-put node. Belief updating of the ST-OODBN is done,propagating the new priors through to updated pos-terior distributions for the output nodes. Finally thebeliefs for each output node (Bel(Oi)) are copied backto the distribution at the (x, y) cell for the correspond-ing ptlayer (i.e. into SAx,y).

Algorithm 2 details the propagation process using theSeedDispersal OOSBN class illustrated in Figure 5.The algorithm starts by taking the OOSBN, a list ofPTLayers whose cells correspond to the area underconsideration, and a dispersal mask (dispMask). Be-fore starting the dispersal process, the provided SeedAvailability PTLayer is initialised with no seeds avail-able at all (x, y) co-ordinates. It then loops througheach cell (x, y) in the area, setting the distribution foreach input node Ii (i.e., Cover, Rooted Basal Stem Di-ameter and Cumulative Seed Availability) from its cor-responding PT layer cell. The algorithm then enters asecond loop for every cell that is a possible seed sourcebased on the dispersal mask. The Distance node is setusing the Euclidean distance between the current celland the target co-ordinates. Finally, belief updatingis done within the OOSBN (Figure 5) and the beliefs(i.e. the posterior probability distribution) from theSeed Availability node at each point are transferredinto the Cumulative Seed Availability for the subse-quent cell at each iteration. After all the cells (x′, y′)within the dispersal mask have been visited, the SeedAvailability PTLayer, SAx,y contains the overall seed

availability in that cell, which is used for modellinggermination in the ST-OODBN in Algorithm 1.

6 PRELIMINARY RESULTS

To implement the software architecture described wechose to use Hugin Researcher 7.7(2013) to develop theST-OODBN and dispersal model OOSBN, the HuginResearcher Java API 7.7 (2013) to provide program-matic access to the developed networks, the Image-IO-ext (2013) java library to provide access to GISraster layer formats, and the Java programming lan-guage to implement the algorithms tying the compo-nents together. Hugin was chosen as the OOBN devel-opment platform as it currently has one of the mostcomplete OOBN implementations. Java was chosenas the implementing language as it is platform inde-pendent and provides for a well established and un-derstood OO development environment. We imple-mented the tool as a standalone program allowing pre-processing of GIS data to be performed in whateverprogram the end user was most familiar with. In ourcase we used a combination of ArcGIS (2013), Quan-tum GIS (2013)and SAGA GIS(2013).

To demonstrate our working implementation, we ranthe model for the Blue Cypress Marsh ConservationArea (138 x 205 cells) within the Upper St. JohnsRiver basin. We used a simplified (and unrealistic)management rule set that says if a cell is next to acanal, mechanical clearing is carried out, otherwise forlandlocked cells, burning is prescribed (with a proba-bility of 0.1). Maps of willow cover and seed produc-tion were generated at yearly intervals for a 25 yearprediction window. This took about 8 hours of com-putation on a 64bit machine with a 2.8GHz processor.Figure 7 shows seed availability across the study areausing output from the SeedAvailability nodes in theOOSBN at 5 yearly intervals. To produce the maps,the seed availability interval with the highest posteriorprobability distribution is used to produce a grayscalevalue where zero seeds is black, and 1012 is white. Inthe run shown, seed availability decreases over time asthe level of willow cover is reduced by the managementregime.

7 DISCUSSION AND FUTUREWORK

For coherent, coordinated and effective landscape-scale decision support, managers need the capabilityto predict willow state changes across space and time.We have tackled the challenges of this real-world prob-lem by synthesising ideas and techniques from object-oriented knowledge engineering, dynamic BNs, GIS-

Figure 7: Seed availability predicted by the Willows ST-OODBN at t = 0, 5, 10, 15, 20, 25 years, across the BlueCypress Marsh Conservation Area (138 x 208 cells). Adult willow occupancy at t = 0 is shown in the bottomright panel; black indicates absence, grey presence.

coupled BNs and dispersal modelling. To our knowl-edge, this is the first environmental management ap-plication in which OOBNs are used to model spatially-explicit process interactions.

Further work in the development of this managementtool includes developing a water dispersal model andupdating the parametrisation of the ST-OODBN andOOSBN using judgements from a larger pool of do-main experts, together with specific empirical datawhere available. In addition, we will work with man-agers and domain experts to identify: i) realistic man-agement scenarios, ii) useful summary descriptors forthe various model outputs and iii) desirable featuresfor a tool interface.

Throughout the research and integration process weencountered challenges with the development, man-agement and use of OOBNs. While there have beenadvances in OOBN software, they still lack a lot of theuseful features available in other development tools.For instance, modern software engineering IDEs pro-vide easy to use re-factoring, documentation and inte-gration with version control tools. The tools we usedto design and implement the underlying OOBNs for

our tool still lack powerful refactoring, making themanagement of object interface changes a time con-suming and error prone task. Integrated source controlis non-existent and documentation tools rudimentary.Improvements in these areas would make working withOOBNs far more accessible to the type of user thatwishes to make use of OOBNs for natural resourcemanagement.

With respect to spatialising the ST-OOBN with theuse of OOSBNs, there is currently no graphical toolup to the task of facilitating the integration of the re-quired components. This means that anyone wantingto replicate our work would need to make use of theavailable APIs and this constitutes a barrier to usageby people with no or little programming background.

Acknowledgements

This work was supported by ARC Linkage ProjectLP110100304 and the Australian Centre of Excellencefor Risk Analysis. John Fauth (UCF), and DianneHall, Kimberli Ponzio and Ken Snyder (SJWRMD)provided critical information about the St Johns Riverecosystem, and knowledge about willow invasion.

References

Bashari, H., Smith, C., and Bosch, O. (2009). De-veloping decision support tools for rangeland man-agement by combining state and transition modelsand Bayesian belief networks. Agricultural Systems,99(1):23–34.

Bestelmeyer, B. T., Brown, J. R., Havstad, K. M.,Alexander, R., Chavez, G., and Herrick, J. E.(2003). Development and use of state-and-transitionmodels for rangelands. Journal of Range Manage-ment, (2):114–126.

Boyen, X. and Koller, D. (1998). Tractable inferencefor complex stochastic processes. In Proceedings ofthe Fourteenth Conference on Uncertainty in Arti-ficial Intelligence, UAI’98, pages 33–42, San Fran-cisco, CA, USA. Morgan Kaufmann Publishers Inc.

Carmona, G., Molina, J., and Bromley, J. (2011).Object-Oriented Bayesian networks for participa-tory water management: two case studies in Spain.Journal of Water Resources Planning and Manage-ment, 137(4):366–376.

Clark, J. D., Silman, M., Kern, R., Macklin, E., andHille Ris Lambers, J. (1999). Seed dispersal near andfar: patterns across temperate and tropical forests.Ecology, 80(5):1475–1494.

Dean, T. and Kanazawa, K. (1989). A model for rea-soning about persistence and causation. Computa-tional Intelligence, 5:142–150.

Esri (2013). ArcGIS (Version 10.1) [Software].http://www.esri.com/software/arcgis.

Fox, J. C., Buckley, Y. M., Panetta, F. D., Bourgoin,J., and Pullar, D. (2009). Surveillance protocols formanagement of invasive plants: modelling Chileanneedle grass (Nassella neesiana) in Australia. Di-versity and Distributions, 15(4):577–589.

GeoSolutions (2013). Image-IO-ext JavaLibrary (Version 1.1.7) [Software].http://github.com/geosolutions-it/.

Hugin Expert A/S (2013). Hugin Researcher (Version7.7) [Software]. http://www.hugin.com.

Johnson, S. and Mengersen, K. (2012). IntegratedBayesian network framework for modeling complexecological issues. Integrated Environmental Assess-ment and Management, 8(3):480–90.

Kinser, P., Lee, M. A., Dambek, G., Williams, M.,Ponzio, K., and Adamus, C. (1997). Expansionof Willow in the Blue Cypress Marsh ConservationArea, Upper St. Johns River Basin. Technical re-port, St. Johns River Water Management District,Palatka, Florida.

Kjærulff, U. B. and Madsen, A. (2008). Bayesian Net-works and Influence Diagrams: A Guide to Con-struction and Analysis. Springer Verlag, New York.

Koller, D. and Pfeffer, A. (1997). Object-orientedBayesian networks. In Proceedings of the Thirteenth

Conference on Uncertainty in Artificial Intelligence,UAI’97, pages 302–313, San Francisco, CA, USA.Morgan Kaufmann Publishers Inc.

Korb, K. B. and Nicholson, A. (2010). Bayesian arti-ficial intelligence. Chapman&Hall/CRC, Boca Ra-ton, FL, 2nd edition.

Molina, J., Bromley, J., Garcıa-Arostegui, J., Sulli-van, C., and Benavente, J. (2010). Integrated wa-ter resources management of overexploited hydro-geological systems using Object-Oriented BayesianNetworks. Environmental Modelling & Software,25(4):383–397.

Neil, M., Fenton, N., and Nielson, L. (2000). Build-ing large-scale Bayesian networks. The KnowledgeEngineering Review, 15(3):1–33.

Nicholson, A., Chee, Y., and Quintana-Ascencio, P.(2012). A state-transition DBN for management ofwillows in an american heritage river catchment. InNicholson, A., Agosta, J. M., and Flores, J., edi-tors, Ninth Bayesian Modeling Applications Work-shop at the Conference of Uncertainty in ArtificialIntelligence, Catalina Island, CA, USA, http://ceur-ws.org/Vol-962/paper07.pdf.

Nicholson, A. E. (1992). Monitoring Discrete Environ-ments using Dynamic Belief Networks. PhD thesis,Department of Engineering Sciences, Oxford.

Nicholson, A. E. and Flores, M. J. (2011). Combiningstate and transition models with dynamic Bayesiannetworks. Ecological Modelling, 222(3):555–566.

Quantum GIS Development Team (2013).Quantum GIS (Version 1.8.0) [Software].http://www.qgis.org/.

Quintana-Ascencio, P. F., Fauth, J. E., CastroMorales, L. M., Ponzio, K. J., Hall, D., and Sny-der, K. (2013). Taming the beast: managing hydrol-ogy to control Carolina Willow (Salix caroliniana)seedlings and cuttings. Restoration Ecology.

Rumpff, L., Duncan, D., Vesk, P., Keith, D., and Win-tle, B. (2011). State-and-transition modelling foradaptive management of native woodlands. Biolog-ical Conservation, 144(4):1224–1236.

SAGA Development Team (2013). SAGA GIS (Version2.0.8) [Software]. http://www.saga-gis.org/.

Shihab, K. (2008). Dynamic modeling of groundwaterpollutants with Bayesian networks. Applied Artifi-cial Intelligence, 22(4):352–376.

Product Trees for Gaussian Process Covariance in Sublinear Time

David A. MooreComputer Science Division

University of California, BerkeleyBerkeley, CA 94709

[email protected]

Stuart RussellComputer Science Division

University of California, BerkeleyBerkeley, CA 94709

[email protected]

Abstract

Gaussian process (GP) regression is a pow-erful technique for nonparametric regression;unfortunately, calculating the predictive vari-ance in a standard GP model requires timeO(n2) in the size of the training set. Thisis cost prohibitive when GP likelihood cal-culations must be done in the inner loop ofthe inference procedure for a larger model(e.g., MCMC). Previous work by Shen et al.(2006) used a k-d tree structure to approxi-mate the predictive mean in certain GP mod-els. We extend this approach to achieve effi-cient approximation of the predictive covari-ance using a tree clustering on pairs of train-ing points. We show empirically that this sig-nificantly increases performance at minimalcost in accuracy. Additionally, we apply ourmethod to “primal/dual” models having bothparametric and nonparametric componentsand show that this enables efficient computa-tions even while modeling longer-scale varia-tion.

1 Introduction

Complex Bayesian models often tie together manysmaller components, each of which must provide itsoutput in terms of probabilities rather than discretepredictions. As a natively probabilistic technique,Gaussian process (GP) regression (Rasmussen andWilliams, 2006) is a natural fit for such systems, butits applications in large-scale Bayesian models havebeen limited by computational concerns: training aGP model on n points requires O(n3) time, while com-puting the predictive distribution at a test point re-quires O(n) and O(n2) operations for the mean andvariance respectively.

This work focuses specifically on the fast evaluation of

GP likelihoods, motivated by the desire for efficient in-ference in models that include a GP regression compo-nent. In particular, we focus on the predictive covari-ance, since this computation time generally dominatesthat of the predictive mean. In our setting, trainingtime is a secondary concern: the model can always betrained offline, but the likelihood evaluation occurs inthe inner loop of an ongoing inference procedure, andmust be efficient if inference is to be feasible.

One approach to speeding up GP regression, com-mon especially to spatial applications, is the use ofcovariance kernels with short lengthscales to inducesparsity or near-sparsity in the kernel matrix. Thiscan be exploited directly using sparse linear algebrapackages (Vanhatalo and Vehtari, 2008) or by morestructured techniques such as space-partitioning trees(Shen et al., 2006; Gray, 2004); the latter approachescreate a query-dependent clustering to avoid consid-ering regions of the data not relevant to a particularquery. However, previous work has focused on effi-cient calculation of the predictive mean, rather thanthe variance, and the restriction to short lengthscalesalso inhibits application to data that contain longer-scale variations.

In this paper, we develop a tree-based method toefficiently compute the predictive covariance in GPmodels. Our work extends the weighted sum algo-rithm of Shen et al. (2006), which computes the pre-dictive mean. Instead of clustering points with sim-ilar weights, we cluster pairs of points having simi-lar weights, where the weights are given by a kernel-dependent distance metric defined on the product spaceconsisting of all pairs of training points. We show howto efficiently build and compute using a product treeconstructed in this space, yielding an adaptive covari-ance computation that exploits the geometric struc-ture of the training data to avoid the need to explicitlyconsider each pair of training points. This enables usto present what is to our knowledge the first account ofGP regression in which the major test-time operations

(predictive mean, covariance, and likelihood) run intime sublinear in the training set size, given a suitablysparse kernel matrix. As an extension, we show howour approach can be applied to GP models that com-bine both parametric and nonparametric components,and argue that such models present a promising optionfor modeling global-scale structure while maintainingthe efficiency of short-lengthscale GPs. Finally, wepresent empirical results that demonstrate significantspeedups on synthetic data as well as a real-world seis-mic dataset.

2 Background

2.1 GP Regression Model

We assume as training input a set of labeled points(xi, yi)|i = 1, . . . , n, where we suppose that

yi = f(xi) + εi

for some unknown function f(·) and i.i.d. Gaussian ob-servation noise εi ∼ N (0, σ2

n). Treating the estimationof f(·) as a Bayesian inference problem, we considera Gaussian process prior distribution f(·) ∼ GP (0, k),parameterized by a positive-definite covariance or ker-nel function k(x, x′). Given a set X∗ containing mtest points, we derive a Gaussian posterior distribu-tion f(X∗) ∼ N (µ∗,Σ∗), where

µ∗ = K∗TK−1y y (1)

Σ∗ = K∗∗ −K∗TK−1y K∗ (2)

and Ky = K(X,X) + σ2nI is the covariance matrix of

training set observations, K∗ = k(X,X∗) denotes then×m matrix containing the kernel evaluated at eachpair of training and test points, and similarly K∗∗ =k(X∗, X∗) gives the kernel evaluations at each pairof test points. Details of the derivations, along withgeneral background on GP regression, can be found inRasmussen and Williams (2006).

In this work, we make the additional assumption thatthe input points xi and test points x∗p lie in some met-ric space (M, d), and that the kernel is a monotoni-cally decreasing function of the distance metric. Manycommon kernels fit into this framework, including theexponential, squared-exponential, rational quadratic,and Matern kernel families; anisotropic kernels can berepresented through choice of an appropriate metric.

2.2 k-d and Metric Trees

Tree structures such as k-d trees (Friedman et al.,1977) form a hierarchical, multiresolution partition-ing of a dataset, and are commonly used in machinelearning for efficient nearest-neighbor queries. They

Figure 1: Cover tree decomposition of seismic eventlocations recorded at Fitzroy Crossing, Australia (withX marking the station location).

have also been adapted to speed up nonparametric re-gression (Moore et al., 1997; Shen et al., 2006); thegeneral approach is to view the regression computa-tion of interest as a sum over some quantity associ-ated with each training point, weighted by the kernelevaluation against a test point. If there are sets oftraining points having similar weight – for example,if the kernel is very wide, if the points are very closeto each other, or if the points are all far enough fromthe query to have effectively zero weight – then theweighted sum over the set of points can be approxi-mated by an unweighted sum (which does not dependon the query and may be precomputed) times an es-timate of the typical weight for the group, saving theeffort of examining each point individually. This isimplemented as a recursion over a tree structure aug-mented at each node with the unweighted sum over alldescendants, so that recursion can be cut off with anapproximation whenever the weight function is shownto be suitably uniform over the current region.

Major drawbacks of k-d trees include poor perfor-mance in high dimensions and a limitation to Eu-clidean spaces. By contrast, we are interested in non-Euclidean metrics both as a matter of practical appli-cation (e.g., in a geophysical setting we might considerpoints on the surface of the earth) and because somechoices of kernel function require our algorithm to op-erate under a non-Euclidean metric even if the under-lying space is Euclidean (see section 3.2). We thereforeconsider instead the class of trees having the followingproperties: (a) each node n is associated with somepoint xn ∈M, such that all descendants of n are con-tained within a ball of radius rn centered at xn, and(b) for each leaf L we have xL ∈ X, with exactly oneleaf node for each training point xi ∈ X. We call anytree satisfying these properties a metric tree.

function WeightedMetricSum(node n, query points (x∗i , x∗

j ),

. accumulated sum S, tolerances εrel, εabs)δn ← δ((x∗

i ,x∗j ), (n1,n2))

if n is a leaf then

S ← S + (K−1y )n ·

(k(d(x∗

i ,n1)) · k(d(x∗j ,n2))

)else

wmin ← kprodlower (δn + rn)

wmax ← kprodupper (max(δn − rn, 0))

if wmax · SAbsn ≤

(εrel

∣∣∣S + wmin · SUWn

∣∣∣+ εabs

)then

S ← S + 12 (wmax + wmin) · SUW

nelse

for each child c of nsorted by ascending δ((x∗

i ,x∗j ), (c1, c2)) do

S ← S + WeightedMetricSum(c, (x∗i ,x

∗j ), S, εrel, εabs)

end forend if

end ifreturn S

end function

Figure 2: Recursive algorithm to computing GP co-variance entries using a product tree. Abusing nota-tion, we use n to represent both a tree node and thepair of points n = (n1,n2) associated with that node.

Examples of metric trees include many structures de-signed specifically for nearest-neighbor queries, such asball trees (Uhlmann, 1991) and cover trees (Beygelz-imer et al., 2006), but in principle any hierarchicalclustering of the dataset, e.g., an agglomerative clus-tering, might be augmented with radius information tocreate a metric tree. Although our algorithms can op-erate on any metric tree structure, we use cover treesin our implementation and experiments. A cover treeon n points can be constructed in O(n log n) time, andthe construction and query times scale only with theintrinsic dimensionality of the data, allowing for ef-ficient nearest-neighbor queries in higher-dimensionalspaces (Beygelzimer et al., 2006). Figure 1 shows acover-tree decomposition of one of our test datasets.

3 Efficient Covariance using ProductTrees

We consider efficient calculation of the GP covari-ance (2). The primary challenge is the multiplicationK∗TK−1y K∗. For simplicity of exposition, we will fo-cus on computing the (i, j)th entry of the resultingmatrix, i.e., on the multiplication k∗i

TK−1y k∗j wherek∗i denotes the vector of kernel evaluations betweenthe training set and the ith test point, or equivalentlythe ith column of K∗. Note that a naıve implementa-tion of this multiplication requires O(n2) time.

We might be tempted to apply the vector multiplica-tion primitive of Shen et al. (2006) separately for eachrow of K−1y to compute K−1y k∗j , and then once moreto multiply the resulting vector by k∗i . Unfortunately,this requires n vector multiplications and thus scales(at least) linearly in the size of the training set. In-

stead, we note that we can rewrite k∗iTK−1y k∗j as a

weighted sum of the entries of K−1y , where the weightof the (p, q)th entry is given by k(x∗i ,xp)k(x∗j ,xq):

k∗iTK−1y k∗j =

N∑p=1

N∑q=1

(K−1y )pqk(x∗i ,xp)k(x∗j ,xq). (3)

Our goal is to compute this weighted sum efficientlyusing a tree structure, similar to Shen et al. (2006),except that instead of clustering points with similarweights, we now want to cluster pairs of points havingsimilar weights.

To do this, we consider the product spaceM×M con-sisting of all pairs of points fromM, and define a prod-uct metric δ on this space. The details of the productmetric will depend on the choice of kernel function, asdiscussed in section 3.2 below. For the moment, we willassume a SE kernel, of the form kSE(d) = exp(−d2),for which a natural choice is the 2-product metric:

δ((xa,xb), (xc,xd)) =√d(xa,xc)2 + d(xb,xd)2.

Note that this metric, taken together with the SE ker-nel, has the fortunate property

kSE(d(xa,xb))kSE(d(xc,xd)) = kSE(δ((xa,xb), (xc,xd))),

i.e., the property that evaluating the kernel in theproduct space (rhs) gives us the correct weight for ourweighted sum (3) (lhs).

Now we can run any metric tree construction algo-rithm (e.g., a cover tree) using the product metric tobuild a product tree on all pairs of training points. Inprinciple, this tree contains n2 leaves, one for each pairof training points. In practice it can often be mademuch smaller; see section 3.1 for details. At each leafnode L, representing a pair of training points, we storethe element (K−1y )L corresponding to those two train-ing points, and at each higher-level node n we cachethe unweighted sum SUWn of these entries over all ofits descendant leaf nodes, as well as the sum of abso-lute values SAbsn (these cached sums will be used todetermine when to cut off recursive calculations):

SUWn =

∑L∈leaves(n)

(K−1y )L (4)

SAbsn =

∑L∈leaves(n)

∣∣(K−1y )L∣∣ . (5)

Given a product tree augmented in this way, theweighted-sum calculation (3) is performed by theWeightedMetricSum algorithm of Figure 2. Thisalgorithm is similar to the WeightedSum andWeightedXtXBelow algorithms of Shen et al.(2006) and Moore et al. (1997) respectively, but

adapted to the non-Euclidean and non-binary tree set-ting, and further adapted to make use of bounds on theproduct kernel (see section 3.2). It proceeds by a recur-sive descent down the tree, where at each non-leaf nodeit computes upper and lower bounds on the weight ofany descendant, and applies a cutoff rule to determinewhether to continue the descent. Many cutoff rules arepossible; for predictive mean calculation, Moore et al.(1997) and Shen et al. (2006) maintain an accumulatedlower bound on the total overall weight, and cut offwhenever the difference between the upper and lowerweight bounds at the current node is a small fractionof the lower bound on the overall weight. However, oursetting differs from theirs: since we are computing aweighted sum over entries of K−1y , which we expect tobe approximately sparse, we expect that some entrieswill contribute much more than others. Thus we wantour cutoff rule to account for the weights of the sumand the entries of K−1y that are being summed over.We do this by defining a rule in terms of the currentrunning weighted sum,

wmax · SAbsn ≤

(εrel

∣∣∣S + wmin · SUWn

∣∣∣+ εabs

), (6)

which we have found to significantly improve per-formance in covariance calculations compared to theweight-based rule of Moore et al. (1997) and Shenet al. (2006). Here S is the weighted sum accumu-lated thus far, and εabs and εrel are tunable approxi-mation parameters. We interpret the left-hand side of(6) as computing an upper bound on the contributionof node n’s descendents to the final sum, while the ab-solute value on the right-hand side gives an estimatedlower bound on the magnitude of the final sum (notethat this is not a true bound, since the sum may con-tain both positive and negative terms, but it appearseffective in practice). If the leaves below the currentnode n appear to contribute a negligible fraction ofthe total sum, we approximate the contribution fromn by 1

2 (wmax +wmin) ·SUWn , i.e., by the average weight

times the unweighted sum. Otherwise, the computa-tion continues recursively over n’s children. FollowingShen et al. (2006), we recurse to child nodes in order ofincreasing distance from the query point, so as to ac-cumulate large sums early on and increase the chanceof cutting off later recursions.

3.1 Implementation

A naıve product tree on n points will have n2

leaves, but we can reduce this and achieve substan-tial speedups by exploiting the structure of K−1y andof the product space M×M:

Sparsity. IfKy is sparse, or can be well-approximatedby a sparse matrix, then K−1y is often also sparse (or

well-approximated as sparse) in practice. This oc-curs in the case of compactly supported kernel func-tions (Gneiting, 2002; Rasmussen and Williams, 2006),but also even when using standard kernels with shortlengthscales. Note that although there is no guaran-tee that the inverse of a sparse matrix must itself besparse (with the exception of specific structures, e.g.,block diagonal matrices), it is often the case that whenKy is sparse many entries of K−1y will be very nearto zero, since points with negligible covariance gener-ally also have negligibly small correlations in the pre-cision matrix, so K−1y can often be well-approximatedas sparse. When this is the case, our product tree needinclude only those pairs (xp,xq) for which (K−1y )pq isnon-negligible. This is often a substantial advantage.

Symmetry. Since K−1y is a symmetric matrix, it is re-dundant to include leaves for both (xp,xq) and (xq,xp)in our tree. We can decompose K−1y = U + D + UT ,where D = diag(K−1y ) is a diagonal matrix and U =triu(K−1y ) is a strictly upper triangular (zero diagonal)matrix. This allows us to rewrite

k∗iTK−1y k∗j = k∗i

TUk∗j + k∗iTDk∗j + k∗i

TUTk∗j ,

in which the first and third terms can be implementedas calls to WeightedMetricSum on a product treebuilt from U ; note that this tree will be half the sizeof a tree built for K−1y since we omit zero entries. Thesecond (diagonal) term can be computed using a sep-arate (very small) product tree built from the nonzeroentries of D. The accumulated sum S can be car-ried over between these three computations, so we canspeed up the later computations by accumulating largeweights in the earlier computations.

Factorization of product distances. In general,computing the product distance δ will usually involvetwo calls to the underlying distance metric d; thesecan often be reused. For example, when calculatingboth δ((xa,xb), (xc,xd)) and δ((xa,xe), (xc,xd)), wecan reuse the value of d(xa,xc) for both computations.This reduces the total number of calls to the distancefunction during tree construction from a worst-case n4

(for all pairs of pairs of training points) to a maximumof n2, and in general much fewer if other optimiza-tions such as sparsity are implemented as well. Thiscan dramatically speed up tree construction when thedistance metric is slow to evaluate. It can also speedup test-time evaluation, if distances to the same pointmust be computed at multiple levels of the tree.

3.2 Other Kernel Functions

As noted above, the SE kernel has the lucky propertythat, if we choose product metric δ =

√d21 + d22, then

the product of two SE kernels is equal to the kernel of

Kernel k(d) k(d1)k(d2) δ(d1, d2) kprodlower(δ) kprodupper(δ)

SE exp(−d2

)exp

(−d21−d

22

) √d21+d22 exp

(−(δ)2

)exp

(−(δ)2

)γ-exponential exp (−dγ) exp

(−dγ1−d

γ2

) (dγ1 +d

γ2

)1/γ exp (−(δ)γ) exp (−(δ)γ)

Rational Quadratic

(1+ d2

2α

)−α (1+

d21+d222α

+d21d

22

4α2

)−α √d21+d22

(1+

(δ)2

2α+

(δ)4

16α2

)−α (1+

(δ)2

2α

)−α

Matern (ν = 3/2)(1+√

3d)

· exp(−√

3d])

(1+√

3 (d1+d2) +3d1d2

)· exp

(−√

3(d1+d2)) d1+d2

(1+√

3δ)

· exp(−√

3δ)

(1+√

3δ+3(δ/2)2)

· exp(−√

3δ)

Piecewise polynomial(compact support),q = 1, dimension D,

j =⌊D2

⌋+2

(1−d)j+1+

· ((j+1)d+1)

((1−d1)+(1−d2)+

)j+1

·((j+1)2d1d2

+ (j+1)(d1+d2)+1)

d1+d2 (1−δ)j+1+

· ((j+1)δ+1)

(1−δ+ (δ)2

4

)j+1

+

·(

(j+1)2(δ2

)2+(j+1)δ+1

)

Table 1: Bounds for products of common kernel functions. All kernel functions are from Rasmussen and Williams(2006).

the product metric δ:

kSE(d1)kSE(d2) = exp(−d21 − d22

)= kSE(δ).

In general, however, we are not so lucky: it is notthe case that every kernel we might wish to use hasa corresponding product metric such that a productof kernels can be expressed in terms of the productmetric. In such cases, we may resort to upper andlower bounds in place of computing the exact kernelvalue. Note that such bounds are all we require toevaluate the cutoff rule (6), and that when we reach aleaf node representing a specific pair of points we canalways evaluate the exact product of kernels directlyat that node. As an example, consider the Maternkernel

kM(d) = (1 +√

3d) exp(−√

3d)

(where we have taken ν = 3/2); this kernel is popu-lar in geophysics because its sample paths are once-differentiable, as opposed to infinitely smooth as withthe SE kernel. Considering the product of two Maternkernels,

kM(d1)kM(d2) =

(1+√

3(d1+d2)+3d1d2) exp(−√

3(d1+d2))

we notice that this is almost equivalent to kM(δ) for thechoice of δ = d1 + d2, but with an additional pairwiseterm of 3d1d2. We bound this term by noting thatit is maximized when d1 = d2 = δ/2 and minimizedwhenever either d1 = 0 or d2 = 0, so we have 3(δ/2)2 ≥3d1d2 ≥ 0. This yields the bounds kprodlower and kprodupper asshown in Table 1. Bounds for other common kernelsare obtained analogously in Table 1.

4 Primal / Dual and Mixed GPRepresentations

In this section, we extend the product tree approachto models combining a long-scale parametric compo-nent with a short-scale nonparametric component. We

introduce these models, which we refer to as mixed pri-mal/dual GPs, and demonstrate how they can mediatebetween the desire to model long-scale structure andthe need to maintain a short lengthscale for efficiency.(Although this class of models is well known, we havenot seen this particular use case described in the litera-ture). We then show that the necessary computationsin these models can be done efficiently using the tech-niques described above.

4.1 Mixed Primal/Dual GP Models

Although GP regression is commonly thought of asnonparametric, it is possible to implement paramet-ric models within the GP framework. For example, aBayesian linear regression model with Gaussian prior,

y = xTβ + ε, β ∼ N (0, I), ε ∼ N (0, σ2n),

is equivalent to GP regression with a linear kernelk(x,x′) = 〈x,x′〉, in the sense that both models yieldthe same (Gaussian) predictive distributions (Ras-mussen and Williams, 2006). However, the two rep-resentations have very different computational prop-erties: the primal (parametric) representation allowscomputation of the predictive mean and variance inO(D) and O(D2) time respectively, where D is theinput dimensionality, while the dual (nonparametric)representation requires time O(n) and O(n2) respec-tively for the same calculations. When learning simplemodels on large, low-dimensional (e.g., spatial) datasets, the primal representation is obviously more at-tractive, since we can store and compute with modelparameters directly, in constant time relative to n.

Of course, simple parametric models by themselvescannot capture the complex local structure that oftenappears in real-world datasets. Fortunately it is pos-sible to combine a parametric model with a nonpara-metric GP model in a way that retains the advantagesof both approaches. To define a combined model, wereplace the standard zero-mean GP assumption with a

(a) MAD= 1.11polynomial(5)

(b) MAD= 1.77GP w/ ` = 0.02

(c) MAD= 0.73Mixed:polynomial(5) +GP w/ ` = 0.02

(d) MAD= 0.65True GP w/`1 = 0.5`2 = 0.02

Figure 3: A primal/dual mixture approximating a longer-scale GP.

parametric mean function h(x)Tβ, yielding the model

y = f(x) + h(x)Tβ + ε

where h(x) is a vector of feature values[h1(x), . . . , hD(x)]. The GP model is then learnedjointly along with a posterior distribution on the co-efficients β. Assuming a Gaussian prior β ∼ N (b, B)on the coefficients, the predictive distributiong(X∗) ∼ N (µ′∗,Σ

′∗) can be derived (Rasmussen and

Williams, 2006) as

µ′∗ = H∗T β +K∗TK−1y (y −H∗T β) (7)

Σ′∗ = K∗∗ −K∗TK−1y K∗

+RT (B−1 +HK−1y HT )R(8)

where we define Hij = hj(xi) for each training pointxi, similarly H∗ for the test points, and we have β =(B−1+HK−1y HT )−1(HK−1y y+B−1b) and R = H∗−HK−1y K∗. Section 2.7 of Rasmussen and Williams(2006) gives further details.

Note that linear regression in this framework corre-sponds to a choice of basis functions h1(x) = 1 andh2(x) = x; it is straightforward to extend this topolynomial regression and other models that are lin-ear in their parameters. In general, any kernel whichmaps to a finite-dimensional feature space can be rep-resented parametrically in that feature space, so thisframework can efficiently handle kernels of the formk(x,x′) =

∑i ki(x,x

′)+kS(x,x′), where kS is a short-lengthscale or compactly supported kernel, monoton-ically decreasing w.r.t. some distance metric as as-sumed above, and each ki either has an exact finite-dimensional feature map or can be approximated usingfinite-dimensional features Rahimi and Recht (2007);Vedaldi and Zisserman (2010).

As an example, Figure 3 compares several approachesfor inferring a function from a GP with long and short-

lengthscale components. We drew training data froma GP with a mixture of two SE kernels at lengthscales`1 = 0.5 and `2 = 0.02, sampled at 1000 randompoints in the unit square. Figure 3 displays the poste-rior means of four models on a 100 by 100 point grid,reporting the mean absolute deviation (MAD) of themodel predictions relative to the “true” values (drawnfrom the same GP) at 500 random test points. Notethat although the short-scale GP (3b) cannot by itselfrepresent the variation from the longer-scale kernel,when combined with a parametric polynomial com-ponent (3a) the resulting mixed model (3c) achievesaccuracy approaching that of the true model (3d).

4.2 Efficient Operations in Primal/DualModels

Likelihood calculation in primal/dual models is astraightforward extension of the standard case. Thepredictive mean (7) can be accommodated within theframework of Shen et al. (2006) using a tree repre-sentation of the vector K−1y

(y −H∗T β

), then adding

in the easily evaluated parametric component H∗T β.In the covariance (8) we can use a product tree toapproximate K∗TK−1y K∗ as described above; of the

remaining terms, β and B−1 +HK−1y HT can be pre-computed at training time, and H∗ and K∗∗ don’tdepend on the training set. This leaves HK−1y K∗ asthe one remaining challenge; we note that this quan-tity can be computed efficiently using mD applicationsof the vector multiplication primitive from Shen et al.(2006), re-using the same tree structure to multiplyeach column of K∗ by each row of HK−1y . Thus, all ofthe the operations required for likelihood computationcan be implemented efficiently with no explicit depen-dence on n (i.e., with no direct access to the trainingset except through space-partitioning tree structures).

Figure 4: Mean runtimes for dense, sparse, hybrid,and product tree calculation of GP variance on a 2Dsynthetic dataset.

5 Evaluation

We compare calculation of the predictive variance us-ing a product tree to several other approaches: a naıveimplementation using dense matrices, a direct calcu-lation using a sparse representation of K−1y and denserepresentation of k∗i , and a hybrid tree implementationthat attempts to also construct a sparse k∗i by query-ing a cover tree for all training points within distancer of the query point x∗i , where r is chosen such thatk(r′) is negligible for r′ > r, and then filling in onlythose entries of k∗i determined to be non-negligible.

Our product tree implementation is a Python exten-sion written in C++, based on the cover tree imple-mentation of Beygelzimer et al. (2006) and implement-ing the optimizations from section 3.1. The approxi-mation parameters εrel and εabs were set appropriatelyfor each experiment so as to ensure that the mean ap-proximation error is less than 0.1% of the exact vari-ance. All sparse matrix multiplications are in CSR for-mat using SciPy’s sparse routines; we impose a spar-sity threshold of 10−8 such that any entry less thanthe threshold is set to zero.

Figure 4 compares performance of these approaches ona simple two-dimensional synthetic data set, consist-ing of points sampled uniformly at random from theunit square. We train a GP on n such points and thenmeasure the average time per point to compute the

predictive variance at 1000 random test points. TheGP uses an SE kernel with observation noise σ2

n = 0.1and lengthscale ` =

√vπ/n, where v is a parame-

ter indicating the average number of training pointswithin a one-lengthscale ball of a random query point(thus, on average there will be 4v points within twolengthscales, 9v within three lengthscales, etc.).

The results of Figure 4 show a significant advantagefor the tree-based approaches, which are able to takeadvantage of the geometric sparsity structure in thetraining data. The dense implementation is relativelyfast on small data sets but quickly blows up, while thesparse calculation holds on longer (except in the rel-atively dense v = 5.0 setting) but soon succumbs tolinear growth, since it must evaluate the kernel be-tween the test point and each training point. Thehybrid approach has higher overhead but scales veryefficiently until about n = 48000, where the sparsematrix multiplication’s Ω(n) runtime (Bank and Dou-glas, 1993) begins to dominate. Conversely, the prod-uct tree remains efficient even for very large, sparsedatasets, with v = 0.25 runtimes growing from 0.08msat n = 1000 to just 0.13ms at n = 200000. Due tomemory limitations we were unable to evaluate v = 1.0and v = 5.0 for values of n greater than 32000.

Our second experiment uses amplitude data from 3105seismic events (earthquakes) detected by a station inFitzroy Crossing, Australia; the event locations areshown in Figure 1. The amplitudes are normalizedfor event magnitude, and the task is to predict therecorded amplitude of a new event given that event’slatitude, longitude, and depth. Here our distance met-ric is the great-circle distance, and we expect our datato contain both global trends and local structure, sinceevents further away from the detecting station willgenerally have lower amplitudes, but this may varylocally as signals from a given source region generallytravel along the same paths through the earth and aredampened or amplified in the same ways as they travelto the detecting station.

Table 2 considers several models for this data. A sim-ple parametric model, the fifth-order polynomial inevent-to-station distance shown in Figure 5, is not veryaccurate but does allow for very fast variance evalua-tions. The GP models are more accurate, but the mostaccurate GP model uses a relatively long lengthscale of50km, with correspondingly slow variance calculations.Depending on application requirements, the most ap-pealing tradeoff might be given by the mixed modelcombining a fifth-degree polynomial with a 10km SEGP: this model achieves accuracy close to that of the50km models, but with significantly faster variancecalculations due to the shorter lengthscale, especiallywhen using a product tree.

Figure 5: Normalized amplitude as a func-tion of event-station distance, with a fifth-degree polynomial fit shading ±2std.

Model Error Sparse (ms) Tree (ms)

Polynomial in distance (deg 5) 0.78 0.050 n/a

GP, SE, ` = 10km 0.67 0.722 ± 0.032 0.216 ± 0.224

poly/GP, deg 5,SE, 10km 0.62 0.795 ± 0.033 0.413 ± 0.307

GP, Matern, ` = 10km 0.65 1.256 ± 0.592 0.337 ± 0.365

poly/GP, deg 5, Matern, 10km 0.62 1.327 ± 0.602 0.654 ± 0.499

GP, SE, ` = 50km 0.61 1.399 ± 0.661 1.168 ± 1.242

poly/GP, deg 5, SE, 50km 0.60 1.490 ± .677 1.551 ± 1.409

Table 2: Models for Fitzroy Crossing amplitude data, withmean absolute prediction error from five-fold cross validationand (mean ± std) time to compute the predictive variance viaa direct sparse calculation versus a product tree.

6 Related Work

Previous approximations for GP mean prediction(Moore et al., 1997; Shen et al., 2006; Gray, 2004),which inspired this work, use tree structures to imple-ment an efficient matrix-vector multiplication (MVM);the Improved Fast Gauss Transform (Morariu et al.,2008) also implements fast MVM for the special case ofthe SE kernel. It is possible to accelerate GP trainingby combining MVM methods with a conjugate gra-dient solver, but models thus trained do not allowfor the computation of predictive variances. One ar-gument against MVM techniques (and, by extension,our product tree approach) is that their efficiency re-quires shorter lengthscales than are common in ma-chine learning applications (Murray, 2009); however,we have found them quite effective on datasets whichdo have genuinely sparse covariance structure (e.g.,geospatial data), or in which the longer-scale variationcan be represented by a parametric component.

Another set of approaches to speeding up GP regres-sion, sparse approximations (Csato and Opper, 2002;Seeger et al., 2003; Snelson and Ghahramani, 2006;Quinonero-Candela and Rasmussen, 2005), attemptto represent n training points using a smaller set ofm points, allowing training in O(nm2) time and pre-dictive covariance (thus likelihood) computation inO(m2) time. This is philosophically a different ap-proach from that of this paper, where we generallywant to retain all of our training points in order torepresent local structure. However, there is no for-mal incompatibility: many sparse approaches, includ-ing all of those discussed by Quinonero-Candela andRasmussen (2005), yield predictive covariances of theform k∗i

TQk∗j for some matrix Q (or a sum of termsof this form), where this product could be computedstraightforwardly using a product tree. Several non-

sparse approximations, e.g., the Nystrom approxima-tion (Williams and Seeger, 2001), also yield predictivecovariances of this form.

More closely related to our setting are local approxi-mations, in which different GPs are trained in differentregions of the input space. There is some evidence thatthese can provide accurate predictions which are veryfast to evaluate (Chalupka et al., 2013); however, theyface boundary discontinuities and inaccurate uncer-tainty estimates if the data do not naturally form inde-pendent clusters. Since training multiple local GPs isequivalent to training a single global GP with a blockdiagonal covariance matrix, it should be possible toenhance local GPs with global parametric componentsas in section 4, similarly to the combined local/globalapproximation of Snelson and Ghahramani (2007).

7 Conclusion and Future Work

We introduce the product tree structure for efficientadaptive calculation of GP covariances using a mul-tiresolution clustering of pairs of training points. Spe-cific contributions of this paper include product met-rics and bounds for common kernels, the adaptationto metric trees, a novel cutoff rule incorporating boththe weights and the quantity being summed over, andcovariance-specific performance optimizations. Addi-tionally, we describe efficient calculation in GP modelsincorporating both primal and dual components, andshow how such models can model global-scale variationwhile maintaining the efficiency of short-lengthscaleGPs.

A limitation of our approach is the need to explicitlyinvert the kernel matrix during training; this can bequite difficult for large problems. One avenue for fu-ture work could be an iterative factorization of Ky

analogous to the CG training performed by MVMmethods (Shen et al., 2006; Gray, 2004; Morariu et al.,2008). Another topic would be a better understandingof cutoff rules for the weighted sum recursion, e.g., anempirical investigation of different rules or a theoreti-cal analysis bounding the error and/or runtime of theoverall computation.

Finally, although our work has been focused primar-ily on low-dimensional applications, the use of covertrees instead of k-d trees ought to enable an exten-sion to higher dimensions. We are not aware of pre-vious work applying tree-based regression algorithmsto high-dimensional data, but as high-dimensional co-variance matrices are often sparse, this may be a nat-ural fit. For high-dimensional data that do not lieon a low-dimensional manifold, other nearest-neighbortechniques such as locality-sensitive hashing (Andoniand Indyk, 2008) may have superior properties to treestructures; the adaptation of such techniques to GPregression is an interesting open problem.

References

Andoni, A. and Indyk, P. (2008). Near-optimal hash-ing algorithms for approximate nearest neighbor inhigh dimensions. Communications of the ACM,51(1):117–122.

Bank, R. E. and Douglas, C. C. (1993). Sparse ma-trix multiplication package (SMMP). Advances inComputational Mathematics, 1(1):127–137.

Beygelzimer, A., Kakade, S., and Langford, J. (2006).Cover trees for nearest neighbor. In Proceedingsof the 23rd International Conference on MachineLearning (ICML), pages 97–104.

Chalupka, K., Williams, C. K., and Murray, I. (2013).A framework for evaluating approximation methodsfor Gaussian process regression. Journal of MachineLearning Research, 14:333–350.

Csato, L. and Opper, M. (2002). Sparse online Gaus-sian processes. Neural Computation, 14(3):641–668.

Friedman, J. H., Bentley, J. L., and Finkel, R. A.(1977). An algorithm for finding best matches inlogarithmic expected time. ACM Transactions onMathematical Software (TOMS), 3(3):209–226.

Gneiting, T. (2002). Compactly supported correla-tion functions. Journal of Multivariate Analysis,83(2):493–508.

Gray, A. (2004). Fast kernel matrix-vector multiplica-tion with application to Gaussian process learning.Technical Report CMU-CS-04-110, School of Com-puter Science, Carnegie Mellon University.

Moore, A. W., Schneider, J., and Deng, K. (1997). Ef-ficient locally weighted polynomial regression pre-

dictions. In Proceedings of the 14th InternationalConference on Machine Learning (ICML).

Morariu, V., Srinivasan, B. V., Raykar, V. C., Du-raiswami, R., and Davis, L. (2008). Automatic on-line tuning for fast Gaussian summation. Advancesin Neural Information Processing Systems (NIPS),21:1113–1120.

Murray, I. (2009). Gaussian processes and fast matrix-vector multiplies. In Numerical Mathematics in Ma-chine Learning workshop at the 26th InternationalConference on Machine Learning (ICML 2009).

Quinonero-Candela, J. and Rasmussen, C. E. (2005).A unifying view of sparse approximate Gaussianprocess regression. The Journal of Machine Learn-ing Research, 6:1939–1959.

Rahimi, A. and Recht, B. (2007). Random featuresfor large-scale kernel machines. Advances in NeuralInformation Processing Systems (NIPS), 20:1177–1184.

Rasmussen, C. and Williams, C. (2006). Gaussian Pro-cesses for Machine Learning. MIT Press.

Seeger, M., Williams, C. K., and Lawrence, N. D.(2003). Fast forward selection to speed up sparseGaussian process regression. In Artificial Intelli-gence and Statistics (AISTATS), volume 9.

Shen, Y., Ng, A., and Seeger, M. (2006). Fast Gaus-sian process regression using kd-trees. In Advancesin Neural Information Processing Systems (NIPS),volume 18, page 1225.

Snelson, E. and Ghahramani, Z. (2006). Sparse Gaus-sian processes using pseudo-inputs. In Advances inNeural Information Processing Systems (NIPS).

Snelson, E. and Ghahramani, Z. (2007). Local andglobal sparse Gaussian process approximations. InArtificial Intelligence and Statistics (AISTATS),volume 11.

Uhlmann, J. K. (1991). Satisfying general proximity/ similarity queries with metric trees. InformationProcessing Letters, 40(4):175 – 179.

Vanhatalo, J. and Vehtari, A. (2008). Modelling localand global phenomena with sparse Gaussian pro-cesses. In Proceedings of Uncertainty in ArtificialIntelligence (UAI).

Vedaldi, A. and Zisserman, A. (2010). Efficient ad-ditive kernels via explicit feature maps. In Com-puter Vision and Pattern Recognition (CVPR),2010 IEEE Conference on, pages 3539–3546. IEEE.

Williams, C. and Seeger, M. (2001). Using theNystrom method to speed up kernel machines. InAdvances in Neural Information Processing Systems(NIPS). Citeseer.

Latent Topic Analysis for Predicting Group Purchasing Behavior onthe Social Web

Feng-Tso Sun, Martin Griss, and Ole MengshoelElectrical and Computer Engineering Department

Carnegie Mellon University

Yi-Ting YehComputer Science Department

Stanford University

Abstract

Group-deal websites, where customers pur-chase products or services in groups, are aninteresting phenomenon on the Web. Eachpurchase is kicked off by a group initiator,and other customers can join in. Customersform communities with people with similarinterests and preferences (as in a social net-work), and this drives bulk purchasing (sim-ilar to online stores, but in larger quantitiesper order, thus customers get a better deal).In this work, we aim to better understandwhat factors influence customers’ purchasingbehavior for such social group-deal websites.We propose two probabilistic graphical mod-els, i.e., a product-centric inference model(PCIM) and a group-initiator-centric infer-ence model (GICIM), based on Latent Dirich-let Allocation (LDA). Instead of merely us-ing customers’ own purchase history to pre-dict purchasing decisions, these two modelsinclude other social factors. Using a lift curveanalysis, we show that by including social fac-tors in the inference models, PCIM achieves35% of the target customers within 5% of thetotal number of customers while GICIM isable to reach 85% of the target customers.Both PCIM and GICIM outperform randomguessing and models that do not take socialfactors into account.

1 Introduction

Group purchasing is a business model that offers var-ious deals-of-the-day and an extra discount depend-ing on the size of the purchasing group. After group-deal websites, such as Groupon and LivingSocial, havegained attention, similar websites, such as ihergo1

1http://www.ihergo.com

and Taobao,2 have introduced social networks as afeature for their users. These group-deal websites pro-vide an interesting hybrid of social networks (e.g.,Facebook.com and LinkedIn.com) and online stores(e.g., Amazon.com and Buy.com). Customers formcommunities with people with similar interests andpreferences (as in a social network), and this drivesbulk purchasing (similar to online stores, but in largerquantities per order, thus customers get a better deal).As we see more and more social interactions amongcustomers in group-deal websites, it is critical to un-derstand the interplay between social factors and pur-chasing preferences.

In this paper, we analyze a transactional datasetfrom the largest social group-deal website in Taiwan,ihergo.com. Figure 1 shows a screenshot from thegroup-deal page of ihergo.com. Each group-purchasingevent on ihergo.com consists of three major compo-nents: (1) a group initiator, (2) a number of groupmembers, and (3) a group-deal product. A group ini-tiator starts a group-purchasing event for a specificgroup-deal product. While this event will be postedpublicly, the group initiator’s friends will also be no-tified. A user can choose to join the purchasing eventto become a group member.

Group initiators play important roles on this kind ofgroup-deal websites. Usually, the merchants would of-fer incentives for the group initiators to initiate group-purchasing events by giving them products for free ifthe size of the group exceeds some threshold. In addi-tion, to save shipping costs, the group can choose tohave the whole group-deal order shipped to the initia-tor. In this case, the initiator would need to distributethe products to group members in person. Hence, thegroup members usually reside or work in the proximityof the group initiator. Sometimes, they are friends orco-workers of the initiator.

Understanding customers’ purchasing behavior in this

2http://www.taobao.com

kind of social group-purchasing scenario could helpgroup-deal websites strategically design their offerings.Traditionally, customers search for or browse productsof their interests on websites like Amazon.com. How-ever, on social group-deal websites, customers can per-form not only product search, but they can also browsegroup deals and search for initiators by ratings and lo-cations. Therefore, a good recommender system [1]for social group-deal websites should take this into ac-count. If the website can predict which customers aremore likely to join a group-purchasing event startedby a specific initiator, it can maximize group sizes andmerchants’ profits in a shorter period of time by de-livering targeted advertising. For example, instead ofspamming everyone, the website can send out notifi-cations or coupons to the users who are most likely tojoin the group-purchasing events.

In this work, we aim to predict potential customerswho are most likely to join a group-purchasing event.We apply Latent Dirichlet Allocation (LDA) [2] to cap-ture customers’ purchasing preferences, and evaluateour proposed predictive models based on a one-yeargroup-purchasing dataset from ihergo.com.

Our contributions in understanding the importance ofsocial factors for group-deal customers’ decisions arethe following:

• A new type of group-purchasing dataset.We introduce and analyze a new type of group-purchasing dataset, which consists of 5,602 users,26,619 products and 13,609 group-purchasingevents.

• Predictive models for group-deal cus-tomers. Based on topic models, we propose twopredictive models that include social factor. Theyachieve higher prediction accuracy compared tothe baseline models.

In the next section, we describe related work in thearea of group purchasing behavior, social recommen-dations, and topic models for customer preferences.Section 3 introduces and analyzes the characteristicsof our real-world group-purchasing dataset. In Sec-tion 4, we first review LDA, then present two proposedpredictive models for group-deal customer prediction.Experimental results are given in Section 5. Finally,conclusion and future research direction are presentedin Section 6.

2 Related Work

In this section, we review related work in three areas:(1) group purchasing behavior, (2) social recommen-dations, and (3) topic models for customer preferences.

to initiate a groupnumber of group members

initiator

group-deal productfilter by

initiator's ratingactive period of group deal

filter by meet-up location

Figure 1: Screenshot of the group-deal page fromihergo.com.

Group Purchasing Behavior. Since group-dealwebsites such as Groupon and LivingSocial gained at-tention, several studies have been conducted to under-stand factors influencing group purchasing behavior.Byers et al. analyzed purchase histories of Grouponand LivingSocial [3]. They showed that Groupon op-timizes deal offers strategically by giving “soft” incen-tives, such as deal scheduling and duration, to encour-age purchases. Byers et al. also compared Grouponand LivingSocial sales with additional datasets fromYelp’s reviews and Facebook’s like counts [4]. Theyshowed that group-deal sites benefit significantly fromword-of-mouth effects on users’ reviews during salesevents. Edelman et al. studied the benefits and draw-backs of using Groupon from the point of view of themerchants [6]. Their work modeled whether adver-tising and price discrimination effects can make dis-counts profitable. Ye et al. introduced a predictivedynamic model for group purchasing behavior. Thismodel incorporates social propagation effects to pre-dict the popularity of group deals as a function oftime [19]. In this work, we focus on potential cus-tomer prediction, as opposed to modeling the overalldeal purchasing sales over time.

Social Recommendations. In real life, a cus-tomer’s purchasing decision is influenced by his or hersocial ties. Guo et al. analyzed the dataset from thelargest Chinese e-commerce website, Taobao, to studythe relationship between information passed amongbuyers and purchasing decision [7]. Leskovec et al.used a stochastic model to explain the propagation ofrecommendations and cascade sizes [11]. They showed

that social factors have a different level of impact onuser purchasing decision for different products. More-over, previous work also tried to incorporate socialinformation into existing recommendation techniques,such as collaborative filtering [13, 14, 20, 12]. Re-cently, many recommendation systems have been im-plemented, taking advantage of social network infor-mation in addition to users’ preferences to improve rec-ommendation accuracy. For example, Yang et al. pro-posed a Bayesian-inference based movie recommenda-tion system for online social networks [18]. Our workconsiders the relationship between the group initiatorand the group members as a social tie to augment cus-tomer prediction for group-purchasing events.

Topic Models for Customer Preference. Topicmodels such as LDA have been widely and successfullyused in many applications including language model-ing [2], text mining [17], human behavior modeling [9],social network analysis [5], and collaborative filter-ing [8]. Researchers have also proposed new topic mod-els for purchasing behavior modeling. For example,topic models have been extended with price informa-tion to analyze purchase data [10]. By estimating themean and the variance of the price for each product,the proposed model can cluster related items by takingtheir price ranges into account. Iwata and Watanabeproposed a topic model for tracking time-varying con-sumer purchase behavior, in which consumer interestsand item trends change over time [9]. In this paper,we use LDA to learn topic proportions from purchasehistory to represent customers’ purchasing preferences.

3 Group-Purchasing Dataset

The dataset for our data analysis comes from users’transactional data of a group-deal website, ihergo. Itis the largest social group-deal website in Taiwan. Wecollected longitudinal data between October 1st 2011and October 1st 2012. From the users’ geographicalprofile, we are able to group them based on their livingarea. For this study, we include all 5,602 users livingin Taipei, the capital of Taiwan. In total, our datasetcontains 26,619 products and 13,609 group-purchasingevents.

On ihergo, users can purchase a product by joining agroup-purchasing event. There are two roles amongthe users: 1) the group initiator and 2) the groupmember. A group initiator initiates a purchase groupwhich other users can join to become group members.Once the group size exceeds some threshold, the groupmembers can get a discount on the product while theinitiator can get the product for free. Sometimes thegroup initiator and the group members already knoweach other before they join the same group-purchasing

Figure 2: Part of group deal graph for ihergo dataset:illustration of the member-centric relationships be-tween group members and initiators from a subset ofrandomly sampled joined group members. A directededge is from a joined customer (dark blue) to an ini-tiator (light orange).

event. Sometimes they become friends after the event.Moreover, each user can become a follower of a groupinitiator. When a new group-purchasing event is initi-ated by an initiator, the system will notify his or herfollowers.

Each group-purchasing deal in our dataset is composedof a set of attributes: the product description (e.g.,discounted price, limited quantity, and product cate-gory), the group size, the group initiator, the groupmembers, and the time period in which the deal is ac-tive. Group-purchasing deals are defined by a time se-ries D = (D1, D2,..., Dn), where Di is a tuple (t, p, o,m) denoting that a group-purchasing deal for productp is initiated by an organizer (initiator) o with joinedgroup members m = m1, ..., mk at time t.

We represent group-purchasing events as a directedgraph. Each user is a vertex in the graph. For ev-ery group-purchasing deal, we build directed edgesfrom each group member to the initiator. Thereare 5,602 vertices and 16,749 edges in our ihergodataset. The directed edges are defined by E =∪i∈[1,n] ∪j∈[1,d(i)](mi,j , oi), where d(i) is the numberof joined customers for group deal i. The vertices inthe graph are defined by V = M ∪ O, where M de-notes all group members M = m1 ∪ . . .∪mn and Odenotes total group-purchasing organizers (initiators)O = o1∪ . . .∪on.Figure 2 illustrates the joined customer centric graphstructure by showing the relationships among a sub-set of randomly sampled joined customers. Light or-ange and dark blue vertices represent the group initia-tors and group members, respectively. According tothis dataset, each user has joined 84 group-purchasing

node out degree10 20 30

0

0.3de

gree

dist

ribut

ion

0.2

0.1

0 5 10 15 20 25 30 35

0.00

0.10

0.20

0.30

node degree

degr

ee.d

istri

butio

n(g,

mod

e =

c("o

ut"))

0

Figure 3: Node out-degree distribution for the group-purchasing graph of ihergo. 80% of the users followfive or fewer initiators.

events on average. However, one interesting obser-vation from the graph is that the number of outgo-ing edges from dark blue vertices is far less than 84.This property can be even clearly seen from the out-degree distribution for the overall group-purchasinggraph shown in Figure 3. We see that 80% of theusers only join group-purchasing events initiated by5 or fewer different initiators. Group members havea tendency to repeatedly join group-purchasing initi-ated by a relatively small number of initiators theyco-bought with before.

Therefore, we hypothesize that customers’ purchasingdecisions are not only influenced by their own purchas-ing preferences but also strongly influenced by who thegroup initiator is. In the next section, we propose twonew models to predict which customers are most likelyto join a particular group-purchasing event.

4 Methodology

In this section, we first describe in Section 4.1 how weapply topic modeling to learn user purchasing pref-erences under the group-purchasing scenario. Duringthe training phase, we compute for each user a mix-ture topic proportion by combining topic proportionsof this user and the initiators with whom this user hasco-bought products.

Given a new group-purchasing event, we would like topredict which customers are more likely to join. Wepropose two predictive models in Section 4.2. Onemodel, which we denote as the product-centric infer-ence model (PCIM), computes the posterior probabil-ity that a user would purchase this product given his orher mixture topic proportion. The other model, whichwe denote the group initiator centric inference model(GICIM), computes the posterior probability that auser would join the group-purchasing event initiatedby this initiator given user’s or initiator’s mixture topicproportion.

↵

z

x

K UNu

Figure 4: Graphical model representation of the latentDirichlet allocation model.

4.1 Topic Model for User PurchasingPreference

We use topic modeling to characterize a user’s pur-chasing preference. In particular, we apply LDA to ourgroup-purchasing dataset. In a typical LDA model fortext mining [2], a document is a mixture of a numberof hidden topics which can be represented by a multi-nomial distribution, i.e. the topic proportion. A wordcan belong to one or more hidden topics with differentprobabilities. Figure 4 shows the graphical model forLDA. LDA is a generative model where each word ina document is generated by two steps: 1) sample atopic from its topic distribution and 2) draw a wordfrom that topic. One can also use Bayesian inferenceto learn the particular topic proportion of each docu-ment.

In our model, we treat a user’s purchase history as adocument. Each purchased product can be seen as aword in a document. We make an analogy betweentext documents and purchasing patterns as shown inTable 1. We replace words with each purchased prod-uct and a document is one user’s purchasing history.Assume that there are U users in our training data.Let U denote the set of users. Each user u ∈ U hasa vector of purchased products xu = xunNu

n=1 whereNu is the number of products that user u purchased.

The generative process of the LDA model for learninga user’s purchasing preferences is described as follow-ing. Each user u has his or her own topic proportion(i.e., purchasing preference) θu that is sampled froma Dirichlet distribution. Next, for each product xunpurchased by user u, a topic zun is firstly chosen fromthe user’s topic proportion θu. Then, a product xunis drawn from the multinomial distribution φzun

. Toestimate θu and φzun

, we use the collapsed Gibbs sam-pling method [15].

Symbol Description for Group Purchase History Description for Text DocumentsU Number of users Number of documentsK Number of latent topics Number of latent topicsNu Number of purchased products of user u Number of words of document uzun Latent co-purchasing category of nth product Latent topic of nth word of document uxun nth purchased product of user u nth word of document uθu Latent co-purchasing category proportion for user u Topic proportin for document uφk Multinomial distribution over products for topic k Mult. distribution over words for topic kα Dirichlet prior parameters for all θu Dirichlet prior parameters for all θuβ Dirichlet prior parameters for all φk Dirichlet prior parameters for all φk

Table 1: Latent Dirichlet allocation plate model notation

4.2 Proposed Models for PredictingGroup-Deal Customers

A group-purchasing event contains two kinds of criticalinformation: who the group initiator is and what thegroup-deal product is. Our goal is to predict whichcustomers are more likely to join a specific group-purchasing event. Intuitively, one may think thatwhether a customer would join a group-purchasingevent solely depends on what the group-deal productis. However, from our observations in the dataset, wehypothesize a correlation between a customer’s pur-chasing decision and who the group initiator is. There-fore, we would like to study how these two kinds ofgroup-purchasing information affect the prediction ac-curacy by asking two questions:

1. What is the likelihood that a customer would jointhe event given what the group-deal product is?

2. What is the likelihood that a customer would jointhe event given who the group-initiator is?

This leads to our two proposed predictive models,the product centric inference model (PCIM ) and thegroup initiator centric inference model (GICIM ).

4.2.1 Product Centric Inference Model(PCIM)

Figure 5(a) shows the graphical structure of PCIM.For each user, we train a PCIM. PCIM computes theposterior probability that a user would purchase aproduct given his or her mixture topic proportion. LetC denote the user’s own topic proportion, which welearned from LDA. Suppose that this user has joinedgroup-purchasing events initiated by n group initia-tors, we use Iini=1 to denote the learned topic pro-portions of these initiators. Our model computes theweighted topic proportions of initiators W by linearlycombining Iini=1 with the frequency distribution thatthe user co-bought products with them.

Intuitively, if a user joins a group-purchasing eventinitiated by a group initiator, they might share sim-ilar interests. Therefore, our model characterizes theuser’s purchasing preferences by a weighting schemethat combines C and W with a weighting parameterw. We use M to denote such a mixture topic propor-tion which encodes the overall purchasing preferencesof the user.

Let P denote the product random variable. Ω(P) =p1, ..., pm, where pi is the product. From each datarecord Di ∈D , we have a tuple (ti, pi, oi, mi) andknow what group-deal product pi corresponds to a par-ticular group-purchasing event ei. Our goal is to com-pute Pr(P = pi), the probability that the user wouldjoin a group-purchasing event ei to buy a product pi.

Given the topic proportion C and Iini=1 correspond-ing to the user and the weighting parameter w, we areable to compute

Pr(P ) =∑

Y=Xp\P

Pr(P,Y) (1)

where Xp = P,M,C,W, I1, ..., In; P a is productrandom variable; M is a mixture topic proportion.

To predict which users are more likely to join a group-

purchasing event, we rank P(u1)pi , . . . ,P(uU )

pi in de-scending order where u1, . . . , uU denotes the set of

users in our dataset and P(uj)pi denotes Pr(P = pi) of

user uj .

4.2.2 Group Initiator Centric InferenceModel (GICIM)

The graphical illustration of GICIM is shown in Figure5(b). GICIM computes the posterior probability thata user would join a group-purchasing event initiatedby a particular initiator given user’s or initiator’s mix-ture topic proportion. GICIM and PCIM only differin their leaf nodes. While PCIM considers only whatthe group-deal product is, GICIM models our observa-tion that the decision of whether or not a user joins a

I1

w 1 w

InI2

C W

M

I

· · ·

(b)

I1

w 1 w

InI2

C W

M

· · ·

P

(a)

Figure 5: Our proposed models for predicting poten-tial customers given a group-purchasing event. (a)Product centric inference model (PCIM). (b) Groupinitiator centric inference model (GICIM).

group-purchasing event is strongly influenced by whothe group initiator of that event is.

Let I denote the initiator random variable and ii de-note the initiator of the group-purchasing event ei.Again, from each data recordDi ∈D , we know who thegroup initiator is. Instead of evaluating Pr(P = pi) asin PCIM, we use GICIM to compute

Pr(I) =∑

Y=Xp\I

Pr(I,Y) (2)

where Xp = I,M,C,W, I1, ..., In; I is an initiatorrandom variable; M is a mixture topic proportion.

To predict which users are more likely to join a group-

purchasing event ei, we rank P(u1)oi , . . . ,P(uU )

oi in de-scending order where u1, . . . , uU denotes the set of

users in our dataset and P(uj)oi denotes Pr(I = oi) of

user uj .

5 Experimental Evaluation

5.1 Data Pre-processing

We evaluate the proposed PCIM and GICIM mod-els with the ihergo group-purchasing dataset. In or-der to capture meaningful user purchasing preferences,we remove users who purchased fewer than 10 prod-ucts during the pre-processing step. We use ten-foldcross-validation to generate our training and testingdatasets.

5.2 LDA Topic Modeling on GroupPurchasing Dataset

To measure the performance of LDA for different num-ber of topics (20, 40, 60, 80, 100) in our group-

4000

6000

8000

100 200 300 400 500#iterations

perplexity

model_type

20

40

60

80

100

Number of Iterations

Perp

lexi

ty

100 2000 400

6000

4000

8000

300

Number of Topics

500

4000

6000

8000

100 200 300 400 500#iterations

perplexity

model_type

20

40

60

80

100

20406080100

Figure 6: Perplexity as function of collapsed Gibbssampling iterations for different number of topics usedin LDA.

purchasing dataset, we compute the perplexity. Itmeasures how well the model generalizes and predictsnew documents [2]. Figure 6 shows that the per-plexity decreases as the number of iteration increasesand converges within 200 iterations. In addition, aswe increase the number of topics, the perplexity de-creases. Unless mentioned specifically, all topic pro-portions used in our experiments are learned with LDAusing 100 topics.

Figure 7 shows three example product topics learnedby LDA using 100 topics. Each table shows the tenproducts that are most likely to be bought in thattopic. Columns in the table represent the productname, the probability of the product being purchasedin that topic, and the ground-truth category of theproduct, respectively. We see that Topic 1 is about“pasta.” It contains a variety of cooked pasta andpasta sauce. Topic 18 and 53 are respectively about“bread and cakes” and “women accessories.”

Figure 8 shows the topic proportions of four randomlyselected users learned by LDA using 60 topics. We seethat different users have distinguishable topic propor-tions, representing their purchasing preferences. Forexample, user #3617 purchased many products thatare about “beauty” and “clothing” so her or his topicproportion has higher probabilities at topic 4 and topic17. Similarly, user #39 tends to buy products in the“dim sum” category which can be represented in heror his topic proportion.

(a) User #3617 (b) User #5 (c) User #2618 (d) User #39

prob

abili

ty

topic0 20 40 60

0

0.1

0.2

0.00

0.05

0.10

0.15

0.20

0 20 40 60topic

prob

types_tp

buyer_tp

clothingbeauty

topic0 20 40 60

0

0.1

0.2

0.00

0.05

0.10

0.15

0.20

0 20 40 60topic

prob

types_tp

buyer_tp

microwavable food

topic0 20 40 60

0

0.1

0.2

0.00

0.05

0.10

0.15

0.20

0 20 40 60topic

prob

types_tp

buyer_tp

dim sum

topic0 20 40 60

0

0.1

0.2

0.00

0.05

0.10

0.15

0.20

0 20 40 60topic

prob

types_tp

buyer_tp

juicecookies

Figure 8: Examples illustrating learned topic proportions of four randomly selected users. For user #3617, thetwo most probable topics are “beauty” and “clothing”.

(a) Topic 1, "pasta"

(b) Topic 18, "bread and cakes"

(c) Topic 53, "women accessories"

Symbol DescriptionU Number of usersI Number of productsK Number of latent topicsNu Number of purchased products of user uzun Latent topic of nth purchased product of user uxun nth purchased product of user u

u Topic proportion for user u, u=ukKk=1, uk 0,

KPk=1

uk = 1

k Multinomial distribution over products for topic k, k=kiIi=1, ki 0,

IPi=1

ki = 1

↵ Dirichlet prior parameters for all u

Dirichlet prior parameters for all k

Table 1: LDA plate model notation

their purchase preference. For example, user2 purchasedmany products that are most likely in ”Seafood” category,hence, his topic proportion has higher probability at topic 4,”Seafood” category. Similarly, user10 tends to buy productsin ”Pasta” category which can be represented in his topicproportion.

5.3 Performance of PCIM and GICIMTo evaluate the performance of PCIM and GICIM, we use

lift chart to measure the e↵ectiveness of prediction result forgroup-buying buyers. In a lift chart, the x-axis represents isthe percentage of users sorted by our prediction score andthe y-axis is the cumulative percentage of the ground-truthbuyers we would predict.

For the evaluation of PCIM and GICIM, we use users’topic proportions and product’s topic proportion as input.Figure 7 shows the performance of PCIM model with dif-ferent weighting of co-occurrence initiator influence. As wecan observe, PCIM with user’s topic proportion only (w=1)has better prediction with 25% of predicted buyers. How-ever, the performance degrades and is worse than randomsample (baseline) after including 65% of predicted users.This e↵ect can be explained that users who have no strongpurchase preference have low prediction score and PCIMmodel cannot correctly predict. On the other hand, us-ing co-occurrence initiator topic proportions (w<1), PCIMachieves better prediction after including 40% of predictedusers. Overall performance is better than random samplewith co-occurrence initiator topic proportions only (w=0),though the positive response rate is slightly lower than user’stopic proportion only setting (w=1) before including 40%predicted users.

Figure 8 demonstrates the performance of GICIM modelwith di↵erent weighting of co-occurrence initiator influence.GICIM model significantly outperforms PCIM and it achieves90% positive response by only including 10% predicted buy-ers. The performance change due to the di↵erent weightingof co-occurrence initiator influence is not as significant asin PCIM. The high prediction rate of GICIM explains thatmost of users choose join the purchase group or not basedon who the group initiator is.

In Figure 10, we found PCIM and GICIM models havethe consistent performance over di↵erent product categories(e.g., food, bodycare, apparel, toys, and living). Moreover,GICIM outperforms PICIM over all product categories.

product prob. categoryChicken pasta (cream sauce) 0.0197 PastaChicken pasta (pesto sauce) 0.0193 PastaPork pasta (tomato sauce) 0.0167 PastaPork steak 0.0155 MeatBacon pasta (cream sauce) 0.0153 PastaSpicy pasta (tomato sauce) 0.0149 PastaClam garlic linguine 0.0146 PastaTomato sauce pasta 0.0142 PastaGerman sausage sauce 0.0132 PastaItalian pasta (cooked) 0.0129 Pasta

Table 2: Illustration of product topics created byLDA. Category is from ground truth.

product prob. categoryHam sandwich 0.0101 BreadCheese sandwich 0.0089 BreadMilk bar cookie 0.0080 CookieCherry chocolate tart 0.0078 CakeCheese roll 0.0077 CakeCheese almond tart 0.0074 CakeTaro toast 0.0073 BreadCreme Brulee 0.0073 CakeRaisin toast 0.0071 BreadWheat ham sandwich 0.0070 Bread

Table 3: Topic 18.

product prob. categoryKnit Hat 0.0169 AccessoryKnit Scarf 0.0165 AccessoryLegging 0.0133 ClothingWool scarf 0.0120 AccessoryLong Pant 0.0111 ClothingCotton Socks 0.0099 AccessoryWool Gloves 0.0097 AccessoryFacial Masks 0.0090 BodyCareWool socks 0.0088 AccessoryBrown knit scarf 0.0081 Accessory

Table 4: Topic 53.



KPk=1

uk = 1


IPi=1

ki = 1













Table 3: Topic 18.


Table 4: Topic 53.



KPk=1

uk = 1


IPi=1

ki = 1













Table 3: Topic 18.


Table 4: Topic 53.

Product

Product

Product

Prob. Category

Prob. Category

Prob. Category

PastaPastaPasta

PastaPastaPastaPastaPastaPastaMeat

BreadBreadCookie

BreadBreadCakeBreadCakeCakeCake

AccessoryAccessoryClothing

AccessoryAccessoryBody CareAccessoryAccessoryClothingAccessory

Figure 7: Illustration of product topics learned byLDA using 100 topics. Category is from ground truth.

5.3 Performance of PCIM and GICIM

We use lift charts to measure the effectiveness ofPCIM and GICIM for predicting group-purchasingcustomers. In a lift chart, the x-axis represents thepercentage of users sorted by our prediction score andthe y-axis represents the cumulative percentage of theground-truth customers we would predict. For all liftcharts shown in this section, we also include two base-line models for comparison. One baseline model isto predict potential customers by randomly samplingfrom the set of users. Therefore, it is a straight linewith slope 1.0 on the lift chart. Another baselinemodel, which we call category frequency, is to predictcustomers with the most frequent purchase history ina given product category. Specifically, to predict po-tential customers given a group-deal product category,we rank each customer in descending order of theirnormalized purchase frequency for the given productcategory.

Effect of w. We first measure the effect of the weight-ing parameter w in PCIM, which is shown in Figure 9.The particular w controls how much the user’s owntopic proportion is used in the mixture topic propor-tion. For example, w = 1 means that only the user’sown topic proportion is used as the mixture topic pro-portion. We see that for all w values, PCIM performsmuch better than the baseline models between 0% and25% of the customers predicted. For instance, PCIMis able to reach 50% of the targeted customers whilethe two baseline models only reach respectively 25%and 40% of the customers.

We also see that with w = 1, the curve first rises veryfast, then flattens between 25% and 50%. It evenperforms worse than the baseline models starting ataround 60% of the customers predicted; however thisis the least interesting part of the curve. The intuitionbehind this behavior is that with w = 1, PCIM is goodat predicting customers who have strong purchasingpreferences that match the targeted group-deal prod-uct. On the other hand, for users without such strongpurchasing preferences, the model is not able to per-

Customers Predicted (%)

Posit

ive

Resp

onse

(%)

25 50

0

0 100

50

25

75

75

100

0

25

50

75

100

0 25 50 75 100Buyers Predicted (%)

Posit

ive R

espo

nses

(%)

model_type

PCIM−w=0

PCIM−w=0.2

PCIM−w=0.4

PCIM−w=0.6

PCIM−w=0.8

PCIM−w=1.0

random_sample

category_frequency

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

GICIM−w=0

GICIM−w=0.2

GICIM−w=0.4

GICIM−w=0.6

GICIM−w=0.8

GICIM−w=1.0

random_sample

category_frequency

Model Type

Figure 9: Lift chart of PCIM with different weight-ing parameter values. With w = 1, the model onlyincludes the user’s own topic proportion.


Posit

ive

Resp

onse

(%)

25 50

0

0 100

50

25

75

75

100

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

GICIM−w=0

GICIM−w=0.2

GICIM−w=0.4

GICIM−w=0.6

GICIM−w=0.8

GICIM−w=1.0

random_sample

category_frequency

Model Type

Figure 10: Lift chart of GICIM with different weight-ing parameter values. With w = 1, the model onlyincludes the user’s own topic proportion.

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

PCIM−Food

PCIM−BodyCare

PCIM−Apparel

PCIM−Toys,Kids&Baby

PCIM−Living

random_sample

(a) PCIM

(b) GICIM


Posit

ive

Resp

onse

(%)

25 50 0

0 100

50

25

75

75

100

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

GICIM−Food

GICIM−BodyCare

GICIM−Apparel

GICIM−Toys,Kids&Baby

GICIM−Living

PCIM−Food

PCIM−BodyCare

PCIM−Apparel


PCIM−Living

random_sample

0

25

50

75

100


Posi

tive

Res

pons

es(%

)model_type

GICIM−Food

GICIM−BodyCare

GICIM−Apparel


GICIM−Living

PCIM−Food

PCIM−BodyCare

PCIM−Apparel


PCIM−Living

random_sample

FoodBodyCareApparel

LivingToys

baseline

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

PCIM−Food

PCIM−BodyCare

PCIM−Apparel


PCIM−Living

random_sample

FoodBodyCareApparel

LivingToys

baseline

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

GICIM−Food

GICIM−BodyCare

GICIM−Apparel


GICIM−Living

PCIM−Food

PCIM−BodyCare

PCIM−Apparel


PCIM−Living

random_sample

Food

BodyCare0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

PCIM−Food

PCIM−BodyCare

PCIM−Apparel


PCIM−Living

random_sample

Food

BodyCare

Zoomed-in Window

Zoomed-in Window

Customers Predicted (%)Po

sitiv

e Re

spon

se (%

)25 50

00 100

50

25

75

75

100

Figure 11: Lift charts of PCIM and GICIM over differ-ent product categories. The zoomed-in windows on theright show that performance is slightly better on fre-quently purchased items (Food) than on infrequentlypurchased items (Body Care).

form well. In general, by introducing the topic propor-tions of initiators with whom the user has co-boughtproducts (w < 1), PCIM is able to reduce the flatten-ing effect. With w = 1, we see that the lift curve isalways above the baseline.

Figure 10 shows the effect of w in GICIM. We seethat GICIM always performs better than PCIM andthe baseline model even for the case where w = 1.In particular, for the cases where w < 0.8, GICIMachieves 90% positive response with only 10% of thepredicted customers. The high prediction success ofGICIM can be explained by the fact that whether auser chooses to join a group-purchasing event or notdepends on who the group initiator is. We also notethat the performance change due to different w valuesis not as significant as for PCIM.

Performance on different product categories.We next investigate whether frequently purchaseditems (e.g., drinks and food items) make PCIM andGICIM perform differently. We test on five differentcategories of group-deal products: food, body care, ap-parel, toys, and living. The ground-truth categoriesare from the dataset.

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

GICIM−w=0−k=20

GICIM−w=0−k=40

GICIM−w=0−k=60

GICIM−w=0−k=80

GICIM−w=0−k=100

random_sample

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

GICIM−w=1−k=20

GICIM−w=1−k=40

GICIM−w=1−k=60

GICIM−w=1−k=80

GICIM−w=1−k=100

random_sample

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

PCIM−w=0−k=20

PCIM−w=0−k=40

PCIM−w=0−k=60

PCIM−w=0−k=80

PCIM−w=0−k=100

random_sample

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

PCIM−w=1−k=20

PCIM−w=1−k=40

PCIM−w=1−k=60

PCIM−w=1−k=80

PCIM−w=1−k=100

random_sample

(a) PCIM, w = 0 (b) PCIM, w = 1

(c) GICIM, w = 0 (d) GICIM, w = 1

k=20

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

GICIM−w=1−k=20

GICIM−w=1−k=40

GICIM−w=1−k=60

GICIM−w=1−k=80

GICIM−w=1−k=100

random_sample

k=40k=60k=80k=100baseline

k=20

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

GICIM−w=1−k=20

GICIM−w=1−k=40

GICIM−w=1−k=60

GICIM−w=1−k=80

GICIM−w=1−k=100

random_sample


k=20

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

GICIM−w=1−k=20

GICIM−w=1−k=40

GICIM−w=1−k=60

GICIM−w=1−k=80

GICIM−w=1−k=100

random_sample


k=20

0

25

50

75

100


Posi

tive

Res

pons

es(%

)

model_type

GICIM−w=1−k=20

GICIM−w=1−k=40

GICIM−w=1−k=60

GICIM−w=1−k=80

GICIM−w=1−k=100

random_sample



Posit

ive

Resp

onse

(%)

25 50 0

0 100

50

25

75

75

100


Posit

ive

Resp

onse

(%)

25 50 0

0 100

50

25

75

75

100


Posit

ive

Resp

onse

(%)

25 50 0

0 100

50

25

75

75

100

Customers Predicted (%)Po

sitiv

e Re

spon

se (%

)

25 50 0

0 100

50

25

75

75

100

Figure 12: Lift charts of PCIM and GICIM over dif-ferent number of topics.

Figure 11 shows the results. We see that, for all prod-uct categories tested, GICIM still performs better thanPCIM. Moreover, from Figure 11, we see that bothmodels are slightly better at predicting potential cus-tomers for the food category than for the body carecategory. We hypothesize that this may be related tothe fact that purchases in the food category are morefrequent and predictable compared to purchases in thebody care category. A customers may buy one or moreproducts in the food category repeatedly, while thesame does not appear to be the case for all products inthe body care category. For example, once someone haspurchased a sunscreen spray (a product in body care),they are probably unlikely to buy it again, at least forthe time span that our dataset covers. However, notethat the differences between categories in Figure 11 aresmall, and developing a better understanding of themis an area of future research.

Effect of different number of topics. We ranPCIM and GICIM on different number of topics usedin LDA. Results are given in Figure 12. We find thatincreasing the number of topics increases predictionaccuracy for both models. This agrees with the aboveperplexity analysis that higher number of topics resultsin better performance.

6 Conclusion

In this paper, we study group-purchasing patternswith social information. We analyze a real-worldgroup-purchasing dataset (5,602 users, 26,619 prod-ucts, and 13,609 events) from ihergo.com. To thebest of our knowledge, we are the first to ana-lyze the group-purchasing scenario where each group-purchasing event is started by an initiator. Under thiskind of social group-purchasing framework, each userbuilds up social ties with a set of group initiators. Ouranalysis of the dataset shows that a user usually joinsgroup-purchasing events initiated by a certain and rel-atively small number of initiators. That is, if a user hasco-bought a group-deal product with a group initia-tor, he or she is more likely to join a group-purchasingevent started by that initiator again.

We develop two models to predict which users are mostlikely to join a group-purchasing event. Experimentalresults show that by including the weighted topic pro-portions of the initiators, we achieve higher predictionaccuracy. We also find that whether a user decides tojoin a group-purchasing event is strongly influenced bywho the group initiator of that event is.

Our model can be further improved in several ways.First, we can use Labeled LDA [16] by exploiting theground-truth category of the products or user profilefrom the dataset. Second, we can incorporate otherinformation such as the geographical and demographicinformation of users, and the seasonality of products ina more complex topic model. We are also interested ininvestigating the model to deal with cold start, where anew user or group-deal product is added to the system.

Acknowledgments

We want to thank Kuo-Chun Chang from ihergo.comfor his cooperation and the use of ihergo dataset. Thismaterial is based, in part, upon work supported byNSF award CCF0937044 to Ole Mengshoel.

References

[1] G. Adomavicius and A. Tuzhilin. Toward the nextgeneration of recommender systems: A survey ofthe state-of-the-art and possible extensions. IEEETrans. on Knowl. and Data Eng., 17(6), June2005.

[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. LatentDirichlet allocation. J. Mach. Learn. Res., 3, Mar.2003.

[3] J. W. Byers, M. Mitzenmacher, M. Potamias, andG. Zervas. A month in the life of Groupon. CoRR,2011.

[4] J. W. Byers, M. Mitzenmacher, and G. Zervas.Daily deals: Prediction, social diffusion, and rep-utational ramifications. CoRR, 2011.

[5] Y. Cha and J. Cho. Social-network analysis usingtopic models. SIGIR ’12, 2012.

[6] B. Edelman, S. Jaffe, and S. D. Kominers. ToGroupon or not to Groupon: The profitability ofdeep discounts. Harvard business school workingpapers, Harvard Business School, Dec. 2010.

[7] S. Guo, M. Wang, and J. Leskovec. The roleof social networks in online shopping: Informa-tion passing, price of trust, and consumer choice.CoRR, 2011.

[8] T. Hofmann. Collaborative filtering via Gaussianprobabilistic latent semantic analysis. SIGIR ’03,2003.

[9] T. Iwata, S. Watanabe, T. Yamada, and N. Ueda.Topic tracking model for analyzing consumer pur-chase behavior. IJCAI, 2009.

[10] T. Iwata, T. Yamada, and N. Ueda. Modeling so-cial annotation data with content relevance usinga topic model. NIPS, 2009.

[11] J. Leskovec, L. A. Adamic, and B. A. Huberman.The dynamics of viral marketing. ACM Transac-tions on the Web, 2007.

[12] T. Lu and C. E. Boutilier. Matching models forpreference-sensitive group purchasing. EC ’12.ACM, 2012.

[13] H. Ma, I. King, and M. R. Lyu. Learning to rec-ommend with social trust ensemble. SIGIR ’09.ACM, 2009.

[14] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King.Recommender systems with social regularization.WSDM ’11. ACM, 2011.

[15] I. Porteous, D. Newman, A. Ihler, A. Asuncion,P. Smyth, and M. Welling. Fast collapsed Gibbssampling for latent Dirichlet allocation. KDD ’08.ACM, 2008.

[16] D. Ramage, D. Hall, R. Nallapati, and C. D. Man-ning. Labeled LDA: a supervised topic modelfor credit attribution in multi-labeled corpora.EMNLP ’09. Association for Computational Lin-guistics, 2009.

[17] H. M. Wallach. Topic modeling: beyond bag-of-words. ICML ’06. ACM, 2006.

[18] X. Yang, Y. Guo, and Y. Liu. Bayesian-inferencebased recommendation in online social networks.IEEE Transactions on Parallel and DistributedSystems, 99(PrePrints), 2012.

[19] M. Ye, C. Wang, C. Aperjis, B. A. Huberman,and T. Sandholm. Collective attention and thedynamics of group deals. CoRR, 2011.

[20] L. Yu, R. Pan, and Z. Li. Adaptive social similar-ities for recommender systems. RecSys ’11. ACM,2011.

A Lightweight Inference Method for Image Classification

John Mark [email protected]

Preeti J. [email protected]

Abstract

We demonstrate a two phase classifica-tion method, first of individual pixels,then of fixed regions of pixels for sceneclassification—the task of assigning posteri-ors that characterize an entire image. Thiscan be realized with a probabilistic graphicalmodel (PGM), without the characteristic seg-mentation and aggregation tasks characteris-tic of visual object recognition. Instead thespatial aspects of the reasoning task are de-termined separately by a segmented partitionof the image that is fixed before feature ex-traction. The partition generates histogramsof pixel classifications treated as virtual evi-dence to the PGM. We implement a samplingmethod to learn the PGM using virtual ev-idence. Tests on a provisional dataset showgood (+70%) classification accuracy amongmost all classes.

1 Introduction

Scene recognition is a field of computer understandingfor classification of scene types by analysis of a visualimage. The techniques employed for scene recognitionare well known, relying on methods for image analysisand automated inference. The fundamental process isto assign probabilities over a defined set of categories—the scene characteristics—based on analysis of the cur-rent visual state. This paper shows the practicabilityof a lightweight approach that avoids much of the com-plexity of object recognition methods, by reducing theproblem to a sequence of empirical machine learningtasks.

The problem we have applied this to is classificationof scene type by analysis of a video stream from amoving platform, specifically from a car. In this pa-per we address aspects of spatial reasoning—clearly

there is also a temporal reasoning aspect, which is notconsidered here. In figurative terms the problem maybe compared with Google’s Streetview R© application.Streetview’s purpose is to tell you what your surround-ings look like by knowing your location. The scenerecognition problem is the opposite: to characterizeyour location from what your surroundings look like.

In this paper we consider a classification scheme forimages where the image is subject to classification inmultiple categories. We will consider outdoor roadwayscenes, and these classification categories:

1. surroundings, zoning, development (urban, resi-dential, commercial, mountainous, etc.)

2. visibility (e.g., illumination and weather),

3. roadway type,

4. traffic and other transient conditions,

5. roadway driving obstacles.

An image will be assigned one label from each of theset of five categories.

1.1 Uses of Scene Classification

There are numerous uses where the automated classi-fication assigned to a scene can help. The purpose ofscene classification is to capture the gist of the currentview from its assigned category labels. For example,how would you describe a place from what you see?Certainly this is different from what you would knowfrom just the knowledge of your lat-long coordinates.These are some envisioned uses:

• A scene classification provides context. For ex-ample in making a recommendation, the contextcould be to consider the practicality of the re-quest: For instance, “Do you want to get a lattenow? This is not the kind of neighborhood forthat.”

• Supplement search by the local surroundings. Forexample, “Find me a winery in a built-up area.”“Find me a restaurant in a remote place.” “Finda park in a less-travelled residential area.”

• Coming up with a score for the current conditions.How is the view from this place? How shaded orsunny is the area? What fraction of the surround-ings are natural versus artificial? Taking this onestep further, given an individual driver’s ratings ofpreferred locations, suggest other desirable routesto take, possibly out of the way from a “best”route.

• Distributed systems could crowd-source theirfindings about nearby locations to form a com-prehensive picture of an area. For example, “Howfar does this swarm (road-race, parade) extend?”

1.2 Relevant previous work

One of the earliest formulations of image understand-ing as a PGM is found in Levitt, Agosta, and Binford(1989) and Agosta (1990). The approach assumedan inference hierarchy from object categories to low-level image features, and proposed aggregation opera-tors that melded top-down (predictive) with bottom-up (diagnostic) reasoning over the hierarchy.

The uses of PGMs in computer vision have expandedinto a vast range of applications. Just to mention acouple of examples, L. Fei-Fei, Fergus and P. Perona(2003) developed a Bayesian model for learning newobject categories using a “constellation” model withterms for different object parts. In a paper that im-proved upon this, L. Fei-Fei and P. Perona (2005)proposed a Bayesian hierarchical theme model thatautomatically recognizes natural scene categories suchas forest, mountains, highway, etc. based on a gen-eralization of the original texton model by T. Leungand J. Malik (2001) and, L. Fei-Fei R. VanRullen, C.Koch, and P. Perona (2002). In another applicationof a Bayesian model, Sidenbladh, Black, and Fleet(2000) develop a generative model for the appearanceof human figures. Both of these examples apply modelselection methods to what are implicitly PGMs, if notexplicitly labeled as such.

Computer vision approaches specifically to scenerecognition recognize the need to analyze the image asa whole. Hoiem, Efros, and Hebert (2008) approachthe problem by combining results from a set of in-trinsic images, each a map of the entire image for oneaspect of the scene. Oliva and Torralba (2006) developa set of scene-centered global image features that cap-ture the spatial layout properties of the image. Sim-ilar to our approach, their method does not requiresegmentation or grouping steps.

1.3 How Scene Classification differs fromObject Recognition

Scene classification implies a holistic image-level in-ference task as opposed to the task of recovering theidentity, presence, and pose of objects within an im-age. Central to object recognition is to distinguishthe object from background of the rest of the image.Typically this is done by segmenting the image into re-gions of smoothly varying values separated by abruptboundaries, using a bottoms-up process. Pixels maybe grouped into “super-pixels” whose grouping is fur-ther refined into regions that are distinguished as partof the foreground or background. Object recognitionthen considers the features and relationships amongforeground regions to associate them with parts to beassembled into the object, or directly with an entireobject to be recovered.

Scene classification as we approach it does not neces-sarily depend on segmenting the image into regions,or identifying parts of the image. Rather it achievesa computational economy by treating the image as awhole; for example, to assign the image to the classof “indoor,” “outdoor,” “urban landscape,” or “rurallandscape,” etc. from a set of pre-defined categories.We view classification as assigning a posterior to classlabels, where the image may be assigned a value overmultiple sets of labels; equivalently, the posterior maybe a joint distribution over several scene variables.

Despite the lack of a bottoms-up segmentation step inour approach, our method distinguishes regions of theimage by a partition that is prior to analyzing the im-age contents. This could be a fixed partition, whichis appropriate for a camera in a fixed location such asa security camera, or it could depend on inferring thegeometry of the location from sources distinct fromthe image contents, such as indicators of altitude andazimuth of the camera. In our case, the prior pre-sumption is that the camera is on the vehicle, facingforward, looking at a road.

The rest of this paper is organized as follows. Section 2describes the inference procedure cascade; the specificdesign and learning of the Bayes network PGM is thesubject of Section 3, and the results of the learnedmodel applied to classification of a set of images ispresented in Section 4.

2 Lightweight inference with virtualevidence

In treating the image as a whole, our approach to infer-ence for scene classification takes place by a sequenceof two classification steps:

2

• First the image’s individual pixels are classified,based on pixel level features. This classifier re-solves the pixel into one of n discrete types, rep-resenting the kind of surface that generated it.In our examples n = 8: sky, foliage, building-structure, road-surface, lane, barrier-sidewalk, ve-hicle, and pedestrian.

• In the second step, the pre-defined partitions areapplied to the image and in each partition thepixel types are histogrammed, to generate a like-lihood vector for the partition. These likelihoodsare interpreted as virtual evidence1 for the secondlevel image classifier, the scene classifier, imple-mented as a PGM. The classifier returns an jointdistribution over the scene variables, inferred fromthe partitions’ virtual evidence.

There is labeled data for both steps, to be able to learna supervised classifier for each. Each training image ismarked up into labeled regions using the open sourceLabelMe tool, (Russell, Torralba, K. Murphy and Free-man, 2007) and also labeled by one label from eachcategory of scene characteristics. From the region la-belings a dataset of pixels, with color and texture asfeatures, and the region they belong to as labels canbe created. In the second step we learn the structureand parameters of a Bayes network—a discrete valuedPGM—from the set of training images that have beenmanually labeled with scene characteristics. Each im-age has one label assigned for each scene characteristic.The training images are reduced to a set of histogramsof the predicted labels for the pixels, one for each par-tition. The supervised data for an image consists ofthe histogram distributions and the label set.

Scene recognition output is a summarization of a visualinput as an admittedly modest amount of informationfrom a input source orders of magnitude greater–evenmores than for the object recognition task. From theorder of 106 pixel values we infer a probability distri-bution over a small number of discrete scene classifi-cation variables. To obtain computational efficiency,we’ve devised an approach that summarizes the infor-mation content of the image in an early stage of theprocess that is adequate at later stages for the classi-fication task.

2.1 Inference Cascade

The two phases in the inference cascade can be for-malized as follows, starting from the pixel image and

1Sometimes called “soft evidence.” We prefer the termvirtual evidence, since soft evidence is also used to mean anapplication of Jeffrey’s rule of conditioning that can changethe CPTs in the network.

resulting in a probability distribution over scene char-acteristics. Consider an image of pixels pij over i× j,each pixel described by a vector of features fij . Thefeatures are derived by a set of filters, e.g. for colorand texture, centered at coordinate (i, j). A pixel-levelclassifier is a function from the domain of f to one ofa discrete set of n types, C : f → c(1), · · · c(n). Theresult is an array of classified image pixels.

A pre-determined segmentation, Gm partitions thepixels in the image into M regions by assign-ing each pixel to one region, rm = pij | pij ∈Gm,m = 1 . . .M , to form regions that are con-tiguous sets of pixels. Each region is describedby a histogram of the pixel types it contains:Hm =

(|C(fij) = c(1)|, · · · |C(fij) = c(n)|

)s.t. fij ∈

Gm, for which we introduce the notation, Hm =(|c(1)ij |m, · · · |c

(n)ij |m

), where |c(i)|m denotes the count

of pixels of type c(i) in region m. The scene classi-fier is a PGM with virtual evidence nodes correspond-ing to the M regions of the image. See Figure 3.Each evidence node receives virtual evidence in theform of a lambda message, λm, with likelihoods in theratios given by Hm. The PGM model has a subsetof nodes S = S1, · · ·Sv, distinct from its evidencenodes, for scene characteristic variables, each with adiscrete state space. Scene classification is completelydescribed by P(S |λ1, · · ·λM ), the joint of S when theλm are applied, or by a characterization of the jointby the MAP configuration over S, or just the posteriormarginals of S.

2.2 Partitions of Pixel-level Data

As mentioned we avoid segmenting the image basedon pixel values by using a fixed partition to groupclassified pixels. We introduce a significant simplifi-cation over conventional object recognition methodsby using such a segmentation. This makes sense be-cause we are not interested in identifying things thatare in the image, but only in treating the image as awhole. For instance in the example we present here,the assumption is that the system is classifying an out-door roadway scene, with sky above, road below, andsurroundings characteristic of the scene to either side.The partitions approximate this division. The image ispartitioned symmetrically into a set of twelve wedges,formed by rays emanating from the image center.

For greater efficiency the same method could be ap-plied over a smoothed, or down-sampled image, so thatevery pixel need not be touched, only pixels on a reg-ular grid. The result of the classification step is a dis-crete class-valued image array. See Figure 2. Despitethe classifier ignoring local dependencies, neighboringpixels tend to be classed similarly, and the class-valued

3

image resembles a cartoon version of the original.

Figure 1: The original image. The barriers borderingthe lane are a crucial feature that the system is trainedto recognize.

Figure 2: The image array of C(fij), the pixel classi-fier, on an image down-sampled to 96 by 54. Rays em-anating from the image center show the wedge-shapedregions. Colors are suggestive of the pixel class, e.g.green indicates foliage and beige indicates barriers.

2.2.1 Inferring partition geometry

The point chosen as the image center, where the ver-tices of the wedges converge approximates the vanish-ing point of the image. Objects in the roadway scenetend to conform (very) roughly to the wedge outlinesso that their contents are more uniform, and hence,likelihoods are more informative. For example, thecontents of the image along the horizon will fall withinone wedge, and the road surface within another.

2.3 The image as a source of virtual evidence

For each wedge that partitions the image, the evidenceapplied to the Bayes network from the wedge m is:

λm ∝ |c(1)ij |m : |c(2)ij |m : · · · : |c(n)ij |m. One typicallythinks of virtual evidence as a consequence of mea-surements coming from a sensor that garbles the pre-

cise value of the quantity of interest—where the actualobserved evidence value is obscured by an inaccuracyin the sensor reading. Semantically, one should notthink of the virtual image evidence as a garbled sensorvariable. Rather it is the evidence that describes theregion.

3 Bayes network design

Formally a Bayes network is a factorization of a jointprobability distribution into local probability mod-els, each corresponding to one node in the network,with directed arcs between the nodes showing theconditioning of one node’s probability model on an-other’s (Koller and Friedman, 2010). Inference—for example, classification—operates in the directionagainst the causal direction of the arc. In short, in-ference flows from lower level evidence in the networkupward to the class nodes at the top of the networkwhere it generates the posterior distributions over theclass variables, in this case, the scene characteristics.We learn a fully observable Bayes network with virtualevidence for scene classification.

3.1 How the structure and parameters aredefined

The design of the Bayes network model is fluid: It iseasily re-learned under different partition inputs, out-put categories and structural constraints. The abilityto easily modify the model to test different kinds ofevidence as inputs, or differently defined nodes as out-puts is an advantage of this approach. The structureof the model discovers dependencies among the modelvariables that reveal properties of the domain.

Learning the Bayes network is composed of two as-pects; the first, learning the variables’ structure, thesecond, learning the parameters of the variable con-ditional probability tables. The algorithm used isSMILE’s Bayesian Search (Druzdzel et al., 1997), aconventional fully observable learning algorithm, witha Bayesian scoring rule used to select the preferredmodel. Learning structure and parameters occur si-multaneously.

The model is structured into two levels, the top level ofoutputs and the lower level of inputs as shown in Fig-ure 3. This is the canonical structure for classificationwith a Bayes network, in this case a multi-classifierwith multiple output nodes. In the learning procedurethis node ordering is imposed as a constraint on thestructure, so that conditioning arcs cannot go from thelower level to the upper level.

Further constraints are used to limit in-degree andnode ordering. The in-degree of evidence nodes is

4

limited to two. Node ordering of output nodes fol-lows common sense causal reasoning: for instance, the“Surroundings” variable influences the “Driving Con-ditions” and not the other way around. The modelconsequently follows an approximately naive Bayesstructure for each scene variable, but with additionalarcs that are a consequence of the model selection per-formed during learning. The resulting network is rela-tively sparse and hence learning a network of this size,let alone running inference on it can be done interac-tively.

3.2 Bayes Network Learning Dataset

An interesting challenge in learning this model is thatthere is no conventional procedure for learning fromvirtual evidence, such as the histogram data.

3.2.1 Consideration of partition contents asvirtual evidence

We considered three ways to approximate learning theBayes network from samples that include virtual evi-dence.

1) Convert the dataset into an approximate equiva-lent observed evidence dataset by generating multi-ples of each evidence row, in proportion to the likeli-hood fraction for each state of the virtual evidence. Ifthere are multiple virtual evidence nodes, then to cap-ture dependencies among virtual evidence nodes thiscould result in a combinatorial explosion of row sets,one multiple for each combination of virtual evidencenode states, with multiplicities in proportion to thelikelihood of the state combination. This is equivalentin complexity to combining all virtual evidence nodesinto one node for sampling.

Similarly one could sample from the combination of allvirtual evidence nodes and generate a sample of rowsbased on the items in the sample. This is a bit likelogic sampling the virtual states.

Both these methods make multiple copies of a row inthe learning set as a way to emulate a training weight.Instead one could apply a weight to each row in thesampled training set, in proportion to its likelihood.

2) One could also consider a mixture, a “multi-net,”of learned deterministic evidence models. The mod-els would have the same structure, so the result wouldbe a mixture of CPTs, weighted (in some way) by thelikelihoods. It appears this would also suffer a combi-natorial explosion of mixture components, and mightbe amenable to reducing the set by sampling.

3) Alternatively, one could consider the virtual evi-dence by a virtual node that gets added as a child

to the evidence node, which is then instantiated tosend the equivalent lambda msg to its parent. This isthe method used in Refaat, Choi and Darwiche (2012).With many cases, there would be a set of virtual nodesadded to the network for each case, again generatinga possibly unmanageable method. Perhaps there is anincremental learning method that would apply: Builda network with one set of nodes, do one learning step,then replace the nodes with the next set, and repeat alearning step.

4 Results on a sample dataset

In this section we present the evaluation of the Bayesnetwork as a classifier. We argue that the first-stagepixel-level classifier, whose accuracy approaches 90%,is a minor factor in the scene classification results,since the partition-level inputs to the Bayes networkaverage over a large number of pixels, although thispremise could be tested.

4.1 Learning from a sampled dataset

The sample dataset to learn the model was a furtherapproximation on alternative 1), where each virtualevidence node was sampled independently to convertthe problem into an equivalent one with sampled data.Each histogram was sampled according to its likeli-hood distribution, to generate a set of conventionalevidence samples that approximated the histogram.The result was an expanded dataset that multipliedthe number rows by the sample size for each row inthe histogram dataset. The resulting dataset descrip-tion is:

1. Original data set: 122 rows of 12 region his-tograms of images labeled by 5 scene labels.

2. Each region histogram is sampled 10 times, togenerate 1220 rows

3. Final data set of 5 labels and 12 features by 1220rows

4.2 Inference Results

As mentioned, the second-stage Bayes network classi-fier infers a joint probability distribution over the setof scene characteristic nodes—the nodes shown in or-ange in Figure 3. We will evaluate the scene classifierby the accuracy of the predicted marginals, comparingthe highest posterior prediction for each scene variablewith the true value.2

2The “dynamic environment” variable is not countedin the evaluation results, since most all labeled data wascollected under overcast conditions, making the predictedresults almost always correct, and uninteresting.

5

s_1 3.4%

s_2 2.8%

s_3 1.0%

s_4 57.4%

s_5 5.6%

s_6 1.1%

s_7 2.6%

s_8 26.1%

N_BL_60_80

s_1 4.4%

s_2 3.4%

s_3 0.7%

s_4 58.4%

s_5 5.3%

s_6 1.2%

s_7 2.6%

s_8 24.1%

N_BR_60_80

s_1 18.5%

s_2 2.8%

s_3 0.8%

s_4 47.8%

s_5 4.6%

s_6 1.7%

s_7 2.4%

s_8 21.4%

N_BL_40_60

s_1 9.0%

s_2 5.2%

s_3 0.9%

s_4 58.2%

s_5 3.4%

s_6 1.1%

s_7 2.4%

s_8 19.8%

N_BR_40_60s_1 27.8%

s_2 3.0%

s_3 0.3%

s_4 57.5%

s_5 2.6%

s_6 0.7%

s_7 2.0%

s_8 6.0%

N_BL_20_40

s_1 21.8%

s_2 17.3%

s_3 2.0%

s_4 41.0%

s_5 2.9%

s_6 2.7%

s_7 4.1%

s_8 8.3%

N_BR_20_40

s_1 18.1%

s_2 8.9%

s_3 0.6%

s_4 50.0%

s_5 0.3%

s_6 10.7%

s_7 6.5%

s_8 4.8%

N_BL_0_20

s_1 15.4%

s_2 44.5%

s_3 0.8%

s_4 27.5%

s_5 0.2%

s_6 3.5%

s_7 3.7%

s_8 4.4%

N_BR_0_20

s_1 8.5%

s_2 39.4%

s_3 0.7%

s_4 9.4%

s_5 1.9%

s_6 20.6%

s_7 13.1%

s_8 6.3%

N_AL_15_0

s_1 4.6%

s_2 68.4%

s_3 0.8%

s_4 6.2%

s_5 2.7%

s_6 8.4%

s_7 4.7%

s_8 4.1%

N_AR_15_0

s_1 3.8%

s_2 28.5%

s_3 0.7%

s_4 9.6%

s_5 30.7%

s_6 14.3%

s_7 8.8%

s_8 3.6%

N_AL_45_15s_1 2.4%

s_2 49.5%

s_3 0.8%

s_4 2.4%

s_5 30.0%

s_6 8.1%

s_7 3.3%

s_8 3.7%

N_AR_45_15

curves_and_grades 36.9%limited_access_highway26.2%local 5.8%no_shoulder 10.7%streetside_parking 20.5%

roadway mountainous36.9%open_rural 8.2%residential 27.9%urban 27.0%

surrounding

bicycles_and_pedestrians20.6%traffic_and_congestion 35.9%unimpeded 43.4%

driving_conditions

clear_road 19.7%construction 19.7%merge_or_intersection34.3%tree_trunks_poles 26.4%

road_obstaclesglare_limited_visibility 3.7%overcast 77.8%sunny 17.3%weather_limited_visibility 1.2%

dynamic_environment

Figure 3: The entire Bayes network used for scene classification. Input nodes, corresponding to the wedgesthat partition the image are shown in light blue, and output nodes for the scene variables are in orange. Theinput nodes are arranged roughly in the positions of the corresponding wedges in the image. The input nodehistograms show the virtual evidence applied from that wedge. The labels used here for virtual evidence states,s 1, . . . s 8 correspond to the classifier outputs c(1), . . . c(8).

The matrix of counts of the true class by the predictedclass is called a confusion matrix. The row sum of theconfusion matrix for any class divided into the diag-onal (true count) is the fraction of correct cases outof those possible, known as the recall or the coveragefor that class. The column sum divided into the diago-nal element is the fraction classified with that columnsclass label that truly belong to that class, which iscalled the precision. Tables 1− 4 show the recall andprecision for each class, for each of the scene variables.As may be expected “Surroundings” that takes in theentire image performs better than “Road obstacles”that requires attention to detail in just the car’s lane.This poor performance is even more true with “Bicy-cles and pedestrians,” in Table 3 that appear in smallareas of the image. In other classes either precision orrecall approach 1.0, except for “Local” roads, where allcases were confused with “Curves and grades,” againdue to the limited variety in the training set.

Beyond evaluating the accuracy of marginal predic-

tions, we can also make observations about the struc-ture learned for the Bayes network. Arcs in the learnedmodel show which wedge histograms are relevant towhich scene variables. These arcs are relatively sparse,in part due to the afore-mentioned design constraintin-degree arc limit of two. The arcs chosen by thestructure learning algorithm show a strong associationbetween the location of the partitions, and differentscene variables. We see this in the associations wherethe “Driving conditions” scene variable connects topartitions at the base of the image, and “Surround-ings” connects to partitions on the image periphery.The relevance of the two wedges at the bottom of thediagram is limited, since their only incoming arcs arefrom other wedges, indicating that their evidence issupported entirely by neighboring wedges. We leavethem in the model, since in the case of virtual evidencethey will still have some information value for classi-fication. Further along these lines, in terms of wedgedependencies, only one arc was learned between wedgehistograms, indicating that the evidence contributed

6

to the scene is conditionally independent in all butthis case. The sub-network of scene variables is moreconnected, indicating strong dependencies among thescene variables. Some of these are to be expected, forinstance “Curves and grades” correlates strongly with“Mountainous” surroundings. Some are spurious, asa result of biased selection of the training sample im-ages, (e.g. all divided highway images correspondedto overcast scenes) and have been corrected by addingmore samples.

4.3 Discussion and Conclusion

We have demonstrated a novel scene classification al-gorithm that takes advantage of the presumed geom-etry of the scene to avoid computationally expensiveimage processing steps characteristic of object recogni-tion methods, such as pixel segmentation, by a cascadeof a pixel level and fixed partition level multi-classifier,for which we learn a Bayes network. As a consequenceof the partition-level data we learn the Bayes networkwith virtual evidence.

The Bayes network classifies the scene in several de-pendent dimensions corresponding to a set of cate-gories over which a joint posterior of scene character-istics is generated. Here we have only considered themarginals over categories, however it is a valid ques-tion whether a MAP interpretation—of the most likelycombination of labels—is more appropriate.

The use of virtual evidence also raises questions aboutwhether it is proper to consider the virtual evidencelikelihood as a convex combination of “pure” imagedata. Another interpretation is that the histogramswe are using are better“sliced and diced” to generatestrong evidence from certain ratios of partition con-tent. For instance a partition that includes a smallfraction of evidence of roadway obstacles—think evi-dence of a small person—may be a larger concern thana partition obviously full of obstacles, and should notbe considered a weaker version of the extreme parti-tion contents. These subtleties could be consideredas we expand the applicability of the system. In thisearly work it suffices that given the approximations,useful and accurate results can be achieved at modestcomputational cost.

Acknowledgements

This work would not have been possible without valu-able discussions with Ken Oguchi and Joseph Dugash,and by Naiwala Chandrasiri, Saurabh Barv, GaneshYalla, and Yiwen Wan.

References

Agosta, J. M., 1990. The structure of Bayes networks forvisual recognition, UAI (1988) North-Holland Publish-ing Co., pp. 397 - 406.

Druzdzel, M. et. al., 1997. SMILE (StructuralModeling, Inference, and Learning Engine)http://genie.sis.pitt.edu/.

Fei-Fei, L., R. Fergus, P. Perona, 2003. A Bayesian Ap-proach to Unsupervised One-Shot Learning of ObjectCategories (ICCV 2003), pp. 1134-1141 vol.2.

Fei-Fei L. and P. Perona, 2005. A Bayesian HierarchicalModel for Learning Natural Scene Categories. (CVPR2005), pp. 524-531.

Fei-Fei, L., R. VanRullen, C. Koch, and P. Perona, 2002.Natural scene categorization in the near absence of at-tention(PNAS 2002), 99(14):95969601.

Hoiem, D., A. A. Efros, and M. Hebert. 2008. Closing theLoop in Scene Interpretation.2008 IEEE Conference onComputer Vision and Pattern Recognition (June): 18.

Koller, D., and Friedman, N. 2010. Probabilistic GraphicalModels: Principles and Techniques. Cambridge, Mas-sachusetts: The MIT Press.

Leung, T. and J. Malik, 2001. Representing and recog-nizing the visual appearance of materials using three-dimensional textons(IJCV 2001), 43(1):2944.

Levitt, T., J. M. Agosta, T.O. Binford, 1989. Model-basedinfluence diagrams for machine vision, (UAI 1989).

Oliva, A., and A. Torralba. 2006. Building the Gist of aScene: The Role of Global Image Features in Recogni-tion.Progress in Brain Research 155 (January): 2336.

Oliva, A., A. Torralba, 2001. Modeling the shape of thescene: a holistic representation of the spatial envelope.International Journal of Computer Vision, Vol. 42(3):145-175.

Refaat, K. S. , A. Choi and A. Darwiche, 2012. New Ad-vances and Theoretical Insights into EDML. (UAI 2012),pp. 705-714 .

Russell, B., A. Torralba, K. Murphy, W. T. Freeman, 2007.“LabelMe: a database and web-based tool for image an-notation” International Journal of Computer Vision.

Sidenbladh, H., M. J. Black, and D. J. Fleet, 2000. Stochas-tic Tracking of 3D Human Figures Using 2D Image Mo-tion. ECCV 2000: 702718.

7

MountainousOpenrural

Residential Urban

Recall 1.0 0.9 0.794 0.45Precision 0.642 1.0 0.964 1.0Accuracy 0.784

Table 1: Surroundings

Curvesand

grades

Limitedaccess

highwayLocal

Noshoulder

Streetsideparking

Recall 1.0 0.75 0.0 0.85 0.56Precision 0.61 1.0 NaN 1.0 1.0Accuracy 0.770

Table 2: Roadways

Bicycles andpedestrians

Traffic andcongestion

Unimpeded

Recall 0.56 0.6 0.98Precision 0.875 0.93 0.67Accuracy 0.754

Table 3: Driving Conditions

Clear road ConstructionMerge

intersection

Treetrunks

and poles

Recall 0.42 0.96 0.6 1.0Precision 1.0 0.92 0.86 0.55Accuracy 0.738

Table 4: Road Obstacles

8

Exploring Multiple Dimensions of Parallelism in Junction TreeMessage Passing

Lu ZhengElectrical and Computer Engineering


Ole J. MengshoelElectrical and Computer Engineering


Abstract

Belief propagation over junction trees isknown to be computationally challenging inthe general case. One way of addressing thiscomputational challenge is to use node-levelparallel computing, and parallelize the com-putation associated with each separator po-tential table cell. However, this approach isnot efficient for junction trees that mainlycontain small separators. In this paper,we analyze this problem, and address it bystudying a new dimension of node-level par-allelism, namely arithmetic parallelism. Inaddition, on the graph level, we use a cliquemerging technique to further adapt junctiontrees to parallel computing platforms. Weapply our parallel approach to both marginaland most probable explanation (MPE) infer-ence in junction trees. In experiments with aGraphics Processing Unit (GPU), we obtainfor marginal inference an average speedupof 5.54x and a maximum speedup of 11.94x;speedups for MPE inference are similar.

1 INTRODUCTION

Bayesian networks (BN) are frequently used to repre-sent and reason about uncertainty. The junction treeis a secondary data structure which can be compiledfrom a BN [2, 4, 5, 9, 10, 19]. Junction trees can beused for both marginal and most probable explanation(MPE) inference in BNs. Sum-product belief propaga-tion on junction tree is perhaps the most popular ex-act marginal inference algorithm [8], and max-productbelief propagation can be used to compute the mostprobable explanations [2, 15]. However, belief propa-gation is computationally hard and the computationaldifficulty increases dramatically with the density of theBN, the number of states of each network node, and

the treewidth of BN, which is upper bounded by thegenerated junction tree [13]. This computational chal-lenge may hinder the application of BNs in cases wherereal-time inference is required.

Parallelization of Bayesian network computation is afeasible way of addressing this computational challenge[1,6,7,9,11,12,14,19,20]. A data parallel implementa-tion for junction tree inference has been developed fora cache-coherent shared-address-space machine withphysically distributed main memory [9]. Parallelismin the basic sum-product computation has been in-vestigated for Graphics Processing Units (GPUs) [19].The efficiency in using disk memory for exact infer-ence, using parallelism and other techniques, has beenimproved [7]. An algorithm for parallel BN inferenceusing pointer jumping has been developed [14]. Bothparallelization based on graph structure [12] as well asnode level primitives for parallel computing based ona table extension idea have been introduced [20]; thisidea was later implemented on a GPU [6]. Gonzalezet al. developed a parallel belief propagation algorithmbased on parallel graph traversal to accelerate the com-putation [3].

A parallel message computation algorithm for junc-tion tree belief propagation, based on the cluster-sepset mapping method [4], has been introduced [22].Cluster-sepset based node level parallelism (denotedelement-wise parallelism in this paper) can acceleratethe junction tree algorithm [22]; unfortunately the per-formance varies substantially between different junc-tion trees. In particular, for small separators in junc-tion trees, element-wise parallelism [22] provides lim-ited parallel opportunity as explained in this paper.

Our work aims at addressing the small separator issue.Specifically, this paper makes these contributions thatfurther speed up computation and make performancemore robust over different BNs from applications:

• We discuss another dimension of parallelism,namely arithmetic parallelism (Section 3.1). Inte-

grating arithmetic parallelism with element-wiseparallelism, we develop an improved parallel sum-product propagation algorithm as discussed inSection 3.2.

• We also develop and test a parallel max-product(Section 3.3) propagation algorithm based on thetwo dimensions of parallelism.

• On the graph level, we use a clique merging tech-nique (Section 4), which leverages the two dimen-sions of parallelism, to adapt the various Bayesiannetworks to the parallel computing platform.

In our GPU experiments, we test the novel two-dimensional parallel approach for both regular sum-propagation and max-propagation. Results show thatour algorithms improve the performance of both kindsof belief propagation significantly.

Our paper is organized as follows: In Section 2, wereview BNs, junction trees parallel computing usingGPUs, and the small-separator problem. In Section3 and Section 4, we describe our parallel approach tomessage computation for belief propagation in junc-tion trees. Theoretical analysis of our approach is inSection 5. Experimental results are discussed in Sec-tion 6, while Section 7 concludes and outlines futureresearch.

2 BACKGROUND

2.1 Belief Propagation in Junction Trees

A BN is a compact representation of a joint distribu-tion over a set of random variables X . A BN is struc-tured as a directed acyclic graph (DAG) whose verticesare the random variables. The directed edges inducedependence and independence relationships among therandom variables. The evidence in a Bayesian networkconsists of instantiated random variables.

The junction tree algorithm propagates beliefs (or pos-teriors) over a derived graph called a junction tree. Ajunction tree is generated from a BN by means of mor-alization and triangulation [10]. Each vertex Ci of thejunction tree contains a subset of the random variablesand forms a clique in the moralized and triangulatedBN, denoted by Xi ⊆ X . Each vertex of the junctiontree has a potential table φXi

. With the above nota-tions, a junction tree can be defined as J = (T,Φ),where T represents a tree and Φ represents all the po-tential tables associated with this tree. Assuming Ciand Cj are adjacent, a separator Sij is induced on aconnecting edge. The variables contained in Sij aredefined to be Xi ∩ Xj .

The junction tree size, and hence also junction treecomputation, can be lower bounded by treewidth,which is defined to be the minimal size of the largestjunction tree clique minus one. Considering a junctiontree with a treewidth tw, the amount of computation islower-bounded by O(exp(c∗tw)) where c is a constant.

Belief propagation is invoked when we get new evi-dence e for a set of variables E ⊆ X . We need to up-date the potential tables Φ to reflect this new informa-tion. To do this, belief propagation over the junctiontree is used. This is a two-phase procedure: evidencecollection and evidence distribution. For the evidencecollection phase, messages are collected from the leafvertices all the way up to a designated root vertex.For the evidence distribution phase, messages are dis-tributed from the root vertex to the leaf vertices.

2.2 Junction Trees and Parallelism

Current emerging many-core platforms, like the recentGraphical Processing Units (GPUs) from NVIDIA andIntel’s Knights Ferry, are built around an array of pro-cessors running many threads of execution in parallel.These chips employ a Single Instruction Multiple Data(SIMD) architecture. Threads are grouped using aSIMD structure and each group shares a multithreadedinstruction unit. The key to good performance on suchplatforms is finding enough parallel opportunities.

We now consider opportunities for parallel computingin junction trees. Associated with each junction treevertex Ci and its variables Xi, there is a potential ta-ble φXi containing non-negative real numbers that areproportional to the joint distribution of Xi. If eachvariable contains sj states, the minimal size of the

potential table is |φXi| =

∏|Xi|j=1 sj , where |Xi| is the

cardinality of Xi.

Message passing from Ci to an adjacent vertex Ck, withseparator Sik, involves two steps:

1. Reduction step. In sum-propagation, the po-tential table φSik of the separator is updated toφ∗Sik by reducing the potential table φXi :

φ∗Sik =∑Xi/Sik

φXi. (1)

2. Scattering step. The potential table of Ck isupdated using both the old and new table of Sik:

φ∗Xk= φXk

φ∗SikφSik

. (2)

We define 00 = 0 in this case, that is, if the de-

nominator in (2) is zero, then we simply set thecorresponding φ∗Xk

to zeros.

Figure 1: Histograms of the separator potential tablesizes of junction trees Pigs and Munin3. For both junc-tion trees, the great majority of the separator tablescontain 20 or fewer elements.

Equation (1) and (2) reveal two dimensions of par-allelism opportunity. The first dimension, which wereturn to in Section 3, is arithmetic parallelism. Thesecond dimension is element-wise parallelism [22].

Element-wise parallelism in junction trees is based onthe fact that the computation related to each sepa-rator potential table cell are independent, and takesadvantage of an index mapping table, see Figure 2. InFigure 2, this independence is illustrated by the whiteand grey coloring of cells in the cliques, the separa-tor, and the index mapping tables. More formally, anindex mapping table µX ,S stores the index mappingsfrom φX to φS [4]. We create |φSik | such mappingtables. In each mapping table µXi,φSik (j)

we store theindices of the elements of φXi

mapping to the j-th sep-arator table element. Mathematically,

µXi,φSik (j)= r ∈ [0, |φXi

| − 1] |φXi

(r) is mapped to φSik(j).

With the index mapping table, element-wise paral-lelism is obtained by assigning one thread per map-ping table of a separator potential table as illustratedin Figure 2 and Figure 3. Consequently, belief prop-agation over junction trees can often be sped up byusing a hybrid CPU/GPU approach [22].

2.3 Small Separator Problem

Figure 1 contains the histograms of two real-world BNsPigs and Munin3.1 We see that most separators inthese two BNs are quite small and have a potentialtable size of less than 20.

In general, small separators can be found in these threescenarios: (i) due to two small neighboring cliques (we

1These BNs can be downloaded at http://bndg.cs.aau.dk/html/bayesian_networks.html

Figure 2: Due to the small separator in the B-S-Spattern, a long index mapping table is produced. Ifonly element-wise parallelism is used, there is just onethread per index mapping table, resulting in slow se-quential computation.

call it the S-S-S pattern); (ii) due to a small intersec-tion set of two big neighboring cliques (the B-S-B pat-tern); and (iii) due to one small neighboring clique andone big neighboring clique (the B-S-S pattern).2 Dueto parallel computing issues, detailed next, these threepatterns characterize what we call the small-separatorproblem.

Unfortunately, element-wise parallelism may not pro-vide enough parallel opportunities when a separatoris very small and the mapping tables to one or bothcliques are very long. A small-scale example, reflect-ing the B-S-S pattern, is shown in Figure 2.3 Whilethe mapping tables may be processed in parallel, thelong mapping tables result in a significant amount ofsequential computation within each mapping table.

A state of the art GPU typically supports more thanone thousand concurrent threads, thus message pass-ing through small separators will leave most of theGPU resources idle. This is a major bottleneck for theperformance, which we address next.

In this paper, to handle the small separator problem,we use clique merging to eliminate small separators(see Section 4) resulting from the S-S-S pattern andarithmetic parallelism (see Section 3) to attack the B-

2These patterns, when read left-to-right, describe thesize of a clique, the size of a neigboring separator, andthe size of a neigboring clique (different from the first) asfound in a junction tree. For example, the pattern B-S-Sdescribes a big clique, a small separator, and a small clique.

3In fact, both the “long” and “short” index mappingtables have for presentation purposes been made short–ina realistic junction tree a “long” table can have more than10,000 entries.

Figure 3: Example of data structure for two GPUcliques and their separator. Arithmetic parallelism(the reverse pyramid at the bottom) is integrated withelement-wise parallelism (mapping tables at the top).The arithmetic parallelism achieves parallel summa-tion of 2d terms in d cycles.

S-S and B-S-B patterns.

3 PARALLEL MESSAGE PASSINGIN JUNCTION TREES

In order to handle the small-separator problem, wediscuss another dimension of parallelism in additionto element-wise parallelism, namely arithmetic paral-lelism. Arithmetic parallelim explores the parallel op-portunity in the sum of (1) and in the multiplicationof (2). By considering also arithmetic parallelism, wecan better match the junction tree and the many-coreGPU platform by optimizing the computing resourcesallocated to the two dimensions of parallelism.

Mathematically, this optimization can be modeled asa computation time minimization problem:

minpe,pa T (pe, pa, Ci, Cj ,Φ),subject to : pe + pa ≤ ptot

(3)

where T (·) is the time consumed for a message pass-ing from clique Ci to clique Cj ; pe and pa are the num-ber of parallel threads allocated to the element-wiseand arithmetic dimensions respectively; ptot is the to-tal number of parallel threads available in the GPU;and Φ is a collection of GPU-related parameters, suchas the cache size, etc. Equation (3) is a formulation ofoptimizing algorithm performance on a parallel com-puting platform. Unfortunately, traditional optimiza-tion techniques can typically not be applied to thisoptimization problem. This is because the analyticalform of T (·) is usually not available, due to the com-plexity of the hardware platform. So in our work wechoose pe and pa empirically for our implementation.

In the rest of this section, we will describe our algo-rithm design, seeking to explore both element-wise andarithmetic parallelism.

3.1 Arithmetic Parallelism

Arithmetic parallelism needs to be explored in differentways for reduction and scattering, and also integratedwith element-wise parallelism, as we will discuss now.

For reduction, given a certain fixed element j, Equa-tion (1) is essentially a summation over all the cliquepotential table φXi elements indicated by the corre-sponding mapping table µXi,φSik (j)

. The number of

sums is |µXi,φSik (j)|. We compute the summation in

parallel by using the approach illustrated in Figure 3.This improves the handling of long index mapping ta-bles induced, for example, by the B-S-B and B-S-Spatterns. The summation is done in several iterations.In each iteration, the numbers are divided into twogroups and the corresponding two numbers in each

group are added in parallel. At the end of this recur-sive process, the sum is obtained, as shown in Algo-rithm 1. The input parameter d is discussed in Section3.2; the remaining inputs are an array of floats.

Algorithm 1 ParAdd(d, op(0), . . . , op(2d − 1))

Input: d, op(0), . . . , op(2d − 1).sum = 0for i = 1 to d dofor j = 0 to 2d−i − 1 in parallel doop(j) = op(j) + op(j + 2d−i)

end forend forreturn op(0)

For scattering, Equation (2) updates the elementsof φXk

independently despite that φSik and φ∗Sik arere-used to update different elements. Therefore, wecan compute each multiplication in (2) with a singlethread. The parallel multiplication algorithm is givenin Algorithm 2. The input parameter p is discussedin Section 3.2; the remaining inputs are an array offloats.

Algorithm 2 ParMul(p, op1, op2(0), . . . , op2(p− 1))

Input: p, op1, op2(0), . . . , op2(p− 1).sum = 0for j = 0 to p− 1 in parallel doop2(j) = op1 ∗ op2(j)

end for

3.2 Parallel Belief Propagation

Combining element-wise and arithmetic parallelism,we design the reduction and scattering operations asshown in Algorithm 3 and Algorithm 4. Both of thesealgorithms take advantage of arithmetic parallelism.In Algorithm 3, the parameter d is the number of cy-cles used to compute the reduction (summation), whilein Algorithm 4, p is the number of threads operatingin parallel. Both p and d are parameters that deter-mine the degree of arithmetic parallelism. They canbe viewed as a special form of pa in (3). Based onAlgorithm 3 and Algorithm 4, junction tree messagepassing can be written as shown in Algorithm 5.

Belief propagation can be done using both breadth-first and depth-first traversal over a junction tree. Weuse the Hugin algorithm [5], which adopts depth-firstbelief propagation. Given a junction tree J with rootvertex Croot, we first initialize the junction tree bymultiplying together the Bayesian network potentialtables (CPTs). Then, two phase belief propagationis adopted [10]: collect evidence and then distributeevidence [22].

Algorithm 3 Reduction(d, φXi , φSik , µXi,Sik)

Input: d, φXi, φSik , µXi,Sik .

for n = 1 to |φSik | in parallel dofor j = 0 to d|µXi,Sik(n)|/2de do

sum = sum + ParAdd(d, φXi(µXi,Sik(n)(j ∗

2d)), . . . , φXi(µXi,Sik(n)((j + 1) ∗ 2d − 1)))

end forend for

Algorithm 4 Scattering(p, φXk, φSik , µXk,Sik)

Input: p, φXk, φSik , µXk,Sik .

for n = 1 to |φSik | in parallel dofor j = 0 to d|φSik |/pe do

sum = sum +

ParMul(p,φ∗Sik

(n)

φSik (n), φXk

(µXk,Sik(n)(j ∗p)), . . . , φXk

(µXk,Sik(n)((j + 1) ∗ p− 1)))end for

end for

3.3 Max-product Belief Propagation

In this paper, we also apply our parallel tech-niques to max-product propagation (or in short, max-propagation), which is also referred as the Viterbialgorithm. Max-propagation solves the problem ofcomputing a most probable explanation. For max-propagation, the

∑in (1) is replaced by max [2,15]. In

this paper we use sum-product propagation to explainour parallel algorithms; the explanation can generallybe changed to discuss max-propagation by replacingadd with max.

4 CLIQUE MERGING FORJUNCTION TREES

The performance of our parallel algorithm is to a largeextent determined by the degree of parallelism avail-able in message passing, which intuitively can be mea-sured by the separator size |φSik |, which determinesthe element-wise parallelism, and the mapping tablesize |µXi,Sik(n)| which upper bounds the arithmeticparallelism. In other words, the larger |φSik | and|µXi,Sik(n)| are, the greater is the parallelism opportu-nity. Therefore, message passing between small cliques(the S-S-S pattern), where |φSik | and |µXi,Sik(n)| aresmall, is not expected to have good performance.There is not enough parallelism to make full use of

Algorithm 5 PassMessage(p, d, φXi, φXk

, φSik , µXi,Sik)

Input: p, d, φXi , φXk, φSik , µXi,Sik .

Reduction(d, φXi, φSik , µXi,Sik)

Scattering(p, φXk, φSik , µXk,Sik)

a GPU’s computing resources.

In order to better use the GPU computing power,we propose to remove small separators (that followthe S-S-S pattern) by selectively merging neighboringcliques. This increases the length of mapping tables,however the arithmetic parallelism techniques intro-duced in Section 3 can handle this. Clique mergingcan be done offline according to this theorem [8].

Theorem 4.1 Two neighboring cliques Ci and Cj ina junction tree J with the separator Sij can be mergedtogether into an equivalent new clique Cij with the po-tential function

φ(xCij ) =φ(xCi

)φ(xCj)

φ(xSij)

, (4)

while keeping all the other part of the junction treeunchanged.

The result of merging cliques is three-fold: (i) it pro-duces larger clique nodes and thus longer mapping ta-bles; (ii) it eliminates small separators; and (iii) it re-duces the number of cliques. Larger clique nodes willresult in more computation and therefore longer pro-cessing time for each single thread, but getting rid ofsmall separators will improve utilization of the GPUand reduce computation time. We have to managethese two conflicting objectives to improve the overallperformance of our parallel junction tree algorithm.

Our algorithm for clique merging is shown in Algo-rithm 6. It uses two heuristic thresholds, the sepa-rator size threshold τs and the index mapping tablesize threshold τµ, to control the above-mentioned twoeffects. We only merge two neighboring cliques Ciand Cj into a new clique Cij when |φSij | < τs and|µXj ,φS

| < τµ.

Algorithm 6 MergeCliques(J , τs, τµ)

merge flag = 1while merge flag do

merge flag = 0for each adjacent clique pair (Ci, Cj) in J doif φS < τs and |µXj ,φS

| < τµ thenMerge (J,Ci, Cj)merge flag = 1

end ifend for

end while

Given an S-S-S-S-S pattern, Algorithm 6 may mergetwo S cliques and produce an S-B-S pattern. Here, Bis the merged clique. Note that the B-S sub-patterncreates a long index mapping table, which is exactlywhat arithmetic parallelism handles. There is in other

words potential synergy between clique merging andarithmetic parallelism, as is further explored in Section6.

5 ANALYSIS AND DISCUSSION

In this section, we analyze the theoretical speedup forour two-dimensional parallel junction tree inference al-gorithm under the idealized assumption that there isunlimited parallel threads available from the many-core computing platform.

The degree of parallelism opportunity is jointly de-termined by the size of the separators’ potential ta-ble, |φS |, and the size of the index mapping table|µX ,φS

|. Consider a message passed from Ci to Ck.Since we employ separator table element-wise paral-lelism in our algorithm, we only need to focus on thecomputation related to one particular separator tableelement. With the assumption of unlimited parallelthreads, we can choose d = dlog |µXi,φS

|e. The timecomplexity for the reduction is then dlog |µXi,φS

|e, dueto our use of summation parallelism.4 Note since

|µXi,φS| =

|φXi|

|φS | , the time complexity can be written

as dlog |φXi| − log |φS |e. For the scattering phase, we

choose p = |µXk,φS| and the time complexity is given

by |µXk,φS|/p+ 1 = 2 due to the multiplication paral-

lelism. Thus the overall time complexity of the two-dimensional belief propagation algorithm is:

dlog |φXi| − log |φS |e+ 2, (5)

which is the theoretical optimal time complexity un-der the assumption of an infinite number of threads.Nevertheless, this value is hard to achieve in practicesince the value of d and p are subject to the concur-rency limit of the computing platform. For example, inthe above-mentioned BN Pigs, some message passingrequires p = 1120 while the GTX460 GPU supports atmost 1024 threads per thread block.

Belief propagation is a sequence of messages passed ina certain order [10], for both CPU and GPU [22]. LetNe(C) denote the neighbors of C in the junction tree.The time complexity for belief propagation is∑

i

∑k∈Ne(Ci)

(dlog |φXi | − log |φS |e+ 2) ,

Kernel invocation overhead, incurred each time Algo-rithm 5 is invoked, turns out to be an important per-formance factor. If we model the invocation overheadfor each kernel call to be a constant τ , then the time

4We assume, for simplicity, sum-propagation. The anal-ysis for max-propagation is similar.

complexity becomes∑i

diτ +∑i

∑k∈Ne(Ci)

(dlog |φXi| − log |φS |e+ 2) ,

where di is the degree of a node Ci. In a tree structure,∑di = 2(n− 1). Thus the GPU time complexity is

2(n− 1)τ +∑i

∑k∈Ne(Ci)

(dlog |φXi | − log |φS |e+ 2) .

From this equation, we can see that the junction treetopology impacts GPU performance in at least twoways: the total invocation overhead is proportional tothe number of nodes in the junction tree, while theseparator and clique table sizes determine the degreeof parallelism.

The overall speedup of our parallel belief propagationapproach is determined by the equation

Speedup =

∑i

∑k∈Ne(Ci)(|φXi

|+ |φXk|)

2(n− 1)τ +∑i

∑k∈Ne(Ci)

(⌈log|φXi|

|φS |

⌉+ 2) .

Clearly, the speedup depends on the distribution ofthe sizes of the separators’ and cliques’ potential ta-bles. That is the reason we propose the clique mergingtechnique. Using clique merging, we change the num-ber of nodes in the junction tree and distribution ofthe size of the separators’ and cliques’ potential ta-ble as well, adapting the junction tree for the specificparallel computing platform.

From the equations above, we can estimate the overallbelief propagation speedup. However, taking into ac-count that the CPU/GPU platform incurs invocationoverhead and the long memory latency when loadingdata from slow device memory to fast shared memory,the theoretical speedup is hard to achieve in practice.We take an experimental approach to study how thestructure of the junction trees affects the performanceof our parallel technique on the CPU/GPU setting inSection 6.

6 EXPERIMENTAL RESULTS

In experiments, we study Bayesian networks compiledinto junction trees. We not only want to comparethe two-dimensional parallel junction tree algorithm tothe sequential algorithm, but also study how effectivethe arithmetic parallelism and clique merging methodsare. Consequently, we experiment with different com-binations of element-wise parallelism (EP), arithmeticparallelism (AP), and clique merging (CM).

6.1 Computing Platform

We use the NVIDIA GeForce GTX460 as the platformfor our implementation. This GPU consists of sevenmultiprocessors, and each multiprocessor consists of48 cores and 48K on-chip shared memory per threadblock. The peak thread level parallelism achieves907GFlop/s. In addition to the fast shared memory,a much larger but slower off-chip global memory (785MB) that is shared by all multiprocessors is provided.The bandwidth between the global and shared mem-ories is about 90 Gbps. In the junction tree compu-tations we are using single precision for the GPU andthe thread block size is set to 256.

6.2 Methods and Data

For the purpose of comparison, we use the same setof BNs as used previously [22] (see http://bndg.cs.

aau.dk/html/bayesian_networks.html). They arefrom different domains, with varying structures andstate spaces. These differences lead to very differ-ent junction trees, see Table 1, resulting in varyingopportunities for element-wise and arithmetic paral-lelism. Thus, we use clique merging to carefully con-trol our two dimensions of parallelism to optimize per-formance. The Bayesian networks are compiled intojunction trees and merged offline and then junctiontree propagation is performed.

6.3 GPU Optimization: ArithmeticParallelism

Arithmetic parallelism gives us more freedom to matchthe parallelism in message passing and the concur-rency provided by a GPU: when there is not enoughpotential table element-wise parallelism available, wecan increase the degree of arithmetic parallelism. Thenumber of threads assigned to arithmetic parallelismaffects the performance significantly. The parameterp in parallel scattering and the parameter 2d in theparallel reduction should be chosen carefully (see Al-gorithm 1 and 2). Since the GPU can provide onlylimited concurrency, we need to balance the arithmeticparallelism and the element-wise parallelism for eachmessage passing to get the best performance.

Consider message passing between big cliques, for ex-ample according to the B-S-B pattern. Intuitively, thevalues of the arithmetic parallelism parameters p andd should be set higher than for the message passingbetween smaller cliques. Thus, based on extensive ex-perimentation, we currently employ a simple heuristicparameter selection scheme for the scattering param-eter p

p =

4 if|µXi,Sik(n)| ≤ 100

128 if|µXi,Sik(n)| > 100(6)

Dataset Pigs Munin2 Munin3 Munin4 Mildew Water Barley Diabetes# of original JT nodes 368 860 904 872 28 20 36 337# of JT nodes after merge 162 553 653 564 22 18 35 334Avg. CPT size before merge 1,972 5,653 3,443 16,444 341,651 173,297 512,044 32,443Avg. CPT size after merge 5,393 10,191 7,374 26,720 447,268 192,870 527,902 33,445Avg. SPT size before merge 339 713 533 2,099 9,273 26,065 39,318 1,845Avg. SPT size after merge 757 1,104 865 3,214 11,883 29,129 40,475 1,860GPU time (sum-prop) [ms] 22.61 86.40 74.99 141.08 41.31 16.33 81.82 68.26GPU time (max-prop) [ms] 22.8 86.8 72.6 114.9 38.6 12.1 94.3 78.3CPU time (sum-prop) [ms] 51 210 137 473 355 120 974 397CPU time (max-prop) [ms] 59 258 119 505 259 133 894 415Speedup (sum-prop) 2.26x 2.43x 1.82x 3.35x 8.59x 7.35x 11.94x 5.81xSpeedup (max-prop) 2.58x 2.97x 1.64x 4.39x 6.71x 10.99x 9.48x 5.30x

Table 1: Junction tree (JT) statistics and belief propagation (BP) performance for various junction trees, withspeedup for our GPU approach (GPU EP + AP + CM) compared to CPU-only in the two bottom rows. Therow “CPU time (sum-prop)” gives previous results [22].

(a) Junction tree sum-propagation

(b) Junction tree max-propagation

Figure 4: Comparison of combinations of junctiontree optimization techniques CM, AP, and EP for (a)sum- and (b) max-propagation. Best performance isachieved for GPU EP + AP + CM.

Figure 5: GPU execution times with CM (GPU EP+ AP + CM) and without CM (GPU EP + AP) forjunction trees compiled from sparse BNs representingelectrical power systems.

and the reduction parameter d

d =

2 if|µXi,Sik(n)| ≤ 1007 if|µXi,Sik(n)| > 100

(7)

We compare the execution time when using element-wise parallelism alone and the case when element-wiseparallelism is used in combination with arithmetic par-allelism. Results, for both sum-propagation and max-propagation, are shown in Figure 4(a) and Figure 4(b).In all cases, the GPU EP + AP + CM outperforms allthe other approaches.5

6.4 GPU Optimization: Clique Merging

Clique merging is based on the observation that manyjunction trees mostly consist of small cliques. Thislack of parallelism opportunity will hinder the efficient

5The GPU EP + CM combination is included for com-pleteness, but as expected it often performs very poorly.The reason for this is that CM, by merging cliques, createslarger mapping tables that EP is not equipped to handle.

use of the available computing resources, since a singlemessage passing will not be able to occupy the GPU.Merging neighboring small cliques, found in the S-S-Spattern, can help to increase the average size of sep-arators and cliques. Clique merging also reduces thetotal number of nodes in the junction tree, which inturn reduces the invocation overhead.

We use two parameters to determine which cliquesshould be merged–one is τs, the threshold for separa-tors’ potential table size and the other is τµ, the thresh-old for the size of the index mapping table. These pa-rameters are set manually in this paper, however ina companion paper [21] this parameter optimizationprocess is automated by means of machine learning.

In the experiments, we used both arithmetic paral-lelism and element-wise parallelism. This experimentpresents how much extra speedup can be obtained byusing clique merging and arithmetic parallelism. Theexperimental results can be found in Table 1, in therows showing the GPU execution times for both sum-propagation and max-propagation. In junction treesPigs, Munin2, Munin3 and Munin4, a considerablefraction of cliques (and consequently, separators) aresmall, in other words the S-S-S pattern is common.By merging cliques, we can significantly increase theaverage separators’ and cliques’ potential size and thusprovide more parallelism.

Comparing GPU EP + AP with GPU EP + AP +CM, the speedup for these junction trees ranged from10% to 36% when clique merging is used. However,in junction trees Mildew, Water, Barley and Diabetes,clique merging does not help much since they mainlyconsist of large cliques to start with.

Another set of experiments were performed with junc-tion trees that represent an electrical power systemADAPT [16–18]. These junction trees contain manysmall cliques, due to their underlying BNs being rel-atively sparsely connected.6 The experimental resultsare shown in Figure 5. Using clique merging, the GPUexecution times are shortened by 30%-50% for theseBNs compared to not using clique merging.

6.5 Performance Comparison: CPU

As a baseline, we implemented a sequential programon an Intel Core 2 Quad CPU with 8MB cache anda 2.5GHz clock. The execution time of the programis comparable to that of GeNie/SMILE, a widely usedC++ software package for BN inference.7 We do notdirectly use GeNie/SMILE as the baseline here, be-cause we do not know the implementation details of

6http://works.bepress.com/ole_mengshoel/7http://genie.sis.pitt.edu/

GeNie/SMILE.

In Table 1, the bottom six rows give the executiontime comparison for our CPU/GPU hybrid versus atraditional CPU implementation. The CPU/GPU hy-brid uses arithmetic parallelism, element-wise paral-lelism and clique merging. The obtained speedup forsum-propagation ranges from 1.82x to 11.94x, with anarithmetic average of 5.44x and a geometric average of4.42.

The speedup for max-propagation is similar, but dif-ferent from sum-propagation in non-trivial ways. Theperformance is an overall effect of many factors suchas parallelism, memory latency, kernel invocation over-head, etc. Those factors, in turn, are closely correlatedwith the underlying structures of the junction trees.The speedup for max-propagation ranges from 1.64xto 10.99x, with an arithmetic average of 5.51x and ageometric average of 4.61x.

6.6 Performance Comparison: Previous GPU

We now compare the GPU EP + AP + CM tech-nique introduced in this paper with our previous GPUEP approach [22]. From results in Table 1, comparedwith the GPU EP approach [22], the arithmetic av-erage cross platform speedup increases from 3.38x (or338%) to 5.44x (or 544%) for sum-propagation. Formax-propagation8 the speedup increases from 3.22x(or 322%) to 5.51x (or 551%).

7 CONCLUSION AND FUTUREWORK

In this paper, we identified small separators as bottle-necks for parallel computing in junction trees and de-veloped a novel two-dimensional parallel approach forbelief propagation over junction trees. We enhancedthese two dimensions of parallelism by careful cliquemerging in order to make better use of the parallelcomputing resources of a given platform.

In experiments with a CUDA implementation on anNVIDIA GeForce GTX460 GPU, we explored how theperformance of our approach varies with different junc-tion trees from applications and how clique mergingcan improve the performance for junction trees thatcontains many small cliques. For sum-propagation, theaverage speedup is 5.44x and the maximum speedupis 11.94x. The average speedup for max-propagationis 5.51x while the maximum speedup is 10.99x.

In the future, we would like to see research on parame-ter optimization for both clique merging and message

8We implemented max-propagation based on the ap-proach developed previously [22].

passing. It would be useful to automatically changethe merging parameters for different junction treesbased on the size distribution of the cliques and sep-arators. In addition, we also want to automaticallychange the kernel running parameters for each singlemessage passing according to the size of a message. Infact, we have already made progress along these lines,taking a machine learning approach [21].

Acknowledgments

This material is based, in part, upon work supportedby NSF awards CCF0937044 and ECCS0931978.

References

[1] R. Bekkerman, M. Bilenko, and J. Langford, edi-tors. Scaling up Machine Learning: Parallel andDistributed Approaches. Cambridge University Press,2011.

[2] A. P. Dawid. Applications of a general propagationalgorithm for probabilistic expert systems. Statisticsand Computing, 2:25–36, 1992.

[3] J. Gonzalez, Y. Low, C. Guestrin, and D. O’Hallaron.Distributed parallel inference on large factor graphs.In Proc. of the 25th Conference in Uncertainty in Ar-tificial Intelligence (UAI-09), 2009.

[4] C. Huang and A. Darwiche. Inference in belief net-works: A procedural guide. International Journal ofApproximate Reasoning, (3):225–263, 1994.

[5] F. V. Jensen, S. L. Lauritzen, and K. G. Olesen.Bayesian updating in causal probabilistic network bylocal computations. Computational Statistics, pages269–282, 1990.

[6] H. Jeon, Y. Xia, and V. K. Prasanna. Parallel exactinference on a CPU-GPGPU heterogenous system. InProc. of the 39th International Conference on ParallelProcessing, pages 61–70, 2010.

[7] K. Kask, R. Dechter, and A. Gelfand. BEEM: bucketelimination with external memory. In Proc. of the26th Annual Conference on Uncertainty in ArtificialIntelligence (UAI-10), pages 268–276, 2010.

[8] D. Koller and N. Friedman, editors. Probablisticgraphic model: principle and techniques. The MITPress, 2009.

[9] A. V. Kozlov and J. P. Singh. A parallel Lauritzen-Spiegelhalter algorithm for probabilistic inference. InProc. of the 1994 ACM/IEEE conference on Super-computing, pages 320–329, 1994.

[10] S. L. Lauritzen and D. J. Spiegelhalter. Local compu-tations with probabilities on graphical structures andtheir application to expert systems. Journal of theRoyal Statistical Society, 50(2):157–224, 1988.

[11] M. D. Linderman, R. Bruggner, V. Athalye, T. H.Meng, N. B. Asadi, and G. P. Nolan. High-throughput

Bayesian network learning using heterogeneous multi-core computers. In Proc. of the 24th ACM Interna-tional Conference on Supercomputing, pages 95–104,2010.

[12] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson,C. Guestrin, and J. Hellerstein. GraphLab: A newframework for parallel machine learning. In Proc. ofthe 26th Annual Conference on Uncertainty in Artifi-cial Intelligence (UAI-10), pages 340–349, 2010.

[13] O. J. Mengshoel. Understanding the scalability ofBayesian network inference using clique tree growthcurves. Artificial Intelligence, 174:984–1006, 2010.

[14] V. K. Namasivayam and V. K. Prasanna. Scalableparallel implementation of exact inference in Bayesiannetworks. In Proc. of the 12th International Confer-ence on Parallel and Distributed System, pages 143–150, 2006.

[15] D. Nilsson. An efficient algorithm for finding the mmost probable configurations in probablistic expertsystem. Statistics and Computing, pages 159–173,1998.

[16] B. W. Ricks and O. J. Mengshoel. The diagnostic chal-lenge competition: Probabilistic techniques for faultdiagnosis in electrical power systems. In Proc. of the20th International Workshop on Principles of Diag-nosis (DX-09), pages 415–422, Stockholm, Sweden,2009.

[17] B. W. Ricks and O. J. Mengshoel. Methods for prob-abilistic fault diagnosis: An electrical power systemcase study. In Proc. of Annual Conference of the PHMSociety, 2009 (PHM-09), San Diego, CA, 2009.

[18] B. W. Ricks and O. J. Mengshoel. Diagnosing in-termittent and persistent faults using static bayesiannetworks. In Proc. of the 21st International Work-shop on Principles of Diagnosis (DX-10), Portland,OR, 2010.

[19] M. Silberstein, A. Schuster, D. Geiger, A. Patney, andJ. D. Owens. Efficient computation of sum-productson GPUs through software-managed cache. In Proc.of the 22nd ACM International Conference on Super-computing, pages 309–318, 2008.

[20] Y. Xia and V. K. Prasanna. Node level primitives forparallel exact inference. In Proc. of the 19th Inter-national Symposium on Computer Architechture andHigh Performance Computing, pages 221–228, 2007.

[21] L. Zheng and O. J. Mengshoel. Optimizing parallelbelief propagation in junction trees using regression.In Proc. of 19th ACM SIGKDD Conference on Knowl-edge Discovery and Data Mining (KDD-13), Chicago,IL, August 2013.

[22] L. Zheng, O. J. Mengshoel, and J. Chong. Belief prop-agation by message passing in junction trees: Com-puting each message faster using GPU paralleliza-tion. In Proc. of the 27th Conference in Uncertaintyin Artificial Intelligence (UAI-11), pages 822–830,Barcelona, Spain, 2011.