Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Herding Dynamical Weights

Max WellingBren School of Information

and Computer ScienceUC Irvine

Motivation

• Xi=1 means that pin i will fall during a Bowling round. Xi=0 means that pin i will still stand.

• You are given pairwise probabilities P(Xi,Xj).

• Task: predict the distribution Q(n), n=0,.., 10 of the total number of pins that will fall.

Stock market: Xi=1 means that company i defaults.You are interested in the probability of n companies defaulting in your portfolio.

Sneak Preview

Newsgroups-small (collected by S. Roweis)100 binary features, 16,242 instances (300 shown)

(Note: herding is a deterministic algorithm, no noise was added)

Herding is a deterministic dynamical system that turns “moments” (average feature statistics)into “samples” which share the same moments.

Quiz: which is which [top/bottom]?

-data in random order.

-herding sequence in order received.

Traditional Approach:Hopfield Nets & Boltzman Machines

ijw weight

state value (say 0/1)

ij sswwsE ),(

ijijw ssw

wZsP exp

Energy:

jijiji SWIS 0

Probability of a joint state:

Coordinate descent on energy:

Traditional Learning Approach

iijiij XXXW

Pidataiii

Pjidatajiijij

XXXXWW

Sii nSInQ

Use CDinstead

What’s Wrong With This?

• E[Xi] and E[XiXj] are intractable to compute (and you need them at every iteration of gradient descent).

• Slow convergence & local minima (only w/ hidden vars)

• Sampling can get stuck in local modes (slow mixing).

Solution in a Nutshell

datajiXX

Sii nSInQ

dataiX

Nonlinear Dynamical SystemNonlinear Dynamical System

dataiSi

datajiSji

(sidestep learning + sampling)

Herding Dynamics

idataiii

jidatajiijij

jijiji

SSXXWW

no stepsize

• no stepsize

• no random numbers

• no exponentiation

• no point estimates

Piston Analogyweights=pistons

Pistons move up at a constant rate (proportional to observed correlations)

When they gets too high, the “fuel” will combustand the piston will be pushed down (depression)

“Engine driven by observed correlations”

Herding Dynamics with General Features

)(maxarg

SfXfWW

kdatakkk

• no stepsize

• no random numbers

• no exponentiation

• no point estimates

Features as New Coordinates)( 1Sf

)( 4Sf

)( 3Sf

)( 2Sf

If then period is infinite

dataXf )(

)( 5Sf

bbbB ffnNnn

11 ),..,(

thanks to Romain Thibaux

Example]:1:[

)sin()(10

weights initialized in a grid

red ball tracks 1 weight

converence on afractal attractor setwith Hausdorf dim.1.5

The Tipi Function

gradient descend on G(w)with stepsize 1.

)(max)( SfWfWwG kk

datakk

This function is:

• Concave• Piecewise linear• Non-positive• Scale free

)(SffWW kdatakkk

SSfWS )(maxarg

coordinate ascend replaced with full maximization.

Scale free property implies that stepsize will not affect state sequence S.

RecurrenceThm: If we can find the optimal state S, then the weights will stay within a compact region.

Empirical evidence: coordinate ascent is sufficient to guarantee recurrence.

Ergodicity

s=3s=4

s=[1,1,2,5,2...

Thm: If the 2-norm of the weights grows slower than linear, then feature averages over trajectories converge to data averages.

Relation to Maximum Entropy

fftoSubject

PHMaximize

kdatakk

efWWLMaximize)(

}{log)(

Tipi function:

T 0lim)(

Herding dynamics satisfies constraints but not maximal entropy

Advantages / Disadvantages

• Learning & Inference have merged into one dynamical system.• Fully tractable – although one should monitor whether local maximization is enough to keep weights finite.• Very fast: no exponentation, no random number generation.• No fudge factors (learning rates, momentum, weight decay..).• Very efficient mixing over all “modes” (attractor set).

• Moments preserved, but what is our “inductive bias”? (i.e. what happens to remaining degrees of freedom?).

Back to BowlingData collected by P. Cotton.10 pins, 298 bowling runs.X=1 means a pin has fallen in two subsequent bowls.H.XX uses all pairwise probabilitiesH.XXX uses all triplet probabilities

P(total nr. pins falling)

More ResultsDatasets: Bowling (n=298, d=10, k=2, Ntrain=150, Ntest = 148)Abelone (n=4177, d=8, k=2, Ntrain=2000, Ntest = 2177)Newsgroup-small (n=16,242, d=100, k=2, Ntrain=10,000, Ntest = 6242)8x8 Digits (n=2200 [3’s and 5’s], d=64, k=2, Ntrain=1600, Ntest =600)

Task: given only pairwise probabilities,compute the probability of the total nr.of 1’s in a data-vector Q(n).

Solution: apply herding and compute Q(n)through sample averages.

Error : KL[Pdata||Pest]

Task: given only pairwise probabilities,compute the classifier P(Y|X).

Solution: train logistic regression (LR) classifieron herding sequence.

Error : fraction of misclassified test cases.

LR is too simple, PL on herding sequence also gives 0.04.In higher dimensions herding looses advantage in accuracy

Conclusions

• Herding replaces point estimates with trajectories over attractor sets (which is not the Bayesian posterior) in a tractable manner.

• Model for “neural computation”– similar to dynamical synapses– Quasi-random sampling of state space (chaotic?)– Local updates– Efficient (no random numbers, exponentiation)

Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Documents

Introduction to Belief Propagation and its Generalizations. Max Welling Donald Bren School of Information and Computer and Science University of California

Logical Agents - Donald Bren School of Information and ...welling/teaching/271fall... · • Semantics define the "meaning” or interpretation of sentences; – connects symbols

WELLING PARTNERSHIP - home2.btconnect.comhome2.btconnect.com/Welling-Partners/downloads/WellingPartnership... · Welling Partnership is a multi-disciplined private practice offering

Solving problems by searching - Donald Bren School of ...welling/teaching/ICS171fall10/IntroSearch... · 2 Why Search? To achieve goals or to maximize our utility we need to predict

Distributed cat herding

Bren Bren School of University of California, Santa ...2 CONTENTS Bren School of Environmental Science & Management 2400 Bren Hall University of California Santa Barbara Santa Barbara,

Herding Sheep

Methods of Proof - Donald Bren School of Information and ... › ~welling › teaching › 271fall09 › PropLogicB271-f09.pdfMethods of Proof Chapter 7, Part II . Proof methods •

Herding smartphones

Nomadic Herding

DEPRESSIVE SYMPTOMS AMONG OMMUNITY WELLING …

110405 Welling REPRINT

Herded Gibbs Sampling - Donald Bren School of …welling/publications/papers/HerdedGibb... · 2013-04-19 · Gibbs in the tasks of image denoising with MRFs and named entity recognition

Support Vector Machines - Donald Bren School of ...welling/teaching/ICS273Afall11/SVM.pdf · If data-item is on the support-vector line (i.e. it is a support vector!) The force becomes

Blumen Welling © 2012

Herding cats (2)

Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine

Abstract - Donald Bren School of Information and Computer …welling/publications/papers/... · 2014-04-19 · Abstract How can we perform efﬁcient inference and learning in directed

UNIVERSITY OF CALIFORNIA, IRVINE Herding: Driving ...welling/people/thesisYutian.pdfUNIVERSITY OF CALIFORNIA, IRVINE Herding: Driving Deterministic Dynamics to Learn and Sample Probabilistic

The Matrix Cookbook - Donald Bren School of …welling/teaching/KernelsICS273B/...The Matrix Cookbook Kaare Brandt Petersen Michael Syskind Pedersen Version: January 5, 2005 What is