44
Forecast Hedging and Calibration Dean P. Foster Amazon Sergiu Hart Hebrew University of Jerusalem Calibration means that forecasts and average realized frequencies are close. We develop the concept of forecast hedging, which consists of choosing the forecasts so as to guarantee that the expected track record can only improve. This yields all the calibration results by the same simple basic argument while differentiating between them by the forecast-hedging tools used: deterministic and fixed point based versus stochastic and mini- max based. Additional contributions are an improved definition of contin- uous calibration, ensuing game dynamics that yield Nash equilibria in the long run, and a new calibrated forecasting procedure for binary events that is simpler than all known such procedures. I. Introduction Weather forecasters nowadays no longer say that it will rain tomorrowor it will not rain tomorrow; rather, they state that the chance that it will rain tomorrow is x.As long as x lies strictly between 0 and 1, they can- not be proven wrong tomorrow, whether it rains or not. However, they can be proven wrong over time. This is the case when a forecast, say, Previous versions: April 2016, November 2019 (Center for Rationality DP-731), June 2020. We thank Benjy Weiss for useful discussions; John Levy, Efe Ok, Sylvain Sorin, and Bernhard von Stengel for references related to theorem 4; and the editor and referees for very helpful suggestions. This paper was edited by Emir Kamenica. Electronically published October 7, 2021 Journal of Political Economy, volume 129, number 12, December 2021. © 2021 The University of Chicago. All rights reserved. Published by The University of Chicago Press. https://doi.org/10.1086/716559 3447

Forecast Hedging and Calibration

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Forecast Hedging and Calibration

Dean P. Foster

Amazon

Sergiu Hart

Hebrew University of Jerusalem

Calibration means that forecasts and average realized frequencies areclose. We develop the concept of forecast hedging, which consists ofchoosing the forecasts so as to guarantee that the expected track recordcanonly improve. This yields all the calibration results by the same simplebasic argumentwhiledifferentiatingbetween themby the forecast-hedgingtools used: deterministic and fixed point based versus stochastic andmini-max based. Additional contributions are an improved definition of contin-uous calibration, ensuing game dynamics that yieldNash equilibria in thelong run, and a new calibrated forecasting procedure for binary eventsthat is simpler than all known such procedures.

I. Introduction

Weather forecasters nowadays no longer say that “it will rain tomorrow”or “it will not rain tomorrow”; rather, they state that “the chance that itwill rain tomorrow is x.”As long as x lies strictly between 0 and 1, they can-not be proven wrong tomorrow, whether it rains or not. However, theycan be proven wrong over time. This is the case when a forecast, say,

Previous versions: April 2016, November 2019 (Center for Rationality DP-731), June2020. We thank Benjy Weiss for useful discussions; John Levy, Efe Ok, Sylvain Sorin, andBernhard von Stengel for references related to theorem 4; and the editor and refereesfor very helpful suggestions. This paper was edited by Emir Kamenica.

Electronically published October 7, 2021

Journal of Political Economy, volume 129, number 12, December 2021.© 2021 The University of Chicago. All rights reserved. Published by The University of Chicago Press.https://doi.org/10.1086/716559

3447

x 5 70%, is repeated many times, and the proportion of rainy days amongthose days when the forecast was 70% is far from 70%.A forecaster is said to be (classically) calibrated if, in the long run, the

actual proportions of rainy days are close to the forecasts (formally, theaverage difference between frequencies and forecasts—the calibrationscore—is small). A surprising result of Foster and Vohra (1998) showsthat one may always generate forecasts that areguaranteed to be calibrated,no matter what the weather will actually be.1 These forecasts must neces-sarily be stochastic; that is, in each period, the forecast x is chosen by a ran-domization (e.g., with probability 1/3 the forecaster announces that thechance of rain tomorrow is x 5 70%, and with probability 2/3 the fore-caster announces that the chance is x5 50%),2 since deterministic forecastscannot be calibrated against all possible future rain sequences (cf. Dawid1982 and Oakes 1985).3 The analysis is thus from a worst-case point of view,which is the same as if one were facing an adversarial “rainmaker.”4

Now the calibration score is discontinuous with respect to the forecasts,as it considers days when the forecast was, say, 69.9% separately from thedays when the forecast was 70%. Smoothing out the calibration score bycombining, in a continuous manner, the days when the forecast was closeto x before comparing the frequency of rain to x yields a continuous cali-bration score, which we introduce in section II.B. The advantage of con-tinuous calibration is that it may be guaranteed by deterministic forecasts(i.e., after every history, there is a single x that is forecasted—in contrastto a probabilistic distribution over x in the classic calibration setup of theprevious paragraph). Similar concepts that appear in the literature—weakcalibration (Kakade and Foster 2004; Foster and Kakade 2006) and smoothcalibration (Foster andHart 2018)—are encompassed by continuous cal-ibration (see app. sec. A2). While the existing proofs of deterministic smoothand weak calibration are complicated, in this paper we provide a simpleproof of deterministic continuous calibration—and so of smooth and weakcalibration as well. We thus propose continuous calibration as the moreappropriate concept: more natural and easier to analyze and guarantee.In this paper, we identify specific conditions, whichwe refer to as forecast-

hedging conditions, that guarantee that the calibration score will essentiallynot increase, whatever tomorrow’s weather will be.5 Roughly speaking, they

1 There are many proofs of the classic calibration result, some relatively simple: besidesFoster and Vohra (1998), see Hart (1995; presented in sec. 4 of Foster and Vohra 1998),Foster (1999), Foster and Vohra (1999), Fudenberg and Levine (1999), Hart and Mas-Colell (2000, 2013), and the survey of Olszewski (2015).

2 The randomization may depend on the history of weather and forecasts.3 Consider the sequence where each day there is rain if and only if the forecast of rain is

less than 50%.4 Which connects to the related literature on the manipulability of tests; see Dekel and

Feinberg (2006), Olszewski and Sandroni (2008), and the survey of Olszewski (2015).5 The use of the term “hedging” here is akin to its use in finance, where one deals with

portfolios that are hedged against risks (by using, say, appropriate options and derivatives).

3448 journal of political economy

amount tomaking sure that today’s calibration errors will tend to go in theopposite direction of past calibration errors (thus overshooting, where theforecast is higher than the frequency of rain, is followed by undershooting,and the other way around). This is illustrated in section I.B by a stylizedsimple version of forecast hedging in the basic binary rain/no rain setup.Interestingly, it turns out to yield a new calibrated procedure in this one-dimensional case that is as simple as can be (and is simpler than the onein Foster 1999); see section V for the formal analysis.We show, first, that themain calibration results in the literature (classic,

smooth, weak, almost deterministic, and continuous, introduced here)all follow from the same simple argument based on forecast hedging. Sec-ond, we provide the appropriate forecast-hedging tools. In the classic cal-ibration setup, they correspond to optimal strategies in finite two-personzero-sumgames, whose existence follows fromvonNeumann’s (1928)mini-max theorem, and which are mixed (i.e., stochastic) in general. In the con-tinuous calibration setup, they correspond to fixed points of continuousfunctions, whose existence follows from Brouwer’s (1912) fixed point the-orem, and which are deterministic. We refer to the resulting proceduresas procedures of type MM and type FP, respectively. This forecast-hedgingapproach integrates the existing calibration results by deriving them allfrom the same proof scheme while clearly differentiating between the MMprocedures and the FP procedures, both in terms of the tool they use—minimax versus fixed point—and in terms of being stochastic versus de-terministic. Thus, classic calibration is obtained byMMprocedures, whereascontinuous calibration as well as almost deterministic calibration are ob-tained by FP procedures. A further benefit of our approach is the simpleand straightforward proof that it provides of deterministic continuous cal-ibration and thus of deterministic smooth calibration (in contrast to thelong and complicated existing proof).While calibration is stated in terms of forecasting, our forecast hedging

makes it clear that this is a misnomer, as there is no actual prediction ofrain or no rain tomorrow (indeed, such a prediction cannot be accom-plished without making some assumptions on the behavior of the rain-maker). Rather, calibration obtains by what can be referred to as “backcasting”(instead of forecasting): forecast hedging guarantees that the past trackrecord can essentially only improve, no matter what the weather will be.

A. The Economic Utility of Calibration

Now, whywould one consider calibration at all? Though some forecasts arecreated just for fun (say, predicting a sports winner or a presidential elec-tion), other forecasts drive decision-making (say, predicting the chance ofrain or the chance of selling a million widgets). We will focus on forecaststhat have decisions attached to them. If the forecaster is the same person

forecast hedging and calibration 3449

as the decisionmaker, thenhe can interpret the forecast in any fashion helikes and still be consistent. But when the forecaster is different from thedecisionmaker, it is desirable for them to be speaking the same language.To make this concrete, consider the rain forecast that a traveler hears onlanding in a new city. Should an umbrella be unpacked and made ready?Or is the weather nice enough not to need one? Locals may be perfectlyhappy with a forecast that implies some set U such that if x ∈ U, then car-rying an umbrella makes sense.6 But pity our poor traveler who has to fig-ure out the set U without any history. Contrast this with the world wherethe forecast in each city is known to be calibrated. Then our traveler canfigure out a rule, say, x > 70% and dig out his umbrella if the forecast ishigher than 70%. Further, this works for both the timid traveler who hasa rule of x > 20% and the outdoors person with a rule of x > 99%. Therecan be many other wonderful properties of forecasts that we could hopeto have (accuracy or martingality, to name two), but by merely having cal-ibration, the forecasts are connected enough to outcomes to be useful todecision makers.Calibration thus allows one to separate the problem into two pieces: the

first is providing a forecast of the world, and the second is taking an actionthat is rational, given that forecast. Thismodel is a good way of factoring abusiness since a forecasting team does not need to understand the nu-ances that go into the decision-making, nor does the decision team needto know the details of the most current statistical methods that go intomaking the forecasts. There are details that the forecasting team will becontinuously worrying about, like whether a neural net is more accuratethan a decision tree or a simple regression. Likewise, there are details thatthe decision-making team will be stressing over, like changing costs andupdating constraints. But as long as they are communicating via calibratedforecasts, these worries do not need to be exposed to the other team. Theforecasting team generates calibrated forecasts, and the optimization teamtreats these forecasts as if they were probabilities and solves their optimizationproblem. This factorization localizes information but still generates a glob-allyoptimaloutcome.7 For a concrete example, considerfigure1, fromFosterand Stine (2004). It shows two forecasts of when a customer will go bank-rupt. The calibrated forecast (right) is easy to use: a customer with a fore-casted high chance of bankruptcy should not be extended further credit.

6 That is, the expected benefit of not being wet on a rainy day exceeds the expected costof carrying the umbrella—and perhaps losing it someplace—on a sunny day.

7 A real-life story from a large online retailer is that an old-fashioned autoregressive mov-ing average (ARMA) forecasting model was used for years. It was not calibrated, and so theoptimization team had learned to buy more than the forecast suggested. When the ARMAmodel was replaced by amodern neural net that was muchmore accurate and also calibrated,the retailer lost money—until the optimization team caught up with the change in the fore-casting model. If both forecasts had been calibrated, there would have been much less in-ternal stress, and the newer model would have been an easy immediate improvement.

3450 journal of political economy

FIG. 1.—In Foster and Stine (2004), the business problem was to forecast the chance of a person going bankrupt in the next month. Both of the aboveforecasts are based on a large linear model. The one on the left was obtained by a logistic regression and the one on the right by a monotone regression.The left-hand forecast is not calibrated, whereas the right-hand forecast is calibrated and so can be used directly for decision-making. BR 5 bankruptcy.

The cutoff point can be created using the costs and benefits to the firm. Bycontrast, constructing a rule basedon theuncalibrated forecast (left) requiresactually doing some statistics to figure out what a forecast of, say, 70%means.The optimization team would have to do some empirical statistics, and thuswe have failed at factoring the problem into two clean pieces.Figure 1may incorrectly suggest that all we need to do is map a forecast

through an appropriate link function that gives the corresponding aver-age realization and then all will be well. This is true for cross-sectional dataand for time-series data where the link function is evaluated at a single pointin time. But in general, we would need different such functions at differentpoints in time. Phrased in terms of our intrepid traveler, if he arrives for asecond time at the same foreign city, the rule he used on the first visit mayno longer apply. But if the forecasts were calibrated, the same trivial rulewould work for both visits. Mathematically, this means that a calibrated fore-cast must divide an arbitrary sequence into a collection of subsequences(one for each forecast value),8 all of which have a limit. This is the hard part.The fact that we also require a calibrated forecast to know what this limit ison each of these subsequences is a small restriction compared with guaran-teeing that there are no fluctuations over time and all these limits exist.Let us turn to the decision side of the problem. Sometimes the forecast

is so strong for rain that not carrying an umbrella would entail a hugecost.9 Likewise, it might be that the chance of rain is so low that carryingone would be too costly. Both of these costs are relative to the best possi-ble action one could take. But sometimes the forecast is close to the fenceand it does not really matter which action is taken. This indifference(equipoise in biostatistics) allows one to consider randomizing betweenthese two actions. This would cheaply allow estimating the actual costs ofeach action. It would allow one to compare what would happen if the coun-terfactual action were taken with what happens if the action that is believedto be the correct action is taken. For these reasons, there are many argu-ments for randomizing at the boundary. Mathematically, it can be thoughtof as continuously switching from taking an umbrella (at the boundaryplus epsilon) to never taking an umbrella (at the boundary minus epsilon).If such a continuous response function is used, then the classic definitionof calibration is stronger than it needs to be. Indeed, we care only aboutwhat the approximate value of the forecast is since we will behave simi-larly for all such values. This is where continuous calibration comes in.Nowwhat is the advantage of using a weaker notion of calibration (con-

tinuous calibration is implied by classic calibration), which is also more

8 We refer to this as binning (see sec. II.B).9 While we continue to phrase the discussion in terms of rain for simplicity, think of

more meaningful circumstances, such as contextual bandits in machine learning and per-sonalized medicine in clinical trials.

3452 journal of political economy

difficult to obtain (it requires a fixed point rather than a minimax com-putation every period; see sec. III.D). The answer is that weakening thecalibration requirement allows one to achieve the important propertyof leakiness of Foster and Hart (2018); namely, the forecasts remain cal-ibrated even if the action in each period depends on the forecast (whichis the case when the forecast is revealed—i.e., leaked—before the actionis chosen). Indeed, for deterministic procedures that yield continuouscalibration, the fact that at the start of each period t the forecast at t is al-ready known(as it is fully determinedby thehistory before t)does notmatter,as continuous calibration is guaranteed for any action. By contrast, for sto-chastic procedures that yield classic calibration, at the start of period t, onlythe distribution of the random forecast at t is known and not its actual re-alization; if the actual realization were known, there would be action choicesthat would invalidate calibration, as in footnote 3. This distinction is un-derscored by forecast hedging, which holds for sure in the deterministiccase and only in expectation in the stochastic case. It is just as in a two-person zero-sum game, where an optimal mixed strategy is no longer optimalif the opponent knows its pure realization, whereas an optimal pure strat-egy remains so even if known (the sameholds formixed vs. pureNashequi-libria). So to answer our question, we can trade off this weaker require-ment of calibration for a guarantee of leakiness. Since the weakening doesnot decrease the value of the forecast for decision-making, we have gainedleakiness at minimal cost.Leakiness turns out to be the crucial property that is needed for game

dynamics in general n-person games to give Nash equilibria rather thancorrelated equilibria. Specifically, while best replying to calibrated forecastsyields correlated equilibria as the long-run time average of play (see Fosterand Vohra 1997), we show in section VI.A that best replying to determi-nistic continuously calibrated forecasts yields Nash equilibria being playedin most of the periods10 (for earlier, somewhat more complicated variants ofthis result, see Kakade and Foster 2004; Foster and Hart 2018).To return to forecasting, in numerous situations Bayesianmethods are

optimal.11 But if you are using the wrong prior, a lot of the charm ofBayesian methods is lost, and estimators that provide robust minimaxprotectionmight be preferred. If we could estimate the prior, then a Bayes-ian approach sounds pretty good. This is one of the motivations for em-pirical Bayesianmethods (see Berger 1985). Unfortunately, unless we areobserving a sequence of independent and identically distributed prob-lems for which we can truly believe there is a single prior that is common

10 The statements here should be understood with appropriate “approximate” adjectivesthroughout.

11 Dawid (1982) discusses the connection of calibration to posterior probabilities, whereashere we want to connect it to the priors.

forecast hedging and calibration 3453

across a string of problems (seeRobbins 1956), then figuring out thepriorto use for the next problem is not easy. This is where calibration can playa part (see George and Foster 2000). By guaranteeing the connection be-tween the beliefs (our forecasts) and the actual parameters, we can use acalibrated forecast tomake stronger claims about priors that are estimatedin a sequential empirical Bayes setting.For a statistician or econometrician, not being calibrated is one of the

most embarrassing mistakes to make. Suppose we are trying to predictsome variable Y on the basis of a bunch ofXi’s. If it turns out that we couldget a much better fit by looking at X17/X12 than we currently are getting,that would be considered a great scientific result and no one would faultthe previous work that missed it. But if 3Y or Y 3 were better forecasts thanthe Y provided by the statistician, that would be an embarrassing error.Given the numerous ways of correcting uncalibrated forecasts (see Zadro-zny and Elkan 2001), people would ask, “Didn’t you look at your forecastat all?”Of course, when dealing with out-of-sample forecasts, this can oc-cur since the world might change. Hence, the value of these calibrationmethods, which sequentially adapt to a changing world, is to ensure thatwe can avoid this embarrassment.Finally, regarding forecast hedging: as it is an elementary principle, it

might perhaps help dispel some of the mystery behind the prevalence ofwell-calibrated forecasts, such as the “superforecasters” of the Good Judge-ment Project (see Mellers et al. 2015; Tetlock and Gardner 2015), Five-ThirtyEight (see fig. 2), ElectionBettingOdds (see fig. 3),12 and others.Indeed, in most of these cases, one forecasts binary yes/no events, whereforecast hedging is extremely simple and straightforward to implement(see secs. I.B, V).13

12 In such betting/market models, we see that calibration goes part way toward the weakefficient market hypothesis (wEMH). For example, take the sequence of times where astock price is above its 7-day average and we are considering whether to buy it (momen-tum) or to sell it (mean reversion). If we had a forecast of the “correct price,” then thesecould be expressed as saying “buy” when the forecast is above the price and “sell” whenit is below. The property we would then want such a forecast to have is merely calibration.Given how simple it is for forecast hedging to generate calibration, it is reasonable to ex-pect many traders to all discover something close to the same calibrated forecast and hencepush the market in that direction until the price is the same as the forecast (while thiswould not generate the full wEMH, which requires its holding for all price patterns, it doesgo in that direction).

13 Of course, we are not implying that forecast hedging is what these forecasters con-sciously do. What we are saying is that since calibration is very easy to achieve, we shouldnot be surprised by its being often obtained. At the same time, it might be of interest tocheck whether there is any balancing of current and past forecasting errors, as in forecasthedging (see the discussion above where forecast hedging is defined and the illustration insec. I.B). Finally, we note that forecasters are tested not only by their calibration scores butalso by stronger measures of “accuracy” or “skill” (specifically, their Brier scores).

3454 journal of political economy

FIG. 2.—Calibration plots of FiveThirtyEight (projects.fivethirtyeight.com/checking-our-work, updated June 26, 2019). For example, in the “Every-thing” plot, the 10% data point (which lies slightly below the diagonal) has the following attached description: “We thought the 107,962 observationsin this bin had a 10% chance of happening. They happened 9% of the time.” A color version of this figure is available online.

B. Forecast Hedging: A Simple Illustration

Consider the basic rain/no rain setup—or, for that matter, any sequence ofarbitrary, possibly unrelated yes/no events (as in the abovementioned proj-ects)—and let the forecasts lie on the equally spaced grid 0, 1=N , 2=N , ::: , 1for some integer N ≥ 1. Take period t. For each forecast x, let nðxÞ ;nt21ðxÞ be the number of days that x has been used in the past t 2 1 peri-ods, and let rðxÞ ; rt21ðxÞ be the number of rainy days out of those nðxÞdays. If the forecast x is correct, there should have been rain on x � nðxÞout of the nðxÞ days, and so the excess number of rainy days at x isGðxÞ ; Gt21ðxÞ ≔ r ðxÞ 2 x � nðxÞ.14 For simplicity, consider the sum ofsquares score S ; St21 ≔ oxGðxÞ2.15Let a ; at denote the weather at time t, with a 5 1 standing for

rain and a 5 0 for no rain, and let c ; ct in the interval ½0, 1� denotethe forecast at time t. The change in the score S from time t 2 1 to time

FIG. 3.—Calibration plot of ElectionBettingOdds (electionbettingodds.com/TrackRecord.html, updated November 13, 2018), which “tracked some 462 different candidate chancesacross dozens of races and states in 2016 and 2018.” A color version of this figure is availableonline.

14 Think of GðxÞ as the total “gap” at x; it may be positive, zero, or negative. The verticaldistance from the diagonal in the calibration plot (as in figs. 1–3) is the normalized gapGðxÞ=nðxÞ.

15 We abstract away from technical details, such as the appropriate normalizations, inthis illustration (for the precise analysis, see secs. IV, V). For the expert reader, we note thatthe calibration score at time t is Kt 5 ox jGtðxÞj=t (see sec. II), which is small when St=t2 issmall (by the Cauchy-Schwartz inequality). Note that a constant forecast of, say, c 5 1=2yields St 5 t2=4 in the worst case (where all days are rainy or all days are sunny) and thusa calibration score that is bounded away from zero.

3456 journal of political economy

t isSt 2 St21 5 ðGðcÞ 1 a 2 cÞ2 2 GðcÞ2 (the only term that changes in thesum S is the GðcÞ term for the forecasted c), whose first-order approxima-tion equals 2D for16

D ≔ GðcÞ � ða 2 cÞ: (1)

We would like to choose the forecast c so that

D ; GðcÞ � ða 2 cÞ ≤ 0 for any a, (2)

that is, nomatter what the weather will be. This is easy to do when there isa point c on the grid with GðcÞ 5 0: just forecast this c. In general, how-ever, we can aim only to make the inequality D ≤ 0 hold on average bychoosing the forecast at random:17

E D½ � ; E GðcÞ � ða 2 cÞ½ � ≤ 0 for any a: (3)

This is what we call the forecast-hedging condition (condition (2) is a spe-cial case of (3)). Interestingly, this inequality seems to express the ideadiscussed in the introduction that the errors a 2 c of the current forecastwould tend to have the opposite sign of the errors GðcÞ of the pastforecasts.How can (3) be obtained? Randomizing between two forecasts, say, c1

with probability p1 and c2 with probability p2 5 1 2 p1, yields

E D½ � 5 p1Gðc1Þ � ða 2 c1Þ 1 p2Gðc2Þ � ða 2 c2Þ5 ½p1Gðc1Þ 1 p2Gðc2Þ� � ða 2 c2Þ 1 p1Gðc1Þ � ðc2 2 c1Þ:

We can thus guarantee E½D� to be small, no matter what a will be, bychoosing the ck and pk so that, first,

p1Gðc1Þ 1 p2Gðc2Þ 5 0, (4)

and, second, c2 2 c1 is small.18

Specifically, working on the grid 0, 1=N , 2=N , ::: , 1, we obtain theseforecasts as follows. If Gð j=N Þ 5 0 for some j, then take c 5 j=N , whichmakes D 5 0. Otherwise, Gði=N Þ ≠ 0 for all i, and so let j ≥ 1 be any in-dex with Gð j=N Þ < 0 (such a j exists because Gð0Þ > 0 and Gð1Þ < 0) andtake c1 5 ð j 2 1Þ=N and c2 5 j=N (and thus Gðc1Þ > 0 > Gðc2Þ), with the

16 We ignore the term ða 2 cÞ2, which is bounded by 1, since the total contribution to St

of all these terms is at most t and thus negligible relative to t 2 (see n. 15).17 This condition is reminiscent of the Blackwell (1956) approachability condition in the

regret-based approach to calibration of Hart and Mas-Colell (2000).18 This turns out to suffice because c2 2 c1 is multiplied by p1Gðc1Þ, which is of the order

of t (for details, see secs. IV, V). The size of the calibration error is determined by the dis-tance between c1 and c2.

forecast hedging and calibration 3457

pk inversely proportional to jGðckÞj (i.e., as given by (4)). Figure 4 pro-vides two examples of graphs of G (for N 5 6; dotted lines provide linearinterpolation). In the left panel, we have c on the grid with GðcÞ 5 0,which yields the perfect deterministic hedging of (2). In the right panel,we have adjacent c1, c2 on the grid with Gðc1Þ > 0 > Gðc2Þ, which yieldsthe approximate stochastic hedging of (3).One of course needs to keep track of all the approximation errors, but,

surprisingly, the procedure described here does work: it guarantees anaverage calibration error that goes to zero as the grid size N increases(see sec. V).19 It turns out to be a new addition to the literature and sim-pler than any existing calibrated procedure in this one-dimensional binarysetup (i.e., rain/no rain). Moreover, while it involves randomizations (itmust!), the randomizations are all between two neighboring grid points(ð j 2 1Þ=N and j/N), and so this procedure is an almost deterministic pro-cedure (Foster 1999; Kakade and Foster 2008).Forecast hedging is central to all the calibration results in this paper in

higher dimensions as well. Specifically, we have the following:

• For classic calibration, probabilistic weights pk that ensure that E½D�is small are obtained using a minimax result; this is stochastic forecasthedging.

• For continuous calibration, where the corresponding functionG be-comes continuous, a deterministic point c that ensures D ≤ 0 (a

FIG. 4.—Examples of forecast hedging. The forecasts x are marked on the horizontalaxis, and the gaps GðxÞ are marked on the vertical axis. On the left, deterministic forecasthedging is obtained by forecasting c with GðcÞ 5 0; on the right, stochastic forecast hedg-ing is obtained by randomizing among the forecasts c1 and c2 with probabilities p1 and p2such that p1Gðc1Þ 1 p2Gðc2Þ 5 0. A color version of this figure is available online.

19 We provide there a slight variant that uses the normalized errors e instead of thegapsG; it is just as simple and guarantees the minimal possible calibration error of 1/(2N)(whereas, for the G -based procedure here, the error is of the order of 1=

ffiffiffiffiffiN

p).

3458 journal of political economy

special case of which is GðcÞ 5 0) is obtained by a fixed point result;this is deterministic forecast hedging.

• Again for classic calibration, an almost deterministic forecast is ob-tained by replacing the fixed point with an appropriate distributionon nearby grid points (as done above in the one-dimensional case);this is almost deterministic forecast hedging.

C. The Organization of the Paper

The paper is organized as follows. Section II presents the general calibra-tion setup and introduces the new concept of “continuous calibration.”Section III is devoted to what we call “outgoing” theorems, which providethe forecast-hedging tools that are used to obtain the calibration resultsin section IV. The simple procedure in the one-dimensional case is given insection V. In section VI, we show that the game dynamics of best reply-ing to continuously calibrated forecasts—“continuously calibrated learn-ing”—yield Nash equilibria, and we conclude in section VII with the sig-nificant distinction made here between the minimax and the fixed pointuniverses. The appendix provides further details, proofs, and extensions.

II. The Calibration Setup

Let A be a set of possible outcomes, which we call actions (such asA 5 f0, 1g, with a 5 1 standing for rain and a 5 0 for shine), and letC be the set of forecasts about these actions (such as C 5 ½0, 1�, with c inC standing for “the chance of rain is c”). We assume that C ⊂ Rm is anonempty compact convex subset of a Euclidean space, and A⊆ C .20 Somespecial cases of interest are as follows: (1) C is the set of probability distri-butions DðAÞ over a finite set A, which is identified with the set of unit vec-tors inC (and thenC is a unit simplex); (2)C is the convex hull conv(A) of afinite set of points A ⊂ Rm (and then C is a polytope); and (3) C 5 A (andthen A is already convex). Let g ≔ diamðCÞ ; maxc,c 0∈C k c 2 c 0 k be thediameter of the set C. Let d > 0; a subset D of C is a d-grid of C if for every c ∈C there is d ∈ D at distance less than d from c, that is, k d 2 c k < d; a compactsetC always has a finite d-grid (obtained from a finite subcover by opend-balls).For each period t 5 1, 2, ::: , let ct in C be the forecast, and let at in

A be the action. The forecast at time t may well depend on the historyht21 5 ðc1, a1; c2, a2; ::: ; ct21, at21Þ ∈ ðC � AÞt21 of past forecasts and actions.A deterministic (forecasting) procedure j is thus a mapping j : [t≥1ðC � AÞt21 →C that assigns to every history ht21 a forecast ct 5 jðht21Þ ∈ C at time t. A

20 Rm denotes the m-dimensional Euclidean space, with the usual Euclidean (‘2) normk�k.

forecast hedging and calibration 3459

stochastic ( forecasting) procedure j is a mapping j : [t≥1ðC � AÞt21 → DðCÞthat assigns to every history ht21 a probability distribution jðht21Þ onC ac-cording to which the forecast ct at time t is chosen. Let r > 0; a stochasticprocedure j is r-almost deterministic if for every history ht21 the support ofthe distribution jðht21Þ of ct is included in a closed ball of radius r; thatis, the forecast ct is deterministic within a precision r.We refer to a 5 ðatÞ∞t51, where at ∈ A for every t, as an action sequence; the

sequencemay be anything from a fixed (oblivious) sequence all the way toan adaptive (adversarial) sequence; the latter allows the action at at time tto be determined by the history ht21 as well as by the forecasting procedure(i.e., the mapping j).21 Let at 5 ðasÞts51 denote the first t coordinates of a.

A. Classic Calibration

Fix a time t ≥ 1 and a sequence ðcs, asÞs51,:::,t ∈ ðC � AÞt of forecasts andactions up to time t. For every x in C, let22

ntðxÞ ≔ 1 ≤ s ≤ t : cs 5 xf gj j 5 ot

s51

1xðcsÞ

be the number of times that the forecast x has been used, and for every xwith ntðxÞ > 0, let

�atðxÞ ≔ 1

ntðxÞot

s51

1xðcsÞas

be the average of the actions in all the periods that the forecast x hasbeen used. The calibration error etðxÞ of the forecast x is then

etðxÞ ≔ �atðxÞ 2 x

(when the forecast x has not been used, i.e., ntðxÞ 5 0, we put for conve-nience etðxÞ ≔ 0).The classic calibration score is the average calibration error, namely,

Kt ≔ ox∈C

ntðxÞt

� �k etðxÞ k; (5)

thus, the error of each x is weighted in proportion to the number of timesntðxÞ that x has been used (the weights add up to 1 becauseoxntðxÞ 5 t).23

21 See the remark following the definition of forecast hedging in sec. IV.A. In the setupof the calibration game (see Foster and Hart 2018), which is a repeated simultaneous gameof perfect monitoring and perfect recall between the action player and the calibrating player(the forecaster), the statement “for every action sequence a” translates to “for every (pure)strategy of the action player.”

22 We write 1x for the x-indicator function, i.e., 1xðcÞ 5 1 for c 5 x and 1xðcÞ 5 0 forc ≠ x. The number of elements of a finite set Z is denoted by FZF.

23 The sum (5) is finite, as it goes over all x with ntðxÞ > 0, i.e., over x in the set fc1, ::: , ctg.In line with standard statistics usage, one may average the squared Euclidean norms jjetðxÞjj2instead (cf. Xt in the proof of theorem 9(S)); this will not affect the results.

3460 journal of political economy

Let ε > 0; a (stochastic) procedure j is ε-calibrated (Foster and Vohra1998) if

limt →∞ supat

E Kt½ � !

≤ ε

(the expectation E is taken over the random forecasts of j).24 In appen-dix section A5, we show that one may make Kt small with probability 1(i.e., almost surely), not just in expectation.

B. Binning and Continuous Calibration

The calibration error etðxÞ can be rewritten as

etðxÞ 5 ot

s51

1xðcsÞntðxÞ

� �ðas 2 csÞ

(because ots511xðcsÞcs 5 ntðxÞx); thus, etðxÞ is the average of the differences

as 2 cs between actions and forecasts, where only the periods s where theforecast was x count.The calibration score, as defined by (5), can then be interpreted as fol-

lows. For each x in C, there is a bin, call it the x -bin, which tracks the errorsof the forecast x; namely, if at time s the forecast is cs and the action is as,then the difference as 2 cs between the action and the forecast is assignedto the cs-bin. At time t, one computes the average error etðxÞ of each x -bin,and then the calibration score Kt is the average norm of these errors, wherethe weight of each x-bin is proportional to its size, that is, to the numberof elements ntðxÞ that it contains.As discussed in the introduction, the resulting calibration score is highly

discontinuous: forecasts c and c 0, even when slightly apart, are tracked sep-arately in distinct bins. To smooth this out and treat them similarly, wehave to, first, allow for fractional assignments into bins and, second, makethese assignments depend continuously on the forecast c.What then is a general binning system? It is given by the fraction

0 ≤ wiðcÞ ≤ 1 of each forecast c that goes into each bin i, where these frac-tions add up to 1 over all bins (for each c). We assume for conveniencethat the number of bins is countable, that is, finite or countably infinite;there is no loss of generality in this assumption, as we show in appendixsectionA1. Abinning is thus a collectionP 5 ðwiÞIi51, with Ifinite or I 5 ∞,of functions wi : C → ½0, 1� such that

24 The calibration score Kt depends on the actions and forecasts up to time t and is thusa function Kt ; Ktða, jÞ of the action sequence a and the forecasting procedure j (in fact,only at and jt, the restriction of j to histories up to time t, matter for Kt). The same appliesto the other scores throughout the paper.

forecast hedging and calibration 3461

oI

i51

wiðcÞ 5 1

for every c ∈ C ; the binning is continuous if all the functions wi are contin-uous functions of c.A continuous binning is obtained, for instance, by taking points yi in

C and letting the fraction of forecast c that goes into the yi-bin decreasecontinuously with the distance between c and yi. For a specific examplein which only small neighborhoods matter, take fy1, ::: , yIg to be a finited-grid of C and put wiðcÞ ≔ Lðc, yiÞ=oI

j51Lðc, yjÞ for each 1 ≤ i ≤ I , whereLðc, yÞ ≔ ½d 2 k c 2 y k�1.25Next, what is the calibration score K P

t with respect to a (continuous)binning P 5 ðwiÞIi51? As for the classic calibration score Kt, one first com-putes the average error in each bin and then takes the average norm ofthese errors in proportion to the total weights accumulated in the bins.The total weight of bin i is

nit ≔ o

t

s51

wiðcsÞ,

the average error of bin i is

eit ≔ ot

s51

wiðcsÞnit

� �ðas 2 csÞ

(again, put eit ≔ 0 when nti 5 0), and the P-calibration score is

K Pt ≔ o

I

i51

nit

t

� �k eit k (6)

(the weights nit=t add up to 1, because oI

i51nit 5 ot

s51oIi51wiðcsÞ 5 t).26

A deterministic procedure j is P-calibrated if

limt →∞

supat

K Pt

!5 0,

and it is continuously calibrated if it is P-calibrated for every continuousbinning P.

25 For fixed y, the graph of the so-called tent function Lðc, yÞ looks like the symbol L(with the peak at c 5 y).

26 For a continuous binning, the bin errors eit are continuous averages of the classic cal-ibration errors etðxÞ, namely,

eit 5 ox∈C

wiðxÞntðxÞnit

� �etðxÞ;

thus, continuous binnings do indeed capture the idea of smoothing out the calibration er-rors (as in the above example with the d-grid {yi} on C).

3462 journal of political economy

Compared with classic calibration, continuous calibration requires theconvergence to be to zero (rather than ≤ ε) simultaneously for all contin-uous P.27

C. Gaps and Preliminary Results

Rather than working with the normalized errors, it is convenient towork with unnormalized “gaps.” For every real function on C, that is, w :C →R, and t ≥ 1, let

gtðwÞ ≔ 1

t ot

s51

wðcsÞðas 2 csÞ

be the (per-period) gap at time t with respect to w (when w 5 1x , this is thetotal gap GðxÞ of sec. I.B divided by the number of periods). We extendthe definitions of nt and et by

ntðwÞ ≔ ot

s51

wðcsÞ and etðwÞ ≔ 1

ntðwÞot

s51

wðcsÞðas 2 csÞ

for every w, and then the relation gtðwÞ 5 ðntðwÞ=tÞetðwÞ immediatelyyields

Kt 5 ox∈C

k gtð1xÞ k  and K Pt 5 o

I

i51

k gtðwiÞ k

(indeed, for Kt we have ntðxÞ ; ntð1xÞ and etðxÞ ; etð1xÞ, and for K Pt we

have nit ; ntðwiÞ and eit ; etðwiÞ).

For every function w, the vectors etðwÞ and gtðwÞ are proportional; theydiffer in that the denominator is ntðwÞ in the former and t, which is larger,in the latter. The calibration scores are averages of the norms of et and sumsof the norms of gt. One advantage of the gt representation is that we donot need to keep track explicitly of the total weights nt.28 Another is that,fixing the sequence of actions and forecasts, we find that themapping gt isa linear bounded operator: gtðaw 1 a0w 0Þ 5 agtðwÞ 1 a0gtðw 0Þ for scalarsa, a0 ∈ R, and using the supremum norm k w k ≔ supc∈C jwðcÞj for func-tions w : C →R, we have

k gtðwÞ k ≤ g k w k

(because gtðwÞ is an average of vectors wðcsÞðas 2 csÞ of norm≤ k w k diamðCÞ 5 k w k g); therefore,

27 One could get uniformity over binningsP by restricting them to a compact space (e.g.,by imposing a uniform Lipschitz condition on the wi, as in weak and smooth calibration).

28 In particular, when ntðwÞ vanishes, so does gtðwÞ.

forecast hedging and calibration 3463

k gtðwÞ k2 k gtðw 0Þ kj j ≤ k gtðwÞ 2 gtðw 0Þ k 5 k gtðw 2 w 0Þ k≤ g k w 2 w 0 k: (7)

Returning to the binning condition, which we can write as oIi51wi 5 1,29

it says that P 5 ðwiÞIi51 is a “partition of unity,” and so the resulting calibra-tion score K P

t may be viewed as the “variation” of gt with respect to the par-tition P. In particular, the classic calibration score Kt is the variation of gtwith respect to the partition ox∈C1x 5 1 into indicator functions. Since theindicator partition is the finest partition,30 it stands to reason that Kt wouldbe the maximal possible variation, that is, the “total variation” of gt. This isindeed so: for every binning P, we have

K Pt ≤ Kt , (8)

which immediately follows from applying lemma 1 below to P.Thus, any notion based on binning—in particular, continuous calibra-

tion—is a weakening of classic calibration: if Kt is small, then so are allthe relevant K P

t .Lemma 1. Let ðwjÞj∈J be a countable collection of nonnegative func-

tions on C, that is, wj : C →R1 for every j ∈ J . Then

oj∈J

k gtðwjÞ k ≤ koj∈Jwj k Kt:

Proof.—Put W ≔ oj∈J wj ; using wj 5 ox∈CwjðxÞ1x and the linearity of gt,we have31

oj∈J

k gtðwjÞ k ≤ oj∈Jox∈C

wjðxÞ k gtð1xÞ k 5 ox∈Coj∈JwjðxÞ k gtð1xÞ k

5 ox∈C

W ðxÞ k gtð1xÞ k ≤ kW kox∈C

k gtð1xÞ k 5 kW k Kt:

QEDFor another use of this lemma, let P 5 ðwiÞ∞i51 be an infinite continuous

binning. The increasing sequence of continuous functions oki51wi con-

verges pointwise, as k →∞, to the continuous function 1 on the compactset C, and so by Dini’s theorem (see, e.g., Rudin 1976, theorem 7.13), theconvergence is uniform:

limk →∞

k o∞

i5k11

wi k 5 0: (9)

(7)

29 We write 1 for the constant 1 function; all indicator and w functions are defined on Conly.

30 Any further split into fractions of indicators does not matter since gtða1xÞ 5 agtð1xÞ.31 The sum ox∈C in the proof below is a finite sum (over x ∈ fc1, ::: , ctg), and so it com-

mutes with oj∈J .

3464 journal of political economy

Using lemma 1 for every at together with Kt ≤ g (by (5)) yields

limk→∞

supat

o∞

i5k11

k gtðwiÞ k !

5 0: (10)

Thus, for continuous binning, only finitely many wi matter, which leads to asimpler characterization of continuous calibration in terms of pointwise-in-w convergence.Proposition 2. A deterministic forecasting procedure j is continu-

ously calibrated if and only if

limt →∞

supat

k gtðwÞ k !

5 0 (11)

for every continuous function w : C → ½0, 1�.Proof.—Given a continuous function w : C → ½0, 1�, let P be the contin-

uous binning ðw, 1 2 wÞ. Since k gtðwÞ k ≤ K Pt , continuous calibration im-

plies (11).Conversely, let P 5 ðwiÞIi51 be a continuous binning. When I is finite, we

have supatK P

t 5 supatoI

i51 k gtðwiÞ k ≤ oIi51supat

k gtðwiÞ k, which convergesto zero as t →∞ by (11). When I is infinite, for every ε > 0 there is by (10)a finite k such that supat

K Pt ≤ ok

i51supatk gtðwiÞ k1 ε, which converges to ε

as t →∞ by (11); since ε is arbitrary, the limit is zero. QEDWe now construct a continuous binning P0 such that P0-calibration

implies P-calibration for all continuous P (and so P0 plays, for continu-ous calibration, the same role that the indicator binning plays for classiccalibration; see (8)).Proposition 3. There exists a continuous binning P0 such that a de-

terministic forecasting procedure j is continuously calibrated if and onlyif it is P0-calibrated.Proof.—The space of continuous functions from the compact set C

to ½0, 1� is separable with respect to the supremum norm; let ðuiÞ∞i51 bea dense sequence. Take ai > 0 such that o∞

i51ai k ui k ≤ 1 (e.g.,ai 5 1=ð2i k ui kÞ), and put wi ≔ aiui for all i ≥ 1 and w0 ≔ 1 2 o∞

i51wi

(the function w0 is continuous because o∞i51wiðcÞ ≤ o∞

i51ai k ui k ≤ 1).Thus, P0 5 ðwiÞ∞i50 is a continuous binning, and so continuous calibrationimplies P0-calibration.Conversely, P0-calibration implies (11) for each wi in P0 (because

k gtðwiÞ k ≤ K P0

t ) and, hence, for each ui by the linearity of gt. This extendsfrom the dense sequence ðuiÞi to any continuous w : C → ½0, 1� by (7), andproposition 2 completes the proof. QEDProposition 2 also implies that continuous calibration is a strengthen-

ing of existing Lipschitz-based notions of weak calibration (Kakade andFoster 2004; Foster and Kakade 2006) and smooth calibration (Fosterand Hart 2018). Indeed, a continuously calibrated procedure—a simple

forecast hedging and calibration 3465

construction of which we provide in section IV—is “universally” weakly andsmoothly calibrated (by contrast, the known constructions depend on theLipschitz bound L and the desired calibration error ε; see proposition 15in app. sec. A2).32 Thus, continuous calibration may well be used insteadof weak and smooth calibration.

III. Forecast-Hedging Tools

In this section, we provide useful variants of Brouwer’s (1912) fixed pointtheorem and von Neumann’s (1928) minimax theorem; they are used insection IV to obtain forecasts that satisfy the forecast-hedging conditions.These conditions, of the form (2) and (3) (see sec. I.B), are referred to as“outgoing” because of their geometric interpretation (see the paragraphfollowing the statement of theorem 4). The reader may skip the proofs inthis section at first reading; however, see the important distinction betweenfixed point and minimax procedures in section III.D.Throughout this section, f : C →Rm is a function from the nonempty

compact and convex subset C of Rm into Rm (with the same dimension m),which may be interpreted as a vector field flow (i.e., think of x as movingto x 1 f ðxÞ or to x 1 εf ðxÞ for some ε > 0).

A. Outgoing Fixed Point

When the function f is continuous, we obtain the following:Theorem 4 (Outgoing fixed point). Let C ⊂ Rm be a nonempty com-

pact convex set and f : C →Rm be a continuous function. Then there existsa point y inC such that

f ðyÞ � ðx 2 yÞ ≤ 0 (12)

for all x ∈ C .Thus, f ðyÞ � y 5 maxx∈C f ðyÞ � x. If y is an interior point of C, then we

must have f ðyÞ 5 0 (because x 2 y can be proportional to any vector inRm), and if y is on the boundary of C, then f ðyÞ is an outgoing normal tothe boundary of C at y. This result is the “variational inequalities” lemma 8.1in Border (1985), who attributes it to Hartman and Stampacchia (1966,lemma 3.1). We provide a short direct proof using Brouwer’s (1912) fixedpoint theorem.33

32 The traditional way to obtain universal procedures is by restarting them at appropriatetimes with new values of the parameters (as in sec. 4.4 of Kakade and Foster 2004). Theprocedures that we construct in this paper are much simpler.

33 Theorem 4 is in fact equivalent to Brouwer’s fixed point theorem, as the latter is easilyproved from the former; see app. sec. A3, which contains various comments on the outgo-ing results.

3466 journal of political economy

Proof.—For every z ∈ Rm, let yðzÞ ∈ C be the closest point to z in theset C, that is, k yðzÞ 2 z k 5 minx∈C k x 2 z k. As is well known, becauseC is a convex and compact set, yðzÞ is well defined (i.e., it exists and isunique), the function y is continuous, and

ðz 2 yðzÞÞ � ðx 2 yðzÞÞ ≤ 0 (13)

for every x ∈ C (when z ∈ C , it trivially holds because then yðzÞ 5 z, andwhen z ∉ C , the vector z 2 yðzÞ is an outward normal to C at the bound-ary point yðzÞ).The function x ↦ yðx 1 f ðxÞÞ is thus a continuous function from C to

C, and so by Brouwer’s fixed point theorem, there is y ∈ C such that y 5yðy 1 f ðyÞÞ. Applying (13) to the point z 5 y 1 f ðyÞ, for which yðzÞ 5 y,yields the result. QED

B. Outgoing Minimax

For functions f that need not be continuous, we have the following:Theorem 5 (Outgoing minimax). Let C ⊂ Rm be a nonempty com-

pact convex set, let D ⊂ C be a finite d-grid of C for some d > 0, and letf : D →Rm. Then there exists a probability distribution h on D such that

Ey∼h f ðyÞ � ðx 2 yÞ½ � ≤ dEy∼h k f ðyÞ k½ � (14)

for all x ∈ C . Moreover, the support of h can be taken to consist of atmost m 1 3 points of D.

When f is bounded, by taking d 5 ε=supx∈C k f ðxÞ k, we have the following:Corollary 6. Let C ⊂ Rm be a nonempty compact convex set and

f : C →Rm a bounded function. Then for every ε > 0, there exists a prob-ability distribution h on C such that

Ey∼h f ðyÞ � ðx 2 yÞ½ � ≤ ε

for all x ∈ C . Moreover, the support of h can be taken to consist of at mostm 1 3 points of C.Unlike in the outgoing fixed point theorem 4, in the outgoing mini-

max theorem 5, y is a random variable and no longer a constant, andthe outgoing inequality holds in expectation (within an arbitrarily smallerror). The proof is a finite construct that uses von Neumann’s (1928)minimax theorem and thus amounts to solving a linear programmingproblem.34

34 As we will see in app. sec. A3, corollary 6 is equivalent to the minimax theorem (astheorem 4 is equivalent to Brouwer’s fixed point theorem).

forecast hedging and calibration 3467

Proof of theorem 5.—Let d0 ; d0ðDÞ ≔ maxx∈Cdistðx, DÞ be the farthestaway a point in C may be from the d-grid D; the maximum is attainedon the compact set C, and so d0 < d. Put d1 ≔ d 2 d0 > 0, and take B ⊂ Cto be a finite d1-grid of C. Consider the finite two-person zero-sum gamewhere the maximizer chooses b ∈ B, the minimizer chooses y ∈ D, andthe payoff is f ðyÞ � ðb 2 yÞ 2 d0 k f ðyÞ k. For every mixed strategy n ∈ DðBÞof the maximizer, let �b ≔ Eb∼n½b� ∈ C be its expectation; the minimizercan make the payoff ≤0 by choosing a point y on the grid D that is withind0 of �b:

Ex∼n f ðyÞ � ðb 2 yÞ 2 d0 k f ðyÞ k½ � 5 f ðyÞ � ð�b 2 yÞ 2 d0 k f ðyÞ k ≤ 0

(because f ðyÞ � ð�b 2 yÞ ≤ k f ðyÞ k � k �b 2 y k ≤ k f ðyÞ k d0). Therefore, by theminimax theorem, the minimizer can guarantee that the payoff is ≤0; thatis, there is a mixed strategy h ∈ DðDÞ such that

Ey∼h f ðyÞ � ðb 2 yÞ 2 d0 k f ðyÞ k½ � ≤ 0 (15)

for every b ∈ B. Since for every x ∈ C there is b ∈ B with k x 2 b k < d1, andso f ðyÞ � ðx 2 bÞ ≤ d1 k f ðyÞ k for every y, adding this inequality to (15) yields,by d0 1 d1 5 d, the inequality (14) for every x ∈ C .For the moreover statement, (14) says that the vector Ey∼h½F ðyÞ� satisfies

Ey∼h½F ðyÞ� � ðx,21,2dÞ ≤ 0 for every x ∈ C , where

F ðyÞ ≔ ð f ðyÞ, f ðyÞ � y, k f ðyÞ kÞ ∈ Rm12

for each y ∈ D. By Carathéodory’s theorem, Ey∼h½F ðyÞ� can be expressedas a convex combination of at most m 1 3 points in fF ðyÞ : y ∈ Dg, andso the support of h can be taken to be of size at most m 1 3. QED

C. Almost Deterministic Outgoing Fixed Point

We can improve the result of the outgoing minimax theorem and obtaina probability distribution that is almost deterministic—that is, the random-ization is between nearby points—by using a fixed point.A probability distribution h is said to be r-local if its support is included

in a closed ball of radius r; that is, there exists x such that hð�Bðx; rÞÞ 5 1,where Bðx; rÞ 5 fz : k z 2 x k < rg and �Bðx; rÞ 5 fz : k z 2 x k ≤ rg de-note, respectively, the open and closed balls of radius r around x.Theorem 7 (Almost deterministic outgoing fixed point). LetC ⊂ Rm

be a nonempty compact convex set, let D ⊂ C be a finite d-grid of C forsome d > 0, and let f : D →Rm. Then there exists a d-local probability dis-tribution h on D such that

Ey∼h f ðyÞ � ðx 2 yÞ½ � ≤ dEy∼h k f ðyÞ k½ �

3468 journal of political economy

for all x ∈ C . Moreover, the support of h can be taken to consist of atmost m 1 1 points of D.

When f is bounded, by taking d 5 minfε=supx∈C k f ðxÞ k, rg, we havethe following:Corollary 8. Let C ⊂ Rm be a nonempty compact convex set and

f : C →Rm a bounded function. Then for every ε > 0 and r > 0, there ex-ists a r-local probability distribution h on C such that

Ey∼h f ðyÞ � ðx 2 yÞ½ � ≤ ε

for all x ∈ C . Moreover, the support of h can be taken to consist of atmost m 1 1 points of C.Proof of theorem 7.—From the values of f on D, one can generate a con-

tinuous function ~f : C →Rm such that ~f ðxÞ is a weighted average of thevalues of f on grid points that are within d of x, that is,

~f ðxÞ ∈ conv f ðdÞ : d ∈ D \ Bðx; dÞf g, (16)

for all x ∈ C . For instance, put

~f ðxÞ ≔ od∈DLðx, dÞf ðdÞod∈DLðx, dÞ

,

where Lðx, dÞ ≔ ½d 2 k d 2 x k�1 (the so-called tent function); ~f is con-tinuous because D is finite, Lðx, dÞ is continuous in x, and the denom-inator is always positive since D is a d-grid of C; as for (16), it follows sincek d 2 x k ≥ d implies Lðx, dÞ 5 0.

Theorem 4 applied to ~f ðxÞ yields a point z ∈ C such that~f ðzÞ � ðx 2 zÞ ≤ 0 for all x ∈ C , and then (16) yields a probability distri-bution h onD \ Bðz; dÞ such that ~f ðzÞ 5 Ey∼h½ f ðyÞ�. The distribution h isthus d-local, and its support can be taken to be of size at most m 1 1 byCarathéodory’s theorem (because f ðyÞ ∈ Rm). Now

Ey∼h½ f ðyÞ � ðx 2 yÞ� 5 Ey∼h½ f ðyÞ � ðx 2 zÞ� 1 Ey∼h½ f ðyÞ � ðz 2 yÞ�;

the first term is Ey∼h½ f ðyÞ� � ðx 2 zÞ 5 ~f ðzÞ � ðx 2 zÞ ≤ 0 (by the choice ofz), and the second term is ≤ dEy∼h½k f ðyÞ k� (because k y 2 z k ≤ d for everyy in the support of h), which completes the proof. QED

D. FP Procedures and MM Procedures

The calibration proofs that we provide below construct procedures wherethe forecast in each period is given by appealing either to the outgoingfixed point theorems 4 and 7 or to the outgoing minimax theorem 5 inorder to satisfy the corresponding forecast-hedging conditions. We will

forecast hedging and calibration 3469

refer to these two kinds of procedures as procedures of type FP and proceduresof type MM, respectively.This distinction is not just a matter of proof technique. It goes the other

way around as well (for details and relevant literature, seeHazan andKakade2012): calibration that is obtained by FP procedures, such as continuouscalibration, may be used to get approximate Nash equilibria in non-zero-sum games.35 Therefore, this kind of calibration falls essentially in the PPADcomplexity class, which is believed to go beyond the class of polynomiallysolvable problems, such as minimax problems. The distinction between FP-obtainable calibrationandMM-obtainable calibration is a significant distinc-tion of the nonpolynomial versus polynomial variety (see also sec. VII).

IV. Calibrated Procedures

In this section, we prove the three main calibration results: deterministiccontinuous calibration, stochastic classic calibration, and almost determin-istic classic calibration. The proofs all run along the same lines: first, weshow that appropriate forecast-hedging conditions yield calibration (theo-rem 9); and second, we construct, using the outgoing results of section III,procedures that satisfy the forecast-hedging conditions (theorem 10).We illustrate the idea of the proof (see also sec. I.B) by showing how

to construct a deterministic procedure that guarantees that gtðwÞ→ 0 ast →∞ (see (11)) for a single continuous function w : C → ½0, 1�. By the def-inition of gt, we have tgtðwÞ 5 ðt 2 1Þgt21ðwÞ 1 wðctÞðat 2 ctÞ, and so

k tgtðwÞ k2 5 kðt 2 1Þgt21ðwÞ k2 1 2ðt 2 1Þgt21ðwÞ � wðctÞðat 2 ctÞ

1 wðctÞ2k at 2 ct k2:

(17)

The last term is ≤g2 (since wðctÞ ∈ ½0, 1� and at, ct belong to C, whose di-ameter is g). The middle term is 2ðt 2 1ÞJðctÞ � ðat 2 ctÞ, where JðcÞ ≔wðcÞgt21ðwÞ is a continuous function of c that takes values in Rm (becausegt21ðwÞ ∈ Rm). The outgoing fixed point theorem 4 then yields a point inC—which will be our forecast ct—that guarantees that JðctÞ � ðat 2 ctÞ ≤ 0for any action at ∈ A⊆ C .36 Therefore, (17) yields the inequality k tgtðwÞ k2 ≤kðt 2 1Þgt21ðwÞ k2 1 g2, which applied recursively gives k tgtðwÞ k2 ≤ tg2,and thus k gtðwÞ k ≤ g=

ffiffit

p→ 0 as t →∞. The proof is easily extended to

handle continuous binnings ðwiÞi , such as P0 of proposition 3, which yieldscontinuous calibration. For classic calibration, where the function J above

35 This should come as no surprise since game dynamics where players best reply to con-tinuously calibrated forecasts yield in the long run approximate Nash equilibria for generaln-person games (see sec. VI).

36 In this simple case of a single w, a fixed point is not really needed: take ct in C that ismaximal in the direction gt21ðwÞ, i.e., ct ∈ argmaxx∈Cx � gt21ðwÞ. However, the fixed point isneeded once we consider multiple w’s.

(17)

3470 journal of political economy

is in general not continuous, we use the outgoing minimax theorem 5(for a variant w of J); finally, using the outgoing almost deterministic fixedpoint theorem 7 instead yields an almost deterministic procedure for classiccalibration.

A. Forecast Hedging

Let P 5 ðwiÞIi51 be a binning. For every period t ≥ 2 and history ht21, wedefine two functions, Jt21 and wt21, from C to Rm , by

Jt21ðcÞ ≔ oI

i51

wiðcÞgt21ðwiÞ,

wt21ðcÞ ≔ oI

i51

wiðcÞet21ðwiÞ

for every c ∈ C . Thus, Jt21 and wt21 are averages of the vectors gt21ðwiÞ andet21ðwiÞ, respectively, with weights that vary with c and are given by the bin-ning P. We define the following:

(D) A deterministic forecasting procedure j satisfies the P-deterministicforecast-hedging condition if, for every t ≥ 2 and history ht21,

Jt21ðctÞ � ða 2 ctÞ ≤ 0 for every a ∈ A, (D-FH)

where ct 5 jðht21Þ is the forecast at time t.(S) A stochastic procedure j satisfies the (P, ε)-stochastic forecast-hedging

condition for ε > 0 if, for every t ≥ 2 and history ht21,

Et21 wt21ðctÞ � ða 2 ctÞ½ � ≤ εEt21 k wt21ðctÞ k½ � for every a ∈ A,(S-FH)

where Et21 denotes expectation with respect to the distributionjðht21Þ of the forecast ct at time t.

Remark. The forecast-hedging conditions (D-FH) and (S-FH) require,for eachhistory ht21, that the corresponding inequality hold for every a ∈ A.This allows the action at that follows the history ht21 to depend on ht21 andthus also on jðht21Þ, which is determined by ht21. Therefore, when j is adeterministic procedure, at may depend on ct as well; this is the leaky setupof Foster and Hart (2018) (when j is stochastic, it may depend on the dis-tribution jðctÞ of ct but not on the actual realization of ct); see footnote 21and section VI.Theorem 9.(D) If a deterministic procedure j satisfies the P-deterministic forecast-

hedging condition for a continuous binning P 5 ðwiÞIi51, then

forecast hedging and calibration 3471

limt →∞

supat

K Pt

!5 0: (18)

(S) If a stochastic procedure j satisfies the (P, ε)-stochastic forecast-hedging condition for a finite binning P 5 ðwiÞIi51 and ε > 0, then

limt →∞

supat

E K Pt

� � !≤ ε: (19)

Proof.—(D) Put St ≔ oIi51k tgtðwiÞ k2; we will show that limt →∞ð1=t2ÞSt 5

0.37

Using (17) for each wi, summing over i, and recalling the definition ofJt21 gives

St ≤ St21 1 2ðt 2 1ÞJt21ðctÞ � ðat 2 ctÞ 1 g2

(the last term isoiwiðctÞ2k at 2 ct k2 ≤ g2oiwiðctÞ 5 g2 since wiðctÞ ∈ ½0, 1�).

This inequality becomes St ≤ St21 1 g2 when j satisfies (D-FH); by recur-sion (starting with S0 5 0), we get St ≤ tg2. All the inequalities hold forevery action sequence at, because for every history ht21, inequality (D-FH)holds for every a. Thus, dividing by t2, we have

supat

oI

i51

k gtðwiÞ k2 ≤g2

t→t →∞

0:

Therefore, supatk gtðwiÞ k→ 0 as t →∞ for every i ∈ I , which yields (18)

(by the same argument as in the second part of the proof of proposi-tion 2, because the binning P is continuous).(S) Put Xt ≔ oI

i51ntðwiÞk etðwiÞ k2. We will show that

limt →∞

supat

E1

tXt

� � !≤ ε2;

this yields (19) since ðK Pt Þ2 ≤ ð1=tÞXt by Jensen’s inequality.38

The proof consists of expressing the one-period increment of Xt as asum of two terms, a Yt-term, which, by forecast hedging, is at most ε2 inexpectation, and a Zt-term, which converges to zero:

37 The score St is precisely S of sec. I.B.38 The score ð1=tÞXt is the square-calibration score for P, namely, the average of the

squared norms of the errors (i.e., replace k eit k with k eit k2 in formula (6) of K P

t ).

3472 journal of political economy

Xt 2 Xt21 5 Yt 1 Zt , (20)

Et21 Yt½ � ≤ ε2, (21)

supat

ot

s51

Zs ≤ OðlogtÞ (22)

for every t ≥ 1 (where X0 5 0). This proves the result, since taking over-all expectation of (21) yields E½Yt � ≤ ε2, and thus

E1

tXt

� �5 E

1

t ot

s51

ðXs 2 Xs21Þ� �

51

t ot

s51

E Ys½ � 1 1

t ot

s51

E Zs½ �

≤ ε2 1 Ologt

t

� �→ ε2

as t →∞, uniformly over at.Proof of (20).—We start with the following easy-to-check identity for sca-

lars a, b ≥ 0 and vectors u, v:

ða 1 bÞ au 1 bv

a 1 b

2 2 ak u k2 5 2bu � v 2 bk u k2 1b2

a 1 bk u 2 v k2:

Using this for a 5 nt21ðwÞ, b 5 wðctÞ, u 5 et21ðwÞ, and v 5 at 2 ct yields

ntðwÞk etðwÞ k2 2 nt21ðwÞk et21ðwÞ k2 5 ytðwÞ 1 ztðwÞ,where

ytðwÞ ≔ 2wðctÞet21ðwÞ � ðat 2 ctÞ 2 wðctÞk et21ðwÞ k2,

ztðwÞ ≔wðctÞ2ntðwÞ k et21ðwÞ 2 ðat 2 ctÞ k2 ≤ 4g2 wðctÞ2

ntðwÞ(the last inequality because k et21ðwÞ k ≤ g and k at 2 ct k ≤ g). Applyingthis to each wi, summing over i, and recalling the definition of Xt and wt21

gives Xt 2 Xt21 5 Yt 1 Zt , where

Yt ≔ oI

i51

ytðwiÞ 5 2wt21ðctÞ � ðat 2 ctÞ 2oI

i51

wiðctÞk et21ðwiÞ k2,

Zt ≔ oI

i51

ztðwiÞ ≤ 4g2oI

i51

wiðctÞ2ntðwiÞ :

Proof of (21).—By the stochastic forecast-hedging condition (S-FH), wehave Et21½2wt21ðctÞ � ðat 2 ctÞ� ≤ Et21½2ε k wt21ðctÞ k�; now

forecast hedging and calibration 3473

2ε k wt21ðctÞ k ≤ oI

i51

wiðctÞð2ε k et21ðwiÞ kÞ ≤ oI

i51

wiðctÞðε2 1 k et21ðwiÞk2Þ

5 ε2 1oI

i51

wiðctÞ k et21ðwiÞk2:

Proof of (22).—We claim that

ot

s51

wðcsÞ2nsðwÞ < ln ntðwÞ 1 2 ≤ ln t 1 2 (23)

for every w : C → ½0, 1� and t ≥ 1, with ntðwÞ > 0.39 Indeed, both wðcsÞ andwðcsÞ=nsðwÞ are between 0 and 1, and so for every 1 ≤ r ≤ t we have

or

s51

wðcsÞ2nsðwÞ ≤ o

r

s51

wðcsÞ 5 nrðwÞ,

ot

s5r11

wðcsÞ2nsðwÞ ≤ o

t

s5r11

wðcsÞnsðwÞ 5 o

t

s5r11

1 2ns21ðwÞnsðwÞ

� �

≤ ot

s5r11

lnnsðwÞns21ðwÞ� �

5 lnntðwÞnrðwÞ� �

(we used 1 2 1=x ≤ ln x for x ≥ 1). Taking r ≤ t such that 1 ≤ nrðwÞ < 2yields <2 in the first inequality and ≤ lnntðxÞ ≤ lnt in the second, andthus (23); if there is no such r, then ntðwÞ < 1, and the first inequalitywith r 5 t gives <1, and thus (23). Applying (23) to each wi and summingover i yields oI

s51Zs ≤ 4g2I ðlnt 1 2Þ, and thus (22).This completes the proof of (S). QEDRemark. In (S), using (21) one gets the stronger almost sure conver-

gence (see app. sec. A5).The reason that the two proofs are slightly different—we use St and

thus Jt21 in (D), and we use Xt and thus wt21 in (S)—has to do withthe limit being zero in the former and ε in the latter. Roughly speaking, forvectors u in I-dimensional space, k u k 5 ðoiu

2i Þ1=2 → 0 implies k u k1 5

oi juij→ 0 regardless of the size of I, whereas k u k ≤ ε yields k u k1 ≤ I 1=2ε,which may not be small when I increases with ε (for further details, seeapp. sec. A4).We now show that the outgoing results of section III yield the exis-

tence of forecast-hedging procedures.

39 One can easily obtain a bound of oðtÞ in (23), since wðctÞ2=ntðwÞ ≤ wðctÞ=nt → 0 ast →∞ (indeed, if ntðwÞ→∞, then wðctÞ=ntðwÞ ≤ 1=ntðwÞ→ 0, and if ntðwÞ→N < ∞ thenwðctÞ=ntðwÞ 5 1 2 nt21ðwÞ=ntðwÞ→ 1 2 N =N 5 0). Inequality (23) provides a better bound,uniform over all w and sequences ct.

3474 journal of political economy

Theorem 10.(D) For every continuous binning P, there exists a deterministic

procedure of type FP that satisfies the P-deterministic forecast-hedging condition.

(S) For every finite binningP, every ε > 0, and every finite ε-gridD ofC, there exists a stochastic procedure of typeMMwith forecasts inD that satisfies the (P, ε)-stochastic forecast-hedging condition.

(AD) For every finite binningP, every ε > 0, and every finite ε-gridD of C,there exists an ε-almost deterministic procedure of type FP withforecasts in D that satisfies the (P, ε)-stochastic forecast-hedgingcondition.

Proof.—(D) WhenP is a continuous binning, each function Jt21 is con-tinuous (since each wi is continuous and k gt21ðwiÞ k ≤ g; when I is infi-nite, use the uniform convergence of the corresponding finite sums, asin the second part of the proof of proposition 2). Apply the outgoing fixedpoint theorem 4 to Jt21 for each history ht21.(S) Apply the outgoing minimax theorem 5 to wt21 and d 5 ε for each

history ht21.(AD) Apply the outgoing almost deterministic fixed point theorem 7

to wt21 and d 5 ε for each history ht21. QED

B. Calibration

Wenow immediately obtain the existence of appropriate calibratedprocedures.Theorem 11.

(D) There exists a deterministic procedure of type FP that is continu-ously calibrated.

(S) For every ε > 0, there exists a stochastic procedure of type MMthat is ε-calibrated; moreover, all its forecasts are in D for any givenfinite ε-grid D of C.

(AD) For every ε > 0, there exists an ε-almost deterministic proce-dure of type FP that is ε-calibrated; moreover, all its forecastsare in D for any given finite ε-grid D of C.

Part (D) implies (by proposition 15 in app. sec. A2) the results of FosterandHart (2018) for smooth calibration and of Kakade and Foster (2004)and Foster and Kakade (2006) for weak calibration. Part (S) yields theclassic calibration result of Foster and Vohra (1998), and part (AD) yieldsthe result of Kakade and Foster (2004) for almost deterministic classiccalibration.Proof.—(D) Apply theorem 10(D) and theorem 9(D) with the contin-

uous binning P0 given by proposition 3.(S) Let D 5 fd1, ::: , dIg be a given finite ε-grid of C. Put wi ≔ 1di

fori 5 1, ::: , I and w0 ≔ 1CnD , and let P be the finite binning ðwiÞIi50. When

forecast hedging and calibration 3475

all forecasts are in D, we have K Pt 5 oI

i51 k gtð1diÞ k 5 Kt (since gtðw0Þ 5

0). Apply theorem 10(S) and theorem 9(S).(AD) Same as (S), applying theorem 10(AD). QED

V. A Simple Calibrated Procedure for Binary Events

This section shows how toobtain classic calibration in the one-dimensionalcase, where the actions are binary yes/no outcomes (such as win/lose inpolitics and sport events or rain/shine, and so on), by a procedure thatis as simple as can be; it is simpler than any existing procedure, includingthe one in Foster (1999). The procedure ismoreover almost deterministic,with all randomizations being between two neighboring points on a fixedgrid. It is essentially the procedure described in section I.B, except that wework with the normalized errors e instead of the gaps G.We are thus in the one-dimensional case (m 5 1), withA 5 f0, 1g (with,

say, 1 for rain and 0 for no rain) and C 5 ½0, 1�. Fix an integer N ≥ 1, andlet D ≔ f0, 1=N , 2=N , ::: , 1g be the grid on which the forecasts lie. Con-sider a history ht21. For every i 5 0, 1, ::: ,N , the error of the forecast i/Nis e i ≔ et21ði=N Þ 5 r i=ni 2 i=N , where ni is the number of times that theforecast i/N has been used in the first t 2 1 periods and r i is the numberof rainy periods among these ni periods (with ei 5 0 when ni 5 0). Theprocedure j chooses the forecast ct as follows (as in fig. 4, with e insteadof G):Case 1.—There is j such that e j 5 0. Put y ≔ j=N and let the (determin-

istic) forecast be ct 5 y.40

Case 2.—ei ≠ 0 for all i. In this case, e0 > 0 (because r 0 ≥ 0) and eN < 0(because r N ≤ nN ), and so let j ≥ 1 be, for concreteness, the smallestindex with e j < 0; thus, e j21 > 0 > ej .41 Put y1 ≔ ð j 2 1Þ=N and y2 ≔ j=N ,and let the forecast be ct 5 y1 with probability p1 ≔ jej j=ðjej21j 1 jej jÞand ct 5 y2 with the remaining probability p2 ≔ jej21j=ðje j21j 1 jej jÞ; thus,p1et21ðy1Þ 1 p2et21ðy2Þ 5 0 (cf. (4)), and y2 2 y1 5 1=N .The above construction amounts to linearly interpolating the func-

tion et21 from the finite grid D to the whole interval [0, 1] and then tak-ing a point where this function vanishes (y in case 1 and p1y1 1 p2y2 incase 2) and using it for the forecast (y itself in case 1 and the p1, p2 prob-abilistic mixture of y1 and y2 in case 2). We thus have Et21½et21ðctÞ� 5 0 inboth cases, where Et21 stands for E½�jht21�.

40 Since e j 5 0 for unused forecasts j/N, in the first periods we try each point on the gridonce; alternatively, assume that there is some initial data for each possible forecast (all thisdoes not matter, of course, in the long run).

41 Any j for which e j21 and e j have opposite signs will work here. In fact, a j for which thesigns are reversed, i.e., e j21 < 0 < e j (however, such a j need not exist in general), will workeven better, as it yields zero on the right-hand side of the forecast-hedging condition (S-FH).

3476 journal of political economy

Theorem 12. The above procedure j is 1/(2N)-almost deterministicand 1/(2N )-calibrated.Proof.—Put �y ≔ y in case 1 and �y ≔ ðy1 1 y2Þ=2 in case 2. Then j�y 2 ct j ≤

1=ð2N Þ in both cases, which implies that Et21½et21ðctÞ � ð�y 2 ctÞ� ≤ð1=2N ÞEt21½jet21ðctÞj�. Now Et21½et21ðctÞ � ða 2 �yÞ� 5 0 for every a (becausea 2 �y is constant given ht21, and Et21½et21ðctÞ� 5 0 by the construction of j);adding to the previous inequality gives the (P, ε)-stochastic forecast-hedgingcondition (S-FH), where P is the same as in the proof of theorem 11(S),and ε 5 1=ð2N Þ. Therefore, j is 1/(2N )-calibrated by theorem 9(S); inaddition, j is 1/(2N)-almost deterministic because we always have jct2�yj ≤1=ð2N Þ. QEDThe calibration bound of 1/(2N ) is the best that one can achieve with

forecasts on the grid D: consider, for instance, the action sequence where atequals 1 with probability 1/(2N) independently over t.

VI. Calibration and Game Dynamics

Forecasts are a useful tool for dynamicmultiplayer interactions. Considera game that is played repeatedly. A natural type of game dynamic is onewhere in each period the playersmake forecasts on what will happen nextand then choose their actions in response to these forecasts. Interestinglong-run behavior obtains when the forecasts are “good” (i.e., calibrated)and the responses to the forecasts are “good” (i.e., best responses).The calibrated learning of Foster and Vohra (1997), on the one hand,

and the publicly calibrated learning of Kakade and Foster (2004) and thesmooth calibrated learning of Foster andHart (2018), on the other hand,are two such types of game dynamics. The main difference between thetwo types is that in the former, each player uses a stochastic classicallycalibrated forecasting procedure, whereas in the latter, all players use thesame deterministic weakly or smoothly calibrated forecasting procedure.In the long run, the former yields correlated equilibria as the time averageof play, whereas the latter yields Nash equilibria as the period-by-periodbehavior (of course, everything should be understood with appropriate“approximate” adjectives; for a more extensive discussion, see Fosterand Hart 2018). If we replace the deterministic weakly and smoothly cali-brated procedures with the stronger but easier to obtain deterministic con-tinuously calibratedprocedures (seeproposition 15 in app. sec.A1), we ob-tain the same long-run result: period-by-period behavior that is close toNash equilibria. The simplicity of continuous calibration allows for a sim-ple result and proof (see theorem 13).The game dynamics results underscore the importance of deterministic

procedures, which are “leaky” (see Foster and Hart 2018) and thus re-main calibrated even if in each period the forecast is revealed beforethe action is chosen. By contrast, stochastic procedures are no longer

forecast hedging and calibration 3477

calibrated if the actual realization of the random forecast is revealed beforethe action is chosen.

A. Continuously Calibrated Learning

A finite game is given by a finite set of playersN and, for each player i ∈ N ,a finite set of pure strategies Ai and a payoff function ui : A→R, whereA ≔

Qi∈NA

i denotes the set of strategy combinations of all players. Letn ≔ jN j be the number of players, mi ≔ jAi j the number of pure strate-gies of player i, and m ≔ oi∈Nmi . The set of mixed strategies of player iis X i ≔ DðAiÞ, the unit simplex (i.e., the set of probability distributions)on Ai; we identify the pure strategies in Ai with the unit vectors of Xi, andso Ai ⊆ X i. Put C ; X ≔

Qi∈NX

i for the set of mixed-strategy combina-tions (i.e., N-tuples of mixed strategies). The payoff functions ui are multi-linearly extended to X, and thus ui : X →R.For each player i and combination of mixed strategies of the other

players x2i 5 ðxjÞj≠i ∈Q

j≠iXj ≕ X 2i, let �uiðx2iÞ ≔ maxyi∈X iuiðyi, x2iÞ 5

maxai∈Aiuiðai, x2iÞ be the maximal payoff that i can obtain against x2i; forevery ε ≥ 0, let BRi

εðx2iÞ ≔ fxi ∈ X i : uiðxi , x2iÞ ≥ �uiðx2iÞ 2 εg denote theset of ε-best replies of i to x2i. A (mixed) strategy combination x ∈ X is a Nashε-equilibrium if xi ∈ BRi

εðx2iÞ for every i ∈ N ; let NE ðεÞ⊆ X denote the setof Nash ε-equilibria of the game.A (discrete-time) dynamic consists of each player i ∈ N playing a pure

strategy ait ∈ Ai at each time period t 5 1, 2, ::: ; put at 5 ðai

t Þi∈N ∈ A. Thereis perfect monitoring: at the end of period t, all players observe at. Thedynamic is uncoupled (Hart and Mas-Colell 2003, 2006, 2013) if the playof every player i may depend on only player i’s payoff function ui (andnot on the other players’ payoff functions). Formally, such a dynamic isgiven by a mapping for each player i from the history ht21 5 ða1, ::: , at21Þand his own payoff function ui into X i 5 DðAiÞ (player i’s choice may berandom); we will call such mappings uncoupled. Let xi

t ∈ X i denote the mixedaction that player i plays at time t, and put xt 5 ðxi

t Þi∈N ∈ X .The dynamics we consider are continuous variants of the calibrated

learning introduced by Foster and Vohra (1997). Calibrated learning con-sists of each player best replying to calibrated forecasts on the other play-ers’ strategies; it results in the joint distribution of play (i.e., the time averageof the N-tuples of strategies at) converging in the long run to the set of cor-related equilibria of the game. We consider continuously calibrated learning,where stochastic classic calibration is replaced with deterministic continu-ous calibration and best replying is replaced with continuous approximatebest replying.Moreover, the forecasts are nowN-tuples ofmixed strategies(inQ

iDðAiÞ) rather than correlated mixtures (in DðQiAiÞ).

Formally, given ε > 0, a continuously calibrated ε-learning dynamic is givenby the following:

3478 journal of political economy

I. A deterministic continuously calibrated procedure on X, whichyields at each time t a forecast ct 5 ðcit Þi∈N ∈ X on the distributionof strategies of each player.

II. For each player i ∈ N , a continuous ε-best-reply function bi : X → X i;that is, biðxÞ ∈ BRi

εðx2iÞ for every x ∈ X .

The dynamic consists of each player running the procedure in I, gen-erating at time t a forecast ct ∈ X ; then each player i plays at period t themixed strategy xi

t ≔ biðctÞ ∈ X i , where bi is given by II.42 All players observethe strategy combination at 5 ðai

t Þi∈N ∈ A that has actually been played andremember it. Let bðxÞ 5 ðbiðxÞÞi∈N ; thus, b : X → X is a continuous func-tion. We refer to ct ∈ X as the forecasts, xt 5 bðctÞ ∈ X as the behaviors (i.e.,themixed strategies played), and at ∈ A as the actions (i.e., the realized purestrategies played); ct, xt, and at depend on the history.Since for each player i the approximate best reply condition in II makes

use of only player i’s payoff function ui, we can without loss of generalitychoose bi so as to depend on only ui, whichmakes the dynamic uncoupled(see above).The existence of a deterministic continuously calibrated procedure in

1 is given by theorem 11(D); the existence of ε-approximate continuousbest-reply mappings in II is well known.Our result is as follows:Theorem 13. Let G 5 ðN , ðAiÞi∈N , ðuiÞi∈N Þ be a finite game. For every

ε > 0, a continuously calibrated ε-learning dynamic is an uncoupled dy-namic and satisfies almost surely

limt →∞

1

ts ≤ t : xs ∈ NEðε0Þf gj j 5 1 (24)

for every ε0 > ε.43

The proof goes by the following three claims. (1) If the forecasts ct arecontinuously calibrated for the sequence of pure strategies at, they arecontinuously calibrated also for the sequence of mixed strategies xt (be-cause, by the law of large numbers, the long-run averages of the at’sand of the xt’s are close, as xt is the expectation of at conditional on thehistory). (2) For every c, in every period where the forecast is c, the mixedplay is the same, namely, x 5 bðcÞ, and so if the sequence ct is continuouslycalibrated for the sequence xt, then ct ≈ xt 5 bðctÞ. (3) From ct ≈ xt we im-mediately get xt 5 bðctÞ ≈ bðxtÞ (apply the continuous map b to both

42 Thus, P½at 5 ajht21� 5Q

i∈N xit ðaiÞ for every a 5 ðaiÞi∈N ∈ A, where ht21 is the history

and xit ðaiÞ is the probability that xi

t ∈ DðAiÞ assigns to the pure strategy ai ∈ Ai .43 It does not follow that we can take ε0 5 ε; for instance, consider the case where at time

t we have an ðε 1 1=tÞ-equilibrium. “Almost surely” applies to all ε0 > ε simultaneously (takea sequence ε0n decreasing to ε).

forecast hedging and calibration 3479

sides), which says that the approximate best reply to xt is xt itself, and thusxt is an approximate Nash equilibrium.The crucial feature of our dynamic is that continuous calibration is pre-

served despite the fact that the actions depend on the forecasts (this leak-iness property does not hold for classic, probabilistic calibration); in addi-tion, in each period all players have the same (deterministic) forecast.In appendix sectionA6, we provide anumber of comments and extensions.Proof.—For every w : X → ½0, 1�, let ~gtðwÞ be the per-period gap for the

mixed xt instead of the pure at, that is,

~gtðwÞ ≔ 1

t ot

s51

wðcsÞðxs 2 csÞ 5 gtðwÞ 1 1

t ot

s51

wðcsÞðas 2 xsÞ: (25)

Claim 1. Let W0 be a countable collection of continuous functionsw : X → ½0, 1�. Then for almost all infinite histories h∞ 5ðct , atÞ∞t51, we have44

limt →∞

~gtðwÞ 5 0 for all w ∈ W0:

Proof.—First, for every h∞, we have limt →∞gtðwÞ 5 0 for all w ∈W0 by continuous calibration (see proposition 2).Second, for each w, we have E½wðcsÞasjhs21� 5 wðcsÞE½asjhs21� 5wðcsÞxs (given hs21, the forecast cs [and thus wðcsÞ] is deter-mined, and so only as is random; its conditional expectationis E½asjhs21� 5 bðcsÞ 5 xs).45 The strong law of large numbersfor dependent random variables (theorem 32.1.E in Loève1978) says that

limt →∞

1

t ot

s51

Ys 2 E Ysjhs21½ �ð Þ 5 0 (26)

almost surely for bounded random variables Yt; since thewðcsÞas are all bounded by g and there are countably many win W0, we obtain

limt →∞

1

t ot

s51

ðwðcsÞas 2 wðcsÞxsÞ 5 0 for all w ∈ W0

almost surely. Using (25) yields the claim. QEDClaim 2. For every d > 0, we have

limt →∞

1

ts ≤ t : k bðcsÞ 2 cs k ≥ df gj j 5 0

for almost every h∞.

44 One can show (as in sec. II.B) that limt~gtðwÞ 5 0 for all continuous w : X → ½0, 1� holdsfor almost all infinite histories (however, there is no uniformity over the action sequences).

45 Recall that we identify the pure actions ai ∈ Ai with the unit vectors in the simplex X i.

3480 journal of political economy

Proof.—For every d ∈ X and ‘ > 0, letwd,‘ðxÞ ≔ ½12‘ k x2d k�1(a tent function on X ); thus, wd,‘ðxÞ > 0 if and only if x ∈Bðd; 1=‘Þ. Let D be the set of points inX with rational coor-dinates; putW0 ≔ fwd,‘ : d ∈ D, ‘ ≥ 1g; thenW0 is a countablecollection of continuous functions from X to [0, 1], and soclaim 1 applies to it.Take d > 0; the function aðxÞ ≔ bðxÞ 2 x is uniformly con-tinuous on the compact set X, and so there is an integer ‘ >0 such that k x 2 y k ≤ 1=‘ implies k aðxÞ 2 aðyÞ k ≤ d. If d ∈D satisfies k aðdÞ k ≥ 2d, then for every x with wd,‘ðxÞ > 0, thatis, x ∈ Bðd; 1=‘Þ, we have k aðxÞ 2 aðdÞ k ≤ d, which yields

kot

s51

wd,‘ðcsÞaðcsÞ 2ot

s51

wd,‘ðcsÞaðdÞ k ≤ dot

s51

wd,‘ðcsÞ,

that is, k t~gtðwd,‘Þ 2 aðdÞntðwd,‘Þ k ≤ dntðwd,‘Þ. Therefore,k t~gtðwd,‘Þ k ≥ k aðdÞ k2dð Þntðwd,‘Þ ≥ dntðwd,‘Þ:

By claim 1, this implies that

1

tntðwd,‘Þ→ 0 (27)

almost surely as t →∞.Take a finite set D0 ⊂ D such that [d∈D0

Bðd; 1=‘Þ ⊃ X , and putD1 ≔ fd ∈ D0 : k aðdÞ k ≥ 2dg. The compact set Y ≔ fx ∈ X :k aðxÞ k ≥ 3dg is covered by[d∈D1

Bðd; 1=‘Þ (because k aðxÞ k ≥3d implies that there is d ∈ D0 such that y ∈ Bðd; 1=‘Þ, andthen k aðdÞ k ≥ k aðxÞ k2d ≥ 2d), and the continuous func-tion od∈D1

wd,‘ðxÞ is positive on Y, and thus it is ≥h for someh > 0, yielding

od∈D1

ntðwd,‘Þ 5 ot

s51od∈D1

wd,‘ðcsÞ ≥ h � s ≤ t : k aðcsÞ k ≥ 3df gj j:

Using (27) and replacing d with d/3 completes the proof.QED

Claim 3. For every ε0 > ε, there is d > 0 such that k bðcÞ 2 c k ≤ d

implies that bðcÞ is a Nash ε0-equilibrium.Proof.—By the uniform continuity of the functions bi and ui,let d > 0 be such that k x 2 y k ≤ d implies juiðbiðxÞ, x2iÞ2uiðbiðyÞ, x2iÞj ≤ ε0 2 ε for every i. Taking x 5 bðcÞ and y 5 cyields juiðbiðxÞ, x2iÞ2uiðxÞj ≤ ε0 2 ε, which together withuiðbiðxÞ, x2iÞ ≥ maxyi uiðyi , x2iÞ 2 ε by the choice of bi as anε-best reply proves the claim. QED

The theorem follows from claims 2 and 3. QED

forecast hedging and calibration 3481

VII. The Minimax Universe versusthe Fixed Point Universe

The forecast-hedging integration of the various calibration approachesthat we have carried out has pointed to a clear distinction between twoseparate, parallel universes: the minimax universe and the fixed point uni-verse.46 Table 1 summarizes the differences exhibited in this paper.

Appendix

A1. General Binnings

In this section, we show that the limitation to countable binnings is without lossof generality.

Sums over arbitrary sets are defined, as usual, as the supremum over all finitesums, that is, oi∈I zi ≔ supfoi∈I zi : J ⊆ I , j J j < ∞g (for real zi).

Define a general binning as P 5 ðwiÞi∈I , where I is an arbitrary set of bins andwi : C →½0, 1� for every i ∈ I , such that oi∈I wiðcÞ 5 1 for every c ∈ C . The generalbinning P is continuous if all wi are continuous functions. The P-calibration scoreis K P

t ≔ oi∈I k gtðwiÞ k.For classic calibration, Kt is the maximal score, that is,

Kt 5 maxP

K Pt ,

where P ranges over all general binnings. Indeed, lemma 1 holds for arbitrarycollections ðwjÞj∈J (apply it to finite sets and then take the supremum), and soK P

t ≤ Kt for every general binning P.For continuous calibration, which is defined as P-calibration for every count-

able continuous binning P, we show that it implies P-calibration for every contin-uous general binning P as well.

Proposition 14. If the deterministic procedure j is continuously calibrated,then it is P-calibrated for every continuous general binning P.

Proof.—Let P 5 ðwiÞi∈I be a continuous general binning. We claim that for ev-ery ε > 0 there is a finite set J * ⊆ I such that

46 This applies to dimension m ≥ 2 (there is no distinction for dimension m 5 1, whereboth minimax and fixed point reduce to the intermediate value theorem).

TABLE 1Minimax and Fixed Point Universes

Minimax Fixed Point

Forecast hedging Stochastic DeterministicProcedure type MM FPCalibration Classic ContinuousEquilibrium Correlated NashDynamic result Time average Period by period

3482 journal of political economy

koi∈In J *wi k ≤ ε: (A1)

This follows from Dini’s theorem for nets (instead of sequences); the proof is thesame, and because it is short, we provide it here for completeness. Let J denotethe collection of finite subsets of I. For every J ∈ J , let DJ ≔ fc ∈ C : oi∈J wiðcÞ >1 2 εg; then DJ is an open set (because J is finite and so oi∈J wi is continuous),and [J ∈JDJ 5 C (because for every c we have supJ ∈Joi∈J wiðcÞ 5 1, and so thereis J ∈ J for which the sum is > 1 2 ε). The set C is compact, and so there is afinite subcover [r

k51DJk 5 C . Put J * ≔ [rk51 Jk ; then J * is a finite set, and DJ * 5 C

(because DJ * ⊇ DJk follows from J * ⊇ Jk). Thus, for every c ∈ C , we have oi∈J *wiðcÞ >1 2 ε, and so oi∈I n J *wi < ε, which yields (A1).

Therefore, by lemma 1,

oi∈In J *

k gtðwiÞ k ≤ εKt ≤ gε:

For any J ∈ J , we then have

supat

oi∈J

k gtðwiÞ k ≤ oi∈J\J *

supat

k gtðwiÞ k1 supat

oi∈J n J * k gtðwiÞ k

≤ oi∈J *

supat

k gtðwiÞ k1 gε:

Taking the supremum over J ∈ J yields

supat

oi∈I

k gtðwiÞ k ≤ oi∈J *

supat

k gtðwiÞ k1 gε;

the right-hand side converges to gε as t →∞ by (11) of proposition 2 (as J * is fi-nite). Since ε > 0 is arbitrary, the limit of the left-hand side is zero. QED

A2. Continuous Calibration Implies Smooth and Weak Calibration

This section recalls the definitions of the existing concepts of smooth and weakcalibration and proves that they are both implied by the stronger concept of con-tinuous calibration (see sec. II).

Let ε ≥ 0 and L < ∞. For a collection L 5 ðLxÞx∈C of L-Lipschitz functionsLx : C → ½0, 1�, let47

~K Lt ≔

1

t ox∈CntðxÞ k etðLxÞ k: (A2)

A deterministic procedure is (ε, L)-smoothly calibrated (Foster and Hart 2018) if

limt →∞

supat ,L

~K Lt

!≤ ε,

47 A function f is L-Lipschitz if j f ðzÞ 2 f ðz0Þj ≤ L k z 2 z0 k for all z, z0 in the domain of f.

forecast hedging and calibration 3483

where the supremum is over all action sequences a and all collections of L-Lipschitz functions L 5 ðLxÞx∈C as above; it is (ε, L)-weakly calibrated (Kakade andFoster 2004; Foster and Kakade 2006) if

limt →∞

supat ,w

k gtðwÞ k !

≤ ε,

where the supremum is over all action sequences a and all L -Lipschitz functionsw : C →½0, 1�.

While formula (A2) for ~K Lt resembles formula (6) for K P

t , there are two differ-ences. The first is that the weight of k etðLxÞ k in ~K L

t is not the total weight ntðLxÞ ofLx (which is the denominator of etðLxÞ) but rather the number of times ntðxÞ thatx has been used as a forecast up to time t (the sum in ~K L

t is thus the finite sum overx ∈ fc1, ::: , ctgÞ. The second is that the functions Lx do not form a binning; that is,they do not add up to 1. The second difference does not really matter (e.g., it canbe addressed by rescaling the Lx functions, which does not affect the etðLxÞ, be-cause etðwÞ is homogeneous of degree 0 in w). The first difference is more signif-icant; it necessitates the use of certain approximations, such as the small cubes inlemma 11 in Foster and Hart (2018) and the resulting proposition 13 there.48

By contrast, continuous calibration uses the more appropriate weights ntðLxÞ;this streamlines the analysis and simplifies the proofs. Moreover, continuous cal-ibration yields a universal smoothly and weakly calibrated procedure for all pa-rameter values (ε, L) at once (recall n. 32).

Proposition 15. A deterministic procedure j that is continuously calibratedis (0, L)-smoothly calibrated and (0, L)-weakly calibrated for every 0 < L < ∞.

Proof.—The convergence to zero in (11) is uniform over any finite set of con-tinuous w’s and thus, by (7), over any compact set of w’s—in particular, the set ofL-Lipschitz functions w : C → ½0, 1�, which is compact by the Arzelà-Ascoli theo-rem. This is precisely (0, L)-weak calibration; by proposition 13 in Foster andHart (2018), it implies (0, L)-smooth calibration. QED

A3. Outgoing Results

We provide here a number of comments and extensions to the results of sec-tion III.

A3.1. Remarks on Theorem 4

1. Theorem 4 was proved using Brouwer’s fixed point theorem; conversely,Brouwer’s theorem can be proved using theorem 4. Indeed, let g : C → Cbe a continuous function. Theorem 4 applied to f ðxÞ 5 g ðxÞ 2 x yieldsy ∈ C such that, in particular, f ðyÞ � ðg ðyÞ 2 yÞ ≤ 0 (because g ðyÞ ∈ C); thisis f ðyÞ � f ðyÞ ≤ 0, and so f ðyÞ 5 0, that is, g ðyÞ 5 y.

2. Brouwer’s fixed point theorem is widely used to prove results in many ar-eas. Most such proofs use ingenious constructions, which are needed tomake the values of the continuous function lie in its domain, that is, have

48 The bound on ~K Lt that is obtained in the proof of proposition 13 in Foster and Hart

(2018) plays the same role as proposition 2 here.

3484 journal of political economy

the function map C into C. By contrast, theorem 4 puts no restriction onthe range of the function (beyond it being in the Euclidean space of thesame dimension); one needs to ensure only that a point y that satisfies (12)has the desired properties.To demonstrate how theorem 4 may yield simpler proofs, consider thefamous result on the existence of Nash equilibria in finite games (Nash1951). Let ðN , ðSiÞi∈N , ðuiÞi∈N Þ be a finite game in strategic form. Let C ≔Pi∈NDðSiÞ ⊂ Rm, where m ≔ oi∈N jSi j, and for every x 5 ðxiÞi∈N ∈ C , putf iðxÞ ≔ ðuiðsi , x2iÞÞsi∈Si (this is the vector of i’s payoffs for all his pure strategiesagainst x2i) and f ðxÞ ≔ ð f iðxÞÞi∈N . The function f : C →Rm is a polynomialand thus continuous, and so theorem 4 gives y ∈ C such that f ðyÞ � ðc 2 yÞ ≤ 0for every c ∈ C . Taking in particular c 5 ðxi , y2iÞ for any i ∈ N and xi ∈ DðSiÞ,we get 0 ≥ f ðyÞ � ðc 2 yÞ 5 f iðyÞ � ðxi 2 yiÞ 5 uiðxi , y2iÞ 2 uiðyi , y2iÞ, whichshows that y is a Nash equilibrium. Moreover, when the game is symmetric,putting C ≔ DðS1Þ and f ðxÞ ≔ ðu1ðs, x, ::: , xÞÞs∈S for every x ∈ C yields theexistence of a symmetric Nash equilibrium. Compare this short proof withthe usual proofs that are based directly on Brouwer’s fixed point theorem,which are much more intricate.

A3.2. Remarks on Theorem 5

1. The factor d on the right-hand side of (14) can be lowered to d0 ; d0ðDÞ < d

(see the proof of theorem 5) by a limit argument, which is, however, nolonger a finite minimax construct. Indeed, take a sequence Bn of finite dn-grids of C with dn decreasing to 0; we then get a sequence of probability dis-tributions hn ∈ DðDÞ such that

Ey∼hn½ f ðyÞ � ðx 2 yÞ� ≤ ðd0 1 dnÞEy∼hn

½k f ðyÞ k� (A3)

for every n ≥ 1 and every x ∈ C . Since D is a finite set, the sequence hn hasa limit point h ∈ DðDÞ, say, hn0 → h, for a subsequence n0 →∞; for each x ∈ C ,taking the limit of (A3) as n0 →∞ then yields49

Ey∼h f ðyÞ � ðx 2 yÞ½ � ≤ d0Ey∼h k f ðyÞ k½ �: (A4)

2. The bound in (A4) is tight: d0 cannot be lowered. Indeed, take a pointx0 ∈ C for which distðx0, DÞ 5 d0 and consider the function f : D →Rm de-fined by f ðyÞ 5 ðx0 2 yÞ= k x0 2 y k for every y ∈ D; we havek f ðyÞ k 5 1 andf ðyÞ � ðx0 2 yÞ 5 k x0 2 y k ≥ d0 for every y ∈ D.

A3.3. Remarks on Corollary 6

1. In corollary 6, one can get h ∈ DðCÞ with support of size at most m 1 2(rather than m 1 3), because with Carathéodory’s theorem, the last

49 The subsequence n0is such that hn0 ðyÞ is a convergent subsequence, with limit hðyÞ,

for each one of the finitely many elements y of D; then Ey∼hn0 ½g ðyÞ� 5 oy∈Dhn0 ðyÞg ðyÞ→oy∈DhðyÞg ðyÞ 5 Ey∼h½g ðyÞ� as n0 →∞ for every real function g on D.

forecast hedging and calibration 3485

coordinate of F ðyÞ, namely, k f ðyÞ k, is no longer needed, as it is replacedby the constant supx∈C k f ðxÞ k.

2. If f is a continuous function, then the result of corollary 6 holds also forε 5 0.50 Indeed, take a sequence εn → 01. For each n, corollary 6 yields a dis-tribution hn on C such that Ey∼hn

½ f ðyÞ � ðc 2 yÞ� ≤ εn for every c ∈ C . All thedistributions hn can be taken to have support of size at most m 1 2 (see re-mark 1 above), and so the sequence hn has a limit point h, which is alsoa distribution on C with support of size at most m 1 2.51 Then Ey∼h½ f ðyÞ �ðc 2 yÞ� ≤ 0 for every c ∈ C (because hn0 → h implies Ey∼hn0 ½ f ðyÞ � ðc 2 yÞ�→n0

Ey∼h½ f ðyÞ � ðc 2 yÞ�, since f ðyÞ � ðc 2 yÞ is a continuous function of y).3. If f is not continuous, the result of corollary 6 need not hold for ε 5 0;

take, for example, C 5 ½0, 2�, and f ðxÞ 5 1 if x < 1 and f ðxÞ 5 21 if x ≥ 1.Assume that h ∈ DðCÞ satisfies Ey∼h½ f ðyÞ � ðc 2 yÞ� ≤ 0 for all c ∈ C . Takingc 5 1 gives Ey∼h½ f ðyÞ � ð1 2 yÞ� ≤ 0, but f ðyÞ � ð1 2 yÞ ≥ 0 for all y ∈ ½0, 2�, withequality only for y 5 1, and so h must put unit mass on y 5 1; but thenEy∼h½ f ðyÞ � ðc 2 yÞ� 5 f ð1Þ � ðc 2 1Þ 5 1 2 c, which is positive for c < 1.

4. The minimax theorem follows from Corollary 6. First, consider a symmetricfinite two-person zero-sum game, given by an m � m payoff matrix B that isskew symmetric (i.e., B⊤ 5 2B). Take C to be the unit simplex in Rm (i.e.,the set ofmixed strategies), and let f : C →Rm be given by f ðxÞ ≔ Bx. Corol-lary 6 together with remark 2 above implies that there exists a distribution h

on C (with finite support) such that Ey∼h½y⊤B⊤ðc 2 yÞ� 5 Ey∼h½By � ðc 2 yÞ� ≤ 0for every c ∈ C . Now y⊤B⊤y 5 0 for every y ∈ C by symmetry (i.e., B⊤ 5 2B),and so Ey∼h½y⊤B⊤c� ≤ 0 for every c ∈ C . Thus, z ≔ Ey∼h½y� ∈ C satisfies z⊤Bc 52zB⊤c ≥ 0 for every c ∈ C , and so z is a minimax strategy that guaranteesthe value 0; by symmetry, z is also a maximin strategy that guarantees thevalue 0, and we are done. Finally, for a general two-person zero-sum game,use a standard symmetrization argument (e.g., Luce and Raiffa 1957, A6.8).

A3.4. Remark on Theorem 7

If C is a convex polytope and the set D consists of the vertices of a simplicial sub-division of C, then we can define ~f by linearly interpolating inside each simplex;this implies that we moreover have Ey∼h½y� 5 z (however, to keep satisfying this ad-ditional property may require h to have support of size 2m 1 1 instead of m 1 1).

A4. Deterministic and Stochastic Forecast Hedging

We explain here why the proofs of (D) and (S) of theorem 9 are somewhat dif-ferent; specifically, we use St and the derived Jt21 in (D) and Xt and the derived wt21

in (S).

50 Of course, theorem 4 yields in this case a stronger result, i.e., a point y rather than adistribution h. However, the result for ε 5 0 is obtained here by a minimax (rather than afixed point) theorem.

51 Take a subsequence n0where all the m 1 2 values and all the m 1 2 probabilities con-

verge (thus, we do not need to appeal to Prokhorov’s theorem); denote by h the limit dis-tribution. Then Ey∼hn0 ½g ðyÞ�→n0Ey∼h½g ðyÞ� for any continuous function g (because then pn0 → pand yn0 → y implies pn0g ðyn0 Þ→ pg ðyÞ).

3486 journal of political economy

One can check that the St approach in the (S) setup gives limtð1=t2ÞSt 5limtoI

i51k gtðwiÞ k2 ≤ ε2. What this yields is limtE½K Pt � 5 limtE½oI

i51jjgtðwiÞjj� ≤ εffiffiffiI

p(e.g., consider the case where the jjgtðwiÞjj2 are all equal to ε2=I ), which, however,does not suffice. Indeed, for classic calibration, the binning comes from an ε-gridof C (see the proof of theorem 11(S)), and so its size I is of the order of 1/εm, whichmakes the bound ε

ffiffiffiI

pnot useful beyond dimension m 5 1. The more delicate ap-

proach with Xt gets rid of this annoyingffiffiffiI

pfactor. The issue does not arise in (D),

since there we have ε 5 0, and so εffiffiffiI

p5 0 for every finite binning, which extends

to countable continuous binnings by (10).Going in the other direction, while we could use the Xt approach for (D) as

well (it will not affect the result), the St approach is preferable, as it is shorter andsimpler.

A5. Calibration with Probability 1

In this section, we show how to strengthen the results on classic calibration (the-orem 11(S) and (AD) in sec. IV) from convergence in expectation to convergencealmost surely (a.s.).

The definition of classic calibration in section II.A requires that the calibra-tion score Kt be small in expectation (i.e., that E½Kt � be less than ε in the limit).One may require in addition that Kt be small almost surely (i.e., with probability 1);that is, for every action sequence a,

limt →∞

Kt ≤ ε a:s:ð Þ: (A5)

We now show that the procedures constructed in section IV do indeed satisfy thisadditional requirement.

In the proof of theorem 9(S), the sequence Yt is uniformly bounded (by2g � g 1 g2 5 3g2), and so we can apply the strong law of large numbers for de-pendent random variables (see (26)):

1

t ot

s51

Ys 2 E Ysjhs21½ �ð Þ→t →∞0 a:s:ð Þ:

Since E½Ysjhs21� 5 Es21½Ys� ≤ ε2 by (21), it follows that limt →∞ð1=tÞots51Ys ≤ ε2 (a.s.).

Together with limt →∞ð1=tÞots51Zs 5 0 by (22), we get limt →∞ð1=tÞXt ≤ ε2 (a.s.), and

thus limt →∞K Pt ≤ ε (a.s.) (because ðK P

t Þ2 ≤ ð1=tÞXt). Applying this to the binningPof theorem 10(S) yields (A5) for stochastic classic calibration (theorem 11(S)) aswell as for almost deterministic classic calibration (theorem 11(AD)).

A6. Continuously Calibrated Learning

In this section, we provide a number of comments and extensions on the resulton game dynamics of section VI.

forecast hedging and calibration 3487

A6.1. Remarks on Theorem 13

1. The forecasts are also approximate Nash equilibria:52

limt →∞

1

ts ≤ t : cs ∈ NEðε0Þf gj j 5 1 ða:s:Þ

for every ε0 > ε. This follows by replacing claim 3 with:Claim 30. For every ε0 > ε, there is d > 0 such that k bðcÞ 2 c k ≤ d implies

that c ∈ NEðε0Þ.Proof.—Take d > 0 such that k x 2 y k ≤ d implies juiðxi , y2iÞ 2 uiðyÞj ≤

ε0 2 ε for every i.

2. A statement that is equivalent to (24) is

limt →∞

1

t ot

s51

distðxs , NEðεÞÞ 5 0, (A6)

which is the way it appears in Kakade and Foster (2004) (and the sameapplies to the statement in remark 1 above). Indeed, for every ε0 > ε, letdðε0Þ ≔ inf x∉NEðε0 Þdistðx, NEðεÞÞ and rðε0Þ ≔ supx∈NEðε0 Þdistðx, NEðεÞÞ; then itis straightforward to see that dðε0Þ > 0 and limε0 ↘ εrðε0Þ 5 0 (use the compact-ness of X and the continuity of the functions ui). Therefore, dðε0Þ1x∉NEðε0 Þ ≤distðx, NEðεÞÞ ≤ rðε0Þ 1 m1=21x∉NEðε0 Þ (because supx,y∈X k x 2 y k ≤ m1=2). Us-ing the first inequality for each xs shows that (A6) implies (24), and usingthe second inequality for each xs shows that (24) implies (A6) (the limitis ≤ r(ε0) for every ε0 > ε and thus zero, because limε0 ↘ εrðε0Þ 5 0).

3. The forecasting procedure in I depends only on the sizes of the strategysets ðmiÞi∈N .

4. The play in each period t need not be independent across the players, solong as the marginals are ðbiðctÞÞi∈N .

References

Berger, J. O. 1985. Statistical Decision Theory and Bayesian Analysis. 2nd ed. NewYork: Springer.

Blackwell, D. 1956. “An Analog of the Minimax Theorem for Vector Payoffs.” Pa-cific J. Mathematics 6:1–8.

Border, K. 1985. Fixed Point Theorems with Applications to Economics and Game The-ory. New York: Cambridge Univ. Press.

Brouwer, L. E. J. 1912. “Über Abbildung von Mannigfaltigkeiten.” MathematischeAnnalen 71:97–115.

Dawid, A. 1982. “The Well-Calibrated Bayesian.” J. American Statis. Assoc. 77:605–13.

Dekel, E., and Y. Feinberg. 2006. “Non-Bayesian Testing of a Stochastic Prediction.”Rev. Econ. Studies 73:893–906.

52 Which is not surprising, as ct and xt are close (see claim 2). Of course, what we careabout are not the forecasts, but the behaviors; this is why the result in theorem 13 is statedfor xt.

3488 journal of political economy

Foster, D. P. 1999. “A Proof of Calibration via Blackwell’s Approachability Theo-rem.” Games and Econ. Behavior 29:73–78.

Foster, D. P., and S. Hart. 2018. “Smooth Calibration, Leaky Forecasts, Finite Re-call, and Nash Dynamics.” Games and Econ. Behavior 109:271–93.

Foster, D. P., and S. M. Kakade. 2006. “Calibration via Regression.” IEEE Informa-tion Theory Workshop, Punta del Este, Uruguay, March 13–17.

Foster, D. P., and R. Stine. 2004. “Variable Selection in Data Mining.” J. AmericanStatis. Assoc. 99:303–13.

Foster, D. P., and R. V. Vohra. 1997. “Calibrated Learning and Correlated Equi-librium.” Games and Econ. Behavior 21:40–55.

———. 1998. “Asymptotic Calibration.” Biometrika 85:379–90.———. 1999. “Regret in the On-Line Decision Problem.” Games and Econ. Behav-

ior 29:7–35.Fudenberg, D., and D. K. Levine. 1999. “An Easier Way to Calibrate.” Games and

Econ. Behavior 29:131–37.George, E. I., and D. P. Foster. 2000. “Calibration and Empirical Bayes Variable

Selection.” Biometrika, 87:731–47.Hart, S. 1995. “Calibrated Forecasts: The Minimax Proof.” http://www.ma.huji

.ac.il/hart/publ.html#calib-minmax.Hart, S., and A. Mas-Colell. 2000. “A Simple Adaptive Procedure Leading to Cor-

related Equilibrium.” Econometrica 68:1127–50; also in S. Hart and A. Mas-Colell2013, chap. 2.

———. 2003. “Uncoupled Dynamics Do Not Lead to Nash Equilibrium.” A.E.R.93:1830–36; also in S. Hart and A. Mas-Colell 2013, chap. 7.

———. 2006. “Stochastic Uncoupled Dynamics and Nash Equilibrium.” Gamesand Econ. Behavior 57:286–303; also in S. Hart and A. Mas-Colell 2013, chap. 8.

———. 2013. Simple Adaptive Strategies: From Regret-Matching to Uncoupled Dynamics.Singapore: World Scientific.

Hartman, P., and G. Stampacchia. 1966. “On Some Non-Linear Elliptic Differen-tial Equations.” Acta Mathematica 115:271–310.

Hazan, E., and S. M. Kakade. 2012. “(Weak) Calibration Is ComputationallyHard.” 25th Annual Conference on Learning Theory. J. Machine LearningRes. 23:3.1–3.10.

Kakade, S. M., and D. P. Foster. 2004. “Deterministic Calibration and Nash Equi-librium.” 17th Annual Conference on Learning Theory (COLT).

———. 2008. “Deterministic Calibration and Nash Equilibrium.” J. Computer andSystem Sci. 74:115–30.

Loève, M. 1978. Probability Theory, vol. 2. 4th ed. New York: Springer.Luce, R. D., and H. Raiffa. 1957. Games and Decisions. New York: Wiley.Mellers, B., E. Stone, T. Murray, et al. 2015. “Identifying and Cultivating Super-

forecasters as a Method of Improving Probabilistic Predictions.” Perspectives Psy-chological Sci. 10:267–81.

Nash, J. 1951. “Non-Cooperative Games.” Ann. Math. 54:286–95.Oakes,D. 1985. “Self-Calibrating PriorsDoNotExist.” J. American Statis. Assoc.80:339.Olszewski, W. 2015. “Calibration and Expert Testing.” InHandbook of Game Theory,

vol. 4, edited by H. P. Young and S. Zamir, 949–84. New York: Springer.Olszewski, W., and A. Sandroni. 2008. “Manipulability of Future-Independent

Tests.” Econometrica 76:1437–66.Robbins, H. 1956. “An Empirical Bayes Approach to Statistics.” In Proceedings of

the Third Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, Con-tributions to the Theory of Statistics, edited by J. Neyman, 157–63. Berkeley: Univ.California Press.

forecast hedging and calibration 3489

Rudin, W. R. 1976. Principles of Mathematical Analysis. 3rd ed. New York: McGraw-Hill.

Tetlock, P. E., and D. Gardner. 2015. Superforecasting: The Art and Science of Predic-tion. New York: Crown.

von Neumann, J. 1928. “Zur Theorie der Gesellschaftsspiele.” MathematischeAnnalen 100:295–320.

Zadrozny, B., and C. Elkan. 2001. “Obtaining Calibrated Probability Estimatesfrom Decision Trees and Naive Bayesian Classifiers.” In ICML ’01: Proceedings ofthe Eighteenth International Conference on Machine Learning, edited by C. E. BrodleyandA. P. Danyluk, 609–16. San Francisco: Morgan Kaufmann.

3490 journal of political economy