6
From Decoupling and Self-Normalization to Machine Learning Victor H. de la Pena Introduction We now live in the age of the digital revolution, where ma- chine learning and artificial intelligence have transformed the way data is generated, processed, and analyzed to solve complex research problems. One of the goals in the de- velopment of algorithms in machine learning is to extract as much information from the data as possible that can- not be achieved through traditional techniques. Decou- pling and self-normalization are statistical tools that han- dle complex problems that are beyond the spheres of clas- sical statistical methods. Self-normalization can be traced back to the seminal work of Gosset [9], which is consid- ered a breakthrough in science. Notably, his Student t- statistic allowed statistical inference about the value of the mean of a (Gaussian) distribution without knowledge of the actual value of the variance, provided one has a ran- dom sample from the target population. Remarkably, the self-normalization approach provides techniques to ex- tend the t-statistic to the case of non-Gaussian distribu- tions and nonindependent variables. Victor H. de la Pena is a professor of statistics at Columbia University. His email address is [email protected]. Communicated by Notices Associate Editor Noah Simon. For permission to reprint this article, please contact: [email protected]. DOI: https://doi.org/10.1090/noti1978 Then what do decoupling, self-normalization, and ma- chine learning have in common? They enable the develop- ment of algorithms with minimal dependence or paramet- ric assumptions. In essence, decoupling and self-normal- ization are areas that grew out of the need to extend mar- tingale methods to high and infinite dimensions and com- plex nonlinear dependence structures. Decoupling pro- vides tools for treating dependent variables as if they were independent. In particular, it provides a natural tool for developing sharp exponential inequalities for self-normal- ized martingales. Prototypical examples of self-normal- ized processes are the t-statistic in dependent random vari- ables as well as (self-normalized) extensions of the law of the iterated logarithm of Kolmogorov. As will be described in the last section, these results have been used to estab- lish important machine-learning tools, the development of efficient algorithms for the stochastic multiarmed ban- dit problem, and for the so-called learning to rank (LTR) model. Complete Decoupling Let { = 1, … , } be a sequence of dependent random variables with | |<∞. Let { = 1,…,} be a se- quence of independent variables where for each , and have the same marginal distributions. Since = , linearity of expectations provides the first “complete de- NOVEMBER 2019 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 1641

FromDecouplingand Self-Normalization toMachineLearning · 2019-10-07 · FromDecouplingand Self-Normalization toMachineLearning VictorH.delaPena Introduction Wenowliveintheageofthedigitalrevolution,wherema

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FromDecouplingand Self-Normalization toMachineLearning · 2019-10-07 · FromDecouplingand Self-Normalization toMachineLearning VictorH.delaPena Introduction Wenowliveintheageofthedigitalrevolution,wherema

From Decoupling andSelf-Normalization

to Machine Learning

Victor H. de la PenaIntroductionWe now live in the age of the digital revolution, where ma-chine learning and artificial intelligence have transformedthe way data is generated, processed, and analyzed to solvecomplex research problems. One of the goals in the de-velopment of algorithms in machine learning is to extractas much information from the data as possible that can-not be achieved through traditional techniques. Decou-pling and self-normalization are statistical tools that han-dle complex problems that are beyond the spheres of clas-sical statistical methods. Self-normalization can be tracedback to the seminal work of Gosset [9], which is consid-ered a breakthrough in science. Notably, his Student t-statistic allowed statistical inference about the value of themean of a (Gaussian) distribution without knowledge ofthe actual value of the variance, provided one has a ran-dom sample from the target population. Remarkably, theself-normalization approach provides techniques to ex-tend the t-statistic to the case of non-Gaussian distribu-tions and nonindependent variables.

Victor H. de la Pena is a professor of statistics at Columbia University. Hisemail address is [email protected] by Notices Associate Editor Noah Simon.

For permission to reprint this article, please contact:[email protected].

DOI: https://doi.org/10.1090/noti1978

Then what do decoupling, self-normalization, and ma-chine learning have in common? They enable the develop-ment of algorithms with minimal dependence or paramet-ric assumptions. In essence, decoupling and self-normal-ization are areas that grew out of the need to extend mar-tingale methods to high and infinite dimensions and com-plex nonlinear dependence structures. Decoupling pro-vides tools for treating dependent variables as if they wereindependent. In particular, it provides a natural tool fordeveloping sharp exponential inequalities for self-normal-ized martingales. Prototypical examples of self-normal-ized processes are the t-statistic in dependent random vari-ables as well as (self-normalized) extensions of the law ofthe iterated logarithm of Kolmogorov. As will be describedin the last section, these results have been used to estab-lish important machine-learning tools, the developmentof efficient algorithms for the stochastic multiarmed ban-dit problem, and for the so-called learning to rank (LTR)model.

Complete DecouplingLet {𝑑𝑖 𝑖 = 1,… ,𝑛} be a sequence of dependent randomvariables with 𝐸|𝑑𝑖| < ∞. Let {𝑦𝑖 𝑖 = 1,… ,𝑛} be a se-quence of independent variables where for each 𝑖, 𝑑𝑖 and𝑦𝑖 have the same marginal distributions. Since 𝐸𝑑𝑖 = 𝐸𝑦𝑖,linearity of expectations provides the first “complete de-

NOVEMBER 2019 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 1641

Page 2: FromDecouplingand Self-Normalization toMachineLearning · 2019-10-07 · FromDecouplingand Self-Normalization toMachineLearning VictorH.delaPena Introduction Wenowliveintheageofthedigitalrevolution,wherema

coupling” equality:

𝐸𝑛∑𝑖=1

𝑑𝑖 = 𝐸𝑛∑𝑖=1

𝑦𝑖. (1)

As a concrete example for constructing the sequence

{𝑦𝑖}, let {𝑑(𝑗)𝑖 , 𝑖 = 1,… ,𝑛; 𝑗 = 1,… ,𝑛} be independent

copies of {𝑑𝑖} and take 𝑦𝑖 = 𝑑(𝑖)𝑖 . Then, it is easy to see

that {𝑦𝑖} is a sequence of independent random variables,since each row in the array is independent of the othersand

𝐸𝑛∑𝑖=1

𝑑𝑖 =𝑛∑𝑖=1

𝐸𝑑𝑖 =𝑛∑𝑖=1

𝐸𝑑(𝑖)𝑖 =

𝑛∑𝑖=1

𝐸𝑦𝑖 = 𝐸𝑛∑𝑖=1

𝑦𝑖.

In complete decoupling, one compares 𝐸𝑓(∑𝑑𝑖) to𝐸𝑓(∑𝑦𝑖) for more general functions than 𝑓(𝑥) = 𝑥. Instatistics, Hoeffding’s [11] inequality for comparing sam-pling without replacement to sampling with replacementprovides an early example. More recent work involves ex-tensions to the case of negatively associated variables ofShao [19], as well as to the case of sequences of nonnega-tive randomvariableswith arbitrary dependence structures.Applications of complete decoupling include, among oth-ers, tools for the optimization of stochastic processes suchas the scheduling of dependent computer servers (seeMakarychev and Sviridenko [17]).

Let the population 𝐶 consist of 𝑁 values 𝑐2, 𝑐2,… , 𝑐𝑁.Let 𝑑1, 𝑑2,… ,𝑑𝑛 denote a random sample without re-placement from𝐶, and let 𝑦1, 𝑦2,… ,𝑦𝑛 denote a randomsample with replacement from 𝐶. The random variables𝑦1,… ,𝑦𝑛 are independent and identically distributed.Moreover, for all 𝑖, 𝑑𝑖 and 𝑦𝑖 have the same marginal dis-tributions. Hoeffding [11] developed the following widelyused complete decoupling inequality: For every continu-ous convex function Φ,

𝐸Φ(𝑛∑𝑖=1

𝑑𝑖) ≤ 𝐸Φ(𝑛∑𝑖=1

𝑦𝑖) . (2)

Shao [19] extended it to the case of negatively associatedrandom variables.

In what follows we present a (sharp) complete decou-pling inequality for sums of nonnegative dependent ran-dom variables that provides a reverse Hoeffding’s inequal-ity for 𝐿𝑝 moments. The price one pays is a constant.

Theorem. Let𝜋(1) be a Poisson random variable with mean1. Assume the 𝑑𝑖’s are a sequence of arbitrarily dependent non-negative random variables. Let 𝑦𝑖,… ,𝑦𝑛 be independent ran-dom variables with 𝑦𝑖 having the same distributions as 𝑑𝑖 forall 𝑖. Then, for 𝑝 ≥ 1,

𝐸(𝑛∑𝑖𝑦𝑖)

𝑝

≤ 𝐸(𝜋(1)𝑝)𝐸(𝑛∑𝑖=1

𝑑𝑖)𝑝

. (3)

In particular, when 𝑝 = 1 we obtain the original completedecoupling equality in (1),

𝐸𝑛∑𝑖𝑦𝑖 = 𝐸𝜋(1)𝐸

𝑛∑𝑖=1

𝑑𝑖 = 𝐸𝑛∑𝑖=1

𝑑𝑖.

The above result was introduced by de la Pena [4] in thecase of more general functions, including powers with dif-ferent constants. The sharp constants were first obtainedby Makarychev and Sviridenko [17].

Conditionally Independent (Tangent)DecouplingThe theory of martingale inequalities has been central inthe development of modern probability theory. Recentlyit has been expanded widely through the introduction ofthe theory of conditionally independent (tangent) decou-pling. This approach to decoupling can be traced back to aresult of Burkholder andMcConell included in Burkholder[2] that represents a step in extending the theory of martin-gales to Banach spaces.

Let {𝑑𝑖} and {𝑒𝑖} be two sequences of random variablesadapted to the 𝜎-fields {𝐹𝑖}. Then {𝑑𝑖} and {𝑒𝑖} are saidto be tangent with respect to {ℱ𝑖} if, for all 𝑖,

ℒ(𝑑𝑖|ℱ𝑖−1) = ℒ(𝑒𝑖|ℱ𝑖−1), (4)

whereℒ(𝑑𝑖|ℱ𝑖−1) denotes the conditional probability lawof 𝑑𝑖 given ℱ𝑖−1.

Let 𝑑1,… ,𝑑𝑛 be an arbitrary sequence of dependentrandom variables adapted to an increasing sequence of 𝜎-fields {ℱ𝑖}. Then, as shown in de la Pena and Gine [6],one can construct a sequence 𝑒1,… , 𝑒𝑛 of random vari-ables that is conditionally independent given 𝒢 = ℱ𝑛.The construction proceeds as follows: First we take 𝑒1 and𝑑1 to be two independent copies of the same randommechanism. Having constructed 𝑑1,… ,𝑑𝑖−1; 𝑒1,… , 𝑒𝑖−1,the 𝑖th pair of variables 𝑑𝑖 and 𝑒𝑖 comes from i.i.d. copiesof the same random mechanism given ℱ𝑖−1. It is easy tosee that using this construction and taking

ℱ′𝑖 = ℱ𝑖 ∨𝜎(𝑒1,… , 𝑒𝑖),

the sequences {𝑑𝑖}, {𝑒𝑖} satisfy

ℒ(𝑑𝑖|ℱ′𝑖−1) = ℒ(𝑒𝑖|ℱ′

𝑖−1) = ℒ(𝑒𝑖|𝒢)and the sequence 𝑒1,… , 𝑒𝑛 is conditionally independentgiven 𝒢 = ℱ𝑛.

A sequence {𝑒𝑖} of random variables satisfying theabove conditions is said to be a decoupled ℱ′

𝑖 -tangent ver-sion of {𝑑𝑖}.

As in the case of complete decoupling, linearity of ex-pectations provides the canonical example of a decoupling“equality.” In conditionally independent decoupling onereplaces dependent random variables with decoupled(conditionally independent) random variables. If 𝐸|𝑑𝑖| <

1642 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 66, NUMBER 10

Page 3: FromDecouplingand Self-Normalization toMachineLearning · 2019-10-07 · FromDecouplingand Self-Normalization toMachineLearning VictorH.delaPena Introduction Wenowliveintheageofthedigitalrevolution,wherema

∞ for all 𝑖, then

𝐸𝑛∑𝑖=1

𝑑𝑖 = 𝐸𝑛∑𝑖=1

𝑒𝑖. (5)

To see this note that

𝐸𝑛∑𝑖=1

𝑑𝑖 =𝑛∑𝑖=1

𝐸𝑑𝑖 =𝑛∑𝑖=1

𝐸(𝐸(𝑑𝑖|ℱ′𝑖−1))

=𝑛∑𝑖=1

𝐸(𝐸(𝑒𝑖|ℱ′𝑖−1)) =

𝑛∑𝑖=1

𝐸(𝐸(𝑒𝑖|𝒢))

= 𝐸(𝐸(𝑛∑𝑖=1

𝑒𝑖|𝒢)) = 𝐸𝑛∑𝑖=1

𝑒𝑖.

The first general decoupling inequality for tangent se-quences was obtained by Zinn [20] and extended byHitczenko [10].

A turning point in the theory of decoupling for tangentsequences has been a 1986 result of Kwapien and Woy-czynski (see Kwapien andWoyczinsky [14] for the exact ref-erence). It is shown in that paper for the first time, and ina precise manner, that one can always obtain a decoupledtangent sequence to any adapted sequence hence, makinggeneral decoupling inequalities widely applicable.

Developments of the theory and applications are foundin hard copy in Kwapien and Woyczynski [14] and in de laPena and Gine [6].

Decoupling and Self-NormalizationNext, we present a sharp decoupling inequality with con-straints from de la Pena [5]. This result will be used later toobtain a sharp extension of Bernstein’s inequality for inde-pendent random variables to the case of self-normalizedmartingales.

Let {𝑑𝑖} and {𝑒𝑖} be two tangent sequences with {𝑒𝑖}decoupled. Then for all 𝑔 > 0 adapted to 𝜎({𝑑𝑖}),

𝐸𝑔exp{𝜆𝑛∑𝑖=1

𝑑𝑖} ≤√√√⎷𝐸𝑔2 exp{2𝜆

𝑛∑𝑖=1

𝑒𝑖}. (6)

As a first application we use (6) in the context of the sam-pling schemes discussed above.Example (conditionally independent sampling). In thisexample we show how to decouple a sample without re-placement and show how the decoupled sequence relatesto sampling without replacement and sampling with re-placement. (In survey sampling we treat draws without re-placement as if they were independent, though they are ac-tually weakly coupled.) As before, consider drawing sam-ples of size 𝑛 from a population 𝐶 that consists of 𝑁 val-ues. Let 𝑑1,… ,𝑑𝑛 denote a sample without replacementand let 𝑦1,… ,𝑦𝑛 denote a sample with replacement. Aconditionally independent sample can be constructed asfollows. At the 𝑖th stage of a simple random samplingwith-out replacement, both 𝑑𝑖 and 𝑒𝑖 are obtained sampling

uniformly from {𝑐1,… , 𝑐𝑁}, excluding {𝑑1,… ,𝑑𝑖−1}.This may be attained by selectively returning items to 𝐶.More precisely, at the 𝑖th stage first draw 𝑒𝑖 and return itto the population. Then draw 𝑑𝑖 and put its value aside. Itis easy to see that the above procedure will make {𝑒𝑖} tan-gent to {𝑑𝑖} with ℱ𝑛 = 𝜎(𝑑1,… ,𝑑𝑛; 𝑒1,… , 𝑒𝑛). More-over, the sequence {𝑒𝑖} is conditionally independent given𝒢=𝜎(𝑑1,… ,𝑑𝑁).

A use of the exponential decoupling inequality (with𝑔 = 1) found in (6) gives

𝐸exp{𝜆𝑛∑𝑖=1

𝑑𝑖} ≤√√√⎷𝐸exp{2𝜆

𝑛∑𝑖=1

𝑒𝑖}. (7)

In some sense, conditionally independent sampling canbe viewed as a sampling scheme in between sampling withreplacement and sampling without replacement.

Next we state Bernstein’s inequality for sums of inde-pendent random variables and its self-normalized coun-terpart in the case of self-normalized martingales.Bernstein’s inequality. Let {𝑥𝑖} be a sequence of inde-pendent random variables. Assume that 𝐸(𝑥𝑗) = 0 and𝐸(𝑥2

𝑗) = 𝜎2𝑗 < ∞ and set 𝑣2

𝑛 = ∑𝑛𝑗=1 𝜎2

𝑗 . Furthermore,assume that there exists a constant 0 < 𝑐 < ∞ such that,almost surely, 𝐸(|𝑥𝑗|𝑘) ≤ (𝑘! /2)𝜎2

𝑗 𝑐𝑘−2 for all 𝑘 > 2.Then for all 𝑥 > 0,

𝑃(𝑛∑𝑖=1

𝑥𝑖 > 𝑥) ≤ exp(− 𝑥2

2(𝑣2𝑛 + 𝑐𝑥)) . (8)

The following is a self-normalized inequality from de laPena [5]:Self-normalized Bernstein’s inequality. Let {𝑑𝑖,ℱ𝑖} be amartingale difference sequence. Assume that 𝐸(𝑑𝑗|ℱ𝑗−1)= 0 and 𝐸(𝑑2

𝑗 |ℱ𝑗−1) = 𝜎2𝑗 < ∞ (satisfied by subexpo-

nential randomvariables) and set𝑉2𝑛 = ∑𝑛

𝑗=1 𝜎2𝑗 . Further-

more, assume that there exists a constant 0 < 𝑐 < ∞ suchthat, almost surely, 𝐸(|𝑑𝑗|𝑘|ℱ𝑗−1) ≤ (𝑘! /2)𝜎2

𝑗 𝑐𝑘−2 forall 𝑘 > 2. Then for all 𝑥,𝑦 > 0,

𝑃(∑𝑛𝑖=1 𝑑𝑖𝑉2𝑛

> 𝑥, 1𝑉2𝑛

≤ 𝑦) ≤ exp(− 𝑥2

2𝑦(1 + 𝑐𝑥)) .(9)

Remark. The inequality is sharp, since when 𝑉2𝑛 = 𝑣2

𝑛(nonrandom) the two inequalities are equivalent. The keysteps in obtaining this result involve the use ofMarkov’s in-equality followed by the decoupling inequality presentedin (6). We then (conditionally) apply the standard resultsfor sums of independent random variables to complete theproof.

Self-NormalizationThis area goes back to the seminal work of Gosset [9] (“Stu-dent”), in which he introduced the t-statistic and the t-

NOVEMBER 2019 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 1643

Page 4: FromDecouplingand Self-Normalization toMachineLearning · 2019-10-07 · FromDecouplingand Self-Normalization toMachineLearning VictorH.delaPena Introduction Wenowliveintheageofthedigitalrevolution,wherema

distribution. For more than a century, the t-statistic hasevolved into much more general Studentized statistics andself-normalized processes. Let 𝑋1,𝑋2,… ,𝑋𝑛 be a se-quence of i.i.d. normal random variables. Gosset consid-ered the problem of statistical inference on the mean 𝜇when the standard deviation 𝜎 of the underlying distribu-tion is unknown. Let �̄�𝑛 = 𝑛−1 ∑𝑛

𝑖=1 𝑋𝑖 , �̂�2𝑛 =

∑𝑛𝑖=1(𝑋𝑖−�̄�𝑛)2

𝑛−1 be the sample mean and the sample variance,respectively. In his 1908 paper Gosset derived the distri-

bution of the t-statistic𝑇𝑛 = √𝑛∑𝑛𝑖=1(𝑋𝑖−𝜇)

�̂�𝑛for normal𝑋𝑖;

this is the t-distribution with 𝑛 − 1 degrees of freedom.The t-distribution converges to the standard normal dis-tribution, and in fact 𝑇𝑛 has a limiting standard normaldistribution even when the 𝑋𝑖 are nonnormal.

It is noted that for “fat-tailed” distributions with infinitesecond or even first absolute moments, it has been foundthat the t-test of 𝜇 = 𝜇0 is robust against nonnormalityin terms of the type I error probability. Furthermore, di-rectly plugging in the true variance will actually result ina substantially worse statistic (that is extremely anticonser-vative in the case of, e.g., tests involving Cauchy randomvariables). Without loss of generality, consider testing thehypothesis 𝜇0 = 0 so that

𝑇𝑛 = √𝑛�̄�𝑛�̂�𝑛

= 𝑆𝑛𝑉𝑛

{ 𝑛 − 1𝑛− (𝑆𝑛/𝑉𝑛)2

}1/2, (10)

where 𝑆𝑛 = ∑𝑛𝑖=1 𝑋𝑖, 𝑉2

𝑛 = ∑𝑛𝑖=1 𝑋2

𝑖 .In view of the above equality, if 𝑇𝑛 or 𝑆𝑛/𝑉𝑛 has lim-

iting distribution, then so does the other, and it is wellknown that they coincide. In de la Pena et al. [8], a morecomplete historical perspective of the general theory is pro-vided, as well as numerous applications in statistics.

Pseudo-maximization (Method of Mixtures)The method of pseudo-maximization (also known as themethod of mixtures) was used in de la Pena et al. [7] and isbased on the following assumption: Let (𝐴,𝐵) be an arbi-trarily dependent vector of random variables,with 𝐵 > 0. Assume that −∞ < 𝜆 < ∞. Then, the pair issaid to satisfy the canonical assumption if

𝐸exp(𝜆𝐴− 𝜆2𝐵2/2) ≤ 1. (11)

The appendix provides a wide class of processes that sat-isfy this assumption. These includemartingales, randomlystopped processes, and sums of conditionally symmetricrandom variables. For some applications the range of 𝜆can be smaller, e.g., 0 < 𝜆 < 𝜆0. Note that if the integrandin 𝐸exp(𝜆𝐴 − 𝜆2𝐵/2) can be maximized over 𝜆 insidethe expectation (as can be done if 𝐴/𝐵2 is nonrandom),

taking 𝜆 = 𝐴/𝐵2 would yield 𝐸exp( 𝐴2

2𝐵2 ) ≤ 1. This in

turn would give the optimal Chebyshev-type bound

𝑃(𝐴𝐵 > 𝑥) ≤ exp(−𝑥2

2 ) . (12)

However, since 𝐴/𝐵2 cannot (in general) be taken to benonrandom, we need to find an alternative method fordealing with this maximization. One approach for attain-ing a similar effect involves integrating over a probabilitymeasure 𝐹 and using Fubini’s theorem to interchange theorder of integration with respect to 𝑃 and 𝐹:

∫𝐸exp(𝜆𝐴− 𝜆2𝐵2/2)𝐹(𝑑𝜆)

= 𝐸∫exp(𝜆𝐴− 𝜆2𝐵2/2)𝐹(𝑑𝜆)

≤ ∫𝐹(𝑑𝜆) ≤ 1.

(13)

To be effective for all possible pairs (𝐴,𝐵), the𝐹 chosenwould need to be as uniform as possible so as to includethe maximum value of 𝐸exp(𝜆𝐴 − 𝜆2𝐵2/2) regardlessof where it might occur. Thus some mass is certain to beassigned to and near the random value 𝜆 = 𝐴/𝐵2 thatmaximizes exp(𝜆𝐴 − 𝜆2𝐵2/2). Since all uniform mea-sures aremultiples of Lebesguemeasure (which is infinite),we construct a finite measure (or a sequence of finite mea-sures) that tapers off to zero as 𝜆 → ∞ as slowly as we canmanage. One can obtain different results by changing themeasure 𝐹.

For example, as shown in de la Pena et al. [7], by inte-grating over a Gaussian measure in (13) one can developthe following self-normalized exponential inequality:

𝑃(|𝐴|/√𝐵2 + (𝐸𝐵)2 ≥ 𝑥) ≤ √2exp (−𝑥2/4) (14)

for all 𝑥 > 0.We remark that we almost get the optimal Chebyshev

bound, the ideal inequality (see (12)) up to constants, andthe term (𝐸𝐵)2.

Under the following refinement of the canonical as-sumption we obtain an LIL bound. Assume that

{exp(𝜆𝐴𝑡 −𝜆2𝐵2𝑡 /2), 𝑡 ≥ 0}

is a super martingale with mean ≤ 1. (15)

Then,

limsup𝑡→∞

𝐴𝑡

√2𝐵2𝑡 log log𝐵2

𝑡≤ 1, (16)

on the set {lim𝑡→∞ 𝐵2𝑡 = ∞}.

As formalized in Lemma A.3, for conditionally symmet-ric increments, 𝑑𝑖, we can use this result to get

lim sup𝑛→∞

∑𝑛𝑖=1 𝑑𝑖

√2∑𝑛𝑖=1 𝑑2

𝑖 log log∑𝑛𝑖=1 𝑑2

𝑖≤ 1, (17)

1644 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 66, NUMBER 10

Page 5: FromDecouplingand Self-Normalization toMachineLearning · 2019-10-07 · FromDecouplingand Self-Normalization toMachineLearning VictorH.delaPena Introduction Wenowliveintheageofthedigitalrevolution,wherema

on the set {lim𝑛→∞ ∑𝑛𝑖=1 𝑑2

𝑖 = ∞},which is a sharp exten-sion of Kolmogorov’s LIL without moments assumptions.In particular, the result is valid for i.i.d. centered Cauchyrandom variables.

Applications to Machine LearningAn area of important application in machine learning isthe stochastic multiarmed bandit problem. Following theformulation by Kaufmann et al. [12], the model involvessequentially sampling from a set of 𝑘 probability distribu-tions, called arms, each having an unknown mean 𝜇𝑘. Attime 𝑡, an arm is selected according to a sampling strategythat depends on the history of past arm selections and sam-ples, and then a sample 𝑋𝑡 is drawn from the associateddistribution. A key objective is to adjust the sampling strat-egy in order to maximize the expected value of the rewardsgathered up to a specified time horizon 𝑇. In their pa-per Kaufmann et al. provide improved sequential stoppingrules that have guaranteed error probabilities and shorteraverage running times. In a related paper Kaufmann andKoolen [13] establish asymptotic optimality of a class ofsequential tests generalizing the track-and-stop method toproblems beyond best arm identification. The approachthey take involves the use of the method of mixtures anda self-normalized law of the iterated logarithm (see (16)above).

In addition in machine learning, self-normalizationtechniques are being used in the development of efficientalgorithms for the so-called learning to rank (LTR). The pri-mary objective of LTR, which is used in web search and rec-ommender systems (Liu [16]), is to optimally select a sub-set from a large set of documents that maximizes the satis-faction of the user. It is well known that some of the com-monly proposed approaches have limited applications, in-cluding lack of convergence in certain situations.

In a recent paper, Lattimore et al. [15] introduced analgorithm, called TopRank, with several desirable features,such as performance superior to many of the competingalgorithms. A major step in the development of the pro-posed algorithm is use of self-normalization principles.For example, in their paper, the proof of their Lemma 6 isbased on this principle in the construction of the boundson relevant quantities.

The theory for self-normalized sums has also been ap-plied in the formulation of regression models for high-dimensional data. Notably, Belloni et al. [1] used the the-ory to achieve Gaussian-like results under weak conditionsfor a self-tuning and square-root function of the lassomethod that works well with unknown scale, heteroscedas-ticity, and nonnormality of the error terms.

Further ApplicationsThe tools developed have been successfully applied in di-verse areas such as extension of martingale results to infi-nite dimensions, including Banach spaces; self-normalizedmartingales; stochastic integration; empirical processes, in-cluding U-statistics and U-processes; density estimation;sequential analysis; survival analysis; efficientMonte Carlomethods; matrix completion; large deviations; robust esti-mation; likelihood; and Bayesian inference. See Kwapienand Woyczynski [14], de la Pena and Gine [6], and de laPena et al. [8] for details.

More recent applications of conditionally independentdecoupling include Candes and Recht [3], which dealswith exact matrix completion via convex optimization.There has been a recent surge of interest in applying anddeveloping the methods presented in this survey. In partic-ular, Rahklyn et al. [18] uses conditional independent de-coupling techniques to study sequential complexities andexponential inequalities for martingales in Banach spaces.Makarychev and Sviridenko [17] uses complete decoup-ling inequalities to develop stochastic optimization toolsfor energy efficient routing load balancing in parallel ma-chines.

Concluding RemarksAs can be seen from the broad range of results and applica-tions, it is worth looking at problems using the decouplingand self-normalization perspectives. In fact, there are stillmultiple open problems in these areas, including the de-velopment of new sharp algorithms in the area of machinelearning.

AppendixLemma A.1. Let𝑊𝑡 be a standard Brownian motion. Assumethat 𝑇 is a stopping time such that 𝑇 < ∞ a.s. Then for all−∞ < 𝜆 < ∞,

𝐸exp{𝜆𝑊𝑇 −𝜆2𝑇/2} ≤ 1. (18)

Lemma A.2. Let 𝑀𝑡 be a continuous, square-integrable mar-tingale, with 𝑀0 = 0. Then, for all −∞ < 𝜆 < ∞,

exp{𝜆𝑀𝑡 −𝜆2 < 𝑀 >𝑡 /2} ≤ 1. (19)

If 𝑀𝑡 is only assumed to be a continuous local martingale, theinequality is also valid (by application of Fatou’s lemma).

Lemma A.3. Let {𝑑𝑖} be a sequence of variables adapted to anincreasing sequence of 𝜎-fields {ℱ𝑖}. Assume that the 𝑑𝑖’s areconditionally symmetric (i.e.,ℒ(𝑑𝑖|ℱ𝑖−1) = ℒ(−𝑑𝑖|ℱ𝑖−1)).Then,

𝐸exp{𝜆𝑛∑𝑖=1

𝑑𝑖 −𝜆2𝑛∑𝑖=1

𝑑2𝑖 /2} ≤ 1 (20)

for all −∞ < 𝜆 < ∞.

NOVEMBER 2019 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 1645

Page 6: FromDecouplingand Self-Normalization toMachineLearning · 2019-10-07 · FromDecouplingand Self-Normalization toMachineLearning VictorH.delaPena Introduction Wenowliveintheageofthedigitalrevolution,wherema

References[1] Belloni A, Chernozhukov V, Wang L. Pivotal estimation

via square-root Lasso in nonparametric regression, Ann.Statist., (42, no. 2): 757–788, 2014. MR3210986

[2] Burkholder DL. A geometric condition that implies the ex-istence of certain singular integrals of Banach-space-valuedfunctions. Conference on harmonic analysis in honor of An-toni Zygmund, Vol. I, II (Chicago Ill., 1981). Wadsworth.Math. Ser., Wadsworth, Belmont, CA; 1983: 270–286.MR730072

[3] Candes EJ, Recht B. Exact matrix completion via convexoptimization, Found. Comput. Math., (9, no. 6): 717–772,2009. MR2565240

[4] de la Pena VH. Bounds on the expectation of functions ofmartingales and sums of positive RVs in terms of norms ofsums of independent random variables, Proc. Amer. Math.Soc., (108, no. 1): 233–239, 1990. MR990432

[5] de la Pena VH. A general class of exponential inequalitiesfor martingales and ratios, Ann. Probab., (27, no. 1): 537–564, 1999. MR1681153

[6] de la Pena VH, Gine E. Decoupling: From Dependence toIndependence, Springer, New York, 1999. MR1666908

[7] de la Pena VH, Klass MJ, Lai TL. Self-normalized pro-cesses: exponential inequalities, moment bounds and it-erated logarithm laws, Ann. Probab., (32, no. 3A): 1902–1933, 2004. MR2073181

[8] de la Pena VH, Lai TL, Shao Q-M. Self-Normalized Pro-cesses: Limit Theory and Statistical Applications, Springer,2009. MR2488094

[9] Gosset (Student) WS. The probable error of the mean,Biometrika, (6): 1–25, 1908.

[10] Hitczenko P. Comparison of moments for tangent se-quences of random variables, Probab. Theory Related Fields,(78, no. 2): 223–230, 1988. MR945110

[11] Hoeffding W. Probability inequalities for sums ofbounded random variables, J. Amer. Statist. Assoc., (58, no.301): 13–30, 1963. MR0144363

[12] Kaufmann E, Cappe O, Garivier A. On the complexityof best-arm identification in multi-armed bandit models,J. Mach. Learn. Res., (17): 1–42, 2016. MR3482921

[13] Kaufmann E, Koolen W. Mixtures martingales revisitedwith applications to sequential tests and confidence inter-vals, arXiv:1811.11419v1, 2018.

[14] Kwapien S, Woyczynski W. Random Series and Stochas-tic Integrals: Single and Multiple, Birkhauser, Boston, 1992.MR1167198

[15] Lattimore T, Kveton B, Li S, Szpervary Z. TopRank:A practical algorithm for online stochastic ranking,arXiv:1806.02248v2, Neural Information Processing Sys-tems (NIPS), Montreal, Canada, 2018.

[16] Liu T. Learning to Rank for Information Retrieval, Springer,2011.

[17] Makarychev K, Sviridenko M. Solving optimizationproblems with diseconomies of scale via decoupling, J.ACM, (65, no. 6): Article 42, 2018. MR3882265

[18] Rakhlin A, Sridharan K, Tewari A. Sequential complexi-ties and uniform martingale laws of large numbers, Probab.Theory Related Fields, (161, issue 1-2): 111–153, 2015.MR3304748

[19] Shao Q-M. A comparison theorem on moment inequal-ities between negatively associated and independent ran-dom variables, J. Theoret. Probab., (13, no. 2): 343–356,2000. MR1777538

[20] Zinn J. Comparison of martingale difference sequences.In: Probability in Banach Spaces V, Lecture Notes in Math.,(1153). Springer; 1985: 453–457. MR821997

Victor H. de la Pena

Credits

Opening image is courtesy of Getty.Photo of the author is courtesy of the author.

Broaden Your HorizonsSummer 2020 Program PARTIAL DIFFERENTIAL EQUATIONS:

THEORY, NUMERICAL METHODS AND APPLICATIONS

CoursesDi�erential Equations in Numerical ModelingLinear Partial Di�erential EquationsMathematical FinanceSpanish & Culture classes also available

APPLICATION Now Open

For More information:

mathsciencesgto.cimat.mx

1646 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 66, NUMBER 10