22
0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEE Transactions on Information Theory IEEE TRANSACTIONS ON INFORMATION THEORY 1 Feeling the Bern: Adaptive Estimators for Bernoulli Probabilities of Pairwise Comparisons Nihar B. Shah, Sivaraman Balakrishnan, Martin J. Wainwright Abstract—We study methods for aggregating pairwise compar- ison data among a collection of n items with the goal of estimating the outcome probabilities for future comparisons. Working within a flexible model that only imposes a form of strong stochastic transitivity (SST), we introduce an ‘adaptivity index’ which compares the risk of our estimator to that of an oracle, over appropriate sub-models, where the oracle knows the specific sub- model in the ground truth. In addition to measuring the usual worst-case risk of an estimator, this adaptivity index also captures the extent to which the estimator adapts to instance-specific difficulty relative to an oracle estimator. First, we propose a three- step estimator termed Count-Randomize-Least squares (CRL), and show that it has adaptivity index upper bounded by n up to logarithmic factors. We then show that conditional on the planted clique hypothesis, no computationally efficient estimator can achieve an adaptivity index smaller than n. Second, we show that a regularized least squares estimator can achieve a poly-logarithmic adaptivity index, thereby demonstrating a n- gap between optimal and computationally achievable adaptivity. Finally, we prove that the standard least squares estimator, which is known to be optimally adaptive in several closely related problems, fails to adapt in the context of estimating pairwise probabilities. Index Terms—Pairwise comparisons, adaptivity, stochastic transitivity, Bernoulli probabilities, negative result for least squares. I. I NTRODUCTION There is an extensive literature on modeling and analyzing data in the form of pairwise comparisons between items, with much of the earliest literature focusing on applications in voting, social choice theory, and tournaments. The advent of new internet-scale applications, particularly search engine ranking [2], online gaming [3], and crowdsourcing [4], has renewed interest in ranking problems, particularly in the statistical and computational challenges that arise from the aggregation of large data sets of paired comparisons. The problem of aggregating pairwise comparisons, which may be non-transitive and/or noisy, presents a number of core challenges, including: (i) how to produce a consensus N. B. Shah is with the Machine Learning Department and the Computer Science Department at Carnegie Mellon University, Pittsburgh, PA 15213, USA. S. Balakrishnan is with the Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213. Martin J. Wainwright is with the departments of EECS and Statistics at the University of Cal- ifornia Berkeley, Berkeley, CA 94720, USA. Email: [email protected], [email protected], [email protected] Manuscript received September 2, 2017; revised October 29, 2018. This paper was presented in part at the International Symposium on Information Theory [1]. Copyright (c) 2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. ranking from the paired comparisons [5], [6], [7], [8]; (ii) how to estimate a notional “quality” for each of the underlying objects [9], [10], [4]; and (iii) how to estimate the probability of the outcomes of subsequent comparisons [11], [12]. In this paper, we focus on the third task—that is, the problem of estimating the probability with which one object is preferred to another. Accurate knowledge of such pairwise comparison probabilities is useful in various applications, including (in operations research) estimating the probability of a customer picking one product over another, or (in sports prediction and tournament design) estimating the probability of one team beating another. In more detail, given a set of n items {1,...,n}, the paired comparison probabilities can be described by an (n×n) matrix M * whose (i, j ) th entry M * ij corresponds to the probability that item i beats item j . From this perspective, the problem of estimating the comparison probabilities amounts to esti- mating the unknown matrix M * . In practice, one expects that the pairwise comparison probabilities exhibit some form of structure, and in this paper, in line with some past work on the problem, we assume that the entries of the matrix M * satisfy the strong stochastic transitivity (SST) constraint. The SST constraint is quite flexible, and models satisfying this constraint often provide excellent fits to paired comparison data in a variety of applications. There is also a substantial body of empirical work that validates the SST assumption; for instance, see the papers [13], [14], [15] in the psychology and economics literatures. The SST constraint is considerably weaker than standard parametric assumptions that are often made in the literature such as the Bradley-Terry-Luce [16], [17] or the Thurstone [18] models (see [12] for a more detailed comparison between the models). On the theoretical front, some past work [11], [12] has studied the problem of estimating SST matrices in the Frobe- nius norm. These papers focus exclusively on the (global) minimax error, meaning that the performance of any estimator is assessed in a worst-case sense globally over the entire SST class. In our past work [12], we derived upper and lower bounds on the minimax error that are sharp up to logarithmic factors. The upper bounds are obtained via a careful analysis of the least squares estimator; however, it remains unknown whether or not this estimator can be computed in polynomial time. In the same paper, we also analyzed algorithms that are suboptimal in terms of their statistical performance, but are computationally efficient, and this analysis also includes sharper results for a singular value thresholding estimator investigated earlier in [11]. While the question of the existence of a computationally efficient estimator attaining minimax

Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 1

Feeling the Bern: Adaptive Estimators forBernoulli Probabilities of Pairwise Comparisons

Nihar B. Shah, Sivaraman Balakrishnan, Martin J. Wainwright

Abstract—We study methods for aggregating pairwise compar-ison data among a collection of n items with the goal of estimatingthe outcome probabilities for future comparisons. Working withina flexible model that only imposes a form of strong stochastictransitivity (SST), we introduce an ‘adaptivity index’ whichcompares the risk of our estimator to that of an oracle, overappropriate sub-models, where the oracle knows the specific sub-model in the ground truth. In addition to measuring the usualworst-case risk of an estimator, this adaptivity index also capturesthe extent to which the estimator adapts to instance-specificdifficulty relative to an oracle estimator. First, we propose a three-step estimator termed Count-Randomize-Least squares (CRL),and show that it has adaptivity index upper bounded by

√n

up to logarithmic factors. We then show that conditional on theplanted clique hypothesis, no computationally efficient estimatorcan achieve an adaptivity index smaller than

√n. Second, we

show that a regularized least squares estimator can achieve apoly-logarithmic adaptivity index, thereby demonstrating a

√n-

gap between optimal and computationally achievable adaptivity.Finally, we prove that the standard least squares estimator, whichis known to be optimally adaptive in several closely relatedproblems, fails to adapt in the context of estimating pairwiseprobabilities.

Index Terms—Pairwise comparisons, adaptivity, stochastictransitivity, Bernoulli probabilities, negative result for leastsquares.

I. INTRODUCTION

There is an extensive literature on modeling and analyzingdata in the form of pairwise comparisons between items,with much of the earliest literature focusing on applicationsin voting, social choice theory, and tournaments. The adventof new internet-scale applications, particularly search engineranking [2], online gaming [3], and crowdsourcing [4], hasrenewed interest in ranking problems, particularly in thestatistical and computational challenges that arise from theaggregation of large data sets of paired comparisons.

The problem of aggregating pairwise comparisons, whichmay be non-transitive and/or noisy, presents a number ofcore challenges, including: (i) how to produce a consensus

N. B. Shah is with the Machine Learning Department and the ComputerScience Department at Carnegie Mellon University, Pittsburgh, PA 15213,USA. S. Balakrishnan is with the Department of Statistics and Data Science,Carnegie Mellon University, Pittsburgh, PA 15213. Martin J. Wainwrightis with the departments of EECS and Statistics at the University of Cal-ifornia Berkeley, Berkeley, CA 94720, USA. Email: [email protected],[email protected], [email protected]

Manuscript received September 2, 2017; revised October 29, 2018.This paper was presented in part at the International Symposium on

Information Theory [1].Copyright (c) 2017 IEEE. Personal use of this material is permitted.

However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

ranking from the paired comparisons [5], [6], [7], [8]; (ii)how to estimate a notional “quality” for each of the underlyingobjects [9], [10], [4]; and (iii) how to estimate the probabilityof the outcomes of subsequent comparisons [11], [12]. In thispaper, we focus on the third task—that is, the problem ofestimating the probability with which one object is preferredto another. Accurate knowledge of such pairwise comparisonprobabilities is useful in various applications, including (inoperations research) estimating the probability of a customerpicking one product over another, or (in sports prediction andtournament design) estimating the probability of one teambeating another.

In more detail, given a set of n items 1, . . . , n, the pairedcomparison probabilities can be described by an (n×n) matrixM∗ whose (i, j)th entry M∗ij corresponds to the probabilitythat item i beats item j. From this perspective, the problemof estimating the comparison probabilities amounts to esti-mating the unknown matrix M∗. In practice, one expects thatthe pairwise comparison probabilities exhibit some form ofstructure, and in this paper, in line with some past work onthe problem, we assume that the entries of the matrix M∗

satisfy the strong stochastic transitivity (SST) constraint. TheSST constraint is quite flexible, and models satisfying thisconstraint often provide excellent fits to paired comparisondata in a variety of applications. There is also a substantialbody of empirical work that validates the SST assumption;for instance, see the papers [13], [14], [15] in the psychologyand economics literatures. The SST constraint is considerablyweaker than standard parametric assumptions that are oftenmade in the literature such as the Bradley-Terry-Luce [16],[17] or the Thurstone [18] models (see [12] for a more detailedcomparison between the models).

On the theoretical front, some past work [11], [12] hasstudied the problem of estimating SST matrices in the Frobe-nius norm. These papers focus exclusively on the (global)minimax error, meaning that the performance of any estimatoris assessed in a worst-case sense globally over the entire SSTclass. In our past work [12], we derived upper and lowerbounds on the minimax error that are sharp up to logarithmicfactors. The upper bounds are obtained via a careful analysisof the least squares estimator; however, it remains unknownwhether or not this estimator can be computed in polynomialtime. In the same paper, we also analyzed algorithms thatare suboptimal in terms of their statistical performance, butare computationally efficient, and this analysis also includessharper results for a singular value thresholding estimatorinvestigated earlier in [11]. While the question of the existenceof a computationally efficient estimator attaining minimax

Page 2: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 2

optimality remains open, in the present paper, we provide atight characterization of the tradeoffs between the statisticaland computational aspects under the more stringent, localnotion of error.

It is well-known that the criterion of (global) minimax canlead to a poor understanding of an estimator, especially insituations where the intrinsic difficulty of the estimation taskis highly variable over the parameter space (see, for instance,the discussion and references in Donoho et al. [19]). In suchsituations, it can be fruitful to benchmark the risk of anestimator against that of a so-called oracle estimator that isprovided with side-information about the local structure of theparameter space. Such a benchmark can be used to show thata given estimator is adaptive, in the sense that even though itis not given side-information about the problem instance, it isable to achieve lower risk for “easier” problems (e.g., see thepapers [20], [21], [22] for results of this type). In this paper, westudy the problem-specific difficulty of estimating a pairwisecomparison matrix M∗ by introducing an adaptivity indexthat involves the size of the “indifference sets” in the matrixM∗. These indifference sets, which arise in many relevantapplications, correspond to subsets of items that are all equallydesirable (that is, lead to a block structure in the matrix M∗).Our work also makes contributions to a growing body ofwork (e.g., [23], [24], [25], [26]) that studies the notion ofa computationally-constrained statistical risk.

The main contributions of this paper are as follows:

• We show that the risk of estimating a pairwise comparisonprobability matrix M∗ depends strongly on the size of itslargest indifference set. This fact motivates us to definean adaptivity index that benchmarks the performance of anestimator relative to that of an oracle estimator that is givenadditional side information about the size of the indifferencesets in M∗. By definition, an estimator with lower values ofthis index is said to exhibit better adaptivity, and the oracleestimator has an adaptivity index of 1.

• We characterize the fundamental limits of adaptivity, inparticular by proposing a regularized least squares estimatorwith a carefully chosen regularization function. With asuitable choice of regularization parameter, we prove thatthis estimator achieves an O(1) adaptivity index, whichmatches the best possible up to poly-logarithmic factors.1

• We then show that conditional on the planted cliquehardness conjecture, the adaptivity index achieved by anypolynomial-time algorithm must be lower bounded asΩ(√n). This result exhibits an interesting gap between the

adaptivity of polynomial-time versus statistically optimalestimators.

• We propose a computationally-efficient three-step “Count–Randomize–Least squares” (CRL) estimator for estimationof SST matrices, and show that its adaptivity index isupper bounded as O(

√n). Due to the aforementioned lower

bound, the CRL estimator achieves the best possible adap-tivity (upto logarithmic factors) among all computationallyefficient estimators.

1We augment the standard Landau notation with a tilde to mean thecorresponding Landau notation up to logarithmic factors.

• Finally, we investigate the adaptivity of the standard (un-regularized) least squares estimator. This estimator is foundto have good, or even optimal adaptivity in several re-lated problems in shape-constrained estimation, and is alsominimax-optimal for the problem of estimating SST matri-ces. We prove that surprisingly, the adaptivity of the leastsquares estimator for estimating SST matrices is of theorder Θ(n), which is as bad as a constant estimator thatis independent of the data.As an important corollary of our results, we show that the

CRL estimator, in addition to the aforementioned adaptivityproperties, also has strong guarantees in terms of the standardminimax error. Moreover, up to logarithmic factors, it attainsthe smallest minimax error for estimation in the SST classamong known polynomial-time estimators for this problem.The SVT estimator studied in [12] also attains the sameminimax rate up to logarithmic factors, but unlike the SVTestimator, the CRL estimator is guaranteed to return a matrixin the SST class.

In independent and concurrent work [27], Chatterjee andMukherjee consider adaptivity under various alternative no-tions of smoothness of the true underlying pairwise-probabilitymatrix M∗. We discuss their work in Section III-C and inAppendix A.

We provide two additional related results in the appendices.In Appendix A we show that the CRL estimator also auto-matically adapts to parameter-based model subclasses suchas the popular Thurstone and BTL models, and attains theminimax-optimal risk under these models. In Appendix B,we consider matrices that may only approximately follow acertain indifference set structure, and for this setting we prove“oracle inequalities” for the CRL estimator. This result impartsadditional robustness to the results presented in the main bodyof the paper.

The remainder of this paper is organized as follows. We be-gin in Section II with background on the problem. Section IIIis devoted to the statement of our main results, as well asdiscussion of their consequences. In Section IV, we providethe proofs of our main results. We conclude the main body ofthe paper with a discussion in Section V. In the appendix, weconsider a popular class of “parametric” models that fall withinthe SST class, and prove that the CRL estimator achievesminimax optimal rates for estimation over this class.

Notation: Throughout this paper, we use the standardLandau order notation, so that an = O(bn) means that thereis a universal positive constant c such that an ≤ cbn. Similarly,we write an = Ω(bn) to mean that an ≥ c′bn for someuniversal positive constant c′, and we write an = Θ(bn) whenboth of the relations an = O(bn) and an = Ω(bn) hold.

II. BACKGROUND AND PROBLEM SETTING

In this section, we provide background and a more preciseproblem statement.

A. Estimation from pairwise comparisons

Given a collection of n items, suppose that we arrange thepaired comparison probabilities in a matrix M∗ ∈ [0, 1]n×n,

Page 3: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 3

where M∗ij is the probability that item i is preferred to itemj in a comparison between the two items. Accordingly, theupper and lower halves of M∗ are related by the shifted-skew-symmetry condition M∗ji = 1−M∗ij for all i, j ∈ [n], where weassume that M∗ii = 0.5 for all i ∈ [n] for concreteness. In otherwords, the shifted matrix M∗− 1

211T is skew-symmetric. Herewe have adopted the standard shorthand [n] : = 1, 2, . . . , n.

Suppose that we observe a random matrix Y ∈ 0, 1n×nwith (upper-triangular) independent Bernoulli entries, in par-ticular, with

P[Yij = 1] = M∗ij for every 1 ≤ i ≤ j ≤ n, (1)

and Yji = 1−Yij except on the diagonal. We take the diagonalentries Yii to be 0, 1 with equal probability, for every i ∈ [n].The focus of this work is not to evaluate the effects of thechoice of the pairs compared, but to understand the effects ofthe noise models. Consequently, we restrict attention to thecase of a single observation per pair, but keeping in mind inthat one may extend the result to other observation models viatechniques similar to those proposed in our past work [4], [12].Based on observing Y , our goal in this paper is to recover anaccurate estimate, in the squared Frobenius norm, of the fullmatrix M∗.

We consider matrices M∗ that satisfy the constraint ofstrong stochastic transitivity (SST), which reflects the naturaltransitivity of any total ordering. Formally, suppose that the setof all items [n] is endowed with a total ordering π∗. We usethe notation π∗(i) < π∗(j) to indicate that item i is preferredto item j in the total ordering π∗. We say that the M∗ satisfiesthe SST condition with respect to the permutation π∗—or thatM∗ is π∗-SST for short—if

M∗ik ≥M∗jk for every triple (i, j, k) such that π∗(i) < π∗(j).(2)

The intuition underlying this constraint is as follows: since idominates j in the true underlying order, when we make noisycomparisons, the probability that i is preferred to k should beat least as large as the probability that j is preferred to k. Theclass of all SST matrices is given by

CSST : = M ∈ [0, 1]n×n | Mba = 1−Mab ∀(a, b)and ∃ π such that M is π-SST. (3)

The goal of this paper is to design estimators that can estimatethe true underlying matrix M∗ ∈ CSST from the observedmatrix Y . We note in passing that an accurate estimate of M∗

leads to an accurate estimate of the underlying permutation aswell [12, Appendix A].

B. Indifference sets

We now turn to the notion of indifference sets, whichallows for a finer-grained characterization of the difficultyof estimating a particular matrix. Suppose that the set [n]of all items is partitioned into the union of s disjoint setsPisi=1 of sizes k1, . . . , ks such that

∑si=1 ki = n. We let

k = k1, . . . , ks denote the multiset containing the sizes ofthe partitions. For reasons to be clarified in a moment, we term

Fig. 1. A task on the Amazon Mechanical Turk crowdsourcing platform wherethe worker is shown two points on an image and is asked to choose the pointat greater depth. One can see that there are various sets of indistinguishablepoints in the image; images often have one such large indifference setcomprising the background.

each of these sets as an indifference set. We say that a matrixM∗ ∈ Rn×n respects the indifference set partition Pisi=1 if

M∗ij = M∗i′j′

for all quadruples (i, j, i′, j′) such that i and i′ lie in the sameindifference set, and j and j′ lie in the same indifference set.2

As an example, in the special case of a two-contiguous-block partition, the matrix M∗ must have a (2 × 2) blockstructure, with all entries equaling 1/2 in the two diagonalblocks, all entries equaling α ∈ [0, 1] in the upper rightblock, and equaling (1−α) in the lower left block. Intuitively,matrices with this type of block structure should be easier toestimate.

Indifference sets arise in various applications of ranking: forinstance, in buying cars, frugal customers may be indifferentbetween high-priced cars; or in ranking news items, peoplefrom a certain country may be indifferent to the domesticnews from other countries. They also arise in crowdsourcingtasks – one such example is illustrated in Figure 1. Pair-wise comparisons are now increasingly being used in peer-evaluations [28], [29], and there is evidence of existence ofindifference sets (or equivalence classes) in these evaluation-based applications [30].

Block structures of this type are also studied in a number ofother matrix estimation problems, in which contexts they havebeen termed communities, blocks, or level sets, depending onthe application under consideration. Each of these applicationshave a long line of works, such as in isotonic regression [31],[32], sparse PCA [33], [34], [35], [36], community detec-tion [37], [38], [39], and other lines on estimating or detectingmatrices with hidden blocks [24], [40], [41], [42].

2Note that we do not impose a restriction of the form M∗ij 6= M∗i′j′ if

either of the pairs (i, i′) or (j, j′) belong to different indifference sets. Allof our results, with minor modifications, carry over even under this stricternotion of adherence to indifference set partitions.

Page 4: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 4

Given the number of partitions s and their size multisetk = k1, . . . , ks, we let CSST(s,k) denote the subset of CSST

comprising all SST matrices that respect some indifferenceset partition Pisi=1 of sizes k. With a slight abuse ofnotation, for any multiset k, we denote the size of the largestindifference set as ‖k‖∞ : = max

i∈1,...,ski. This quantity plays

an important role in our analysis. For any positive integer k0,we also use the notation CSST(k0) to denote all SST matricesthat have at least one indifference set of size at least k0, thatis,

CSST(k0) : =⋃

‖k‖∞≥k0

CSST(s,k).

Finally, for any matrix M ∈ CSST, we let kmax(M) denote thesize of the largest indifference set in M .

C. An oracle estimator and the adaptivity index

We begin by defining a benchmark based on the perfor-mance of the best estimator that has side-information thatM∗ ∈ CSST(s,k), along with the values of (s,k). We eval-uate any such estimator M(s,k) based on its mean-squaredFrobenius error

E[|||M(s,k)−M∗|||2F ] = E[ n∑i,j=1

(Mij(s,k)−M∗ij

)2], (4)

where the expectation is taken with respect to the randommatrix Y ∈ 0, 1n×n of noisy comparisons and also withrespect to any randomness in the estimator M(s,k). With thisnotation, the (s,k)-oracle risk is given by

Rn(s,k) : = infM(s,k)

supM∗∈CSST(s,k)

1

n2E[|||M(s,k)−M∗|||2F ],

(5)

where the infimum is taken over all measurable functionsM(s,k) of the data Y . Note that this oracle estimator M(s,k)has access to the number and sizes of the indifference sets,but not to the identities of which items belong to whichindifference sets.

For a given estimator M that does not know the values of(s,k), we can then compare its performance to this benchmarkvia the (s,k)-adaptivity index given by

αn(M ; s,k) : =

supM∗∈CSST(s,k)

1n2E[|||M −M∗|||2F ]

Rn(s,k). (6a)

The adaptivity index αn(M) of an estimator M is the worst-case value

αn(M) : = maxs,k:‖k‖∞<n

αn(M ; s,k). (6b)

In this definition, we restrict the maximum to the interval‖k‖∞ < n since in the (degenerate) case of ‖k‖∞ = n,the only valid matrix M∗ is the all-half matrix and hencethe estimator with the knowledge of the parameters triviallyachieves zero error.

Given these definitions, the goal is to construct estimatorsthat are computable in polynomial time, and possess a low

adaptivity index. Finally, we note that an estimator with alow adaptivity index also achieves a good worst-case risk: anyestimator M with adaptivity index αn(M) ≤ γ is minimax-optimal within a factor γ.

III. MAIN RESULTS

In this section, we present the main results of this paperon both statistical and computational aspects of the adaptivityindex. We begin with an auxiliary result on the risk of theoracle estimator which is useful for our subsequent analysis.

A. Risk of the oracle estimator

Recall from Section II-C that the oracle estimator has accessto additional side information on the values of the number sand the sizes k = k1, . . . , ks of the indifference sets of thetrue underlying matrix M∗. The oracle estimator is defined asthe estimator that achieves the lowest possible risk (5) amongall such estimators.The following result provides tight bounds on the risk of theoracle estimator.

Proposition 1. There are positive universal constants c` andcu such that the (s,k)-oracle risk (5) is sandwiched as

c`n− ‖k‖∞

n2≤ Rn(s,k) ≤ cu

n− ‖k‖∞ + 1

n2(log n)2. (7)

Proposition 1 provides a characterization of the minimax riskof estimation for various subclasses of CSST. Remarkably, theminimax risk depends on only the size ‖k‖∞ of the largestindifference set: given this value, it is not affected by thenumber of indifference sets s nor their sizes k. This propertyis in sharp contrast to known results for the related problemof bivariate isotonic regression [31], in which the number sof indifference sets plays a strong role. As a consequence ofthis disparity, there are significant differences in the adaptiverates obtained for bivariate isotonic regression and the rateswe obtain in the SST setting. In bivariate isotonic regression,faster rates are obtained when the number of indifferencesets s is much smaller than n, whereas for the SST problemwe obtain faster rates when the largest indifference set issufficiently large, that is, when ‖k‖∞ = n− o(n).

Note that when ‖k‖∞ < n, we have 12 (n − ‖k‖∞ + 1) ≤

(n−‖k‖∞), and consequently the lower bound in (7) can bereplaced by c`

2n−‖k‖∞+1

n2 .

B. Fundamental limits on adaptivity

Proposition 1 provides a sharp characterization of the de-nominator in the adaptivity index (6a). In this section, weinvestigate the fundamental limits of adaptivity by studyingthe numerator but disregarding computational constraints. Themain result of this section is to show that a suitably regularizedform of least-squares estimation has optimal adaptivity up tologarithmic factors.

More precisely, recall that kmax(M) denotes the size of thelargest indifference set in the matrix M . Given the observedmatrix Y , consider the M -estimator

MREG ∈ arg minM∈CSST

(|||M − Y |||2F − kmax(M)(log n)3

). (8)

Page 5: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 5

Here the inclusion of term −kmax(M), along with its loga-rithmic weight, serves to “reward” the estimator for returninga matrix with a relatively large maximum indifference set. Asour later analysis in Section III-E will clarify, the inclusion ofthis term is essential: the unregularized form of least-squareshas very poor adaptivity properties.

The following theorem provides an upper bound on theestimation error and the adaptivity of the estimator MREG.

Theorem 1. There are universal constants cu and c′u suchthat for every M∗ ∈ CSST, the regularized least squaresestimator (8) has squared Frobenius error at most

1

n2|||M∗ − MREG|||2F ≤ cu

n− kmax(M∗) + 1

n2(log n)3, (9a)

with probability at least 1 − e−12 (logn)2 . Consequently, its

adaptivity index is upper bounded as

αn(MREG) ≤ c′u(log n)3. (9b)

Since the adaptivity index of any estimator is at least 1by definition, we conclude that the regularized least squaresestimator MREG is optimal up to logarithmic factors.

Notice that the optimization problem (8) defining the regu-larized least squares estimator MREG is non-trivial to solve; itinvolves both a nonconvex regularizer, as well as a nonconvexconstraint set. We shed light on the intrinsic complexity ofcomputing this estimator in Section III-D, where we establisha lower bound on the adaptivity index achievable by estimatorsthat are computable in polynomial time.Dependence on largest indifference set. Our results showthat the risk (7), (9) of estimating M∗ under the SST modeldepends only on the size of the largest indifference set and isindependent of the number and sizes of other indifference sets.Furthermore, the rate for estimating SST matrices improvesonly when the largest indifference set is quite large.

From a practical perspective large indifference sets are oftenthe norm rather than the exception in many applications. Forinstance, in buying cars, frugal customers may be indifferentbetween high-priced cars and vice versa, and these sets of carsform large indifference sets. Or for example, in crowdsourcingdepth-recognition tasks, people are shown an image and askedto select one of two specified points that is at a greater depth;the pixels in the image representing the same depth formindifference sets, with the background usually forming a verylarge indifference set.

On the theoretical side, focusing on problems with largeindifference sets we are able to uncover several interestingphenomena. We establish a connection with the planted cliqueconjecture which in turn provides a (conditional) lower boundon the adaptivity of computationally tractable estimators. Thisin turn may help shed some light on the hardness of relatedproblems [26], [43] on estimation involving permutations(Section III-D). Furthermore, we uncover a surprising negativeresult and show that the least squares estimator (Section III-E)fails to adapt to the intrinsic difficulty of problems with largeindifference sets. This result sheds new light on the adaptivityproperties of the least-squares estimator over non-convex sets,and shows a stark contrast to its behavior over convex sets.

C. The CRL estimator

In this section, we propose an estimator that is computablein polynomial time, which we term the Count-Randomize-Least-Squares (CRL) estimator, and prove an upper bound onits adaptivity index. As we discuss subsequently, this upperbound on the adaptivity index also translates to an upper boundon the minimax risk of the CRL estimator. Unlike the SVTestimator studied in [12], the CRL estimator is guaranteed toreturn a matrix in the SST class. Moreover, our guarantees forthe CRL estimator match the best known error guarantees (upto logarithmic factors) for polynomial-time estimation of SSTmatrices.

In order to define the CRL estimator, we require someadditional notation. For any permutation π on n items, letCSST(π) ⊆ CSST denote the set of all SST matrices that arefaithful to the permutation π, that is,

CSST(π) : =M ∈ [0, 1]n×n | Mba = 1−Mab ∀ (a, b),

Mik ≥Mjk ∀ i, j, k ∈ [n] s.t. π(i) < π(j). (10)

One can verify that the sets CSST(π) for all permutations πon n items together comprise the SST class CSST.

The CRL estimator acts on the observed matrix Y andoutputs an estimate MCRL ∈ CSST via a three-step procedure:Step 1 (Count): For each i ∈ [n], compute the total numberNi =

∑nj=1 Yij of pairwise comparisons that it wins. Order

the n items in terms of Nini=1, with ties broken arbitrarily.Step 2 (Randomize): Find the largest subset of items S suchthat |Ni−Nj | ≤

√n log n for all i, j ∈ S. Taking the ordering

computed in Step 1, permute this (contiguous) subset of itemsuniformly at random within the subset. Denote the resultingpermutation as πCRL.Step 3 (Least squares): Compute the least squares estimateassuming that the permutation πCRL is the true permutation ofthe items:

MCRL ∈ arg minM∈CSST(πCRL)

|||Y −M |||2F . (11)

It is not hard to see that computing the first two steps ofthe algorithm requires at most an order n2 computationalcomplexity. The optimization problem (11) in the third stepcorresponds to a projection onto the polytope of bi-isotonematrices contained within the hypercube [0, 1]n×n, along withskew symmetry constraints. Problems of the form (11) havebeen studied in past work [44], [45], [11], [46], and theestimator MCRL is indeed computable in polynomial time. Byconstruction, the estimator MCRL is agnostic to the values of(s,k).

The second step involving randomization serves to discard“non-robust” information from the ordering computed in Step1. To clarify our choice of threshold

√n log n in Step 2,

the factor√n corresponds to the standard deviation of a

typical win count Ni (as a sum of Bernoulli variables),whereas the log n serves to control fluctuations in a unionbound. An ordering of the items whose counts are withinthis threshold is likely to arise from the noise due to theBernoulli sampling process, as opposed to structural infor-mation about the matrix. If we do not perform this second

Page 6: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 6

step—effectively retaining considerable bias from Step 1—then isotonic regression procedure in Step 3 may amplify it,leading to a poorly performing estimator. In particular, therandomization step helps the estimator adapt to the situationwhen there is a large indifference set of size at least n

2 . Suchsituations arise in various practical applications, for instance,in depth recognition via crowdsourcing. In this application, then items are pixels of an image, and workers in crowdsourcingcompare pairs of points (pixels) and choose the one that seemscloser.

The following theorem provides an upper bound on theadaptivity index achieved by the CRL estimator.

Theorem 2. There are universal constants cu and c′u suchthat for every M∗ ∈ CSST, the CRL estimator MCRL has squaredFrobenius norm error at most

1

n2|||MCRL −M∗|||2F ≤ cu

n− kmax(M∗) + 1

n3/2(log n)8, (12a)

with probability at least 1−n−20. Consequently, its adaptivityindex is upper bounded as

αn(MCRL) ≤ c′u√n(log n)8. (12b)

We now augment the guarantees on the CRL estimator givenby Theorem 2 with an upper bound on its minimax (worst-case) risk.

Corollary 1. For every M∗ ∈ CSST, the CRL estimator hasmean-squared Frobenius error upper bounded as

1

n2|||MCRL −M∗|||2F ≤ cu

(log n)8

√n

, (13)

with probability at least 1 − 2n−20, where cu is a universalconstant.

A few remarks are in order. First, up to logarithmic fac-tors, the upper bound (13) matches the best known upperbound on the minimax rate of polynomial-time estimatorsthat was previously known [12, Theorem 2] to be attained bythe singular value thresholding estimator. The singular valuethresholding estimator investigated in past works [11], [12]does not guarantee an estimate in the SST class, whereas theCRL estimate is guaranteed to lie in the SST class. Second, ifone is concerned only about attaining the upper bound (13) onthe worst case error, then the randomization step in the CRLestimator is unnecessary and the Count and the Least-squaressteps alone suffice to achieve this bound.

In Appendix A of this paper, we consider a popular class of“parametric” models for pairwise comparisons. We show thatthe CRL estimator is minimax optimal over this model class.

The guarantees for the CRL estimator in Theorem 2 assumethat the matrix M∗ belongs to the class CSST. In Appendix Bwe develop an oracle inequality which guarantees that similarresults can be obtained more generally even when M∗ is not inthe SST class. This oracle inequality establishes, for instance,that the CRL estimator obtains favorable rates when M∗ isclose to an SST matrix, with a large indifference set, in theFrobenius norm.

Finally, in an independent and concurrent work [27], Chat-terjee and Mukherjee consider adaptivity in estimating the

matrix M∗ in the SST class. They propose an estimatorsimilar to the CRL estimator of the present paper (withoutthe randomize step), and under various notions of smoothnessthey establish the adaptive rates of this estimator. For eachof these smoothness notions, their results guarantee a riskof order O( 1

n ) or higher, whereas the CRL estimator of thepresent paper can adapt and guarantee an error as low as orderO( 1

n3/2 ) when the true matrix M∗ is smooth enough.

D. A lower bound on adaptivity for polynomial-time algo-rithms

By comparing the guarantee (12b) for the CRL estimatorwith the corresponding guarantee (9b) for the regularized least-squares estimator, we see that (apart from logarithmic factorsand constants), their adaptivity indices differ by a factor of√n. Given this polynomial gap, it is natural to wonder whether

our analysis of the CRL estimator might be improved, or ifnot, whether there is another polynomial-time estimator with alower adaptivity index than the CRL estimator. In this section,we answer both of these questions in the negative, at leastconditionally on a certain well-known conjecture in averagecase complexity theory.

More precisely, we prove a lower bound that relies on theaverage-case hardness of the planted clique problem [47],[48]. The use of this conjecture as a hardness assumptionis widespread in the literature [49], [50], [51], and thereis now substantial evidence in the literature supporting theconjecture [47], [52], [53], [54]. It has also been used as atool in proving hardness results for sparse PCA and relatedmatrix recovery problems [23], [24].

Informally the planted clique conjecture asserts that it ishard to detect the presence of a planted clique in an Erdos-Renyi random graph. In order to state it more precisely, letG(n, κ) be a random graph on n vertices constructed in oneof the following two ways:H0: Every edge is included in G(n, κ) independently with

probability 12 .

H1: Every edge is included in G(n, κ) independently withprobability 1

2 . In addition, a set of κ vertices is chosenuniformly at random and all edges with both endpointsin the chosen set are added to G.

The planted clique conjecture then asserts that whenκ = o(

√n), then there is no polynomial-time algorithm that

can correctly distinguish between H0 and H1 with an errorprobability that is strictly bounded below by 1/2.Using this conjectured hardness as a building block, we havethe following result:

Theorem 3. Suppose that the planted clique conjecture holds.Then there is a universal constant c` > 0 such that for anypolynomial-time computable estimator M , its adaptivity indexis lower bounded as

αn(M) ≥ c`√n(log n)−3.

Together, the upper and lower bounds of Theorems 2 and 3imply that the estimator MCRL achieves the optimal adaptivityindex (up to logarithmic factors) among all computationallyefficient estimators.

Page 7: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 7

E. Negative result for the least squares estimator

In this section, we study the adaptivity of the (unregularized)least squares estimator given by

MLS ∈ arg minM∈CSST

|||Y −M |||2F . (14)

Least squares estimators of this type are known to possessvery good properties in various contexts closely related to thesetting of this paper:• They exhibit excellent adaptivity in various other problems

of shape-constrained estimation. See the papers [55], [32],[56], [31], [57] and references therein for various examplesof such phenomena. In particular, the least squares estimatoradapts very well for estimation over the class of bivariatemonotone matrices, which is a subset of the SST class wherethe underlying permutation is fixed and known.

• The least squares estimator (14) is also minimax optimalfor estimating SST matrices, as shown in our own pastwork [12].Given this context, the negative result given in the following

theorem is surprising, in that it shows that the least-squaresestimator (14) has remarkably poor adaptivity:

Theorem 4. There is a universal constant c` > 0 such thatthe adaptivity index of the least squares estimate (14) is lowerbounded as

αn(MLS) ≥ c` n (log n)−2. (15)

In order to understand why the lower bound (15) is very strong,consider the trivial estimator M0 that simply ignores the data,and returns the constant matrix M0 = 1

211T . It can be verifiedthat we have

1

n2|||M∗ − 1

211T |||2F ≤

3n(n− kmax(M∗) + 1)

n2

for every M∗ ∈ CSST. Thus, for this trival estimator M0, wehave αn(M0) ≤ cun. Comparing to the lower bound (15),we see that apart from logarithmic factors, the adaptivity ofthe least squares estimator is no better than that of the trivialestimator M0.3

The negative result of Theorem 4 is shown via a construc-tion where M∗ is an (almost) constant matrix. The fact thatan almost constant matrix is a hard instance for ordinary leastsquares stands in sharp contrast to a number of other settingsin which the least squares estimator is known to adapt bestif the underlying target of estimation is constant-valued [55],[32], [56], [31], [57]. The disparate behavior of least squaresestimator may be explained by observing that while theunderlying target of estimation in the problem settings ofearlier works was assumed to lie in a convex set, the class ofSST matrices constitutes a non-convex set. This non-convexityis induced by the presence of an unknown permutation, andthis added complexity causes the least squares estimator tooverfit to a noise-driven permutation.

3Of course, the trivial estimator M0 incurs high error when the true matrixM∗ is far from the all-half matrix. Our regularized least squares estimatorMREG proposed in Section III-B achieves the best of both worlds.

IV. PROOFS

In this section, we present the proofs of all our results.We note in passing that our proofs additionally lead to someauxiliary results that may be of independent interest. Theseauxiliary results pertain to the problem of bivariate isotonicregression—that is, estimating M∗ when the underlying per-mutation is known—which is itself an important problem inthe field of shape-constrained estimation [45], [58], [11]. Priorworks restrict attention to the expected error and assume thatthe underlying permutation is correctly specified; our resultsprovide exponential tail bounds and also address settings whenthe permutation is misspecified.

A few comments about assumptions and notation are inorder. In all of our proofs, so as to avoid degeneracies, weassume that the number of items n is greater than a universalconstant. (The cases when n is smaller than some universalconstant all follow by adjusting the pre-factors in front of ourresults suitably.) For any matrix M , we use kmax(M) to denotethe size of the largest indifference set in M , and we definek∗ = kmax(M∗). The notation c, c1, cu, c` etc. all denotepositive universal constants. For any two square matrices Aand B of the same size, we let 〈〈A, B〉〉 = trace(ATB) denotetheir trace inner product. For an (n × n) matrix M and anypermutation π on n items, we let π(M) denote an (n × n)matrix obtained by permuting the rows and columns of M byπ. For a given class C of matrices, metric ρ, and toleranceε > 0, we use N(ε,C, ρ) to denote the ε-covering number ofthe class C in the metric ρ. The metric entropy is given by thelogarithm of the covering number—namely logN(ε,C, ρ).

It is also convenient to introduce a linearized form ofthe observation model (1). Observe that we can write theobservation matrix Y in a linearized fashion as

Y = M∗ +W, (16a)

where W ∈ [−1, 1]n×n is a random matrix with independentzero-mean entries for every i > j, and and Wji = −Wij forevery i < j. For every i ≥ j, its entries follow the distribution

Wij ∼

1−M∗ij with probability M∗ij−M∗ij with probability 1−M∗ij .

(16b)

Note that all entries of the matrix W above the main diagonalare independent, zero-mean, and uniformly bounded by 1 inabsolute value. This fact plays an important role in severalparts of our proofs.

A. A general upper bound on regularized M -estimators

In this section, we prove a general upper bound that appliesto a relatively broad class of regularized M -estimators for SSTmatrices. Given a matrix Y generated from the model (16a),consider an estimator of the form

M ∈ arg minM∈C

|||Y −M |||2F + λ(M)

. (17)

Here λ : [0, 1]n×n → R+ is a regularization function to bespecified by the user, and C ⊆ [0, 1]n×n is some set of [0, 1]-valued matrices. Our goal is to derive a high-probability bound

Page 8: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 8

on the Frobenius norm error |||M−M∗|||F for any M∗ ∈ C. Asis well-known from theory on M -estimators [59], [60], [61],doing so requires studying the empirical process in a localizedsense.

In order to do so, it is convenient to consider sets of theform

CDIFF(M∗, t,C) : = α(M −M∗) |M ∈ C, α ∈ [0, 1],

|||α(M −M∗)|||F ≤ t,

where t ∈ [0, n].In the analysis to follow, we assume that for each ε ≥ n−8,

the ε-metric entropy of CDIFF(M∗, t,C) satisfies an upper

bound of the form

logN(ε,CDIFF(M∗, t,C), ||| · |||F) ≤

t2b(g(M∗))2

ε2+ (h(M∗))2,

(18a)

where g : Rn×n 7→ R+ and h : Rn×n 7→ R+ are somefunctions, and b ∈ 0, 1 is a binary value. In the sequel, weprovide concrete examples for which a bound of this formholds.

Given the quadruplet (g, h, b, λ), we then define a criticalradius δn ≥ 0 as

δ2n = c

((g(M∗) logn)1+b + (h(M∗))2 + λ(M∗) + n−7

),

(18b)

where c > 0 is a universal constant. The following result, tobe proven in this section, guarantees that the Frobenius normcan be controlled by the square of this critical radius:

Lemma 1. For any set C satisfying the metric entropybound (18a), and any M∗ ∈ C, the Frobenius norm errorof the estimator (17) can be controlled as

P[|||M −M∗|||2F > uδ2

n

]≤ e−uδ

2n for all u ≥ 1, (19)

where δn is the critical radius (18b).

The significance of this claim is that it reduces the problemof controlling the error in the M -estimator to bounding themetric entropy (as in equation (18a)), and then computing thecritical radius (18b). The remainder of this section is devotedto the proof of this claim.

1) Proof of Lemma 1: Define the difference ∆ = M −M∗between M∗ and the optimal solution M to the constrainedleast-squares problem (17). Since M is optimal and M∗ isfeasible for the optimization problem (17), we have

|||Y − M |||2F + λ(M) ≤ |||Y −M∗|||2F + λ(M∗).

Following some algebra, and using the assumed non-negativitycondition λ(·) ≥ 0, we arrive at the basic inequality

1

2|||∆|||2F ≤ 〈〈∆, W 〉〉+

1

2λ(M∗),

where W ∈ [0, 1]n×n is the noise matrix in the linearizedobservation model (16a), and 〈〈∆, W 〉〉 denotes the trace innerproduct between ∆ and W .

Now for any value t > 0, let us define

Z(t) : = supD∈CDIFF(M∗,t,C)

〈〈D, W 〉〉.

With this definition, we find that the error matrix ∆ satisfiesthe inequality

1

2|||∆|||2F ≤ 〈〈∆, W 〉〉+

1

2λ(M∗) ≤ Z

(|||∆|||F

)+

1

2λ(M∗).

(20)

Thus, in order to obtain a high probability bound, we need tounderstand the behavior of the random quantity Z(t).

By definition, the set CDIFF(M∗, t,C) is star-shaped, that

is, αD ∈ CDIFF(M∗, t,C) for every α ∈ [0, 1] and every

D ∈ CDIFF(M∗, t,C). Using this star-shaped property, it is

straightforward to verify that Z(t) grows at most linearly witht, ensuring that there is a non-empty set of scalars t > 0satisfying the critical inequality:

E[Z(t)] +1

2λ(M∗) ≤ t2

2. (21)

Our interest is in an upper bound on the smallest (strictly)positive solution δn to the critical inequality (21). Moreover,our goal is to show that for every t ≥ δn, we have |||∆|||2F ≤ctδn with probability at least 1− c1e−c2tδn .

Define a “bad” event

At : =∃∆ ∈ CDIFF(M

∗, t,C) | |||∆|||F ≥√tδn and

〈〈∆, W 〉〉+1

2λ(M∗) ≥ 2|||∆|||F

√tδn. (22)

Now suppose the event At is true for some t ≥ δn, and let∆0 ∈ CDIFF(M

∗, t,C) be a matrix that satisfies the two condi-tions required for At to occur. Recall that Z(t) grows at mostlinearly in t, and more precisely due to the star-shaped natureof the set CDIFF(M

∗, t,C), we have that Z(t1)/t1 ≤ Z(t2)/t2whenever t1 ≥ t2. Now since λ(·) ≥ 0 and |||∆0|||F ≥ δn, wehave that whenever event At is true,

Z(δn) +1

2λ(M∗) ≥ δn

|||∆0|||FZ(|||∆0|||F) +

1

2λ(M∗)

≥ δn|||∆0|||F

(Z(|||∆0|||F) +

1

2λ(M∗)

)≥ δn|||∆0|||F

(〈〈∆0, W 〉〉+

1

2λ(M∗)

)≥ 2δn

√tδn,

where the final inequality uses the second condition in thedefinition of event At. As a consequence, we obtain thefollowing bound on the probabilities of the associated events

P[At] ≤ P[Z(δn) +1

2λ(M∗) ≥ 2δn

√tδn] for all t ≥ δn.

The entries of W lie in [−1, 1], have a mean of zero, arei.i.d. on and above the diagonal, and satisfy skew-symmetry.Moreover, the function W 7→ Z(u) is convex and Lipschitzwith parameter u. Consequently, from known concentrationbounds(e.g., [62, Theorem 5.9], [63]) for convex Lipschitzfunctions, we have

P[Z(δn) ≥ E[Z(δn)] + δn

√tδn]≤ e−c1tδn for all t ≥ δn,

Page 9: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 9

for some universal constant c1 > 0. By the definition of δn,we have E[Z(δn)]+ 1

2λ(M∗) ≤ δ2n ≤ δn

√tδn for any t ≥ δn,

and consequently

P[At] ≤ P[Z(δn) +1

2λ(M∗) ≥ 2δn

√tδn]≤ e−c1tδn

for all t ≥ δn.

Now, it must be that either |||∆|||F ≤√tδn, or |||∆|||F >

√tδn. In

the latter case, conditioning on the complement Act , the basicinequality (20) implies that 1

2 |||∆|||2F ≤ 2|||∆|||F

√tδn. Putting

together the pieces yields that

|||∆|||F ≤ 4√tδn,

with probability at least 1 − e−c1tδn for every t ≥ δn.Substituting u = t

δn, we get

P(|||∆|||2F > c2uδ

2n

)≤ e−c1uδ

2n , (23)

for every u ≥ 1.In order to determine a feasible δn satisfying the critical

inequality (21), we need to bound the expectation E[Z(δn)].To this end, we introduce an auxiliary lemma:

Lemma 2. There is a universal constant c such that for anyset C satisfying the metric entropy bound (18a), we have

E[Z(t)] ≤ ctbg(M∗) log n+ t h(M∗) + n−7

(24)

for all t ≥ 0.

See Section IV-A2 for the proof of this claim.Using Lemma 2, we see that the critical inequality (21) issatisfied for

δn = c0

(g(M∗) log n

) 12 (b+1)

+ h(M∗) +√λ(M∗) + n−

72

,

for a positive universal constant c0. With this choice, ourclaim follows from the tail bound (23), absorbing theconstants c1 and c2 into c0.

It remains to prove Lemma 2.2) Proof of Lemma 2: By the truncated form of Dudley’s

entropy inequality, we have

E[Z(t)] ≤ c infδ∈[0,n]

nδ+

∫ t

δ2

√logN(ε,CDIFF(M∗, t,C), |||.|||F)dε

≤ c

2n−7+

∫ t

n−8

√logN(ε,CDIFF(M∗, t,C), |||.|||F)dε

,

(25)

where the second step follows by setting δ = 2n−8. Com-bining our assumed upper bound (18a) on the metric entropywith the earlier inequality (25) yields

E[Z(t)] ≤ c

2n−7 + tbg(M∗) log(nt) + th(M∗)

≤ 2c

2n−7 + tbg(M∗) log n+ th(M∗),

where the final step uses the upper bound t ≤ n. We havethus established the claimed bound (24).

B. Proof of Proposition 1

We are now equipped to prove bounds on the risk achievedby the oracle estimator from equation (5).

1) Upper bound: Let k∗ = ‖k‖∞ denote the size of thelargest indifference set in M∗, and recall that the oracleestimator knows the value of k∗. For our upper bound, weuse Lemma 1 from the previous section with the choices

C = CSST(k∗) and λ(M) = 0.

With these choices, the estimator (17) for which Lemma 1provides guarantees is equivalent to the oracle estimator (5).We then have

CDIFF(M∗, t,CSST(k

∗)) =α(M −M∗) |M ∈ C, α ∈ [0, 1],

|||α(M −M∗)|||F ≤ t.

In order to apply the result of Lemma 1, we need to computethe metric entropy of the set CDIFF. Consider the set

CSST(k) : = αM |M ∈ CSST(k), α ∈ [0, 1].

Since M∗ ∈ CSST(k∗), the metric entropy of CDIFF is at most

twice the metric entropy of CSST(k∗). The following lemma

provides an upper bound on the metric entropy of the setCSST(k):

Lemma 3. For every ε > 0 and every integer k ∈ [n], themetric entropy of the set CSST(k) is bounded as

logN(ε, CSST(k), |||.|||F)

≤ c (n− k + 1)2

ε2

(log

n

ε

)2

+ c(n− k + 1) log n,

where c > 0 is a universal constant.

See Section IV-B3 for the proof of this claim.With this lemma, we are now equipped to prove the upper

bound in Proposition 1. The bound of Lemma 3 implies that

logN(ε,CDIFF(M∗, t,CSST(k

∗)), |||.|||F)

≤ 2c(n−k∗+1)2

ε2log2 n+ 2c(n−k∗+1) log n,

for all ε ≥ n−8. Consequently, a bound of the form (18a)holds with g(M∗) =

√2c(n − k∗ + 1) log n, h(M∗) =√

2c(n− k∗ + 1) log n, and b = 0. Applying Lemma 1 withλ(·) = 0 and u = 1/(3c) yields

P(|||M(s,k)−M∗|||2F > c′(n− k∗ + 1)(log n)2

)≤ e−(n−k∗+1)(logn)2 ,

where c′ > 0 is a universal constant. Integrating this tail bound(and using the fact that the Frobenius norm is bounded as|||M(s,k)−M∗|||F ≤ n) gives the claimed result.

2) Lower bound: We now turn to proving the lower boundin Proposition 1. By re-ordering as necessary, we may assumewithout loss of generality that k1 ≥ · · · ≥ ks, so that ‖k‖∞ =k1. The proof relies on the following technical preliminary thatestablishes a lower bound on the minimax rates of estimationwhen there are two indifference sets.

Page 10: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 10

Lemma 4. If there are s = 2 indifference sets (say, of sizesk1 ≥ k2), then any estimator M has error lower bounded as

supM∗∈CSST(2,k1,k2)

1

n2E[|||M −M∗|||2F ] ≥ c`

n− k1

n2. (26)

See Section IV-B4 for the proof of this claim.Let us now complete the proof of the lower bound in

Proposition 1. We split the analysis into two cases dependingon the size of the largest indifference set.

Case I: First, suppose that k1 > n3 . We then observe that

CSST(2, k1, n − k1) is a subset of CSST(k): indeed, everymatrix in CSST(2, k1, n − k1) can be seen as a matrix inCSST(k) which has identical values in entries correspondingto all items not in the largest indifference set. Since theinduced set CSST(2, k1, n − k1) is a subset of CSST(k), thelower bound for estimating a matrix in CSST(k) is at leastas large as the lower bound for estimating a matrix in theclass CSST(2, k1, n−k1). Now applying Lemma 4 to the setCSST(2, k1, n− k1) yields a lower bound of c`

minn−k1,k1n2 .

Since k1 > n3 , we have k1 ≥ n−k1

2 . As a result, we get alower bound of c`

2n−k1n2 .

Case II: Alternatively, suppose that k1 ≤ n3 . In this case,

we claim that there exists a value u ∈ [n/3, 2n/3] suchthat CSST(2, u, n − u) is a subset of the set CSST(k) withk1 ≤ n

3 . Observe that for any collection of sets with sizes kwith k1 ≤ n

3 , there is a grouping of sets into two groups, bothof size between n/3 and 2n/3. This is true since the largestset is of size at most n/3. Denoting the size of either of thesegroups as u, we have established our earlier claim.

As in the previous case, we can now apply the result ofLemma 4 to the subset CSST(2, u, n − u) to obtain a lowerbound of c` 1

3n ≥ c`n−k13n2 .

3) Proof of Lemma 3: In order to upper bound the metricentropy of CSST(k), we first separate out the contributions ofthe permutation and the bivariate monotonicity conditions. LetCSST(id)(k) denote the subset of matrices in CSST(k) that arefaithful to the identity permutation. With this notation, the ε-metric entropy of CSST(k

∗) is upper bounded by the sum oftwo parts:(a) the ε-metric entropy of the set CSST(id)(k); and(b) the logarithm of the number of distinct permutations of

the n items.Due to the presence of an indifference set of size at least k, thequantity in (b) is upper bounded by log(n!

k! ) ≤ (n− k) logn.We now upper bound the ε-metric entropy of the set

CSST(id)(k). We do so by partitioning the n2 positions in thematrix into four sets, computing the ε

2 -metric entropy of eachpartition separately, and then adding up these metric entropies.More precisely, letting Sk ⊆ [n] denote some set of k itemsthat belong to the same indifference set, let us partition theentries of each matrix into four sub-matrices as follows:

(i) The (k × k) sub-matrix comprising entries (i, j) whereboth i ∈ Sk and j ∈ Sk;

(ii) the (k × (n − k)) sub-matrix comprising entries (i, j)where i ∈ Sk and j ∈ [n]\Sk;

(iii) ((n− k)× k) sub-matrix comprising entries (i, j) wherei ∈ [n]\Sk and j ∈ Sk; and

(iv) the ((n − k) × (n − k)) sub-matrix comprising entries(i, j) where both i ∈ [n]\Sk and j ∈ [n]\Sk.

By construction, the ε-metric entropy of CSST(id)(k) is at mostthe sum of the ε

2 -metric entropies of these sub-matrices.The set of sub-matrices in (i) comprises only constant

matrices, and hence its ε2 -metric entropy is at most log 2n

ε .Any sub-matrix from set (ii) has constant-valued columns,and so the ε

2 -metric entropy of this set is upper bounded by2(n− k) log 2n

ε . An identical bound holds for the set of sub-matrices in (iii). Finally, the set of sub-matrices in (iv) are allcontained in the set of all ((n−k)×(n−k)) SST matrices. Themetric entropy of the SST class is analyzed in Theorem 1 ofour past work [12], where we showed that the ε

2 -metric entropy

of this set is at most 2(

2(n−k)ε

)2 (log 2(n−k)

ε

)2+(n−k) log n.

Summing up each of these metric entropies, some algebraicmanipulations yield the claimed result.

4) Proof of Lemma 4: For the first part of the proof,we assume k2 is greater than a universal constant. (See theanalysis of Case 2 below for how to handle small values ofk2.) Under this condition, the Gilbert-Varshamov bound [64],[65] guarantees the existence of a binary code B of length k2,minimum Hamming distance c0k2, and number of code wordscard(B) = T = 2ck2 . (As usual, the quantities c and c0 arepositive numerical constants.)

We now construct a set of T matrices contained withinthe set CSST(2, (k1, k2)), whose constituents have a one-to-one correspondence with the T codewords of the binarycode constructed above. Let items S = 1, . . . , k1 correspondto the first indifference set, so that the complementary setSc : = k1 + 1, . . . , n indexes the second indifference set.

Fix some δ ∈ (0, 13 ], whose precise value is to be specified

later. Define the base matrix M(0) with entries

Mij(0) =

12 if i, j ∈ S or i, j ∈ Sc12 + δ if i ∈ S and j ∈ Sc12 − δ if i ∈ Sc and j ∈ S.

For any other codeword z ∈ B, the matrix M(z) is definedby starting with the base matrix M(0), and then swappingrow/column i with row/column (k1 + i) if and only if zi = 1.For instance, if the codeword is z = [1 1 0 · · · 0], then thenew ordering in the matrix M(z) is obtained by swapping thefirst two items of the two indifference sets, and is given by(k1 + 1), (k1 + 2), 3, . . . , k1, 1, 2, (k1 + 3), . . . , n.

We have thus constructed a set of T matrices that arecontained within the set CSST(2, k1, k2). We now evaluatecertain properties of these matrices which will allow us provethe claimed lower bound. Consider any two matrices M1 andM2 in this set. Since any two codewords in our binary codehave a Hamming distance at least c0k2, we have from theaforementioned construction:

c1k2nδ2 ≤ |||M1 −M2|||2F ≤ 2k2nδ

2,

for a constant c1 ∈ (0, 1).Let PM1 and PM2 correspond to the distributions of the

random matrix Y based on Bernoulli sampling (1) from the

Page 11: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 11

matrices M1 and M2, respectively. Since δ ∈ (0, 13 ], all entries

of the matrices M1 and M2 lie in the interval [1/3, 2/3].Under this boundedness condition, the KL divergence maybe sandwiched by the Frobenius norm up to constant factors.Applying this result in the current setting yields

c2k2nδ2 ≤ DKL(PM1

‖PM2) ≤ c3k2nδ

2,

again for positive universal constants c2 and c3. An applicationof Fano’s inequality to this set gives that the error incurred byany estimator M is lower bounded as

supM∗∈CSST(2,(k1,k2))

E[|||M −M∗|||2F ]

≥ c1k2nδ2

2

(1− c3k2nδ

2 + log 2

ck2

). (27)

From this point, we split the remainder of the analysis intotwo cases.

a) Case 1:: First suppose that k2 is larger than somesuitably large (but still universal) constant. In this case, wemay set δ2 = c′′

n for a small enough universal constant c′′,and the Fano bound (27) then implies that

supM∗∈CSST(2,(k1,k2))

E[|||M −M∗|||2F ] ≥ c′k2,

for some universal constant c′ > 0. Since k2 = n − k1, thiscompletes the proof the claimed lower bound (26) in this case.

b) Case 2:: Otherwise, the parameter k2 is smaller thanthe universal constant in the above part of the proof. In thiscase, the claimed lower bound (26) on E[|||M − M∗|||2F ] isjust a constant, and we can handle this case with a differentargument. In particular, suppose that the estimator is providedthe partition forming the two indifference sets, and only needsto estimate the parameter δ. For this purpose, the sufficientstatistics of the observation matrix Y are those entries of theobservation matrix that correspond to matches between twoitems of different indifference sets; note that there are k1k2

such entries in total. From standard bounds on estimation ofa single Bernoulli probability, any estimator δ of δ must havemean-squared error lower bounded as E[(δ − δ)2] ≥ c

k1k2.

Finally, observe that the error in estimating the matrix M∗

in the squared Frobenius norm is at least 2k1k2 times the(squared) error in estimating the parameter δ. We have thusestablished the claimed lower bound of a constant.

C. Proof of Theorem 1

We now prove the upper bound (9a) for the regularizedleast squares estimator (8). Note that this estimator has theequivalent representation

MREG ∈ arg minM∈CSST

|||Y −M |||2F + (n− kmax(M) + 1)(log n)3

.

(28)

In this least squares optimization program as well as inthose to follow, we assume that ties are broken via somefixed rule, for instance, by lexicographic ordering. Defining

k∗ : = kmax(M∗), it is also convenient to consider the familyof estimators

Mk ∈ arg minM∈CSST(k)∪CSST(k∗)

|||Y −M |||2F

+ (n− kmax(M) + 1)(log n)3, (29)

where k ranges over [n]. Note that this family of estimatorscannot be computed in practice (since the value of k∗ is un-known), but they are convenient for our analysis, in particularbecause MREG = Mk for some value k ∈ [n]. Now observethat CSST(k1) ⊆ CSST(k2) for every 1 ≤ k2 ≤ k1 ≤ n, andconsequently also CSST(k) ∪ CSST(k

∗) = CSST(mink, k∗).Also recall that any ties in the minimization problems (28)and (29) are broken in some fixed manner such as in termsof lexicographic ordering. Then from the definitions of MREG

and Mk, we have the deterministic inequality

|||MREG −M∗|||2F ≤ maxk∈[n]

(|||Mk −M∗|||2F 1kmax(Mk) = k

).

(30)

In what is to follow shortly, we show that there exists auniversal constant c0 > 0 such that

P[|||Mk −M∗|||2F 1kmax(Mk) = k > c0(n−k∗+1)(log n)3

]≤ e−(logn)2 (31)

for each fixed k ∈ [n]. Given this inequality, applying theunion bound over all k ∈ [n] using (30) yields

P[|||MREG −M∗|||2F > c0(n− k∗ + 1)(log n)3

]≤ ne−(logn)2 ≤ e− 1

2 (logn)2 .

We have thus established the claimed tail bound (9a).In order to prove the bound (9b) on the adaptivity index, we

first integrate the tail bound (9a). Since all entries of M∗ andMREG all lie in [0, 1], we have |||M∗ − MREG|||2F ≤ n2, and sothis integration step yields an analogous bound on the expectederror:

E[|||M∗ − MREG|||2F ] ≤ cu(n− kmax(M∗) + 1)(log n)3.

Coupled with the lower bound on the risk of the oracleestimator established in Proposition 1, we obtain the claimedbound (9b) on the adaptivity index of MREG.

It remains to prove the tail bound (31). We proceed via atwo step argument: first we use the general upper bound givenby Lemma 1 to derive a weaker version of the required bound;and second, we then refine this weaker bound so as to obtainthe bound (31).

Establishing a weaker bound:: To begin with the firststep, we apply Lemma 1 with the choices

C = CSST(k) ∪ CSST(k∗)

and

λ(M) = (n− kmax(M) + 1)(log n)3.

Page 12: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 12

With these choices, the CDIFF(M∗, t,C) in the statement of

Lemma 1 takes the form

CDIFF(M∗, t,C) = α(M −M∗) | α ∈ [0, 1],

M ∈ CSST(k) ∪ CSST(k∗), |||α(M −M∗)|||F ≤ t.

Lemma 3 implies that

logN(ε,CDIFF(M∗, t,C), |||.|||F)

≤ c (n−mink, k∗+ 1)2

ε2(log n)2

+ c(n−mink, k∗+ 1) log n,

for all ε ≥ n−8. Applying Lemma 1 with u = 1 then yields

P(|||Mk −M∗|||2F > c(n−mink, k∗+ 1)(log n)2

)≤ e−c(logn)2 . (32)

Note that the bound (32) is weaker than the desired bound (31),since mink, k∗ ≤ k∗. Thus, our next step is to refine it.

a) Refining the bound (32):: If k ≥ k∗, then thebound (32) directly implies the bound (31). So in what follows,we restrict attention to k < k∗. Observe that the matrix Mk

is optimal for the optimization problem (29) and the matrixM∗ lies in the feasible set. Consequently, we have the basicinequality:

|||Y − Mk|||2F + (n− kmax(Mk) + 1)(log n)3

≤ |||Y −M∗|||2F + (n− k∗ + 1)(log n)3.

Using the linearized form of the observation model (16a),some simple algebraic manipulations give

1

2|||Mk −M∗|||2F +

1

2(n− kmax(Mk) + 1)(log n)3

≤ 〈〈Mk −M∗, W 〉〉+1

2(n− k∗ + 1)(log n)3, (33)

where W is the noise matrix (16b) in the linearized form ofthe model. The following lemma helps bound the first termon the right hand side of inequality (33). Consistent with thenotation elsewhere in the paper, for any value of t > 0, let usdefine a set of matrices CDIFF(M

∗, t,CSST(k)) as

CDIFF(M∗, t,CSST(k)) : = α(M −M∗) |M ∈ CSST(k),

α ∈ [0, 1], |||α(M −M∗)|||F ≤ t.

With this notation, we then have the following result:

Lemma 5. For any M∗ ∈ CSST, any fixed k ∈ [n], and anyt > 0, we have

supD∈CDIFF(M∗,t,CSST(k))

〈〈D, W 〉〉

≤ ct√

(n−mink, k∗+ 1) log n

+ c(n−mink, k∗+ 1)(log n)2, (34)

with probability at least 1− e−(logn)2 .

See Section IV-C1 for the proof of this lemma.From the weaker guarantee (32) that we

established earlier, we know that |||Mk − M∗|||F ≤c′√

(n−mink, k∗+ 1)(log n)2, with high probability.

Consequently, the term 〈〈Mk −M∗, W 〉〉 is upperbounded by the quantity (34) for some value oft ≤ c′

√(n−mink, k∗+ 1)(log n)2, and hence

〈〈Mk −M∗, W 〉〉 ≤ c′′(n−mink, k∗+ 1)(log n)2

≤ c′′(n− k + 1)(log n)2,

with probability at least 1 − e−(logn)2 . Applying this boundto the inequality (33) we obtain

P(1

2|||Mk −M∗|||2F +

1

2(n− kmax(Mk) + 1)(log n)3

≤ c′′(n− k + 1)(log n)2 +1

2(n− k∗ + 1)(log n)3

)≥ 1− e−(logn)2 .

Some algebraic manipulations then yield

P(|||Mk −M∗|||2F 1kmax(Mk) = k ≤ c′′′(n− k∗ + 1)(log n)3

)≥ 1− e−(logn)2 ,

thereby establishing the claimed result (31).1) Proof of Lemma 5: Consider the function ζ : [0, n] →

R+ given by ζ(t) : = supD∈CDIFF(M∗,t,CSST(k))

〈〈D, W 〉〉. In order

to control the behavior of this function, we first bound themetric entropy of the set CDIFF(M

∗, t,CSST(k)). Note thatLemma 3 ensures that

logN(ε,CDIFF(M∗,t,CSST(k)), ||| · |||F)

≤ c (n−mink, k∗+ 1)2

ε2(

logn

ε

)2+ c(n−mink, k∗+ 1) log n.

Based on this metric entropy bound, the truncated version ofDudley’s entropy integral then guarantees that

E[ζ(t)] ≤c(n−mink, k∗+ 1)(log n)2

+ ct√

(n−mink, k∗+ 1) log n.

It can be verified that for any value t > 0, the functionζ(t) : Rn×n → R (which is a function of W ) is t-Lipschitz.Moreover, the random matrix W has entries (16b) that areindependent on and above the diagonal, bounded by 1 in ab-solute value, and satisfy skew-symmetry. Consequently, fromknown concentration bounds(e.g., [62, Theorem 5.9], [63]) forconvex Lipschitz functions, we have

P[ζ(t) ≥ E[ζ(t)] + tv

]≤ e−v

2

for all v ≥ 0.

Combining the pieces, we find that

P[ζ(t) ≥ c(n−mink, k∗+ 1)(log n)2

+ ct√

(n−mink, k∗+ 1) log n+ tv]≤ e−v

2

,

valid for all v ≥ 0. Setting v =√

(n−mink, k∗+ 1) log nyields the claimed result.

Page 13: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 13

D. Proof of Theorem 2

We now prove the upper bound for the CRL estimator, asstated in Theorem 2. In order to simplify the presentation, weassume without loss of generality that the true permutationof the n items is the identity permutation id. Let πCRL =(π1, . . . , πn) denote the permutation obtained at the end ofthe second step of the CRL estimator. The following lemmaproves two useful properties of the outcomes of the first twosteps.

Lemma 6. With probability at least 1−n−20, the permutationπCRL obtained at the end of the second step of the estimatorsatisfies the following two properties:(a) maxi∈[n]

∑n`=1 |M∗i` −M∗πCRL(i)`| ≤ 2

√n(log n)2, and

(b) the group of similar items obtained in the first step is ofsize at least k∗ = kmax(M∗).

See Section IV-D1 for the proof of this claim.Given Lemma 6, let us complete the proof of the theorem.

Let Π denote the set of all permutations on n items whichsatisfy the two conditions in the statement of Lemma 6.Given that every entry of M∗ lies in the interval [0, 1], anypermutation π ∈ Π satisfies

|||M∗ − π(M∗)|||2F =∑i∈[n]

∑`∈[n]

(M∗i` −M∗π(i)π(`))2

≤∑i∈[n]

∑`∈[n]

|M∗i` −M∗π(i)π(`) |

≤∑i∈[n]

∑`∈[n]

|M∗i` −M∗π(i)` |+∑i∈[n]

∑`∈[n]

|M∗π(i)` −M∗π(i)π(`) |,

where the final expression is a result of the triangle inequality.Since M∗ satisfies shifted skew-symmetry, we obtain

|||M∗ − π(M∗)|||2F ≤ 2∑i∈[n]

∑`∈[n]

|M∗i` −M∗π(i)` | . (35)

Now consider any item i ∈ [n]. Incorrectly estimating itemi as lying in position π(i) contributes a non-zero error only ifeither item i or item π(i) lies in the (n−k∗)-sized set of itemsoutside the largest indifference set. Consequently, there are atmost 2(n− k∗) values of i in the sum (35) that make a non-zero contribution. Moreover, from property (a) of Lemma 6,each such item contributes at most 2

√n(log n)2 to the error.

As a consequence, we have the upper bound

|||M∗ − π(M∗)|||2F ≤ 8(n− k∗)√n(log n)2. (36)

Let us now analyze the third step of the CRL estimator. Theproblem of bivariate isotonic regression refers to estimation ofthe matrix M∗ ∈ CSST when the true underlying permutationof the items is known a priori. In our case, the permutationis known only approximately, so that we need also to trackthe associated approximation error. In order to derive a tailbound on the error of bivariate isotonic regression, we callupon the general upper bound proved earlier in Lemma 1 withthe choices C = CSST(id), and λ = 0. Now let

CDIFF(M∗, t,CSST(id)) : = α(M −M∗) |M ∈ CSST(id),

α ∈ [0, 1], |||α(M −M∗)|||F ≤ t.

The following lemma uses a result from the paper [31]to derive an upper bound on the metric entropy ofCDIFF(M

∗, t,CSST(id)). For any matrix M∗ ∈ CSST, let s(M∗)denote the number of indifference sets in M∗.

Lemma 7. For every ε > n−8 and t ∈ (0, n], we have themetric entropy bound

logN(ε,CDIFF(M∗, t,CSST(id)), |||.|||F)

≤ c1t2(s(M∗))2(log n)6

ε2+ c1(s(M∗))2 log n.

where c1 > 0 is a universal constant.

With this bound on the metric entropy, we apply Lemma 1with

u = c2(n− k∗ + 1)2

(s(M∗))2,

g(M∗) =√c1 s(M

∗) (log n)3,

h(M∗) =√c1 s(M

∗)√

log n, andb = 1,

where c2 > 0 is a large enough constant. Lemma 1 then yieldsthat for each fixed matrix M∗ ∈ CSST(id), the error incurredby the least squares estimator Mid ∈ arg min

M∈CSST(id)

|||M − Y |||2F is

upper bounded as

|||Mid −M∗|||2F ≤ c(n− k∗ + 1)2(log n)8, (37)

with probability at least 1−e−(n−k∗+1)2(logn)8 . Note that thisapplication of Lemma 1 is valid since s(M∗) ≤ n − k∗ + 1and hence u ≥ 1. Furthermore, it follows as a corollary ofTheorem 1 in the paper [12] that

|||Mid −M∗|||2F ≤ cn(log n)8,

with probability at least 1 − e−n(logn)6 . (Although the state-ment of [12, Theorem 1] considers the expected loss, the entireproof actually provides the relevant tail bound — see [12,Equation 8b] and the discussion following the statement of [12,Theorem 1].) Combining these upper bounds yields

|||Mid −M∗|||2F ≤ cmin(n− k∗ + 1)2, n(log n)8

(i)

≤ c(n− k∗ + 1)√n(log n)8, (38)

with probability at least 1−e−min(n−k∗+1)2,n(logn)3 , wherec is a positive universal constant. Inequality (i) makes useof the bound minu2, v2 ≤ uv for any two non-negativenumbers u and v.

Let us put together the analysis of the approximationerror (36) in the permutation obtained in the first two stepsand the error (38) in estimating the matrix in the third step. Tothis end, consider any (fixed) permutation π ∈ Π. For clarity,we use ML(Y, π) to represent the least squares estimator underthe permutation π for the observation matrix Y , that is,

ML(Y, π) : = arg minM∈CSST(π)

|||M − Y |||2F . (39)

With this definition, we have the relation MCRL = ML(Y, πCRL).We cannot bound the error of this estimate directly since the

Page 14: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 14

permutation πCRL is not fixed, but dependent on the observeddata Y . In order to derive the desired result, we first boundthe error of the estimator ML(Y, π) when the permutation πis fixed.

Consider any matrix M∗ ∈ CSST(id) under the identitypermutation. We can then write

|||ML(M∗ +W, π)−M∗|||2F

= |||ML(M∗+W, π)−ML(π(M∗)+W, π)+ML(π(M∗)+W, π)−M∗|||2F

≤ 2|||ML(M∗ +W, π)− ML(π(M∗) +W, π)|||2F

+ 2|||ML(π(M∗) +W, π)−M∗|||2F . (40)

We separately bound the two terms on the right hand sideof expression (40). First observe that the least squares stepof the estimator ML (for a given permutation π in its secondargument) is a projection onto the convex set CSST(π), andhence we have the deterministic bound

|||ML(M∗ +W, π)−ML(π(M∗) +W, π)|||2F ≤ |||M∗ − π(M∗)|||2F .

(41a)

In addition, we have

|||ML(π(M∗) +W, π)−M∗|||2F≤ 2|||ML(π(M∗) +W, π)− π(M∗)|||2F + 2|||π(M∗)−M∗|||2F .

(41b)

From our earlier bound (38), we have that for any matrixM∗ ∈ CSST(id), the least squares estimate satisfies

|||ML(M∗ +W, id)−M∗|||2F ≤ cu(n− k∗ + 1)

√n(log n)8,

with probability at least 1− e−cmin(n−k∗+1)2,n(logn)3 . Thisbound is based directly on the earlier results of Lemma 1presented earlier in this paper and Theorem 1 of our previouswork [12]. There are three properties of the noise matrix Wthat are required for the proofs of Lemma 1 and paper [12,Theorem 1]: (a) E[W ] = 0, (b) | Wij |≤ 1 for everypair i, j ∈ [n], and (c) the entries above the diagonal ofW are independent (and those below are governed by skew-symmetry). For any fixed permutation π, the matrix π−1(W )also satisfies each of these properties. As a result, the samebound applies when the noise matrix is π−1(W ) instead ofW :

P(|||ML(M

∗+π−1(W ), id)−M∗|||2F ≤ cu(n−k∗+1)√n(log n)8

)≥ 1− e−cmin(n−k∗+1)2,n(logn)3 .

Applying permutation π to each of the matrices in the aboveinequality then yields the bound

P(|||ML(π(M∗)+W, π)−π(M∗)|||2F ≤cu(n−k∗+1)

√n(log n)8

)≥ 1− e−cmin(n−k∗+1)2,n(logn)3 .

(42)

In conjunction, the bounds (36), (40), (41a), (41b) and (42)imply that for any fixed π ∈ Π,

P(|||ML(M

∗ +W, π)−M∗|||2F ≤ cu(n− k∗ + 1)√n(log n)8

)≥ 1− e−cmin(n−k∗+1)2,n(logn)3 . (43)

Although we are guaranteed that πCRL ∈ Π, we cannot applythe bound (43) directly to it, since πCRL is a data-dependentquantity. In order to circumvent this issue, we need to obtaina uniform version of the bound (43), and we do so by applyingthe union bound over the data-dependent component of πCRL.

In more detail, let us consider Steps 1 and 2 of theCRL algorithm as first obtaining a total ordering of the nitems via a count of the number of pairwise victories, thenconverting it to a partial order by putting all items in thesubset identified by Step 2 in an equivalence class, and thenobtaining a total ordering by permuting the items in theequivalence class in a data-independent manner. Lemma 6ensures that the size of this equivalence class is at least k∗.Consequently, the number of possible (data-dependent) partialorders obtained is at most n!

k∗! ≤ e(n−k∗) logn. Taking a union

bound over each of these e(n−k∗) logn cases, and recalling thatMCRL = ML(M

∗ +W,πCRL), we obtain the result

P[|||MCRL −M∗|||2F ≤ cu(n−k∗ + 1)

√n(log n)8 | πCRL ∈ Π

]≥ 1− e−(logn)3 .

Note that we do not need to union bound over the elementsof the same equivalence class since picking a high-probabilityevent uniformly at random retains the probability bound forthe resulting event.

Finally, recalling that Lemma 6 ensures thatP[πCRL ∈ Π

]≥ 1 − n−20, we have established the

claim.

It remains to prove the two auxiliary lemmas stated above.1) Proof of Lemma 6: We first prove that for any fixed

item i ∈ [n], the inequality of part (a) holds with probabilityat least 1−n−22. The claimed result then follows via a unionbound over all items.

Consider any item j > i such thatn∑`=1

M∗i` −n∑`=1

M∗j` > 2√n(log n)2. (44)

An application of Bernstein’s inequality then gives (see theproof of Theorem 1 in the paper [7] for details) that

P( n∑`=1

Yj` ≥n∑`=1

Yi` −√n log n

)≤ 1

n23. (45)

Likewise, for any item j < i such that∑n`=1M

∗j` −∑n

`=1M∗i` > 2

√n(log n)2, we have P

(∑n`=1 Yi` ≥∑n

`=1 Yj` −√n log n

)≤ 1

n23 .Now consider any j ≥ i. In order for item i to be located in

position j in the total order given by the count and randomizesteps of the CRL estimator, there must be at least (j − i)items in the set i + 1, . . . , n whose row sums are at least(∑n`=1 Yi` −

√n log n). In particular, there must be at least

one item in the set j, . . . , n such that its row sum is at least(∑n`=1 Yi`−

√n log n). It follows from our results above that

under the condition (44), this event occurs with probabilityno more than 1

n21 . Likewise when j ≤ i, thereby proving theclaim.

Page 15: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 15

We now move to the condition of part (b). Observe thatfor any two items i and j in the same indifference set, wehave that M∗i` = M∗j` for every ` ∈ [n]. An application of theBernstein inequality now gives that

P( n∑`=1

Yj` −n∑`=1

Yi` ≥√n log n

)≤ 1

n23.

A union bound over all pairs of items in the largest indifferenceset gives that all k∗ items in the largest indifference set havetheir row sums differing from each other by at most

√n log n.

Consequently, the group must be of at least this size.2) Proof of Lemma 7: For the proof, it will be convenient to

define a class CSST(; [-1,1]) that is similar to the class CSST(id),but contains matrices with entries in [−1, 1]:

CSST(t; [−1, 1]) : =M ∈ [−1, 1]n×n | |||M |||F ≤ t and

Mk` ≥Mij whenever k ≤ i and ` ≥ j.

We now call upon Theorem 3.3 in the paper [31]. It providesthe following upper bound on the metric entropy of bivariateisotonic matrices within a Frobenius ball:

logN(ε,CSST(t; [−1, 1]), ||| · |||F) ≤ c0t2(log n)4

ε2(

logt log n

ε

)2.

(46)

We now use this result to derive an upper bound on themetric entropy of the set CDIFF(M

∗, t,CSST(id)). Consider thefollowing partition of the entries of any (n × n) matrix into(s(M∗))2 submatrices. Submatrix (i, j) ∈ [s(M∗)]× [s(M∗)]in this partition is the (ki×kj) submatrix corresponding to thepairwise comparison probabilities between every item in theith indifference set with every item in the jth indifferenceset in M∗. Such a partition ensures that each partitionedsubmatrix of M∗ is a constant matrix. Consequently, forany M ∈ CDIFF(M

∗, t,CSST(id)), each partitioned submatrixbelongs to the set of matrices CSST(t; [−1, 1]) (where weslightly abuse notation to ignore the size of the matrices aslong as no dimension is greater than (n × n)). The metricentropy of the set of matrices in CDIFF(M

∗, t,CSST(id)) can nowbe upper bounded by the metric entropies of each individualset of submatrices, via an application of Lemma B.2 fromFlammarion et al. [43] as:

logN(ε,CDIFF(M∗, t,CSST(id)), ||| · |||F)

≤(s(M∗))2 logc1t

ε+(s(M∗))2 logN(

ε

3,CSST(t; [−1, 1]), ||| · |||F)

≤(s(M∗))2 logc1t

ε+c0

t2(s(M∗))2(log n)4

ε2(

logt log n

ε

)2,

where the final inequality follows from the earlier bound (46).Simplifying this bound by substituting ε ≥ n−8 and t ≤ nyields the claimed result.

E. Proof of Corollary 1

We now prove the upper bound on the minimax risk ofthe CRL estimator under the SST model. The proof of thisresult closely follows the proof of Theorem 2. As in the proofof Theorem 2, we assume without loss of generality that thetrue permutation of the n items is the identity permutation

id. Let πCRL = (π1, . . . , πn) denote the permutation obtainedat the end of the second step of the CRL estimator. For thisargument, we call upon Lemma 6 and parts of the proof ofTheorem 2. We let Π denote the set of all permutations onn items which satisfy the two conditions in the statement ofLemma 6.

Consider any fixed permutation π ∈ Π, and given theobserved data matrix Y , consider the estimator ML(Y, π) asdefined in equation (39) in the proof of Theorem 2. From thebound (36) in the proof of Theorem 2, we first obtain thedeterministic inequality

|||M∗ − π(M∗)|||2F ≤ 8n√n(log n)2. (47)

Further, based on the inequalities (40) and (41) in the proofof Theorem 2, we have the inequality

|||ML(M∗ +W, π)−M∗|||2F

≤ 4|||ML(π(M∗) +W, π)− π(M∗)|||2F + 6|||π(M∗)−M∗|||2F .(48)

As a corollary of [12, Theorem 1], we also obtain the bound

P(|||ML(M

∗ + π−1(W ), id)−M∗|||2F ≤ cn(log n)3)

≥ 1− e−3n logn.

(Note that although the statement of [12, Theorem 1] considersthe expected loss, the entire proof actually provides the rele-vant tail bound — see [12, Equation 8b] and the discussionfollowing the statement of [12, Theorem 1].) It follows that

P(|||ML(π(M∗) +W, π)−π(M∗)|||2F ≤ cun(log n)3

)≥ 1− e−3n logn. (49)

In conjunction, the bounds (47), (48) and (49) imply that forany fixed permutation π ∈ Π,

P(|||ML(M

∗ +W, π)−M∗|||2F ≤ cun√n log n

)≥ 1− e−3n logn.

Applying the union bound across all possible n! permutationsof the n items, we obtain

P(

maxπ∈Π|||ML(M

∗ +W, π)−M∗|||2F ≤ cun√n log n

)≥ 1− e−2n logn. (50)

Finally, recalling that Lemma 6 ensures thatP[πCRL ∈ Π

]≥ 1 − n−20, we have established the

claim.

F. Proof of Theorem 3

We now turn to the proof of the lower bound for polynomial-time computable estimators, as stated in Theorem 3. Weproceed via a reduction argument. Consider any estimator thathas Frobenius norm error upper bounded as

supM∗∈CSST

E[|||M −M∗|||2F ] ≤ cu√n(n− kmax(M∗) + 1)(log n)−1.

(51)

We show that any such estimator defines a method that,with probability at least 1 − 1√

logn, is able to identify the

Page 16: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 16

presence or absence a planted clique with√n

log logn vertices.This result, coupled with the upper bound on the risk of theoracle estimator established in Proposition 1 proves the claimof Theorem 3.

Our reduction from the bound (51) proceeds by identifying asubclass of CSST, and showing that any estimator satisfying thebound (51) on this subclass can be used to identify a plantedclique in an Erdos-Renyi random graph. Naturally, in orderto leverage the planted clique conjecture, we need the plantedclique to be of size o(

√n).

Our construction involves a partition with s = 3 compo-nents, maximum indifference set of size k1 = n− 2k, withthe remaining two indifference sets of size k2 = k3 = k. Wechoose the parameter k : =

√n

log logn so that any constantmultiple of it will be within the hardness regime of plantedclique (for sufficiently large values of n). Now let M∗0 be amatrix with all ones in the (k× k) sub-matrix in its top-right,zeros on the corresponding (k× k) sub-matrix in the bottom-left and all other entries set equal to 1

2 . By construction, thematrix M∗0 has indifference sets of sizes (n− 2k, k, k).

For any permutation π on n2 items and any (n× n) matrix

M , define another (n × n) matrix Pπ(M) by applying thepermutation π to:• the first n

2 rows of M∗, and the last n2 rows of M∗

• the first n2 columns of M∗, and the last n

2 columns of M∗.

We then define the set CSST : =Pπ(M∗0 ) |

for all permutations π on [n/2]

. By construction, every

matrix in the set CSST has indifference sets of sizes(n− 2k, k, k).

For any estimator M that satisfies the bound (51), we have

supM∗∈CSST∪ 1

2 11T E[|||M −M∗|||2F ] ≤ ck

√n

log n.

On the other hand, Markov’s inequality implies that

E[|||M −M∗|||2F ] > ck√n√

log nP[|||M −M∗|||2F ≥

ck√n√

log n

].

Combining the two bounds, we find that for every M∗ ∈CSST

⋃1211T , it must be that

P[|||M −M∗|||2F <

ck√n√

log n

]≥ 1− 1√

log n. (52)

Now consider the set of (n2 ×n2 ) matrices comprising

the top-right (n2 ×n2 ) sub-matrix of every matrix in the set

CSST

⋃1211T . We claim that this set is identical to the set of

all possible matrices in the planted clique problem with n2

vertices and a planted clique of size k. Indeed, the set containsthe all-half matrix corresponding to the absence of a plantedclique, and all symmetric matrices that have all entries equalto half except for a (k× k) all-ones submatrix correspondingto the planted clique.

Consider the problem of testing the hypotheses of whetherM∗ is equal to the all-half matrix (“no planted clique”) or if itlies in CSST (“planted clique”). Let us consider a decision rulethat declares the absence of a planted clique if |||M− 1

211T |||2F ≤116k

2, and the presence of a planted clique otherwise.

a) Null case: On one hand, if there is no planted clique,meaning that M∗ = 1

211T , then the bound (52) guaranteesthat

|||M − 1

211T |||2F <

k√n√

log n(53)

with probability at least 1− 1√logn

. Recalling that k =√n

log logn ,we find that our decision rule can detect the absence of theplanted clique with probability at least 1− 1√

logn.

b) Case of planted clique: On the other hand, if there isa planted clique (M∗ ∈ CSST), then we have

|||M − 1

211T |||2F ≥

1

2|||M∗ − 1

211T |||2F − |||M −M∗|||2F

=1

4k2 − |||M −M∗|||2F .

Thus, in this case, the bound (52) guarantees that

|||M − 1

211T |||2F ≥

1

4k2 − ck

√n√

log n,

with probability at least 1 − 1√logn

. Since k =√n

log logn , ourdecision rule successfully detects the presence of a plantedclique with probability at least 1− 1√

logn.

In summary, under the planted clique conjecture, our de-cision rule cannot be computed in polynomial time. Sincethe decision rule can be computed in polynomial-time giventhe estimator M , it must also be the case that M cannot becomputed in polynomial time, as claimed.

G. Proof of Theorem 4

We now prove lower bounds on the standard least-squaresestimator. A central piece in our proof is the following lemma,which characterizes an interesting structural property of theleast-squares estimator.

Lemma 8. Let M∗ = 1211T and consider any matrix

Y ∈ 0, 1n×n satisfying the shifted-skew-symmetry condi-tion. Then the least squares estimator MLS from equation (14)must satisfy the quadratic equation

|||Y −M∗|||2F = |||Y − MLS |||2F + |||M∗ − MLS |||2F .

See Section IV-G1 for the proof of this claim.

Let us now complete the proof of Theorem 4 usingLemma 8. Our strategy is as follows: we first construct a “bad”matrix M ∈ CSST that is far from M∗ but close to Y . We thenuse Lemma 8 to show that the least squares estimate MLS

must also be far from M∗.In the matrix Y , let item ` be an item that has won

the maximum number of pairwise comparisons—that is ` ∈arg maxi∈[n]

∑nj=1 Yij . Let S denote the set of all items that are

beaten by item `—that is, S : = j ∈ [n]\` | Y`j = 1.Note that card(S) ≥ n−1

2 . Now define a matrix M ∈ CSST

with entries M`j = 1 = 1 − Mj` for every j ∈ S, and all

Page 17: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 17

remaining entries equal to 12 . Some simple calculations then

give

|||Y −M∗|||2F = |||Y − M |||2F + |||M∗ − M |||2F , and (54a)

|||M −M∗|||2F ≥n− 1

4. (54b)

Next we exploit the structural property of the least squaressolution guaranteed by Lemma 8. Together with the condi-tions (54) and the fact that |||Y − MLS |||2F ≤ |||Y − M |||2F , somesimple algebraic manipulations yield the lower bound

|||M∗ − MLS |||2F ≥n− 1

4.

This result holds for any arbitrary observation matrix Y ∈0, 1n×n, and consequently, holds with probability 1 whenthe observation matrix Y is drawn at random.

In order to complete the proof, we must address onetechnical condition. The definition of the adaptivity index(see equation 6b) excludes the all-half matrix. In order tocircumvent this minor issue, we now set the matrix M∗ ∈ CSST

as:

M∗ij =

1 if i < n, j = n

0 if i = n, j < n12 otherwise.

In words, item n is beaten by every other item with probability1, whereas the remainder of the matrix corresponding tocomparisons between items 1 through (n − 1) is an all-halfmatrix. One can verify that the arguments above for the all-half matrix continue to hold in this setting, and yield a lowerbound of the form

|||M∗ − MLS |||2F ≥n− 2

4. (55)

For ‖k‖∞ = n − 1, Proposition 1 yields an upper boundof c(log n)2 on the oracle risk. Combining this upper boundwith the lower bound (55) yields the claimed lower bound onthe adaptivity index of the least squares estimator.

1) Proof of Lemma 8: From our earlier construction of Min Section IV-G, we know that |||Y − MLS |||F ≤ |||Y − M |||F <|||Y −M∗|||F, which guarantees that MLS 6= M∗. Now considerthe line

L(M∗, MLS) : = θM∗ + (1− θ)MLS | θ ∈ R

that passes through the two points M∗ and MLS . Given thisline, consider the auxiliary estimator

M1 ∈ arg minM∈L(M∗,MLS)

|||Y −M |||2F . (56)

Since M1 is the Euclidean projection of Y onto this line, itmust satisfy the Pythagorean relation

|||Y −M∗|||2F = |||Y − M1|||2F + |||M∗ − M1|||2F . (57)

Let Π[0,1] : Rn×n → [0, 1]n×n denote the Euclidean projectionof any (n × n) matrix onto the hypercube [0, 1]n×n. Thisprojection actually has a simple closed-form expression: itsimply clips every entry of the matrix M to lie in the unit

interval [0, 1]. Since projection onto the convex set [0, 1]n×n

is non-expansive, we must have

|||Y −M1|||2F ≥ |||Π[0,1](Y )−Π[0,1](M1)|||2F = |||Y −Π[0,1](M1)|||2F .(58)

Here the final equation follows since Y ∈ [0, 1]n×n, and henceΠ[0,1](Y ) = Y .

Furthermore, we claim that Π[0,1](M1) ∈ CSST. In orderto prove this claim, first recall that the matrix M1 lies onthe line L(M∗, MLS) and hence can be written as M1 =

(1−θ)(MLS− 1211T )+ 1

211T for some θ ∈ R. Consequently,if θ ≤ 1, then the rows/columns of the projected matrixΠ[0,1](M1) obey the same monotonicity conditions as thoseof MLS ; conversely, if θ > 1, the rows/columns obey aninverted set of monotonicity conditions, again specified by therows/columns of M1. Moreover, since the two matrices MLS

and 1211T satisfy shifted-skew-symmetry, so does the matrix

M1. One can further verify that any two real numbers a ≥ bmust also satisfy the inequalities

min(a, 1) ≥ min(b, 1), and max(a, 0) ≥ max(b, 0).

If in addition, the pair (a, b) satisfy the constraint, a+ b = 1,then we have max(min(a, 1), 0) + max(min(b, 1), 0) = 1.Using these elementary facts, it can be verified that themonotonicity and shifted-skew-symmetry conditions of anymatrix are thus retained by the projection Π[0,1].

The arguments above imply that Π[0,1](M1) ∈ CSST andhence the matrix Π[0,1](M1) is feasible for the optimizationproblem (14). By the optimality of MLS , we must have|||Y − Π[0,1](M1)|||2F ≥ |||Y − MLS |||2F . Coupled with theinequality (58), we find that

|||Y − M1|||2F ≥ |||Y − MLS |||2F .

On the other hand, since MLS is feasible for the optimiza-tion problem (56) and M1 is the optimal solution, we mustactually have

|||Y − M1|||2F = |||Y − MLS |||2F ,

so that MLS is also optimal for the optimization problem (56).However, the optimization problem (56) amounts to Euclideanprojection on to a line, it must have a unique minimizer, whichimplies that MLS = M1. Substituting this condition in thePythagorean relation (57) yields the claimed result.

V. CONCLUSIONS

We proposed the notion of an adaptivity index to measurethe abilities of any estimator to automatically adapt to theintrinsic complexity of the problem. This notion helps to obtaina more nuanced evaluation of any estimator that is moreinformative than the classical notion of the worst-case error.We provided sharp characterizations of the optimal adaptivitythat can be achieved in a statistical (information-theoretic)sense, and that can be achieved by computationally efficientestimators.

The logarithmic factors in our results arise from corre-sponding logarithmic factors in the metric entropy results of

Page 18: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 18

Gao and Wellner [66], and understanding their necessity is anopen question. In statistical practice, we often desire estimatorsthat perform well in a variety of different senses. We believethat estimating SST matrices at the minimax-optimal rate inFrobenius norm, as studied in more detail in the paper [12],may be computationally difficult — this fundamental problemstill remains open. In this work, for simplicity we also considera setting where one has access to exactly one comparison perpair of items. It is of interest to extend the results to more gen-eral observation models including those where only a subset ofpairs are observed and/or with multiple observations per pair.Finally, developing a broader understanding of fundamentallimits imposed by computational considerations in statisticalproblems is an important avenue for continued investigation.

VI. ACKNOWLEDGMENTS

This work was partially supported by NSF grants CIF31712-23800, CIF 1763734 and CIF 175565, DMS-1713003,AFOSR grant FA9550-14-1-0016, and DOD Advanced Re-search Projects Agency W911NF-16-1-0552. The work ofNBS was also supported in part by a Microsoft Research PhDfellowship.

The authors would like to thank the AE Emmanuel Abbefor the prompt and efficient handling of the paper and theanonymous reviewers for their very valuable comments.

APPENDIX ACRL IS MINIMAX-OPTIMAL FOR PARAMETRIC MODELS

In this section, we study the performance of the CRLestimator when applied to data drawn from parametric models.More precisely, in a parametric model, each item i ∈ [n] isassociated with an unknown parameter w∗i ∈ R that representsthe item’s intrinsic value or quality. Given some (known)function F : R → [0, 1], pairwise comparison probabilitiesare then assumed to be generated via the equation

M∗ij = F (w∗i − w∗j ) for each pair i, j ∈ [n]. (59)

The function F is assumed to be strictly increasing, and theweights are uniformly bounded as maxi∈[n] |w∗i | ≤ 1. Also let

cFmin : = minx∈[−2,2]

∇F (x) and cFmax : = maxx∈[−2,2]

∇F (x).

(60)

We assume that cFmin and cFmax are positive constants that arestrictly bounded away from zero.

In our past work [12], we showed that the set of all paramet-ric models is a strict—and significantly smaller—subset of theset of SST matrices. We also proved that under some technicalassumptions on the function F , then the minimax optimal rate(in squared Frobenius norm) of pairwise-probability matricesover parametric models scales Θ( 1

n ), matching the minimaxrate of estimation over the SST class up to logarithmic factors.On the other hand, Theorem 2 and Corollary 1 in the presentpaper show that the CRL estimator achieves a risk upperbounded by order O( 1√

n) over the entire SST model class. In

the following proposition, we show that if the underlying dataactually follows a parametric model, then the CRL estimator

can achieve risk scaling as Θ( 1n ), which is optimal up to

logarithmic factors.

Proposition 2. For all matrices M∗ in any parametricmodel (59), the risk of the CRL estimator MCRL is upperbounded as

1

n2E[|||MCRL −M∗|||2F ] ≤ c 1

nlog2 n where c : = 4c2Fmax/c

2Fmin.

An implication of this proposition is that for parametricmodels, up to logarithmic factors in n, the CRL estimatoris minimax optimal and matches the statistical performanceof the estimators based on parametric maximum likelihoodanalyzed in [12, Theorem 4]. The proof of this propositionfollows from our analysis of the CRL estimator in the proofof Theorem 2 in the present paper. Independently, Chatterjeeand Mukherjee [27] also investigate adaptivity of a similarestimator to parametric models, and show that it attains anerror of order O( 1

n ). The proof techniques employed in thepaper [27] are however markedly different from our prooftechniques.

In comparison to parametric estimators, there are two keybenefits offered by the CRL estimator. First, unlike the para-metric estimators, the CRL estimator does not need to knowthe function F . Second, the CRL estimator is more robustto model misspecification, with an error at most O( 1√

n) over

the richer SST model, as proved in Corollary 1 in the presentpaper. This guarantee is significantly superior to the Ω(1) errorincurred by the estimators that fit any parametric model – thislower bound is proved in [12, Proposition 1].

The remainder of this section is devoted to a proof ofProposition 2. Our argument relies heavily on the proof ofTheorem 2 presented earlier in the paper.

Proof of Proposition 2: Without loss of generality, wemay assume that the permutation of the items in the matrix M∗

is the identity permutation. Let πCRL denote the permutationobtained at the end of the randomization step in the CRLestimator. The first step in this analysis is to control the errorin the output of the count and randomize steps.

Consider any pair of items i, j ∈ [n] such that i < j. Thiscondition means that item i is preferred to item j in M∗, andin particular that w∗i ≥ w∗j and M∗i` ≥M∗j` for every ` ∈ [n].Now suppose that the associated rows of matrix M∗ satisfythe relation

n∑`=1

(M∗i` −M∗j`) ≥ 2√n log n. (61)

From the proof of Lemma 6 presented earlier – specifically,from the argument of the bound (45) – we obtain that for anypair of items i < j satisfying (61), the probability that theCRL estimator errs in their relative ordering is bounded as

P(πCRL(i) > πCRL(j)

)≤ n−23.

A union bound over all pairs of items yields that this conditionholds with probability at least 1−n−21 simultaneously for allpairs satisfying the inequality (61). Conditioned on this event,

Page 19: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 19

for every pair of items i < j, we have that either πCRL correctlyrecovers their correct ordering, or

2√n log n >

n∑`=1

(M∗i` −M∗j`)

=n∑`=1

(F (w∗i − w∗` )− F (w∗j − w∗` ))

≥ cFminn(w∗i − w∗j ). (62)

For any such pair of items i, j, we then haven∑`=1

(M∗i` −M∗j`)2 =n∑`=1

(F (w∗i − w∗` )− F (w∗j − w∗` ))2

≤ c2Fmaxn(w∗i − w∗j )2

≤ 4c2Fmax log2 n

c2Fmin

,

where the final inequality relies on the bound (62). Here, cFmin

and cFmax are F -dependent constants defined in equation (60).Aggregating this result across all pairs of items, we find that

P(|||M∗ − πCRL(M

∗)|||2F > cn log2 n)≤ n−21, (63)

where c : = 4c2Fmax/c2Fmin. Given the bound (63), the remainder of

the proof proceeds in a manner identical to that of the analysisof the least squares step in the proof of Theorem 2.

APPENDIX BROBUST GUARANTEES AND AN ORACLE INEQUALITY

Thus far we have focused on analyzing the error andadaptivity index of various estimators in settings where thematrix underlying the pairwise comparisons is in the SSTclass, and have shown that we can obtain improved guaranteeswhen the matrix has a large indifference set. In this section weexplore the robustness of our guarantees for the CRL algorithmto model-misspecification. In particular, we show that even ifthe true matrix underlying the pairwise comparisons is not inthe SST class or does not have a large indifference set, wecan still obtain strong guarantees provided the true matrix isclose in a Frobenius norm sense to a structured matrix in theSST class.

To simplify the analysis of the CRL estimator we assumethat we receive two independent copies of the matrix Yaccording to the model in (1). We denote these copies Y (1)

and Y (2). We compute an ordering through the Count stepusing Y (1) and then perform the Least Squares step on thesecond copy Y (2) using the ordering obtained from the firststep. We no longer require the Randomize step (since we havetwo samples), and we instead define πCRL to be the orderingobtained after the Count step.

We note in passing that the analysis we present generalizesin a straightforward way to the setting where each independentcopy Y (1), Y (2) has entries which are censored with proba-bility 1/2 (see, for instance, [26], [12] for details) and this willonly change the constant factors in the results. In this case,we obtain, in expectation, exactly n(n − 1)/2 independentcomparisons, as in the single-observation setting we have

studied in this paper. We show the following generalizationof Theorem 2.

Theorem 5. There are universal constants cu and c′u suchthat for every M∗, the CRL estimator MCRL with probabilityat least 1− n−20 has squared Frobenius norm error at most

1

n2|||MCRL −M∗|||2F

≤ cu infM∈CSST

[1

n2|||M −M∗|||2F +

n−kmax(M)+1

n3/2(log n)8

].

(64)

This result implies a smoothness property in terms of the errorguarantees of the CRL estimtaor, in that if two matrices M1

and M2 are close (say, in terms of the Frobenius norm distancebetween them) then the errors |||MCRL −M1|||2F and |||MCRL −M2|||2F incurred by the CRL estimator when M∗ = M1 andM∗ = M2 respectively are also close to each other.

Theorem 5 also implies in a straightforward way a minimaxoracle-inequality which generalizes Corollary 1.

Corollary 2. There is a positive universal constant cu suchthat for every M∗, the CRL estimator has mean-squaredFrobenius error upper bounded as

1

n2|||MCRL −M∗|||2F ≤ cu inf

M∈CSST

[1

n2|||M −M∗|||2F +

(log n)8

√n

],

(65)

with probability at least 1− 2n−20

For both Theorem 5 and Corollary 2, we emphasize thatthe matrix M∗ is no longer assumed to have any structure(beyond the restriction that it be a skew-symmetric matrix ofprobabilities). The CRL estimator is adaptive, and is able totake advantage of the SST and indifference set structure whenit exists, and its performance degrades gracefully. If the matrixM∗ is near any well-structured SST matrix M then the CRLestimator has strong guarantees.

We devote the reminder of this section to the proof ofTheorem 5.

Proof of Theorem 5: Throughout this proof we let Mdenote an arbitrary matrix in the SST class CSST, and withoutloss of generality we assume that the true permutation corre-sponding to M is the identity permutation id We begin withour analysis of the Count step of the estimator. Analogous toLemma 6 we have the following result.

Lemma 9. With probability at least 1−n−20, the permutationπCRL obtained at the end of the Count step of the estimatorsatisfies the following property:

maxi∈[n]

n∑`=1

|M∗i` −M∗πCRL(i)`| ≤ 2√n(log n)2. (66)

The proof of this result follows exactly the proof of Lemma 6and is hence omitted. We now condition on the first sampleY (1), fixing the permutation πCRL, and apply the least squaresprocedure on Y (2). Let us consider in the linearized form (16a)that:

Y (2) = M∗ +W (2).

Page 20: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 20

Recall the notation introduced earlier in (39) for the leastsquares estimator applied with a fixed permutation πCRL:

ML(Y(2), πCRL) : = arg min

M∈CSST(πCRL)

|||M − Y (2)|||2F .

Our goal is to bound the error:

|||MCRL −M∗|||2F = |||ML(Y(2), πCRL)−M∗|||2F

= |||ML(M∗ +W (2), πCRL)− ML(πCRL(M) +W (2), πCRL)

+ ML(πCRL(M) +W (2), πCRL)− πCRL(M) + πCRL(M)−M∗|||2F≤ 3|||ML(M

∗ +W (2), πCRL)− ML(πCRL(M) +W (2), πCRL)|||2F+ 3|||ML(πCRL(M) +W (2), πCRL)− πCRL(M)|||2F+ 3|||πCRL(M)−M∗|||2F ,

and we analyze each of these terms separately. Noting that theleast squares estimator for a fixed permutation is a projectiononto a convex set, we obtain that:

|||ML(M∗ +W (2), πCRL)− ML(πCRL(M) +W (2), πCRL)|||2F

≤ |||M∗ − πCRL(M)|||2F .

For the second term we notice that this corresponds to ananalysis of the least squares estimator in a well-specified case,and using the argument leading to (38) we obtain that,

|||ML(πCRL(M) +W (2), πCRL)− πCRL(M)|||2F≤ c(n− kmax(M) + 1)

√n(log n)8, (67)

with probability at least 1−e−min(n−k∗+1)2,n(logn)3 , wherec is a positive universal constant. We thus obtain that withprobability at least 1− e−min(n−kmax(M)+1)2,n(logn)3 ,

|||MCRL −M∗|||2F≤ 6|||πCRL(M)−M∗|||2F + 3c(n− kmax(M) + 1)

√n(log n)8

≤ 12|||M −M∗|||2F + 12|||M − πCRL(M)|||2F+ 3c(n− kmax(M) + 1)

√n(log n)8. (68)

The following result now bounds the second term with high-probability.

Lemma 10. When the permutation πCRL satisfies (66), we havethe following deterministic guarantee for any matrix M :

|||M − πCRL(M)|||2F≤ 6|||M∗ − M |||2F + 24

√n(n− kmax(M)) log2 n.

We prove this lemma below. Putting the result of this lemmatogether with the inequality (68), and using the union bound,establishes Theorem 5.

Proof of Lemma 10: Since M has an indifference set (i.e.a large set of identical rows) of size kmax(M), re-orderingits rows contributes to the Frobenius norm only when eitherthe original row or its re-ordered counterpart falls outside this

indifference set. Let us denote by I the set of such indices,and note that |I| ≤ 2(n− kmax(M)). Observe that,

|||M − πCRL(M)|||2F≤∑i∈I

∑j∈1,...,n

(Mij − MπCRL(i)j)2

≤ 3∑i∈I

∑j∈1,...,n

[(Mij −M∗ij)2 + (MπCRL(i)j −M∗πCRL(i)j)

2

+ (M∗ij −M∗πCRL(i)j)2]

≤ 6|||M∗ − M |||2F + 3∑i∈I

∑j∈1,...,n

(M∗ij −M∗πCRL(i)j)2.

Now, applying the condition in (66) we obtain that:

|||M−πCRL(M)|||2F≤ 6|||M∗ − M |||2F + 24(n− kmax(M))

√n log2 n,

as desired.

REFERENCES

[1] N. B. Shah, S. Balakrishnan, and M. J. Wainwright, “Feeling the Bern:Adaptive estimators for Bernoulli probabilities of pairwise comparisons,”in IEEE International Symposium on Information Theory. IEEE, 2016,pp. 1153–1157.

[2] F. Radlinski, M. Kurup, and T. Joachims, “How does clickthroughdata reflect retrieval quality?” in ACM conference on Information andknowledge management, 2008.

[3] R. Herbrich, T. Minka, and T. Graepel, “Trueskill: A Bayesian skillrating system,” in Advances in Neural Information Processing Systems,2007.

[4] N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran,and M. J. Wainwright, “Estimation from pairwise comparisons: sharpminimax bounds with topology dependence,” The Journal of MachineLearning Research, vol. 17, no. 1, pp. 2049–2095, 2016.

[5] M. Braverman and E. Mossel, “Noisy sorting without resampling,” inProc. ACM-SIAM symposium on Discrete algorithms, 2008, pp. 268–276.

[6] A. Rajkumar, S. Ghoshal, L.-H. Lim, and S. Agarwal, “Ranking fromstochastic pairwise preferences: Recovering Condorcet winners andtournament solution sets at the top,” in International Conference onMachine Learning, 2015.

[7] N. B. Shah and M. J. Wainwright, “Simple, robust and optimal rankingfrom pairwise comparisons,” The Journal of Machine Learning Re-search, vol. 18, no. 1, pp. 7246–7283, 2017.

[8] R. Heckel, N. B. Shah, K. Ramchandran, and M. J. Wainwright, “Activeranking from pairwise comparisons and when parametric assumptionsdon’t help,” The Annals of Statistics (to appear), 2019.

[9] S. Negahban, S. Oh, and D. Shah, “Iterative ranking from pair-wisecomparisons,” in Advances in Neural Information Processing Systems,2012, pp. 2474–2482.

[10] B. Hajek, S. Oh, and J. Xu, “Minimax-optimal inference from partialrankings,” in Advances in Neural Information Processing Systems, 2014.

[11] S. Chatterjee, “Matrix estimation by universal singular value threshold-ing,” The Annals of Statistics, vol. 43, no. 1, pp. 177–214, 2014.

[12] N. B. Shah, S. Balakrishnan, A. Guntuboyina, and M. J. Wainwright,“Stochastically transitive models for pairwise comparisons: Statisticaland computational issues,” IEEE Transactions on Information Theory,vol. 63, no. 2, pp. 934–959, 2017.

[13] D. H. McLaughlin and R. D. Luce, “Stochastic transitivity and can-cellation of preferences between bitter-sweet solutions,” PsychonomicScience, vol. 2, no. 1-12, pp. 89–90, 1965.

[14] D. Davidson and J. Marschak, “Experimental tests of a stochasticdecision theory,” Measurement: Definitions and theories, pp. 233–69,1959.

[15] T. P. Ballinger and N. T. Wilcox, “Decisions, error and heterogeneity,”The Economic Journal, vol. 107, no. 443, pp. 1090–1105, 1997.

[16] R. A. Bradley and M. E. Terry, “Rank analysis of incomplete blockdesigns: I. The method of paired comparisons,” Biometrika, pp. 324–345, 1952.

Page 21: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 21

[17] R. D. Luce, Individual choice behavior: A theoretical analysis. NewYork: Wiley, 1959.

[18] L. L. Thurstone, “A law of comparative judgment,” PsychologicalReview, vol. 34, no. 4, p. 273, 1927.

[19] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard,“Wavelet shrinkage: asymptopia?” Journal of the Royal StatisticalSociety. Series B (Methodological), pp. 301–369, 1995.

[20] E. J. Candes, “Modern statistical estimation via oracle inequalities,” Actanumerica, vol. 15, pp. 257–325, 2006.

[21] V. Koltchinskii, Oracle Inequalities in Empirical Risk Minimization andSparse Recovery Problems. Springer Science & Business Media, 2011.

[22] T. T. Cai and M. G. Low, “A framework for estimation of convexfunctions,” Statistica Sinica, pp. 423–456, 2015.

[23] Q. Berthet and P. Rigollet, “Complexity theoretic lower bounds forsparse principal component detection,” in COLT, 2013.

[24] Z. Ma and Y. Wu, “Computational barriers in minimax submatrixdetection,” The Annals of Statistics, vol. 43, no. 3, pp. 1089–1116, 2015.

[25] M. J. Wainwright, “Constrained forms of statistical minimax: Compu-tation, communication and privacy,” in Proceedings of the InternationalCongress of Mathematicians, Seoul, Korea, 2014.

[26] N. B. Shah, S. Balakrishnan, and M. J. Wainwright, “A permutation-based model for crowd labeling: Optimal estimation and robustness,”arXiv preprint arXiv:1606.09632, 2016.

[27] S. Chatterjee and S. Mukherjee, “On estimation in tourna-ments and graphs under monotonicity constraints,” arXiv preprintarXiv:1603.04556, 2016.

[28] N. B. Shah, J. K. Bradley, A. Parekh, M. Wainwright, and K. Ramchan-dran, “A case for ordinal peer-evaluation in MOOCs,” in NIPS Workshopon Data Driven Education, Dec. 2013.

[29] J. Cambre, S. Klemmer, and C. Kulkarni, “Juxtapeer: Comparative peerreview yields higher quality feedback and promotes deeper reflection,”in Proceedings of the 2018 CHI Conference on Human Factors inComputing Systems. ACM, 2018, p. 294.

[30] A. Nguyen, C. Piech, J. Huang, and L. Guibas, “Codewebs: scalablehomework search for massive open online programming courses,” inProceedings of the 23rd international conference on World wide web.ACM, 2014, pp. 491–502.

[31] S. Chatterjee, A. Guntuboyina, and B. Sen, “On matrix estimation undermonotonicity constraints,” arXiv:1506.03430, 2015.

[32] ——, “On risk bounds in isotonic and other shape restricted regressionproblems,” arXiv preprint arXiv:1311.3765, 2013.

[33] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal componentanalysis,” Journal of computational and graphical statistics, vol. 15,no. 2, pp. 265–286, 2006.

[34] A. d’Aspremont, L. E. Ghaoui, M. I. Jordan, and G. R. Lanckriet, “Adirect formulation for sparse pca using semidefinite programming,” inAdvances in neural information processing systems, 2005, pp. 41–48.

[35] I. M. Johnstone and A. Y. Lu, “On consistency and sparsity for principalcomponents analysis in high dimensions,” Journal of the AmericanStatistical Association, vol. 104, no. 486, pp. 682–693, 2009.

[36] Q. Berthet, P. Rigollet et al., “Optimal detection of sparse principalcomponents in high dimension,” The Annals of Statistics, vol. 41, no. 4,pp. 1780–1815, 2013.

[37] E. Abbe and C. Sandon, “Recovering communities in the generalstochastic block model without knowing the parameters,” in Advancesin Neural Information Processing Systems, 2015, pp. 676–684.

[38] E. Abbe, “Community detection and stochastic block models: recentdevelopments,” arXiv preprint arXiv:1703.10146, 2017.

[39] Y. Chen and J. Xu, “Statistical-computational tradeoffs in plantedproblems and submatrix localization with a growing number of clustersand submatrices,” The Journal of Machine Learning Research, vol. 17,no. 1, pp. 882–938, 2016.

[40] L. Addario-Berry, N. Broutin, L. Devroye, G. Lugosi et al., “Oncombinatorial testing problems,” The Annals of Statistics, vol. 38, no. 5,pp. 3063–3092, 2010.

[41] X. Sun and A. B. Nobel, “On the maximal size of large-averageand anova-fit submatrices in a gaussian random matrix,” Bernoulli:official journal of the Bernoulli Society for Mathematical Statistics andProbability, vol. 19, no. 1, p. 275, 2013.

[42] M. Kolar, S. Balakrishnan, A. Rinaldo, and A. Singh, “Minimax local-ization of structural information in large noisy matrices,” in Advancesin Neural Information Processing Systems, 2011, pp. 909–917.

[43] N. Flammarion, C. Mao, and P. Rigollet, “Optimal rates of statisticalseriation,” arXiv preprint arXiv:1607.02435, 2016.

[44] G. Bril, R. Dykstra, C. Pillers, and T. Robertson, “Algorithm as 206:isotonic regression in two independent variables,” Journal of the Royal

Statistical Society. Series C (Applied Statistics), vol. 33, no. 3, pp. 352–357, 1984.

[45] T. Robertson, F. Wright, and R. L. Dykstra, Order restricted statisticalinference. Wiley New York, 1988, vol. 229.

[46] R. Kyng, A. Rao, and S. Sachdeva, “Fast, provable algorithms forisotonic regression in all l p-norms,” in Advances in Neural InformationProcessing Systems, 2015, pp. 2701–2709.

[47] M. Jerrum, “Large cliques elude the Metropolis process,” RandomStructures & Algorithms, vol. 3, no. 4, pp. 347–359, 1992.

[48] L. Kucera, “Expected complexity of graph partitioning problems,”Discrete Applied Mathematics, vol. 57, no. 2, pp. 193–212, 1995.

[49] A. Juels and M. Peinado, “Hiding cliques for cryptographic security,”Designs, Codes and Cryptography, vol. 20, no. 3, pp. 269–280, 2000.

[50] N. Alon, A. Andoni, T. Kaufman, K. Matulef, R. Rubinfeld, and N. Xie,“Testing k-wise and almost k-wise independence,” in Proceedings of thethirty-ninth annual ACM symposium on Theory of computing. ACM,2007, pp. 496–505.

[51] S. Dughmi, “On the hardness of signaling,” in IEEE Foundations ofComputer Science (FOCS), 2014, pp. 354–363.

[52] U. Feige and R. Krauthgamer, “The probable value of the Lovasz–Schrijver relaxations for maximum independent set,” SIAM Journal onComputing, vol. 32, no. 2, pp. 345–370, 2003.

[53] R. Meka, A. Potechin, and A. Wigderson, “Sum-of-squares lower boundsfor planted clique,” arXiv:1503.06447, 2015.

[54] Y. Deshpande and A. Montanari, “Improved sum-of-squareslower bounds for hidden clique and hidden submatrix problems,”arXiv:1502.06590, 2015.

[55] E. Cator et al., “Adaptivity and optimality of the monotone least-squaresestimator,” Bernoulli, vol. 17, no. 2, pp. 714–735, 2011.

[56] S. Chatterjee and J. Lafferty, “Adaptive risk bounds in unimodal regres-sion,” arXiv preprint arXiv:1512.02956, 2015.

[57] P. C. Bellec, “Adaptive confidence sets in shape restricted regression,”arXiv preprint arXiv:1601.05766, 2016.

[58] R. J. Tibshirani, H. Hoefling, and R. Tibshirani, “Nearly-isotonic regres-sion,” Technometrics, vol. 53, no. 1, pp. 54–61, 2011.

[59] S. van de Geer, Empirical Processes in M-estimation. Cambridgeuniversity press, 2000, vol. 6.

[60] P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local Rademachercomplexities,” Annals of Statistics, vol. 33, no. 4, pp. 1497–1537, 2005.

[61] V. Koltchinskii, “Local Rademacher complexities and oracle inequalitiesin risk minimization,” Annals of Statistics, vol. 34, no. 6, pp. 2593–2656,2006.

[62] M. Ledoux, The Concentration of Measure Phenomenon, ser. Mathemat-ical Surveys and Monographs. Providence, RI: American MathematicalSociety, 2001.

[63] P.-M. Samson, “Concentration of measure inequalities for Markov chainsand φ-mixing processes,” Annals of Probability, pp. 416–461, 2000.

[64] E. N. Gilbert, “A comparison of signalling alphabets,” Bell SystemTechnical Journal, vol. 31, no. 3, pp. 504–522, 1952.

[65] R. Varshamov, “Estimate of the number of signals in error correctingcodes,” in Dokl. Akad. Nauk SSSR, 1957.

[66] F. Gao and J. A. Wellner, “Entropy estimate for high-dimensionalmonotonic functions,” Journal of Multivariate Analysis, vol. 98, no. 9,pp. 1751–1764, 2007.

Nihar B. Shah Nihar B. Shah is an Assistant Professor in the MachineLearning and Computer Science departments at CMU. He received his PhDfrom the EECS department at UC Berkeley in 2017. He is a recipient ofthe 2017 David J. Sakrison memorial prize from EECS Berkeley for a”truly outstanding and innovative PhD thesis”, the Microsoft Research PhDFellowship 2014-16, the Berkeley Fellowship 2011-13, the IEEE Data StorageBest Paper and Best Student Paper Awards for the years 2011/2012, andthe SVC Aiya Medal 2010. His research interests include machine learning,information theory, statistics, and game theory, with a current focus onapplications to learning from people.

Page 22: Feeling the Bern: Adaptive Estimators for Bernoulli ...wainwrig/Papers/ShaBalWai19.pdf · in voting, social choice theory, and tournaments. The advent of new internet-scale applications,

0018-9448 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2903249, IEEETransactions on Information Theory

IEEE TRANSACTIONS ON INFORMATION THEORY 22

Sivaraman Balakrishnan Sivaraman Balakrishnan is an Assistant Professorin the Department of Statistics and Data Science at Carnegie Mellon Uni-versity. Prior to this he received his Ph.D. from the School of ComputerScience at Carnegie Mellon University and was a postdoctoral researcher inthe Department of Statistics at UC Berkeley. His Ph.D. work was supportedby several fellowships including the Richard King Mellon Fellowship and agrant from the Gates Foundation. He is broadly interested in problems thatlie at the interface between computer science and statistics. Some particularareas that have provided motivation for his past and current research includethe applications of statistical methods in ranking problems, computationalbiology, clustering, topological data analysis, nonparametric statistics, robuststatistics and non-convex optimization.

Martin J. Wainwright Martin Wainwright is currently a Chancellor’s Profes-sor at University of California at Berkeley, with a joint appointment betweenthe Department of Statistics and the Department of Electrical Engineering andComputer Sciences (EECS). He received a Bachelor’s degree in Mathematicsfrom University of Waterloo, Canada, and Ph.D. degree in EECS fromMassachusetts Institute of Technology (MIT). His research interests includehigh-dimensional statistics, information theory, statistical machine learning,and optimization theory. He has been awarded an Alfred P. Sloan FoundationFellowship (2005), Best Paper Awards from the IEEE Signal ProcessingSociety (2008), and IEEE Communications Society (2010); the Joint PaperPrize (2012) from IEEE Information Theory and Communication Societies;a Medallion Lecturer (2013) from the Institute of Mathematical Statistics; aSection Lecturer (2014) at the International Congress of Mathematicians; andthe COPSS Presidents’ Award (2014) from the Joint Statistical Societies.