ONLINE ACTIVE LEARNING WITH LINEAR MODELS ADISSERTATION - Stanford …rp382fv8012/Carlo... · · 2017-06-07My department at Stanford, the Institute for Computational and Mathematical

ONLINE ACTIVE LEARNING WITH LINEAR MODELS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF MATHEMATICAL AND

COMPUTATIONAL ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Carlos Riquelme

May 2017

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/rp382fv8012

© 2017 by Carlos Riquelme Ruiz. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/rp382fv8012

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Ramesh Johari, Primary Adviser


John Duchi


Johan Ugander

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

In this thesis we address online decision making problems where an agent needs to collect optimal

training data to fit statistical models. Decisions on whether to request the response of incoming

stochastic observations or not must be made on the spot, and, when selected, the outcomes are

immediately revealed. In particular, in the first part, we study scenarios where a single linear model

is estimated given a limited budget: only k out of the n incoming observations can be labelled. In the

second part, we focus on active learning in the presence of several linear models with heterogeneous

unknown noise levels. The goal is to estimate all the models equally well. We design algorithms to

e�ciently solve the problems, extend them to sparse high-dimensional settings, and derive statistical

guarantees in all cases. In addition, we validate our algorithms with synthetic and real-world data,

where most of our technical assumptions are violated. Finally, we briefly explore active learning

in pure exploration settings for reinforcement learning. In this case, an unknown Markov Decision

Process needs to be learned under a fixed budget of episodes.

iv

La incertidumbre es una margarita cuyos petalos no se terminan jamas de deshojar.

Mario Vargas Llosa

v

Preface

Most chefs would agree that cooking without the right ingredients is hard. Similarly, when learning

from data, observations of high informational value can enormously help. In scenarios where the

algorithm designer has some control over the data collection process, a careful training data selection

—known as active learning— may remarkably reduce the sample size required to achieve the desired

level of accuracy. In words, we can learn faster; or better if the sample size is fixed.

In Machine Learning, the raw cooking material is data, which we process to obtain knowledge,

information, actionable outputs, and —more generally— insights. In a future world where countless

algorithmic autonomous agents interact with the environment, learn, and make decisions all the time,

e�cient and principled data acquisition leading to good data will be a substantially better alternative

to simply consuming big data. However, engineering the data collection process presents a set of

challenges. In particular, first, we need to understand what we do not know. In order to assess the

quality of future data (say, according to a given model), accurate uncertainty quantification seems

essential. Unfortunately, computing the relevant confidence regions is currently intractable for many

popular complex models. Second, even if we assume valid uncertainty measurements are available,

e�cient exploration procedures are still needed to collect the most useful data in interactive systems.

In some practical cases, exploration requires tenacious adaptive planning.

In this thesis, our goal is, in fact, far more modest. We study a specific family of active learning

regression problems. There are two main aspects that characterize the problems: their online nature

—i.e., data points are sequentially o↵ered to the agent, and a decision must be made on the spot—

and the underlying models are assumed to be linear. Under those assumptions, we design and

analyse e�cient algorithms that exploit the structure of the data to minimize the final error of the

fitted model or models (in the case where we try to simultaneously learn several functions).

vi

Acknowledgments

The act of sitting and thinking for a while about who-to-thank-for-what is, at least, surprisingly

insightful. Life goes by fast, and it is sometimes hard to connect the points. I feel extremely lucky.

I’m surrounded by amazing people who contributed to enrich my own perspectives in so many ways.

Ramesh Johari has been a fantastic mentor. He was welcoming, supportive, and friendly since

the very first time we met. During all these years, he encouraged me to freely explore as many topics

and research directions as I wanted, and also helped me to identify, understand, and overcome many

mistakes students and amateur researchers usually make. Most importantly, Ramesh gave me the

most useful present any student can be given: his honest feedback. Thanks so much, Ramesh!

The support and love of my family have been unconditional. Beautiful California is far from

what I call home, but I could still feel them very close. Thanks Mum, for showing me the world

when I was just a kid. Thanks Dad, for fighting hard for my education and values. Thanks Sofia,

for being an example of e↵ort, perseverance, and transparency.

Tambor, your cheerfulness was always a source of inspiration. Your love, dedication, constant

smile, and patience made this work possible. Thanks.

Sven, we spent so many hours in front of a whiteboard with no clue of how to solve a problem,

or trying to find real-world applications to mathematically appealing settings. Ramon, we dreamed

together about so many projects, companies, and new worlds. Thanks to both for making me believe

we can and should work towards impactful goals.

My department at Stanford, the Institute for Computational and Mathematical Engineering,

gave me an opportunity that —for sure— changed my life. I would like to thank Margot Gerritsen,

Indira Choudhury, and Michael Saunders, for making it possible, and for their key daily work at

ICME. Thanks to the rest of the ICME family, including students and sta↵; a great community.

vii

The committee for my Ph.D. defense and current thesis gave me quite useful comments and

extensive feedback. Thanks to John Duchi, Mohammad Ghavamzadeh, Mykel Kochenderfer, and

Johan Ugander for their time and support. In addition, during the Ph.D. I was so lucky to be

able to collaborate with some extraordinary researchers, like Alessandro Lazaric, Baosen Zhang, or

Siddhartha Banerjee. I did learn a lot from you, and really appreciate your time and patience.

My internship experiences definitely shaped my understanding of what matters in practice, and

helped me to improve my modeling skills. I owe a big thank-you to Eunyee Koh, my mentor at

Adobe Research, to Don Van der Drift, at Quora, also to Dominic DiPalantino and Reid Andersen,

my mentors at Twitter, and to Eytan Bakshy, my mentor at Facebook Research.

Life does not make sense without friends. My very dear friends from the AT, your a↵ection and

encouragement guided me through the Ph.D. My UAM and GoB friends o↵ered me the balance

I needed over the years. I’m really lucky to have friends like Marcos Lletget, Ariadna Sanz, Luis

Garcıa-Mon, and so many others. Thanks to each one of you!

Finally, I would not like to forget my mentors from before I landed at Stanford. Brian Stewart,

Dragan Vukotic, Jose Dorronsoro, and Miguel de Guzman, among many others, I deeply appreciate

your fundamental help at the very beginning of this exciting trip.

To all, again, thanks.

viii

Contents

Abstract iv

Preface vi

Acknowledgments vii

1 Introduction 1

1.1 Learning with Data; the Two Processes. . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Learning Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Thesis Organization and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Learning One Model 12

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Algorithm and Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Thresholding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.2 Unknown Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.3 Adaptive Acceptance Region . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.4 Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.5 Sparsity and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.6 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 Misspecified Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Learning Several Models 33

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

ix


3.3 The Trace-UCB Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 High Dimensional Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Directions for Future Work 51

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


4.4 The Information Supermarket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 Information Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.6 Adaptive Submodularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Concluding Remarks 67

A Statistical Learning Theory 68

A.1 Main Generalization Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

B Proofs Chapter 2 77

B.1 Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

B.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

B.3 Proof of Tr(X�1) � Tr(Diag(X)�1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

B.4 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

B.5 CLT Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82



B.8 Proof of Corollary 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

B.9 Proof of CLT Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

B.10 Proofs for Misspecified Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

B.11 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

B.12 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

B.12.1 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

B.12.2 Synthetic Non-Linear Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

B.12.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

B.12.4 Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

x

C Proofs Chapter 3 103

C.1 Optimal Static Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

C.1.1 Proof of Proposition 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

C.2 Loss of an OLS-based Learning Algorithm (Proof of Lemma 14) . . . . . . . . . . . . 105

C.3 Concentration Inequalities (Proofs of Propositions 15 and 16) . . . . . . . . . . . . . 112

C.3.1 Concentration Inequality for the Variance (Proof of Proposition 15) . . . . . 113

C.3.2 Concentration Inequality for the Trace (Proof of Proposition 16) . . . . . . . 115

C.3.3 Concentration Inequality for b� Estimates . . . . . . . . . . . . . . . . . . . . 117

C.3.4 Bounded Norm Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

C.4 Performance Guarantees for Trace-UCB . . . . . . . . . . . . . . . . . . . . . . . . 119

C.4.1 Lower Bound on Number of Samples (Proof of Theorem 17) . . . . . . . . . . 119

C.4.2 Regret Bound (Proof of Theorem 18) . . . . . . . . . . . . . . . . . . . . . . . 122

C.4.3 High Probability Bound for Trace-UCB Loss (Proof of Theorem 19) . . . . . 124

C.5 Loss of a RLS-based Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 130

C.5.1 Distribution of RLS estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

C.5.2 Loss Function of a RLS-based Algorithm . . . . . . . . . . . . . . . . . . . . 131

C.6 Sparse Trace-UCB Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

C.6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

C.6.2 A note on the Static Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 136

C.6.3 Simultaneous Support Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . 136

C.6.4 High-Dimensional Trace-UCB Guarantees . . . . . . . . . . . . . . . . . . . . 139

D Proofs Chapter 4 140

Bibliography 142

xi

List of Tables

xii

List of Figures

1.1 Interactive Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 D has two independent components: white Gaussian and Uniform(�p3,p3). . . . . 17

2.2 D has two independent components: white Gaussian and Laplace(0, 1/p2). . . . . . 17

2.3 Graphical description of Sparse Thresholding Algorithm (Algorithm 2). . . . . . . . 23

2.4 Sparse Linear Regression (700 iters). We fix the e↵ective dimension to s = 7, and

increase the ambient dimension from d = 100 to d = 500. The budget scales as

k = Cs log d for C ⇡ 3.4, while n = 4d. We set k1

= 2k/3 and k2

= k/3. . . . . . . . 25

2.5 MSE of �OLS . The (0.05, 0.95) quantile conf. int. displayed. Solid median; Dashed

mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 White Gaussian synthetic data with m = 7. In Figures (a,b), we set n = 350. In

Figures (c,d,e,f), we set d = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 Real World Data. Median over 1000 simulations. . . . . . . . . . . . . . . . . . . . . 48

4.1 Linear Bandits Simulations with d = 25, |X | = 100 arms, and �2 = 2. . . . . . . . . 57

B.1 MSE of �OLS ; white Gaussian obs, (0.25, 0.75) quantile confidence intervals displayed

in (a), (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

B.2 In (a), (b), MSE of �OLS ; N (0,⌃) data, (0.05, 0.95) conf. intervals. . . . . . . . . . . 98

B.3 Model is y =P

i �ixi + P

i x2

i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

B.4 MSE of regularized estimators, � = 0.01; white Gaussian obs. The (0.05, 0.95) confi-

dence intervals in (a), and (0.25, 0.75) in (b). . . . . . . . . . . . . . . . . . . . . . . 100

B.5 Combined Cycle Power (150 iters). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

B.6 Scatter Plots of Real World Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 102

xiii

Chapter 1

Introduction

1.1 Learning with Data; the Two Processes.

Learning with data involves two main processes: data collection and model fitting.

The most commonly studied problem in Machine Learning consists in recovering an unknown

function f given a few (and possibly noisy) evaluations at di↵erent points D = {(Xi, f(Xi))}i. A

set of assumptions are usually made, so the problem is not hopeless. For example, we may assume

that f has certain smoothness properties, or that the points Xi were chosen in a some particular

way. While these assumptions lead to convenient mathematical properties that let us derive nice

guarantees about specific algorithms, or even derive fundamental limits on what is possible and what

is not, it is essential that they hold —at least approximately— in those scenarios frequently found

in practice. Otherwise, our results and conclusions may not apply or be relevant at all.

There are countless success stories. When the input X represents an image, computer vision

systems have been trained to detect faces [39], pedestrians [92], license plates [102], cancer tumors

[99, 49], signatures [103], and many other objects. Financial institutions tried to learn whether a

given client will pay back a loan [59], or other institutions will go bankrupt [101, 65]. E-commerce

companies spent millions of dollars trying to improve click prediction models in the last two decades

[73]. Similar examples can be found in physics [10], education [61], politics [33], supply chain

management [17], marketing [21], sports [80], or healthcare [53].

Tasks with this flavor are referred to as Supervised Learning. The key aspect is that the data

D is given and fixed from the beginning. In addition, the algorithm designer usually commits to

a family of potential candidates F upfront —like linear models, neural networks, or trees of some

kind— and then picks the one instance that seems a better fit accounting for the observed data. In

1

CHAPTER 1. INTRODUCTION 2

some sense, then, supervised learning is a search problem. The problem of model fitting.

Conceptually, the job can be split into three di↵erent tasks. First, the data scientist needs to

define the family of models F where he thinks good candidates for f fall. The larger and more

expressive F is, the more likely it contains functions very similar to the true f . Obviously, there is

a trade-o↵, as search is harder for complex and huge families of models. Second, the data scientist

has to define goodness of fit. How to compare two candidates f1

, f2

2 F? In general, this is done

via a score function s computable for any f and D, s(f ,D). For example, s may take into account

how well f predicts points in D and, maybe, penalize complexity in f . Finally, the data scientist

must design and implement a computationally e�cient search procedure to find the best (or, say, a

good) model in F according to s. All the decisions should be made jointly, as the third step may

become dramatically more or less expensive depending on F , s, and their interaction.

The main premise in supervised learning is that the data, D, is provided as an input. While

standard in practice, this is somewhat limiting: like a cook who wants to come up with new fine

recipes but is provided with only a few ingredients chosen by somebody else. Not only more data is

always better (something most people acknowledge); the quality of data can also vary, and it plays

a fundamental role for final performance. Some data is better than other in general, and some data

is better than other for fitting a specific family of models. A central question to the field of Machine

Learning is how much data do we need to achieve certain accuracy under some circumstances, for

example, given a reasonable computational power. The study of sample complexity uses formal tools

from statistical learning theory and computational complexity theory. It turns out that, sometimes,

if we know in advance the family of models that we plan to fit, we can gather data specifically tailored

to those models —and, say, previous data that we have already seen— to significantly reduce the

required sample complexity. In other words, we can learn faster by collecting the right data.

Likewise supervised learning, we can cast the latter as a constrained optimization problem.

Formally, there is a set X of potential data points that we can query to obtain their possibly noisy

outcome in some space Y, and a cost function C : X ✓ X ! R+ that measures the cost of querying

any subset X of X . In addition, we have access to a score function s that provides, for each X, the

expected quality of the final model that could be fitted using X. Then, the problem is to find the

subset X⇤ maximizing s(X) given a maximum budget B, i.e.,

X⇤ = arg maxX✓X

s(X), s.t. C(X) B. (1.1)

Real world applications often times impose new constraints, or require a di↵erent —but neighboring—

setting. Let us briefly discuss two examples that fall in the previous problem definition.


Imagine there is a video streaming company whose business model consists in di↵erent types

of memberships where users pay a monthly fee to watch TV-shows and movies. After careful

consideration, the company decides to launch a new type of account. The account is free, so the user

does not pay, but every thirty minutes several kinds of audio and pop-up advertisement show up

interrupting the video being currently played. The platform has never shown ads before, and they

do not know how di↵erent users will interact with them. Ideally, the company would like to identify

those users that —under the new membership— would potentially generate a lot of revenue through

clicks. Then, the plan is to display a small pop-up only to the promising users, encouraging them to

switch account types. In order to train a model that estimates the revenue from user covariates, some

initial training data is required. The most natural way to obtain the data is by o↵ering the service

with ads to a few initial users (that login or register for the first time, say). From their outcomes,

the platform will generalize the model predictions to the whole user database. The question is: how

should the company choose the beta testers? Is uniformly at random good enough? Or can they do

significantly better by following a principled approach?

Clinical trials provide a di↵erent motivating example. Suppose a research center has developed

a new treatment for cancer leading to promising results in mice. At some point, the next step is

to test the treatment with humans. One can imagine a setup where patients arrive sequentially

over time, volunteering to test the drug. For several reasons, the (ethical and economical) cost of

treating a single patient is high, and only a very small subset of the whole population can take part

in the experimental process. Each patient i can be represented as a vector Xi 2 Rd encoding her

characteristics, and Yi 2 R may denote the health status of the patient three months after the start

of the treatment. The goal is to select patients in a way that, by the end of the clinical trial, the

predicted e↵ect of the treatment in a new patient is as accurate as possible. In other words, to learn

as much as we can about the drug. This may certainly lead to situations where the patient’s and

the scientist’s interests are not aligned. If the whole process takes one or two years to complete,

a reasonable approach consists in spending the experimental budget somewhat evenly over time,

so that we can incorporate previous outcomes to our current beliefs. We can then take advantage

of partial models based on this data to help us quantify the uncertainty in di↵erent parts of the

input space, and select new patients accordingly. Sometimes several di↵erent treatments are actually

tested together, and we need to adaptively allocate patients to the drugs. These sequential settings

do not directly fit in (1.1); while, conceptually, the underlying problem and goal are very similar.

Budget constraints can be stated in several di↵erent ways. In many cases, there is a total number

of data points that can be labeled, and it is assumed that they are equally costly. In terms of (1.1),

we have that C(X) = |X|. This is equivalent to bounding the number of experimental units. For

example, when we are able to label a number of images that may or may not contain a dog, or treat a


Figure 1.1: Interactive Learning.

maximum number of patients. Another standard way to define the budget is in terms of money. The

cost function then computes the total cost in an additive fashion, as C(X) =P

i p(xi), where p(xi)

specifies the price of labeling xi. Instead of setting a number of images to label, it may be easier to

allocate $1000 to spend in mechanical turks. Finally, when experiments are run in computational

simulators, the bottleneck for querying points may be the amount of time simulations require. For

example, if X denotes a complete configuration and architecture for a large neural network, then Y

could be its test error, and its computation may require a few hours as the network needs to be trained

first. In those cases, we define a maximum time budget B, and make sure that C(X) =P

i t(xi) B,

where t(xi) denotes the time required to label xi. Note that t(xi) may be random.

Active Learning studies how to dynamically collect the data that is best for learning. Reasoning

with data involves three main components: data collection, model fitting, and decision making.

Instead of defining two di↵erent and independent processes for data collection and model fitting

respectively, active learning relies on the idea of iterating short batches where data is collected, a

model is fitted or a posterior distribution over models is computed, and then some decision making

rule is applied to guide the next data collection process (and, maybe, produce some external output

or decision along the way), see Figure 1.1. By means of adaptive data collection, we may be able to

refine the models much faster, as we can allocate our e↵orts to those parts of the space where new

information is needed the most.

In this thesis, we focus on online active learning with linear models. We first provide a brief

introduction to the field of Active Learning, together with a basic literature review. The main

di↵erences with Passive or Supervised Learning are explained, and the fundamental questions and

answers-to-date are discussed. In addition, we describe and provide examples of online scenarios.

Finally, we address the use of linear models, and justify their importance regardless of the obvious

abundance of non-linear data.

1.2 Active Learning

The Problem. Active Learning faces settings where access to input examples X 2 X is cheap, but

obtaining their outcome variable Y 2 Y is quite expensive. Therefore, not all the input data points

can be labeled or queried, and the algorithm is given the freedom to (sequentially) choose its own


training data. Most algorithms try to identify points of high informational value.

Active Learning problems can be partitioned according to two di↵erent aspects of the problem.

The first one relates to how the input data is presented and accessed. There are two main settings.

In the pool-based case, the whole pool of candidate data points X is provided to the algorithm

in advance. A static setting forces the algorithm to choose a subset of X , and then returns all

their outcomes {YX : X 2 X}. The more interesting dynamic setting lets the algorithm choose

one point Xt 2 X at time t from the known pool X , to immediately reveal its outcome Yt. After

observing Yt, the algorithm updates its beliefs regarding plausible models, and selects the next

point by maximizing some score or utility function (or, say, sampling accordingly if the algorithm is

randomized). Pool-based scenarios are the most commonly studied in the literature.

A di↵erent setting, known as stream-based or online active learning, assumes that the algorithm

sequentially receives a stream of input data points X1

, . . . , Xn. At time t, the algorithm needs to

make a decision: whether to label Xt or not. If the algorithm decides to label Xt, then its outcome

will be returned right away. On the other hand, if the algorithm decides not to label Xt, then it

will not be able to request its label later. Clinical trials o↵er a good motivation for online settings.

In pool-based scenarios, we may decide to label Xt only after the treatment has been applied and

results observed for X1

, . . . , Xt�1

. One can imagine that terminal cancer patients can’t wait that

long. The general strategy for these algorithms consists in defining a region of acceptance Rt at

time t. The point is labeled if and only if Xt 2 Rt, in which case the algorithm beliefs and models

are updated. Randomized algorithms may, instead, define a probability of acceptance p : X ! [0, 1]

and sample the decision accordingly. Note that the online setting is harder than the pool-based one,

as there is an additional source of uncertainty: we do not know which points are going to arrive in

the future. In this thesis, we focus on online settings.

The second key aspect is whether we are solving a classification or regression problem. Not every

algorithm is simultaneously applicable to both output spaces, like the boundary-based methods. We

will focus on regression algorithms, while some ideas may be transferable to classification scenarios.

Finally, another distinction that is relevant when showing fundamental limits relates to whether our

family of models contains the true model, realizable case, or not, agnostic learning. We mainly study

noisy realizable cases, while for the first part of the work the agnostic setting is also analyzed.

Fundamental Questions. One of the central questions in statistical learning theory is: how

much data do we need to learn accurately? In order to be able to answer the question, we have to

specify a few key aspects of the problem instance. In particular, we need to define what we mean by

“accurately”. As this is a subtle and extensive field, we added Appendix A where the main concepts


and definitions of statistical learning theory are presented and explained.

One of the most commonly used concepts to define when a problem is learnable is the PAC

framework, or Probably Approximately Correct. Assume we fix a loss function L that defines the

risk R of any hypothesis, concept or, more generally, model in our candidate class H. An algorithm

A learns class H in the PAC sense if for any distribution p⇤ over X ⇥ Y ⇢ Rd ⇥ R, and any ✏ > 0,

� > 0, A takes n training examples and returns h 2 H such that with probability at least 1 � �,

R(h) � R(h⇤) ✏, where n = poly(d, 1/✏, 1/�), and h⇤ is the model with lowest risk in H (zero in

the realizable case, non-zero in the agnostic one). The class is learnable if there exists any such

algorithm. The sample complexity is given by n, and the n samples presented to the algorithm

are assumed to be i.i.d. examples from p⇤. The fundamental theorems of statistical learning theory

upper and lower bound n in terms of complexity measures of H.

A fundamental question for Active Learning is, can we decrease the data requirements (that is,

the value of n) when we can sample the data points adaptively and from a di↵erent distribution

(i.e., not necessarily from p⇤)? By how much? And, under what circumstances?

Fundamental Answers. Several bounds have been established to tackle the general sample

complexity problem in supervised (or passive) learning, [90, 89]. Fundamental bounds on the rate

of convergence on learning processes have been derived in terms of a complexity measure of the

hypotheses class H, its VC dimension. When the VC dimension is infinite, then the hypothesis class

is not PAC learnable. Otherwise, when the VC dimension is finite and equal to d, and as long as

the number of i.i.d. data points n that we are given satisfies

n = O

✓

1

✏2

✓

d+ log

✓

1

�

◆◆◆

, (1.2)

then we can guarantee that the excess risk is bounded with high probability, R(h) � R(h⇤) ✏.

Note that this roughly corresponds to an error decreasing at rate O(p

d/n). The previous sample

complexity is for agnostic learning, while in realizable settings the 1/✏2 factor can be relaxed to

log(1/✏)/✏. Note that in the case of linear models in Rd, the VC dimension is d + 1, leading to

standard results in linear regression.

There is no equivalent unified answer for active learning settings yet. Analogously to the VC

dimension, a variety of complexity measures based on the data joint distribution, the hypothesis

class, and the true underlying function f , have been proposed. If the VC dimension relates to the

richness and expressivity of the function class H —formally, the maximum number of points it can

shatter—, then the active learning corresponding concepts quantify how quickly and easily the set of


plausible hypothesis can be shrunk. For instance, [23] defines the splitting index ⇢, and [37] presents

the disagreement coe�cient to derive guarantees for the A2 algorithm, [6].

In some simple cases in the realizable setting, it has been shown that we can obtain an ex-

ponential improvement with respect to the sample complexity VC bounds under passive learning,

given by (1.2). For example, [20] shows that in order to learn threshold classifiers in the real line,

fa(x) = 1(x a), passive learning requires O(1/✏ log(1/✏)) data points while an active binary

search algorithm only needs O(log(1/✏)) samples. Similarly, [29] considered the problem of learning

an homogeneous linear separator for data uniformly sampled from the surface of the unit sphere in

Rd in the online setting (with no noise). They propose a query-by-committee active learning algo-

rithm that, after receiving O(d/✏ log(1/✏)) data points and labeling O(d log(1/✏)) points, achieves

the same generalization error as a passive algorithm with O(d/✏) examples.1

Unfortunately, [47] shows that in the non-realizable case2, in general, no such exponential im-

provements should be expected. In particular, it is shown that at least ⌦(1/✏2) samples are required

by active learning algorithms, matching the upper bound for passive learning given in (1.2).

It is essential to remark two important aspects of the previous results. First, most of these

statements are worst-case results, that is, they hold for any joint distribution on X ⇥Y . Therefore,

for data distributions commonly found in practice, the prospects may be better. Second, even

though exponential gains are certainly desirable, constant sample complexity savings can still have

a dramatic impact: making possible the use in practice of some technologies that under passive

learning would remain intractable. Despite negative theoretical results, there is room for optimism.

In particular, [47] highlights the fact that it has been empirically observed that active learning

techniques tend to be of great help and provide important speed-ups at the beginning of the data

collection process, to o↵er diminishing returns as the sample size grows.

Algorithm Design Principles. We briefly describe the main algorithmic design ideas behind

a large fraction of the active learning algorithms. For an extensive introduction see [81].

1. Disagreement-Based. These methods maintain a set of plausible models at time t, say Ht,

i.e., the models statistically consistent with the data observed so far. Broadly speaking, the

algorithms then try to choose points that lead to a decrease in some measure of size of Ht

or its confidence ellipsoid. For example, points x 2 X where models in Ht disagree, or those

minimizing future expected entropy of the hypothesis set (in a Bayesian framework).

Unfortunately, in order to provide theoretical guarantees, these algorithms tend to be extremely

1The O(·) notation hides multiplicative log d, log log 1/✏, log 1/� terms.2Even in noise-free cases, where the true underlying function does not belong to the hypothesis class.


conservative (i.e., in classification settings, by labeling points as soon as two plausible models

output di↵erent classes). Examples of disagreement-based algorithms are [7, 25, 54, 12].

2. Margin-Based. In classification scenarios, these type of methods compute a candidate model

at time t, say mt, which best summarizes its current beliefs. The idea is to try to find and

query points that are close to the decision boundary of mt, as those are the most uncertain

points. A potential issue with margin-based algorithms is that, by focussing on one decision

boundary, they are unable to find other decision boundaries. Moreover, sampling bias may

lead some of these methods to lack consistency. This can be somewhat alleviated by labeling

additional random samples with small but positive probability.

Margin-based methods are presented in [8, 9, 5, 100].

3. Cluster-Based. A natural idea for active learning and semi-supervised learning is to exploit

the unlabeled data structure to partition the space, and transfer the outcomes within a neigh-

borhood or cluster. In [24], the algorithm is provided a hierarchical clustering of the data as

input. Heterogeneous clusters are further partitioned and outcomes queried, whereas reason-

ably homogeneous clusters are completed by majority vote. Consistency is shown.

This is a promising area of research; auto-encoders have been successfully used to find a new

representation of the data on which applying active learning methods is, hopefully, easier, [36].

1.3 Learning Linear Models

In many statistical learning problems we have access to observations X, and their outcomes, Y . The

most common goal is then to establish the relationship between X and Y . A natural way to model

the link is by assuming that there exists a function f such that

E[Y |X] = f(X).

Equivalently, under mean-zero homoscedastic additive noise, we can write that

Y = f(X) + ✏.

As mentioned before, given n iid samples from (X,Y ), the objective is to recover an approximate

version of f , denoted by f . In this work, we constrain the family of functions for both f and f to

that of linear models. We assume there exists an unknown vector � 2 Rd such that

Y = XT� + ✏. (1.3)


Therefore, the output of our algorithms will be an approximate version �, and hopefully, k� � �kwill be small in an appropriate norm. Linear models have a long history in Statistics, and their

properties are well-understood [40]. While most people would agree that real-world data tends to

show non-linearities, linear models are still extremely common in practice. People fit and use linear

models everyday. There are several reasons that help explain this:

1. Flexibility. The representational power of linear models can be easily enhanced by adding more

components to the vector of covariates X. The model in (1.3) imposes a linear relationship

in terms of the variables in X, while the noise can be seen to include the contribution of

those variables that were not modeled or observed. A natural way to explicitly account for

non-linearities in the original variables is by adding (polynomial) interactions. For example,

we may extend our model from X = (X1

, X2

) to X = (X1

, X2

, X1

X2

). Unfortunately, if no

prior knowledge helps pruning the number of feasible interactions, its number will grow quite

fast. The number of possible interactions of order q using d variables is

✓

d+ q � 1

q

◆

⇡ (q + 1)d�1

(d� 1)!.

In addition, the flexibility of linear models may incentivize the use of variables that actually

have no impact on Y . These facts motivate the study of high-dimensional linear settings. We

study these scenarios in the present work, under di↵erent kinds of sparsity assumptions.

Other extensions are popular, like mixed e↵ect models [97] and generalized linear models [64].

2. Interpretability. In general, there are more sophisticated families of models whose prediction

accuracy usually outperforms that of linear ones. Consequently, if the very only goal of the

algorithm designer is to train the best possible prediction engine, then linear models may not

be the right choice. However, often times, there is value in understanding the fitted model. For

most black-box approaches, this can actually be quite a hard task. On the other hand, linear

models o↵er a simple and intuitive explanation of how changes in X lead to changes in Y . It

is reasonably easy in general to give an interpretation to the coe�cients and signs assigned to

each variable. Moreover, under standard assumptions, there are statistically sound procedures

to construct confidence intervals and run hypothesis tests on those coe�cients. Obtaining

similar uncertainty measurements and guarantees may be di�cult for other families of models.

3. Simplicity. Due to their mathematical tractability, linear models have been extensively studied.

In parallel to theoretical understanding, many software packages have been developed and

polished to make the task of fitting linear models incredibly easy in practice. The number of

tuning parameters for algorithms fitting linear models is very small compared to some of its

non-linear competitors. In addition, linear models are computationally fast to fit, letting the

designer iterate quickly and test a variety of models in a short period of time.


4. Small Data Requirements. While powerful models may lead to better predictions, it usually

does not come for free: those models tend to require lots of data to be able to learn a large

number of parameters. In many practical scenarios, there is simply not so much data. Linear

models, however, do not require big data. Also, a remarkably successful line of research has

developed linear estimators like LASSO [87], that incorporate variable selection, allowing rich

inference for high-dimensional linear models with very little data.

In the following chapters, we study two specific online active learning scenarios, where a stream

of observations is sequentially presented to us, and the underlying functions to be learnt are linear.

The analytical tractability of linear models allows us to derive closed-form expressions to measure

the variance of their unbiased estimators, and, more generally, to estimate the error of our approxi-

mations at each step. Then, we propose greedy algorithms driven by the approximations that are in

most cases near-optimal under our assumptions. Extensions to non-linear cases may be hard mainly

due to two reasons. First, obtaining uncertainty and error estimates may be far from trivial, and may

require to rely on cross-validated quantities for intermediate fitted models. Computing the latter is

usually computationally expensive, and theoretical guarantees hard to derive. Second, while local

knowledge of a linear function straightforwardly generalizes globally, this is —in general— no longer

the case with non-linear functions. Therefore, purely greedy approaches may have to be modified to

take into account the value and likelihood of the current observation.

1.4 Thesis Organization and Contributions

As explained in the previous section, in the most general worst-case scenario Active Learning may

not provide rate improvements for sample complexity. Accordingly, a large fraction of the research

has been oriented towards finding specific settings and models where substantial benefits can be

achieved, and, also, towards understanding the common aspects of these settings.

In this thesis, we focus on two canonical and simple scenarios. First, we study the process of

collecting data for a single learning problem, where n data points sequentially arrive, and we can

query at most k of them. Second, we consider the scenario where we need to simultaneously solve

several learning problems. In this case, n data points sequentially arrive too, and each point must

be allocated to one of the problems. As discussed in the previous section, given their tractability

and flexibility we constrain our models to be linear. We also provide hints on how to proceed if the

underlying model is non-linear, a setting where a non-parametric approach could be appropriate,

for example, fitting Gaussian Processes.


In Chapter 2, we consider a decision maker with a limited experimentation budget who must

e�ciently learn an underlying linear population model. Our main contribution is a novel threshold

based algorithm for selection of most informative observations; we characterize its performance and

fundamental lower bounds. We extend the algorithm and its guarantees to sparse linear regression

in high-dimensional settings. Simulations suggest the algorithm is remarkably robust: it provides

significant benefits over passive random sampling in real-world datasets that exhibit high nonlinearity

and high dimensionality — significantly reducing both the mean and variance of the squared error.

Most of this chapter was published in AAAI 2017, in [74].

In Chapter 3, we explore the sequential decision making problem where the goal is to estimate

uniformly well a number of linear models, given a shared budget of random contexts independently

sampled from a known distribution. The decision maker must query one of the linear models for

each incoming context, and receives an observation corrupted by noise levels that are unknown,

and depend on the model instance. We present Trace-UCB, an adaptive allocation algorithm that

learns the noise levels while balancing contexts accordingly across the di↵erent linear functions, and

derive guarantees for simple regret in both expectation and high-probability. Finally, we extend the

algorithm and its guarantees to high dimensional settings, where the number of linear models times

the dimension of the contextual space is higher than the total budget of samples. Simulations with

real data suggest that Trace-UCB is remarkably robust, outperforming a number of baselines even

when its assumptions are violated.

Most of this chapter was published in ICML 2017, in [75].

In Chapter 4, we discuss future work. We would like to actively learn some more complicated

structures. In particular, we formulate a sequential decision making problem where an agent interacts

with an unknown MDP for a finite number of episodes T , fixed in advance. The agent’s goal is to

maximize the value of the final policy ⇡T under the true underlying MDP. First, we highlight

the di↵erences with the standard reinforcement learning cumulative-regret minimization goal, and

provide a family of MDPs where optimizing for one goal leads to arbitrary sub-optimal performance

for the other. Then, under a Bayesian framework, we briefly introduce algorithms that are centered

in information maximization. We also justify the promise of informational-greedy algorithms by

deriving guarantees for a toy-problem using adaptive submodularity.

Finally, we summarize the work, conclude, and discuss related active areas of current research.

Chapter 2

Learning One Model

2.1 Introduction

This chapter studies online active learning for estimation of linear models. As explained in the

previous chapter, active learning is motivated by the premise that in many sequential data collection

scenarios, labeling or obtaining output from observations is costly. Thus ongoing decisions must be

made about whether to collect data on a particular unit of observation.

As a motivating example, suppose that an online marketing organization plans to send display

advertising promotions to a new target market. Their goal is to estimate the revenue that can be

expected for an individual with a given covariate vector. Unfortunately, providing the promotion and

collecting data on each individual is costly. Thus the goal of the marketing organization is to acquire

first the most “informative” observations. They must do this in an online fashion: opportunities to

display the promotion to individuals arrive sequentially over time. In online active learning, this

is achieved by selecting those observational units (target individuals in this case) that provide the

most information to the model fitting procedure.

Linear models are ubiquitous in both theory and practice—often used even in settings where

the data may exhibit strong nonlinearity—in large part because of their interpretability, flexibility,

and simplicity. As a consequence, in practice, people tend to add a large number of features and

interactions to the model, hoping to capture the right signal at the expense of introducing some

noise. Moreover, the input space can be updated and extended iteratively after data collection if

the decision maker feels predictions on a held-out set are not good enough. As a consequence, often

times the number of covariates becomes higher than the number of available observations. In those

cases, selecting the subsequent most informative data is even more critical. Accordingly, our focus

is on actively choosing observations for optimal prediction of the resulting high-dimensional linear

12

CHAPTER 2. LEARNING ONE MODEL 13

models.

Our main contributions are as follows. We initially focus on standard linear models, and build the

theory that we later extend to high dimensional settings. First, we develop an algorithm that sequen-

tially selects observations if they have su�ciently large norm, in an appropriate space (dependent

on the data-generating distribution). Second, we provide a comprehensive theoretical analysis of

our algorithm, including upper and lower bounds. We focus on minimizing mean squared prediction

error (MSE), and show a high probability upper bound on the MSE of our approach (cf. Theorem

1). In addition, we provide a lower bound on the best possible achievable performance in high prob-

ability and expectation (cf. Section 2.4). In some distributional settings of interest we show that

this lower bound structurally matches our upper bound, suggesting our algorithm is near-optimal.

The results above show that the improvement of active learning progressively weakens as the

dimension of the data grows, and a new approach is needed. To tackle our original goal and address

this degradation, under standard sparsity assumptions, we design an adaptive extension of the

thresholding algorithm that initially devotes some budget to learn the sparsity pattern of the model,

in order to subsequently apply active learning to the relevant lower dimensional subspace. We find

that in this setting, the active learning algorithm provides significant benefit over passive random

sampling. Theoretical guarantees are given in Theorem 3.

Finally, we empirically evaluate our algorithm’s performance. Our tests on real world data show

our approach is remarkably robust: the gain of active learning remains significant even in settings

that fall outside our theory. Our results suggest that the threshold-based rule may be a valuable

tool to leverage in observation-limited environments, even when the assumptions of our theory may

not exactly hold.

Active learning has mainly been studied for classification; see [6, 25, 8, 96, 24]. For regression,

there is also extensive work, see [57, 84, 15] and the references within. A closely related work to

our setting is [79]: they study online or stream-based active learning for linear regression, with

random design. They propose a theoretical algorithm that partitions the space by stratification

based on Monte-Carlo methods, where a recently proposed algorithm for linear regression [44] is

used as a black box. It converges to the globally optimal oracle risk under possibly misspecified

models (with suitable assumptions). Due to the relatively weak model assumptions, they achieve a

constant gain over passive learning. As we adopt stronger assumptions (well-specified model), we

are able to achieve larger than constant gains, with a computationally simpler algorithm. Suppose

covariate vectors are Gaussian with dimension d; the total number of observations is n; and the

algorithm is allowed to label at most k of them. Then, we beat the standard �2d/k MSE to


obtain �2d2/[kd+2(�� 1)k log k] when n = k�, so active learning truly improves performance when

k = ⌦(exp(d)) or � = ⌦(d). While [79] does not tackle high-dimensional settings, we overcome the

exponential data requirements via l1

-regularization.

The remainder of the chapter is organized as follows. We define our setting in Section 2.2. In

Section 2.3, we introduce the algorithm and provide analysis of a corresponding upper bound. Lower

bounds are given in Section 2.4. Simulations are presented in Section 2.6, and Section 2.7 concludes.

2.2 Problem Definition

The online active learning problem for regression is defined as follows. We sequentially observe n

covariate vectors in a d-dimensional space Xi 2 Rd, which are i.i.d. When presented with the i-th

observation, we must choose whether we want to label it or not, i.e., choose to observe the outcome.

If we decide to label the observation, then we obtain Y i 2 R. Otherwise, we do not see its label,

and the outcome remains unknown. We can label at most k out of the n observations.

We assume covariates are distributed according to some known distribution D, with zero mean

EX = 0, and covariance matrix ⌃ = EXXT . We relax this assumption later. In addition, we assume

that Y follows a linear model: Y = XT�⇤ + ✏, where �⇤ 2 Rd and ✏ ⇠ N (0,�2) i.i.d. We denote

observations by X,Xi 2 Rd, components by Xj 2 R, and sets in boldface: X 2 Rk⇥d,Y 2 Rk.

After selecting k observations, (X,Y), we output an estimate �k 2 Rd, with no intercept.1 Our

goal is to minimize the expected MSE of �k in ⌃ norm, i.e. Ek�k � �⇤k2⌃

, under random design;

that is, when the Xi’s are random and the algorithm may be randomized. This is related to the

A-optimality criterion, [71]. We use the experimentation budget to minimize the variance of �k by

sampling X from a di↵erent thresholded distribution. Under the OLS estimator, minimizing the

expected MSE is equivalent to minimizing the expected trace of the normalized inverse of the Fisher

information matrix XTX/�2, as

E[(Y �XT �k)2] = E[k�k � �⇤k2

⌃

] + �2

= �2 E⇥

Tr(⌃(XTX)�1)⇤

+ �2

where expectations are over all sources of randomness. In this setting, the OLS estimator is the best

linear unbiased estimator by the Gauss–Markov Theorem. For any set X,Y of k i.i.d. observations,

�k := �OLSk has sampling distribution �k | X ⇠ N (�⇤,�2(XTX)�1), [41]. In Section 2.3.5, we tackle

high-dimensionality, where k d, via Lasso estimators within a two-stage algorithm.

1We assume covariates and outcome are centered.


2.3 Algorithm and Main Results

In this section we motivate the algorithm, state the main result quantifying its performance for

general distributions, and provide a high-level overview of the proof. A corollary for the Gaussian

distribution is presented, and we also extend the algorithm by making the threshold adaptive.

Finally, we show how to generalize the results to sparse linear regression. In Appendix B.5, we

derive a CLT approximation with guarantees, useful in complex or unknown distributional settings.

Without loss of generality, we assume that each observation is white, that is, E[XXT ] is the

identity matrix. For correlated observations X 0, we apply X := D�1/2UTX 0 to whiten them,

⌃ = UDUT (see Appendix B.1). Note that Tr(⌃(X0TX0)�1) = Tr((XTX)�1).

We bound the whitened trace as

d

�max

(XTX) Tr((XTX)�1) =

dX

i=1

1

�i(XTX) d

�min

(XTX). (2.1)

To minimize the expected MSE, we need to maximize the minimum eigenvalue of XTX with high

probability. The thresholding procedure in Algorithm 1 maximizes the minimum eigenvalue of XTX

through two observations. First, since the sum of eigenvalues ofXTX is the trace ofXTX, which is in

turn the sum of the norm of the observations, the algorithm chooses observations of large (weighted)

norm. Second, the eigenvalues of XTX should be balanced, that is, have similar magnitudes. This

is achieved by selecting the appropriate weights for the norm.

Let ⇠ 2 Rd+

be a vector of weights defining the norm kXk2⇠ =Pd

j=1

⇠jX2

j . Let � > 0 be a

threshold. Algorithm 1 simply selects the observations with ⇠-weighted norm larger than �. In

other words, it defines an acceptance region At = {X : kXk⇠ � �}, and labels the observation Xt

if and only if Xt 2 At. The selected observations can be thought as i.i.d. samples from an induced

distribution D: the original distribution conditional on kXk⇠ � �. Suppose k observations are

chosen using (⇠,�), and denoted by X 2 Rk⇥d. Then EXTX =Pk

i=1

EXiXiT =Pk

i=1

Hi = kH,

where H is the covariance matrix with respect to D. This covariance matrix is diagonal under

density symmetry assumptions, as thresholding preserves uncorrelation; its diagonal terms are

Hjj = E¯DX2

j = ED[X2

j | kXk⇠ � �] =: �j . (2.2)

A rich family of distributions that satisfies the density symmetry assumptions are the elliptical

distributions [52, 16], which includes multivariate normal and multivariate t-distributions.


When a elliptical distribution has a density function f , its form is given by

f(x) = k · g �(x� µ)T⌃�1(x� µ)�

, (2.3)

where k is a scale parameter, µ is the mean vector, and ⌃ is positive definite and proportional to

the covariance matrix under suitable assumptions. In particular, note that f(x) = f(�x) when

µ = 0, and the probability of a point x only depends on its Mahalanobis distance to µ,⌃, defining

iso-density ellipsoids. Uncorrelation is preserved under thresholding for elliptical distributions as

E [Xi | X�i, kXk⇠ � �] =1

C 0

Z

x2i

>Cxi f(xi, X�i) dxi = 0, (2.4)

where C = (�2 �Pj 6=i ⇠jX2

j )/⇠i, and f is the whitened density. Note that f(z) = k · g(zT z), sof(xi, X�i) = f(�xi, X�i). Then, the statement follows by applying the law of total expectation.

As a consequence, �min

(EXTX) = kminj �j , and �max

(EXTX) = kmaxj �j . The main technical

result in Theorem 1 is to link the eigenvalues of the random matrix XTX to its deterministic counter

part EXTX. From the above calculations, the goal is to find (⇠,�) such that minj �j ⇡ maxj �j ,

and both are as large as possible. The first objective is achieved when there exists some � such that

ED[X2

j | kXk⇠ � �] = �j = �, for all j. (2.5)

When X has independent components with the same marginal distribution after whitening, then it

su�ces to choose ⇠j = 1 for all j. In particular, this corresponds to normalizing every component,

i.e., imagine X ⇠ D(0,⌃) where ⌃ = Diag(�2

i ) is the diagonal covariance matrix, and suppose

components are independent. Then, whitening X = ⌃�1/2X is equivalent to setting Xi = Xi/�i for

each component i = 1, . . . , d, so the Xi’s are iid. The joint Gaussian distribution is a special case,

as whitening removes existing dependencies, and we simply set ⇠j = 1. It is necessary to choose

unequal weights when the marginal distributions of the components are di↵erent, e.g., some are

Gaussian and some are uniform, or components are dependent. We discuss examples below.

2.3.1 Thresholding Algorithm

The algorithm is simple, see Algorithm 1 below. For each incoming observation Xi we compute its

weighted norm kXik⇠ (possibly after whitening if necessary). If the norm is above the threshold �,

then we select the observation, otherwise we ignore it. We stop when we have collected k observations.

Note that passive learning or random sampling is equivalent to setting � = 0.


(a) Optimal ⇠ for Gaussian component. (b) Optimal �2.

Figure 2.1: D has two independent components: white Gaussian and Uniform(�p3,p3).

(a) Optimal ⇠ for Gaussian component. (b) Optimal �2.

Figure 2.2: D has two independent components: white Gaussian and Laplace(0, 1/p2).

We want to catch the k largest observations given our budget, therefore we require that � satisfies

PD (kXk⇠ � �) = k/n. (2.6)


If we apply this rule to n independent observations coming from D, on average we select k of them:

the ⇠�largest. If (⇠,�) is a solution to (2.5) and (2.6), then (c ⇠,pc �) is also a solution for any

c > 0. So we requireP

i ⇠i = d. Algorithm 1 can be seen as a regularizing process similar to ridge

Algorithm 1 Thresholding Algorithm.

1: Set (⇠,�) 2 Rd+1 satisfying (2.5) and (2.6).2: Set S = ;.3: for observation 1 i n do4: Observe Xi.5: Compute Xi = D�1/2UTXi.6: if kXik⇠ > � or k � |S| = n� i+ 1 then7: Choose Xi: S = S [Xi.8: if |S| = k then9: break.

10: end if11: end if12: end for13: Return OLS estimate � based on observations in S.

regression, where the amount of regularization depends on the distribution D and the budget ratio

k/n; it improves the conditioning of the problem.

2.3.2 Unknown Distribution

In practice, D may not be completely known, and equations like (2.5) require the ability to compute

conditional expectations. In this section we briefly discuss what to do in those cases.

Recovering ⌃ in a online fashion should be easy (as n is large). Guarantees when ⌃ is unknown

can be derived as follows: we allocate an initial sequence of points to estimation of the inverse of the

covariance matrix, and the remainder to labeling (where we no longer update our estimate). In this

manner observations remain independent. Note that O(d) observations are required for accurate

recovery when D is subgaussian, while O(d log d) are required if the distribution is subexponential,

[91]. Errors by using the estimate to whiten and make decisions are bounded, small with high

probability (via Cauchy–Schwarz), and the result is equivalent to using a slightly worse threshold.

When not only ⌃ but the density of D is unknown too, it is not obvious how to find the weights

⇠ and the threshold �. As if all we had to estimate was ⌃, we could also devote some budget T of

observations, X1, . . . , XT , to approximate D and solve the system of equations. For example, we can

define and solve the following relaxed integer convex program, where the data (constants) is given

by Xij for i = 1, . . . , T , j = 1, . . . , d, and also k and n. Fix some user-defined weights c

1

, c2

, c3

> 0,


then one possible program is as follows

minimize c1

dX

j=1

(�� j)2 + c

2

1

T

TX

i=1

si � k

n

!

2

+ c3

TX

i=1

z2i (2.7)

�j =1

(k/n)T

TX

i=1

(Xij)

2 · si for j = 1, . . . , d (2.8)

ti =dX

j=1

⇠j(Xij)

2 � �2 for i = 1, . . . , T (2.9)

zi = ti/C � (si � 1/2) for i = 1, . . . , T (2.10)

dX

j=1

⇠j = d (2.11)

�,�1

, . . . ,�d, ⇠1, . . . , ⇠d,�2 � 0 (2.12)

s1

, . . . , sT 2 [0, 1], C = maxi2[T ]

kXik2. (2.13)

Let us explain the convex program step by step. Assume ⇠ 2 Rd and �2 > 0 are fixed. We need

to approximate the value of (2.5), the expected squared value of each component. In order to do

so, we define the si variables. Ideally, si denotes whether —under ⇠,�2— the i-th observation Xi

would be chosen. Thus, we would like si 2 {0, 1}, but we relax the integer constraint to 0 si 1.

If si states whether the i-th observation is chosen, then (2.8) defines the value of �j as the mean

squared value of the j-th component among the selected observations, i.e., under the approximated

D distribution. But, instead of normalizing by 1/P

i si, we assume the right amount of observations

have been selected, that is, (k/n)T of them. We define the objective function as a weighted sum of

di↵erent goals we would like to impose. First, we want (⇠,�2) to lead to even values of �j for all

j’s. We use a quadratic loss, and assign weight c1

to it. Second, we would like to satisfy (to some

extent) equation (2.6). In our terms, that impliesP

i si/T , the fraction of selected observations,

being close to k/n, our goal —as we implicitly assume in (2.8). The second term in (2.7) tries to

force that, with weight c2

. Finally, we need to make sure si truly represents whether an observation

was selected or not. Accordingly, we define ti, to be the gap between the weighted norm of Xi and

the threshold squared. We would like to have si = 0 if ti 0, and si = 1 when ti > 0. As we relaxed

the definition of si, at least we would like si to be close to zero and one respectively. If Xi was not

selected (ti < 0), si�1/2 should be negative. Otherwise, it should be positive. We would like, then,

ti and si � 1/2 to share the same sign. We define zi to be its di↵erence (after scaling ti), and add

a third term in the objective function. The term is the sum of squares, z2i . All the constraints are

linear, and the objective is convex.


2.3.3 Adaptive Acceptance Region

Algorithm 1 keeps the threshold fixed from the beginning, leading to a mathematically convenient

analysis, as it generates i.i.d. observations. However, Algorithm 1b, which is adaptive and updates

its parameters after each observation, produces slightly better results, as we empirically show in

Appendix B.12. Before making a decision on Xi, Algorithm 1b finds (⇠i,�i) satisfying (2.5) and

PD

�kXik⇠i

� �i

�

=k � |Si�1

|n� i+ 1

, (2.14)

where |Si�1

| is the number of observations already labeled. The idea is identical: set the threshold

to capture, on average, the number of observations still to be labeled, that is k � |Si�1

|, out of thenumber still to be observed, n� i+ 1.

Algorithm 1 b Adaptive Thresholding Algorithm.

1: Set S = ;.2: for observation 1 i n do3: Observe Xi, estimate b⌃i = bUi

bDibUTi .

4: Compute Xi = bD�1/2i

bUTi Xi.

5: Let (⇠i,�i) satisfy (2.5) and (2.14).6: if kXik⇠

i

> �i or k � |S|=n� i+ 1 then7: Choose Xi: S = S [Xi.8: if |S| = k then9: break.

10: end if11: end if12: end for13: Return OLS estimate � based on observations in S.

Importantly, active learning not only decreases the expected MSE, but also its variance. Since

the variance of the MSE for fixed X depends onP

j 1/�j(XTX)2, see [41], it is also minimized by

selecting observations that lead to large eigenvalues of XTX. Our real-world data simulations clearly

show this phenomenon.

2.3.4 Main Theorem

Theorem 1 states that by sampling k observations from D where (⇠,�) satisfy (2.5), the estimation

performance is significantly improved, compared to randomly sampling k observations from the

original distribution. Section 2.4 shows the gain in Theorem 1 essentially cannot be improved and,

therefore, Algorithm 1 is near-optimal. A sketch of the proof is provided at the end of this section

(see Appendix B.2).


Theorem 1. Let n > k > d. Assume observations X 2 Rd are distributed according to subgaussian

distribution D with covariance matrix ⌃ 2 Rd⇥d. Also, assume marginal densities are symmetric

around zero after whitening. Let X be a k⇥d matrix with k observations sampled from the distribution

induced by the thresholding rule with parameters (⇠,�) 2 Rd+1

+

satisfying (2.5). Let ↵ > 0, so that

t = ↵pk � C

pd > 0, then, with probability at least 1� 2 exp(�ct2)

Tr(⌃(XTX)�1) d

(1� ↵)2 �k, (2.15)

where constants c, C depend on the subgaussian norm of D.

While Theorem 1 is stated in fairly general terms, we apply the result to specific settings.

Obviously, the magnitude of the improvement depends on the value that � � 1 can achieve. We first

present the Gaussian case where white components are independent. The proof is in Appendix B.4.

Corollary 2. If the observations in Theorem 1 are jointly Gaussian with covariance matrix given by

⌃ 2 Rd⇥d, and we set ⇠j = 1 for all j = 1, . . . , d, and � = Cp

d+ 2 log(n/k), for some appropriate

constant C � 1, then with probability at least 1� 2 exp(�ct2) we have that

Tr(⌃(XTX)�1) d

(1� ↵)2⇣

1 + 2 log(n/k)d

⌘

k. (2.16)

By the inverse Wishart distribution, we know that the expected MSE of random sampling for

white Gaussian data is proportional to d/(k � d � 1). Thus, active learning provides a gain factor

of order 1/(1 + 2 log(n/k)/d) with high probability (a very similar 1 � ↵ term shows up for high

probability guarantees in the case of random sampling). Note that our algorithm may select fewer

than k observations. Then, when the number of observations yet to be seen equals the remaining

labeling budget, we should select all of them (equivalent to random sampling). The number of

observations with kXk⇠ > � has binomial distribution, is highly concentrated around its mean k,

with variance k(1� k/n). By the Cherno↵ Bounds, the probability that the algorithm selects fewer

than k � C 0pk decreases exponentially fast in C 0. Thus, these deviations are dominated in the

bound of Theorem 1 by the leading term. In practice, one may set the threshold in (2.6) by choosing

k(1 + ✏) observations for some small ✏ > 0, or use the adaptive threshold in Algorithm 1b.

2.3.5 Sparsity and Regularization

The gain provided by active learning in our setting su↵ers from the curse of dimensionality, as

it diminishes very fast when d increases, and Section 2.4 shows the gain cannot be improved in


general. For high dimensional settings (where k d) we assume s-sparsity in �, that is, we assume

the support of � contains at most s non-zero components, for some s ⌧ d. In Appendix B.11, we

also provide related results for Ridge regression.

We state the two-stage Sparse Thresholding Algorithm (see Algorithm 2) and show this algorithm

e↵ectively overcomes the curse of dimensionality. For simplicity, we assume the data is Gaussian,

D = N (0,⌃). Based, for example, on the results of [88] and Theorem 1 in [46], we could extend

our results to subgaussian data via the Orthogonal Matching Pursuit algorithm for recovery. The

two-stage algorithm works as follows. First, we focus on recovering the true support, S = S(�), by

selecting the very first k1

observations (without thresholding), and computing the Lasso estimator

�1

. Second, we assign the weights ⇠: for i 2 S(�1

), we set ⇠i = 1, otherwise we set ⇠i = 0. For general

(non-Gaussian) distributions, the weights for the dimensions in the support need not be equal. Then,

we apply the thresholding rule to select the remaining k2

= k� k1

observations. While observations

are collected in all dimensions, our final estimate �2

is the OLS estimator computed only including

the observations selected in the second stage, and exclusively in the dimensions contained in S(�1

).

Note that, in general, the points that end up being selected by our algorithm are informational

outliers, while not necessarily geometric outliers in the original space. After applying the whitening

transformation, ignoring some dimensions based on the Lasso results, and then thresholding based

on a weighted norm possibly learnt from data (say, if components are not independent, and we

recover the covariance matrix in a online fashion), the algorithm is able to identify good points for

the underlying data distribution and �.

Theorem 3 summarizes the performance of Algorithm 2; it requires the standard assumptions on

⌃,�,�2, and mini |�i| for support recovery (see Theorem 3 in [94]).

Theorem 3. Let D = N (0,⌃). Assume ⌃,�,�2, and mini |�i| satisfy the standard conditions

given in Theorem 3 of [94]. Assume we run the Sparse Thresholding algorithm with k1

= C 0s log d

observations to recover the support of �, for an appropriate C 0 � 0. Let X2

be k2

= k � k1

observations sampled via thresholding on S(�1

). It follows that for ↵ > 0 such that t = ↵pk2

�Cps >

0, there exist some universal constants c1

, c2

, and c, C that depend on the subgaussian norm of

D | S(�1

), such that with probability at least

1� 2e�min(c2 min(s,log(d�s))�log(c1),ct2�log(2))

the true support is recovered, S(�1

) = S(�), and it holds that

Tr(⌃SS(XT2

X2

)�1) s

(1� ↵)2⇣

1 + 2 log(n2/k2)

s

⌘

k2

.


Figure 2.3: Graphical description of Sparse Thresholding Algorithm (Algorithm 2).

The expected MSE (over training data) for random sampling with the Lasso estimator isO(s log d/k).

A regime of interest is s ⌧ d, k = C1

s log d, and n = C2

d, for large enough C1

, and some

C2

> 0. In that case, Algorithm 2 leads to a bound of order smaller than 1/ log(d) (note that

1/ log(d) < 1/ log(k)), as opposed to a weaker constant guarantee for random sampling. The gain

is at least a log d factor with high probability. The proof is in Appendix B.6. In practice, the

performance of the algorithm is improved by using all the k observations to fit the final estimate

�2

, as shown in simulations. However, in that case, observations are no longer iid. Also, using

thresholding to select the initial k1

observations decreases the probability of making a mistake in

support recovery. In Section 2.6 we provide simulations comparing di↵erent methods.

2.3.6 Proof of Theorem 1

The complete proof of Theorem 1 is in Appendix B.2. We only provide a sketch here. The proof

is a direct application of spectral results in [91], which are derived via a covering argument using

a discrete net N on the unit Euclidean sphere Sd�1, together with a Bernstein-type concentration

inequality that controls deviations of kXwk2

for each element w 2 N in the net. Finally, a union

bound is taken over the net. Importantly, the proof shows that if our algorithm uses (⇠,�) which are

approximate solutions to (2.5), then (2.15) still holds with minj E ¯DX2

j = minj �j in the denominator

of the RHS, instead of �. This fact can be quite useful in practice, when F is unknown. We can

devote some initial budget X1

, . . . , XT to recover F, and then find (⇠,�) approximately solving (2.5)

and (2.6) under F, as explained in previous sections. Note that no labeling is required.


Algorithm 2 Sparse Thresholding Algorithm.

1: Set S1

= ;, S2

= ;. Let k = k1

+ k2

, n = k1

+ n2

.2: for observation 1 i k

1

do3: Observe Xi. Choose Xi: S

1

= S1

[Xi.4: end for5: Set � = 3/4,� =

p

4�2 log(d)/�2k1

.

6: Compute Lasso estimate �1

on S1

, regularization �.7: Set weights: ⇠i = 1 if i 2 S(�

1

), ⇠i = 0 otherwise.8: Set � = C

p

s+ 2 log(n2

/k2

).9: Factorize ⌃S(

ˆ�1)S(

ˆ�1)= UDUT .

10: for observation k1

+ 1 i n do11: Observe Xi 2 Rd. Restrict to Xi

S := XiS(

ˆ�1)2 Rs.

12: Compute XiS = D�1/2UTXi

S .13: if kXi

Sk⇠ > � or k2

� |S2

| = n� i+ 1 then14: Choose Xi

S : S2

= S2

[XiS .

15: if |S2

| = k2

then16: break.17: end if18: end if19: end for20: Return OLS estimate �

2

based on observations in S2

.

Also, the result can be extended to subexponential distributions. In this case, the probabilistic

bound will be weaker (including a d term in front of the exponential). More generally, our probabilis-

tic bounds are strongest when k � Cd log d for some constant C � 0, a common situation in active

learning [79], where super-linear requirements in d seem unavoidable in noisy settings. A simple

bound for the parameter � can be calculated as follows. Assume there exists (⇠,�) such that �j = �,

and consider the weighted squared norm Z⇠ =Pd

j=1

⇠jX2

j . Then E¯D [Z⇠] =

Pdj=1

⇠jE ¯D

⇥

X2

j

⇤

=Pd

j=1

⇠j�j = d�, and � = ED

⇥

Z⇠ | Z⇠ � �2

⇤

/d � �2/d = F�1

Z⇠

(1� k/n)/d, which implies that

1/�min

(EXTX) = 1/k� d/k�2. For specific distributions, �2/d can be easily computed. The last

inequality is close to equality in cases where the conditional density decays extremely fast for values

ofPd

j=1

⇠jX2

j above �2. However, heavy-tailed distributions allocate mass to significantly higher

values, and � could be much larger than �2/d.

2.4 Lower Bound

In this section we derive a lower bound for the k > d setting. Suppose all the data are given. Again

choose the k observations with largest norms, denoted by X0. To minimize the prediction error,

the best possible X0TX0 is diagonal, with identical entries, and trace equal to the largest sum of k

norms. No selection algorithm, online or o✏ine, can do better. Algorithm 1 tries to achieve this by

selecting observations with large norms and uncorrelated entries (through whitening if necessary).


(a) Zooming out. (b) Zooming in.

Figure 2.4: Sparse Linear Regression (700 iters). We fix the e↵ective dimension to s = 7, andincrease the ambient dimension from d = 100 to d = 500. The budget scales as k = Cs log d forC ⇡ 3.4, while n = 4d. We set k

1

= 2k/3 and k2

= k/3.

Theorem 4 captures this intuition.

Theorem 4. Let A be an algorithm for the problem we described in Section 2.2. Then,

EA Tr(⌃(XTX)�1) � d2

Eh

Pki=1

||X(i)||2

i (2.17)

� d

k E⇥

1

d maxi2[n] ||Xi||2⇤ ,

where X(i) is the white observation with the i-th largest norm. Moreover, fix ↵ 2 (0, 1). Let F be

the cdf of maxi2[n] ||Xi||2. Then, Tr(⌃(XTX)�1) � d2/k F�1(1� ↵) with probability at least 1�↵.

The proof is in Appendix B.7. The upper bound in Theorem 1 has a similar structure, with

denominator equal to k�. By Theorem 1, � = ED[X2

j | kXk2⇠ � �2] for every component j.

Hence, summing over all components: k� = kE¯D

⇥kXk2/d⇤. The latter expectation is taken with

respect to D, which only captures the k expected ⇠-largest observations out of n, as opposed to

k ED[(1/k)Pk

i=1

||X(i)||2/d] in (2.17). The weights ⇠ simply account for the fact that, in reality, we

cannot make all components have equal norm, something we implicitly assumed in our lower bound.

We specialize the lower bound to the Gaussian setting, for which we computed the upper bound

of Theorem 1. The proofs are based on the Fisher-Tippett Theorem and the Gumbel distribution;

the proof details are provided in Appendix B.8.


Corollary 5. For Gaussian observations Xi ⇠ N (0,⌃) and large n, for any algorithm A

EA Tr(⌃(XTX)�1) � d

k⇣

2 lognd + log log n

⌘ .

Moreover, let ↵ 2 (0, 1). Then, for any A with probability at least 1� ↵ and C = 2 log�(d/2)/d,

Tr(⌃(XTX)�1) � d/k2 logn

d + log log n� 1

d log log1

1�↵ � C

The results from Corollary 2 have the same structure as the lower bound; hence in this setting our

algorithm is near optimal. Similar results and conclusions are derived for the CLT approximation

in Appendices B.5 and B.9.

2.5 Misspecified Models

In this section, we briefly explore the more general case where the underlying function E[Y |X] is

not linear in X. In those cases, our algorithm is in general no longer optimal, and we measure the

error in terms of the non-linearity and its associated bias.

Let X ⇠ D, and define the best linear response as

�⇤ = �⇤(D) = arg min�2Rd

E[(XT� � Y )2].

The expectation is computed over X ⇠ D⌃

, and the noise randomness.

Importantly, note that if we change D, then in general �⇤ will change too.

As we assume that the linear model may be misspecified, we have that:

Y = XT�⇤ + Bias(X) + ✏(X),

where ✏ denotes the response noise, and Bias(X) the approximation error at X. Given X, Bias(X)

is deterministic, while ✏(X) is random and has zero mean.

We use the following decomposition of the error, where the data D is fixed. We define the set of

indices i 2 S, such that |S| = k, corresponding to observations Xi 2 D.


Proposition 6. Error decomposition under misspecification if ⌃ is positive definite:

k�OLS � �⇤k2⌃

2

�

�

�

�

⌃1/2⌃�1⌃1/2

�

�

�

�

2

�

�

�

�

1

k

X

i2S

⌃�1/2Xi Bias(Xi)

�

�

�

�

2

(2.18)

+ 2

�

�

�

�

⌃1/2⌃�1⌃1/2

�

�

�

�

�

�

�

�

1

k

X

i2S

⌃�1/2Xi ✏(Xi)

�

�

�

�

2

. (2.19)

The proof is simple and given in Lemma 2 of [43]. Recall that ⌃ = XTX/k.

As mentioned above, we assume that ✏(X) does not depend on X. In particular, the noise is

iid, and �2-subgaussian for some �2 � 0. The second term in the right-hand side of Proposition 6

does not depend on the bias or the approximation error of the collected observations. As in the

well-specified case, we expect thresholded observations to help decrease the k⌃1/2⌃�1⌃1/2k term

faster than random observations from D, while the kPi2S ⌃�1/2Xi ✏(Xi)/kk term will behave

quite similarly in both cases. However, the behavior of the bias term fundamentally depends on the

distribution D and the true underlying function E[Y | X].

As in previous sections, we restrict D to the elliptical family of distributions, and assume D is

centered and has covariance matrix ⌃. According to our assumptions, random observations X ⇠ D

are such that E[⌃�1/2X] = 0 and Cov[⌃�1/2X] = Id. Thresholded observations X ⇠ D are such

that E[⌃�1/2X] = 0 and Cov[⌃�1/2X] = � Id for � > 1, as defined in (2.5). In other words, if

X ⇠ D, then we can write X = ⌃1/2W , so that Cov[W ] = � Id.

We assume the approximation error or bias cannot be arbitrarily large, see [43].

Assumption 1. Bounded approximation error.

There exists a constant BBias > 0 such that for all X 2 Rd

k⌃�1/2X Bias(X)k BBias

pd. (2.20)

We start by showing that X Bias(X) has mean zero under the original distribution D. Recall

that the definition of Bias(X) actually depends on D. Let {vj}dj=1

be an orthonormal basis of

eigenvectors of ⌃. It is easy to show that the optimal linear model is given by

�⇤ =X

j

�⇤j vj where �⇤

j =EX⇠D[(vTj X)Y ]

EX⇠D[(vTj X)2], (2.21)

by taking derivatives with respect to the squared prediction error.


The following known result will be useful:

Proposition 7. For X ⇠ D, E[X Bias(X)] = 0.

Proof. The proof is based on the fact that, when X ⇠ D, for all w 2 Rd

E[(wTX)Y ] = E[(wTX)(�⇤TX)], (2.22)

which implies that

E[(wTX)(Y � �⇤TX)] = EX [(wTX)(E[Y |X]� �⇤TX)] = 0.

Then, we use that

E[X Bias(X)] = EX [X(E[Y |X]� �⇤TX)],

and take w = ei for each i.

Unfortunately, in general, if X ⇠ D 6= D, the previous expression is not zero (as the bias is

defined with respect to �⇤ = �⇤(D), the optimal solution for D). Therefore, in this case, we derive

a new result given in the following proposition. The proof is provided in Appendix B.10.

Proposition 8. For X ⇠ D where D = D⇠⇤,�⇤ is the thresholding distribution:

EX⇠ ¯D[X Bias(X)] = EX⇠ ¯D[X E[Y |X]]� � EX⇠D[X E[Y |X]]. (2.23)

We define ⌫ := EX⇠ ¯D[X Bias(X)] 2 Rd as given in Proposition (8).

We will use the following result, which is Lemma 15 in [43].

Lemma 9. Let X1, . . . , Xk be a martingale di↵erence vector sequence (i.e., E[Xi | X1, . . . , Xi�1] =

0 for all i), such that

kX

i=1

E[kXik2 | X1, . . . , Xi�1] v, and kXik b (2.24)

for all i = 1, . . . , k almost surely. Then, for all � 2 (0, 1),

P

�

�

�

�

kX

i=1

Xi

�

�

�

�

>pv⇣

1 +p

8 log(1/�)⌘

+ (4/3) b log(1/�)

!

�. (2.25)


We can now derive a lemma that controls the error contribution that comes from the approxi-

mation error. Note that this is the main di↵erence —a notable one— with the well-specified case.

The proof, based on the previous lemma, is also in Appendix B.10.

Lemma 10. Suppose Assumption 1 holds. Fix � 2 (0, 1). Then, the term�

�

1

k

Pki=1

⌃�1/2Xi Bias(Xi)�

�

2

is upper bounded with probability at least 1� � by

k⌃�1/2⌫k2 + 2 EX⇠ ¯D[k⌃�1/2X Bias(X)k2]k

1 +

r

8 log1

�

!

2

+64

9k2

⇣

B2

Biasd+ k⌃�1/2⌫k2⌘

log21

�,

where ⌫ = EX⇠ ¯D[X E[Y |X]]� � EX⇠D[X E[Y |X]].

Importantly, Lemma 10 says that if ⌫ 6= 0, then k�OLS � �⇤k2⌃

will not go to zero regardless of

how much data we collect by thresholding, and the algorithm will not be consistent.

We also need to bound�

�⌃1/2⌃�1⌃1/2�

�, where thresholding will help.

Lemma 11. For � 2 (0, 1), after collecting k observations X ⇠ D, with probability at least 1� �

�

�

�

�

⌃1/2⌃�1⌃1/2

�

�

�

�

1

�

1

(1� ↵)2, (2.26)

where ↵ = Cp

d/k +p

log(2/�)/kc for universal constants c, C > 0, and � is defined in (2.5).

Proof. The proof is quite similar to that of Theorem 1, and is provided in Appendix B.10.

Finally, we need to bound�

�

1

k

P

i ⌃�1/2Xi ✏(Xi)

�

�

2

. The rate at which our bound for the latter

goes to zero will not be a↵ected by whether we use D or D, but by the noise level �2 and the number

of observations k. We directly use Lemma 5 (based on Lemma 14) in [43].

Lemma 12. Assume the noise is �2-subgaussian. Fix � 2 (0, 1). Conditioned on ⌃ being positive

definite, we have that with probability at least 1� �.

�

�

�

�

1

k

X

i

⌃�1/2Xi ✏(Xi)

�

�

�

�

2

�2

⇣

d+ 2p

d log(1/�) + 2 log(1/�)⌘

k. (2.27)


(a) Protein Structure; 150 iters. (b) Bike Sharing; 300 iters. (c) YearPredictionMSD; 150 iters.

Figure 2.5: MSE of �OLS . The (0.05, 0.95) quantile conf. int. displayed. Solid median; Dashed mean.

Therefore, the error decomposition in Lemma 6, together with the bounds for its terms given in

Lemma 10, Lemma 11, and Lemma 12, o↵er guarantees on the estimation error in ⌃ norm with re-

spect to the best linear approximation, k�OLS��⇤k2⌃

, under the assumption that the approximation

error is upper bounded by some constant.

In addition, by means of (2.22), one can show that

E[(XT �OLS

� Y )2] = E[(XT�⇤ � Y )2] + k�OLS � �⇤k2⌃

. (2.28)

2.6 Simulations

We conducted experiments in various settings: regularized estimators in high-dimensions, and the

basic thresholding approach in real-world data to explore its performance on strongly non-linear

environments.

Regularized Estimators. We compare the performance in high-dimensional settings of ran-

dom sampling and Algorithm 1 —both with an appropriately adjusted Lasso estimator— against

Algorithm 2, which takes into account the structure of the problem (s ⌧ d). For completeness,

we also show the performance of Algorithm 2 when all observations are included in the final OLS

estimate —a sensible thing to do in practice—, and that of random sampling (RS) and Algorithm

1 (Thr) when the true support S is known in advance, and the OLS computed on S. Note that

the last two benchmarks heavily rely on information that is not available in practice. In Figure 2.4

(a), we see that Algorithm 2 dramatically reduces the MSE, while in Figure 2.4 (b) we zoom-in to

see that, quite remarkably, Algorithm 2 using all observations for the final estimate outperforms

random sampling that knows the sparsity pattern in hindsight. We used k1

= (2/3)k for recovery.

More experiments are provided in Appendix B.12.


Real-World Data. We show the results of Algorithm 1b (including online ⌃ estimation) with

the simplest distributional assumption (Gaussian threshold, ⇠j = 1) versus random sampling on

publicly available real-world datasets (UCI, [60]), measuring test squared prediction error. We fix a

sequence of values of n, together with k =pn, and for each pair (n, k) we run a number of iterations.

In each one, we randomly split the dataset in training (n observations, random order), and test (rest

of them). Finally, �OLS

is computed on selected observations, and the prediction error estimated

on the test set. All datasets are initially centered to have zero means (covariates and response).

Confidence intervals are provided.

We first analyze the Physicochemical Properties of Protein Tertiary Structure dataset (45730

observations), where we predict the size of the residue, based on d = 9 variables, including the total

surface area of the protein and its molecular mass. Figure 2.5 (a) shows the results; Algorithm 1b

outperforms random sampling for all values of (n, k). The reduction in variance is substantial. In

the Bike Sharing dataset [28] we predict the number of hourly users of the service, given weather

conditions, including temperature, wind speed, humidity, and temporal covariates. There are 17379

observations, and we use d = 12 covariates. Our estimator has lower mean, median and variance

MSE than random sampling; as shown in Figure 2.5 (b). Finally, for the YearPredictionMSD dataset

[11], we predict the year a song was released based on d = 90 covariates, mainly metadata and audio

features. There are 99799 observations. The MSE and variance did strongly improve; Figure 2.5 (c).

In the examples we see that, while active learning leads to strong improvements in MSE and

variance reduction for moderate values of k with respect to d, the gain vanishes when k grows large.

This was expected; the reason might be that by sampling so many outliers, we end up learning

about parts of the space where heavy non-linearities arise, which may not be important to the test

distribution. However, the motivation of active learning are situations of limited labeling budget,

and hybrid approaches combining random sampling and thresholding could be easily implemented

if needed.

2.7 Conclusion

This chapter provides a comprehensive analysis of thresholding algorithms for online active learning

of linear regression models, which are shown to perform well both theoretically and empirically.

Several natural open directions suggest themselves. Additional robustness could be guaranteed

in other settings by combining our algorithm as a “black box” with other approaches: for example,

some addition of random sampling or stratified sampling could be used to determine if significant


nonlinearity is present, and to determine the fraction of observations that are collected via thresh-

olding. A di↵erent approach to deal with non-linearity consists in fitting a non-parametric model,

like a Gaussian Process. The idea would be to take into account the current uncertainty at the

region of the space where the point falls, and if the uncertainty is above some threshold �t, then the

point is selected, labeled, and the model updated.

Chapter 3

Learning Several Models

3.1 Introduction

In this chapter, we study the problem faced by a decision-maker that wants to estimate a number

of regression problems equally well (i.e., have a small prediction error in each of them), and it

has to adaptively allocate a limited budget of samples to the problems to gather information and

improve its estimates. Two aspects of the problem formulation are key, and drive the algorithm

design. First, the observations collected from each regression problem depend on side information

(i.e., contexts) and we model the relationship between contexts X 2 Rd and observations Y as a

linear function with unknown parameters �i 2 Rd, which are specific to each problem i. Second,

the “hardness” of learning each parameter �i is unknown in advance, and it may vary across the

problems. In particular, we assume the response observations are corrupted by noise levels that are

problem-dependent, and must be learned too.

This scenario may arise in a number of di↵erent domains where a fixed experimentation budget

(i.e., an amount of samples) can be collected over di↵erent problems. For instance, consider a MOOC

platform that plans to o↵er a new degree in Computer Science. Students from all around the world

sign up, and, as their backgrounds are quite diverse, the platform would like to be able to predict the

performance of each student across di↵erent subjects in order to plan in advance which students need

to take refresher courses before starting. In this case, a student is a context X whose performance Y

depends on the specific subject i through a linear function with parameter �i (i.e., Y ⇡ XT�i). The

platform may decide to start an experimentation phase where n students are assigned automatically

a test in one of a few di↵erent subjects with the objective of estimating the parameters �i accurately

so that, at the end of the experiment, the expected performance of each new student X can be

predicted reliably for any subject. Since the parameters �i and the noise levels are unknown in

advance, this requires deploying an adaptive method that allocates the n students smartly so as to

33

CHAPTER 3. LEARNING SEVERAL MODELS 34

have accurate estimates for all subjects by the end of the experiment. Notice that while in general n

may be relatively small since collecting samples may be expensive, the distribution of the contexts

X (i.e., the type of students subscribing to the MOOC) can be easily estimated in advance.

While in the previous example di↵erent treatments are not directly comparable, another broad

family of examples arises in situations where a company wants to predict how promising a given

product is for a given user. The company tests a set of prototypes in order to decide which ones

should eventually make it to production, in such a way that the preferences of a wide range of

customers are covered. Importantly, the final costs associated to each product may not be known

at prototyping time (for example, manufacturing costs may depend on the size of the population

of potential customers for the product), and, therefore, the company would like to rely on accurate

estimates for all of them, so they can make informed decisions after the exploration stage. In these

cases, companies usually allocate a fixed budget to produce a first version of the alternatives and

conduct interviews with beta-testers.

This setting is clearly related to the problem of pure exploration and active learning in multi-

armed bandits [3], where the learner wants to estimate the mean of a finite set of arms by allocating

a finite budget of n pulls. [3] first introduced this setting where the objective is to minimize the

largest mean square error (MSE) in estimating the value of each arm. While the optimal solution

is trivially to allocate the pulls proportionally to the variance of each arm, when the variances are

not known an exploration-exploitation dilemma arises, where variances must be estimated at the

same time as the value of the arms to allocate pulls wherever they are more needed (i.e., arms

with high variance). [3] proposed a forcing algorithm where all arms are pulled at leastpn times

before allocating pulls proportionally to the estimated variances. They derive bounds on the regret

measuring the di↵erence between the MSE of the learning algorithm and the MSE of an optimal

allocation showing that the regret decreases as O(n�3/2). A similar result is obtained by [18] that

proposed two algorithms that use upper confidence bounds on the variance to estimate the MSE of

each arm and select the arm with the largest MSE at each step. When the arms are embedded in Rd

and their mean is a linear combination with an unknown parameter, then the problem becomes an

optimal experimental design problem [70], where the objective is to estimate the linear parameter

and minimize the prediction error over all arms (see e.g., [22, 78]). In this chapter, we consider an

orthogonal extension to the original problem where a finite number of linear regression problems

is available (i.e., the arms) and random contexts are observed at each time step. Similarly to the

setting by [3], we assume each problem is characterized by a noise level with di↵erent variance and

the objective is to return regularized least-squares estimates (RLS) with small prediction error (i.e.,

MSE). While we leverage on the solution proposed by [18] to deal with the unknown variances, in our

setting the presence of random contexts make the estimation problem considerably more di�cult. In


fact, the MSE in one specific regression problem is not only determined by the variance of the noise

and the number of samples used to compute the RLS estimate but also on the contexts observed

over time.

Our main contributions are as follows. First, we propose Trace-UCB, an algorithm that simul-

taneously learns the “hardness” of each problem, allocates observations proportionally to these esti-

mates, and balances contexts across problems. We derive performance guarantees for Trace-UCB

in expectation and in high-probability, and compare the algorithm to several baselines. Trace-

UCB performs remarkably well in scenarios where the dimension of the contexts or the number of

instances is large compared to the total budget, motivating the study of high-dimensional settings,

whose analysis and guarantees are presented in Appendix C.6. Finally, we provide simulations with

synthetic data that support our theoretical results, and with real data, demonstrating the robustness

of our approach even when some of the assumptions do not hold.


The Setting. We consider m linear regression problems, where each instance i 2 [m] = {1, . . . ,m}is characterized by a parameter �i 2 Rd such that for any context X 2 Rd, a random observation

Y 2 R is obtained as

Y = XT�i + ✏i, (3.1)

where ✏i is an i.i.d. realization of a Gaussian distribution N (0,�2

i ). We denote by �2

max

= maxi �2

i

and by �2 = 1/mP

i �2

i , the largest and the average variance, respectively.

We define a sequential decision-making problem over n rounds, where at each step t 2 [n] a

learning algorithm A receives a context Xt drawn i.i.d. from N (0,⌃), selects an instance It, and

observes a random sample YIt

,t obtained as in (3.1). At the end of the experiment, a training

set Dn = {Xt, It, YIt

,t}t2[n] is generated and all the m linear regression problems are solved, each

problem i 2 [m] with its own training set Di,n (i.e., a subset of Dn that contains all the samples

with It = i), and estimates of the parameters {�i,n}i2[m]

are returned. For each �i,n, we measure

its accuracy by the mean-squared error (MSE)

Li,n(�i,n)=EX

⇥

(XT�i�XT�i,n)2

⇤

=k�i��i,nk2⌃

. (3.2)

The overall accuracy of the estimates returned by the algorithm A is evaluated as

Ln(A) = maxi2[m]

EDn

⇥

Li,n(�i,n)⇤

, (3.3)


where the expectation is w.r.t. the randomness of the contexts Xt and observations Yi,t used to

compute �i,n (and, possibly, the algorithm’s randomness if any). The objective is to design an

algorithm A that minimizes the loss (3.3). This requires defining an allocation rule to select the

instance It at each step t and the algorithm to compute the final estimates �i,n, e.g., by means of

ordinary least-squares (OLS), regularized least-squares (RLS), or Lasso.

In designing a learning algorithm, we rely on the following assumption.

Assumption 2. The covariance matrix ⌃ of the Gaussian distribution generating the contexts

{Xt}nt=1

is known.

This is a standard assumption in active learning, since in this setting the learner has access to the

input distribution and the main question is for which context she should ask for a label [78, 74]. Often

times, organizations like the MOOC considered in the introduction own prior data that provides an

accurate idea of the distribution of their customers.

While in the rest of the chapter we mostly focus on Ln(A), other objectives can be considered too,

such as replacing the maximum in (3.3) with average across all instances, i.e., 1/mPm

i=1

EDn

⇥

Li,n(�i,n)⇤

,

or using weighted errors, maxi wi EDn

⇥

Li,n(�i,n)⇤

, and our algorithms naturally extend to those cases

(by updating the score to focus on the estimated standard deviation, or by including the weights in

the score, respectively). In addition, later in the chapter, we replace the expectation in (3.3) with a

high-probability error, see (3.19).

Optimal static allocation with OLS estimates. While the distribution of the contexts is

fixed and does not depend on the instance i, the errors Li,n(�i,n) directly depend on the variances

�2

i of the noise ✏i. We define an optimal baseline obtained when the noise variances {�2

i }mi=1

are

known. In particular, we focus on a static allocation algorithm Astat

that selects each instance i

exactly ki,n times, independently of the context,1 and returns an estimate �i,n computed by OLS as

b�i,n =�

XTi,nXi,n

��1

XTi,nYi,n, (3.4)

where Xi,n 2 Rki,n

⇥d is the matrix of (random) samples obtained at the end of the experiment for

instance i, and Yi,n 2 Rki,n is the corresponding vector of observations. It is simple to show that

the global error corresponding to Astat

is

Ln(Astat

) = maxi2[m]

�2

i

ki,nTr⇣

⌃EDn

⇥

b⌃�1

i,n

⇤

⌘

, (3.5)

1This strategy is obtained by simply selecting the first instance k1,n times, the second one k2,n times, and so on.


where b⌃i,n = XTi,nXi,n/ki,n 2 Rd⇥d is the empirical covariance matrix of the contexts assigned

to instance i. Since the algorithm does not change the allocation depending on the contexts and

recalling that Xt ⇠ N (0,⌃), b⌃�1

i,n is distributed as an inverse Wishart and we can write (3.5) as

Ln(Astat

) = maxi2[m]

d�2

i

ki,n � d� 1. (3.6)

Thus, we may write the following proposition for the optimal static allocation algorithm A⇤stat

.

Proposition 13. Given m linear regression problems, each characterized by a parameter �i, Gaus-

sian noise with variance �2

i , and Gaussian contexts with covariance ⌃, let n > m(d + 1), then the

optimal OLS static allocation algorithm A⇤stat selects each instance

k⇤i,n =�2

iP

j �2

j

n+ (d+ 1)

✓

1� �2

i

�2

◆

, (3.7)

times (up to rounding e↵ects), and incurs the global error

L⇤n = Ln(A⇤

stat) = �2

md

n+O

�2

✓

md

n

◆

2

!

. (3.8)

Proof. See Appendix C.1.1.

Proposition 13 divides the problems into two types: those for which �2

i � �2 (wild instances) and

those for which �2

i < �2 (mild instances). We see that for the first type, the second term in (3.7) is

negative and the instance should be selected less frequently than in the context-free case (where the

optimal allocation is given just by the first term). On the other hand, instances whose variance is

below the mean variance should be pulled more often. In any case, we see that the correction to the

context-free allocation (i.e., the second term) is constant, as it does not depend on n. Nonetheless,

it does depend on d and this suggests that in high-dimensional problems, it may significantly skew

the optimal allocation.

While A⇤stat

e↵ectively minimizes the prediction loss Ln, it cannot be implemented in practice

since the optimal allocation k⇤i requires the variances �2

i to be known at the beginning of the

experiment. As a result, we need to devise a learning algorithm A whose performance approaches

L⇤n as n increases.


More formally, we define the regret of A as

Rn(A) = Ln(A)� Ln(A⇤stat

) = Ln(A)� L⇤n, (3.9)

and we expect Rn(A) = o(1/n). In fact, any allocation strategy that selects each instance a linear

number of times (e.g., uniform sampling) achieves a loss Ln = O(1/n), and thus, a regret of order

O(1/n). However, we expect that the loss of an e↵ective learning algorithm decreases not just at the

same rate as L⇤n but also with the very same constant, thus implying a regret that decreases faster

than O(1/n). For completeness, we note that uniform sampling leads to

Ln(AUnif

) = �2

max

md

n+O

✓

1

n2

◆

. (3.10)

The coe�cient for the leading term in d/n goes from the optimalPm

i=1

�2

i in (3.8) to m �2

max

for

uniform sampling. Thus, when the values of the �2

i are not very similar, we expect uniform sampling

to show quite a poor performance.

3.3 The Trace-UCB Algorithm

In this section, we present and analyze our proposed algorithm to solve the problem discussed in

Section 3.2, called Trace-UCB, whose pseudocode is in Algorithm 3.

Algorithm 3 Trace-UCB Algorithm

1: for i = 1, . . . ,m do2: Select problem instance i exactly d+ 1 times3: Compute its OLS estimates �i,m(d+1)

and �2

i,m(d+1)

4: end for5: for steps t = m(d+ 1) + 1, . . . , n do6: for problem instance 1 i m do7: Compute score (�i,t�1 is defined in Eq. (3.12))

si,t�1

=b�2

i,t�1

+�i,t�1

ki,t�1

Tr�

⌃⌃�1

i,t�1

�

8: end for9: Select problem instance It = argmaxi2[m]

si,t�1

10: Observe Xt and YIt

,t

11: Update its OLS estimators �It

,t and �2

It

,t12: end for13: Return RLS estimates {��

i,n}mi=1

with regularization �

The regularization parameter � = O(1/n) is provided to the algorithm as input, while in practice


one could set � independently for each arm using cross-validation.

Intuition. Equation (3.6) suggests that while the parameters of the context distribution, par-

ticularly its covariance ⌃, do not impact the prediction error, the noise variances play the most

important role in the loss of each problem instance. This is in fact confirmed by the optimal allo-

cation k⇤i,n in (3.7), where only the variances �2

i appear. This evidence suggests that an algorithm

similar to GAFS-MAX [3] or CH-AS [18], which were designed for the context-free case (i.e., each

instance i is associated to an expected value and not a linear function) would be e↵ective in this

setting as well. Nonetheless, (3.6) holds only for static allocation algorithms that completely ignore

the context (and their history) to decide which instance It to choose at time t. On the other hand,

adaptive learning algorithms create a strong correlation between the dataset Dt�1

collected so far,

the current context Xt, and the decision It. As a result, the sample matrix Xi,t is no longer a

random variable independent of A, and using (3.6) to design a learning algorithm is not convenient,

since the impact on the error of the contexts is completely overlooked. Unfortunately, in general,

it is very di�cult to study the potential correlation between the contexts Xi,t, the intermediate

estimates �i,t, and the most suitable choice It. However, in the next lemma, we show that if at each

step t, we select It as a function of Dt�1

, and not Xt, we may still recover an expression for the final

loss that we can use as a basis for the construction of an e↵ective learning algorithm.

Lemma 14. Let A be a learning algorithm that selects instances It as a function of the previous

history, that is, Dt�1

= {X1

, I1

, YI1,1, . . . , Xt�1

, It�1

, YIt�1,t�1

} and computes estimates b�i,n using

OLS. Then, its loss after n steps can be expressed as

Ln(A) = maxi2[m]

EDt

�2

i

ki,nTr⇣

⌃b⌃�1

i,n

⌘

�

, (3.11)

where ki,n =Pn

t=1

I{It = i} and b⌃i,n = XTi,nXi,n/ki,n.

Proof. See Appendix C.2.

Remark 1. We assume noise and contexts are Gaussian. The noise Gaussianity is crucial for the

estimates of the parameter b�i,t and variance b�2

i,t to be independent of each other, for each instance

i and time t (we actually need and derive a stronger result in Lemma 40, see Appendix C.2). This

is key in proving Lemma 14, as it allows us to derive a closed form expression for the loss function

which holds under our algorithm, and is written in terms of the number of pulls and the trace of the

inverse empirical covariance matrix (note that b�i,t drives our loss, while b�2

i,t drives our decisions).

One way to remove this assumption is by defining and directly optimizing a surrogate loss equal


to (3.11) instead of (3.3). On the other hand, the Gaussianity of contexts leads to the whitened

inverse covariance estimate ⌃b⌃�1

i,n being distributed as an inverse Wishart. As there is a convenient

closed formula for its mean, we can find the exact optimal static allocation k⇤i,n in Proposition 13,

see (3.7). In general, for sub-Gaussian contexts, no such closed formula for the trace is available.

However, as long as the optimal allocation k⇤i,n has no second order n↵ terms for 1/2 ↵ < 1, the

same regret rate results hold.

Expression (3.11) makes it explicit that the prediction error comes from two di↵erent sources. The

first one is the noise in the measurements Y, whose strength is controlled by the unknown variances

�2

i ’s. Clearly, the larger the �2

i is, the more observations are required to achieve the desired accuracy.

At the same time, the diversity of contexts across instances also impacts the overall prediction error.

This is very intuitive, since it would be a terrible idea for the MOOC platform discussed in the

introduction of the chapter to estimate the parameters of a course by performing a hundred exams

on the very same student (context). We say contexts are balanced when b⌃i,n is well conditioned.

Therefore, a good algorithm should take care of both aspects.

There are two extreme scenarios regarding the contributions of the two sources of error. 1) If the

number of contexts n is relatively large, since the context distribution is fixed, one can expect that

contexts allocated to each instance eventually become balanced. In this case, the di↵erence in �2

i ’s

will mostly determine the number of times each instance should be selected. 2) When the dimension

d or the number of arms m is large w.r.t. n, balancing contexts becomes critical, and can play an

important role in the final prediction error, whereas the �2

i ’s are less relevant in this scenario. While

a learning algorithm cannot deliberately choose a specific context (i.e., Xt is a random variable), we

may need to favor instances in which the contexts are poorly balanced and their prediction error is

large, despite the fact that they might have small noise variances.

Algorithm. Trace-UCB is designed as a combination of the upper-confidence-bound strategy

used in CH-AS [18] and the loss in (3.11), so as to obtain a learning algorithm capable of allocating

according to the estimated variances and at the same time balancing the error generated by context

mismatch. We recall that all the quantities that are computed at every step of the algorithm are

indexed at the beginning and end of a step t by i, t� 1 (e.g., b�2

i,t�1

) and i, t (e.g., b�i,t), respectively.

At the end of each step t, Trace-UCB first computes an OLS estimate b�i,t, and then use it to

estimate the variance b�2

i,t as

b�2

i,t =1

ki,t � d

�

�Yi,t �XTi,tb�i,t�

�

2

,

which is the average squared deviation of the predictions based on b�i,t. We rely on the following


concentration inequality for the variance estimate of linear regression with Gaussian noise, whose

proof is reported in Appendix C.3.1.

Proposition 15. Let the number of pulls ki,t � d + 1 and R � maxi �2

i . If � 2 (0, 3/4), then for

any instance i and step t > m(d+ 1), with probability at least 1� �2

, we have

|�2

i,t � �2

i | �i,t�

= R

s

64

ki,t � d

✓

log2mn

�

◆

2

. (3.12)

Given (3.12), we can construct an upper-bound on the prediction error of any instance i and

time step t as

si,t�1

=�2

i,t�1

+�i,t�1

ki,t�1

Tr⇣

⌃⌃�1

i,t�1

⌘

, (3.13)

and then simply select the instance which maximizes this score, i.e., It = argmaxi si,t�1

. Intuitively,

Trace-UCB favors problems where the prediction error is potentially large, either because of a

large noise variance or because of significant unbalance in the observed contexts w.r.t. the target

distribution with covariance ⌃. A subtle but critical aspect of Trace-UCB is that by ignoring

the current context Xt (but using all the past samples Xt�1

) when choosing It, the distribution

of the contexts allocated to each instance stays untouched and the second term in the score si,t�1

,

i.e., Tr(⌃b⌃�1

i,t�1

), naturally tends to d as more and more (random) contexts are allocated to instance

i. This is shown by Proposition 16 whose proof is in Appendix C.3.2.

Proposition 16. Force the number of samples ki,t � d + 1. If � 2 (0, 1), for any i 2 [m] and step

t > m(d+ 1) with probability at least 1� �/2, we have

✓

1� CTr

r

d

n

◆

2

Tr⇣

⌃⌃�1

i,t

⌘

d✓

1 + 2CTr

r

d

n

◆

2

,

with CTr

= 1 +p

2 log(4nm/�)/d.

While Proposition 16 shows that the error term due to context mismatch tends to the constant

d independently of the specific instance i, when t is small w.r.t. d and m, correcting for the context

mismatch may significantly improve the accuracy of the estimates b�i,n returned by the algorithm.

Finally, note that while Trace-UCB uses OLS to compute estimates b�i,t, it calculates its returned

parameters b�i,n by ridge regression (RLS) with regularization parameter � as

��i = (XT

i,nXi,n + �I)�1XTi,nYi,n. (3.14)


As we will discuss later, using RLS makes the algorithm more robust and is crucial in obtaining our

regret bounds both in expectation and high probability.

Performance Analysis. Before proving a regret bound for our Trace-UCB algorithm, we

report an intermediate result, whose proof is in Appendix C.4.1, that shows Trace-UCB behaves

similarly to the optimal static allocation.

Theorem 17. Let � > 0. With probability at least 1� �, the total number of contexts that Trace-

UCB allocates to each problem instance i after n rounds satisfies

ki,n � k⇤i,n � C�

+ 8CTr

�2

min

r

nd

�min

� ⌦(n1/4) (3.15)

where R � �2

max

is known by the algorithm, and we defined C�

= 16R log(2mn/�), CTr

= 1 +p

2 log(4nm/�)/d, and �min

= �2

min

/P

j �2

j .

We now report our regret bound for the Trace-UCB algorithm. The proof of Theorem 18 is in

Appendix C.4.2.

Theorem 18. The regret of the Trace-UCB algorithm, i.e., the di↵erence between its loss and the

loss of optimal static allocation (see Eq. 3.8), is upper-bounded by

Ln(A)� L⇤n O

✓

1

�2

min

⇣ d

�min

n

⌘

3/2◆

, (3.16)

where �min

= �2

min

/P

j �2

j .

The expression in (3.16) shows that the regret decreases as O(n�3/2) as expected. This is also

consistent with the results in the context-free case [3, 18], where the regret decreases as n�3/2, a

rate that is conjectured to be optimal. However, it is important to note that in the contextual case,

the numerator also includes the dimensionality d. In order to improve the n�3/2 rate, it would be

necessary to guarantee deviations in (3.15) of order smaller thanpn, which in turn would require

confidence intervals for the �2

i ’s and the empirical traces that decrease faster than 1/p

ki,t.

Note that, when n � dm, the regret will be small, and it will be larger when n ⇡ dm. More

formally, a equivalent expression for the right-hand side of (3.16) is given by

Ln(A)� L⇤n O

✓

1

�5

min

X

i

�2

id

n

!

3/2◆

= O

✓

�3

�5

min

✓

dm

n

◆

3/2◆

. (3.17)


This motivates giving particular attention to the high-dimensional setting, which is the subject of

Appendix C.6. Note that (3.16) also indicates that the regret depends on a problem-dependent

constant 1/�min

, which measures the complexity of the problem. When �2

max

⇡ �2

min

, we have

1/�min

⇡ m, but 1/�min

could be much larger when �2

max

� �2

min

.

Remark 2. We briefly discuss a baseline motivated by the context-free problem. Let Var-UCB

be the algorithm that, at round t, selects the instance that maximizes the score2

s0i,t�1

=�2

i,t�1

+�i,t�1

ki,t�1

. (3.18)

The only di↵erence with the score used by Trace-UCB is the lack of the trace term in (3.13). Note

that contexts still play a role in computing the variance estimate �2

i,t�1

. Moreover, the regret of

this algorithm has similar rate in terms of n and d as that of Trace-UCB reported in Theorem 18.

However, the simulations of Section 3.5 show that the regret of Var-UCB is actually much higher

than that of Trace-UCB, specially when dm is close to n. Intuitively, when n is close to dm,

balancing contexts becomes critical, and Var-UCB su↵ers because its score does not explicitly take

them into account.

Sketch of the proof of Theorem 18. The proof is divided into three parts. 1) We show that

the behavior of the ridge loss of Trace-UCB is similar to that reported in Lemma 14 for algorithms

that rely on OLS; see Lemma 50 in Appendix C.5. The independence of the �i,t and �2

i,t estimates is

again essential (see Remark 1). Although the loss of Trace-UCB depends on the ridge estimate of

the parameters ��i,n, the decisions made by the algorithm at each round only depend on the variance

estimates �2

i,t and observed contexts. 2) We follow the ideas in [18] to lower-bound the total number

of pulls ki,n for each i 2 [m] under a good event (see Theorem 17 and its proof in Appendix C.4.1).

3) We finally use the ridge regularization to bound the impact of those cases outside the good event,

and combine everything in Appendix C.4.2.

The regret bound of Theorem 18 shows that the largest expected loss across the problem instances

incurred by Trace-UCB quickly approaches the loss of the optimal static allocation algorithm

(which knows the true noise variances). While Ln(A) measures the worst expected loss, at any

specific realization of the algorithm, there may be one of the instances which is very poorly estimated.

As a result, it would be desirable to obtain guarantees also for the (random) maximum loss

eLn(A) = maxi2[m]

k�i � �i,nk2⌃

. (3.19)

2Note that Var-UCB is similar to both the CH-AS and B-AS algorithms in [18].


In particular, we prove the following high-probability bound on eLn(A) for Trace-UCB.

Theorem 19. Let � > 0, and assume k�ik2 Z for all i, for some Z > 0. With probability at least

1� �,

eLn(A) Pm

j=1 �2j

n

✓d+ 2 log

3m

�

◆+O

1

�2min

✓d

n�min

◆3/2!, (3.20)

where �min = �2min/

Pj �

2j .

Note that the first term in (3.20) corresponds to the first term of the loss for the optimal static

allocation, and the second term is, again, a n�3/2 deviation. However, in this case, the guarantees

hold simultaneously for all the instances.

Sketch of the proof of Theorem 19. In the proof we slightly modify the confidence ellip-

soids for the �i,t’s, based on self-normalized martingales, and derived in [1]; see Theorem 44 in

Appendix C.3. By means of the confidence ellipsoids we control the loss in (3.19). Their radiuses

depend on the number of samples per instance, and we rely on a set of good events that hold with

high probability, leading to a lower bound on the number of samples. In addition, we need to make

sure the mean norm of the contexts will not be too large (see Corollary 46 in Appendix C.3). Fi-

nally, we combine the lower bound on ki,n with the confidence ellipsoids to conclude the desired

high-probability guarantees in Theorem 19.

3.4 High Dimensional Setting

High-dimensional linear regression models are remarkably common in practice. Companies tend to

record a large number of features of their customers, and feed them to their prediction models. There

are also cases in which the number of problem instances under consideration m is large, e.g., too

many courses in the MOOC example described in the introduction. Unless the horizon n is still

proportionally large w.r.t. md, these scenarios require special attention as the algorithms discussed

so far may break down. Interestingly, algorithms like Trace-UCB that adaptively use contexts in

their allocation strategy become more robust than their context-free counterparts.

A natural assumption in such scenarios is sparsity, i.e., only a small subset of features are relevant

to the prediction problem at hand (have non-zero coe�cient). In our setting of m problem instances,

it is often reasonable to assume that these instances are related to each other, and thus, it makes

sense to extend the concept of sparsity to joint sparsity, i.e., a sparsity pattern across the instances.


Formally, we assume that there exists some s ⌧ d such that

|S| �

= | [i2[m]

supp(�i)| = s, (3.21)

where supp(�i) = {j 2 [d] : �(j)i 6= 0} denotes the support of the i’th problem instance. A special

case of joint sparsity is when |supp(�i)| ⇡ s, for all i, i.e., most of the relevant features are shared

across the instances.

In this section, we focus on the scenario where dm > n. When we can only allocate a small

(relative to d) number of contexts to each problem instance, proper balancing of contexts become

extremely important, and thus, the algorithms that do not take into account context in their alloca-

tion are destined to fail. Although Trace-UCB has the advantage of using contexts in its allocation

strategy, it still needs to quickly discover the relevant features (those in the support) and only use

those in its allocation strategy.

This motivates a two-stage algorithm, we call it Sparse-Trace-UCB, whose pseudocode is in

Algorithm 7 in Appendix C.6. In the first stage, the algorithm allocates contexts uniformly to all

the instances, L contexts per instance, and then recovers the support. In the second stage, it relies

on the discovered support bS, and applies the standard Trace-UCB to all the instances, but only

takes into account the features in bS. Note that L should be large enough that with high probability,

support is exactly discovered, i.e., bS = S.

There exists a large literature on how to perform simultaneous support discovery in jointly sparse

linear regression problems [67, 68, 95]. Most of these algorithms minimize the regularized empirical

loss

minM2Rd⇥m

1

k

mX

i=1

kYi �Xi M[, i]k2 + � kMk,

where k is the number of samples per problem, M be the matrix whose i’th column is M[, i] =

�i, Xi 2 Rk⇥d, and Yi = Xi�i + ✏i. In particular, they use la/lb block regularization norm,

i.e., kMkla

/lb

= kvkla

, where vi = kM[i, ]klb

and M[i, ] is the i’th row of M.

The Sparse-Trace-UCB algorithm uses the l1

/l2

block regularization Lasso algorithm [95],

which is an extension of the algorithm in [68], for its support discovery stage. We can now extend

the guarantees of Theorem 19 to the high dimensional case with joint sparsity, assuming s is known.

Theorem 20. Let �1

> 0. Assume k�ik2 Z for all i, for some Z > 0, and assume the parameters

(n, d, s,�i,⌃,�2

max

) satisfy conditions C1 to C5 in [95]. Let be the sparsity overlap function defined


in [68]. If L > 2(1+v) log(d�s)⇢u(⌃(1:m)

SC

SC

|S)/�2 for some constant v > 0, and n�Lm � (s+1)m,

then, with probability at least 1� �1

� �2

, L is at most

P

j �2

j

n� Lm

✓

s+ 2 log3m

�1

◆

+2c

p

�2

min

sP

j �2

j

n� Lm

!

3/2

+ o (z) , (3.22)

where c 2⇣

1 +p

2 log(12mn/�1

)/s⌘

and we defined �2

= m exp(�c0

log s) + exp(�c1

log(d � s))

for positive constants c0

, c1

> 0, and z = (s/(n� Lm))3/2.

The exact technical assumptions and the proof are given and discussed in Appendix C.6. We

simply combine the high-probability results of Theorem 19, and the high-probability support recovery

of Theorem 2 in [95]. Corollary 51 studies the regime of interest where the support overlap is

complete, n = C1

ms log d ⌧ md for C1

> 0, and L = C2

s log d, for C1

� C2

> 0 (see Appendix

C.6).

3.5 Simulations

In this section, we provide empirical evidence to support our theoretical results. We consider both

synthetic and real-world problems, and compare the performance (in terms of normalized MSE) of

Trace-UCB to uniform sampling, optimal static allocation (which requires the knowledge of noise

variances), and the context-free algorithm Var-UCB (see Remark 2). In [3], GFSP-MAX performs

worse than GAFS-MAX. In [18], CH-AS and B-AS are shown to outperform GAFS-MAX, and thus,

we chose them as non-contextual baselines. In particular, Var-UCB is indeed the same algorithm

as CH-AS, except for the fact that we need to use the concentration inequality in Proposition 15,

since we are estimating the variance from a regression problem using OLS (unlike in [18]).

First, we use synthetic data to ensure that all the assumptions of our model are satisfied, namely

we deal with linear regression models with Gaussian context and noise. We set the number of

problem instances to m = 7 and consider two scenarios: one in which all the noise variances are

equal to 1 and one where they are not equal, and �2 = (0.01, 0.02, 0.75, 1, 2, 2, 3). In the latter case,

�2

max

/�2

min

= 300. We study the impact of (independently) increasing dimension d and horizon n on

the performance, while keeping all other parameters fixed. Second, we consider real-world datasets

in which the underlying model is non-linear and the contexts are not Gaussian, to observe how

Trace-UCB behaves (relative to the baselines) in settings where its main underlying assumptions

are violated.

Synthetic Data. In Figures 3.1(a,b), we display the results for fixed horizon n = 350 and

increasing dimension d. For each value of d, we run 10, 000 simulations and report the median


(a) �2 = (1, 1, 1, 1, 1, 1, 1). (b) �2 = (0.01, 0.02, 0.75, 1, 2, 2, 3). (c) �2 = (1, 1, 1, 1, 1, 1, 1).

(d) �2 = (0.01, 0.02, 0.75, 1, 2, 2, 3). (e) �2 = (1, 1, 1, 1, 1, 1, 1). (f) �2 = (0.01, 0.02, 0.75, 1, 2, 2, 3).

Figure 3.1: White Gaussian synthetic data with m = 7. In Figures (a,b), we set n = 350. In Figures(c,d,e,f), we set d = 10.

of the maximum error across the instances for each simulation. In Fig. 3.1(a), where �2

i ’s are

equal, uniform sampling and optimal static allocation execute the same allocation since there is no

di↵erence in the expected losses of di↵erent instances. Nonetheless we notice that Var-UCB su↵ers

from poor estimation as soon as d increases, while Trace-UCB is competitive with the optimal

performance. This di↵erence in performance can be explained by the fact that Var-UCB does not

control for contextual balance, which becomes a dominant factor in the loss of a learning strategy

for problems of high dimensionality. In Fig. 3.1(b), in which �2

i ’s are di↵erent, uniform sampling is

no longer optimal but even in this case Var-UCB performs better than uniform sampling only for

small d < 23, where it is more important to control for the �2

i ’s. For larger dimensions, balancing

uniformly the contexts eventually becomes a better strategy, and uniform sampling outperforms

Var-UCB. In this case too, Trace-UCB is competitive with the optimal static allocation even for

large d, successfully balancing both noise variance and contextual error.

Next, we study the performance of the algorithms w.r.t. n. We report two di↵erent losses, one


(a) Jester Dataset: d = 40,m = 10. (b) MovieLens Dataset: d = 25,m = 5.

Figure 3.2: Real World Data. Median over 1000 simulations.

in expectation (3.3) and one in high probability (3.19), corresponding to the results we proved in

Theorems 18 and 19, respectively. In order to approximate the loss in (3.3) (Figures 3.1(c,d)) we

run 30, 000 simulations, compute the average prediction error for each instance i 2 [m], and finally

report the maximum mean error across the instances. On the other hand, we estimate the loss in

(3.19) (Figures 3.1(e,f)) by running 30, 000 simulations, taking the maximum prediction error across

the instances for each simulation, and finally reporting their median.

In Figures 3.1(c, d), we display the loss for fixed dimension d = 10 and horizon from n = 115

to 360. In Figure 3.1(c), Trace-UCB performs similarly to the optimal static allocation, whereas

Var-UCB performs significantly worse, ranging from 25% to 50% higher errors than Trace-UCB,

due to some catastrophic errors arising from unlucky contextual realizations for an instance. In

Fig. 3.1(d), as the number of contexts grows, uniform sampling’s simple context balancing approach

is enough to perform as well as Var-UCB that again heavily su↵ers from large mistakes. In both

figures, Trace-UCB smoothly learns the �2

i ’s and outperforms uniform sampling and Var-UCB.

Its performance is comparable to that of the optimal static allocation, especially in the case of equal

variances in Fig. 3.1(c).

In Figure 3.1(e), Trace-UCB learns and properly balances observations extremely fast and

obtains an almost optimal performance. Similarly to figures 3.1(a,c), Var-UCB struggles when

variances �2

i are almost equal, mainly because it gets confused by random deviations in variance

estimates �2

i , while overlooking potential and harmful context imbalances. Note that even when

n = 360 (rightmost point), its median error is still 25% higher than Trace-UCB’s. In Fig. 3.1(f),


as expected, uniform sampling performs poorly, due to mismatch in variances, and only outperforms

Var-UCB for small horizons in which uniform allocation pays o↵. On the other hand, Trace-UCB

is able to successfully handle the tradeo↵ between learning and allocating according to variance

estimates �2

i , while accounting for the contextual trace b⌃i, even for very low n. We observe that

for large n, Var-UCB eventually reaches the performance of the optimal static allocation and

Trace-UCB.

Note that in practice the loss in (3.19) (figures 3.1(e,f)) is often more relevant than the loss in

(3.3), since it is in high probability and not in expectation, and fortunately, Trace-UCB shows

excellent performance and robustness in terms of this loss, regardless of the underlying variances �2

i .

Real Data. Trace-UCB is based on assumptions such as linearity, and Gaussianity of noise

and contexts that may not hold in practice, where data may show complex dependencies. Therefore,

it is important to evaluate the algorithm with real-world data to see its robustness to the violation

of its assumptions. We consider two collaborative filtering datasets in which users provide ratings

for items. We choose a dense subset of k users and p items, where every user has rated every item.

Thus, each user can be considered as a p-dimensional vector of ratings. We represent the context

of a user by d out of her p ratings, and learn to predict her remaining m = p� d ratings (each one

can be considered as a problem instance). All item ratings are first centered, so each item’s mean

is zero. In each simulation, n out of the k users are selected at random to be fed to the algorithm

(in random order). Algorithms can select any instance as the dataset contains the ratings of each

instance for all the users. At the end of each simulation, we compute the prediction error for each

instance using the k � n users that did not participate in training for that simulation. Finally, we

report the median of the maximum error across simulations.

Fig. 3.2(a) reports the results using the Jester Dataset by [34] that consists of joke ratings in

a continuous scale from �10 to 10. We take d = 40 joke ratings as context and learn the ratings

for another 9 jokes. In addition, we add another function that counts the total number of movies

originally rated by the user. The latter is also centered, bounded to the same scale, and has higher

variance (without conditioning on X). The number of total users is k = 3811, and m = 10.

When the number of observations is limited, the advantage of Trace-UCB is quite significant (the

improvement w.r.t. uniform allocation goes from 45% to almost 20% for large n, while w.r.t. Var-

UCB it goes from almost 30% to roughly 5%), even though the model and context distribution are

far from linear and Gaussian, respectively.

Fig. 3.2(b) shows the results for the MovieLens Dataset [63] that consists of movie ratings in a

discrete scale from 0 to 5, with 0.5 increments. We select 30 popular movies that were simultaneously


rated by k = 1363 users, and randomly choose m = 5 of them to learn (so d = 25). In this case,

it turns out that all problems have similar variance (�2

max

/�2

min

⇡ 1.3) so uniform allocation seems

appropriate. Both Trace-UCB and Var-UCB modestly improve uniform allocation, while their

performance is similar.

In real-world problems, the di�culty and underlying assumptions of learning tasks are most

often unknown in advance. Thus, it is crucial to rely on algorithms that smoothly account for this

uncertainty without any further assumptions. Our experiments show that Trace-UCB is robust

and agnostic to the underlying data generating process.

3.6 Conclusions

In this chapter, we studied the problem of adaptive allocation of n contextual samples of dimension

d to estimate m linear functions equally well, under heterogenous noise levels �2

i that depend on the

linear instance and are unknown to the decision-maker. We proposed Trace-UCB, an optimistic

algorithm that successfully solves the exploration-exploitation dilemma by simultaneously learning

the �2

i ’s, allocating samples accordingly to their estimates, and balancing the contextual information

across the instances. We also provide strong theoretical guarantees for two losses of interest: in

expectation and high-probability.

Simulations were conducted in several settings, with both synthetic and real data. The favorable

results suggest that Trace-UCB is reliable, and remarkably robust even in settings that fall outside

its assumptions, thus, a useful and simple tool to implement in practice.

Chapter 4

Directions for Future Work

In this chapter, we briefly introduce how more complicated structures can be actively learned. In

particular, we focus on Markov Decision Processes, the basic mathematical object that models

sequential decision making in the presence of states. Previous actions have delayed consequences,

and planning to learn becomes necessary, unlike in stateless scenarios like bandits. This is still

ongoing and future work, so most results and ideas are subject to change.

4.1 Motivation

In the last few years, there has been a remarkable increment in interest for reinforcement learning

techniques. In particular, a few success cases received wide media coverage, such as solving Atari

games, or AlphaGo [82]. The algorithms behind them were able to adequately combine both the

latest theoretical and applied research, and enormous computational resources.

The motivation for the present chapter is two-fold.

Sample Complexity. The algorithms mentioned above had access to extremely high-performing

computing systems (hundreds of GPUs, e�cient database infrastructure, etc). In addition, the

system was able to generate its own training data on the fly. In other words, the training was

simulator -based. When those two conditions are simultaneously satisfied, i.e., you can generate

your own data and you can do it fast, then the number of episodes or trajectories involved in

training can be on the billions, and in those circumstances current algorithms lead to finding good

policies. However, we believe reinforcement learning should be implemented to solve common,

small and simple tasks where some limited experimentation budget is available. Unfortunately, in

most cases, either super-computers are not available and billions of data points cannot be handled,

or simulators cannot be used to sample unlimited data from the true environment, as observations

51

CHAPTER 4. DIRECTIONS FOR FUTURE WORK 52

come, for example, from human input (recommendation systems), or costly real-world measurements

(robotics).

Two-Stage Learning. An elegant conceptual aspect of reinforcement learning is that agents

keep refining their knowledge and policies over time, while exploiting what they already know. In

some settings, however, after consuming a given experimentation budget (in terms of total time or

observations), a final policy is fixed and implemented. Only the performance of this policy matters.

A natural example are simulator-based systems. The goal of AlphaGo designers was to beat the

world champion on a specific date and game, while losing training games was irrelevant from a

cost stand-point. Also, deploying reinforcement learning architectures that constantly update their

policies is di�cult in real-world production systems. Due to engineering constraints, companies often

prefer to spend some initial e↵ort in learning how to behave, to subsequently deploy a static policy.

As a consequence, we propose a reinforcement learning scenario where

1. The sample complexity T (i.e., number of episodes) is fixed and known in advance.

2. The goal is to minimize the discounted cumulative reward of the final policy ⇡T .

In this chapter we mostly discuss Bayesian algorithms, as they are a good fit for the applications

that motivate the work. In those applications, the task is simple and known enough so that tabula

rasa methods can be directly applied, and useful information can be encoded in priors for a model-

based approach. Extensions to model-free methods that use function approximation for the value

function or the policy remain exciting open directions of future work.

Importantly, our goal is not to minimize overall cumulative regret. Near-optimal approaches for

the cumulative regret setting tend to only play actions that are still plausible to be the best action.

For example, under optimism or the current posterior distribution over policies. In pure exploration

settings like ours, on the other hand, some times actions known to be suboptimal should be played

as they provide the algorithm with useful information to discriminate other actions.

4.2 Bandits

Before discussing the more general reinforcement learning problem, we briefly study the bandit

setting, where there is a rich literature in both cumulative regret and pure exploration scenarios.

For the simple case of Bernoulli arms, it was shown in [14] that algorithms that lead to logarithmic

regret in the cumulative-regret scenario, necessarily incur simple regret that decreases at most at a


polynomial rate. On the other hand, there are pure-exploration algorithms, like Successive Rejects

[4], whose simple regret decays exponentially fast. The latter algorithm pulls two arms during all

rounds; so its cumulative regret is, however, of linear order. In the authors’ words, [14]: “the smaller

the cumulative regret, the larger the simple regret”. In addition, [14] provides both distribution-

dependent and distribution-free simple regret bounds. The standard bounds for optimal algorithms

minimizing cumulative regret after n pulls are of order log(n) (distribution-dependent), andpn

(distribution-free). For simple regret, the situation is similar: while the distribution-dependent

optimal bound is of exponential order, i.e. e�Cn, the distribution-free bound behaves as 1/pn.

For a variety of reasons, in practice, it is often the case that organizations claim they want to

minimize cumulative regret, but restrict the family of algorithms to explore-then-commit policies.

These policies operate in two stages: first, they explore all arms and pick a winner, in the second

stage the algorithm only pulls the winner, i.e., it commits. Explore-then-commit algorithms have a

strong pure-exploration flavor and, in fact, [32] recently showed that the cumulative regret of these

policies is asymptotically strictly dominated by fully sequential algorithms.

Best arm identification is the most studied problem within the pure exploration for bandits

literature. The goal is to maximize the expected value of the arm selected by the algorithm after a

number of rounds. There are several generalizations, like m-best arm identification for m > 1, [13].

In [4], the authors characterize the hardness of the problem by H = 1/P

i �2

i , where i⇤ is the

best arm, and �i = µi⇤�µi is the optimality gap for arm i. For i⇤, we take the gap �i⇤ with respect

to the second best arm. The sample complexity required to identify the best arm with reasonable

probability is determined by H. In general, one expects sample complexity to depend on three

aspects of the problem: number of arms, relative mean quality of each arm, and noise distributions.

There are two main settings: fixed confidence and fixed budget.

Fixed Confidence. Let i⇤ and bi be the best arm and the arm that was selected by the algorithm,

respectively. In this setting, given � > 0, the goal is to pull arms in a way such that the algorithm

is able to return its final guess as quickly as possible while satisfying P(bi = i⇤) > 1 � �, a fixed

confidence requirement. In [62], an elimination algorithm called Hoe↵ding Races is proposed, where

arms are discarded when their confidence intervals are dominated. Later, [66] improved the algorithm

by means of empirical Bernstein bounds that incorporate variance information. An asymptotically

optimal algorithm, called Track-and-Stop, was recently proposed in [31].

Fixed Budget. Let n be the total number of pulls the algorithm is allowed to make. The goal


here is to output a recommended arm with the highest possible mean reward after the n pulls.

Two di↵erent algorithms are proposed in [4]: UCB-E and Successive Rejects. The simple regret

of both algorithms decreases at rate roughly equal to exp(�Cn/H). UCB-Exploration requires

tuning a parameter (whose optimal value depends on the unknown H), while Successive Rejects

needs no tuning. Most UCB approaches pull the arm maximizing a score of the following form:

si,t = bµi,t�1

+p

a/Ti(t� 1), where Ti(t) is the number of times arm i has been pulled in the first

t rounds, and bµi,t the empirical mean of those pulls. In particular, to minimize cumulative regret

the amount of exploration required is a = O(log(n)). However, [4] shows that, in order to minimize

simple regret, the exploration needs to be way more aggressive: a = O(n/H) (which is, precisely,

what UCB-E does; so it requires to know the value of H). On the other hand, Successive Rejects

divides the n pulls in rounds of increasing length, and at the end of each round, it discards the

arm with lowest empirical mean. During any round, all surviving arms are uniformly allocated. A

pure exploration problem is, at the end of the day, a search problem, a problem of (adaptively)

collecting the right information or data, so that an informed decision can be made once the agent

has spent all his budget. In those cases, a natural benchmark to beat are non-adaptive random data

collection procedures. If one lets a ! 1 above, then the algorithm becomes uniform allocation:

every arm is pulled equally often. Interestingly, simple regret under uniform allocation decreases

at rate exp(�Cn�2

i⇤/K), where K is the number of arms, and �i⇤ the minimal gap (between best

and second best arms). By definition �2

i⇤/K 1/H, but in practice �2

i⇤/K may be actually much

smaller than 1/H. Therefore, UCB-E and Successive Rejects may dramatically outperform uniform

allocation. The results mentioned above hold for any n, i.e., they are not asymptotic results.

These algorithms provide frequentist guarantees. Pure exploration has also been widely stud-

ied in the Bayesian literature. Thompson sampling is one of the most popular Bayesian bandits

algorithms [86]. The algorithm is simple to implement, leads to strong performance in practice

[19], and theoretical guarantees for cumulative regret have been shown [2, 50]. At each time t, the

algorithm samples a parameter from the posterior, ✓t ⇠ pt(✓), assumes ✓t is the true underlying

mean reward, and acts accordingly: it pulls at = argmaxi ✓t,i. While optimal for independent arms,

Thompson sampling (and other UCB algorithms) only pulls arms, by definition, whose optimality

is statistically plausible. As pointed out in [76], this can be problematic in settings where there are

complex informational dependencies across arms. In particular, in settings where arms known to be

suboptimal provide useful information to discriminate between potentially optimal arms. In those

cases, for minimizing cumulative-regret, one needs to carefully trade regret and some informational

metric. The information ratio, presented in [76], addresses these issues by selecting arms according

to a randomized policy that myopically minimizes regret cost-per-bit of information. Formally, the

information ratio for a single action a is given by its expected regret squared (with respect to the

current posterior) over the information gain that pulling that action will produce with respect to the


posterior distribution of the optimal arm. Actions with high regret that lead to great information

gains can still be competitive.

One can imagine that, when the goal is to minimize some notion of simple regret, the issues

regarding Thompson Sampling or UCB described in [76] become more serious. Intuitively, one now

wants to capture observations of high informational value to, only eventually, be able to identify

actions leading to low regret. Recently, [77] proposed three simple Bayesian algorithms: top-two

probability sampling, top-two value sampling, and top-two Thompson sampling. The latter is a

modified version of Thompson sampling where, with probability � (a tuning parameter), the best

arm i0t of ✓t ⇠ pt(✓) is pulled but, with probability 1��, we keep sampling from pt(✓) until we find a

di↵erent best arm i00t 6= i0t. The paper provides frequentist asymptotic guarantees: the convergence of

the posterior distribution to the true best action happens at an exponential rate, with a near-optimal

exponent that depends on �.

Linear Bandits

Linear bandits are a powerful generalization of stochastic multi-armed bandits. A parameter vector

�⇤ 2 Rd is fixed and unknown, and the agent needs to choose an arm from a given set Xt ⇢ Rd

at every round t. After pulling arm x 2 Xt, the agent obtains reward y = xT�⇤ + ✏, for some

subgaussian mean-zero noise ✏. Note that, in general, by pulling action x, we can learn about the

expected outcome of all other actions. Therefore, the informational dependencies between arms are

explicitly modeled in linear bandits.

An optimistic algorithm for cumulative regret of linear bandits is proposed in [1]. The algorithm

constructs confidence sets Ct for �⇤ by means of a self-normalized martingale argument (that we

also used in the previous chapter), and it pulls Xt satisfying (Xt,�t) = argmax(x,�)2X

t

⇥Ct�1

xT� at

time t. The cumulative regret is at most of order O(d log(n)pn+

p

dn log(n/�)).

The problem of pure exploration has also been studied for linear bandits, where the goal is to

find argmaxx2X xT�⇤. Again, mainly in fixed confidence and fixed budget settings.

1. Fixed Budget. In [42], BayesGap is proposed, a Bayesian algorithm that extends the UGapE

approach presented in [30] for the context-free case. The number of arms is assumed finite, i.e.,

|X | = n, and y|x,�⇤ ⇠ N (xT�⇤,�2) with �2 known. For each x, upper and lower bounds over

xT�⇤ are computed. Using those bounds, the algorithm upper bounds the potential simple

regret incurred by each arm. The algorithm considers two candidates: the arm xi minimizing

the previous upper bound on simple regret, and the arm xj 6= xi with highest upper bound on


xTj �

⇤. It finally pulls the arm among xi and xj with highest gap between its upper and lower

bound for xT�⇤.

2. Fixed Confidence. In [83], several algorithms are proposed that guarantee the right answer

with probability at least 1 � �, while trying to minimize the sample complexity. The main

idea is to choose the arms that will shrink the most the current confidence set in the relevant

directions: those directions y = x � x0 between arms x, x0 that are still plausible candidates

to be the best arm. In their XY-adaptive algorithm, arms are sequentially discarded from the

latter set. In addition, a problem-dependent complexity measure is proposed, which extends

and corresponds to H in the standard stochastic MAB.

To illustrate the di↵erence between cumulative and simple regret, we performed some simple

simulations. In Figure 4.1, we compare the cumulative regret with the simple regret for increasing

(but small) values of T , when d = 25, the number of arms |X | = 100 and the noise level is �2 = 2. The

algorithms included in the simulations are listed and explained below. Except uniform allocation,

all of them are Bayesian algorithms that keep a posterior distribution pt over � 2 Rd. We measure

simple regret by computing the true value of the best arm according to the mode of the posterior

distribution after time T of each algorithm.

1. InfoMax. The algorithm chooses the arm that minimizes the expected entropy of the posterior

after observing the new observation (which is equivalent to maximizing information gain). In

our simulations, the expectation was computed using a Monte Carlo approximation.

2. Information Directed Sampling (IDS). Introduced in [76], pulls the arm minimizing a score that

equals the ratio between the squared expected regret of the arm, and the mutual information

between the optimal arm and the observation obtained after pulling the arm.

3. Thompson Sampling. It samples �t ⇠ pt and plays the best arm wrt �t, xt = argmaxx2X xT�t.

4. Two-Sample Thompson Sampling (TSTS). It samples �t ⇠ pt first. Let xt be its best arm.

With probability ↵ it plays xt, and with probability 1� ↵ it keeps sampling �0t ⇠ pt from the

posterior distribution until the best arm for the new sampled parameter, say x0t, is di↵erent

from xt. Then, the algorithm plays x0t. The algorithm was proposed in [77].

5. BayesGap. As explained above in more detail, proposed in [42], BayesGap considers two

candidates: the arm minimizing an optimistic estimate of simple regret, and the (di↵erent) arm

with highest upper confidence bound on reward xT�, pulling the one with largest uncertainty.

6. Uniform Allocation. At time t, it selects the arm to pull uniformly at random.


(a) Simple Reward. (b) Cumulative Reward.

Figure 4.1: Linear Bandits Simulations with d = 25, |X | = 100 arms, and �2 = 2.

The performance of BayesGap is remarkable, specially in terms of simple reward. IDS is a

state-of-the-art algorithm for cumulative-regret, and we see that it is by far the best one in that

comparison. However, even uniform sampling collects data that is more informative to make a final

policy decision on which we measure simple reward. As expected, InfoMax performs pretty well

in terms of simple reward, while its cumulative reward —specially when T grows— su↵ers as a

consequence of the high degree of exploration which is not necessarily oriented towards promising

actions. Performance of TS and TSTS is fairly similar.


The goal of this chapter is to define a pure exploration setting for reinforcement learning problems.

Let M be a family of MDPs parameterized by ✓ 2 ⇥, i.e., M = {M(✓) : ✓ 2 ⇥} where M(✓) is

a specific MDP with state space S, action space A, horizon H, and transition and reward functions

P✓, R✓, respectively. We assume ✓⇤ is the true underlying parameter, which is unknown to the

algorithm. Bayesian algorithms may assume it is sampled from a prior distribution p0

(✓). The input

to the algorithm is T > 0, a fixed budget of episodes. After interacting with M(✓⇤) for T episodes,

the algorithms needs to output a final policy ⇡T .


The objective function is to maximize the expected value of the policy, that is,

E✓⇤⇠p0 [V (✓⇤,⇡T )], (4.1)

where the expectation includes all the randomness of the process and the algorithm. We believe that

in many practical scenarios this is the true goal of the algorithm designer. However, and surprisingly,

the pure exploration objective function has been overlooked in the literature.

Next, we summarize some types of guarantees that are most similar to (4.1).

MDP-PAC. A concept studied before in reinforcement learning is that of Probably Approxi-

mately Correct (PAC) guarantees, [48]. Let ⇡t be the policy followed by algorithm A at time t. Let

�, ✏ > 0. We say that A is MDP-PAC if, with probability at least 1 � �, at time step t we have

that V⇡t

(st) � V ⇤(st) � ✏ for all but a polynomial number of steps m in |S|, |A|, H, 1/✏ and 1/�.

Unfortunately, the MDP-PAC guarantee does not specify when those m steps happen, so it is hard

to know for a fixed t = T whether ⇡T is ✏-good. MDP-PAC algorithms tend to be very conservative.

Bayes Optimal. Bayesian algorithms for reinforcement learning provide an elegant framework

to appropriately tune and solve the trade-o↵ between exploration and exploitation. In fact, one can

construct an augmented MDP, M0, whose states encode both the states from the original space, S,

and the belief state b, which summarizes the observed outcomes up to that point and, therefore, the

current uncertainty about the underlying model. M0 is known as the Bayes-Adaptive MDP, [26], and

it is fully observed so, in principle, it could be solved using dynamic programming to find ⇡⇤M

0 . Note

that this policy solves the exploration-exploitation dilemma in an optimal fashion. However, the size

of M0 is usually enormous, and performing planning on it becomes intractable. Several algorithms

have been proposed that rely on di↵erent kind of approximations, usually doing partial lookahead

search that incorporates the value of information, while most of them are computationally very

expensive. In [55], the Bayesian Exploration Bonus algorithm (BEB) is introduced. The algorithm

adds a novelty bonus to each (s, a) of the form �/(1 + n(s, a)) where n(s, a) is the number of times

the state-action pair has been visited. The authors show that, for all but a polynomial number of

steps, BEB’s policy is ✏-close to ⇡⇤M

0 . Remarkably, they also show the algorithm is not MDP-PAC,

as the novelty bonus for MDP-PAC algorithms cannot wash out faster than �/pn, i.e., exploring far

more aggressively. Finally, note that when the final goal is given by (4.1), the Bayes-Adaptive MDP

is di↵erent as all the reward is at the leaves, but still known, and similar approximation algorithms

could be derived in principle.


4.4 The Information Supermarket

In this section, we describe a family of MDPs where algorithms that minimize cumulative regret can

be forced to be arbitrarily sub-optimal for simple regret minimization.

Let M = M(✓) be an arbitrary sparse MDP, for ✓ 2 ⇥ ⇢ Rd. By sparse, we mean that all

the reward is obtained in the final states or leaves s1

, . . . , sm. Assume the reward at each leaf s is

sampled from rs ⇠ N (�s,�2) for some known �2 > 1, where the mean-reward parameters � 2 Rm

are part of ✓. The information supermarket MDP describes an scenario where the agent can pay

some known price c > 0 to obtain accurate measurements of reward, or, instead play for free in the

more noisy version of the world.

At the beginning of the episode, the agent can take two actions: a1

and a2

. If the agent plays

a1

, then he enters the information supermarket. The deterministic reward for this action is �c. In

this case, the agent will interact with an exact copy of M(✓), say M0(✓), but the reward in the leaves

will be sampled from rs ⇠ N (�s, 1) instead. On the other hand, if the agent pulls action a2

, he will

not enter the information supermarket; he will deal with the real world. The agent interacts with a

copy of M(✓) with the original (noisier) rewards rs ⇠ N (�s,�2). First, note that the optimal policy

for M(✓) and M0(✓) is identical. The value function at the leaves is equal as their mean reward is the

same. However, solving M(✓) is harder, requires more observations as estimating �s is more di�cult.

Given X1

, . . . , Xn ⇠ N (0,�2) iid observations, the unbiased mean estimate has squared prediction

error equal to E[X2

n] = �2/n. Equivalently, in order to confidently detect the di↵erence in means

between X ⇠ N (0,�2) and Y ⇠ N (✏,�2) after n samples from each, we need n to be order �2/✏2

samples. Thus, we expect the cumulative regret of algorithm A in M(✓) to be order �2 times that

in M0(✓).

Our example is parameterized by (c,�2), the price and quality of information respectively. Sup-

pose the agent can play the MDP T times (or episodes). For c large enough, an optimal algorithm for

minimizing cumulative regret will not enter the supermarket at all; it is too expensive. By tuning �2,

we can make the pure-exploration performance of such an algorithm arbitrarily bad (actually, equal

to that of uniform allocation!) as the quality of the data it observes becomes worse and worse, and

so does its final policy. On the other hand, an optimal algorithm for pure exploration is completely

immune to both (c,�2). It will enter the supermarket in all of the T rounds as it is virtually free,

and will collect data with no �2-noise, to learn the optimal policy for M(✓). Note its output policy

will not enter the supermarket (no reason to pay c in test time), and its expected cumulative reward

value will not depend on �2.

A few remarks are needed. As the optimal policy for the (whole) Information Supermarket MDP


does not enter the supermarket, proper implementations of Thompson/Posterior sampling will never

enter the information supermarket regardless of the actual value of c > 0, as the probability of a1

being optimal is zero and those policies are never sampled. Other approaches like Information

Directed Sampling may enter the supermarket or may not, it depends on whether the information

is rightly priced.

Another useful way to look at the MDP is by setting the reward/cost of a1

to zero, but charging

the information fee at the end: rs ⇠ N (�s � c, 1) for all leaves s. More broadly, this illustrates our

point. In practice, examples could be constructed where M0(✓) and M(✓) are not equal, and paths

in M0(✓) require some extra cost c while providing neat information to better solve M(✓). These

scenarios apply in real-world situations where the agent is given freedom to ignore intermediate

costs, and should focus only on collecting the best data.

4.5 Information Maximization

In this section we provide some formal intuition about the problem, and propose a simple heuristic.

Recall that at the end of the learning process, the decision maker must output a final policy

⇡T . The loss associated with ⇡T is given by E✓⇤⇠p0 [V (✓⇤,⇡T )]. After collecting all data, and

before choosing a final policy, the knowledge acquired by the algorithm is completely summarized

in the posterior distribution over MDPs given by pT . Therefore, any algorithm has two distinct

components: the data collector that makes decisions during the learning stage leading to the final

posterior pT (that is, the one that sets the policy ⇡t to use at each time t), and the policy maker

that takes as input the latest posterior pT and returns the final policy ⇡T .

There are two natural choices for the policy maker. We want to maximize the expected value of

the final policy. When the true environment is actually sampled from the initial prior, its distribution

after observing the data is precisely given by the posterior pT . Thus, the best response is defined as

b⇡T := argmax⇡2⇧

E✓⇤⇠pT

[V (✓⇤,⇡)]. (4.2)

In this case, the policy maker is deterministic but b⇡T can be hard to compute in practice. On the

other hand, a di↵erent approach consists in returning a policy sampled from the posterior. Formally,

the algorithm samples ✓ ⇠ pT , and then computes its optimal policy ⇡⇤✓ . Note that many di↵erent

✓’s may lead to the very same policy, or to policies with identical values under the true environment.

This approach is much easier to compute, while suboptimal in general. We focus on algorithms that

sample the final answer from the posterior, which we call cheerful algorithms.


To simplify notation we define the value function V (✓1

, ✓2

) := V (✓1

,⇡⇤✓2). In words, the expected

value of playing the optimal policy for ✓2

under ✓1

. Note that for fixed ✓1

, ✓2

, function V (✓1

, ✓2

) is

deterministic (⇥⇥⇥ ! R), but usually complex. In particular, V ⇤✓ = V (✓, ✓).

The loss L(A) of any cheerful algorithm A is then

L(A) := E✓⇤⇠p0 [V (✓⇤,⇡T )] = EpT

[V (✓1

, ✓2

)] , ✓1

, ✓2

⇠ pT (·), (4.3)

where ✓1

and ✓2

are sampled independently. Further, we denote by Ft the distribution of V (✓1

, ✓2

),

where ✓1

, ✓2

⇠ pt(·). Thus, if a cheerful algorithm had to stop at time t, then Ft is the distribution

of its performance. The goal of the data collector algorithm is to sequentially choose policies that

lead to desirable Ft distributions, i.e., with large mean and small dispersion (variance, entropy, etc).

Formally, let f encode our definition of desirable. For example, f(F) = E[F], f(F) = �H[F]

or f↵(F) = E[F] + ↵ std(F). We can define a family of greedy algorithms that, at time t, choose

the policy that maximizes the expected value of f(Ft+1

), as illustrated in Algorithm 4. As solving

equations like (4.2) may be unrealistic in practice, we also define Algorithm 5, an approximation

version of Algorithm 4 that relies on Monte Carlo samples.

Algorithm 4 f -Greedy Algorithm

1: Input: f : F ! R.2: for episode t = 1, . . . , T do3: Find policy ⇡t = argmax⇡2⇧

E✓⇤ [f(Ft) | D1:t�1

[D(✓⇤,⇡)]4: Collect data Dt using policy ⇡t.5: Update posterior distribution pt(·),⇡t(·) with Dt.6: end for7: Return ⇡ ⇠ ⇡T (·).

In Algorithm 4, the expected value of f(Ft) is computed by taking the expectation over ✓’s

according to the current posterior pt�1

, and, also, the expectation with respect to the datasets

D(✓⇤,⇡) that can be collected using ⇡ under ✓⇤ ⇠ pt�1

, which lead to the fantasized updated

posterior Ft. Unfortunately, even for very small problems, computing the argmax in the third step

of Algorithm 4 is usually not feasible. Consequently, we propose Algorithm 5, where —in steps 3 and

4— a subset of policies and MDPs are sampled to approximate the argmax. Sampling the MDPs

from the posterior pt helps to obtain an unbiased estimate of the expectation. However, how to

sample the policy set is not clear. We denote the mechanism or distribution by P. Letting P be the

posterior distribution over policies at time t� 1 is not necessarily a good idea, specially for large t,

as, in some sense, it makes innovation harder as it tends to gather data from policies already known

to be good. In our experiments, the best results corresponded to sampling random policies with


probability wt and posterior optimal policies with probability 1 � wt, while letting wt ! 1 when t

grows. Nonetheless, this approach seems destined to fail in large and complex MDPs where random

policies are extremely poor. How to e�ciently sample or search in the space of policies to perform

step 3 in Algorithm 5 remains an exciting open research question. Finally, note that computing

f(Fi,jt ) in step 9 of Algorithm 5 can be quite expensive, and require further approximations.

Algorithm 5 Monte Carlo f -Greedy Algorithm

1: Input: q, k 2 N, f : F ! R.2: for episode t = 1, . . . , T do3: Sample policies ⇡

1

, . . . ,⇡q ⇠ P.4: Sample MDPs ✓

1

, . . . , ✓k ⇠ pt�1

.5: for policy ⇡i i = 1, . . . , q do6: Initialize policy score: si = 0.7: for MDP ✓j j = 1, . . . , k do

8: Generate data D(✓j ,⇡i), and compute updated posterior Fi,jt .

9: Update score: si = si + f(Fi,jt ).

10: end for11: end for12: Define policy ⇡t = argmaxi si.13: Collect data Dt using policy ⇡t.14: Update posterior distribution pt(·),⇡t(·) with Dt.15: end for16: Return ⇡ ⇠ ⇡T (·).

4.6 Adaptive Submodularity

In this section, we argue that in some simple settings a greedy approach like the one discussed in

the previous section can be close to optimal. In order to do so, we go back to linear models, and

focus on an entropy minimization goal.

We consider the following problem. We would like to estimate ✓ so that Y = XT ✓ + ✏, where

the noise is independent and Gaussian ✏ ⇠ N (0,�2), with �2 known. In order to collect data,

at each step, we can choose one of m policies. Let ⇡i be a randomized data collection policy, for

i = 1, . . . ,m. In particular, when we play policy ⇡i, the observation X is sampled from Fi with

mean µi and covariance matrix ⌃i. If we denote by ⇡t the policy chosen at time t, then:

Yt = XTt ✓ + ✏, Xt ⇠ Ft(µt,⌃t), ✏ ⇠ N (0,�2). (4.4)

Our beliefs for ✓ are initially given by the prior p0

= N (µ0

,�0

Id), and after observing (Xt, Yt) for


t = 1, . . . , we update the posterior to pt = N (µt,⌃t) with

⌃t =

✓

1

�0

Id +1

�2

XTt Xt

◆�1

, (4.5)

µt = ⌃t

✓

µ0

�0

+1

�2

XTt Yt

◆

, (4.6)

whereXt andYt are the first t observations. Following the D-optimality criteria, we seek to minimize

the entropy of pt over time. Equivalently, if we can sample n observations, we would like to minimize

H(pn) = C +1

2log det⌃n = C � 1

2log det

✓

1

�0

Id +1

�2

XTnXn

◆

. (4.7)

We show that a greedy algorithm that at each step t chooses the policy ⇡ minimizing the expected

value of the entropy after collecting the observation, i.e. E⇡ H(pt+1

), is near-optimal. To do so, we

use submodularity. The previous function is supermodular, as we show in Lemma 54 in Appendix D.

It is also monotonic decreasing, see Lemma 55 in Appendix D. However, in our setting Xi is a

random variable, not a deterministic point in the space. Therefore, we need a stronger concept of

submodularity: adaptive submodularity [35].

In order to describe our problem in a way that fits the adaptive submodularity framework,

consider the following definitions. The set of possible actions is given by V = {⇡i | i 2 [m]}. The

same policy can be played as many times as desired, one way to think about this is by assuming

there are n di↵erent copies of ⇡i in V . After playing an action, we receive a random observation

(X,Y ) 2 Rd ⇥ R =: O. We will use the observation to update our beliefs about the world, in

particular, about ✓. The function that we want to optimize, f : V 0 ⇥O0 ✓ V ⇥O ! R, is as follows

f(⇧t,Xt,Yt) = C 0 �H[pt] (4.8)

= C 00 +1

2log det

✓

�2

�0

Id +XTt Xt

◆

(4.9)

where ⇧t = {⇡1

, . . . ,⇡t} denotes the history of played policies up to time t. We simply add a

constant to make sure the function is always positive, f � 0. In particular, as the log det term is

increasing in the observations, we only assume that C 00 > �(n/2) log(�2/�0

).

We need to define three additional concepts that naturally extend submodularity to settings

where decisions are made adaptively and based on partial feedback. The conditional expected

marginal benefit of action ⇡i after having observed (⇧t,Xt,Yt), usually denoted by �(⇡i|Xt,Yt) is

�(⇡i|Xt,Yt) = E [f({⇡i, Xi, Yi} [ (⇧t,Xt,Yt), ✓)� f((⇧t,Xt,Yt), ✓) | Xt,Yt] . (4.10)


The previous expectation is with respect to pt, the posterior distribution of ✓ given (Xt,Yt), and

the random noise ✏ for Yi. In other words, � is the expected information gain

�(⇡i|Xt,Yt) = H[pt | Xt,Yt]�E [H[pt+1

] | Xt,Yt, X ⇠ ⇡i] . (4.11)

We say that f is adaptive submodular with respect to the prior over ✓ if for all = (⇧1

,X1

,Y1

)

and 0 = (⇧2

,X2

,Y2

), such that is a subrealization of 0, i.e. every triple (⇡j , Xj , Yj) from is

also contained in 0, and for all ⇡ 62 0, we have

�(⇡|X1

,Y1

) � �(⇡|X2

,Y2

). (4.12)

Assume X2

= X1

[X. We define � := �(⇡|X1

,Y1

)��(⇡|X2

,Y2

). Then, we see that

� =

Z

(H[p | ]�H[p | , X, Y ]) g⇡(X, ✏) dX⇡ d✏�Z

(H[p | 0]�H[p | 0, X, Y ]) g⇡(X, ✏) dX⇡ d✏

=

Z

(H[p | ]�H[p | , X]) g⇡(X) dX⇡ �Z

(H[p | 0]�H[p | 0, X]) g⇡(X) dX⇡

=

Z

[(H[p | X1

,Y1

]�H[p | X1

,Y1

, X])� (H[p | X2

,Y2

]�H[p | X2

,Y2

, X])] g⇡(X) dX⇡

=

Z

[(H[p | X1

]�H[p | X1

, X])� (H[p | X2

]�H[p | X2

, X])] g⇡(X) dX⇡

where g⇡(X, ✏) is the joint density of X, ✏ when playing policy ⇡, and p denotes the posterior over ✓

given data. The value of the previous expression does not depend on Y or ✏. It follows that � � 0

as, by Lemma 54, for every X 2 Rd

log det(��1

0

Id +XT1

X1

+XXT )� log det(��1

0

Id +XT1

X1

) (4.13)

� log det(��1

0

Id +XT1

X1

+XTX+XXT )� log det(��1

0

Id +XT1

X1

+XTX), (4.14)

so the integrand is positive. Next, we need to check that f is adaptive monotonic, which requires

(4.11) to be always non-negative. The latter is always true, as we can simply condition on X ⇠ ⇡,

and apply Lemma 55. We now apply Theorem 1.14 in [56]:

Theorem 21. Let G be the greedy algorithm that selects ⇡⇤t 2 argmax⇡ �(⇡|Xt,Yt) at time t.

Assume G is run for l n time steps, selecting l policies, and let A be any algorithm selecting at

most n policies. Then,

E[f(G)] �⇣

1� e�l/k⌘

E[f(A)], (4.15)

where E[f(A)] denotes the expected value of f under data collected by algorithm A.

As f(A) � 0, we have that �f(A) 0. Therefore, we conclude that


Theorem 22. Let G be the greedy algorithm that, at time t, selects the policy maximizing informa-

tion gain in one step. Let A be any algorithm selecting sequentially at most n policies. Then,

E[Hn(G)]� C 0 �1� e�1

�

(E[Hn(A)]� C 0), (4.16)

where E[Hn(A)] denotes the expected entropy at time n when data was collected by algorithm A.

Moreover, we can extend the previous result to a more general reinforcement learning problem.

Suppose S is the set of states, and |S| = S. Each state s 2 S has an underlying latent linear

parameter ✓s 2 Rd. We can play one of ms actions at state s, say ⇡k, leading to reward xT ✓s + ✏,

where x ⇠ ⇡k = F(µk,s,⌃k,s), and ✏ ⇠ N (0,�2

s). After playing action k at state si, we transition to

state sj with probability pijk. If we assume the transition probabilities are known, then at time t,

the posterior over (✓1

, . . . , ✓S) factorizes as

pt = pt(✓1, . . . , ✓S) =Y

s2Sps,t(✓s) =

Y

s2SN (✓s;µs,t,⌃s,t). (4.17)

The joint entropy is the sum of the individual entropies H(pt) =P

s2S H(ps,t), and the sum of

submodular functions is submodular, so

H(pt) = C 000 � 1

2

X

s2Slog det

✓

1

�0

Id +1

�2

s

XTs,tXs,t

◆

(4.18)

is submodular, where Xs,t denotes the X data points collected at state s up to time t.

At each episode t, we must choose a policy

wt = (⇡s,t | s 2 S) . (4.19)

We can directly extend Theorem 22 to the algorithm that chooses wt greedily with respect to

the global information gain. Unfortunately, now the number of potential policies is exponential in

the number of states, sayQ

s ms, and we have to rely on approximations for the argmax like in

Algorithm 5.

4.7 Conclusions

E�cient pure exploration algorithms for reinforcement learning could have an important impact

in practice. In words, those algorithms collect the data that is best for sequentially learning an

unknown Markov Decision Process. Unfortunately, in general cases, scaling the algorithms to large

MDPs is computationally very expensive, and becomes intractable fairly quickly. There are several


interesting future directions of research. First, it is not clear how to sample informative policies

given a posterior distribution over the unknown parameters in a Bayesian setting, something central

for a successful greedy algorithm. Second, Algorithms 4 and 5 require being able to solve a known

MDP (basically, to approximate f(F) as we need the optimal policy for ✓2

to approximate (4.3)).

However, in complex MDPs that are modeled via function approximation using —for example—

neural networks, this may be not be cheap or even feasible. Finally, the pure exploration nature

of the problem implies that terrible actions may be selected due to their high informational value.

While this is usually perfectly fine in simulated environments, some kind of safety constraints may

apply in real-world experiments. Therefore, it would be valuable to study the same problem but

augmented with minimum instantaneous regret constraints. A natural way to impose the constraint

is by defining a control policy, ⇡con

, and requiring the algorithm to deliver at least an ✏ fraction of

its value at all times t > T0

with high-probability, [98, 51].

Concluding Remarks

A few key innovations changed the world in the second half of the 18th century. James Hargreaves’s

Spinning Jenny, Watt steam engine, or the telegraph, among others inventions, drove the first

industrial revolution, leading to unprecedented increases in productivity.

A few years later, before the World War I, we did it again. Several energy, transportation, and

communication developments heavily transformed structural aspects of our societies: from labor,

commerce and economics, to politics, or science.

At a lower scale, we are also currently witnessing some fundamental changes during our times.

Alan Turing published his seminal On Computable Numbers in 1936. Ten years later, in 1947,

the bipolar transistor was invented. Packet switching, the TCP/IP protocol, and the ARPANET

network, were born in the late 1950’s and late 1960’s, finally leading to the development of the

Internet around the early 1990s. Today, it is estimated that almost half the world population has

access to Internet. Also, there are more than 2 billion smartphones, and even more Internet of

Things devices. Collecting and storing data. Lots of data.

The world’s capacity to store data has been roughly doubling every three years since the 1980s.

Big Data; improving in volume, velocity, and variety. Recently, both governments and privately

funded institutions have been investing vast amounts of money into Data Science, Machine Learning,

and Artificial Intelligence’s research, education, infrastructure, and human capital.

We are probably just starting to scratch the surface of what we can do with data; and we need to

evaluate what the current bottlenecks are. Remarkable progress has been made in scaling systems:

more, bigger, larger. Certainly, collecting more data almost always helps, but at this point it may

be worth investing e↵orts in improving its quality.

The next step is better data. We need better data. Data tailored to the learning tasks at hand.

Data that justifies in informational and actionable terms its collection, storage, and treatment costs.

67

Appendix A

Statistical Learning Theory

Let X 2 X ✓ Rd, Y 2 Y ✓ R be random variables with joint distribution p⇤. We assume the

distribution is unknown and, we are given a dataset D = {Zi = (Xi, Yi) : i = 1, . . . ,m} where the

Zi ⇠ p⇤ are iid.

Let us introduce some definitions of key concepts:

Definition 1 (Hypothesis Class). A set of functions f : X ! Y. We want to find the one which

best resembles distribution p⇤. In the case of quadratic loss, we would like to approximate f(X) =

Ep⇤ [Y |X]. As an example, consider the set of linear functions H = {f(x) = xT ✓ + ✓0

: ✓ 2 Rd, ✓0

2R}.

Definition 2 (Loss Function). A function L : X ⇥ Y ⇥ H ! R+ which measures the penalty for

using h 2 H to predict Y given X. Examples include the 0-1 loss for classification: L(X,Y, h) =

I(Y 6= h(X)) and the squared error loss for regression L(X,Y, h) = (Y � h(X))2. Convexity is an

important feature of loss functions for optimization.

Definition 3 (Generalization Error or Risk). For h 2 H, we define its risk wrt loss L as

R(h) = EZ⇠p⇤ [L(Z, h)] =

Z

L(z, h)p⇤(z)dz.

Definition 4 (Risk Minimizer).

h⇤ = argminh2H

R(h).

Definition 5 (Empirical Risk). Given D, the empirical risk of h 2 H with respect to loss L is

R(h) =1

m

mX

i=1

L(Zi, h).

68

APPENDIX A. STATISTICAL LEARNING THEORY 69

Definition 6 (Empirical Risk Minimizer). given dataset D

h = hD = argminh2H

RD(h).

We may also write Rm(h) to make the dependence on m = |D| explicit.The main question in Statistical Learning is how close are R(h) and R(h)?

Importantly, for any m > 0 and h 2 H, E[Rm(h)] = R(h), i.e. R(h) is unbiased.

Definition 7 (Excess Risk). Di↵erence vs. risks of empirical and global risk minimizators:

R(h)�R(h⇤).

The goal is, as |D| = m ! 1, to let the excess risk go to 0 as fast as possible.

Note that h⇤ is deterministic but h and the excess risk are random variables.

We are using a frequentist paradigm here; h⇤ is fixed and we consider worst-case. For each hypothesis

h, R(h) is deterministic and Rm(h) is a random variable.

Definition 8 (Realizable Setting). If R(h⇤) = 0, that is, there is a perfect hypothesis in H.

Definition 9 (Agnostic Setting). If R(h⇤) > 0. There are no perfect hypotheses in H.

Definition 10 (M-estimator). Statistical estimators which are obtained as the minima of sums of

functions of the data. More formally, they are usually defined to be a zero of an estimating function

– for example, the derivative of the likelihood, for MLE.

When we observe a dataset, we will assume that we can compute h, the empirical risk mini-

mizer. Therefore, our main concern is to derive guarantees for R(h), the generalization error of the

hypothesis that we will eventually choose. However, computing h is not always easy, for example, if

the loss function is not convex.

Definition 11 (Uniform Convergence). If we fix a single hypothesis h 2 H, we can show that

R(h) � Rm(h) ! 0 as m ! 1, by using Hoe↵ding’s inequality or the CLT. However, as h is

random, plain convergence is not su�cient. We need uniform convergence over all h 2 H, which

implies the convergence of Rm(h) ! R(h).

What’s the impact of the complexity of H in the rate of convergence of Rm(h)?

We would like to prove bounds of the form: for ✏, � > 0, if m � g(✏, �, d),

P⇣

R(hm)�R(h⇤) � ✏⌘

�.

We are interested in finding out how does g depend on ✏ and d.


Definition 12 (PAC Framework). Probably Approximately Correct. An algorithm A learns class

H in the PAC sense if for any distribution p⇤ over X ⇥Y ⇢ Rd ⇥R, and any ✏ > 0, � > 0, A takes

n training examples and returns h 2 H such that with probability at least 1 � �, R(h) � R(h⇤) ✏,

where n = poly(d, 1/✏, 1/�).

Bound 23 (Realizable Finite Hypothesis Class). Let L be the 0-1 loss. Then the generalization

error is, with probability 1� �, given by

R(hm) log |H|+ log(1/�)

m.

Equivalently, the sample complexity needs to be

m � log |H|+ log(1/�)

✏! R(hm) ✏.

Note that the excess risk behaves as O(1/m) (as R(h⇤) = 0). This is a fast rate.

Definition 13 (Distribution-Free Results). The previous result does not depend on p⇤.

Uniform convergence leads to results that are valid under all distributions p⇤. Therefore, these

generalization bounds consider the worst-case scenario and may be pretty loose in practical cases.

Let us study the agnostic setting. Consider the following decomposition,

R(h)�R(h⇤) = [R(h)� R(h)] + [R(h)� R(h⇤)] + [R(h⇤)�R(h⇤)].

The second term is negative by definition of h. We can apply the CLT to the third one, and

concentration. The first term is more subtle, can’t apply CLT directly, need uniform convergence

(by the dependence of h on the dataset).

We can write the following probabilistic statement

P

R(h)�R(h⇤) � ✏

�

P

suph2H

|R(h)�R(h)| � ✏/2

�

.

Theorem 24 (Glivenko–Cantelli). Let X1

, X2

, . . . be iid random variables in R such that Xi ⇠ F .

We define the empirical distribution function as

Fn(x) =1

n

nX

i=1

I(Xi x).

Not only Fn converges pointwise to F for each x 2 R, but also uniformly:

supx2R

|Fn(x)� F (x)| ! 0, almost surely.


Definition 14 (Empirical Processes). See EPs: Theory & Applications, by Pollard.

Definition 15 (Consistency). Fix h 2 H, then R(h)�R(h) ! 0, in probability.

Definition 16 (Asymptotic Normality). If �2 = V ar(L(Z, h)), where Z ⇠ p⇤, then (in dist.)

pm(Rm(h)�R(h)) ! N (0,�2).

Definition 17 (Moment Generating Function). MX(t) = E[etX ], useful for tail bounds (Markov’s).

Definition 18 (Sub-Gaussian variable). A mean-zero random variable X is sub-Gaussian with

parameter �2 if its moment-generating function is bounded by that of a Gaussian Y ⇠ N (0,�2). In

other words,

MX(t) exp

✓

�2t2

2

◆

= MY (t).

In both cases, P[X � ✏] exp��✏2/2�2

�

. Double exponential decay in ✏.

Exponential and Gamma variables are not sub-gaussian. Their tails are too fat and the decay is

only exponential.

Definition 19 (Concentration and Mathematical Tools). Hoe↵ding’s Inequality, McDiarmid’s In-

equality, Doob Martingales, Glivenko-Cantelli, Massart’s Finite Lemma.

Bound 25 (Agnostic Finite Hypothesis Class). Let L be the 0-1 loss. Then the generalization error

is, with probability 1� �, given by

R(hm)�R(h⇤) r

2(log |H|+ log(2/�))

m.

Equivalently, the sample complexity needs to be

m � 2(log |H|+ log(2/�))

✏2! R(hm)�R(h⇤) ✏.

Note the important di↵erence between the realizable (Bound 23) and agnostic cases (Bound 25).

In the first case, the excess risk decays as O(1/m) while in the second the decay is much slower

O(1/pm).

What happens when |H| = 1? Union bounds do not work anymore, we need to account for some

complexity measure that reflects the size of the hypothesis class with respect to the loss function.

Rademacher complexity, VC dimension or covering numbers are examples of this.

Definition 20 (Rademacher Complexity). Let F be a class of functions f : Z ! R.


We define the Rademacher Complexity of F to be

Rn(F) = E

"

supf2F

1

m

mX

i=1

�if(Zi)

#

,

where �1

, . . . ,�m are the Rademacher variables, iid uniform in {�1,+1}. Note that Z1

, Z2

, . . . are

drawn iid from p⇤.

The intuition behind the Rademacher Complexity is that it measures to what extent the family

of functions F is able to fit pure noise (represented by the random variables �1

, . . . ,�m). We would

like Rn(F) to go to zero as n increases.

Note that the expectation is over both the Zi’s and the �i’s. Let’s fix the data:

Definition 21 (Empirical Rademacher Complexity).

Rm(F) = E

"

supf2F

1

m

mX

i=1

�if(Zi)

�

�

�

�

Z1

, . . . , Zm

#

.

Clearly, we have that EZ1,...,Zm

[Rm(F)] = Rm(F).

Bound 26 (Agnostic Hypothesis Class Rademacher).

Let A = {z ! L(z, h) : h 2 H} be the loss class. Then the generalization error is, with probability

1� �, given by

R(hm)�R(h⇤) 4Rm(A) +

r

2 log(2/�)

m.

Interesting properties of Rademacher complexity:

1. Rm(convex hull F) = Rm(F). This is useful as it allows to reduce polytopes (infinite sets) to

computations on finite sets (its vertices).

2. Lipschitz composition: Rm(� � F) c�Rm(F), where c� is the Lipschitz constant of �.

Lemma 27 (Massart’s finite Lemma). Let us fix Z1

, . . . , Zm. Let F be a finite class of functions

and, let M be a constant satisfying

supf2F

1

m

mX

i=1

f(Zi)2 M2.

Then, we have that

Rm(F) r

2M2 log |F|m

.


Key Observation: The empirical Rademacher complexity depends on m points. Suppose we

had two function classes F and F 0, such that F is finite and F 0 is infinite. Further, assume both

classes have the same image on those m points. Then they have the same empirical Rademacher

complexity. Formally, if

{[f(Z1

), . . . , f(Zm)] : f 2 F} = {[f 0(Z1

), . . . , f 0(Zm)] : f 0 2 F 0},

then Rm(F) = Rm(F 0). In other words, with respect to empirical Rademacher complexity, all that

matters is the behavior of a function class on m data points.

When the loss has a finite set of values, for example, 0-1 loss, the number of behaviors is finite.

We now relate this to the VC dimension.

Definition 22 (Shattering Coe�cient). for class F (also known as growth function):

s(F ,m) = maxz1,...,zm2X

|{[f(z1

), . . . , f(zm)] : f 2 F}|.

We can now apply Massart’s finite Lemma (27) to upper bound the Rademacher complexity

of function classes of infinite size with finite shattering coe�cient. If F is infinite but s(F ,m) is

finite, we can take F 0 to be the smallest subset of F with identical shattering coe�cient, and apply

Massart’s Lemma to F 0. Note that in this case |F 0| = s(F ,m). By taking expectations on both

sides (the RHS is constant), it then follows that

Rm(F) r

2 log s(F ,m)

m.

We need s(F ,m) to grow sub-exponentially in m so that Rm(F) ! 0.

For 0-1 loss and binary classifiers, we can show that s(A,m) = s(H,m)

Definition 23 (VC Dimension). of a family H of functions f : R ! {�1,+1} is defined as

V C(H) = sup{m : s(H,m) = 2m}.

One-parameter family with infinite VC dim: {f(x) = I{sin(✓x) � 0} : ✓ 2 R}.It is not the number of parameters that matter but the dimension of the space.

Theorem 28 (VC dimension of finite-dimensional function classes).

Let F be a function class containing functions f : X ! R. The dimension of F is given by the


number of elements in a basis of F . Consider

H = {{x : f(x) � 0} : f 2 F}.

Then, we have that

V C(H) dim(F).

Lemma 29 (Sauer’s Lemma). Let V C(H) = d. Then, for any m,

s(H,m) dX

i=0

✓

m

i

◆

8

<

:

2m if m d,�

emd

�dif m > d.

It follows that for m > d = V C(H),

Rm(F) r

2 log s(F ,m)

mr

2d(logm/d+ 1)

m=

s

2(logm/d+ 1)

m/d.

Hence, by Bound 26, we conclude that

R(hm)�R(h⇤) 4Rm(A) +

r

2 log(2/�)

m

s

32(logm/d+ 1)

m/d+

r

2 log(2/�)

m

Norm-constrained Hypothesis Classes:

Theorem 30 (Rademacher complexity for linear function, L2

constraints.). Let F = {z ! w · z :

kwk2

B2

}. Assume EZ⇠p⇤ [kZk22

] C2

2

. Then,

Rm(F) B2

C2p

m.

Theorem 31 (Rademacher complexity for linear function, L1

constraints.). Assume coordinates

are bounded with prob. 1 for all data points i: kZik1 C1.

Let F = {z ! w · z : kwk1

B1

}. Then,

Rm(F) B1

C1p

2 log(2d)pm

.

Idea: suppose we believe most features are irrelevant and the true w has s < d non-zero entries.

Further, assume that kwk1 1, kzk1 1. So we consider the hypothesis class kwk1

B1

= s.

The Rademacher complexity is bounded by O(sp

log d/m) and, s controls the complexity instead


of d which can be exponentially larger.

So far, we have mainly considered classification problems. In other words, due to its combinato-

rial nature, the shattering coe�cient works with functions whose image includes a finite number of

values. How can we handle classes of hypothesis for regression? With Covering Numbers.

Definition 24 (Metric). A metric ⇢ : F ⇥ F ! R must be non-negative, symmetric, satisfy the

triangle inequality and evaluate to 0 if and only if its arguments are equal.

Definition 25 (Pseudo-metric). Similar to a metric but we could have ⇢(f, f 0) = 0 for f 6= f 0.

Definition 26 (L2

(Pn) pseudo-metric). It is the L2

distance wrt the empirical distribution over m

data points. Given z1

, . . . , zm, and f, f 0 2 F , it is defined as

⇢(f, f 0) =

1

m

mX

i=1

�

f(zi)� f 0(zi)�

2

!

1/2

.

We can see this as the Euclidean distance on n�dimensional vectors (which represent the func-

tions) scaled down by a factorpn.

Definition 27 (Ball). Let (F , ⇢) be a metric space. The ball with radius ✏ > 0 centered at f 2 F

B✏(f) = {f 0 2 F : ⇢(f, f 0) ✏}.

Definition 28 (✏-cover). of a set F wrt metric ⇢ is a finite subset C = {f1

, . . . , fn} ✓ F such that

every f 2 F belongs to B✏(fi) for some fi 2 C.

Definition 29 (✏-covering number). of a set F wrt metric ⇢ is the size of the smallest ✏-cover:

N(✏,F , ⇢) = min{m : 9{f1

, . . . , fm} ✓ F ✓ [mj=1

B✏(fj)}.

Definition 30 (Metric Entropy). of a set F wrt metric ⇢ is logN(✏,F , ⇢).


A.1 Main Generalization Bounds

I summarize here a few theorems that provide the best generalization bounds based on the VC

dimension for classification problems and Rademacher Complexity. I follow a few books and papers

for this.

Theorem 32 (The Fundamental Theorem of Statistical Learning - I). Let H be a hypothesis class

of functions from a domain X to {0, 1} and let the loss function be the 0-1 loss. Then, the following

are equivalent:

1. H has the uniform convergence property.

2. Any Empirical Risk Minimization rule is a successful PAC learner for H.

3. Any ERM rule is a successful agnostic PAC learner for H.

4. H is agnostic PAC learnable.

5. H is PAC learnable.

6. H has a finite VC-dimension.

Theorem 33 (The Fundamental Theorem of Statistical Learning - II). Let H be a hypothesis class

of functions from a domain X to {0, 1} and let the loss function be the 0-1 loss. Assume that

V C dim(H) = d < 1.

Then, there are absolute constants C1

, C2

such that H is (realizable) PAC learnable with sample

complexity

C1

d+ log(1/�)

✏ m(✏, �, d) C

2

d log(1/✏) + log(1/�)

✏.

Also, there are absolute constants C1

, C2

such that H is agnostic PAC learnable with sample

complexity

C1

d+ log(1/�)

✏2 m(✏, �, d) C

2

d+ log(1/�)

✏2.

Appendix B

Proofs Chapter 2

B.1 Whitening

Before thresholding the norm of incoming observations, it is useful to decorrelate and standardize

their components, i.e., to whiten the data. Then, we apply the algorithm to uncorrelated covariates,

with zero mean and unit variance (not necessarily independent). The covariance matrix ⌃ can be

decomposed as ⌃ = UDUT , where U is orthogonal, and D diagonal with dii = �i(⌃). We whiten

each observation to X = D�1/2UTX 2 Rd⇥1 (while for X 2 Rk⇥d, X = XUD�1/2), so that

EXXT = Id. We denote whitened observations by X and X in the appendix. After some algebra

we see that,d

�max

(XT X) Tr(⌃(XTX)�1) = Tr((XT X)�1) d

�min

(XT X). (B.1)

We focus on algorithms that maximize the minimum eigenvalue of XT X with high probability, or,

in general, leading to large and even eigenvalues of XT X.

B.2 Proof of Theorem 1

Theorem. Let n > k > d. Assume observations X 2 Rd are distributed according to subgaussian

D with covariance matrix ⌃ 2 Rd⇥d. Also, assume marginal densities are symmetric around zero

after whitening. Let X be a k ⇥ d matrix with k observations sampled from the distribution induced

by the thresholding rule with parameters (⇠,�) 2 Rd+1

+

satisfying (2.5). Let ↵ > 0, so that t =

↵pk � C

pd > 0, then, with probability at least 1� 2 exp(�ct2)

Tr(⌃(XTX)�1) d

(1� ↵)2 �k, (B.2)

where constants c, C depend on the subgaussian norm of D.

77

APPENDIX B. PROOFS CHAPTER 2 78

Proof. We would like to choose k out of n observations X1

, . . . , Xn ⇠ F iid. Assume our sampling

induces a new distribution F. The loss we want to minimize for our OLS estimate � = �(X,Y) is

EX,Y⇠¯F,X⇠F

⇣

XT � �XT�⌘

2

�

= �2 EX,Y⇠¯F

⇥

Tr�

⌃(XTX)�1

�⇤

, (B.3)

where we assumed Gaussian noise with variance �2.

Let us see how we construct F. We sample X ⇠ F, we whiten the observation Z = ⌃�1/2X,

and then we select it or not according to a fixed thresholding rule. If kZk⇠ � �, then we keep

X = ⌃1/2Z.

We choose ⇠ and � so that there exists � > 0, such that for all i = 1, . . . , d,

EW⇠¯F[W2

(i)] = �, (B.4)

where W(i) denotes the i-th component of W ⇠ F. Note that F = F(⇠,�).

Z is a linear transformation of X; W is not a linear transformation of Z.

Moreover, the covariance matrix of F is ⌃¯F = � Id. If F is a general subgaussian distribution,

thresholding could change the mean away from zero.

Assume after running our algorithm, we end up with X 2 Rk⇥d. We denote by W the observa-

tions after whitening, note that by design every w 2 W passed our test: kwk⇠ � �. In other words,

w ⇠ F. We see that W = X⌃�1/2 or, alternatively, X = W⌃1/2.

Now, we can derive

Tr�

⌃(XTX)�1

�

= Tr

✓

⌃⇣

⌃1/2WTW⌃1/2⌘�1

◆

(B.5)

= Tr

✓

⇣

⌃�1/2⌃1/2WTW⌃1/2⌃�1/2⌘�1

◆

(B.6)

= Tr⇣

�

WTW��1

⌘

(B.7)

= Tr⇣

⌃1/2¯F

⌃�1

¯F⌃1/2

¯F

�

WTW��1

⌘

(B.8)

= Tr⇣

⌃�1

¯F

�

WTW��1

⌘

(B.9)

= Tr

✓

1

�Id�

WTW��1

◆

(B.10)

=1

�Tr⇣

�

WTW��1

⌘

(B.11)

d

�

1

�min

�

WTW� =

d

� k

1

�min

�

1

kWTW

� . (B.12)

where W is actually white data. Thus, note that WTW/k ! Id as k ! 1.


Assume that F is subgaussian such that if k > d, then XTX has full rank with probability one.

Thresholding will not change the shape of the tails of the distribution, F will also be subgaussian.

At this point, we need to measure how fast �min

�

WTW�

/k goes to 1. We can use Theorem

5.39 in [91] which guarantees that, for ↵ > 0 such that t = ↵pk � C

pd > 0, with probability at

least 1� 2 exp(�ct2) we have

�min

✓

1

kWTW

◆

� (1� ↵)2, (B.13)

as W is white subgaussian. It follows that for ↵ > Cpd/

pk, with probability at least 1�2 exp(�ct2)

Tr�

⌃(XTX)�1

� d

(1� ↵)2 � k. (B.14)

Note that 1/(1� ↵) ⇡ 1 +O(p

d/k).

B.3 Proof of Tr(X�1) � Tr(Diag(X)�1)

In order to justify that we want S = XTX to be as close as possible to diagonal, we show the

following lemma. Under our assumptions S is symmetric positive definite with probability 1.

Lemma 34. Let X be a n⇥ n symmetric positive definite matrix. Then,

Tr(X�1) � Tr(Diag(X)�1), (B.15)

where Diag(·) returns a diagonal matrix with the same diagonal as the argument.

In other words, we show that for all positive definite matrices with the same diagonal elements,

the diagonal matrix (matrix with all o↵ diagonal elements being 0) has the least trace after the

inverse operation.

Proof. We show this by induction. Consider a 2⇥ 2 matrix

X =

"

a b

b c

#

(B.16)

and

Tr(X�1) =1

ac� b2(a+ c) (B.17)

since ac� b2 > 0 (X is positive definite), the above expression is minimized when b2 = 0, that is, X

is diagonal.


Assume the statement is true for all n⇥n matrices. Let X be a (n+1)⇥ (n+1) positive definite

matrix. Decompose it as

X =

"

A b

bT c

#

. (B.18)

By the block inverse formula, (see for example [69])

Tr(X�1) = Tr(A�1) +1

k+

1

kTr(A�1bbTA�1), (B.19)

where k = c� bTA�1b. Note k > 0 by Schur’s complement for positive definite matrices. Using the

induction hypothesis, Tr(A�1) � Tr(Diag(A)�1). By the positive definiteness of A, bTA�1b � 0,

therefore 1

k � 1

c .

Also, Tr(A�1bbTA�1) � 0. Thus,

Tr(X�1) � Tr(A�1) +1

c= Tr(Diag(X)�1), (B.20)

and the result follows.

B.4 Proof of Corollary 2

Corollary. If the observations in Theorem 1 are jointly Gaussian with covariance matrix ⌃ 2 Rd⇥d,

⇠j = 1 for all j = 1, . . . , d, and � = Cp

d+ 2 log(n/k), for some constant C � 1, then with

probability at least 1� 2 exp(�ct2) we have that

Tr(⌃(XTX)�1) d

(1� ↵)2⇣

1 + 2 log(n/k)d

⌘

k. (B.21)

Proof. We have to show that ⇠j = 1 for all j, and � = Cp

d+ 2 log(n/k) satisfy the equations

PD

�kXk⇠ � ��

= ↵ =k

n, (B.22)

ED[X2

j | kXk2⇠ � �2] = �, for all j, (B.23)

and � > (1 + 2 log(n/k)/d). The components of X are independent, as observations are jointly


Gaussian. It immediately follows that ⇠j = 1, for all 1 j d. Thus,

Z⇠ =dX

j=1

Xj ⇠ �2

d, �2 = F�1

�2d

✓

1� k

n

◆

. (B.24)

The value of Z⇠ is strongly concentrated around its mean, EZ⇠ = d. We now use two tail approxi-

mations to obtain our desired result.

By [58], we have that

P(Z⇠ � d � 2pdx+ 2x) exp(�x). (B.25)

If we take exp(�x) = ↵, then x = log(n/k). In this case, we conclude that

P

✓

Z⇠ � d+ 2 log⇣n

k

⌘

+ 2

r

d log⇣n

k

⌘

◆

↵ =k

n. (B.26)

Note that P(kXk⇠ > �) = P(Z⇠ > �2) = ↵. Therefore, by definition

� s

d+ 2 log⇣n

k

⌘

+ 2

r

d log⇣n

k

⌘

. (B.27)

On the other hand, we would like to show that

P⇣

Z⇠ � d+ 2 log⇣n

k

⌘⌘

� ↵, (B.28)

as that would directly imply that � �pd+ 2 log (n/k).

We can use Proposition 3.1 of [45]. For d > 2 and x > d� 2,

P(Z⇠ � x) � 1� e�2

2

x

x� d+ 2pdexp

⇢

�1

2

⇣

x� d� (d� 2) log⇣x

d

⌘

+ log d⌘

�

.

Take x = d+ 2 , where = log(n/k). It follows that

P(Z⇠ � d+ 2 ) � 1� e�2

2

d+ 2

2pd+ 2

exp

⇢

�1

2

✓

2 � (d� 2) log

✓

1 +2

d

◆

+ log d

◆�

=1� e�2

2

d+ 2

2pd+ 2

exp

⇢

d� 2

2log

✓

1 +2

d

◆

� 1

2log d

�

exp {� }

� exp {� } , (B.29)

where we assumed, for example, d � 9 and n/k > 17 (as in Proposition 5.1 of [45]). In any case, in

those rare cases (in our context) where d < 9 and n/k very small, the previous bound still holds if

we subtract a small constant C 2 [0, 5/2] from the LHS: P(Z⇠ � d+ 2 � C).


Equivalently, from (B.29)

P(Z⇠ � d+ 2 log(n/k)) � k/n = ↵. (B.30)

We conclude that

r

d+ 2 log⇣n

k

⌘

� s

d+ 2 log⇣n

k

⌘

+ 2

r

d log⇣n

k

⌘

. (B.31)

Finally, we have that

� � �2

d� 1 +

2 log (n/k)

d. (B.32)

By Theorem 1, the corollary follows.

B.5 CLT Approximation

As we explain in the main text, it is sometimes di�cult to directly compute the distribution of the

⇠-norm of a white observation, given by Z⇠. Recall that �2 = F�1

Z⇠

(1� k/n). Fortunately, Z⇠ is the

sum of d random variables, and, in high-dimensional spaces, a CLT approximation can help us to

choose a good threshold. In this section we derive some theoretical guarantees.

The CLT is a good idea for bounded variables (as the square is still bounded, and therefore

subgaussian), but if the underlying components Xj are unbounded subgaussian, Z⇠ will be at least

subexponential—as the square of a subgaussian random variable is subexponential, [91]—, and a

higher threshold —like that coming from chi-squared— is more appropriate.

In addition, in the context of heavy-tails, catastrophic e↵ects are expected, as P(maxj Xj > t) ⇠P(P

j Xj > t), leading to observations dominated by single dimensions.

Assume that components Xj are independent (while not necessarily identically distributed). By

Lyapunov’s CLT, one can show that1

Z⇠ =dX

j=1

⇠jX2

j ⇡ N0

@d,dX

j=1

⇠2j�

E[X4

j ]� 1�

1

A .

1Some mild additional moment/regularity conditions on each Xj are required to satisfy Lyapunov’s Condition.


It follows that � satisfies PD

�kXik⇠ � ��

= k/n if

�2 ⇡ d+ ��1

✓

1� k

n

◆

v

u

u

t

dX

i=1

⇠2i�

E[X4

i ]� 1�

.

In the sequel, assume d is large enough, and the approximation error is negligible.

Define � =q

Pdi=1

⇠2i�

E[X4

i ]� 1�

.

Corollary 35. Assume Z⇠ = N �

d, �2�

and �2 = d+ � ��1 (1� k/n), with ⇠j satisfying (2.5). Let

↵ > 0, so that t = ↵pk � C

pd > 0. then with probability at least 1� 2 exp(�ct2) we have that

Tr(⌃(XTX)�1) d

(1� ↵)2 k

✓

1 +�p

2 log(n/k)

d �O

✓

� log log(n/k)

dp

log(n/k)

◆◆ . (B.33)

Proof. Note that, by definition, kXk2⇠ ⇠ Z⇠ and � jointly solve the equations required by Theorem

B.2. In order to apply the theorem, all we need to do is to estimate the magnitude of

� = ED[X2

j | kXk2⇠ � �2] � �2

d= 1 +

�

d��1 (1� k/n) . (B.34)

Therefore, we want to find bounds on tail probabilities of the normal distribution. By Theorem 2.1

of [45], we have that for small k/n

p

2 log(n/k)� log(4 log(n/k)) + 2

2p

2 log(n/k) ��1

✓

1� k

n

◆

(B.35)

p

2 log(n/k)� log(2 log(n/k)) + 3/2

2p

2 log(n/k), (B.36)


We can also show how to apply the previous result to independent uniform distributions centered

around zero. In that case, we have that the fourth moment is E[X4

j ] = 9/5, so � =q

4

5

d, leading to

a gain factor

� =

1 +

r

8 log(n/k)

5d� o

log log(n/k)p

d log(n/k)

!!

.


Recall the Sparse Thresholding Algorithm below. We show the following theorem.

Theorem. Let D = N (0,⌃). Assume ⌃,� and mini |�i| satisfy the standard conditions given in

Theorem 3 of [94]. Assume we run the Sparse Thresholding algorithm with k1

= C 0s log d observa-

tions to recover the support of �, for an appropriate C 0 � 0. Let X2

be k2

= k � k1

observations


sampled via thresholding on S(�1

). It follows that for ↵ > 0 such that t = ↵pk2

� Cps > 0, there

exist some universal constants c1

, c2

, and c, C that depend on the subgaussian norm of D | S(�1

),

such that with probability at least

1� 2e�min(c2 min(s,log(d�s))�log(c1),ct2�log(2))

it holds that

Tr(⌃SS(XT2

X2

)�1) s

(1� ↵)2⇣

1 + 2 log(n2/k2)

s

⌘

k2

.

Algorithm 6 Sparse Thresholding Algorithm.

1: Set S1

= ;, S2

= ;. Let k = k1

+ k2

, n = k1

+ n2

.2: for observation 1 i k

1

do3: Observe Xi. Choose Xi: S

1

= S1

[Xi.4: end for5: Set � = 1/2,� =

p

4�2 log(d)/�2k1

.

6: Compute Lasso estimate �1

based on S1

, with regularization �.7: Set weights: ⇠i = 1 if i 2 S(�

1

), ⇠i = 0 otherwise.8: Set � = C

p

s+ 2 log(n2

/k2

). Factorize ⌃S(

ˆ�1)S(

ˆ�1)= UDUT .

9: for observation k1

+ 1 i n do10: Observe Xi 2 Rd. Restrict to Xi

S := XiS(

ˆ�1)2 Rs.

11: Compute XiS = D�1/2UTXi

S .

12: if kXiSk⇠ > � or k

2

� |S2

| = n� i+ 1 then13: Choose Xi

S : S2

= S2

[XiS .

14: if |S2

| = k2

then15: break.16: end if17: end if18: end for19: Return OLS estimate �

2

with observations in S2

.

For support recovery, we use Theorem 3 from [94]:

Theorem 36. Consider the linear model with random Gaussian design

Y = X�⇤ + ✏, with k i.i.d. rows xi ⇠ N (0,⌃) 2 Rd, (B.37)

with noise ✏ ⇠ N (0,�2 Idk⇥k). Assume the covariance matrix ⌃ satisfies

k⌃SCS(⌃SS)�1k1 (1� �), for some � 2 (0, 1], (B.38)

�min

(⌃SS) � Cmin

> 0. (B.39)


Let |S| = s. Consider the family of regularization parameters for �d � 2

�k(�d) =

s

�d ⇢u(⌃SCS)

�22�2 log(d)

k. (B.40)

If for some fixed � > 0, the sequence (k, d, s) and regularization sequence {�k} satisfy

k

2s log(d� s)� (1 + �) ✓u(⌃)

✓

1 +�2C

min

�2ks

◆

, (B.41)

then the following holds with prob at least 1� c1

exp(�c2

min{s, log(d� s)}):

1. The Lasso has a unique solution � with support in S (i.e. S(�) ⇢ S(�⇤)).

2. Define the gap

g(�k) := c3

�kk⌃�1/2SS k21 + 20

s

�2 log s

Cmin

k. (B.42)

Then, if �min

:= mini2S |�⇤i | > g(�k), the signed support S±(�) is identical to S±(�⇤), and

moreover k�S � �⇤Sk1 g(�k).

The required definitions to apply the previous theorem are

⇢l(⌃) =1

2mini 6=j

(⌃ii + ⌃jj � 2⌃ij) , ⇢u(⌃) = maxi

⌃ii, (B.43)

✓l(⌃) =⇢l(⌃SC |S)

Cmax

(2� �(⌃))2, ✓l(⌃) =

⇢u(⌃SC |S)C

min

�2(⌃). (B.44)

Proof. (Theorem 3)

Let X ⇠ N (0,⌃) with ⌃ satisfying (B.38) and (B.39). Let �k(�d) be like in (B.40), for some

�d > 2. Assume we choose the number of observations k1

in the first stage to be at least

k1

� 2(1 + �) ✓u(⌃)

✓

1 +�2C

min

�2ks

◆

s log(d� s) (B.45)

= C(⌃, d, s) s log(d� s), (B.46)

and that �min

is greater than (B.42). Then, with probability at least

1� c1

exp(�c2

min{s, log(d� s)}),

we recover the right support S(�⇤) = S(�) in the first stage of the algorithm.


Conditional on this event, we apply our algorithm on the remaining observations. In the second

stage, we only look at those dimensions in S(�), by setting weights ⇠S(

ˆ�) = 1, and zero otherwise.

Finally, we run OLS along the dimensions in the recovered support, and using the observations

collected during the second stage. Importantly, note that the new observations are N (0,⌃SS).

We can now apply our original results. Denote by X2

2 Rk2⇥s the set of observations collected

in the second stage of the algorithm. In particular, by Corollary 2, we conclude that for ↵ > 0 such

that t = ↵pk2

� Cps > 0, the following holds with probability at least 1� 2 exp(�ct2)

Tr(⌃SS(XT2

X2

)�1) s

(1� ↵)2⇣

1 + 2 log(n2/k2)

s

⌘

k2

. (B.47)

Under the event that the recovery is correct, the contribution to the MSE of the components of

� that are not in its support is zero. In other words,

k� � �2

k2⌃

= (� � �2

)T⌃(� � �2

) (B.48)

= (�S � �2S)

T⌃SS(�S � �2S) = k�S � �

2Sk2⌃

SS

. (B.49)

As the events that the first and second stages succeed are independent, we conclude (B.47) holds

with probability at least

1� c1

e�c2 min{s,log(d�s)} � 2e�ct2 � (B.50)

1� 2e�min(c2 min(s,log(d�s))�log(c1),ct2�log(2)). (B.51)


Theorem. Let A be an algorithm for the problem we described in Section 2.2. Then,

EA Tr(⌃(XTX)�1) � d2

Eh

Pki=1

||X(i)||2

i (B.52)

� d

k E⇥

1

d maxi2[n] ||Xi||2⇤ ,

where X(i) is the white observation with the i-th largest norm. Moreover, fix ↵ 2 (0, 1). Let F be

the cdf of maxi2[n] ||Xi||2. Then, with probability at least 1� ↵

Tr(⌃(XTX)�1) � d2/k F�1(1� ↵). (B.53)


Proof. We want to minimize Tr(⌃(XTX)�1) = Tr((XT X)�1). Let us define S = XT X. One can

prove that H ! Tr(H�1) is convex for symmetric positive definite matrices H. It then follows by

Jensen’s Inequality (assuming k > d, so S is symmetric positive definite with high probability)

ETr(S�1) � Tr((ES)�1) =dX

j=1

1

�j(ES). (B.54)

Let ES be the expected value of S for an arbitrary algorithm A that selects its observations sequen-

tially. We want to understand what is the minimum possible value the RHS of (B.54) can take.

The sum of eigenvalues is upper bounded by

dX

j=1

�j(ES) = Tr(ES) =dX

j=1

E(Sjj) =dX

j=1

kX

i=1

E[X2

ij ]

=kX

i=1

E[||Xi||2]

E

"

kX

i=1

||X(i)||2

#

k E

maxi2[n]

||Xi||2�

,

where X(i) denotes the observation with the i-th largest norm. Because ES is symmetric positive

definite, its eigenvalues are real non-negative, so that

0 < �min

(ES) Tr(ES)

d

Eh

Pki=1

||X(i)||2

i

d k E

⇥

maxi2[n] ||Xi||2⇤

d.

We conclude that the solution to the minimization problem of (B.54) —that is, when all eigenvalues

are equal— is lower bounded by

ETr(S�1) �dX

j=1

1

�j(ES)� d2

Eh

Pki=1

||X(i)||2

i � d2

k E⇥

maxi2[n] ||Xi||2⇤ ,

which proves (B.52).


In order to prove the high-probability statement (B.53), note that

Tr(⌃(XTX)�1) = Tr((XT X)�1) =dX

i=1

1

�i(XT X)

�dX

i=1

1Pk

j=1

kXjk2/d

� d2Pk

j=1

kX(j)k2

� d2

kmaxi2[n] kXik2 . (B.55)

We directly conclude that with probability at least 1� ↵,

maxi2[n]

kXik2 F�1(1� ↵) (B.56)

as F is the cdf of maxi2[n] kXik2. It follows that with probability at least 1� ↵,

Tr(⌃(XTX)�1) � d2

k F�1(1� ↵). (B.57)

B.8 Proof of Corollary 5

Corollary. For Gaussian observations Xi ⇠ N (0,⌃) and large n, for any algorithm A

EA Tr(⌃(XTX)�1) � d

k⇣

2 lognd + log log n

⌘ . (B.58)

Moreover, let ↵ 2 (0, 1). Then, for any A with probability at least 1� ↵ and C = 2 log�(d/2)/d,

Tr(⌃(XTX)�1) � d

k⇣

2 lognd + log log n� 1

d log log1

1�↵ � C⌘ . (B.59)

Proof. In order to apply Theorem 4, we need to upper bound E⇥

maxi2[n] ||Xi||2⇤

, where Xi is a

d-dimensional gaussian random variable with identity covariance matrix. In other words, we need to

upper bound the expected maximum of n chi-squared random variables with d degrees of freedom.

Let us start by proving (B.58). We can use extreme value theory to find the limiting distribution of

the maximum of n random variables. Firstly, note that the chi-squared distribution is a particular

case of the Gamma distribution. More specifically, �2

d ⇠ �(d/2, 2). If we parameterize the �

distribution by ↵ (shape) and � (rate), then ↵ = d/2 and � = 1/2.


By the Fisher-Tippett Theorem we know that there are only three limiting distributions for

limn!1 X(n) = limn!1 maxin Xi, where the Xi are iid random variables, namely, Frechet, Weibull

and Gumbel distributions. It is known that the Gamma distribution is in the max-domain of

attraction of the Gumbel distribution. Further, the normalizing constants are known (see Chapter

3 of [27]). In particular, we know that if X(n) := maxi2[n] ||Xi||2

limn!1P

�

X(n) 2x+ 2 lnn+ 2(d/2� 1) ln lnn� 2 ln�(d/2)

�

= ⇤(x) = e�e�x

. (B.60)

We can assume that the asymptotic limit holds, as n is in practice very large, and compute the

mean value of X(n). As X

(n) is a positive random variable,

E[X(n)] =

Z 1

0

P�

X(n) � t

�

dt (B.61)

=

Z 1

0

(1�P�

X(n) t

�

) dt (B.62)

We make the change of variables t = 2x+ C, where C = 2 lnn+ (d� 2) ln lnn� 2 ln�(d/2). Then,

E[X(n)] =

Z 1

0

P�

X(n) � t

�

dt (B.63)

=

Z 1

�C/22(1�P

�

X(n) 2x+ C

�

) dx (B.64)

⇡Z 1

�C/22(1� e�e�x

) dx (B.65)

=

Z

0

�C/22(1� e�e�x

) dx+

Z 1

0

2(1� e�e�x

) dx (B.66)

Z

0

�C/22 dx+ 2� = C + 2�, (B.67)

where � is the Euler–Mascheroni constant. We conclude that

E[X(n)] C + 2� 2 lnn+ (d� 2) ln lnn. (B.68)

If we take the largest k observations, and assume we could split the weight equally among all

dimensions (which is desirable), we see that the best we can do in expectation is upper bounded by

k

dE[X

(n)] k

✓

2 lnn

d+ ln lnn

◆

. (B.69)

Now, let us prove (B.59). The following inequalities simplify our task to finding a high-probability


upper bound on maxi2[n] kXik2. We have that

Tr(⌃(XTX)�1) = Tr((XT X)�1) =dX

i=1

1

�i(XT X)

�dX

i=1

1Pk

j=1

kXjk2/d

� d2Pk

j=1

kX(j)k2

� d2

kmaxi2[n] kXik2 . (B.70)

Fix ↵ 2 [0, 1]. We need to find a constant Q such that with probability at least 1 � ↵, Q �maxi2[n] kXik2, so that we conclude that Tr(⌃(XTX)�1) � d2/Qk with high probability. By (B.60)

we know that

limn!1P

�

X(n) 2x+ 2 lnn+ 2(d/2� 1) ln lnn� 2 ln�(d/2)

�

= ⇤(x) = e�e�x

. (B.71)

For large n, we assume the previous upper bound for X(n) is exact. We want to find Q > 0 such

that P�

X(n) Q

�

= 1� ↵. Note that if 1� ↵ = e�e�x

, then

x = � log log1

1� ↵. (B.72)

It follows that Q = 2 lnn+2(d/2� 1) ln lnn� log log(1�↵)�1 � 2 ln�(d/2). Finally, (B.59) follows

asQ

d=

2 log n

d+ log log n� log log(1� ↵)�1 + 2 ln�(d/2)

d. (B.73)

B.9 Proof of CLT Lower Bound

Corollary 37. Assume the norm of white observations is distributed according to Z⇠ = N �

d, �2�

.

Then, we have that for any algorithm A

EA Tr(⌃(XTX)�1) � d�

1 + �d

p2 log n

�

k. (B.74)

Proof. By Theorem 4, we need to compute E⇥

maxi2[n] ||Xi||2⇤

.


By assumption kXik2 ⇠ N �

d, �2�

for each i, which implies

E

maxi2[n]

||Xi||2�

= E

d+maxi2[n]

�||Xi||2 � d

�

�

(B.75)

d+ � E

maxi2[n]

N (0, 1)

�

(B.76)

d+ �p

2 log n, (B.77)


B.10 Proofs for Misspecified Models

We start by deriving Proposition 8. Recall that under D, EX⇠D[X Bias(X)] = 0.

Proposition. For X ⇠ D where D = D⇠⇤,�⇤ is the thresholding distribution:

EX⇠ ¯D[X Bias(X)] = EX⇠ ¯D[X E[Y |X]]� � EX⇠D[X E[Y |X]]. (B.78)

Proof. Assume that X ⇠ D, and take w = vj . Then,

EX⇠ ¯D[(wTX)(�⇤TX)] = EX⇠ ¯D[(vTj X)X

i

�⇤i v

Ti X]

= �⇤j EX⇠ ¯D[(vTj X)2]

= EX⇠D[(vTj X)Y ]EX⇠ ¯D[(vTj X)2]

EX⇠D[(vTj X)2]=: EX⇠D[(vTj X)Y ] j ,

where we implicitly defined j . In the second inequality above, we used that for i 6= j,

EX⇠ ¯D[vTj XXT vi] = E[vTj ⌃1/2WWT⌃1/2vi]

= � E[vTj ⌃1/2⌃1/2vi] = � �i E[vTj vi] = 0.

While j = 1 when X is sampled from D, in general, for a di↵erent D, j 6= 1.

For an arbitrary w =P

j wjvj , we have that

EX⇠ ¯D[(wTX)(�⇤TX)] =X

j

wj EX⇠ ¯D[(vTj X)X

i

�⇤i v

Ti X]

=X

j

wj j EX⇠D[(vTj X)Y ]. (B.79)


We next show that, for our thresholded distribution D, j has a special form.

j =EX⇠ ¯D[(vTj X)2]

EX⇠D[(vTj X)2]=

EX⇠ ¯D[(vTj ⌃1/2⌃�1/2X)2]

EX⇠D[(vTj ⌃1/2⌃�1/2X)2]

=EX⇠ ¯D[((⌃1/2vj)T X)2]

EX⇠D[((⌃1/2vj)T X)2].

We can write ⌃1/2vj =P

i �ijei for some coe�cients �ij . Thus, it follows that

j =EX⇠ ¯D[((

P

i �ijei)T X)2]

EX⇠D[((P

i �ijei)T X)2]

=

P

i �2

ij EX⇠ ¯D[(eTi X)2]P

i �2

ij EX⇠D[(eTi X)2]

=

P

i �2

ij EX⇠ ¯D[X2

i ]P

i �2

ij EX⇠D[X2

i ]=�P

i �2

ijP

i �2

ij

= �,

where we used that E[XiXk] = 0 for i 6= k under both D and D.

We conclude by (B.79) that

EX⇠ ¯D[(wTX)(�⇤TX)] = � EX⇠D[(wTX)Y ]. (B.80)

If we go back to the expected value of X Bias(X), taking w = ei, we see that

EX⇠ ¯D[X Bias(X)] = EX⇠ ¯D[X(E[Y |X]� �⇤TX)]

= EX⇠ ¯D[X E[Y |X]]�EX⇠ ¯D[X(�⇤TX)]

= EX⇠ ¯D[X E[Y |X]]� � EX⇠D[X E[Y |X]]. (B.81)

Next, we prove Lemma 10, providing a explicit bound on the impact of the approximation error.

Lemma. Suppose Assumption 1 holds. Fix � 2 (0, 1). Then, the term�

�

1

k

Pki=1

⌃�1/2Xi Bias(Xi)�

�

2

is upper bounded with probability at least 1� � by

k⌃�1/2⌫k2 + 2 EX⇠ ¯D[k⌃�1/2X Bias(X)k2]k

1 +

r

8 log1

�

!

2

+64

9k2

⇣

B2

Biasd+ k⌃�1/2⌫k2⌘

log21

�,

where ⌫ = EX⇠ ¯D[X E[Y |X]]� � EX⇠D[X E[Y |X]].

Proof. If X ⇠ D, then we can use Lemma 9 by taking Zi := ⌃�1/2Xi Bias(Xi), as ED Zi = 0.


However, as we care about the case X ⇠ D, we need to subtract the mean, and take

Zi = ⌃�1/2Xi Bias(Xi)� ⌃�1/2⌫.

Note that in this case E¯DZi = 0. By Assumption 1,

kZik k⌃�1/2Xi Bias(Xi)k+ k⌃�1/2⌫k B

Bias

pd+ k⌃�1/2⌫k =: b.

As observations are independent, we take

v = k EX⇠ ¯D[k⌃�1/2X Bias(X)k2] kd B2

Bias

.

Applying Lemma 9, we conclude that with probability at least 1� �

�

�

�

�

1

k

kX

i=1

⌃�1/2Xi Bias(Xi)�⌃�1/2⌫

�

�

�

�

r

EX⇠ ¯D[k⌃�1/2X Bias(X)k2]k

1 +

r

8 log1

�

!

+4b

3klog

1

�.

As a consequence, with probability at least 1� �

�

�

�

�

1

k

kX

i=1

⌃�1/2Xi Bias(Xi)

�

�

�

�

2

k⌃�1/2⌫k2 + 2 EX⇠ ¯D[k⌃�1/2X Bias(X)k2]k

1 +

r

8 log1

�

!

2

+32b2

9k2log2

1

�

k⌃�1/2⌫k2 + 2 EX⇠ ¯D[k⌃�1/2X Bias(X)k2]k

1 +

r

8 log1

�

!

2

+64

9k2

⇣

B2

Bias

d+ k⌃�1/2⌫k2⌘

log21

�.

(B.82)

Finally, we show Lemma 11.

Lemma. For � 2 (0, 1), after collecting k observations X ⇠ D, with probability at least 1� �

�

�

�

�

⌃1/2⌃�1⌃1/2

�

�

�

�

1

�

1

(1� ↵)2, (B.83)

where ↵ = Cp

d/k +p

log(2/�)/kc for universal constants c, C > 0, and � is defined in (2.5).

Proof. Let W ⇠ D. Then, ⌃ = E[WWT ] = � Id. Note that the observation that we actually collect

is given by X = ⌃1/2W . In terms of matrices, X = W⌃1/2.


Note that

�

�

�

�

⌃1/2⌃�1⌃1/2

�

�

�

�

=

�

�

�

�

⌃1/2

✓

1

kXTX

◆�1

⌃1/2

�

�

�

�

=

�

�

�

�

✓

1

kWTW

◆�1

�

�

�

�

=

�

�

�

�

⌃�1/2

✓

1

kWTW

◆�1

⌃�1/2

�

�

�

�

=1

�

1

�min

�

1

kWTW

� .

As in the main theorem of the work, we use Theorem 5.39 in [91] which guarantees that if D

and D are subgaussian, then for ↵ > 0 such that t = ↵pk � C

pd > 0, with probability at least

1� 2 exp(�ct2) we have

�min

✓

1

kWTW

◆

� (1� ↵)2,

as W is white subgaussian.

Equivalently, for � 2 (0, 1), with probability at least 1� �

�

�

�

�

⌃1/2⌃�1⌃1/2

�

�

�

�

1

�

1

(1� ↵)2, (B.84)

where ↵ = Cp

d/k +p

log(2/�)/kc for universal constants c, C > 0.

B.11 Ridge Regression

Regularized linear estimators also benefit from large and balanced observations. We show that,

under mild assumptions, the performance of the ridge regression is directly aligned with that of

previous sections.

The ridge estimator is �� =�

XTX+ �I��1

XTY, given (X,Y) and � > 0. The following

result shows how large values of �min

(XTX) help to control the MSE of ��. As the optimal penalty

parameter �⇤ is unknown until the end of the data collection process, we assume it is uniformly

random in a small interval.

Theorem 38. Let R > 0. Assume the penalty parameter for ridge regression is chosen uniformly

at random �⇤ ⇠ U [0, R]. Then, the MSE of ��⇤ is upper bounded by

E�⇤,X k��⇤ � �⇤k2 EX f�

�min

(XTX)�

, (B.85)


where f is the following decreasing function of �min

:

f(�min

) =�2 d

�min

+R+ k�⇤k2

2

✓

1� 2�min

Rlog

✓

1 +R

�min

◆

+�min

�min

+R

◆

. (B.86)

Proof. The SVD decomposition of X = USV T implies that XTX = V SUTUSV T = V S2V T , where

U and V are orthogonal matrices.

We define W =�

XTX+ �I��1

, and see that

W =�

V (S2 + �I)V T��1

= V Diag

1

s2jj + �

!d

j=1

V T .

In this case, the MSE of �� has two sources: squared bias and the trace of the covariance matrix.

The covariance matrix of �� is Cov(��) = �2 WXTXW, while its bias is given by ��W�⇤ (see [41]).

Thus,

Cov(��) = �2 V Diag

s2jj(s2jj + �)2

!d

j=1

V T . (B.87)

Note that s2jj = �j , where sjj ’s are the singular values of X, and �j ’s the eigenvalues of XTX. As

V is orthogonal, Trh

Cov(��)i

= �2

Pdj=1

�j/(�j + �)2.

Unfortunately, in practice, the value of � is unknown before collecting the data. A common tech-

nique consists in using an additional validation set to choose the optimal regularization parameter

�⇤. Generally, in supervised learning, the validation set comes from the same distribution as the test

set, while in active learning it does not. As in the unregularized case, we want to train on unlikely

data, but we want to test on likely data. We achieve robustness against this fact as follows. We fix

some fairly large R > 0 such that we assume �⇤ 2 (0, R). We treat �⇤ as a random variable, and we

impose a uniform prior D� over (0, R).

Then, we see that

E�⇤⇠D�

h

Trh

Cov(��⇤)ii

= �2

dX

j=1

�j

Z R

0

1

(�j + �)21

Rd�

= �2

dX

j=1

1

�j +R �2 d

�min

+R. (B.88)


The squared bias can be upper bounded by

�2 �⇤TWTW�⇤ = �⇤TV Diag

�2

(�j + �)2

�

j

V T�⇤

k�⇤k22

maxi

✓

�

�j + �

◆

2

= k�⇤k22

✓

�

�min

+ �

◆

2

. (B.89)

for every � > 0, as �j � 0 for all j. Taking expectations on both sides of (B.89) with respect to

�⇤ ⇠ D�, and after some algebra

ED�

Bias2(��⇤)

k�⇤k22

1� 2�min

Rlog

✓

1 +R

�min

◆

+�min

�min

+R, (B.90)

where the RHS is a decreasing function of �min

that tends to zero as �min

grows.

It follows that E k��⇤ � �⇤k2 can be controlled by minimizing �min

(XTX), and we can focus on

minimizing �min

(XT X) by the equivalence shown in the Problem Definition of Chapter 2.

B.12 Simulations

We conducted several experiments in various settings. We present here some experiments that

complement those showed in the main paper. In particular, we show experiments for linear mod-

els, synthetic linear models, synthetic non-linear data, and additional regularized and real-world

datasets.

B.12.1 Linear Models

We first empirically show the results proved in Theorem 1. For a sequence of values of n, we choose

k =pn observations in Rd, with fixed d = 10. The observations are generated according to N (0, Id),

and y follows a linear model with �i ⇠ U(�5, 5). For each tuple (n, k) we repeat the experiment

200 times, and compute the squared error (�⇤ is known). The results in Figure B.1 (a) show the

average MSE of Algorithm 1 significantly outperforms that of random sampling. We also see a

strong variance reduction. Figure B.1 (b) restricts the comparison to fixed and adaptive threshold

algorithms; while the latter outperforms the former, the di↵erence is small. In Figure B.1 (c) we

keep n and d fixed, and vary k. Finally, in Figure B.2 (a) we show the case where ⌃ 6= Id.

For completeness, we repeated the simulation with observations generated according to a joint

Gaussian distribution with a random covariance matrix that had Tr(⌃) = 21.59, �min

= 0.65, and

�max

= 3.97. Figure B.2 (a) shows that thresholding algorithms outperform random sampling in a


(a) With k =pn, d = 10. (b) With k =

pn, d = 10.

(c) With n = 3000, d = 10.

Figure B.1: MSE of �OLS ; white Gaussian obs, (0.25, 0.75) quantile confidence intervals displayedin (a), (c).


similar way as in the white case presented in the paper. Also, Figure B.2 (b) shows how the adaptive

threshold slightly beats the fixed one.

Finally, in Figure B.2 (c), we show the results of simulations when observations are sampled

from Laplace correlated marginals (through a Gaussian Copula). We compare random sampling to

two versions of the thresholding algorithm. The most simple one, denoted by Unif-Weig Algorithm,

assigns uniform weights (i.e., ⇠i = 1 for all i). On the other hand, denote by Opt-Weig Algorithm the

algorithm that uses the optimal weights (previously pre-computed, in this case maxi ⇠i/mini ⇠i ⇡ 7,

independent variables tend to require higher weights). As one would expect, the latter does better

than the former. However, it is remarkable that the di↵erence between random and thresholding

is way more substantial than the di↵erence between optimal and approximate thresholding, an

observation that can be very useful in practice.

(a) With k =pn, d = 10. (b) With k =

pn, d = 10.

(c) Laplace Correlated Data via Gaussian Copula;Uniform vs. Optimal Weights.

Figure B.2: In (a), (b), MSE of �OLS ; N (0,⌃) data, (0.05, 0.95) conf. intervals.


B.12.2 Synthetic Non-Linear Data

The theory and algorithms presented in this paper are based on the linearity of the model. To

understand the impact of this assumption, we perform an experiment where the response model

was y = xT� + xTx for various values of , and �i ⇠ U(�5, 5). Note that high-order terms and

transformations can easily be included in the design matrix (not done in this case). As expected,

the results in Figure B.3 show an intersection point. The active learning algorithms are robust to

some level of non-linearity but, at some point, random sampling becomes more e↵ective.

Figure B.3: Model is y =P

i �ixi + P

i x2

i .

B.12.3 Regularization

An appealing property of the proposed algorithms is that their gain is preserved under regularized

estimators such as ridge and lasso. This is specially relevant as it allows for higher dimensional

models where transformations and interactions of the original variables are added to better capture

non-linearities in the data and regularization is used to avoid overfitting. In fact, our algorithm can

be thought of as a type of regularizing process.

We repeated the first experiment from the linear model simulations, using the ridge estimator

with � = 0.01. Figure B.4 (a) shows that the average MSEs of Algorithms 1 and 1b strongly

outperform the results of random sampling. Their variance is less than 30% that of random sampling

in all cases.

We performed two experiments with Lasso estimators to investigate the behavior of our al-

gorithms in the presence of sparse models. We do not test Algorithm 2 here, but only simple


(a) Ridge; d = 10, k =pn. (b) Lasso; n = 5000, k = 150, d = 70.

(c) Lasso; n = 1000, k = 100, de↵ = 7

Figure B.4: MSE of regularized estimators, � = 0.01; white Gaussian obs. The (0.05, 0.95) confidenceintervals in (a), and (0.25, 0.75) in (b).


thresholding approaches. First, we fixed n = 5000, k = 150, d = 70 and white Gaussian data.

The dimension of the latent subspace, or e↵ective dimension of the model, ranges from de↵

= 5 to

de↵

= 70. Results are shown in Figure B.4 (b). Algorithm 1 and Algorithm 1b strongly improve the

performance of random sampling, while their variance is at most half that of random sampling.

In the second experiment, we fixed de↵

= 7, and progressively increased the dimension of the

space d from d = 70 to d = 450. Also, we kept fixed n = 1000 and k = 100. Results are shown in

Figure B.4 (c).

Thresholding algorithms consistently decrease the MSE of the lasso estimator with respect to

random sampling, even though we are adding a large number of purely noisy dimensions. The

reason is simple. While these algorithms do not actively try to find the latent subspace (Algorithm

2 does), their observations will be, on average, larger in those dimensions too. There may be ways to

leverage this fact, like batched approaches where weights ⇠ are updated by giving more importance

to promising dimensions.

B.12.4 Real World Datasets

The Combined Cycle Power dataset has 9568 observations. The outcome is the net hourly electrical

energy output of the plant, and it has d = 4 covariates: temperature, pressure, humidity, and

exhaust vacuum. In Figure B.5, we see the phenomenon explained in the main paper (for large k,

the gain vanishes). In this case, and after adding all second order interactions, active learning solves

the problem. Random sampling with interactions is not shown as the error was much larger.

Figure B.5: Combined Cycle Power (150 iters).


In addition, in Figure B.6 we show the scatterplots of the datasets used in the paper (we omitted

the YearPredictionMSD dataset as d = 90).

(a) Protein Structure Dataset. (b) Combined Cycle Power Plant Dataset.

(c) Bike Sharing Dataset.

Figure B.6: Scatter Plots of Real World Datasets.

Appendix C

Proofs Chapter 3

C.1 Optimal Static Allocation

C.1.1 Proof of Proposition 13

Proposition. Given m linear regression problems, each characterized by a parameter �i, Gaussian

noise with variance �2

i , and Gaussian contexts with covariance ⌃, let n > m(d+1), then the optimal

OLS static allocation algorithm A⇤stat selects each instance

k⇤i,n =�2

iP

j �2

j

n+ (d+ 1)

✓

1� �2

i

�2

◆

, (C.1)

times (up to rounding e↵ects), and incurs the global error

L⇤n = Ln(A⇤

stat) = �2

md

n+O

�2

✓

md

n

◆

2

!

. (C.2)

Proof. For the sake of readability in the following we drop the dependency on n.

We first derive the equality in Eq. 3.2

Li(�i) = EX

⇥

(XT�i �XT�i)2

⇤

= EX [(�i � �i)TXXT(�i � �i)]

= (�i � �i)TE[XXT](�i � �i)

= (�i � �i)T⌃(�i � �i)

= k�i � �ik2⌃

.

103

APPENDIX C. PROOFS CHAPTER 3 104

As a result, we can write the global error as

Ln(Astat

) = maxi2[m]

EDi,n

h

k�i � �ik2⌃

i

= maxi2[m]

EDi,n

Tr⇣

(�i � �i)T⌃(�i � �i)

⌘

�

= maxi2[m]

EDi,n

Tr⇣

⌃(�i � �i)(�i � �i)T⌘

�

= maxi2[m]

Tr

✓

EDi,n

h

⌃(�i � �i)(�i � �i)Ti

◆

,

where Di,n is the training set extracted from Dn containing the samples for instance i. Since contexts

and noise are independent random variables, we can decompose Di,n into the randomness related to

the context matrix Xi 2 Rki

⇥d and the noise vector ✏i 2 Rki . We recall that for any fixed realization

of Xi 2 Rki

⇥d, the OLS estimates b�i is distributed as

�i | Xi ⇠ N (�i,�2

i (XTi Xi)

�1), (C.3)

which means that �i conditioned on Xi is unbiased with covariance matrix given by �2

i (XTi Xi)�1.

Thus, we can further develop Ln(Astat

) as

Ln(Astat

) = maxi2[m]

Tr

✓

EXi

E✏i

h

⌃(�i � �i)(�i � �i)T�

�Xi

i

�◆

, (C.4)

= maxi2[m]

�2

iTr

✓

⌃EXi

h

(XTi Xi)

�1

i

◆

= maxi2[m]

�2

iTr

✓

EXi

h

(XTi Xi)

�1

i

◆

,

where X = ⌃�1/2X is a whitened context and Xi is its corresponding whitened matrix. Since

whitened contexts X are distributed as N (0, I), we know that (XTi Xi)�1 is distributed as an inverse

Wishart W�1(Id, ki), whose expectation is Id/(ki � d� 1), and thus,

Ln(Astat

) = maxi2[m]

�2

iTr

1

ki � d� 1Id

�

= maxi2[m]

�2

i d

ki � d� 1. (C.5)

Note that this final expression requires that ki > d+ 1, since it is not possible to compute an OLS

estimate with less than d + 1 samples. Therefore, we proceed by minimizing Eq. C.5, subject to

ki > d+ 1. We write ki = k0i + d+ 1 for some k0i > 0. Thus, equivalently, we minimize

Ln(Astat

) = maxi

�2

i d

k0i. (C.6)


SinceP

i k0i = n�m(d+ 1), we may conclude that the optimal k0i is given by

k0i =�2

iP

j �2

j

�

n�m(d+ 1)�

,

so that all the terms in the RHS of Eq. C.6 are equal. This gives us the optimal static allocation

k⇤i =�2

iP

j �2

j

(n�m(d+ 1)) + d+ 1

=�2

iP

j �2

j

n+ (d+ 1)

✓

1� �2

i

�2

◆

, (C.7)

where �2 = (1/m)P

i �2

i is the mean variance across the m problem instances.

Thus, for the optimal static allocation, the expected loss is given by

L⇤n = Ln(A⇤

stat

) = dmaxi

�2

i�2iP

j

�2j

n� (d+ 1)�2i

�2

=

⇣

P

j �2

j

⌘

d

n�m(d+ 1)

=

⇣

P

j �2

j

⌘

d

n+

⇣

P

j �2

j

⌘

md(d+ 1)

n�

n�m(d+ 1)�

=

⇣

P

j �2

j

⌘

d

n+O

0

@

⇣

P

j �2

j

⌘

md2

n2

1

A ,

which concludes the proof. Furthermore the following bounds trivially holds for any n � 2m(d+ 1)

md�2

n L⇤

n 2md�2

n.

C.2 Loss of an OLS-based Learning Algorithm (Proof of Lemma 14)

Unlike in the proof of Proposition 13, when the number of pulls is random and depends on the value

of the previous observations (through Dn), then in general, the OLS estimates b�i,n are no longer

distributed as Eq. C.3 and the derivation for Astat

no longer holds. In fact, for a learning algorithm,

the value ki,t itself provides some information about the observations that have been obtained up

until time t and were used by the algorithm to determine ki,t. In the following, we show that by

ignoring the current context Xt when choosing instance It, we are still able to analyze the loss of

Trace-UCB and obtain a result very similar to the static case.


We first need two auxiliary lemmas (Lemmas 39 and 40), one on the computation of an empirical

estimate of the variance of the noise, and an independence result between the variance estimate and

the linear regression estimate.

Lemma 39. In any linear regression problem with noise ✏ ⇠ N (0,�2), after t � d + 1 samples,

given an OLS estimator b�t, the noise variance estimator can be computed in a recurrent form as

�2

t+1

=t� d

t� d+ 1�2

t +1

t� d+ 1

(XTt+1

�t � Yt+1

)2

1 +XTt+1

(XTt Xt)�1Xt+1

, (C.8)

where Xt 2 Rt⇥d is the sample matrix.

Proof. We first recall the “batch” definition of the variance estimator

b�2

t =1

t� d

tX

s=1

(Ys �XTsb�t)

2 =1

t� dkYt �XT

tb�tk2

Since Yt = Xt� + ✏t and b�t = � + (XTt Xt)�1XT

t ✏t, we have

b�2

t =1

t� dk(XT

t Xt)�1XT

t ✏t � ✏tk2 =1

t� d

⇣

✏Tt ✏t � ✏Tt Xt(XTt Xt)

�1XTt ✏t⌘

=1

t� d(Et+1

� Vt+1

).

We now devise a recursive formulation for the two terms in the previous expression. We have

Et+1

= ✏Tt+1

✏t+1

= ✏Tt ✏t + ✏2t+1

= Et + ✏2t+1

.

In order to analyze the second term we first introduce the design matrix St = XTt Xt, which has the

simple update rule St+1

= St +Xt+1

XTt+1

. Then we have

Vt+1

= ✏Tt+1

Xt+1

(XTt+1

Xt+1

)�1XTt+1

✏t+1

=�

✏Tt Xt + ✏t+1

XTt+1

��

St +Xt+1

XTt+1

��1

�

✏Tt Xt + ✏t+1

XTt+1

�T

=�

✏Tt Xt + ✏t+1

XTt+1

�

✓

S�1

t � S�1

t Xt+1

XTt+1

S�1

t

1 +XTt+1

S�1

t Xt+1

◆

�

✏Tt Xt + ✏t+1

XTt+1

�T,

where we used the Sherman-Morrison formula in the last equality. We further develop the previous

expression as

Vt+1

= Vt + ✏t+1

XTt+1

S�1

t Xt+1

✏t+1

+ 2✏t+1

XTt+1

S�1

t XTt ✏t

� ✏Tt XtS�1

t Xt+1

XTt+1

S�1

t

1 +XTt+1

S�1

t Xt+1

XTt ✏t � ✏t+1

XTt+1

S�1

t Xt+1

XTt+1

S�1

t

1 +XTt+1

S�1

t Xt+1

Xt+1

✏t+1

� 2✏Tt XtS�1

t Xt+1

XTt+1

S�1

t

1 +XTt+1

S�1

t Xt+1

Xt+1

✏t+1

.


We define ↵t+1

= XTt+1

S�1

t XTt ✏t and t+1

= XTt+1

S�1

t Xt+1

, and then obtain

Vt+1

= Vt + ✏2t+1

t+1

+ 2↵t+1

✏t+1

� ↵2

t+1

1 + t+1

� ✏2t+1

2

t+1

1� t+1

� 2✏t+1

↵t+1

t+1

1 + t+1

= Vt + ✏2t+1

⇣

t+1

+ 2

t+1

1 + t+1

⌘

+ 2✏t+1

↵t+1

1 + t+1

� ↵2

t+1

1 + t+1

.

Bringing everything together we obtain

Et+1

� Vt+1

= Et � Vt + ✏2t+1

⇣

1� t+1

+ 2

t+1

1 + t+1

⌘

� 2✏t+1

↵t+1

1 + t+1

+↵2

t+1

1 + t+1

= Et � Vt +1

1 + t+1

⇣

✏2t+1

� 2✏t+1

↵t+1

+ ↵t+1

⌘

= Et � Vt +

�

✏t+1

� ↵t+1

�

2

1 + t+1

.

Since ✏t+1

= Yt+1

�XTt+1

�, we may write

Et+1

� Vt+1

= Et � Vt +

�

Yt+1

�XTt+1

(� + S�1

t XTt ✏t)

�

2

1 + t+1

= Et � Vt +

�

Yt+1

�XTt+1

b�t�

2

1 + t+1

.

Recalling the definition of the variance estimate, we finally obtain

�2

t+1

=1

t� d+ 1(Et+1

� Vt+1

) =1

t� d+ 1(Et � Vt) +

1

t� d+ 1

�

Yt+1

�XTt+1

b�t�

2

1 +XTt+1

S�1

t Xt+1

=t� d

t� d+ 1�2

t +1

t� d+ 1

�

Yt+1

�XTt+1

b�t�

2

1 +XTt+1

S�1

t Xt+1

,

which concludes the proof.

Lemma 40. Let Fj be the �-algebra generated by X1

, . . . , Xn and �2

1

, . . . , �2

j . Then, for any j � d,

�j | Fj ⇠ N (�,�2 (XT1:jX1:j)

�1). (C.9)

Proof. We prove the lemma by induction. The statement is true for t = d. We want to prove the

induction, that is if �t | Ft ⇠ N (�,�2 (XTt Xt)�1), then

�t+1

| Ft+1

⇠ N (�,�2(XTt+1

Xt+1

)�1). (C.10)

Let us first derive a recursive expression for �t+1

. Let St = XTt Xt, then

b�t+1

= � + S�1

t+1

XTt+1

✏t+1

=�

St +Xt+1

XTt+1

��1

�

XTt ✏t + ✏t+1

Xt+1

�

=

✓

S�1

t � S�1

t Xt+1

XTt+1

S�1

t

1 +XTt+1

S�1

t Xt+1

◆

�

XTt ✏t + ✏t+1

Xt+1

�

,


where we used Sherman-Morrison formula. By developing the previous expression we obtain

b�t+1

=�

� + S�1

t XTt ✏t�

+ ✏t+1

S�1

t Xt+1

✓

1� XTt+1

S�1

t Xt+1

1 +XTt+1

S�1

t Xt+1

◆

� S�1

t Xt+1

XTt+1

S�1

t XTt ✏t

1 +XTt+1

S�1

t Xt+1

= b�t +✏t+1

S�1

t Xt+1

1 +XTt+1

S�1

t Xt+1

� S�1

t Xt+1

XTt+1

(b�t � �)

1 +XTt+1

S�1

t Xt+1

.

We can conveniently rewrite the previous expression as

b�t+1

� � =

✓

I � S�1

t Xt+1

XTt+1

1 +XTt+1

S�1

t Xt+1

◆

(b�t � �) +✏t+1

S�1

t Xt+1

1 +XTt+1

S�1

t Xt+1

= (I � ↵t)(b�t � �) + �t✏t+1

, (C.11)

where ↵t 2 Rd⇥d and �t 2 Rd are defined implicitly. By Lemma 39, we notice that the sequence of

empirical variances in Ft is equivalent to the sequence of squared deviations up to t. In order to

make this equivalence more apparent we define the filtration

Gt =n

{Xs}ns=1

[ �2

2

[ {(XTs+1

�s � ✏s+1

)2}t�1

s=2

o

,

so that �t+1

| Ft+1

⇠ �t+1

| Gt+1

. We introduce two auxiliary random vectors conditioned on G

U = ✏t+1

�XTt+1

(�t � �) | Gt, V = �t+1

� � | Gt.

We want to show that the random vectors U 2 R and V 2 Rd are independent. We first recall that

the noise ✏t+1

| Gt ⇠ N (0,�2), and it is independent of ✏1

, . . . , ✏t, and �t under Gt. Furthermore,

by the induction assumption �t | Gt is also Gaussian, so we have that (�t, ✏t+1

) are jointly Gaussian

given Gt. Then we can conveniently rewrite U as

U = (�Xt+1

, 1)T(�t, ✏t+1

) +XTt+1

�,

which shows that it is a Gaussian vector. Using the recursive formulation in Eq. C.11 we can also

rewrite V as

V = (Id� ↵t)(�t � �) + �t ✏t+1

=h

I� ↵t �ti

"

�t � �

✏t+1

#

,

which is also Gaussian. Furthermore, we notice that under the induction assumption, EGt

[U ] = 0 and


EGt

[V ] = 0 and thus we need to show that E[UV | Gt] = 0 to prove that U and V are uncorrelated

E[UV | Gt] = EGt

h⇣

✏t+1

�XTt+1

(�t � �)⌘⇣

(Id� ↵t)(�t � �) + �t ✏t+1

⌘i

= �t EGt

⇥

✏2t+1

⇤�EGt

h

XTt+1

(�t � �)(I� ↵t)(�t � �)i

= �2�t �EGt

h

(I� ↵t)(�t � �)(�t � �)TXt+1

i

= �2�t � (I� ↵t) EGt

h

(�t � �)(�t � �)Ti

Xt+1

= �2�t � �2(I� ↵t)(XTt Xt)

�1Xt+1

= �2

S�1

t Xt+1

1 +XTt+1

S�1

t Xt+1

� �2

I� S�1

t Xt+1

XTt+1

1 +XTt+1

S�1

t Xt+1

!

S�1

t Xt+1

= �2

S�1

t Xt+1

� (1 +XTt+1

S�1

t Xt+1

)S�1

t Xt+1

+ S�1

t Xt+1

XTt+1

S�1

t Xt+1

1 +XTt+1

S�1

t Xt+1

= 0.

It thus follows that, as U and V are uncorrelated, they are also independent. Combining the

definition of Gt, U and its independence w.r.t V , we have

V | Gj+1

= V | U,Gj

= V | {X1

, . . . , XT , �2

2

, {(XTs+1

�s � ✏s+1

)2}t�1

s=2

}

=h

I� ↵t �ti

"

�t � �

✏t+1

#

| Gt.

By the induction hypothesis the vector in the previous expression is distributed as

"

�t � �

✏t+1

#

⇠ N "

0

0

#

,�2

"

S�1

t 0

0 1

#!

.

Therefore, we conclude that

V | Gt+1

⇠ N

0,�2

h

I� ↵t �ti

"

S�1

t 0

0 1

#

h

I� ↵t �tiT!

= N (0,�2 ⌃0),


where the covariance matrix ⌃0 can be written as

⌃0 =h

I� ↵t �ti

"

S�1

t 0

0 1

#

h

I� ↵t �tiT

=h

I� ↵t �ti

"

S�1

t (I� ↵t)T

�Tt

#

= (I� ↵t)S�1

t (I� ↵t)T + �t�

Tt .

Recalling the definitions of ↵t and �t, and defining t+1

= XTt+1

S�1

t Xt+1

⌃0 =

I� S�1

t Xt+1

XTt+1

1 +XTt+1

S�1

t Xt+1

!

S�1

t

I� S�1

t Xt+1

XTt+1

1 +XTt+1

S�1

t Xt+1

!T

+

S�1

t Xt+1

1 +XTt+1

S�1

t Xt+1

!

S�1

t Xt+1

1 +XTt+1

S�1

t Xt+1

!T

= S�1

t � 2S�1

t Xt+1

XTt+1

S�1

t

1 + r+ t+1

S�1

t Xt+1

XTt+1

S�1

t

(1 + t+1

)2+

S�1

t Xt+1

XTt+1

S�1

t

(1 + t+1

)2

= S�1

t � S�1

t Xt+1

XTt+1

S�1

t

1 + t+1

= S�1

t+1

= (XTt+1

Xt+1

)�1,

where we applied the Woodbury matrix identity in the last step. Finally, it follows that

�t+1

| Ft+1

⇠ N (�,�2 (XTt+1

Xt+1

)�1),

and the induction is complete.

Now we can prove Lemma 14:

Lemma. Let A be a learning algorithm that selects instances It as a function of the previous history,

that is, Dt�1

= {X1

, I1

, YI1,1, . . . , Xt�1

, It�1

, YIt�1,t�1

} and computes estimates b�i,n using OLS.

Then, its loss after n steps can be expressed as

Ln(A) = maxi2[m]

EDt

�2

i

ki,nTr⇣

⌃b⌃�1

i,n

⌘

�

, (C.12)

where ki,n =Pn

t=1

I{It = i} and b⌃i,n = XTi,nXi,n/ki,n.

Proof. For any instance i, we can assume that the following random variables are sampled before

Trace-UCB starts collecting observations (we omit the i index in the table):


t = 1 t = 2 . . . t = n

X1

X2

. . . Xn

✏1

✏2

. . . ✏n

�1

�2

. . . �n

�2

1

�2

2

. . . �2

n

As a result, we can interpret Trace-UCB as controlling the stopping time ti = ki,n for each problem

i, that is, the total number of samples ki,n, leading to the final estimates �ti

and �2

ti

. In the following

we introduce the notation X1:j as the sample matrix constructed from exactly j samples, unlike Xi,n

which is the sample matrix obtained with ki,n. So we have X1:k

i,n

= Xi,n. Crucially, when the errors

✏j are Gaussian, then �j | X1:j and �2

j | X1:j are independent for any fixed j (note these random

variables have nothing to do with the algorithm’s decisions).

Let Fj be the �-algebra generated by X1

, . . . , Xn and �2

1

, . . . , �2

j . We recall that from Lemma 40

b�j |X1:j = �j | Fj ⇠ N (�j ,�2 (XT

1:jX1:j)�1). (C.13)

Intuitively, this results says that, given the data X1:n, if we are additionally given all the estimates

for the variance {�2

s}js=2

—which obviously depend on ✏1

, . . . , ✏j—, then the updated distribution for

�j does not change at all. This is a crucial property since Trace-UCB ignores the current context

Xt and it makes decisions only based on previous contexts and the variance estimates {�2

s}js=2

, thus

allowing us to proceed and do inference on �j as in the fixed allocation case.

We now need to take into consideration the filtration Fi,j for a specific instance i and the

environment filtration E�i containing all the contextsX and noise ✏ from all other instances (di↵erent

from i). Since the environment filtration E�i is independent from the samples from instance i, then

we can still apply Lemma 40 and obtain

b�i,j | Fi,j , E�i ⇠ b�i,j | Fi,j . (C.14)


Now we can finally study the expected prediction error

Li,n(b�i,n) =E[(�i � �i)(�i � �i)T]

= EX1:n,"�i

h

E[(�i � �i)(�i � �i)T | X

1:n, "�i]i

= EX1:n,"�i

2

4

nX

j=1

E[(�ki

� �i)(�ki

� �i)T | X

1:n, "�i, ki = j] P(ki = j)

3

5

= EX1:n,"�i

2

4

nX

j=1

Eh

EFj

[(�j � �i)(�j � �i)T | Fj ,X1:n, "�i, ki = j] | X

1:n, "�i, ki = ji

P(ki = j)

3

5

= EX1:n,"�i

2

4

nX

j=1

Eh

EFj

[(�j � �i)(�j � �i)T | Fj ,X1:n] | X1:n, "�i, ki = j

i

P(ki = j)

3

5

(C.15)

= EX1:n,"�i

2

4

nX

j=1

E⇥

�2

i (XT1:jX1:j)

�1 | X1:n, ki = j

⇤

P(ki = j)

3

5

= EX1:n,"�i

2

4

nX

j=1

�2

i (XT1:jX1:j)

�1 P(ki = j)

3

5

= �2

i EX1:n,"�i

⇥

Eki

[(XT1:k

i

X1:k

i

)�1]⇤

= �2

i E⇥

(XT1:k

i

X1:k

i

)�1

⇤

,

where in Eq. C.15 we applied Lemma 40. Hence, going back to the definition of loss (see e.g.,

Eq. C.4), we obtain an expression for the loss which applies under Trace-UCB (while not in

general for other algorithms)

Ln(A) = maxi

E⇥

�2

i Tr(⌃(XTi,nXi,n)

�1)⇤

= maxi

E

�2

i

ki,nTr⇣

⌃⌃�1

i,n

⌘

�

.

C.3 Concentration Inequalities (Proofs of Propositions 15

and 16)

In the next two subsections, we prove Propositions 15 and 16, respectively. In addition, we also

show a confidence ellipsoid result for the b� estimates, and a concentration inequality for the norm

of the observations Xt.


C.3.1 Concentration Inequality for the Variance (Proof of Proposition 15)

We use the following concentration inequality for sub-exponential random variables.

Proposition 41. Let X be a mean-zero (⌧2, b)-subexponential random variable. Then, for all ⌘ > 0,

P(|X| � ⌘) exp

✓

�1

2min

⇢

⌘2

⌧2,⌘

b

�◆

. (C.16)

Proof. See Proposition 2.2 in [93].

We first prove the concentration inequality for one single instance.

Proposition 42. Let t > d, Xt 2 Rt⇥d be a random matrix whose entries are independent standard

normal random variables, Yt = XTt �+✏t, where the noise ✏t ⇠ N (0,�2 Id) is independent from Xt,

and � 2 (0, 3/4]. Then, with probability at least 1� �, we have

|�2

t � �2| �2

s

64

t� d

✓

log1

�

◆

2

, (C.17)

where �2

t is the unbiased estimate b�2

t = 1

t�dkYt �Xtb�tk2 and b�t is the OLS estimator of �, given

Xt and Yt.

Proof. First note that the distribution of �2

t conditioned on Xt follows the scaled chi-squared dis-

tribution, i.e.,

�2

t | Xt ⇠ �2

t� d�2

t�d.

Also note that the distribution of the estimate does not depend on Xt and we can integrate out

the randomness in Xt. In order to show concentration around the mean, we directly use the sub-

exponential properties of �2

t . The �2

k distribution is sub-exponential with parameters (4k, 4). 1

Furthermore, we know that for any constant C > 0, the random variable C�2

k is (4C2k, 4C)-sub-

exponential. As a result, we have that �2

t is subexponential with parameters

(⌧2, b) =

✓

4�4

t� d,4�2

t� d

◆

.

Now we use Proposition 41 as our concentration bound. In our case, ⌘2/⌧2 < ⌘/b, when ⌘ < �2. In

such a case, if we denote the RHS of (C.16) by �, we conclude that

⌘ = �2

r

8

t� dlog

1

�.

1See Example 2.5 in [93].


Then, ⌘ < �2 holds when t � d+ 8 log(1/�). Otherwise, if ⌘2/⌧2 > ⌘/b, by Eq. C.16, we have

⌘ =8�2

t� dlog

1

�.

In this case, when t < d+ 8 log(1/�), we have that

|�2

t � �2| �2

8

t� dlog

1

�.

We would like to derive a bound that is valid in both cases. Let x = 8 log(1/�)/(t�d), then we have

P�|�2

t � �2| � �2 max(x,px)� �. (C.18)

Suppose x � px, so t < d + log(1/�). Then, we would like to find C, such that x C

px. As

t � d+ 1, we see thatpx =

r

8 log(1/�)

t� dp

8 log(1/�)�

= C.

if C > 1, it does follow that max(x,px) < max(C

px,

px) <

p

8 log(1/�)x, which corresponds to

� < 0.88. By (C.18), we now conclude that

P

0

@|�2

t � �2| � �2

s

64

t� d

✓

log1

�

◆

2

1

A �,

and the proof is complete.

In order to prove Proposition 15, we are just left to apply a union bound over steps t 2 {1, . . . , n}and instances i 2 {1, . . . ,m}. In order to avoid confusion, let �i,t be the estimate obtained by the

algorithm after t steps and �i(j) the estimate obtained using j samples. Let j > d, then

Ei(j) =⇢

|�2

i (j)� �2

i | � �2

i

s

64

j � d

✓

log1

�

◆

2

�

,

is the high-probability event introduced in Proposition 42, which holds with probability 1� �. Thenwe have that the event

E =m\

i=1

n\

j=1

Ei(j),

holds with probability 1� �0, with �0 = mn�. We complete the proof of Proposition 15 by properly

tuning � and taking R � maxi �2

i . Recall that Proposition 15 is as follows.

Proposition. Let the number of pulls ki,t � d + 1 and R � maxi �2

i . If � 2 (0, 3/4), then for any


instance i and step t > m(d+ 1), with probability at least 1� �2

, we have

|�2

i,t � �2

i | �i,t�

= R

s

64

ki,t � d

✓

log2mn

�

◆

2

. (C.19)

C.3.2 Concentration Inequality for the Trace (Proof of Proposition 16)

We first recall some basic definitions. For any matrix A 2 Rn⇥d, the i-th singular value si(A) is

equivalent to si(A)2 = �i(ATA), where �i is the i-th eigenvalue. The smallest and largest singular

values smin

and smax

satisfy

smin

kxk2

kAxk2

smax

kxk2

for all x 2 Rd.

The extreme singular values measure the maximum and minimum distortion of points and their

distance when going from Rd to Rn via the linear operator A. We also recall that the spectral norm

of A is given by

kAk = supx2Rd\0

kAxk2

kxk2

= supx2Sn�1

kAxk2

,

and thus, smax

(A) = kAk and smin

(A) = 1/kA�1k, if A is invertible.

We report the following concentration inequality for the eigenvalues of random Gaussian matrices.

Proposition 43. Let n � d, X 2 Rn⇥d be a random matrix whose entries are independent standard

normal random variables, and ⌃ = XTX/n be the corresponding empirical covariance matrix. Let

↵ > 0, then with probability at least 1� 2 exp(�↵2d/2), we have

Tr⇣

⌃�1

⌘

� d

1� 2(1 + ↵)pd+ (1 + ↵)2d/

pnp

n+ 2(1 + ↵)pd+ (1 + ↵)2d/

pn

!

,

and

Tr⇣

⌃�1

⌘

d

1 +2(1 + ↵)

pd� (1 + ↵)2d/

pnp

n� 2(1 + ↵)pd+ (1 + ↵)2d/

pn

!

.

In particular, we have

d⇣

1� (1 + ↵)

r

d

n

⌘

2

Tr⇣

⌃�1

⌘

d⇣

1 + 2(1 + ↵)

r

d

n

⌘

2

.

Proof. We first derive the concentration inequality for the eigenvalues of the empirical covariance

matrix and then we invert it to obtain the guarantee for the inverse matrix. From Corollary 5.35


in [91], we have that for any t > 0

⇣pn�

pd� t

⌘

2

�min

(XTX) = s

min

(X)2 smax

(X)2 = �max

(XTX)

⇣pn+

pd+ t

⌘

2

,

(C.20)

with probability at least 1 � 2 exp(�t2/2). Let ↵ > 0 and take t = ↵pd, then with probability at

least 1� 2 exp(�↵2d/2), we obtain the desired statement

1� (1 + ↵)

r

d

n

!

2

�min

�

⌃� �

max

�

⌃�

1 + (1 + ↵)

r

d

n

!

2

.

We now proceed by studying the eigenvalues of the inverse of the empirical covariance matrix

�min

(⌃�1

) = 1/�max

(⌃) and �max

(⌃�1

) = 1/�min

(⌃). Combined with Eq. C.20 we have

�min

⇣

⌃�1

⌘

� 1✓

1 + (1 + ↵)q

dn

◆

2

=1

1 + 2(1 + ↵)q

dn + (1 + ↵)2 d

n

= 1�2(1 + ↵)

q

dn + (1 + ↵)2 d

n

1 + 2(1 + ↵)q

dn + (1 + ↵)2 d

n

.

Similarly, we have that

�max

⇣

⌃�1

⌘

1✓

1� (1 + ↵)q

dn

◆

2

=1

1� 2(1 + ↵)q

dn + (1 + ↵)2 d

n

= 1 +2(1 + ↵)

q

dn � (1 + ↵)2 d

n

1� 2(1 + ↵)q

dn + (1 + ↵)2 d

n

.

Using the fact that for any matrix A 2 Rd⇥d, we may write d �min

(A) Tr(A) d �max

(A), we

obtain the final statement on the trace of ⌃�1

. The first of the two bounds can be further simplified

by using 1/(1 + x) � 1� x for any x � 0, thus obtaining

�min

�

⌃�1

� �⇣

1� (1 + ↵)

r

d

n

⌘

2

.

While under the assumption that n � 4(1+↵)2d we can use 1/(1�x) 1+2x (for any 0 x 1/2)


and obtain

�max

�

⌃�1

� ⇣

1 + 2(1 + ↵)

r

d

n

⌘

2

.

The statement of Proposition 16 (below) is obtained by recalling that ⌃b⌃�1

i,n is the empirical

covariance matrix of the whitened sample matrix Xi,n and by a union bound over the number of

samples ki,n and the number of instances i.

Proposition. Force the number of samples ki,t � d + 1. If � 2 (0, 1), for any i 2 [m] and step

t > m(d+ 1) with probability at least 1� �/2, we have

✓

1� CTr

r

d

n

◆

2

Tr⇣

⌃⌃�1

i,t

⌘

d✓

1 + 2CTr

r

d

n

◆

2

,

with CTr

= 1 +p

2 log(4nm/�)/d.

C.3.3 Concentration Inequality for b� Estimates

We slightly modify Theorem 2 from [1] to obtain a confidence ellipsoid over the b�i’s.

Theorem 44. Let {Ft}1t=0

be a filtration. Let {⌘t}1t=1

be a real-valued stochastic process such that

⌘t is Ft measurable and ⌘t is conditionally R-subgaussian for some R � 0, i.e.

8� 2 R E[e�⌘t | Ft�1

] exp

✓

�2R2

2

◆

. (C.21)

Let {Xt}1t=1

be an Rd-valued stochastic process such that Xt is Ft�1

measurable. Assume that V is

a d⇥ d positive definite matrix. For any t � 0, define

Vt = V +tX

s=1

XsXTs , St =

tX

s=1

⌘sXs. (C.22)

Let V = �Id, � > 0, and define Yt = XTt �

⇤ + ⌘t. Assume that k�⇤k2

S. Also, let �t = V �1

t XTt Yt

be the ridge estimate for � after t observations Xt,Yt. Then, for any � > 0, with probability at least

1� �, for all t � 0, �⇤ lies in

Ct =

8

<

:

� 2 Rd : k�t � �k¯Vt

/t Rpt

v

u

u

t2 log

det�

Vt

�

1/2det (�I)�1/2

�

!

+

r

�

tS

9

=

;

. (C.23)

Proof. Take x =¯Vt

t (�t � �⇤) in equation 5 in the proof of Theorem 2 in [1].


We use the previous theorem by lower bounding the Vt/t norm in ⌃ norm.

C.3.4 Bounded Norm Lemma

Lemma 45. Let X1

, . . . , Xt 2 Rd be iid subgaussian random variables.

If kX1

k2 is subexponential with parameters (a2, b), then, for ↵ > 0

P

0

@

1

t

tX

j=1

kXjk2 E[kX1

k2] + ↵

t

1

A �8

<

:

1� exp⇣

� ↵2

2ta2

⌘

if 0 ↵ ta2/b,

1� exp�� ↵

2b

�

if ↵ > ta2/b.(C.24)

Proof. The proof directly follows by Proposition 41, by defining zero-mean subexponential random

variable Z with parameters (a2/t, b/t)

Z =1

t

tX

j=1

kXjk2 �E

2

4

1

t

tX

j=1

kXjk23

5 . (C.25)

Corollary 46. Let X1

, . . . , Xt 2 Rd be iid gaussian variables, X ⇠ N (0, Id). Assume t � d + 1.

Let � > 0. Then, with probability at least 1� �,

1

t

tX

j=1

kXjk2 d+ 8 log

✓

1

�

◆

r

d

t, (C.26)

Proof. For standard Gaussian X ⇠ N (0, Id), kXk2 ⇠ �2

d, and a2 = 4d and b = 4. Note that

E[kXjk2] = d. By the proof of Lemma 45 and (C.26)

P

|Z| � a

s

2

tlog

✓

1

�

◆

!

�, when t � 2

✓

b

a

◆

2

log

✓

1

�

◆

. (C.27)

P

✓

|Z| � 2b

tlog

✓

1

�

◆◆

�, when t < 2

✓

b

a

◆

2

log

✓

1

�

◆

. (C.28)

Substituting a = 2pd and b = 4 leads to

P

|Z| �s

8d

tlog

✓

1

�

◆

!

�, when t � 8

dlog

✓

1

�

◆

. (C.29)

P

✓

|Z| � 8

tlog

✓

1

�

◆◆

�, when t <8

dlog

✓

1

�

◆

. (C.30)


We would like to upper bound 8 log (1/�) /t in (C.30). As t > d, we see

8

tlog

✓

1

�

◆

8pdt

log

✓

1

�

◆

. (C.31)

As a consequence,

P

✓

|Z| � 8pdt

log

✓

1

�

◆◆

�, when t <8

dlog

✓

1

�

◆

. (C.32)

It follows that for all t > d

P

|Z| � max

8pdt

log

✓

1

�

◆

,

s

8d

tlog

✓

1

�

◆

!!

�. (C.33)

As � < 1, we finally conclude that

P

|Z| � 8

r

d

tlog

✓

1

�

◆

!

�. (C.34)

Therefore, with probability at least 1� �,

1

t

tX

j=1

kXjk2 d+ 8 log

✓

1

�

◆

r

d

t, (C.35)

as stated in the corollary.

C.4 Performance Guarantees for Trace-UCB

C.4.1 Lower Bound on Number of Samples (Proof of Theorem 17)

We derive the high-probability guarantee on the number of times each instance is selected.

Theorem. Let � > 0. With probability at least 1��, the total number of contexts that Trace-UCB

allocates to each problem instance i after n rounds satisfies

ki,n � k⇤i,n � C�

+ 8CTr

�2

min

r

nd

�min

� ⌦(n1/4) (C.36)

where R � �2

max

is known by the algorithm, and we defined C�

= 16R log(2mn/�), CTr

= 1 +p

2 log(4nm/�)/d, and �min

= �2

min

/P

j �2

j .

Proof. We denote by E� the joint event on which Proposition 15 and Proposition 16 hold at the same

time with an overall probability 1 � �. This immediately gives upper and lower confidence bounds


on the score si,t used in Trace-UCB as

1� CTr

s

d

ki,t

!

2

�2

i

ki,t si,t

d

1 + 2CTr

s

d

ki,t

!

2

�2

i + 2�i,t

ki,t.

Recalling the definition of �i,t we can rewrite the last term as

�2

i + 2�i,t

ki,t=

1 +16R log(2mn/�)

�2

i

p

ki,t � d

!

�2

i

ki,t=

1 +C

�

�2

i

p

ki,t � d

!

�2

i

ki,t,

where C�

= 16R log(2mn/�). We consider a step t + 1 n at which It+1

= q. By algorithmic

construction we have that sp,t sq,t for every arm p 2 [m]. Using the inequalities above we obtain

1� CTr

s

d

kp,t

!

2

�2

p

kp,t sp,t

d sq,t

d

1 + 2CTr

s

d

kq,t

!

2

�2

q + 2�q,t

kq,t

If t+1 is the last time step at which arm q is pulled, then kq,t = kq,t+1

�1 = kq,n�1 and kp,n � kp,t.

Then we can rewrite the previous inequality as

1� CTr

s

d

kp,n

!

2

�2

p

kp,n=: Ap,n Bq,n :=

1 + 2CTr

s

d

kq,n � 1

!

2

1 +C

�

�2

q

p

kq,n � d� 1

!

�2

q

kq,n � 1.

(C.37)

If every arm is pulled exactly the optimal number of times, then for any i 2 [m], ki,n = k⇤i,n and

the statement of the theorem trivially holds. Otherwise, there exists at least one arm that is pulled

more than k⇤i,n. Let q be this arm, then kq,n > k⇤q,n. We recall that L⇤n = d�2

q/(k⇤q,n � d� 1) and we

rewrite the RHS of Eq. C.37 as

Bq,n

1 + 2CTr

s

d

k⇤q,n � d� 1

!

2

1 +C

�

�2

q

p

k⇤q,n � d� 1

!

�2

q

k⇤q,n � d� 1

1 + 2CTr

s

L⇤n

�2

q

!

2

1 + C�

s

L⇤n

d�6

q

!

L⇤n

d.

We also simplify the LHS of Eq. C.37 as

Ap,n =

1� 2CTr

s

d

kp,n+ C2

Tr

d

kp,n

!

�2

p

kp,n�

1� 2CTr

s

d

kp,n

!

�2

p

kp,n.

At this point we can solve Eq. C.37 for kp,n and obtain a lower bound on it. We study the inequality

1/Ap,n � 1/Bp,n.


We first notice that

1

Ap,n kp,n

�2

p

1 + 4CTr

s

d

kp,n

!

1

�2

p

⇣

p

kp,n + 2CTr

pd⌘

2

,

where we used 1/(1� x) 1 + 2x for x 1/2 and we added a suitable positive term to obtain the

final quadratic form. Similarly we have

1

Bq,n�

1� 2CTr

s

L⇤n

�2

q

!

2

1� C�

s

L⇤n

d�6

q

!

d

L⇤n

=

1� 2CTr

s

L⇤n

�2

q

!

2

d

L⇤n

� C�

s

d

L⇤n�

6

q

!

,

where we used 1/(1 + x) � 1� x for any x � 0. In order to ease the derivation of an explicit lower-

bound on kp,n, we further simplify the previous expression by replacing higher order terms with a

big-⌦ notation. We first recall that L⇤n = e⇥(md�2/n), then the terms of order (1/L⇤

n) and (1/p

L⇤n)

clearly dominate the expression, while all other terms are asymptotically constant or decreasing in

n and thus we can rewrite the previous bound as

1

Bq,n� d

L⇤n

� (C�

+ 4CTr

pd)

s

d

L⇤n�

6

q

� ⌦(1).

By setting C = C�

+ 4CTr

pd we can finally use the upper bound on 1/Ap,n and the lower bound

on 1/Bq,n to obtain

1

�2

p

⇣

p

kp,n + 2CTr

pd⌘

2

� d

L⇤n

� C

s

d

L⇤n�

6

q

� ⌦(1).

We proceed with solving the previous inequality for kp,n and obtain

kp,n � �2

p

0

@

d

L⇤n

� C

s

d

L⇤n�

6

q

� ⌦(1)

!

1/2

� 2CTr

pd

1

A

2

.

Taking the square on RHS and adding and subtracting d+ 1 we have

kp,n � d+ 1 + �2

p

0

@

d

L⇤n

� C

s

d

L⇤n�

6

q

� 4CTr

pd

d

L⇤n

� C

s

d

L⇤n�

6

q

� ⌦(1)

!

1/2

+ 4C2

Tr

d

1

A� d� 1� ⌦(1).

We clearly notice that the first three terms in the RHS are dominant (they are higher order function

of n through L⇤n) and thus we can isolate them and replace all other terms by their asymptotic lower


bound as

kp,n � d+ 1 +d�2

p

L⇤n

�s

1

L⇤n

⇣

C

s

d�4

p

�6

q

+ 4CTr

d⌘

� ⌦(n1/4),

where we used the fact that L⇤n = e⇥(md�2/n) to bound the higher order terms. Furthermore, we

recall that k⇤p,n = d�2

p/L⇤n + d+ 1 and thus we can finally write the previous bound as

kp,n � k⇤p,n �s

1

L⇤n

⇣

C

s

d�4

p

�6

q

+ 4CTr

d⌘

� ⌦(n1/4).

The final bound is obtained by using �2

p/P

j �2

j = �p � �min

and �2

q � �2

min

with the final expression

kp,n � k⇤p,n �pn⇣ C

�2

min

r

1

�min

+ 4CTr

pd⌘

� ⌦(n1/4).

A quite loose bound based on the definition of C for the previous expression gives the final more

readable result

kp,n � k⇤p,n � C�

+ 8CTr

�2

min

r

nd

�min

� ⌦(n1/4).

C.4.2 Regret Bound (Proof of Theorem 18)

Theorem. The regret of the Trace-UCB algorithm, i.e., the di↵erence between its loss and the loss

of optimal static allocation (see Eq. 3.8), is upper-bounded by

Ln(A)� L⇤n O

✓

1

�2

min

⇣ d

�min

n

⌘

3/2◆

, (C.38)

where �min

= �2

min

/P

j �2

j .

Proof. We first simplify the expression of the loss for Trace-UCB in Lemma 50. We invert trace

operator and expectation and have

Li,n(b��i ) = E

�

Tr⇥

⌃Wi,n

�

�2

iXTi,nXi,n + �2�i�

Ti

�

WTi,n

⇤�

.

We notice that Wi,n = (XTi,nXi,n + �I)�1 � (XT

i,nXi,n)�1, where � is the Lower ordering between


positive-definite matrices. We focus on the two additive terms in the trace separately. We have

Tr�

⌃Wi,nXTi,nXi,nW

Ti,n

�

= Tr�

Wi,nXTi,nXi,nW

Ti,n⌃

�

Tr�

(XTi,nXi,n)

�1XTi,nXi,nW

Ti,n⌃

�

= Tr�

⌃WTi,n

�

(C.39)

Tr�

⌃(XTi,nXi,n)

�1

�

=1

ki,nTr�

⌃b⌃�1

i,n

�

,

where we used the fact that Tr(AB) = Tr(BA), Tr(AB) Tr(CB) if A � C and the definition ofb⌃i,n.

Similarly, we have

Tr�

⌃Wi,n�i�Ti W

Ti,n

�

= k�ik2Tr�

⌃Wi,nWTi,n

�

k�ik2Tr�

(XTi,nXi,n)

�1⌃Wi,n

� k�ik2�max

(b⌃�1

i,n)

ki,nTr�

⌃Wi,n

�

k�ik2�max

(b⌃�1

i,n)

ki,nTr�

⌃(XTi,nXi,n)

�1

�

= k�ik2�max

(b⌃�1

i,n)

k2i,nTr�

⌃b⌃�1

i,n

�

.

Going back to the loss expression we have

Li,n(b��i ) E

"

Tr�

⌃b⌃�1

i,n

�

ki,n

✓

�2

i + k�ik2�max

(b⌃�1

i,n)

ki,n

◆

#

.

We decompose the loss in two terms depending on the high-probability event E� under which the

concentration inequalities Proposition 15 and Proposition 16 hold at the same time

Li,n(b��i ) E

"

Tr�

⌃b⌃�1

i,n

�

ki,n

✓

�2

i + k�ik2�max

(b⌃�1

i,n)

ki,n

◆

�

�

�

E�#

+ �E�

Tr⇥

⌃Wi,n

�

�2


Ti

�

WTi,n

⇤

�

�Ec�

�

,

where we used P(Ec� �). If we denote the second expectation in the previous expression by Lc

i,n(b��i ),

then we can use Eq. C.39 and obtain

Lci,n(b�

�i ) �2

iE�

Tr�

⌃WTi,n

�

�

�Ec�

�

+ k�ik�2E�

Tr�

⌃Wi,nWTi,n

�

�

�Ec�

�

Using the fact that Tr(AB) �max

(A)Tr(B), we can upper bound the previous equation as

Lci,n(b�

�i ) �2

iTr(⌃)E�

�max

(Wi,n)�

�Ec�

�

+ k�ikTr(⌃)�2E�

�max

(Wi,n)2

�

�Ec�

�

Recalling that thanks to the regularization �max

(Wi,n) 1/�, we finally obtain

Lci,n(b�

�i ) Tr(⌃)

⇣�2

i

�+ k�ik

⌘

. (C.40)


The analysis of the high-probability part of the bound relies on the concentration inequalities for

the trace and �max

and the lower bound on the number of samples ki,n from Thm. 17. We recall

the three main inequalities we are going to use to bound the loss

ki,n � k⇤i,n � Cpnd� ⌦(n1/4),

Tr(⌃b⌃�1

i,n) d

✓

1 + 2(1 + ↵)

r

d

n

◆

2

,

�max

(b⌃�1

i,n) 1

�min

(⌃)

✓

1 + 2(1 + ↵)

r

d

n

◆

2

,

where C = C�+8CTr

�2min

p�min

and the last inequality is obtained by multiplying by ⌃�1⌃ to whiten b⌃i,n and

using Proposition 43, and �max

(AB) �max

(A)�max

(B) and finally �max

(⌃�1) = 1/�min

(⌃). We

can invert the first inequality as

1

ki,n 1

k⇤i,n � Cpnd� ⌦(n1/4)

1

k⇤i,n+O

✓

2C

k⇤i,n

r

d

n

◆

1

k⇤i,n+O

✓

pd

�2

min

(�min

n)3/2

◆

, (C.41)

where the last inequality is obtained by recalling that k⇤i,n = ⇥(�in) and using the definition of C

(where we ignore C�

and CTr

). We can then rewrite the high-probability loss as

E

"

Tr�

⌃b⌃�1

i,n

�

ki,n

✓

�2

i + k�ik2�max

(b⌃�1

i,n)

ki,n

◆

�

�

�

E�#

d�2

i

k⇤i,n+O

✓

1

�2

min

⇣ d

�min

n

⌘

3/2◆

L⇤n +O

✓

1

�2

min

⇣ d

�min

n

⌘

3/2◆

.

By recalling the regret Rn = maxi Li,n(��i,n)� L⇤

n, bringing the bounds above together and setting

� = O(n�3/2�✏) for any ✏ > 0 and a suitable multiplicative constant, we obtain the final regret

bound

Rn O

✓

1

�2

min

⇣ d

�min

n

⌘

3/2◆

.

C.4.3 High Probability Bound for Trace-UCB Loss (Proof of Theorem

19)

In this section, we start by defining a new loss function for algorithm A:

eLn(A) = maxi2[m]

k�i � �i,nk2⌃

. (C.42)

Note that eLn(A) is a random variable as �i,n is random, and the expectation is only taken with

respect to the test point X ⇠ F (leading to the ⌃-norm). We expect results of the following flavor:


let � > 0, then with probability at least 1� �,

eLn(A)� eL⇤n O

0

B

@

0

@

X

j

�2

jd

n

1

A

3/21

C

A

, (C.43)

when A corresponds to Trace-UCB, and eL⇤n to the optimal static allocation under ordinary least

squares. We start by focusing on eLn(A), and proving Theorem 19:

Theorem. Let � > 0, and assume k�ik2 Z for all i, for some Z > 0. With probability at least

1� �,

eLn(A) Pm

j=1 �2j

n

✓d+ 2 log

3m

�

◆+O

1

�2min

✓d

n�min

◆3/2!, (C.44)

where �min = �2min/

Pj �

2j .

Proof. We define a set of events that help us control the loss, and then we show that these events

simultaneously hold with high probability. In particular, we need the following events:

1. EG ⌘ the good event holds (for all arms i, and all times t), which includes a confidence interval

for �2

i,t and the trace of the empirical covariance matrix.

Holds with probability 1 � �G. This event is described and controlled in Proposition 15 and

Proposition 16.

2. EM,i ⌘ the confidence intervals Ci,t created for arm i at time t contain the true �i at all times

t —based on the vector-valued martingale in [1].

Holds with probability 1� �M,i. This event is described and controlled in Theorem 44.

3. EC,i,t ⌘ the empirical covariance ⌃i,t for arm i at time t is close to ⌃. This event is a direct

consequence of event EG.

4. EB,i,t ⌘ the first t observations pulled at arm i have norm reasonably bounded. The empirical

average norm is not too far from its mean. Holds with probability 1 � �B,i,t. This event is

described and controlled in Corollary 46.

Let H be the set of all the previous events. Then, by the union bound

P (\✏2H ✏) � 1�X

✏2H

�✏. (C.45)

Our goal is to show that if \✏2H ✏ holds, then the loss eLn(A) = maxi2[m]

k�i � �i,nk2⌃

is upper

bounded by a quantity that resembles the expected loss of the algorithm that knows the �2

i ’s in

advance.


Fix � > 0. We want � =P

✏2H �✏, and we would like to assign equal weight to all the sets of

events. First, �G = �/3. Also,P

i �M,i = �/3, implying �M,i = �/3m for every arm i 2 [m]. Finally,

to bound observation norms, we setP

i

P

t �B,i,t = �/3. It follows that we can take �B,i,t = �/3mT ,

even though t really ranges from d to n.

Assume that EG, EM,i and EB,i,t hold for all arms i and times t. Then, by Theorem 17, the final

number of pulls for arm i can be lower bounded by

ki � �2

iP

j �2

j

n� c

s

�2

i

�2

min

+ 1

!

s

�2

iP

j �2

j

dn+ o⇣p

dn⌘

, (C.46)

where c = 2⇣

1 +p

2 log(12mn/�)/d⌘

.

For notational simplicity, we denote by �i,t the estimate after t pulls. Thus, with respect to our

previous notation where �i,n referred to our final estimate, we have that �i,ki,n

= �i,n as ki,n is the

total number of pulls for arm i.

If the EM,i events hold, then we know that our �i,t estimates are not very far from the true

values �i when t is large. In particular, we know that the error is controlled by the radius Ri,t of

the confidence ellipsoids. We expect these radiuses to decrease with the number of observations per

arm, t. As we have a lower bound on the total number of pulls for arm i, ki,n, if the confidence

ellipsoids apply, then we can directly obtain an upper bound on the radius Ri,t at the end of the

process.

We need to do a bit of work to properly bound k�i,ki,n

� �ik2⌃

.

Fix arm i, and assume EM,i holds. In addition, assume k�ik2 Z for all i. Let Vi,t = �I +

XTi,tXi,t, where Xi,t contains the first t observations pulled by arm i. We modify the proof of

Theorem 2 in [1] by taking x = (Vt/t)(�t � �⇤) in their equation 5 (we are using their notation in

the latter expression). Assume the algorithm pulls arm i a total of t times —ki,n is a stopping time

with respect to the �-algebra that includes the environment (other arms)— then, by Theorem 44

k�i,t � �ik ¯Vi,t

/t �ipt

v

u

u

t2 log

det�

Vi,t

�

1/2det (�I)�1/2

�M,i

!

+

r

�

tZ. (C.47)

We would like to upper bound k�i,ki,n

� �ik⌃ by means of k�i,ki,n

� �ik ¯Vi,k

i,n

/ki,n

. Note that when

t grows, Vi,t/t ! ⌃ as the regularization is washed out. The distance between ⌃i,t = Vi,t/t� (�/t)I

and ⌃ is captured by event ✏C,i,t.

Formally, as EG holds, we know that the di↵erence between ⌃ and ⌃i,t is bounded in operator

norm for any i and t by

k⌃� ⌃i,tk 2

✓

1 +

r

2

dlog

2

�G

◆

r

d

tk⌃k = c

r

d

t�max

(⌃). (C.48)


Then, as a consequence, for all x 2 Rs

xT (⌃� ⌃i,t)x c �max

(⌃)

r

d

tkxk2

2

. (C.49)

In particular, by taking x = �i,t � �i,

c �max

(⌃)

r

d

tk�i,t � �ik2

2

� (�i,t � �i)T (⌃� ⌃i,t)(�i,t � �i) (C.50)

= k�i,t � �ik2⌃

� k�i,t � �ik2ˆ

⌃

i,t

. (C.51)

In addition, note that kxk2ˆ

⌃

i,t

= kxk2¯Vi,t

/t� (�/t)kxk2

2

. We conclude that

k�i,t � �ik2⌃

k�i,t � �ik2ˆ

⌃

i,t

+ c �max

(⌃)

r

d

tk�i,t � �ik2

2

(C.52)

= k�i,t � �ik2¯Vi,t

/t +

c �max

(⌃)

r

d

t� �

t

!

k�i,t � �ik22

. (C.53)

On the other hand, we know that k�i,t � �ik2⌃

� �min

(⌃)k�i,t � �ik22

.

Therefore, by (C.47)

k�i,t � �ik2⌃

1

1� 1

�min(⌃)

✓

c �max

(⌃)q

dt � �

t

◆k�i,t � �ik2¯Vi,t

/t (C.54)

1

1� �t

2

4

�ipt

v

u

u

t2 log

det�

Vi,t

�

1/2det (�I)�1/2

�M,i

!

+

p�Zpt

3

5

2

(C.55)

1

1� �t

1

t

2

4�i

v

u

u

t2

1

2log

det�

Vi,t

�

det (�I)

!

+ log

✓

1

�M,i

◆

!

+p�Z

3

5

2

(C.56)

1

1� �t

1

t

2

6

4

�i

v

u

u

u

t2

0

@

1

2

tX

j=1

kXjk2¯V �1i,t

+ log

✓

1

�M,i

◆

1

A+p�Z

3

7

5

2

, (C.57)

where we defined �t =

✓

c �max

(⌃)q

dt � �

t

◆

/�min

(⌃), and we used Lemma 11 in [1] which shows

that

log

det�

Vi,t

�

det (�I)

!

tX

j=1

kXjk2¯V �1i,t

. (C.58)

We would like to approximate the V �1

i,t norm, by means of the inverse covariance norm, ⌃�1. The

whitened equation that is equivalent to (C.49) — see Lemma 43 — is given by kI � ˆ⌃i,tk ✏, with


✏ = cp

d/t.

It implies that for any j = 1, . . . , d,

1� c

r

d

t�O

✓

d

t

◆

�j(ˆ⌃i,t) 1 + c

r

d

t+O

✓

d

t

◆

. (C.59)

The V �1

i,t norm can be bounded as follows

kxk2¯V �1i,t

= xT V �1

i,t x = xT�

�I +XTi,tXi,t

��1

x (C.60)

= xT⌃�1/2⌃1/2�

�I +XTi,tXi,t

��1

⌃1/2⌃�1/2x (C.61)

= xT�

�⌃�1 + XTi,tXi,t

��1

x (C.62)

=1

txT

✓

�

t⌃�1 + ˆ⌃�1

i,t

◆�1

x, (C.63)

where x denotes the whitened version of x. We can now apply the matrix inversion lemma to see

that

kxk2¯V �1i,t

=1

txT

✓

�

t⌃�1 + ˆ⌃�1

i,t

◆�1

x (C.64)

=1

txT

ˆ⌃i,t � ˆ⌃i,t⌃�1/2

✓

t

�I + ⌃�1/2 ˆ⌃i,t⌃

�1/2

◆�1

⌃�1/2 ˆ⌃i,t

!

x (C.65)

=1

txT⇣

ˆ⌃i,t � ˆ⌃i,t⌃�1/2R�1⌃�1/2 ˆ⌃i,t

⌘

x, (C.66)

where we implicitly defined R = (t/�)I + ⌃�1/2 ˆ⌃i,t⌃�1/2, a positive definite matrix. We upper

bound the previous expression to conclude that

kxk2¯V �1i,t

=1

txT⇣

ˆ⌃i,t � ˆ⌃i,t⌃�1/2R�1⌃�1/2 ˆ⌃i,t

⌘

x (C.67)

1

txT ˆ⌃i,tx �

max

( ˆ⌃i,t)

tkxk2

2

1 + cp

d/t+O (d/t)

tkxk2

2

. (C.68)

If we now go back to (C.58), using the previous results, we see that

tX

j=1

kXjk2¯V �1i,t

1 + c

r

d

t+O

✓

d

t

◆

!

0

@

1

t

tX

j=1

kXjk22

1

A . (C.69)


Substituting the upper bound in (C.57):

k�i,t � �ik2⌃

1

1� �t

1

t

2

6

4

�i

v

u

u

u

t2

0

@

1

2

tX

j=1

kXjk2¯V �1i,t

+ log

✓

1

�M,i

◆

1

A+p�Z

3

7

5

2

(C.70)

1

1� �t

1

t

2

6

4

�i

v

u

u

u

t

1 + c

r

d

t+O

✓

d

t

◆

!

0

@

1

t

tX

j=1

kXjk22

1

A+ 2 log1

�M,i+

p�Z

3

7

5

2

.

By Corollary 46, with probability 1 � �B,i,t, the empirical average norm of the white gaussian

observations is controlled by

1

t

tX

j=1

kXjk2 d+ 8 log

✓

1

�B,i,t

◆

r

d

t. (C.71)

As �B,i,t = �/3mn and �M,i = �/3m, we conclude that

k�i,t � �ik2⌃

1

1� �t

1

t

2

4�i

v

u

u

t

1 + c

r

d

t+O

✓

d

t

◆

!

d+ 8 log

✓

3mn

�

◆

r

d

t

!

+ 2 log

✓

3m

�

◆

+p�Z

3

5

2

1

1�✓

c�max

(⌃)q

dt � �

t

◆

/�min

(⌃)

1

t

2

4�i

v

u

u

t

d+

✓

c+ 8 log3mn

�

◆

r

d

t+O

✓

d

t

◆

!

+ 2 log3m

�+p�Z

3

5

2

.

(C.72)

At this point, recall that under our events

ki,n � k⇤i,n � Cpnd� ⌦(n1/4), (C.73)

where C = C�+8CTr

�2min

p�min

. As (C.72) decreases in t, we will bound the error k�i,t � �ik2⌃

by taking the

number of pulls t = (�2

i /P

j �2

j )n+O(pdn) (in particular, the RHS of (C.73)).


If we take � = 1/n, we have that

k�i,t � �ik2⌃

(C.74)

1

1�✓

c�max

(⌃)q

dt � �

t

◆

/�min

(⌃)

1

t

2

4�i

v

u

u

t

d+

✓

c+ 8 log3mn

�

◆

r

d

t+O

✓

d

t

◆

!

+ 2 log3m

�+p�Z

3

5

2

1 + c�max

(⌃)

�min

(⌃)

r

d

t+O

✓

d

t

◆

!

1

t

2

4�i

v

u

u

t

d+

✓

c+ 8 log3mn

�

◆

r

d

t+O

✓

d

t

◆

!

+ 2 log3m

�+

p�Z

3

5

2

1 +O

r

d

t

!!

1

t

2

4�2

i

d+ 2 log3m

�+

✓

c+ 8 log3mn

�

◆

r

d

t

!

+Z2

n+ 2Z�i

s

d+ 2 log 3m�

n+ o

r

d

n

!

3

5 .

Now, by (C.73) and (C.41), and using the �i = �2

i /P

j �2

j notation

k�i,t � �ik2⌃

(C.75)

1 +O

r

d

n

!!

�2

i

�

d+ 2 log 3m�

�

+ �2

i

�

c+ 8 log 3mn�

�

q

dt + 2Z�i

q

dn + o

✓

q

dn

◆�

k⇤i,n � Cpnd� ⌦(n1/4)

=

1 +O

r

d

n

!!

�2

i

�

d+ 2 log 3m�

�

+�

�2

i

�

c+ 8 log 3mn�

�

+ 2Z�i�

q

dt + o

✓

q

dn

◆�

k⇤i,n � Cpnd� ⌦(n1/4)

=

1 +O

r

d

n

!!

1

k⇤i,n+O

✓

pd

�2

min

(�min

n)3/2

◆

!"

�2

i

✓

d+ 2 log3m

�

◆

+ O

r

d

n

!#

=�2

i

k⇤i,n

✓

d+ 2 log3m

�

◆

+O

✓

1

�2

min

⇣ d

�min

n

⌘

3/2◆

. (C.76)

C.5 Loss of a RLS-based Learning Algorithm

C.5.1 Distribution of RLS estimates

Proposition 47. Given a linear regression problem with observations Y = XT� + ✏ with Gaussian

noise with variance �2, after n contexts X and the corresponding observations Y, the ridge estimate

of parameter � is obtained as

�� = (XTX+ �I)�1XTY = WXTY,


with W = (XTX+ �I)�1, and its distribution conditioned on X is

�� | X ⇠ N �

� � �W�,�2 W(XTX)WT�

. (C.77)

Proof. Recalling the definition of the OLS estimator b� (assuming it exists), we can easily rewrite

the RLS estimator as

�� = (XTX+ �I)�1(XTX)(XTX)�1XTY = (XTX+ �I)�1(XTX)b�.

This immediately gives that the conditional distribution of �� is Gaussian as for b�. We just need

to compute the corresponding mean vector and the covariance matrix. We first notice that the RLS

estimator is biased as

E[��

�X] = (XTX+ �I)�1(XTX)�.

Let S = XTX, then we can further rewrite the bias as

E[��

�X] = (S+ �SS�1)�1S� =⇣

S�

I + �S�1

�

⌘�1

S�

= (I + �S�1)�1� =�

I � �(S+ �I)�1

�

�

= � � �(S+ �I)�1� = � � �W�,

where we used the matrix inversion lemma. Recalling that the covariance of b� is �2(XTX)�1, the

covariance of b�� is then

Covh

��|Xi

= W(XTX)Covh

�|Xi

(XTX)WT = �2W(XTX)WT.

C.5.2 Loss Function of a RLS-based Algorithm

We start by proving the loss function in the case of a static algorithm.

Lemma 48. Let A be a learning algorithm that selects instance i for ki,n times, where ki,n is a fixed

quantity chosen in advance, and that returns estimates b��i obtained by RLS with regularization �.

Then its loss after n steps can be expressed as

Ln(Astat) = maxi2[m]

Tr�

⌃E⇥

Wi,n

�

�2


Ti

�

WTi,n

⇤�

, (C.78)

where Wi,n = (XTi,nXi,n + �I)�1, and Xn

i

is the matrix with the ki,n contexts from instance i.


Proof. The proof follows the same steps as in App. C.1 up to Eq. C.4, where we have

Ln(Astat

) = maxi2[m]

Tr

✓

EXi

E✏i

h

⌃(�i � �i)(�i � �i)T�

�Xi

i

�◆

.

Following Proposition 47, we can refine the inner expectation as

Eh

(� � �)(� � �)T | Xi

= Eh

(� � � + �W� � �W�)(� � � + �W� � �W�)T | Xi

= Eh

(� �E[� | X]� �W�)(� �E[� | X]� �W�)T | Xi

= Eh

(� �E[� | X])(� �E[� | X])T | Xi

+ �2W��TWT

= �2 W(XTX)WT + �2W��TWT

= W⇥

�2XTX+ �2��T⇤

WT.

Plugging the final expression back into Ln(Astatic

) we obtain the desired expression.

We notice that a result similar to Lemma 40 holds for RLS estimates as well.

Proposition 49. Assume the noise ✏ is Gaussian. Let �2 be the estimate for �2 computed by using

the residuals of the OLS solution �. Then, �� and �2 are independent random variable conditionally

to X.

Proof. As shown in the proof of Proposition 47, we have �� = (XTX + �I)�1(XTX)b� and we

know that functions of independent random variables are themselves independent. Since the matrix

mapping b� to b�� is fixed given X, and � and �2 are conditionally independent from Lemma 40,

then the statement follows.

We can now combine Proposition 49 and Lemma 48 to conclude that a similar expression to

Eq. C.79 holds for the ridge estimators also when a non-static algorithm such as Trace-UCB is

run.

Lemma 50. Let A be a learning algorithm such that It is chosen as a function of the previous history,

i.e. Dt�1

= {X1

, I1

, YI1,1, . . . , Xt�1

, It�1

, YIt�1,t�1

}, and that it returns estimates b��i obtained by

RLS with regularization �. Then its loss after n steps can be expressed as

Ln(A) = maxi2[m]

Tr�

⌃E⇥

Wi,n

�

�2


Ti

�

WTi,n

⇤�

, (C.79)

where Wi,n = (XTi,nXi,n + �I)�1, and Xi,n is the matrix with the ki,n contexts from instance i.


Proof. The proof follows immediately by extending Lemma 40 to �� as, by Proposition 49, �� and

�2

OLS

are independent. Then, we proceed in a way similar to that in the proof of Lemma 14 to

perform the required conditioning.

C.6 Sparse Trace-UCB Algorithm

C.6.1 Summary

High-dimensional linear regression models are remarkably common in practice. Companies tend to

record a large number of features of their customers, and feed them to their prediction models. There

are also cases in which the number of problem instances under consideration m is large, e.g., too

many courses in the MOOC example described in the introduction. Unless the horizon n is still

proportionally large w.r.t. md, these scenarios require special attention. In particular, algorithms

like Trace-UCB that adaptively use contexts in their allocation strategy become more robust than

their context-free counterparts.

A natural assumption in such scenarios is sparsity, i.e., only a small subset of features are relevant

to the prediction problem at hand (have non-zero coe�cient). In our setting of m problem instances,

it is often reasonable to assume that these instances are related to each other, and thus, it makes

sense to extend the concept of sparsity to joint sparsity, i.e., a sparsity pattern across the instances.

Formally, we assume that there exists a s ⌧ d such that

|S| �

= | [i2[m]

supp(�i)| = s, (C.80)

where supp(�i) = {j 2 [d] : �(j)i 6= 0} denotes the support of the i’th problem instance. A special

case of joint sparsity is when |supp(�i)| ⇡ s, for all i, i.e., most of the relevant features are shared

across the instances.

In this section, we focus on the scenario where dm > n. When we can only allocate a small

(relative to d) number of contexts to each problem instance, proper balancing of contexts becomes

extremely important, and thus, the algorithms that do not take into account context in their alloca-

tion are destined to fail. Although Trace-UCB has the advantage of using context in its allocation

strategy, it still needs to quickly discover the relevant features (those in the support) and only use

those in its allocation strategy.

This motivates a two-stage algorithm, we call it Sparse-Trace-UCB, whose pseudocode is in

Algorithm 7. In the first stage, the algorithm allocates contexts uniformly to all the instances, L

contexts per instance, and then recovers the support. In the second stage, it relies on the discovered

support bS, and applies the standard Trace-UCB to all the instances, but only takes into account

the features in bS. Note that L should be large enough that with high probability, support is exactly

discovered, i.e., bS = S.


There exists a large literature on how to perform simultaneous support discovery in jointly sparse

linear regression problems [67, 68, 95], which we discuss in detail below.

Most of these algorithms minimize the regularized empirical loss

minM2Rd⇥m

1

k

mX

i=1

kYi �Xi M[, i]k2 + � kMk,

where k is the number of samples per problem, M be the matrix whose i’th column is M[, i] =

�i, Xi 2 Rk⇥d, and Yi = Xi�i + ✏i. In particular, they use la/lb block regularization norm,

i.e., kMkla

/lb

= kvkla

, where vi = kM[i, ]klb

and M[i, ] is the i’th row of M. In short, the Sparse-

Trace-UCB algorithm uses the l1

/l2

block regularization Lasso algorithm [95], an extension of the

algorithm in [68], for its support discovery stage.

We extend the guarantees of Theorem 19 to the high dimensional case with joint sparsity, as-

suming s is known. The following is Theorem 20 in Chapter 3.

Theorem. Let �1

> 0. Assume k�ik2 Z for all i, for some Z > 0, and assume the parameters

(n, d, s,�i,⌃) satisfy conditions C1 to C5 in [95]. Let be the sparsity overlap function defined in

[68]. If L > 2(1+ v) log(d� s)⇢u(⌃(1:m)

SC

SC

|S)/�2 for some constant v > 0, and n�Lm � (s+1)m,

then, with probability at least 1� �1

� �2

,

eLn(A) P

j �2

j

n� Lm

✓

s+ 2 log3m

�1

◆

+2c

p

�2

min

sP

j �2

j

n� Lm

!

3/2

+ o (z) , (C.81)

where c 2⇣

1 +p

2 log(12mn/�1

)/s⌘

and we defined �2

= m exp(�c0

log s) + exp(�c1

log(d � s))

for positive constants c0

, c1

> 0, and z = (s/(n� Lm))3/2.

The exact technical assumptions and the proof are given and discussed in below. We simply

combine the high-probability results of Theorem 19, and the high-probability support recovery of

Theorem 2 in [95]. In addition, we provide Corollary 51, where we study the regime of interest

where the support overlap is complete (for simplicity), n = C1

ms log d ⌧ md for C1

> 0, and

L = C2

s log d, for C1

� C2

> 0.

Corollary 51. Under the assumptions of Theorem 20, let �1

> 0, assume n = C1

ms log d, the

support of all arms are equal, and set L = C2

s log d, for C := C1

� C2

> 0. Then, with probability

at least 1� �1

� �2

,

eLn(A) P

j �2

j

Cms log d

✓

s+ 2 log3m

�1

◆

+2c

p

�2

min

P

j �2

j

Cm log d

!

3/2

+ o (z) (C.82)

where c 2⇣

1 +p

2 log(12mn/�1

)/s⌘

and we defined �2

= m exp(�c0

log s) + exp(�c1

log(d � s))


for constants c0

, c1

> 0, and z =�

Cm log d��3/2

.

Algorithm 7 contains the pseudocode of our Sparse-Trace-UCB algorithm.

Algorithm 7 Sparse-Trace-UCB Algorithm.

1: for i = 1, . . . ,m do2: Select problem instance i exactly L times3: end for4: Run l

1

/l2

Lasso to recover support S = [i supp(�i,L)5: for i = 1, . . . ,m do6: Select problem instance i exactly s+ 1 times7: Compute its OLS estimates �i,m(L+s+1)

and �2

i,m(L+s+1)

with respect to dimensions in S.8: end for9: for steps t = m(L+ s+ 1) + 1, . . . , n do

10: for problem instance 1 i m do11: Compute score based on S dimensions only:

si,t�1

=b�2

i,t�1

+�i,t�1

ki,t�1

Tr�

⌃⌃�1

i,t�1

�

12: end for13: Select problem instance It = argmaxi2[m]

si,t14: Observe Xt and YI

t

,t

15: Update OLS estimators �It

,t and �2

It

,t based on S16: end for17: Return RLS estimates {��

i }mi=1

, with ��ij = 0 if j /2 S

Given our pure exploration perspective, it is obviously more e�cient to learn the true supports

as soon as possible. That way we can adjust our behavior by collecting the right data based on our

initial findings. Note that this is not always the case; for example, if the total number of pulls is

unknown. Then it is not clear what is the right amount of budget to invest upfront to recover the

supports (see tracking algorithms and doubling trick).

We briefly describe Algorithm 7 in words. First, in the recovery stage we sequentially pull all

arms a number of times, say L times. We do not take into account the context, and just apply a

round robin technique to pull each arm exactly L times. In total, there are exactly s components

that are non-zero for at least one arm (out of d). After the Lm pulls, we use a block-regularized

Lasso algorithm to recover the joint sparsity pattern. We discuss some of the alternatives later. The

outcome of this stage is a common support bS := [i supp(�i). With high probability we recover

the true support bS = S. In the second stage, or pure exploration stage, the original Trace-UCB

algorithm is applied. The Trace-UCB algorithm works by computing an estimate �2

i at each step

t for each arm i. Then, it pulls the arm maximizing the score

si,t�1

=b�2

i,t�1

+�i,t�1

ki,t�1

Tr�

⌃⌃�1

i,t�1

�

.


The key observation is that in the second stage we only consider the components of each context

that are in bS. In particular, we start by pulling s + 1 times each arm so that we can compute the

initial OLS estimates �OLS

i and �2

i . We keep updating those estimates when an arm is pulled, and

the trace is computed with respect to the components in bS only.

Finally, we return the Ridge estimates based only on the data collected in the second stage.

C.6.2 A note on the Static Allocation

What is the optimal static performance in this setting if the �2’s are known? For simplicity, suppose

we pull arm i exactly (�2

i /P

j �2

j ) n times. We are interested in Lasso guarantees for kXT (�i��i)k22

.

Note in this case we can actually set �i as a function of �2

i as required in most Lasso analyses, because

�2

i is known.

A common guarantee is as follows (see [38, 72]). With high probability

k�i � �ik22

c2�2

i

�2⌧s log d

k,

where k is the number of observations, d the ambient dimension, s the e�cient dimension, � is the

restricted eigenvalues constant for ⌃, ⌧ > 2 is the parameter that tunes the probability bound, and

c is a universal constant.

Thus, if we set k = (�2

i /P

j �2

j ) n, then we obtain that whp

k�i � �ik22

c2⌧

�2

0

@

mX

j=1

�2

j

1

A

s log d

n. (C.83)

Note that the latter event is independent across di↵erent i 2 [m], so all of them simultaneously hold

with high probability. The term ��2 was expected as depending on the correlation levels in ⌃ the

problem can be easier or harder. In addition, note that as k�i � �ik2⌃

= Tr(⌃(�i � �i)(�i � �i)T ),

we have that

�min

(⌃) k�i � �ik22

k�i � �ik2⌃

�max

(⌃)k�i � �ik22

. (C.84)

C.6.3 Simultaneous Support Recovery

There has been a large amount of research on how to perform simultaneous support recovery in

sparse settings for multiple regressions. Let M be the matrix whose i-th column is M(i) = �i.

A common objective function after k observations per problem is

min¯

M2Rd⇥m

1

k

mX

j=1

kYj �XjM(j)k2 + � kMk, (C.85)


where we assumed Yj = Xj�j + ✏j , and Xj 2 Rk⇥d,Yj , ✏j 2 Rk and �j 2 Rd.

The la/lb block regularization norm is

kMkla

/lb

= kvka, where vj = kMjkb Mj is the j-th row of M. (C.86)

There are a few di↵erences among the most popular pieces of work.

Negahban and Wainwright [67] consider random Gaussian designs Xj ⇠ N (0,⌃j) with random

Gaussian noise (and common variance). The regularization norm is l1

/l1. In words, they take the

sum of the absolute values of the maximum element per row in M. This forces sparsity (via the l1

norm), but once a row is selected there is no penalty in increasing the � components up to the current

maximum of the row. They tune � as in the standard analysis of Lasso, that is, proportionally to

�2, which is unknown in our case. Results are non-asymptotic, and recovery happens with high

probability when the number of observations is k > Cs(m + log d). They show that if the overlap

is not large enough (2/3 of the support, for m = 2 regression problems), then running independent

Lasso estimates has higher statistical e�ciency. We can actually directly use the results in [67] if we

assume an upper bound �2

max

R is known.

Obozinski, Wainwright and Jordan [68] use l1

/l2

block regularization (aka Multivariate Group

Lasso). Their design is random Gaussian, but it is fixed across regressions: Xj = X. They provide

asymptotic guarantees under the scaling k, d, s ! 1, d � s ! 1, and standard assumptions like

bounded ⌃-eigenspectrum, the irrepresentable condition, and self-incoherence. The first condition

is not only required for support recovery, but also for l2

consistency. The last two conditions are

not required for risk consistency, while essential for support recovery. To capture the amount of

non-zero pattern overlap among regressions, they define the sparsity overlap function , and their

sample requirements are a function of . In particular, one needs k > C log(d � s), where the

constant C depends on quantities related to the covariance matrix of the design matrices, and can

be equal to s/m, if all the patterns overlap, and at most s if they are disjoint.

Their theorems use a sequence of regularization parameters

�k =

r

f(d) log d

k, where f(d) ! 1 as d ! 1,

in such a way that �k ! 0 as k, d ! 1. Finally, k > 2s is also required. They also provide a

two-stage algorithm for e�cient estimation of individual supports for each regression problem. All

these optimization problems are convex, and can be e�ciently solved in general.

To overcome the issue of common designs (we do not pull each context several times), we use

the results by Wang, Liang, and Xing in [95]. They extend the guarantees in [68] to the case where

the design matrices are independently sampled for each regression problem. In order to formally

present their result, we describe some assumptions. Let ⌃(i) be the covariance matrix for the design

observations of the i-th regression (in our case, they are all equal to ⌃), and S the union of the


sparse supports across regressions.

• C1 There exists � 2 (0, 1] such that k|Ak|1 1� �, where

Ajs = max1im

�

�

�

�

✓

⌃(i)SCS

⇣

⌃(i)SS

⌘�1

◆

js

�

�

�

�

, (C.87)

for j 2 SC and s 2 S.

• C2 There are constants 0 < Cmin

Cmax

< 1, such that the eigenvalues of all matrices ⌃(i)

are in [Cmin

, Cmax

].

• C3 There exists a constant Dmax

< 1 such that

max1im

|k⇣

⌃(i)SS

⌘�1

k|1 Dmax

. (C.88)

• C4 Define the regularization parameter

�k =

r

f(d) log d

k, where f(d) ! 1 as d ! 1, (C.89)

such that �k ! 0 as k ! 1.

• C5 Define ⇢(k, s,�k) as

⇢(k, s,�k) :=

s

8�2

max

s log s

k Cmin

+ �k

✓

Dmax

+12s

Cmin

pk

◆

, (C.90)

and assume ⇢(k, s,�k)/b⇤min

= o(1), where b⇤min

= minj2S kMjk2.

We state the main theorem in [95]; k is the number of observations per regression.

Theorem 52. Assume the parameters (k, d, s,M,⌃(1:m)) satisfy conditions C1 to C5. If for some

small constant v > 0,

k > 2(1 + v) log(d� s)⇢u(⌃

(1:m)

SC

SC

|S)

�2, (C.91)

then the l1

/l2

regularized problem given in (C.85) has a unique solution M, the support union

supp(M) equals the true support S, and kM�Mkl1/l2 = o(b⇤min

), with probability greater than

1�m exp(�c0

log s)� exp(�c1

log(d� s)), (C.92)

where c0

and c1

are constants.

The following proposition is also derived in [95] (Proposition 1):


Proposition 53. Assume ⌃(1:m) satisfy C2, then is bounded by

s

m Cmin

= ⇣

M,⌃(1:m)

⌘

s

Cmin

. (C.93)

For our purposes, there is a single ⌃, which implies that we can remove the max expressions in

C1 and C3. Corollary 2 in [95] establishes that when supports are equal for all arms, the number

of samples required per arm is reduced by a factor of m.

C.6.4 High-Dimensional Trace-UCB Guarantees

If the support overlap is complete we can reduce the sampling complexity of the first stage by a

factor of m; we only need

Lm > 2(1 + v) s log(d� s)⇢u(⌃

(1:m)

SC

SC

|S)

Cmin

�2(C.94)

observations in total, for some small constant v > 0.

Now we show our main theorem for high-dimensional Trace-UCB, Theorem 20.

Proof. We start by assuming the recovered support S is equal to the true support S. This event,

say ES , holds with probability at least 1� �2

by Theorem 52 when L satisfies (C.94).

Then, we fix �1

> 0, and run the second stage applying the Trace-UCB algorithm in the s-

dimensional space given by the components in S.

By Theorem 19, if n�Lm � (s+1)m, then, with probability at least 1� �1

, the following holds:

eLn(A)S P

j �2

j

n� Lm

✓

s+ 2 log3m

�1

◆

+2c

p

�2

min

sP

j �2

j

n� Lm

!

3/2

+ o

✓

s

n� Lm

◆

3/2!

, (C.95)

where eLn(A)S denotes the loss restricted to the components in � that are in S (and ⌃S). However,

under event ES , we recovered the true support, and our final estimates for �ij for each j 62 S and

arm i will be equal to zero, which corresponds to their true value. Hence eLn(A) = eLn(A)S .

We conclude that (C.95) holds with probability at least 1� �1

� �2

.

One regime of interest is when n = C1

ms log d ⌧ md. In addition, let us assume complete

support overlap across arms, so = s/Cm. Then, we set the number of initial pulls per arm to be

L = C2

s log d, with C1

> C2

. In this case, we have that Corollary 51 holds.

Appendix D

Proofs Chapter 4

Lemma 54. Let X = {Xi 2 Rd | i = 1, 2, . . . ,m} be a set of points in Rd, and let S ✓ [m]. Let

� > 0. The set function f(S) = log det�

�Id +XTSXS

�

= log det�

�Id +P

i2S XiXTi

�

is submodular.

Proof. Straightforwardly adapted from [85].

Fix i 2 [m] and for S ✓ [m] such that i 62 S define

fi(S) = log det⇣

�Id +XTS[{i}XS[{i}

⌘

� log det�

�Id +XTSXS

�

. (D.1)

For any i 2 [m], let S1

✓ S2

✓ [m] � {i}. We define S(�) = XTS1XS1 + �

�

XTS2XS2 �XT

S1XS1

�

for

� 2 [0, 1]. Note that S(0) = XTS1XS1 and S(1) = XT

S2XS2 . Furthermore, we also define

fi(S(�)) = log det�

�Id + S(�) +XiXTi

�� log det (�Id + S(�)) . (D.2)

We use the following equality: (d/d�) log detX(�) = Tr⇥

X(�)�1(d/d�)X(�)⇤

, to compute

d

d�fi(S(�)) =

d

d�

⇥

log det�


�� log det (�Id + S(�))⇤

= Trh

�


��1

�

XTS2XS2 �XT

S1XS1

�

i

� Trh

(�Id + S(�))�1

�

XTS2XS2 �XT

S1XS1

�

i

= Trh⇣

�


��1 � (�Id + S(�))�1

⌘

�

XTS2XS2 �XT

S1XS1

�

i

0.

The last inequality is justified as follows. The second term in the trace is positive semidefinite, as

XTS2XS2 �XT

S1XS1 =

X

j2S2,j 62S1

XjXTj . (D.3)

140

APPENDIX D. PROOFS CHAPTER 4 141

The first one is negative semidefinite, by the Sherman-Morrison formula:

�

A+ vvT��1

= A�1 � A�1vvTA�1

1 + vTA�1v, (D.4)

where we take v = Xi and A = �Id + S(�). We conclude that the trace is non-positive, and the

submodularity follows immediately as fi(S(�)) is a continuous function of � for all i.

Lemma 55. Let X = {Xi 2 Rd | i = 1, 2, . . . ,m} be a set of points in Rd, � > 0, and S ✓ [m]. The

set function f(S) = log det�

�Id +XTSXS

�

= log det�

�Id +P

i2S XiXTi

�

is monotonic increasing.

Proof. Let j 62 S, and define S0 = S[{j}. We have that f(S0) = log det�

�Id +P

i2S XiXTi +XjXT

j

�

.

For positive semi-definite matrices A,B, we have that det(A + B) � det(A). The result follows by

taking A = �Id +P

i2S XiXTi and B = XjXT

j , as the logarithm is monotone.

Bibliography

[1] Y. Abbasi-Yadkori, D. Pal, and Cs. Szepesvari. Improved algorithms for linear stochastic

bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.

[2] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit

problem. In Journal of Machine Learning Research: Workshop and Conference Proceedings

Vol 23, 2012.

[3] A. Antos, V. Grover, and Cs. Szepesvari. Active learning in multi-armed bandits. In Interna-

tional Conference on Algorithmic Learning Theory, pages 287–302, 2008.

[4] Jean-Yves Audibert and Sebastien Bubeck. Best arm identification in multi-armed bandits.

In COLT-23th Conference on Learning Theory-2010, pages 13–p, 2010.

[5] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for

e�ciently learning linear separators with noise. In Proceedings of the 46th Annual ACM

Symposium on Theory of Computing, pages 449–458. ACM, 2014.

[6] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In

Proceedings of the 23rd international conference on Machine learning, pages 65–72. ACM,

2006.

[7] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. Jour-

nal of Computer and System Sciences, 75(1):78–89, 2009.

[8] Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In

Learning Theory, pages 35–50. Springer, 2007.

[9] Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under

log-concave distributions. In Conference on Learning Theory, pages 288–316, 2013.

[10] Nicholas M Ball and Robert J Brunner. Data mining and machine learning in astronomy.

International Journal of Modern Physics D, 19(07):1049–1106, 2010.

142

BIBLIOGRAPHY 143

[11] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million

song dataset. Proceedings of the 12th International Conference on Music Information Retrieval

(ISMIR 2011), 2011.

[12] Alina Beygelzimer, Daniel J Hsu, John Langford, and Tong Zhang. Agnostic active learning

without constraints. In Advances in Neural Information Processing Systems, pages 199–207,

2010.

[13] Seebastian Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identifications in multi-

armed bandits. In International Conference on Machine Learning, pages 258–265, 2013.

[14] Sebastien Bubecka, Remi Munosa, and Gilles Stoltzb. Pure exploration in finitely-armed and

continuous-armed bandits. Theoretical Computer Science, 412:1832–1852, 2011.

[15] Wenbin Cai, Ya Zhang, and Jun Zhou. Maximizing expected model change for active learning

in regression. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages

51–60. IEEE, 2013.

[16] Stamatis Cambanis, Steel Huang, and Gordon Simons. On the theory of elliptically contoured

distributions. Journal of Multivariate Analysis, 11(3):368–385, 1981.

[17] Real Carbonneau, Kevin Laframboise, and Rustam Vahidov. Application of machine learning

techniques for supply chain demand forecasting. European Journal of Operational Research,

184(3):1140–1154, 2008.

[18] A. Carpentier, A. Lazaric, M. Ghavamzadeh, R. Munos, and P. Auer. Upper-confidence-bound

algorithms for active learning in multi-armed bandits. In Algorithmic Learning Theory, pages

189–203. Springer, 2011.

[19] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances

in neural information processing systems, pages 2249–2257, 2011.

[20] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning.

Machine learning, 15(2):201–221, 1994.

[21] Dapeng Cui and David Curry. Prediction in marketing using the support vector machine.

Marketing Science, 24(4):595–615, 2005.

[22] Wiens D. and P. Li. V-optimal designs for heteroscedastic regression. Journal of Statistical

Planning and Inference, 145:125–138, 2014.

[23] Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Proceedings of

the 18th International Conference on Neural Information Processing Systems, pages 235–242.

MIT Press, 2005.

BIBLIOGRAPHY 144

[24] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of

the 25th international conference on Machine learning, pages 208–215. ACM, 2008.

[25] Sanjoy Dasgupta, Claire Monteleoni, and Daniel J Hsu. A general agnostic active learning

algorithm. In Advances in neural information processing systems, pages 353–360, 2007.

[26] Michael O’Gordon Du↵. Optimal Learning: Computational procedures for Bayes-adaptive

Markov decision processes. PhD thesis, University of Massachusetts Amherst, 2002.

[27] Paul Embrechts, Claudia Kluppelberg, and Thomas Mikosch. Modelling extremal events, vol-

ume 33. Springer Science & Business Media, 1997.

[28] Hadi Fanaee-T and Joao Gama. Event labeling combining ensemble detectors and background

knowledge. Progress in Artificial Intelligence, pages 1–15, 2013.

[29] Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using

the query by committee algorithm. Machine learning, 28(2):133–168, 1997.

[30] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification:

A unified approach to fixed budget and fixed confidence. In Advances in Neural Information

Processing Systems, pages 3212–3220, 2012.

[31] Aurelien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence.

In 29th Annual Conference on Learning Theory, pages 998–1027, 2016.

[32] Aurelien Garivier, Tor Lattimore, and Emilie Kaufmann. On explore-then-commit strategies.

In Advances in Neural Information Processing Systems, pages 784–792, 2016.

[33] Daniel Gayo Avello, Panagiotis T Metaxas, and Eni Mustafaraj. Limits of electoral predictions

using twitter. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social

Media. Association for the Advancement of Artificial Intelligence, 2011.

[34] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time collaborative

filtering algorithm. Information Retrieval, 4(2):133–151, 2001.

[35] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in

active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–

486, 2011.

[36] Rafael Gomez-Bombarelli, David Duvenaud, Jose Miguel Hernandez-Lobato, Jorge Aguilera-

Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alan Aspuru-Guzik. Automatic

chemical design using a data-driven continuous representation of molecules. arXiv preprint

arXiv:1610.02415, 2016.

BIBLIOGRAPHY 145

[37] Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings

of the 24th international conference on Machine learning, pages 353–360. ACM, 2007.

[38] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and

generalizations. CRC Press, 2015.

[39] Erik Hjelmas and Boon Kee Low. Face detection: A survey. Computer vision and image

understanding, 83(3):236–274, 2001.

[40] Ronald R Hocking. Methods and applications of linear models: regression and the analysis of

variance. John Wiley & Sons, 2013.

[41] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal

problems. Technometrics, 12(1):55–67, 1970.

[42] Matthew Ho↵man, Bobak Shahriari, and Nando Freitas. On correlation and budget con-

straints in model-based bandit optimization with application to automatic machine learning.

In Artificial Intelligence and Statistics, pages 365–374, 2014.

[43] Daniel Hsu, Sham M Kakade, and Tong Zhang. An analysis of random design linear regression.

In arXiv:1106.2363 Preprint. Citeseer, 2011.

[44] Daniel Hsu and Sivan Sabato. Heavy-tailed regression with a generalized median-of-means.

In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages

37–45, 2014.

[45] Tadeusz Inglot. Inequalities for quantiles of the chi-square distribution. Probability and Math-

ematical Statistics, 30(2):339–351, 2010.

[46] Antony Joseph. Variable selection in high-dimension with random designs and orthogonal

matching pursuit. Journal of Machine Learning Research, 14(1):1771–1800, 2013.

[47] Matti Kaariainen. Active learning in the non-realizable case. In International Conference on

Algorithmic Learning Theory, pages 63–77. Springer, 2006.

[48] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD

thesis, University College London, 2003.

[49] Stavros A Karkanis, Dimitrios K Iakovidis, Dimitrios E Maroulis, Dimitris A. Karras, and

M Tzivras. Computer-aided tumor detection in endoscopic video using color wavelet features.

IEEE transactions on information technology in biomedicine, 7(3):141–152, 2003.

[50] Emilie Kaufmann, Nathaniel Korda, and Remi Munos. Thompson sampling: An asymp-

totically optimal finite-time analysis. In International Conference on Algorithmic Learning

Theory, pages 199–213. Springer, 2012.

BIBLIOGRAPHY 146

[51] Abbas Kazerouni, Mohammad Ghavamzadeh, and Benjamin Van Roy. Conservative contextual

linear bandits. arXiv preprint arXiv:1611.06426, 2016.

[52] Douglas Kelker. Distribution theory of spherical distributions and a location-scale parameter

generalization. Sankhya: The Indian Journal of Statistics, Series A, pages 419–430, 1970.

[53] Hian Chye Koh, Gerald Tan, et al. Data mining applications in healthcare. Journal of

healthcare information management, 19(2):65, 2011.

[54] Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learn-

ing. The Journal of Machine Learning Research, 11:2457–2485, 2010.

[55] J Zico Kolter and Andrew Y Ng. Near-bayesian exploration in polynomial time. In Proceedings

of the 26th Annual International Conference on Machine Learning, pages 513–520. ACM, 2009.

[56] Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability: Prac-

tical Approaches to Hard Problems, 3(19):8, 2012.

[57] Andreas Krause and Carlos Guestrin. Nonmyopic active learning of gaussian processes: an

exploration-exploitation approach. In Proceedings of the 24th international conference on

Machine learning, pages 449–456. ACM, 2007.

[58] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model

selection. Annals of Statistics, pages 1302–1338, 2000.

[59] Tian-Shyug Lee, Chih-Chou Chiu, Chi-Jie Lu, and I-Fei Chen. Credit scoring using the hybrid

neural discriminant technique. Expert Systems with applications, 23(3):245–254, 2002.

[60] M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.

[61] Ioanna Lykourentzou, Ioannis Giannoukos, Vassilis Nikolopoulos, George Mpardis, and Vassili

Loumos. Dropout prediction in e-learning courses through the combination of machine learning

techniques. Computers & Education, 53(3):950–965, 2009.

[62] Oded Maron and Andrew W Moore. Hoe↵ding races: Accelerating model selection search

for classification and function approximation. In Advances in Neural Information Processing

Systems, pages 59–66, 1994.

[63] F. Maxwell Harper and J. Konstan. The movielens datasets: History and context. ACM

Transactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016.

[64] Peter McCullagh. Generalized linear models. European Journal of Operational Research,

16(3):285–292, 1984.

BIBLIOGRAPHY 147

[65] Jae H Min and Young-Chan Lee. Bankruptcy prediction using support vector machine with

optimal choice of kernel function parameters. Expert systems with applications, 28(4):603–614,

2005.

[66] Volodymyr Mnih, Csaba Szepesvari, and Jean-Yves Audibert. Empirical bernstein stopping. In

Proceedings of the 25th International Conference on Machine learning, pages 672–679. ACM,

2008.

[67] S. Negahban and M. Wainwright. Simultaneous support recovery in high dimensions: Benefits

and perils of block-regularization. IEEE Transactions on Information Theory, 57(6):3841–

3863, 2011.

[68] G. Obozinski, M. Wainwright, and M. Jordan. Support union recovery in high-dimensional

multivariate regression. The Annals of Statistics, pages 1–47, 2011.

[69] Kaare Brandt Petersen and IMM ISP. The matrix cookbook. Technical University of Denmark

7 (2008): 15., 2008.

[70] F. Pukelsheim. Optimal Design of Experiments. Classics in Applied Mathematics. Society for

Industrial and Applied Mathematics, 2006.

[71] Friedrich Pukelsheim. Optimal design of experiments, volume 50. siam, 1993.

[72] G. Raskutti, M. J Wainwright, and B. Yu. Restricted eigenvalue properties for correlated

gaussian designs. Journal of Machine Learning Research, 11(8):2241–2259, 2010.

[73] Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the

click-through rate for new ads. In Proceedings of the 16th international conference on World

Wide Web, pages 521–530. ACM, 2007.

[74] C. Riquelme, R. Johari, and B. Zhang. Online active linear regression via thresholding.

arXiv:1602.02845, 2016.

[75] Carlos Riquelme, Mohammad Ghavamzadeh, and Alessandro Lazaric. Active learning for

accurate estimation of linear models. arXiv preprint arXiv:1703.00579, 2017.

[76] Dan Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling.


[77] Daniel Russo. Simple bayesian algorithms for best arm identification. arXiv preprint

arXiv:1602.08448, 2016.

[78] S. Sabato and R. Munos. Active regression by stratification. In Advances in Neural Information

Processing Systems, pages 469–477, 2014.

BIBLIOGRAPHY 148

[79] Sivan Sabato and Remi Munos. Active regression by stratification. In Advances in Neural

Information Processing Systems, pages 469–477, 2014.

[80] Robert P Schumaker, Osama K Solieman, and Hsinchun Chen. Predictive modeling for sports

and gaming. In Sports Data Mining, pages 55–63. Springer, 2010.

[81] Burr Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-

66):11, 2010.

[82] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van

Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanc-

tot, et al. Mastering the game of go with deep neural networks and tree search. Nature,

529(7587):484–489, 2016.

[83] Marta Soare, Alessandro Lazaric, and Remi Munos. Best-arm identification in linear bandits.


[84] Masashi Sugiyama and Shinichi Nakajima. Pool-based active learning in approximate linear

regression. Machine Learning, 75(3):249–274, 2009.

[85] Tyler H Summers, Fabrizio L Cortesi, and John Lygeros. On submodularity and controllability

in complex dynamical networks. IEEE Transactions on Control of Network Systems, 3(1):91–

101, 2016.

[86] William R Thompson. On the likelihood that one unknown probability exceeds another in

view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

[87] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society. Series B (Methodological), pages 267–288, 1996.

[88] J Tropp and Anna C Gilbert. Signal recovery from partial information via orthogonal matching

pursuit, 2005.

[89] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media,

2013.

[90] Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural

networks, 10(5):988–999, 1999.

[91] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv

preprint arXiv:1011.3027, 2010.

[92] Paul Viola, Michael J Jones, and Daniel Snow. Detecting pedestrians using patterns of motion

and appearance. International Journal of Computer Vision, 63(2):153–161, 2005.

BIBLIOGRAPHY 149

[93] M. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Draft, 2015.

[94] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery

using-constrained quadratic programming (lasso). Information Theory, IEEE Transactions

on, 55(5):2183–2202, 2009.

[95] W. Wang, Y. Liang, and E. Xing. Block regularized lasso for multivariate multi-response linear

regression. In AISTATS, 2013.

[96] Yining Wang and Aarti Singh. Noise-adaptive margin-based active learning and lower bounds

under tsybakov noise condition. arXiv preprint arXiv:1406.5383, 2014.

[97] Brady T West, Andrzej T Galecki, and Kathleen B Welch. Linear mixed models. CRC Press,

2014.

[98] Yifan Wu, Roshan Shari↵, Tor Lattimore, and Csaba Szepesvari. Conservative bandits. stat,

1050:13, 2016.

[99] Qing-Hai Ye, Lun-Xiu Qin, Marshonna Forgues, Ping He, Jin Woo Kim, Amy C Peng, Richard

Simon, Yan Li, Ana I Robles, Yidong Chen, et al. Predicting hepatitis b virus–positive

metastatic hepatocellular carcinomas using gene expression profiling and supervised machine

learning. Nature medicine, 9(4):416–423, 2003.

[100] Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learn-

ing. In Advances in Neural Information Processing Systems, pages 442–450, 2014.

[101] Guoqiang Zhang, Michael Y Hu, B Eddy Patuwo, and Daniel C Indro. Artificial neural

networks in bankruptcy prediction: General framework and cross-validation analysis. European

journal of operational research, 116(1):16–32, 1999.

[102] Huaifeng Zhang, Wenjing Jia, Xiangjian He, and Qiang Wu. Learning-based license plate

detection using global and local features. In Pattern Recognition, 2006. ICPR 2006. 18th

International Conference on, volume 2, pages 1102–1105. IEEE, 2006.

[103] Guangyu Zhu, Yefeng Zheng, David Doermann, and Stefan Jaeger. Multi-scale structural

saliency for signature detection. In Computer Vision and Pattern Recognition, 2007. CVPR’07.

IEEE Conference on, pages 1–8. IEEE, 2007.

Documents

ONLINE ACTIVE LEARNING WITH LINEAR MODELS ADISSERTATION - Stanford …rp382fv8012/Carlo... · · 2017-06-07My department at Stanford, the Institute for Computational and Mathematical