SCALABLE ESTIMATION AND INFERENCE FOR …statweb.stanford.edu/~owen/students/KatelynGaoThesis.pdfSCALABLE ESTIMATION AND INFERENCE FOR MASSIVE LINEAR MIXED MODELS WITH CROSSED RANDOM

SCALABLE ESTIMATION AND INFERENCE FOR MASSIVE LINEAR MIXED

MODELS WITH CROSSED RANDOM EFFECTS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF STATISTICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Katelyn Gao

June 2017

Abstract

With modern electronic activity, large crossed data sets are increasingly common, with factors such

as users and items. It is often appropriate to model them with crossed random effects, since specific

levels are temporary. Their size provides challenges for statistical analysis. For such large data sets,

the computational costs of estimation and inference (time, space, and communication) should grow

at most linearly with the sample size and the algorithms should be parallelizable.

Both traditional maximum likelihood estimation and numerous Markov chain Monte Carlo

Bayesian algorithms take superlinear time in order to obtain good parameter estimates in the sim-

ple two-factor crossed random effects model. We propose moment based, parallelizable algorithms

that, with at most linear cost, estimate variance components and measure the uncertainties of those

estimates. These estimates are consistent and, proven using a martingale central limit theorem de-

composition of U -statistics, asymptotically Gaussian. When run on simulated normally distributed

data, our algorithm performs competitively with maximum likelihood methods.

Maximum likelihood and Bayesian approaches run into similar challenges for the linear mixed

model with crossed random effects. We propose a method of moments based algorithm that scales

linearly and can be easily parallelized at the expense of some loss of statistical efficiency. This

algorithm is proven to give estimates that are consistent and asymptotically Gaussian. We apply

the algorithm to some data from Stitch Fix where the crossed random effects correspond to clients

and items. The random effects analysis is able to account for the increased variance due to intra-

client and intra-item correlations in the data. Ignoring the correlation structure can lead to standard

error underestimates of over 10-fold for that data. As in the crossed random effects model, when

run on simulated normally distributed data, our algorithm performs nearly as well as maximum

likelihood.

iv

Preface and Acknowledgements

Since the beginning of my PhD, I have been interested in statistical methods for massive data sets.

I was excited when Brad Klingenberg from Stitch Fix, an online personal styling service, suggested

this project to my advisor Art Owen and me. He said that the data analytics team at Stitch Fix

wanted to fit mixed models to their data sets, but existing methodology and software could not scale

to data sets with millions of observations. Because such large data sets are common in e-commerce,

he would like to see if there was an efficient and scalable way to fit mixed models and perform

inference for the resulting estimates.

My deepest gratitude goes to my advisor, Prof. Art Owen. He introduced this project to me, and

provided invaluable guidance and support the past four years. Moreover, he taught me how to think

about statistical problems, and being part of his research group exposed me to many fascinating

problems and areas of study. He was unfailingly kind, encouraging, and generous with his time.

Many thanks go to the members of my reading committee, Prof. Robert Tibshirani and Prof.

Lester Mackey, and the other members of my oral committee, Prof. Jerome Friedman and Prof.

Sharad Goel, who have provided valuable suggestions for the completion of this thesis. I would like

to thank Prof. Norm Matloff from UC Davis and Prof. Trevor Hastie for fruitful discussions during

the course of this project. Lastly, I would also like to thank Brad Klingenberg and Stitch Fix for

providing me with the motivation and data for this project.

I would like to express my appreciation to the faculty, staff, and other students of the Stanford

statistics department, who have made the past five years very enjoyable. I am especially grateful

that I have had the opportunity to learn from some of the best statisticians in the world. Finally, this

work was supported under the National Science Foundation grants DGE-114747 and DMS-1407397.

I dedicate this thesis to my parents, who have made many sacrifices so that I could have the best

education possible.

v

Contents

Abstract iv

Preface and Acknowledgements v

1 Introduction 1

1.1 Motivation and Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Goals of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Crossed Random Effects Model 6

2.1 Model Description and Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Notation and Asymptotic Regime . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Moment-based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Bayesian Analysis via MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Implications from Probability Theory . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Method of Moments Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 U -statistics for variance components . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.2 Estimator Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.3 Asymptotic Normality of Estimates . . . . . . . . . . . . . . . . . . . . . . . 25

2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5.2 Examples from Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Linear Mixed Model with Crossed Random Effects 33

3.1 Additional Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

vi

3.2.1 Moment-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Likelihood-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.3 Challenges of Likelihood-based Approaches . . . . . . . . . . . . . . . . . . . 38

3.3 Alternating Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.1 Estimating Variance Components . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.2 Scalable and More Efficient Regression Coefficient Estimators . . . . . . . . . 41

3.4 Theoretical Properties of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.2 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.5.2 An Example from E-commerce . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Conclusions 59

4.1 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.1 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.2 Factor Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.3 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A Crossed Random Effects Model 63

A.1 Bayesian Analysis via MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.1.1 Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.1.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.2 Expectation and Variance of Moment Estimates . . . . . . . . . . . . . . . . . . . . . 64

A.2.1 Weighted U -statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.2.2 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

A.2.3 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

A.2.4 Variance of Ua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

A.2.5 Variance of Ue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

A.2.6 Covariance of Ua and Ub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A.2.7 Covariance of Ua and Ue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

A.2.8 Estimating Kurtoses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.3 Asymptotic Normality of Moment Estimates . . . . . . . . . . . . . . . . . . . . . . 87


A.3.2 Asymptotic approximation: proof of Theorem 2.4.4 . . . . . . . . . . . . . . . 89

A.3.3 Proof of Lemma 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

vii


B Linear Mixed Model with Crossed Random Effects 111

B.1 Estimator Efficiencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

B.2 A Useful Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

B.3 Consistency of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

B.3.1 Proof of Theorem 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113



B.4 Asymptotic Normality of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118


B.4.2 Proof of Lemma 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121


viii

List of Tables

2.1 Summary of simulation results for cases with R = C = 1000. The first row gives CPU

time in seconds. The next four rows give median estimates of the 4 parameters. The

next four rows give the number of lags required to get an autocorrelation below 0.5. 14

3.1 Stitch Fix Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.1 Median CPU time in seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.2 Median estimates of µ and number of lags after which ACF(µ) ≤ 0.5. . . . . . . . . . 66

A.3 Median estimates of σ2A and number of lags after which ACF(σ2

A) ≤ 0.5. . . . . . . . 67

A.4 Median estimates of σ2B and number of lags after which ACF(σ2

B) ≤ 0.5. . . . . . . . 68

A.5 Median estimates of σ2E and number of lags after which ACF(σ2

E) ≤ 0.5. . . . . . . . 69

ix

List of Figures

2.1 Schematic of our algorithm. The expressions in the smallest boxes are the values

computed at each step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Simulation results: log-log plots of the five recorded measurements against the number

of observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 For p = 5, log-log plot of the computation time per iteration for maximum likelihood. 51

3.2 For each value of p, log-log plots of the computation times of the two algorithms

against N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3 For each value of p, log-log plots of the (scaled by p) mean squared error of β against

N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4 For each value of p, log-log plots of the mean squared error of σ2A against N . . . . . 54

3.5 For each value of p, log-log plots of the mean squared error of σ2B against N . . . . . 55

3.6 For each value of p, log-log plots of the mean squared error of σ2E against N . . . . . 56

3.7 For each subset of data, estimates and confidence intervals for Bohemian-related vari-

ables of interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

x

Chapter 1

Introduction

Driven by the growth of the Internet in the last twenty years, the amount of data available in the

world has grown exponentially. According to a BBC article from 2014 [48], 2.5 billion gigabytes of

data were being generated every day in 2012; that number could only have increased since then.

Numerous hardware, software, and algorithms have been devised to deal with these massive data

sets. Graphics processing units (GPUs) have been very successful at processing image data in parallel

and have become the default for many deep learning applications. Frameworks such as Hadoop [2]

and Spark [3] facilitate large-scale computing over clusters of machines. For data analysis, streaming

algorithms, algorithms that only require a few passes over the data set at hand, have been invented.

In the field of statistics, we are primarily concerned with the algorithmic (analytic) aspect. That

is, how can we construct statistical procedures that scale to such massive data sets and that still

have theoretical guarantees, enabling us to efficiently carry out the traditional statistical tasks of

estimation, inference, and prediction?

Because the amount of available data often outstrips the available compute power, as was the case

over a hundred years ago when most computations had to be done by hand, many old statistical

techniques have become popular. In this thesis, we apply two such techniques, the method of

moments and least squares, to the analysis of massive data sets from e-commerce. Details of our

motivation and statistical problem are given next.

1.1 Motivation and Problem Overview

The Internet has enabled transactions that once required face-to-face interaction to be made in

front of a computer screen. People can now purchase anything, including food, with a mouse click,

and have access to all kinds of information. Every time they enter into an online transaction or

visit a website, data about them is collected, such as their IP address, queries, pages visited, items

purchased, and total time spent.

1

CHAPTER 1. INTRODUCTION 2

The data sets are massive, easily having millions of observations. More importantly for us, they

have an unbalanced crossed factor structure. The factors are users, queries, URLs, items, etc. At

the intersection of levels for each factor, data of interest and auxiliary data are collected. Yet, these

electronic data sets are sparse in the sense that we only have observations at a small fraction of the

intersections of levels for each factor.

The inspiration for our project originated from e-commerce, in particular, Stitch Fix [1], a per-

sonal styling service for clothing and accessories primarily geared towards women. Clients answer

questions about their style preferences, stylists pick out items to send to the clients, and the clients

choose to keep or return the items. Stitch Fix collects data on clients’ opinions of items they re-

ceived, as well as information about the clients and items themselves. The factors are clients and

items, some examples of data of interest include the client’s rating of the item, and some examples of

auxiliary data are item material and the client’s style preferences. The analysis goal is to understand

which characteristics of items or users influence customer satisfaction. This would help to inform

future selections by stylists, as they would only send send items for which they believe the client

has a fairly high probability of keeping. For instance, if certain materials are significantly negatively

associated with the rating, stylists should be more selective about sending clothes made with those

materials.

As in the above example, in this thesis we focus on understanding how the auxiliary data affect

the data of interest. For convenience, we also assume that the data of interest is continuous. We

choose to model the factors as random and not fixed, even though fixed factors are in some sense

easier to deal with, for two reasons. First, the goal is to discover some behavior patterns of the

entire population and not of a specific person. Second, in these electronic data sets, the factor levels

are evanescent, making random factors more realistic. Clothing items rise and fade in popularity,

clients turn over, and URLs are created and destroyed.

Therefore, we describe the relationship between the data of interest and the auxiliary data using

a linear mixed model, where each random effect corresponds to a factor.

1.2 Data Model

For simplicity, we assume, as in the Stitch Fix example, that there are only two factors. We model

the relationship between the data of interest and the auxiliary data as follows:

Model 1. Two-factor linear mixed model

Yij = xTijβ + ai + bj + eij , xij ∈ Rp, i, j ∈ N where,

aiiid∼ (0, σ2

A), bjiid∼ (0, σ2

B), eijiid∼ (0, σ2

E) (independently) and,

E(a4i ) <∞, E(b4j ) <∞, E(e4ij) <∞,

(1.1)


where i indicates the level of the first factor and j the level of the second factor.

For convenience, let us call the i’s rows and the j’s columns. Yij is the data of interest, and xij

is the auxiliary data, which contains p predictors. Note that we can have at most one observation

in each cell (i, j). The random effect at row i is ai, the random effect at column j is bj , and noise

is eij . For example, for Stitch Fix, ai could be the effect of a particular client and bj the effect of a

specific item.

The variance of the random effect of the first factor is σ2A, the same for the random effect of the

second factor is σ2B , and the same for the noise is σ2

E . Because the random effects have finite fourth

moment, we also define the excess kurtoses of the random effects and noise as κA, κB , and κE . Most

importantly, we do not make any parametric assumptions about the random effects or noise. This

is relevant for Internet applications, in which non-normal data distributions often appear.

In practice, we only observe Yij and xij in N cells (i, j), where 1 ≤ N < ∞. However, as we

expect future data to bring hitherto unseen levels, a countable index set is appropriate. We consider

the regime where p is much smaller than and does not increase with N . Let R denote the number

of distinct i for which we observe data and C denote the number of distinct j for which we observe

data. The unseen (Yij , xij) are taken to be missing at random, and our results are conditional on

the observed responses Yij and predictors xij for which Zij = 1.

In this thesis, we assume that the missingness pattern is not informative. But in many appli-

cations the observed values are likely to differ in some way from the missing values. For instance,

in movie ratings data people may be more likely to watch and rate movies they believe they will

like, and so missing ratings could be lower on average than observed ones. In general, the observed

ratings may have both high and low values oversampled relative to middling values.

From observed values alone we cannot tell how different the missing values would be. To do

so requires making untestable assumptions about the missingness mechanism. Even in cases where

followup sampling can be made, e.g., giving some users incentives to make additional ratings, there

will still be difficulties such as users refusing to make those ratings, or if forced, making inaccurate

ratings. Methods to adjust for missingness have to be designed on a case by case basis, using

whatever additional data and assumptions can be brought to bear.

We end with a brief discussion on how such massive data sets with crossed factor structure are

handled. There are various possible data storage models. We consider the log-file model with a

collection of (i, j, Yij , xij) triples, which for the purposes of this thesis we assume are stored at the

same location. A pass over the data proceeds via an iteration over all (i, j, Yij , xij) triples in the

data set. Such a pass may generate intermediate values that we assume can be retained for further

computations.


1.2.1 Goals of Analysis

As hinted in Section 1.1, in this thesis we focus on the problems of parameter estimation and infer-

ence. Because we are dealing with massive data sets (at least on the order of a million observations),

we would like to design algorithms that are computationally efficient and scale to big data.

More concretely, with reference to model (1), we would like to estimate β, σ2A, σ2

B , and σ2E and

construct confidence intervals for those estimates using easily parallelizable algorithms that require

at most O(N) computation and O(N) space. Ideally, these algorithms should only require a few

passes over the data and should also be simple to analyze.

To guarantee good statistical behavior, the estimates should be consistent. To construct con-

fidence intervals, we need to find the variance of the estimates and prove that the estimates are

asymptotically normal. Unfortunately, the exact variances of our estimates cannot be computed

in O(N) time. Therefore, we may either compute the asymptotic variances of the estimates or a

somewhat conservative approximation to them. Note that because of the crossed factor structure,

the usual definition of asymptotics as being N →∞ is not enough; we also need to control how the

observed data are spread over the levels of each factor. A detailed explanation of the asymptotic

regime used in this thesis is given in Section 2.4.3.

1.3 Outline

The remainder of this thesis is organized as follows. We consider two models of increasing complexity.

In Chapter 2, we study the simplest linear model with crossed random effects, the crossed random

effects model with two random factors. In this case, there is only an intercept. We concentrate on

the tasks of estimation and inference for the variance components. First, we review some classical

estimators. Next, we show through theory and simulations that a Bayesian approach is not scalable.

By modifying one of the classical estimators, we obtain our proposed method of moments estimator,

which can be computed using linear computation time and space and whose exact variance can be

found. Indeed, in linear computation time and space, we can also find a conservative approximation

to the variance of the estimates. In addition, we prove that the estimates are asymptotically normal.

Lastly, we illustrate the behavior of the estimates on simulated data and several large data sets from

social media.

In Chapter 3, we extend the crossed random effects model to allow for arbitrary predictors,

obtaining model (1). We start by reviewing previous literature, which has primarily focused on

likelihood-based methods that are expensive for massive linear mixed models with crossed random

effects. Then, we propose an algorithm that alternates between estimating the regression coeffi-

cients and the variance components, using only linear computation time and space. We show that

the resulting estimates are consistent and asymptotically normal. Moreover, the variance of the

estimated regression coefficients can be found using linear computation time and space, enabling us


to efficiently perform inference. Using simulated data, we compare the quality and computational

cost of our estimates to those obtained by maximum likelihood. Then, we apply the alternating

algorithm to a large data set from e-commerce, containing ratings of clothing by customers, along

with information about the particular customer or piece of clothing.

Finally, in Chapter 4, we discuss some practical recommendations for applying our methods to

real-world data. We then summarize the thesis and briefly discuss several possible directions for

future work. All proofs are included in the appendices.

The bulk of Chapters 2 and 3 and their accompanying proofs are taken from two papers [13, 14]

written by me and my advisor Art Owen. We collaborated on the algorithm design and analysis

and I performed the experiments. In this thesis, I have included some additional literature review

and analysis.

Chapter 2

Crossed Random Effects Model

In this chapter, we start with the simplest version of model (1) where there is only an intercept

instead of arbitrary auxiliary data. It is still able to provide us with challenges and insights into the

more general model.

2.1 Model Description and Our Contributions

We consider the following model for the data of interest:

Model 2. Two-factor crossed random effects model

Yij = µ+ ai + bj + eij , xij ∈ Rp, i, j ∈ N with

aiiid∼ (0, σ2

A), bjiid∼ (0, σ2

B), eijiid∼ (0, σ2

E) (independently) and,

κA <∞, κB <∞, κE <∞,

(2.1)

where µ is the intercept (global mean) and the other quantities are defined as in Section 1.2.

The assumptions about the patterns of observation discussed in that section also hold.

For this model, the parameters of interest are µ, σ2A, σ2

B , and σ2E . However, in this chapter

we restrict our attention to the variance components σ2A, σ2

B , and σ2E , as estimating µ has been

well-studied in the literature. For instance, a simple estimator is the sample mean, whose value and

exact variance can be computed in O(N) time [37]. In addition, when we return to model (1) in

Chapter 3, our discussion and results about the estimation of β will also apply to the estimation

of µ in this model. Before presenting an algorithm that has the properties desired in Section 1.2.1,

we argue that a Bayesian approach, where we assume that the random effects have some known

distribution and sample from the posterior distribution of the variance components, does not scale

to big data. To do so, we show both simulation results from Markov Chain Monte Carlo (MCMC)

6

CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 7

algorithms and theory describing the convergence rate of those algorithms.

Let θ = (σ2A, σ

2B , σ

2E)T be the vector of variance components. Following Section 1.2.1, we first

obtain an unbiased estimate θ of θ using a parallelizable algorithm requiring one pass over the data,

with computational cost O(N) and O(R + C) additional space. Second, we find the variance of θ,

Var(θ | θ, κ), which depends on both θ and the kurtoses κ = (κA, κB , κE)T. The exact variance is

too expensive to compute, but we develop (mildly) overestimating approximations V (θ, κ) that can

be computed in a second pass over the data, requiring O(N) time and O(R + C) space, once given

values for θ and κ. By estimating κ in a similar manner to θ, we then estimate the variances of the

estimates of θ by Var(θ) = V (θ, κ). Third, we show that θ is normally distributed under a suitable

asymptotic regime. Combining these three results, we have the tools we need to find estimates and

confidence intervals for the variance components efficiently and scalably.

For large data sets we might suppose that Var(θ) is necessarily very small and getting exact

values is not important. While this may be true, it is wise to check. The effective sample size (as

defined in Lavrakas (2008) [30]) in model (2) might be as small as R or C if the row or column

effects dominate. Moreover, if the sampling frequencies of rows or columns are very unequal, then

the effective sample size can be much smaller than R or C. For example, the Netflix data set [8]

has N.= 108. But there are only about 18,000 movies and so for statistics dominated by the movie

effect the effective sample size might be closer to 18,000. That the movies do not appear equally

often would further reduce the effective sample size. Indeed, Owen (2007) [36] shows that for some

linear statistics the variance could be as much as 50,000 times larger than a formula based on IID

sampling would yield. That factor is perhaps extreme but it would translate a nominal sample size

of 108 into an effective sample size closer to 2,000.

2.1.1 Notation and Asymptotic Regime

In this section, we review some additional notation we need for this chapter and the next.

To index rows, we use i′, r, and r′ in addition to i. To index columns, we use j′, s, and s′ in

addition to j. Let Zij = 1 if Yij is observed, and Zij = 0 otherwise.

The number of observations in row i is Ni• =∑j Zij and the number of observations in row j

is N•j =∑i Zij . In the following, all of our sums over rows are only over rows i with Ni• > 0, and

similarly for sums over columns. We state this because there are a small number of expressions where

omitting rows without data changes their values. This convention corresponds to what happens when

one makes a pass through the whole data set.

Let Z be the R × C matrix containing Zij . Of interest are (ZZT)ii′ =∑j ZijZi′j , the number

of columns for which we have data in both rows i and i′, and (ZTZ)jj′ . Note that (ZZT)ii′ ≤ Ni•

and furthermore

∑ir

(ZZT)ir =∑jir

ZijZrj =∑j

N2•j , and

∑js

(ZTZ)js =∑i

N2i•.


Two other useful idioms are

Ti• =∑j

ZijN•j and T•j =∑i

ZijNi•. (2.2)

Here Ti• is the total number of observations in all of the columns j that are represented in row i.

Our notation allows for an arbitrary pattern of observations. Some special cases are as follows.

A balanced crossed design can be described via Zij = 1i≤R1j≤C . If maxiNi• = 1 but maxj N•j > 1

then the data have a nested structure with rows nested in columns. If maxiNi• = maxj N•j = 1, then

the observed Yij are IID. Some patterns are difficult to handle. For example, if all the observations

are in the same row or column, some of the variance components are not identifiable. We assume

that we do not meet such worst cases.

We also need to introduce some notation useful for asymptotic analysis. The quantities

εR = maxiNi•/N and εC = max

jN•j/N (2.3)

measure the extent to which a single row or column dominates the data set. For the remainder of

this section, we discuss the asymptotic regime used in this thesis; simply N →∞ is not enough.

First, we expect that no row or column contains a large portion of the observed data. Thus, as

N →∞, we assume that

max(εR, εC)→ 0. (2.4)

It is also often reasonable to suppose that maxi Ti•/N and maxj T•j/N are both small.

In many data sets, the average row and column sizes are large, but much smaller than N . One

way to measure the average row size is N/R. Another way to measure it is to randomly choose an

observation and inspect its row size, obtaining an expected value of (1/N)∑iN

2i•. Similar formulas

hold for the average column size. Therefore, we assume that as N →∞

max(R/N,C/N)→ 0 (2.5)

and

min( 1

N

∑i

N2i•,

1

N

∑j

N2•j

)→∞ and

max( 1

N2

∑i

N2i•,

1

N2

∑j

N2•j

)→ 0.

(2.6)


Notice that

1

N2

∑i

N2i• ≤

1

N2

∑i

Ni•(εRN) ≤ εR and1

N2

∑j

N2•j ≤ εC (2.7)

and so the second part of (2.6) merely follows from (2.3) and (2.4).

While the average row count may be large, many of the rows corresponding to newly seen

entities can have Ni• = 1. In our algorithms and analysis, it is not necessary to assume that all of

the rows and columns contain at least some minimum number of observations. Thus, we avoid losing

information by the practice of iteratively removing all rows and columns with few observations.

To illustrate the appropriateness of our assumptions for data sets of practical interest, we consider

the Netflix data [8], which has N = 100,480,507 ratings on R = 17,770 movies by C = 480,189

customers. Therefore R/N.= 0.00018 and C/N

.= 0.0047, and the observation pattern is sparse

with N/(RC).= 0.012. It is not dominated by a single row or column because εR

.= 0.0023 and

εC = 0.00018 even though one customer has rated an astonishing 17,653 movies. Similarly

N∑iN

2i•

.= 1.78× 10−5,

∑iN

2i•

N2

.= 0.00056,

N∑j N

2•j

.= 0.0015, and

∑j N

2•j

N2

.= 6.43× 10−6

so that the average row or column has size � 1 and � N .

2.2 Previous Work

Before presenting our results, we review algorithms to estimate variance components for model (2)

presented in the literature.

They fall into two main categories. The first, which our proposal would fall under, uses moments

of the data. These have the advantage that they do not require any parametric assumptions and

moments can be computed easily by passing over the data, which also enables parallelization. The

second assumes some distribution for the random effects and noise and maximizes the data likelihood

(or some version thereof). These have the advantage that if the data actually follow the chosen

distribution, the resulting estimates often have smaller mean squared error, i.e. are more efficient

[44].

Here, we discuss the first category of algorithms. We defer the discussion of the second category

to Section 3.2.2, since such algorithms for crossed random effects models do not differ much from such

algorithms for linear mixed models. We note that likelihood-based methods require computation

time superlinear in the number of observations, which make them not scalable; see Section 3.2.2 and

also Raudenbush (1993) [40].


Before doing so, we refer to Clayton & Rasbash (1999) [11], which propose an alternating imputa-

tion posterior (AIP) algorithm for crossed random effects models with missing data. They show that

it has good performance on fairly large data sets. It may be termed a ‘pseudo-MCMC’ method since

it alternates between sampling the missing data from its distribution given the parameter estimates

and sampling the parameters from a distribution centered on the maximum likelihood estimates.

Because of this last step, we do not consider AIP to be scalable to Internet size problems.

2.2.1 Moment-based Algorithms

Note that model (2) can be written as

Y = µ1N + Zaa+ Zbb+ e, with

a ∼ (0, σ2AIR), b ∼ (0, σ2

BIC), e ∼ (0, σ2EIN ) (independently),

(2.8)

where Y is the vector of Yij , 1N is the N -dimensional vector of ones, Za is a N × R binary matrix

with entry (k, i) equal to 1 if the kth observation is from level i of the first factor, a is the vector of

ai, Zb is a N ×C binary matrix with entry (k, j) equal to 1 if the kth observation is from level j of

the second factor, b is the vector of bj , and IN is the identity matrix of dimension N .

Suppose that we choose a set of three matrices A1, A2, and A3 for which 1TNA11N = 1TNA21N =

1TNA31N = 0. Then, applying the method of moments on Si = Y TAiY for i = 1, 2, 3, where

E(Si) = σ2A Tr(ZT

aAiZa) + σ2B Tr(ZT

b AiZb) + σ2E Tr(Ai), will give unbiased estimates of the variance

components [44]. When the random effects are normally distributed, the matrices Ai can be chosen

optimally, obtaining minimum variance unbiased estimates [33]. We now describe two special cases

of this technique.

Henderson I The most common choice, called Henderson I [23], has estimating equations

SSA =

R∑i=1

Ni•(Yi• − Y••)2,

SSB =

C∑j=1

N•j(Y•j − Y••)2, and

SSE =∑ij

Zij(Yij − Y••)2 − SSA− SSB,

which are based on the ANOVA sum of squares for balanced data. We propose an alternative to

Henderson I based on U -statistics that is more amenable to analysis, and we develop a method to

approximate its sampling variance in linear time. A disadvantage of Henderson I is that it only

applies to random effects models. Henderson II extends Henderson I to linear mixed models and we

will discuss it in Section 3.2.1.


Henderson III Henderson III [44] comes from a regression perspective rather than an ANOVA

perspective. To construct estimating equations, it pretends that the random effects are fixed effects.

Let R(a, b, µ) be the reduction in sum of squares due to fitting the model with µ, a, and b as fixed

effects. Similarly, let R(a, µ) be the same for fitting the model with only µ and a, and let R(µ) be the

same for fitting the model with only µ. Then, we apply the method of moments to the estimating

equations

R(a, b | µ) = R(a, b, µ)−R(µ),

R(b | a, µ) = R(a, b, µ)−R(a, µ), and

SSE = Y TY −R(a, b, µ),

with expectations under model (2.8)

E(R(a, b | µ) =

(N −

∑iN

2i•

N

)σ2A +

(N −

∑j N

2•j

N

)σ2B + (R+ C − 2)σ2

E ,

E(R(b | a, µ)) = (N −R)σ2B + (C − 1)σ2

E , and

E(SSE) = (N −R− C + 1)σ2E .

2.3 Bayesian Analysis via MCMC

In addition to the moment and likelihood based algorithms discussed in the previous section, we

also may take a Bayesian approach. As in likelihood based algorithms, we assume that the random

effects and noise follow some known distribution; we consider the Gaussian distribution for this

section only. To estimate σ2A, σ2

B , and σ2E , we then utilize MCMC algorithms to sample from the

posterior distribution given the data: π = p(µ, a, b, σ2A, σ

2B , σ

2E | Y ) where a is the vector of ai, b is

the vector of bj , and Y is the vector of Yij . Let

S(t) =(µ(t) a(t)

T

b(t)T

σ2(t)A σ

2(t)B σ

2(t)E

)T, for t ≥ 1

denote the resulting chain.

We show that numerous MCMC methods scale badly for crossed random effects models even

though they have been shown to be effective for hierarchical random effects models [15, 46, 51].

Indeed, in limits where R,C →∞, the dimension of our chain S(t) approaches infinity. Convergence

rates of many MCMC methods slow down as the dimension of the chain increases, making them

ineffective for high dimensional parameter spaces.

In our arguments, we make two restrictions. First, balanced data is a fully sampled R×C matrix

with Yij for rows i = 1, . . . , R and columns j = 1, . . . , C. We present some analyses for balanced

data with interspersed remarks on how the general unbalanced case behaves. The balanced case


allows sharp formulas that we find useful and that case is the one we simulate. In particular, we can

obtain convergence rates for some MCMC algorithms. Second, the MCMC algorithms we consider

go over the entire data set at each iteration. There are alternative samplers that save computation

time by only looking at subsets of data at each iteration. However, so far those approaches are

developed for IID data only.

2.3.1 Implications from Probability Theory

We start by exploring the theoretical properties of some common MCMC algorithms for our problem.

2.3.1.1 Gibbs Sampling

In each iteration of Gibbs sampling [16], we draw from the conditional posteriors of µ, a, b, σ2A,

σ2B , and σ2

E in turn. To analyze the behavior of this sampler, let us consider the problem of Gibbs

sampling from the ‘smaller’ distribution φ = p(a, b | µ, σ2A, σ

2B , σ

2E , Y ). At iteration t+ 1, we sample

a(t+1) ∼ p(a | b(t), µ, σ2A, σ

2B , σ

2E , Y ) and b(t+1) ∼ p(b | a(t+1), µ, σ2

A, σ2B , σ

2E , Y ), which are normal

distributions with diagonal covariance matrices. Let X(t) be the resulting chain.

Roberts & Sahu (1997) [42] give the following definition.

Definition 2.3.1. Let θ(t), for integer t ≥ 0 be a Markov chain with stationary distribution h. Its

convergence rate is the minimum number ρ such that

limt→∞

Eh[(Eh[f(θ(t)) | θ(0)]− Eh[f(θ)])2

]r−t = 0

holds for all measurable functions f such that Eh(f(θ)2) <∞ and all r > ρ.

Theorem 2.3.1. Let ρ be the convergence rate of X(t) to φ, as in Definition 2.3.1. Then,

ρ =σ2B

σ2B + σ2

E/R× σ2

A

σ2A + σ2

E/C.

Proof. See Section A.1.1.

We see that ρ→ 1 as R,C →∞, outside of trivial cases with σ2A or σ2

B equal to zero. If R and

C grow proportionately then ρ = 1 − α/√N + O(1/N) for some α > 0. We can therefore expect

the Gibbs sampler to require at least some constant multiple of√N iterations to approximate the

target distribution sufficiently. When the data are not perfectly balanced, numerical computation

of ρ shows that Gibbs still mixes increasingly slowly as N → ∞ while the sampler requires O(N)

computation per iteration. In sum, Gibbs takes O(N3/2) work to sample from φ, which is not

scalable.

Because sampling from φ can be viewed as a subproblem of sampling from π, we believe that

the Gibbs sampler that draws from π, which also requires O(N) time per iteration, will exhibit the


same slow convergence and hence require superlinear computation time. Our simulations, described

in Section 2.3.2, confirm this.

2.3.1.2 Other MCMC algorithms

The Gibbs sampler is widely used for problems like this, where it is tractable to sample from the

full conditional distributions. But there are other MCMC algorithms that one could use. Now we

consider random walk Metropolis (RWM), Langevin diffusion, and Metropolis adjusted Langevin

(MALA). They also have difficulties scaling to large data sets.

At iteration t+ 1 of RWM, a Gaussian random walk proposal S(t+1) ∼ N (S(t), σ2I) for σ2 > 0

is made and the step is taken with the Metropolis-Hastings acceptance probability. If the target

distribution is a product distribution of dimension d, the chain S(t) ≡ S(dt) (i.e. the chain formed by

every dth state of the chain S(t)) converges to a diffusion whose solution is the target distribution.

We may interpret this as a convergence time for the algorithm that grows as O(d) [41].

For our problem, evaluating the acceptance probability requires time at least O(N), so the overall

algorithm then takes O(N(R+C)) time. This is at best O(N3/2), as we found for Gibbs sampling,

and could be worse for sparse data where N � RC. Our target distribution is not of product

form, but we have no reason to expect that RWM mixes orders of magnitude faster here than for a

distribution of product form. Indeed, it seems more likely that mixing would be faster for product

distributions than for distributions with more complicated dependence patterns such as ours.

At iteration t+ 1, Langevin diffusion steps S(t+1) ∼ N (S(t) + (h/2)∇ log π(S(t)), hI) for h > 0.

As h → 0, the stationary distribution for this process converges to π, as shown for general tar-

get distributions in Liu (2004) [32]. Because h 6= 0 in practice, the Langevin algorithm is biased.

To correct this, the MALA algorithm uses the Metropolis-Hastings algorithm with the Langevin

proposal S(t+1). When the target distribution is a product distribution of dimension d, the chain

S(t) ≡ S(d1/3t) converges to a diffusion with solution π; the convergence time grows as O(d1/3) [41].

Computing the gradients and the acceptance probability requires O(N) time, so with similar reason-

ing as for RWM, the computation time of MALA is O(N(R+ C)1/3), which is at best O(N1+1/6).

2.3.2 Simulation Results

We carried out simulations of the four algorithms described above, as well as five others: the block

Gibbs sampler (‘Block’), the reparameterized Gibbs sampler (‘Reparam.’), the independence sampler

(‘Indp.’), RWM with subsampling (‘RWM Sub.’), and the pCN algorithm of Hairer et al. (2014)

[20]. Descriptions of these five algorithms are given below with discussions of their simulation results.

Every algorithm was implemented in MATLAB and run on a cluster using 4GB memory.

For each algorithm and a range of values of R and C, we generated balanced data from model (2)

with µ = 1, σ2A = 2, σ2

B = 0.5, and σ2E = 1. We ran 20,000 iterations of the algorithm, retaining the


Table 2.1: Summary of simulation results for cases with R = C = 1000. The first row gives CPUtime in seconds. The next four rows give median estimates of the 4 parameters. The next four rowsgive the number of lags required to get an autocorrelation below 0.5.

Method Gibbs Block Reparam. Lang. MALA Indp. RWM RWM Sub. pCN

CPU sec. 3432 15046 4099 2302 4760 2513 2141 2635 1966

med µ 0.97 1.02 1.04 0.99 0.96 2.39 1.55 1.07 1.53med σ2

A 1.96 1.99 2.02 1.90 1.95 1.78 2.01 1.96 1.99med σ2

B 0.51 0.50 0.50 0.40 0.50 2.94 0.51 0.50 0.49med σ2

E 1.00 1.00 1.00 65.22 2.66 0.15 0 0.93 0

ACF(µ) 801 790 694 1 2501 5000+ 1133 1656 1008ACF(σ2

A) 1 1 1 122 2656 5000+ 1133 989 912ACF(σ2

B) 1 1 1 477 2514 5000+ 1133 855 556ACV(σ2

E) 1 1 1 385 3062 5000+ 1518 1724 621

last 10,000 for analysis. We record the CPU time required, the median values of µ, σ2A, σ2

B , and σ2E ,

and the number of lags needed for their sample auto-correlation functions (ACF) to go below 0.5.

The entire process is repeated in 10 independent runs. Table 2.1 presents median values of the

recorded statistics over the 10 runs for the case R = C = 1000. Section A.1.2 includes corresponding

results at a range of (R,C) sizes.

Block Gibbs, which updates a and b together to try to improve mixing, has computation time

per iteration superlinear in the number of observations. To improve mixing, reparameterized Gibbs

scales the random effects to have equal variance. This gives an algorithm equivalent to the conditional

augmentation of Van Dyk & Meng (2001) [47]. For all three Gibbs-type algorithms, the parameter

estimates are good but µ mixes slower as R and C increase, while the variance components do not

exhibit this behavior.

The computation times of Langevin diffusion (‘Lang.’) and MALA are approximately linear in

the number of observations. However, σ2E tends to explode for large data sets in Langevin diffusion,

while the chain does not mix well in MALA.

The independent sampler is a Metropolis-Hastings algorithm where the proposal distribution is

fixed. We propose µ ∼ N (1, 1), a ∼ N (0, IR), b ∼ N (0, IC), and σ2A, σ

2B , σ

2E ∼ InvGamma(1, 1).

The computation time grows linearly with the data size. The parameters do not mix well, and their

estimates are not good. It is possible that better results would be obtained from a different proposal

distribution, but it is not clear how best to choose one in practice.

RWM and RWM with subsampling, the latter of which updates a subset of parameters at each

iteration, both have computation time linear in the number of observations. Neither algorithm

mixed well, and for RWM σ2E tended to go to zero in large data sets.

The pCN algorithm is a Metropolis-Hastings algorithm where the proposals are Gaussian random

walk steps shrunken towards zero: S(t+1) ∼ N (√

1− σ2S(t), σ2I), for σ2 ≤ 1. Hairer et al. (2014)

[20] show that under certain conditions on the target distribution, the convergence rate of this


algorithm does not slow with the dimension of the distribution. We include it here, even though

our π does not satisfy those conditions. The computation time grows linearly with the data size.

However, the estimates for µ and σ2E are not good, and those for σ2

E even get worse as the data size

increases. None of the parameters seem to mix well.

In summary, for large data sets each algorithm mixes increasingly slowly or returns flawed esti-

mates of µ and the variance components. We have also simulated some unbalanced data sets and

slow mixing is once again the norm, with worse performance as R and C grow.

2.4 Method of Moments Estimation

We propose a method of moments estimate θ for θ = (σ2A, σ

2B , σ

2E)T that requires one pass over the

data. We find an expression for Var(θ | θ, κ) and describe how to obtain an approximation of it after

a second pass over the data. We also show that the estimates are asymptotically Gaussian under

the regime described in Section 2.1.1, which enables the efficient construction of reliable confidence

intervals.

2.4.1 U-statistics for variance components

The usual unbiased sample variance estimate can be formulated as a U -statistic, which is more

convenient to analyze. Thus, we use the following U-statistics as our method of moments estimators:

Ua =1

2

∑ijj′

N−1i• ZijZij′(Yij − Yij′)2,

Ub =1

2

∑jii′

N−1•j ZijZi′j(Yij − Yi′j)2, and

Ue =1

2

∑iji′j′

ZijZi′j′(Yij − Yi′j′)2.

(2.9)

We explain our choice of these estimators in Section A.2.1.

To understand Ua note that for each row i, the quantities Yij − µ − ai are IID with variance

σ2B +σ2

E . Thus, Ua is a weighted sum of within-row unbiased estimates of σ2B +σ2

E . The explanation

for Ub is similar, while Ue is proportional to the sample variance estimate of all N observations.

Lemma 2.4.1. Let Yij follow the two-factor crossed random effects model (2) with the observation

pattern Zij as described in Section 2.1.1. Then the U -statistics defined at (2.9) satisfy

E(Ua) = (σ2B + σ2

E)(N −R),

E(Ub) = (σ2A + σ2

E)(N − C), and

E(Ue) = σ2A(N2 −

∑i

N2i•) + σ2

B(N2 −∑j

N2•j) + σ2

E(N2 −N).



To obtain unbiased estimates σ2A, σ2

B , and σ2E given values of the U -statistics, we solve the 3× 3

system of equations

M

σ2A

σ2B

σ2E

=

Ua

Ub

Ue

for M =

0 N −R N −R

N − C 0 N − CN2 −

∑iN

2i• N2 −

∑j N

2•j N2 −N

. (2.10)

For our method to return unique and meaningful estimates, the determinant of M

detM = (N −R)(N − C)(N2 −

∑i

N2i• −

∑j

N2•j +N

)≥ (N −R)(N − C)(N2(1− εR − εC) +N)

must be nonzero. This is true when no row or column has more than half of the data and at least

one row and at least one column has more than one observation.

To compute the U -statistics, notice that Ua =∑i Si•, where Si• =

∑j Zij(Yij − Yi•)

2 and

Yi• = (1/Ni•)∑j ZijYij . In one pass over the data and time O(N), we compute Ni•, Yi•, and Si•

for all R observed levels of i using the incremental algorithm described in the next paragraph. We

can also compute N , R and C in such a pass if they are not known beforehand.

Chan et al. (1983) [10] show how to compute both Yi• = Ni•Yi• and Si• in a numerically stable

one pass algorithm. At the initial appearance of an observation in row i, with corresponding column

j = j(1), set Ni• = 1, Yi• = Yij and Si• = 0. After that, at the kth appearance of an observation in

row i with corresponding column j(k), Yij(k),

Ni• ← Ni• + 1, Yi• ← Yi• + Yij(k), and Si• ← Si• +(Ni• × Yij(k) − Yi•)2

Ni•(Ni• − 1). (2.11)

Note that here k = Ni•. Chan et al. (1983) [10] give a detailed analysis of roundoff error for

update (2.11) as well as generalizations that update higher moments from groups of data values.

In that same pass over the data, Ue and the analogous quantities needed to compute Ub (S•j ,

Y•j , N•j) are also computed using the incremental algorithm. Finally, in additional time O(R+C),

we calculate∑i Si•,

∑j S•j ,

∑iN

2i•, and

∑j N

2•j . Now, we have Ua, Ub, Ue, and all the entries

of M .

Given Ua, Ub, Ue, and M we can calculate σ2A, σ2

B , and σ2E in constant time. Therefore, finding

our method of moments estimators takes O(N) time overall. Note that in practice we may get

negative estimates of the variance components, since the method of moments does not take into

account any constraints. However, there is no consensus on how to deal with this situation. In this

thesis, we automatically set any negative variance component estimates to zero.


2.4.2 Estimator Variances

Now we show how to estimate the covariance matrix of θ = (σ2A, σ

2B , σ

2E)T.

2.4.2.1 True variance of θ

This section discusses the finite sample covariance matrix of θ. Theorem 2.4.1 below gives the exact

variances and covariances of our U -statistics.

Theorem 2.4.1. Let Yij follow the random effects model (2) with the observation pattern Zij as

described in Section 2.1.1. Then the U -statistics defined at (2.9) have variances

Var(Ua) = σ4B(κB + 2)

∑ir

(ZZT)ir(1−N−1i• )(1−N−1r• )

+ 2σ4B

∑ir

N−1i• N−1r• (ZZT)ir[(ZZ

T)ir − 1] + 4σ2Bσ

2E(N −R)

+ σ4E(κE + 2)

∑i

Ni•(1−N−1i• )2 + 2σ4E

∑i

(1−N−1i• ),

(2.12)

Var(Ub) = σ4A(κA + 2)

∑js

(ZTZ)js(1−N−1•j )(1−N−1•s )

+ 2σ4A

∑js

N−1•j N−1•s (ZTZ)js[(Z

TZ)js − 1] + 4σ2Aσ

2E(N − C)

+ σ4E(κE + 2)

∑j

N•j(1−N−1•j )2 + 2σ4E

∑j

(1−N−1•j ),

(2.13)

and

Var(Ue) = 2σ4A

[(∑iN

2i•

)2−∑iN

4i•

]+ σ4

A(κA + 2)(N2∑i

N2i• − 2N

∑i

N3i• +

∑i

N4i•

)+ 2σ4

B

[(∑jN

2•j

)2−∑j

N4•j

]+ 4σ2

Aσ2B

(N3 − 2N

∑ij

ZijNi•N•j +∑ij

N2i•N

2•j

)+ σ4

B(κB + 2)(N2∑j

N2•j − 2N

∑j

N3•j +

∑j

N4•j

)+ σ4

E(κE + 2)N(N − 1)2

+ 2σ4EN(N − 1) + 4σ2

Aσ2E

(N3 −N

∑i

N2i•

)+ 4σ2

Bσ2E

(N3 −N

∑j

N2•j

).

(2.14)

Their covariances are

Cov(Ua, Ub) = σ4E(κE + 2)

∑ij

Zij(1−N−1i• )(1−N−1•j ), (2.15)

Cov(Ua, Ue) = 2σ4B

(∑i

N−1i• T2i• −

∑ij

ZijN−1i• N

2•j

)


+ σ4B(κB + 2)

∑ij

Zij(N −N•j)N•j(1−N−1i• ) + 2σ4E(N −R) (2.16)

+ σ4E(κE + 2)(N −R)(N − 1) + 4σ2

Bσ2EN(N −R), and

Cov(Ub, Ue) = 2σ4A

(∑j

N−1•j T2•j −

∑ij

ZijN−1•j N

2i•

)+ σ4

A(κA + 2)∑ij

Zij(N −Ni•)Ni•(1−N−1•j ) + 2σ4E(N − C) (2.17)

+ σ4E(κE + 2)(N − C)(N − 1) + 4σ2

Aσ2EN(N − C).

Proof. Equation (2.12) is proved in Section A.2.4 and then equation (2.13) follows by exchanging

indices. Equation (2.14) is proved in Section A.2.5. Equation (2.15) is proved in Section A.2.6.

Equation (2.16) is proved in Section A.2.7 and then equation (2.17) follows by exchanging indices.

Now we consider Var(θ). From (2.10)

Var(θ) = M−1Var

Ua

Ub

Ue

(M−1)T. (2.18)

We show in Section 2.4.2.2 that while Var(Ue) and the covariances of the U -statistics may be

exactly computed in time O(N), Var(Ua) and Var(Ub) cannot. Therefore, we approximate Var(Ua)

and Var(Ub) such that when we apply formula (2.18) we get conservative estimates of Var(σ2A),

Var(σ2B), and Var(σ2

E) (the values of primary interest).

For intuition on what sort of approximation is needed, we give a linear expansion of Var(θ) in

terms of the variances and covariances of the U -statistics. Letting ε = max(εR, εC , R/N,C/N) we

have as ε→ 0

M =

N

N

N2

0 1 1

1 0 1

1 1 1

(1 +O(ε))

and so

M−1 =

−1 0 1

0 −1 1

1 1 −1

N−1

N−1

N−2

(1 +O(ε)).

It follows that

σ2A = (Ue/N

2 − Ua/N)(1 +O(ε)),

σ2B = (Ue/N

2 − Ub/N)(1 +O(ε)), and

σ2E = (Ua/N + Ub/N − Ue/N2)(1 +O(ε)).

(2.19)


Disregarding the O(ε) terms,

Var(σ2A)

.= Var(Ue)/N

4 + Var(Ua)/N2 − 2Cov(Ua, Ue)/N3,

Var(σ2B)

.= Var(Ue)/N

4 + Var(Ub)/N2 − 2Cov(Ub, Ue)/N

3, and

Var(σ2E)

.= Var(Ua)/N2 + Var(Ub)/N

2 + Var(Ue)/N4

− 2Cov(Ua, Ue)/N3 − 2Cov(Ub, Ue)/N

3 + 2Cov(Ua, Ub)/N2.

(2.20)

In light of equation (2.20), to find computationally attractive but conservative approximations

of Var(θ) in finite samples, we use (slight) over-estimates of Var(Ua) and Var(Ub). We discuss how

to do so in Section 2.4.2.2.

In practice, when obtaining Var(θ), we plug in σ2A, σ2

B , σ2E , and estimates of the kurtoses into

the covariance matrix of the U -statistics where Var(Ua) and Var(Ub) have been replaced by their

over-estimates. Then we apply equation (2.18). We discuss estimating the kurtoses in Section 2.4.2.3.

2.4.2.2 Computable approximations of Var(U)

First, we show how to obtain over-estimates of Var(Ua) in time O(N); the case of Var(Ub) is similar.

In addition to N −R, Var(Ua) contains the quantities

∑ir

(ZZT)ir(1−N−1i• )(1−N−1r• ),∑ir

N−1i• N−1r• (ZZT)ir((ZZ

T)ir − 1),∑i

Ni•(1−N−1i• )2, and∑i

(1−N−1i• ).

The third and fourth quantities above can be computed in O(R) work after the first pass over the

data.

The first quantity is a sum over i and r, and cannot be simplified any further. Computing it

takes more than O(N) work. Since its coefficient σ4B(κB + 2) is nonnegative, we must use an upper

bound to obtain an over-estimate of Var(Ua). We have the bound

∑ir

(ZZT)ir(1−N−1i• )(1−N−1r• ) ≤∑ij

∑r

ZijZrj(1−N−1i• ) =∑j

N2•j −

∑ij

ZijN•jN−1i• ,

which can be computed in O(N) work in a second pass over the data. Other weaker bounds may be

obtained without the second pass. An example is

∑ir

(ZZT)ir(1−N−1i• )(1−N−1r• ) ≤∑ir

(ZZT)ir =∑j

N2•j

which can be computed in O(C) work.

For our motivating problems this over-estimation of variance is negligible. The true term is a


weighted sum of (ZZT)ir and we use a weight of 1 instead of 1 − N−1r• and a typical Nr• will be

large. Consider first a small row r, with Nr• = 1. Let j(r) be the unique column with Zrj(r) = 1.

Our bound replaces that row’s contribution of 0 by

∑ij

ZijZrj(1−N−1i• ) =∑i

ZijZrj(r)(1−N−1i• ) = Zij(r)(1−N−1i• ) ≤ 1

thereby adding at most 1 to the sum. The total from such terms is then at most the number of

singleton rows which is in turn below R � N �∑j N

2•j . The latter quantity dominates that

coefficient. When Nr• ≥ 2 our approximation at most doubles the contribution. A near doubling of

one term under the extreme setting where most rows have only two observations, is acceptable.

For the same reason as the first quantity, the second quantity cannot be computed in time O(N)

and we upper bound it via (ZZT)ir ≤ Nr•, getting

∑ir


T)ir − 1) ≤∑ir

N−1i• N−1r• (ZZT)ir(Nr• − 1)

=∑ij

ZijN−1i• N•j −

∑ir

N−1i• N−1r• (ZZT)ir

≤∑ij

ZijN−1i• N•j

which can be computed in O(N) work on a second pass. Even this upper bound is a small part of

the variance; it is a sum of column sizes divided by row sizes. Another term in the variance is of

order∑ij ZijN•j . When typical row sizes are large then

∑ij ZijN•jN

−1i• �

∑ij ZijN•j .

All but one expression in Var(Ue) (see (2.14)) can be computed in O(R+C) time after the first

pass over the data. That one expression is

∑ij

ZijNi•N•j . (2.21)

Equation (2.21) requires a second pass over the data in time O(N), because it is the sum over i and

j of a polynomial in Zij , Ni•, and N•j . Hence computing Var(Ue) takes O(N) time total.

With the same reasoning as for (2.21), we see that Cov(Ua, Ub) can be computed in a second

pass over the data in time O(N). This reasoning also shows that we can compute nearly every term

in Cov(Ua, Ue) in a second pass over the data; the exception is

∑i

N−1i• T2i•. (2.22)

We compute Ti• for each i in a second pass over the data. But we must use additional time O(R) to

get (2.22). Nevertheless, the total computation time is still O(N). Symmetrically Cov(Ub, Ue) can


be computed in time O(N) as well.

2.4.2.3 Estimating kurtoses

Under a Gaussian assumption, κA = κB = κE = 0. If however the data have heavier tails than

this, a Gaussian assumption will lead to underestimates of Var(θ). Therefore, we will estimate the

kurtoses using the method of moments with U -statistics as well.

Let µA,4 = E(a4i ) = (κA + 3)σ4A, µB,4 = E(b4i ) = (κB + 3)σ4

B , and µE,4 = E(e4ij) = (κE + 3)σ4E .

The fourth moment U -statistics we use are

Wa =1

2

∑ijj′


Wb =1

2

∑iji′


We =1

2

∑iji′j′


(2.23)

Theorem 2.4.2. Let Yij follow the random effects model (2) with the observation pattern Zij as

described in Section 2.1.1. Then the statistics defined at (2.23) have means

E(Wa) = (µB,4 + 3σ4B + 12σ2

Bσ2E + µE,4 + 3σ4

E)(N −R),

E(Wb) = (µA,4 + 3σ4A + 12σ2

Aσ2E + µE,4 + 3σ4

E)(N − C), and

E(We) = (µA,4 + 3σ4A + 12σ2

Aσ2E)(N2 −

∑i

N2i•)

+ (µB,4 + 3σ4B + 12σ2

Bσ2E)(N2 −

∑j

N2•j)

+ (µE,4 + 3σ4E)(N2 −N) + 12σ2

Aσ2B

(N2 −

∑i

N2i• −

∑j

N2•j +N

).


Using Theorem 2.4.2, we compute estimates µA,4, µB,4, and µE,4, by solving the 3× 3 system of

equations

M

µA,4

µB,4

µE,4

=

Wa −ma

Wb −mb

We −me

, (2.24)

where M is the same matrix that we used for the U -statistics in equation (2.10), with

ma = (3σ4B + 12σ2

Bσ2E + 3σ4

E)(N −R),


mb = (3σ4A + 12σ2

Aσ2E + 3σ4

E)(N − C), and

me = (3σ4A + 12σ2

Aσ2E)(N2 −

∑i

N2i•) + (3σ4

B + 12σ2Bσ

2E)(N2 −

∑j

N2•j)

+ 3σ4E(N2 −N) + 12σ2

Aσ2B

(N2 −

∑i

N2i• −

∑j

N2•j +N

).

Then, κA = µA,4/σ4A − 3, κB = µB,4/σ

4B − 3, and κE = µE,4/σ

4E − 3.

We compute the statistics (2.23) via

Wa =∑i

(∑j

Zij(Yij − Yi•)4 + 3N−1i• S2i•

),

Wb =∑j

(∑i

Zij(Yij − Y•j)4 + 3N−1•j S2•j

), and

We = N∑ij

Zij(Yij − Y••)4 + 3S2••,

(2.25)

where Y•• = N−1∑ij ZijYij and S•• =

∑ij Zij(Yij − Y••)2.

Therefore, the kurtosis estimates κ requires R+ C + 1 new quantities

∑j

Zij(Yij − Yi•)4,∑i

Zij(Yij − Y•j)4, and∑ij

Zij(Yij − Y••)4 (2.26)

beyond those used to compute θ. These can be computed in a second pass over the data after Yi•,

Y•j and Y•• have been computed in the first pass. They can also be computed in the first pass using

update formulas analogous to the second moment formulas (2.11). Such formulas are given by Pebay

(2008) [38], citing an unpublished paper by Terriberry.

Because the kurtosis estimates are used in formulas for Var(θ) and those formulas already require

a second pass over the data, it is more convenient to compute (2.26) and the sample fourth moments

in a second pass. By a similar argument as in Section 2.4.1, obtaining κA, κB , and κE has space

complexity O(R + C) and time complexity O(N), and is therefore scalable. As with the variance

component estimates, in practice sometimes we get kurtosis estimates less than −2, outside the

parameter space. In this thesis, we simply threshold the kurtoses at −2, in line with the common

practice of raising variance estimates to zero.

2.4.2.4 Algorithm summary

For clarity of exposition, here we gather all of the steps in our algorithm for estimating σ2A, σ2

B , and

σ2E and the variances of those estimators. An outline is shown in Figure 2.1. We assume that all

of the computations below can be done with large enough variable storage that overflow does not

occur. This may require an extended precision representation beyond 64 bit floating point, such as

that in the python package mpmath [25].


Figure 2.1: Schematic of our algorithm. The expressions in the smallest boxes are the valuescomputed at each step.

The first task is to compute θ. In a first pass over the data compute counts N , R, C, row values

Ni•, Yi•, Si• for all unique rows i in the data set, and column values N•j , Y•j , S•j for all unique

columns j in the data set as well as Y•• and S••. Incremental updates are used as described in (2.11).

Then compute

Ua =∑i

Si•, Ub =∑j

S•j , Ue = NS••,

the matrix M from (2.10), and θ = (σ2A, σ

2B , σ

2E)T = M−1(Ua, Ub, Ue)

T in time O(R+ C).

The second task is to compute approximately the variance of θ. First, we estimate the kurtoses.

A second pass over the data computes the centered fourth moments in (2.26). Then one calculates

the fourth order U -statistics of equation (2.25), solves (2.24) for the centered fourth moments, and

converts them to kurtosis estimates, all in time O(R+ C).

In that second pass over the data, we also compute

ZNp,q ≡∑ij

ZijNpi•N

q•j (2.27)


for (p

q

)∈{(−1

−1

),

(1

−1

),

(−1

1

),

(1

1

),

(−1

2

),

(2

−1

)}

as well as Ti• and T•j of equation (2.2) for all i and j in the data.

Then, we estimate the variances of the U -statistics. Some of these next computations require

even more bits per variable than are needed to avoid overflow, because they involve subtraction in a

way that could lose precision. To estimate the variances of Ua and Ub, we apply the upper bounds

discussed in Section 2.4.2.2 to (2.12) and (2.13) and plug in σ2A, σ2

B , σ2E , κA, κB , and κE , calculating

using time and space O(R+ C)


(∑j

N2•j − ZN−1,1

)+ 2σ4

B

(ZN−1,1 −R

∑i

N−1i•

)+ 4σ2

Bσ2E(N −R) + σ4

E(κE + 2)∑i

Ni•(1−N−1i• )2 + 2σ4E

∑i

(1−N−1i• )

and


(∑i

N2i• − ZN1,−1

)+ 2σ4

A

(ZN1,−1 − C

∑j

N−1•j

)+ 4σ2

Aσ2E(N − C) + σ4

E(κE + 2)∑j

N•j(1−N−1•j )2 + 2σ4E

∑j

(1−N−1•j ).

To estimate Var(Ue) and the covariances of the U -statistics, we again plug in the variance com-

ponent and kurtosis estimates into Theorem 2.4.1 without approximation. We get Var(Ue) from

(2.14), using ZN1,1 from the second pass over the data. We get Cov(Ua, Ue) from (2.16) using

ZN−1,1, ZN−1,2, and Ti•, and Cov(Ub, Ue) from (2.17) using ZN1,−1, ZN2,−1, and T•j . We get

Cov(Ua, Ub) from (2.15) using ZN−1,−1. It can be easily verified that these calculations also take

time and space O(R+ C).

The final plug-in estimator of variance is

Var

σ2A

σ2B

σ2E

= M−1

Var(Ua) Cov(Ua, Ub) Cov(Ua, Ue)

Cov(Ub, Ua) Var(Ub) Cov(Ub, Ue)

Cov(Ue, Ua) Cov(Ue, Ub) Var(Ue)

(M−1)T (2.28)

where M is the matrix in (2.10).

Aggregating the computation times and counting the number of intermediate values we must

calculate, we see that our algorithm takes time O(N) and space O(R+ C).


2.4.3 Asymptotic Normality of Estimates

We can bound the bias and variance of our estimates, only requiring that R/N and C/N be bounded

away from one.

Theorem 2.4.3. Suppose that max(R/N,C/N) ≤ θ for some θ < 1 and let ε = max(εR, εC). Then

E(σ2A) = (σ2

A + Υ)(1 +O(ε)),

E(σ2B) = (σ2

B + Υ)(1 +O(ε)), and

E(σ2E) = (σ2

E + Υ)(1 +O(ε)),

where

Υ ≡ σ2A

∑iN

2i•

N2+ σ2

B

∑j N

2•j

N2+σ2E

N= O(ε).

Furthermore

max(

Var(σ2A),Var(σ2

B),Var(σ2E))

= O(∑

iN2i•

N2+

∑j N

2•j

N2

)= O(ε).


Theorem 2.4.3 has the same variance rate for all variance components. Both bias and variance

are O(ε) and so a (conservative) effective sample size is then O(1/ε). The quantity Υ appearing

in Theorem 2.4.3 is Var(Y••) where Y•• = (1/N)∑ij ZijYij . The variances of the variance compo-

nents contain similar quantities to Υ although kurtoses and other quantities appear in their implied

constants.

Furthermore, under asymptotic conditions, we may obtain simple expressions for the covariance

matrix of our method of moments estimators.

Theorem 2.4.4. As described in Section 2.1.1, suppose that

Ni• ≤ εN, N•j ≤ εN, R ≤ εN, C ≤ εN, N ≤ ε∑i

N2i•, and N ≤ ε

∑j

N2•j

hold for the same small ε > 0 and that

0 < κA + 2, κB + 2, κE + 2, σ4A, σ

4B , σ

4E <∞.

Suppose additionally that

∑ij

ZijN−1i• N•j ≤ ε

∑i

N2i• and

∑ij

ZijNi•N−1•j ≤ ε

∑j

N2•j (2.29)


hold. Then


∑j

N2•j(1 +O(ε)),


∑i

N2i•(1 +O(ε)), and

Var(Ue) =(σ4A(κA + 2)N2

∑i

N2i• + σ4

B(κB + 2)N2∑j

N2•j

)(1 +O(ε)).

Similarly

Cov(Ua, Ub) = σ4E(κE + 2)N(1 +O(ε)),

Cov(Ua, Ue) = σ4B(κB + 2)N

∑j

N2•j(1 +O(ε)), and

Cov(Ub, Ue) = σ4A(κA + 2)N

∑i

N2i•(1 +O(ε)).

Finally σ2A, σ2

B and σ2E are asymptotically uncorrelated as ε→ 0 with

Var(σ2A) = σ4

A(κA + 2)1

N2

∑i

N2i•(1 +O(ε)),

Var(σ2B) = σ4

B(κB + 2)1

N2

∑j

N2•j(1 +O(ε)), and

Var(σ2E) = σ4

E(κE + 2)1

N(1 +O(ε)).


Equation (2.29) is a new condition. The first part can be written∑ij ZijN

−1i• N•j ≤ ε

∑ij ZijNi•.

It means that the sum (over entries) of row sizes is much larger than the corresponding sum of

column sizes divided by row sizes. These two conditions impose mild ‘squareness’ on the observation

pattern Z.

In an asymptotic setting with ε → 0 the three variance estimates become uncorrelated. Also

each of them has the same variance it would have had if the other variance components had truly

been zero.

In addition, under the asymptotic regime in Section 2.1.1, the estimates are normally distributed,

which can be proven via a martingale central limit theorem [21].

From (2.19), as N →∞,σ2A

σ2B

σ2E

=

−N−1 0 N−2

0 −N−1 N−2

N−1 N−1 −N−2

Ua

Ub

Ue

. (2.30)


Therefore, to show that σ2A, σ2

B , and σ2E are asymptotically normal, we show that Ua, Ub,

and Ue are asymptotically jointly normal. More precisely, we show that Ua = Ua/√∑

j N2•j ,

Ub = Ub/√∑

iN2i•, and Ue = Ue/(N

√∑iN

2i• +

∑j N

2•j) are asymptotically jointly normal. We

prove that their joint distribution is Gaussian, with a martingale construction found by combining

ideas from Serfling (2009) [45] and Gu (2008) [18].

Our martingales are constructed as follows. Let a = (a1, a2, . . . , aR) and b = (b1, b2, . . . , bC).

Next define the increasing sequence of σ-algebras

FN,k,0 = σ(bj , 1 ≤ j ≤ k), k = 1, . . . , C,

FN,q,1 = σ(b, ai, 1 ≤ i ≤ q), q = 1, . . . , R, and

FN,`,m,2 = σ(a, b, eij , 1 ≤ i ≤ `− 1, 1 ≤ j ≤ C, e`t, 1 ≤ t ≤ m), ` = 1, . . . , R, m = 1, . . . , C.

Then define random variables

SAN,k,0 =∑i

N−1i• Zik

(∑j:j 6=k

Zij(b2k − σ2

B)− 2∑j:j<k

Zijbjbk

),

SAN,q,1 = 0,

SAN,`,m,2 = N−1`• Z`m

( ∑j:j 6=m

Z`j(2(bm − bj)e`m + e2`m − σ2E)− 2

∑j:j<m

Z`je`je`m

),

SBN,k,0 = 0,

SBN,q,1 =∑j

N−1•j Zqj

(∑i:i 6=q

Zij(a2q − σ2

A)− 2∑i:i<q

Zijaiaq

),

SBN,`,m,2 = N−1•mZ`m

(∑i:i 6=`

Zim(2(a` − ai)e`m + e2`m − σ2E)− 2

∑i:i<`

Zimeime`m

),

SEN,k,0 =∑ii′

Zik

(∑j:j 6=k

Zi′j(b2k − σ2

B)− 2∑j:j<k

Zi′jbjbk

),

SEN,q,1 =∑jj′

Zqj

(∑i:i 6=q

Zij′(a2q − σ2

A + 2(bj − bj′)aq)− 2∑i:i<q

Zij′aiaq

), and

SEN,`,m,2 = Z`m

( ∑ij:ij 6=`m

Zij(e2`m − σ2

E + 2(a` − ai)e`m + 2(bm − bj)e`m)− 2∑

ij:i<` ori=`,j<m

Zijeije`m

).

Lemma 2.4.2. With the definitions above {SAN,k,0, SAN,q,1, SAN,`,m,2}, {SBN,k,0, SBN,q,1, SBN,`,m,2}, and

{SEN,k,0, SEN,q,1, SEN,`,m,2} constitute martingale difference sequences with

Ua − E(Ua) =∑k

SAN,k,0 +∑q

SAN,q,1 +∑`m

SAN,`,m,2,

Ub − E(Ub) =∑k

SBN,k,0 +∑q

SBN,q,1 +∑`m

SBN,`,m,2, and


Ue − E(Ue) =∑k

SEN,k,0 +∑q

SEN,q,1 +∑`m

SEN,`,m,2.


We use the following martingale central limit theorem from Hall & Heyde (1980) [21] to obtain

Theorem 2.4.6 below.

Theorem 2.4.5. For each n, let Mns be a martingale difference sequence with filtration Fns. Define

σ2ns = E(M2

ns | Fn,s−1) and suppose that∑s σ

2ns

p→ σ2 for some constant σ2 > 0. If for each ε > 0,

limn→∞

∑s

E(M2nsI{|Mns| ≥ ε} | Fn,s−1) = 0,

then∑sMns converges in distribution to σN (0, 1) as n→∞.

We will need a growth condition on Ni• and N•j that is mild but stricter than we had from

Section 2.1.1. Specifically we require

maxiNi• + maxj N•j

min(√∑

iN2i•,√∑

j N2•j

) → 0. (2.31)

Under (2.31) any row or column size squared is ultimately small compared to sums of squared row

or column sizes.

Theorem 2.4.6. Suppose that ai, bj, and eij are bounded and the assumptions in Theorem 2.4.4

hold. In addition, suppose that∑iN

2i•∑

j N2•j

→ c,∑ir

(ZZT)irN−1i• N

−1r• ≤ ε

∑j

N2•j , and

∑js

(ZTZ)jsN−1•j N

−1•s ≤ ε

∑i

N2i•

for some constant c > 0. Then, if (2.31) holds, Ua, Ub, and Ue are asymptotically normal with mean

(σ2B + σ2

E)(N −R)√∑j N

2•j

(σ2A + σ2

E)(N − C)√∑iN

2i•

σ2A(N2 −

∑iN

2i•) + σ2

B(N2 −∑j N

2•j) + σ2

E(N2 −N)

N√∑

iN2i• +

∑j N

2•j


and covariance

σ4B(κB + 2)

σ4E(κE + 2)N√∑iN

2i•

∑j N

2•j

σ4B(κB + 2)

∑j N

2•j√∑

j N2•j(∑iN

2i• +

∑j N

2•j)

∗ σ4A(κA + 2)

σ4A(κA + 2)

∑iN

2i•√∑

iN2i•(∑iN

2i• +

∑j N

2•j)

∗ ∗σ4A(κA + 2)

∑iN

2i• + σ4

B(κB + 2)∑j N

2•j + σ4

E(κE + 2)N∑iN

2i• +

∑j N

2•j

.


From Theorem 2.4.6 and (2.30), it is clear that the estimated variance components are also

asymptotically normal. Then, we can construct asymptotic confidence intervals for the estimated

variance components using Theorem 2.4.4 if so desired. Of course, the confidence intervals should

not include negative numbers. This should not happen for the data sizes of our interest, but in

practice if necessary we could truncate the confidence intervals at zero. However, we have found

that in the literature authors rarely give confidence intervals for estimated variance components.

2.5 Experimental Results

2.5.1 Simulations

First, we compare the performance of our method of moments algorithm (MoM), described in Sec-

tion 2.4.2.4 and implemented in Python, to maximum likelihood estimation as implemented in the

commonly used R package for mixed models, lme4 [7]. The package maximizes the profiled log-

likelihood with respect to the variance components and then estimates the regression coefficients via

least squares [5]. Computing the profiled log-likelihood requires finding the Cholesky decomposition

of a square matrix with dimension R+ C, which uses O((R+ C)3), or O(N3/2), time.

We assume that the random effects are normally distributed. We do so because it is the as-

sumption implicitly made by lme4, and so we expect that these are the conditions under which lme4

performs best. These experiments were carried out on a Windows machine with 8 GB memory and

Intel i5 processor.

For our algorithm, we consider a range of data sizes, with R = C ranging from 10 to 500. At

each fixed value of R = C, for 100 iterations, we generate data according to model (2) with σ2A = 2,

σ2B = 0.5, and σ2

E = 1. Exactly 25 percent of the cells were randomly chosen to be observed. We

measure the CPU time needed to obtain the variance component estimates σ2A, σ2

B , and σ2E (labeled

short) and the CPU time need to obtain the variance component estimates as well as conservative

estimates of the variances of those estimates (labeled long). In addition, we measure the mean


squared errors of the variance component estimates. At the end, those five measurements were

averaged over the 100 iterations.

With regard to lme4, our simulation steps are nearly the same, with the following differences.

Due to the slowness of lme4, we only consider data sizes with R = C up to 300. In addition, because

lme4 finds the maximum likelihood variance component estimates, the variances of those estimates

were computed asymptotically using the inverse expected Fisher information matrix. The simulation

results are shown in Figure 2.2.

(a) CPU (b) MSE of σ2A

(c) MSE of σ2B (d) MSE of σ2

E

Figure 2.2: Simulation results: log-log plots of the five recorded measurements against the numberof observations.

Note that lme4 always takes more time than our algorithm. From Figure 2.2(a), we see that

our method of moments algorithm takes time at most linear in the data size to compute both the

variance component estimates and conservative estimates of the variances of those estimates. For

lme4 the computation time is always superlinear in the data size, for data sets large enough that

the startup cost of the package is no longer dominant.

The MSEs of σ2A for our algorithm and lme4 are comparable. Moreover, both decrease sublinearly


with the data size. The same is true for the MSEs of σ2B . However, the MSE of σ2

E in lme4 is

noticeably smaller than that of our algorithm; this appears to be the price we pay for the decreased

computation time. In both cases, though, the MSE of σ2E decreases approximately linearly with

the data size. In sum, our algorithm provides estimates that are nearly as efficient as MLE and is

scalable to huge data sets, unlike MLE.

We also considered data with other aspect ratios, e.g. R = 2C and C = 2R. The CPU time

required and the MSEs of the variance component estimates behave similarly to the case R = C.

For the sake of brevity, we do not explicitly show the results of those simulations here.

Remark: In our simulations we investigated the conservativeness of the estimates of the variances

of σ2A, σ2

B , and σ2E due to the over-estimates described in Section 2.4.2.2. When R = C, it appears

that they are conservative by a factor of at most four. The factors corresponding to σ2A and σ2

B

approached 1 as the data size increased. Therefore, the upper bounds appear to be a reasonable

approximation of the true variances of the variance components.

2.5.2 Examples from Social Media

We illustrate our algorithm on three real world data sets that are too large for lme4 to handle in a

timely manner.

The first, from Yahoo Webscope [49], contains a random sample of ratings of movies by users,

which are grades from A+ to F converted into a numeric scale. There are 211,231 ratings by 7,642

users on 11,916 movies, filtered with the condition that each user rates at least ten movies. Only

0.23 percent of the user-movie matrix is observed.

The estimated variances of the user random effect, the movie random effect, and the error are

2.57, 2.86, and 7.68. The estimated kurtoses are −2, −2, and 6.56. Conservative estimates of the

variances of the estimated variance components are 0.0030, 0.0018, and 0.0060, with corresponding

standard errors 0.055, 0.42, and 0.077. Therefore, most of the variation in the movie rating comes

from error or the interaction between movies and users; this is not surprising, since different people

have different tastes.

The second data set, from Yahoo Webscope as well [50], contains ratings of 1000 songs by 15,400

users, on a scale of 1 to 5. The first group of 10,000 users were randomly selected on the condition

that they had rated at least 10 of the 1000 songs. The rest of the users were randomly selected

from responders on a survey that asked them to rate a random subset of 10 of the 1000 songs. The

songs were selected to have at least 500 ratings. Here, about 2 percent of the user-song pairs were

observed.

The estimated variances of the user random effect, the song random effect, and the error are 0.97,

0.24, and 1.30. The estimated kurtoses are −2, −2, and 3.31. Conservative estimates of the variances

of the estimated variance components are 4.5 × 10−5, 10−5, and 5.8 × 10−5, yielding conservative


standard errors of about 0.007, 0.003 and 0.008. In this case there is negligible sampling variability

in the variance component estimates. For determining the rating, the user effect is dominant over the

song effect, but as in the previous example, the greatest variation comes from error or interactions

between song and user.

The third data set from Last.fm [29] contains the numbers of times artists’ songs are played by

about 360,000 users. Only the counts for the top k (for some k) artists for each user is recorded. The

users are randomly selected. This data set is extremely sparse; only about 0.03 percent of user-artist

pairs are observed.

The estimated variances of the user random effect, the artist random effect, and the error are

1.65, 0.22, and 0.27. The estimated kurtoses are 0.019, −2, and 23.14. Conservative estimates of

the variances of the estimated variance components are 1.68× 10−5, 4.06× 10−7, and 1.37× 10−6,

yielding small standard errors of about 0.004, 0.0006 and 0.001. The biggest source of variation in

the number of plays is the user. The kurtosis of the row effect is nearly zero, indicating possible

normality.

In all three data sets, the standard errors of the estimated variance components are at least one

order of magnitude smaller than the estimates themselves. This lends credence to the argument that

we do not need to truncate confidence intervals at zero in practice.

Moreover, in the data sets at least one of the estimated kurtoses was −2, which would be

unexpected if the model is correctly specified. However, if model (2) does not fit the data well, such

behavior may occur. We suspect that more realistic models incorporating fixed effects and SVD-like

interactions would reduce the prevalence of such kurtosis estimates.

Chapter 3

Linear Mixed Model with Crossed

Random Effects

In this chapter we return to the linear mixed model with crossed random effects, model (1).

Our first task is to obtain consistent estimates of β and the variance components σ2A, σ2

B , and σ2E

with computational cost O(N) and space cost O(R + C). Our second and more challenging task is

to find a reliable estimate of the variance of those estimates, also in O(N) time and O(R+C) space,

in order to perform inference. We could compute finite-sample variances of the estimates, but they

are complex and not efficiently computable. Hence, we consider asymptotic variances in the regime

described in Section 2.1.1, which are much simpler. These constraints would allow our algorithm

to scale to data sets with millions of observations. Our view is that algorithms with costs growing

faster than N are, in this setting, almost like asking for an oracle. In the case of Stitch Fix, our

algorithms will give confidence intervals on quantities such as the effects of certain styles on ratings.

To obtain a scalable algorithm and avoid parametric assumptions, we impose a major simplifica-

tion. In forming our estimate β, we forget about the correlations arising from one of the two factors.

This allows us to easily invert the corresponding covariance matrix of the data. Our estimate β sac-

rifices some efficiency. When computing Var(β) we do take account of both sources of correlation.

As a result, we get an inefficient estimator compared to an oracle that is not constrained to O(N)

cost, but we get a reliable variance estimate for that estimator.

A second simplification is to use the method of moments to estimate the variance components,

as in Chapter 2. The method of moments has many advantages in the big data world. It is easily

parallelizable, makes no parametric assumptions, and has no tuning parameters to select. The

method of moments does come with a drawback. It can lead to estimators that do not obey known

constraints. The best known of these are negative variance estimates. It can also lead to kurtosis

estimates below −2.

33

CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 34

3.1 Additional Notation

The same notation and asymptotic regime described in Section 2.1.1 apply here, with two additional

definitions. Let v1,i be the N -dimensional vector with ones in entries∑i−1r=1Nr• + 1 to

∑ir=1Nr•

and zeros elsewhere, and let wik be a set of Ni• − 1 unit vectors orthogonal to v1,i with zeros in the

same entries. Similarly, let v2,j be the N -dimensional vector with ones in entries∑j−1s=1N•s + 1 to∑j

s=1N•s and zeros elsewhere and wjk be defined like wik.

Moreover, when there are covariates xij we need to rule out degenerate settings where the sample

variance of xij does not grow with N or where it is dominated by a handful of observations. We

add some such conditions when we discuss central limit theorems in Section 3.4.2.

3.2 Previous Work

Much of the previous literature for parameter estimation and inference for linear mixed models

is similar to and builds upon the literature for crossed random effects models. Again, there are

two categories of algorithms, moment-based and likelihood-based, with the same advantages and

disadvantages as described in Section 2.2. The moment-based algorithms are mostly generalizations

of the moment-based algorithms for crossed random effects models, described in Section 2.2.1. The

likelihood-based algorithms are the same for both crossed random effects models and linear mixed

models with crossed random effects.

Note that model (1) can be written as

Y = Xβ + Zaa+ Zbb+ e, with

a ∼ (0, σ2AIR), b ∼ (0, σ2

BIC), e ∼ (0, σ2EIN ) (independently),

(3.1)

where Y is the vector of Yij , X is the N × p matrix of predictors, 1N is the N -dimensional vector

of ones, Za is a N ×R binary matrix with entry (k, i) equal to 1 if the kth observation is from level

i of the first factor, a is the vector of ai, Zb is a N ×C binary matrix with entry (k, j) equal to 1 if

the kth observation is from level j of the second factor, b is the vector of bj , and IN is the identity

matrix of dimension N . We utilize this reformulation for the rest of this section.

3.2.1 Moment-based Approaches

Henderson II Henderson I, described in Section 2.2.1, can be generalized to linear mixed models.

The generalization is called Henderson II, and we give a description of it following Searle et al.

(2006) [44]. The intuition of the method is to somehow transform the linear mixed model with

crossed random effects to a simple crossed random effects model, and then apply a modified version

of Henderson I on the resulting data.


First, estimate β via β = LY for some matrix p×N matrix L; the choice of L is discussed next.

Then, the residuals are

Y −Xβ = (I −XL)Y = (I −XL)Xβ + (I −XL)Zaa+ (I −XL)Zbb+ (I −XL)e.

From here, there are two directions to go with the choice of L. First, we can choose L so that

β does not appear in the residuals, i.e. such that X = XLX [43]. That requirement is satisfied

by the OLS projection L = (XTX)−1XT. Then, Y −Xβ = (I −H)Zaa + (I −H)Zbb + (I −H)e

for H = X(XTX)−1XT. A modified version of Henderson I accounting for the fact that there is

a projection matrix multiplying Zaa and Zbb is used to estimate the variance components. The

regression coefficient β is estimated at the end using some sort of least squares estimator.

The disadvantage of this approach is that the expectation of the estimating equations become

much more complex to compute due to the presence of the projection matrix H. Moreover, no

analysis of the variances of the estimates is available. Our proposed algorithm, described in Section

3.3, is similar in that it also alternates between estimating β and the variance components and starts

by estimating β via OLS. However, we ignore the presence of the projection matrix H and directly

estimate the variance components using our alternative to Henderson I described in Section 2.4.1.

We prove that in the asymptotic setting, ignoring H doesn’t matter and we still get consistent and

asymptotically Gaussian estimates (with known variances) of β and the variance components. In

addition, unlike Henderson II, after estimating β, we adjust our estimates of the variance components

as well.

Because a more difficult version of Henderson I must be used with the above choice of L, Searle

et al. (2006) [44] describe another way to choose L with the following criterion:

• XLZa = XLZb = 0,

• XL1 = δ1 for some scalar δ, and

• X −XLX = 1τT for some vector τ .

Then, Y −Xβ = (τTβ − δ)1 +Zaa+Zbb+ (I −XL)e, on which the original version of Henderson I

can be applied to estimate the variance components.

The main disadvantage of this approach is that the construction of such a matrix L requires

inverting a square matrix with dimension approximately the total number of levels of the two factors,

which requires time O((R+ C)3) = O(N3/2). Moreover, there is no analysis of the variances of the

final estimates.

Henderson III Henderson III can also be applied to the linear mixed model with crossed random

effects. The estimating equations are now, with R(·) defined as in Section 2.2.1,

R(a, b | β) = R(β, a, b)−R(β)


R(b | β, a) = R(β, a, b)−R(β, a)

SSE = Y TY −R(β, a, b)

where E(R(a, b | β)) equals

σ2Atr(Z

Ta (IN −H)Za) + σ2

Btr(ZTb (IN −H)Zb) + σ2

E(rank([X Za Zb])− rank(X))

and H = X(XTX)−1XT. The disadvantage of this method is that it requires inverting a square

matrix with dimension the total number of levels of the two factors, which requires time O((R +

C)3) = O(N3/2). Moreover, the variances of the resulting estimates are very complex to analyze.

Sufficient statistics We could also use an algorithm developed for generalized linear mixed models

[24]. Suppose that the random effects are normally distributed. Then, a set of estimating equations

based on the sufficient statistics of the data distribution are

∑ij

ZijxijYij = XTY,

∑ijj′

ZijZij′YijYij′ ,∑ii′j

ZijZi′jYijYi′j , and

∑iji′j′

ZijZi′j′YijYi′j′ ,

with expectations

XTE(Y ) = XTXβ,

E

∑ijj′

ZijZij′YijYij′

= βT

∑ijj′

ZijZij′xijxTij′

β + σ2A

∑i

N2i• + σ2

BN + σ2EN,

E

∑ii′j

ZijZi′jYijYi′j

= βT

∑ii′j

ZijZi′jxijxTi′j

β + σ2B

∑j

N2•j + σ2

AN + σ2EN, and

E

∑iji′j′

ZijZi′j′YijYi′j′

= βT

∑iji′j′

ZijZi′j′xijxTi′j′

+ σ2A

∑i

N2i• + σ2

B

∑j

N2•j + σ2

EN.

To solve for β, σ2A, σ2

B , and σ2E , we would need to use a nonlinear root finder such as the Newton-

Raphson method. This seems unnecessarily complex for the linear mixed effects model, and no

analytic formulas for the variances of the estimates are given.


3.2.2 Likelihood-based Approaches

Likelihood-based approaches assume some distribution for the random effects and noise and then

estimate the parameters β and the variance components by maximizing the data likelihood or some

version thereof. The most common approach is to assume that ai, bj , and eij are normally dis-

tributed, which we believe is too stringent for many e-commerce applications.

Using the notation from model (3.1), under that assumption Y ∼ N (Xβ, V ), where

V = σ2AZaZ

Ta + σ2

BZbZTb + σ2

EIN . Maximum likelihood maximizes the log likelihood of the data Y

with respect to β, σ2A, σ2

B , and σ2E . Hartley & Rao (1967) [22] utilize a different parameterization,

letting V = Hσ2E and maximizing the log likelihood with respect to β, σ2

E , σ2A/σ

2E , and σ2

B/σ2E .

The asymptotic variances of the estimates are found by inverting the expected Fisher information

matrix.

Analytic formulas for the maximum likelihood estimates exist for balanced data and other special

observation patterns, but in general we need nonlinear optimization procedures to obtain the (biased)

estimates. Moreover, as shown in Bates (2014) [5] and Section 3.2.3, even computing the value of

the log likelihood at a given set of values for the parameters has O(N3/2) computation cost. We

describe some of these nonlinear optimization algorithms later.

To adjust for the bias of the maximum likelihood estimates, restricted maximum likelihood,

better known as REML, has been developed. Indeed, for a balanced crossed random effects model,

REML turns out to be equivalent to Henderson I [44]. An intuitive explanation for the approach

is to estimate variance components using maximum likelihood based on residuals from fitting only

the fixed effects β by least squares. More specifically, let K be a matrix such that KTX = 0.

That requirement is satisfied by the residual of any generalized least squares projection matrix, e.g.

I − X(XTX)XT. Then, KTY ∼ N (0,KTV K), and we apply the maximum likelihood procedure

on the log likelihood of KTY to estimate the variance components. The regression coefficients can

be estimated using generalized least squares and the estimated variance components. Again, the

asymptotic variance of the REML estimates can be found using the expected Fisher information

matrix.

There have been a number of optimization algorithms used in the literature for linear mixed

models. Graser et al. (1987) [17] utilizes a derivative-free line search algorithm when there is only

one random effect, and the MixedModels package in Julia [6] implements a derivative-free quadratic

approximation algorithm, BOBYQA [39]. The Expectation-Maximization algorithm [12] has also

been used, with the random effects as the hidden data. Raudenbush (1993) [40] applies it to a linear

mixed model with crossed random effects and Laird et al. (1987) [28] to a linear mixed model with

nested (and possibly crossed within nested) random effects.

Gradient-based optimization algorithms have also been popular, because the formulas for the

updates are usually simple. However, computing the gradients of the log likelihoods requires su-

perlinear time, like for computing the log likelihoods themselves. The Gauss-Newton algorithm is


implemented in lme4 [5, 7] and Lindstrom & Bates (1988) [31] uses the Newton-Raphson algorithm

for linear mixed models with nested random effects. Other possible algorithms are the scoring al-

gorithm, which is a variation of the Newton-Raphson algorithm that replaces the negative of the

Hessian of the log-likelihood with the Fisher information matrix, and quasi-Newton algorithms,

which approximate the Hessian with outer products of computed gradients.

3.2.3 Challenges of Likelihood-based Approaches

As in the previous subsection, let us assume that the random effects and noise are normally dis-

tributed. Then, when the observations are ordered by rows, the negative log likelihood is proportional

to

`(β, σ2A, σ

2B , σ

2E) = (Y −Xβ)T(σ2

AAR + σ2BBR + σ2

EIN )−1(Y −Xβ) + ln |σ2AAR + σ2

BBR + σ2EIN |

where AR and BR are defined in Section 3.3.2. We argue that it requires O(N3/2) time to compute

the negative log likelihood at one set of values of the parameters.

Let σ2BBR = PTUDUTP , where P is a permutation matrix and UDUT is the eigendecomposition

of BC . The columns of U are v2,j divided by√N•j . D is a diagonal matrix with entries σ2

BN•j .

Then by the Woodbury formula,

(σ2AAR + σ2

BBR + σ2EIN )−1

= (σ2AAR + σ2

EIN )−1

− (σ2AAR + σ2

EIN )−1PTU(D−1 + UTP (σ2AAR + σ2

EIN )−1PTU)−1UTP (σ2AAR + σ2

EIN )−1.

Similarly,

|σ2AAR + σ2

BBR + σ2EIN | = |σ2

AAR + σ2EIN ||D||D−1 + UTP (σ2

AAR + σ2EIN )−1PTU |

of which the first two terms can be computed analytically in O(R) time.

We can find a sparse representation of PTU by storing the columns seen for each row. Note that

(σ2AAR + σ2

EIN )−1 =

R∑i=1

[v1,iv

T1,i

Ni•(σ2E + σ2

ANi•)+

Ni•−1∑k=1

wikwTik

σ2E

].

Fixing i, we see that we can compute vT1,iPTU in O(Ni•) time and similarly for wik. Thus, we

can compute UTP (σ2AAR + σ2

EIN )−1PTU in O(∑iN

2i•) time. Therefore, we conclude that

• Computing the determinant term in the log likelihood and the inverse of D−1 +UTP (σ2AAR +

σ2EIN )−1PTU both require O(R3 +

∑iN

2i•) time.


• Computing (Y −Xβ)T(σ2AAR + σ2

EIN )−1(Y −Xβ) requires O(∑iN

2i•) time.

• Computing (Y −Xβ)T(σ2AAR + σ2

EIN )−1PTU requires O(∑iN

2i•) time.

• Computing the rest of the first term of the log likelihood requires O(R2) time.

In sum, we have shown explicitly that computing the log likelihood requires O(R3+∑iN

2i•) time.

Note that we could have taken a symmetric approach to computing the log likelihood, factoring out

the row random effects instead of the column random effects in the Woodbury formula. Doing that

requires time O(C3 +∑j N

2•j). Thus, we can choose whichever approach requires less time.

Nevertheless, we see that

∑i

N2i• = R(

∑i

N2i•/R) ≥ R(

∑i

Ni•/R)2 = N2/R.

Write R = Nα for 0 < α < 1. Then, O(R3 +∑iN

2i•) is equivalent to O(N3α + N2−α). This is

minimized when α = 1/2, in which we get that the order is at least O(N3/2). The same argument

applies for O(C3 +∑j N

2•j), so either approach takes at least O(N3/2) time.

It is clear that the lack of scalability arises from the fact that the likelihood involves the inversion

of a large matrix, the covariance matrix. Therefore, if the distribution of the random effects were

chosen so that the likelihood does not involve such an inversion (or multi-dimensional integrals), we

may be able to scale maximum likelihood procedures to massive data. However, it is unclear what

distribution besides the Gaussian would be appropriate, as it must realistically model the data.

3.3 Alternating Algorithm

We summarize our parameter estimation procedure for model (1) in Algorithm 1. We alternate twice

between finding β and the variance component estimates σ2A, σ2

B , and σ2E . Further details of steps

1-4, including the reasoning for which generalized least squares (GLS) estimator to use in step 3, are

given in the next two subsections. Step 5 is described in Section 3.4.2.1 where we derive Var(βRLS)

and Var(βCLS).

We only briefly discuss step 1, which computes the OLS estimate of β. Let X ∈ RN×p have rows

xij in some order and let Y ∈ RN be elements Yij in the same order. Then,

βOLS = (XTX)−1XTY =

(∑ij

ZijxijxTij

)−1∑ij

ZijxijYij . (3.2)

In one pass over the data, we can compute XTX and XTY incrementally and then solve for β.

Solving the normal equations this way is easy to parallelize but more prone to roundoff error than

the usual alternative based on computing the SVD of X. One can compensate by computing in

extended precision. It costs O(p3) to compute βOLS and so the cost of step 1 is O(Np2 + p3).


Algorithm 1: Alternating Algorithm

1 Estimate β via ordinary least squares (OLS): β = βOLS.2 Let σ2

A, σ2B , and σ2

E be the method of moments estimates from Section 2.4.1 defined on the

data (i, j, ηij), where ηij = Yij − xTij βOLS.

3 Compute a more efficient β using σ2A, σ2

B , and σ2E . If σ2

A maxiNi• ≥ σ2B maxj N•j , estimate β

via GLS accounting for row correlations: β = βRLS. Otherwise, estimate it via GLSaccounting for column correlations: β = βCLS.

4 Repeat step 2 using ηij = Yij − xTij β with β from step 3.

5 Compute an estimate Var(β) for β from step 3 using σ2A, σ2

B and σ2E from step 4.

3.3.1 Estimating Variance Components

In this subsection, we discuss steps 2 and 4 of Algorithm 1 in detail. The errors Yij − xTijβ follow a

two-factor crossed random effects model, model (2). If β is a good estimate of β, then the residuals

ηij = Yij − xTij β should approximately follow a two-factor crossed random effects model with µ = 0

and variance components σ2A, σ2

B , and σ2E .

We estimate σ2A, σ2

B , and σ2E using the algorithm described in Section 2.4.1 with data (i, j, ηij).

That algorithm gives unbiased estimates of the variance components in a two-factor crossed random

effects model by applying the method of moments on three U -statistics; the weighted sum of within-

row variances, the weighted sum of within-column variances, and the overall variance.

The (modified) versions of those statistics used in Algorithm 1 are:

Ua(β) =1

2

∑ijj′

N−1i• ZijZij′(ηij − ηij′)2,

Ub(β) =1

2

∑jii′

N−1•j ZijZi′j(ηij − ηi′j)2, and

Ue(β) =1

2

∑iji′j′

ZijZi′j′(ηij − ηi′j′)2.

(3.3)

Then, the variance component estimates are obtained by solving the system

M

σ2A

σ2B

σ2E

=

Ua(β)

Ub(β)

Ue(β)

, M =

0 N −R N −R

N − C 0 N − CN2 −

∑iN

2i• N2 −

∑j N

2•j N2 −N

. (3.4)

As described in Section 2.4.1, the U -statistics can be computed in one pass over the data. The

procedure differs slightly in this case; in the pass over the data, at each point, compute ηij and

use that instead of Yij to compute the U -statistics. Therefore, the computational cost is still O(N)

(actually O(Np)) and the space cost is still O(R + C). Solving (3.4) takes constant time. Thus,

steps 2 and 4 each have computational complexity O(N) and space complexity O(R+ C).


3.3.2 Scalable and More Efficient Regression Coefficient Estimators

Two estimators: We first define the GLS estimators of β accounting for row or column correla-

tions. We consider two orderings of the data. In one, the data are ordered by rows and the other

is an ordering by columns. Our algorithms do not require us to sort the data into either of these

orders; they are merely utilized to describe our estimators clearly. Let P be the permutation matrix

corresponding to the transformation of the column ordering to the row ordering. Let AR be the

block diagonal matrix with blocks 1Ni•1TNi•

and BC the block diagonal matrix with blocks 1N•j1TN•j .

In the row ordering

Cov(Y ) = VR ≡ σ2EIN + σ2

AAR + σ2BBR, BR = PBCP

T, (3.5)

while in the column ordering

Cov(Y ) = VC ≡ σ2EIN + σ2

AAC + σ2BBC , AC = PTARP. (3.6)

Solving a system of equations defined by VR or VC generally takes time that grows likeN3/2 by the

argument in Section 3.2.3, so the usual GLS estimator is not feasible. Our estimate β takes account

of either the row or column correlations but not both. To account for the row correlations, assuming

that the data are ordered by rows, we use the working covariance matrix VA = σ2EIN + σ2

AAR.

Similarly, to account for the column correlations, assuming that the data are ordered by columns,

we use the covariance matrix VB = σ2EIN + σ2

BBC .

We define the GLS estimator of β accounting for the row correlations as

βRLS = (XTV −1A X)−1XTV −1A Y, VA ≡ σ2EIN + σ2

AAR, (3.7)

where the rows of X and Y are xij and Yij ordered by i. We use estimates of σ2E and σ2

A. Similarly,

the GLS estimator of β accounting for the column correlations is

βCLS = (XTV −1B X)−1XTV −1B Y, VB ≡ σ2EIN + σ2

BBC , (3.8)

where now the rows of X and Y are xij and Yij ordered by j.

GLS Computations: Here we show how to compute βRLS in O(N) time and using O(R) space.

By symmetry, similar results hold for βCLS.

From the Woodbury formula [19] and defining D as the N ×R matrix with ith column v1,i (from

Section 3.1), we have

XTV −1A X = XT(σ2EIN + σ2

ADDT)−1X


=XTX

σ2E

− σ2A

σ2E

XTD diag( 1

σ2E + σ2

ANi•

)DTX

=1

σ2E

∑ij

ZijxijxTij −

σ2A

σ2E

∑i

1

σ2E + σ2

ANi•

(∑j

Zijxij

)(∑j

Zijxij

)T

.

Likewise, XTV −1A Y equals

1

σ2E

∑ij

ZijxijYij −σ2A

σ2E

∑i

1

σ2E + σ2

ANi•

(∑j

Zijxij

)(∑j

ZijYij

).

One pass over the data allows us to compute∑ij Zijxijx

Tij and

∑ij ZijxijYij , as well as Ni•,∑

j Zijxij and∑j ZijYij for i = 1, . . . , R. This has computational cost O(Np2) and space O(Rp).

Using these quantities, we calculate XTV −1A X and XTV −1A Y in time O(Rp2). Then, βRLS is com-

puted in O(p3). Hence, βRLS can be found with O(Rp) space and computational cost O(Np2 + p3),

which is linear in N .

Efficiencies: In step 3 of Algorithm 1, we choose between βRLS and βCLS. The choice could be

made dependent on X but in many applications one considers numerous different X matrices and we

prefer to have a single choice for all regressions. Accordingly, we find a lower bound on the efficiency

of RLS when X is a single nonzero vector x ∈ RN×1. We choose RLS if that lower bound is higher

than the corresponding bound for CLS, in this p = 1 setting. That rule translates to choosing RLS

if the variance component associated with rows is dominant and CLS otherwise, as shown below.

The full GLS estimator is

βGLS = (XTV −1R X)−1XTV −1R Y (3.9)

when the data are ordered by rows and (XTV −1C X)−1XTV −1C Y when the data are ordered by

columns. For data ordered by rows, the efficiency of βRLS is

effRLS =Var(βGLS)

Var(βRLS)=

(xTV −1A x)2

(xTV −1A VRV−1A x)(xTV −1R x)

, (3.10)

in this p = 1 setting. For data ordered by columns, the corresponding efficiency of βCLS is

effCLS =Var(βGLS)

Var(βCLS)=

(xTV −1B x)2

(xTV −1B VCV−1B x)(xTV −1C x)

. (3.11)

The next two theorems establish lower bounds on these efficiencies.

Theorem 3.3.1. Let A be a positive definite Hermitian matrix and u be a unit vector. If the


eigenvalues of A are bounded below by m > 0 and above by M <∞, then

(uTAu)(uTA−1u) ≤ (m+M)2

4mM.

Equality may hold, for example when uTAu = (M +m)/2 and the only roots of A are m and M .

Proof. This is Kantorovich’s inequality [34].

By two applications of Theorem 3.3.1 on (3.10) and (3.11) we prove:

Theorem 3.3.2. When p = 1 and effRLS and effCLS are defined as in (3.10) and (3.11),

effRLS ≥4σ2

E(σ2E + σ2

B maxj N•j)

(2σ2E + σ2

B maxj N•j)2and

effCLS ≥4σ2

E(σ2E + σ2

A maxiNi•)

(2σ2E + σ2

A maxiNi•)2,

where each inequality is tight.

Proof. See Section B.1.

After some algebra, we see that the worst case efficiency of βRLS is higher than that of βCLS

when σ2A maxiNi• > σ2

B maxj N•j . We set β to be βRLS when σ2A maxiNi• ≥ σ2

B maxj N•j , and βCLS

otherwise. As shown previously, either estimate requires O(N) computation time and O(R + C)

space. Our importance measure for the row random effects is σ2A maxiNi•. We believe that other

reasonable measures could be derived from other ways of quantifying efficiency.

Remark: We also considered GLS estimators of β with the covariance matrices of the form

VA = (σ2E + λσ2

B)IN + σ2AAR, in order to partially account for the variance induced by the column

random effects. The value of λ was chosen to maximize the worst case efficiency of the estimator.

However, doing so does not appear to decrease the MSE of the estimate of β in our simulations, so

we decided to remain with the simpler estimators.

Alternatively, we can use an iterative procedure such as backfitting [9] or linear algebra solvers

to compute βGLS; we are guaranteed to approach the true solution as the number of iterations goes

to infinity. The hope is that we get a good approximation in only O(1) iterations. The disadvantage

of this method for our tasks is that the variance of the resulting estimate after a finite number of

iterations is difficult to compute analytically. We could use the bootstrap, but doing so is expensive

for such large, correlated data sets. Indeed, it would appear that no bootstrap procedure can be

exact [35].


3.4 Theoretical Properties of Estimates

Here we show that the parameter estimates β, σ2A, σ2

B and σ2E obtained from Algorithm 1 are con-

sistent. We also give a central limit theorem for β and show how to compute its asymptotic variance

in O(N) time and O(R + C) space. A martingale CLT for the variance component estimates ap-

pears, based on the corresponding theorem for the crossed random effects model, Theorem 2.4.6. We

use the asymptotic assumptions described in Section 3.1 and some additional regularity conditions

on xij .

As in ordinary IID error regression problems our central limit theorem requires the information

in the observed xij to grow quickly in every projection while also imposing a limit on the largest

xij . For each i with Ni• > 0, let xi• be the average of those xij with Zij = 1 and similarly define

column averages x•j .

For a symmetric positive semi-definite matrix V , let I(V ) be the smallest eigenvalue of V . We

will need lower bounds on I(V ) for various V to rule out singular or nearly singular designs. Some

of those V involve centered variables. In most applications xij will include an intercept term, and so

we assume that the first component of every xij equals 1. That term raises some technical difficulties

as centering that component always yields zero. We will treat that term specially in some of our

proofs. For a symmetric matrix V ∈ Rp×p, we let

I0(V ) = I((Vij)2≤i,j≤p)

be the smallest eigenvalue of the lower (p− 1)× (p− 1) submatrix of V .

In our motivating applications, it is reasonable to assume that ‖xij‖ are uniformly bounded. We

let

M∞ ≡ maxij

Zij‖xij‖2 (3.12)

quantify the largest xij in the data so far.

3.4.1 Consistency

First, we give conditions under which βOLS from step 1 is consistent.

Theorem 3.4.1. Let max(εR, εC) → 0 and I(XTX) ≥ cN for some c > 0, as N → ∞. Then

E(‖βOLS − β‖2) = O((εR + εC)/I(XTX/N))→ 0.

Proof. See Section B.3.1.

Second, we show that the variance component estimates computed in step 2 are consistent.

Recall that we compute the U -statistics (3.3) on data (i, j, ηij = Yij − xTij β) and use them to obtain

estimates σ2A, σ2

B , and σ2E via (3.4).


Theorem 3.4.2. Suppose that as N →∞ that max(εR, εC)→ 0, max(R,C)/N ≤ θ ∈ (0, 1), βp→ β,

and that M∞ is bounded. Then σ2A

p→ σ2A, σ2

B

p→ σ2B, and σ2

E

p→ σ2E.


From Theorem 3.4.1, the estimate of β obtained in step 1 of Algorithm 1 is consistent. Therefore,

from Theorem 3.4.2, the variance component estimates obtained in step 2 are consistent, given the

combined assumptions of those two theorems. The proof of Theorem 3.4.1 shows that the estimated

variance components differ by O(‖β − β‖2 + ε‖β − β‖) from what we would get replacing β by an

oracle value β and computing variance components of Yij − xTijβ. Such an estimate would have

mean squared error O(ε) by Theorem 2.4.3. As a result the mean squared error for all parameters

of interest is O(ε).

Our third result shows that the estimate of β obtained in step 3 is consistent. We do so by

showing that estimators βRLS and βCLS are consistent when constructed using consistent variance

component estimates. We give the version for βRLS.

Theorem 3.4.3. Let βRLS be computed with σ2A

p→ σ2A and σ2

E

p→ σ2E as N →∞, where σ2

E > 0. If

max(εR, εC)→ 0 and,

I0(∑ij

Zij(xij − xi•)(xij − xi•)T/N)≥ c > 0 (3.13)

and

1

R2

∑ir

(ZZT)irN−1i• N

−1r• → 0, (3.14)

then βRLSp→ β.


From Theorem 3.4.2, the variance component estimates obtained in step 4 are consistent. There-

fore by Theorem 3.4.3, the final estimates returned by Algorithm 1 are consistent.

The most complicated part of the proof of Theorem 3.4.3 involves handling the contribution of

bj to βRLS. In row weighted GLS it is quite standard to have random errors ai and eij but here

we must also contend with errors bj that do not appear in the model for which βRLS is the MLE.

Condition (3.14) is used to control the variance contribution of the column random effects to the

intercept in βRLS. For balanced data it reduces to 1/C → 0 and so it has an effective number of

columns interpretation. Recalling that (ZZT)ir is the number of columns sampled in both rows i

and r, we have (ZZT)ir ≤ Nr• and so a sufficient condition for (3.14) is that (1/R)∑iN−1i• → 0. For

sparsely observed data we expect (ZZT)ir � max(Ni•, Nr•) to be typical, and then these bounds

are conservative.


Any realistic setting will have σ2E > 0 and we need σ2

E > 0 for βRLS to be well defined. So that

condition in Theorem 3.4.3 is not restrictive.

It remains to show that the variance component estimates from step 4 are consistent. We can

just apply Theorem 3.4.2 again. Therefore the final estimates returned by Algorithm 1 are consistent

given only weak conditions on the behavior of Zij and xij .

3.4.2 Asymptotic Normality

Here we show that the estimator βRLS constructed using consistent estimates of σ2A, σ2

B , and σ2E is

asymptotically Gaussian. We need stronger conditions than we needed for consistency.

These conditions are expressed in terms of some weighted means of the predictors. First, let

x•j =1

N•j

∑i

Zijσ2A

σ2A + σ2

E/Ni•xi•. (3.15)

This is a ‘second order’ average of x for column j: it is the average over rows i that intersect j,

of averages xi• shrunken towards zero. For a balanced design with Zij = 1i≤R1j≤C we would have

x•j = x••σ2A/(σ

2A + σ2

E/C), so then the second order means would all be very close to x•• for large

C. Apart from the shrinkage, we can think of x•j as a local version of x•• appropriate to column j.

Next let

k =∑j

N2•j(x•j − x•j)

/∑j

N2•j ∈ Rp. (3.16)

This is a weighted sum of adjusted columns means, weighted by the squared column size. The

intercept component of this k will not be used.

Theorem 3.4.4. Let βRLS be computed with σ2A

p→ σ2A, σ2

B

p→ σ2B, and σ2

E

p→ σ2E > 0 as N → ∞.

Suppose that

I(∑

i

xi•xTi•

), I0

(∑ij

Zij(xij − xi•)(xij − xi•)T), and

I0(∑

j

N2•j(x•j − x•j − k)(x•j − x•j − k)T

) /maxjN2•j

all tend to infinity, where x•j is given by (3.15) and k is given by (3.16). For cj =∑i Zijσ

2E/(σ

2E +

σ2ANi•) and cij = Zijσ

2E/(σ

2E + σ2

ANi•) assume that both maxj c2j/∑j c

2j and maxij c

2ij/∑ij c

2ij

converge to zero. Then βRLS is asymptotically distributed as

N (β, (XTV −1A X)−1XTV −1A VRV−1A X(XTV −1A X)−1). (3.17)



The statement that βRLS has asymptotic distribution N (β, V ) is shorthand for V −1/2(β − β)p→

N (0, Ip). Theorem 3.4.4 imposes three information criteria. First, the R rows i with Ni• > 0 must

have sample average xi• vectors with information tending to infinity. It would be reasonable to

expect that information to be proportional to R and also reasonable to require R →∞ for a CLT.

Next, the sum of within row sums of squares and cross products of row-centered xij must have

growing information, apart from the intercept term. Finally, thinking of x•j− x•j as locally centered

mean for column j, those quantities centered on the vector k must have a weighted sum of squares

that is not dominated by any single column when weights proportional to N2•j are applied.

The conditions on cj and cij are used to show that the CLT will apply to the intercept in the

regression. The condition on maxj c2j/∑j c

2j will fail if for example column j = 1 has half of the N

observations, all in rows of size Ni• = 1. In the case of an R × C grid maxj c2j/∑j c

2j = 1/C and

so we can interpret this condition as requiring a large enough effective number∑j c

2j/maxj c

2j of

columns in the data.

The condition on maxij c2ij/∑ij c

2ij will fail if for example the data contain a full R× C grid of

values plus a single observation with i = R+ 1 and j = C + 1. The problem is that in a row based

regression, a single small row can get outsized leverage. It can be controlled by dropping relatively

small rows. This pruning of rows is only used for the CLT to apply to the intercept term. It is not

needed for other components of β nor is it need for consistency. We do not know if it is necessary

for the CLT.

Next we show that the variance component estimates are asymptotically normal as well, building

off of the corresponding result from Section 2.4.3.

Define Ua, Ub, and Ue be the same as Ua(β), Ub(β), and Ue(β) with ηij = ai + bj + eij instead

of ηij .

As stated in Section 3.4.1, in step 3 of Algorithm 1 we have obtained a consistent estimate β of

β. From (2.19), the variance component estimates obtained in step 4 are, as N →∞,σ2A

σ2B

σ2E

=

−N−1 0 N−2

0 −N−1 N−2

N−1 N−1 −N−2

Ua(β)

Ub(β)

Ue(β)

. (3.18)

Therefore, to show that σ2A, σ2

B , and σ2E are asymptotically normal, we show that Ua(β), Ub(β),

and Ue(β) are asymptotically jointly normal. More precisely, we show that

Ua(β) = Ua(β)/√∑

j N2•j , Ub(β) = Ub(β)/

√∑iN

2i•, and Ue(β) = Ue(β)/(N

√∑iN

2i• +

∑j N

2•j)

are asymptotically jointly normal.

Here is an outline of our proof. The asymptotic joint distribution of Ua = Ua/√∑

j N2•j , Ub =

Ub/√∑

iN2i•, and Ue = Ue/(N

√∑iN

2i• +

∑j N

2•j) is Gaussian, from Theorem 2.4.6. We show that


Ua(β)− Ua, Ub(β)− Ub, and Ue(β)− Ue are asymptotically zero and then cite Slutsky’s Theorem.

Lemma 3.4.1. Under the assumptions of Section 3.1, if E(‖β−β‖2) = o(εR+εC), as well as (2.31)

and boundedness of M∞, Ua(β)− Ua, Ub(β)− Ub and Ue(β)− Ue all converge to 0 in probability.


The condition on β in Lemma 3.4.1 holds for βOLS by Theorem 3.4.1. Because βRLS and βCLS

use a better model for Cov(Y ), they should ordinarily satisfy it too.

Theorem 3.4.5. Under the conditions of Theorem 2.4.6 and Lemma 3.4.1, the variance component

estimates output by Algorithm 1 are asymptotically distributed as

N

σ2A

σ2B

σ2E

,

σ4A(κA + 2)

∑iN

2i•

N2σ4E(κE + 2)

1

N−σ4

E(κE + 2)1

N

∗ σ4B(κB + 2)

∑j N

2•j

N2−σ4

E(κE + 2)1

N

∗ ∗ σ4E(κE + 2)

1

N

.


As with the crossed random effects model, the variance component estimates are asymptotically

uncorrelated. However, they could be correlated with β. Indeed, we expect that they are correlated,

since we do not assume Gaussian or even symmetric distributions for ai, bj , and eij .

In practice, to compute the variances of σ2A, σ2

B , and σ2E , we need to estimate κA, κB , κE in

step 4 of Algorithm 1 using the method of moments technique described in Section 2.4.2.3. Then,

we plug in σ2A, σ2

B , σ2E , κA, κB , and κE into the result of Theorem 3.4.5.

3.4.2.1 Computing the Variance of β

Here we show how to compute the estimate of the asymptotic variance of β from Theorem 3.4.4.

Expanding its expression, we see that

(XTV −1A X)−1XTV −1A VRV−1A X(XTV −1A X)−1

= (XTV −1A X)−1XTV −1A (VA + σ2BBR)V −1A X(XTV −1A X)−1

= (XTV −1A X)−1 + (XTV −1A X)−1XTV −1A σ2BBRV

−1A X(XTV −1A X)−1

= (XTV −1A X)−1 + (XTV −1A X)−1Var(XTV −1A b)(XTV −1A X)−1, (3.19)

where b is the length-N vector of column random effects for each observation.


Using the Woodbury formula [19], we have Var(XTV −1A b) equals

σ2B

σ4E

∑j

(X•j − σ2

A

∑i

ZijXi•

σ2E + σ2

ANi•

)(X•j − σ2

A

∑i

ZijXi•

σ2E + σ2

ANi•

)T. (3.20)

In practice, we plug σ2A, σ2

B , and σ2E in for σ2

A, σ2B , and σ2

E in (3.19) and (3.20). We already have

(XTV −1A X)−1 as well as Ni• and Xi• for i = 1, . . . , R available from computing βRLS. In a new pass

over the data, we compute X•j and∑i ZijXi• for j = 1, . . . , C, incurring O(Np) computational and

O(Cp) storage cost. Then, (3.20) can be found in O(Cp2) time; a final step finds (3.19) in O(p3)

time. Overall, estimating the variance of β requires O(Np+Cp2 + p3) additional computation time

and O(Cp) additional space.

3.5 Experimental Results

Here we simulate Gaussian data to compare both the computational and statistical efficiency of our

algorithm with a state of the art code for linear mixed models, MixedModels [6]. In this section,

all our computations were carried out in Julia. MixedModels maximizes the log-likelihood of the

data under the assumption that ai, bj , and eij are normally distributed. It uses the derivative-free

optimization algorithm BOBYQA [39], iteratively computing and maximizing an approximation to

the objective. More specifically, at each iteration, it computes the objective at various values of β,

σ2A, σ2

B , and σ2E . Each computation of the objective can be shown to require O(N3/2) computation

(see Section 3.2.3), but the number of iterations needed for such a non-concave function does not

seem to be known.

3.5.1 Simulations

We used a range of data sizes: p = 1, 5, 10, 20 and N = 100, 400, 1600, 6400, 25,600, 102,400, 409,600,

1,638,400. We let β be the vector of ones, and each entry of xij was sampled from N (0, 1). For each

choice of R = C = 2√N and p, we sampled 100 observation patterns, randomly selecting exactly 25

percent of the (i, j) pairs to observe at each time. With the xij and observation pattern fixed, we

sampled Yij according to model (1) for (i, j) observed with σ2A = 2, σ2

B = 0.5, and σ2E = 1. Then

we ran our algorithm (including calculating the variance of β) and maximum likelihood, recording

their computation times and the mean squared errors of their parameter estimates. At the end, for

each pair of values of R = C and p, we recorded the average computation times and the four mean

squared errors of both algorithms. For comparison purposes, we divide the mean squared errors of

β by p.

We also carried out some auxiliary simulations. When p = 1, we also considered N = 6,553,600

and 14,745,600, but with 10 observation patterns instead of 100. For our algorithm, we additionally

usedN = 26,214,400 and 58,982,400; the maximum likelihood algorithm threw an error for more than


21 million observations. When p = 5, we also considered N = 6,553,600 and 10,240,000, but with

10 observation patterns instead of 100. For our algorithm, we additionally used N = 14,745,600,

26,214,400, and 36,966,400; the maximum likelihood algorithm threw an error for more than 13

million observations. When p = 10, we also considered N = 6,553,600, but with 10 observation

patterns instead of 100. For our algorithm, we additionally used N = 14,745,600 and 23,040,000;

the maximum likelihood algorithm threw an error for more than 9 million observations. When

p = 20, we also considered N = 3,686,400, but with 10 observation patterns instead of 100. For our

algorithm, we additionally used N = 6,553,600 and 14,745,600; the maximum likelihood algorithm

threw an error for more than 5 million observations.

3.5.1.1 Computational efficiency

Figure 3.2 shows log-log plots of the average computation time required in seconds versus N for each

value of p. The black dotted lines correspond to a computational complexity of O(N) and the green

dotted lines to a computational complexity of O(N3/2). We see that for large N , the computational

cost of maximum likelihood is indeed O(N3/2), while the computational cost of our algorithm is

O(N) throughout.

Notice that the computation time savings of our algorithm compared to maximum likelihood is

larger for smaller p. This is because as seen in Section 3.2.3, as p increases for constant N , the

fraction of the computation time taken to compute Xβ increases.

Because the maximum likelihood cost is superlinear we expect that it should become infeasible

at some N for which moments can still work. Indeed, this is described in the previous section.

Because our algorithm only computes simple summary statistics on each pass over the data, it is

easily parallelizable and we don’t have to store any large matrices in memory on one machine.

For p = 5, we also investigated the computation time per iteration for maximum likelihood.

We repeated the same simulation as above, but with 10 runs instead of 100 and N = 25 and

N = 6,553,600 being the smallest and largest data sizes, respectively. Figure 3.1 shows a log-log

scatter plot of the computation time per iteration versus N . The black line goes through the mean

computation time per iteration for each data size. For larger values of N , the computation time per

iteration indeed scales as O(N3/2), as shown by the dotted line.

3.5.1.2 Statistical efficiency

For both MLE and our algorithm, the scaled MSE of β appears to decrease linearly with the number

of observations. From the Gauss-Markov theorem we know that βRLS or βCLS cannot be statistically

efficient. In our simulated example, the MSE of the estimate output by our algorithm is larger by

at most a factor of two.

For σ2A and σ2

B , the MSE of the estimates output by both algorithms decreases sublinearly with

the number of observations. Initially, the MSE of our algorithm may be larger, but as N increases


Figure 3.1: For p = 5, log-log plot of the computation time per iteration for maximum likelihood.

the two MSEs are nearly identical. Occasionally, the MSEs output by our algorithm is smaller than

that of MLE. This may be due to the MLE process being caught in local maxima of the likelihood.

For σ2E , the MSE of the estimates output by both algorithms decreases approximately linearly with

the number of observations. However, the MSE of our estimate is usually larger by a factor of two

or three.

In this example, we take a modest loss in statistical efficiency of β and σ2E , while attaining a better

convergence rate in computational efficiency. For σ2A and σ2

B the two algorithms have practically

equivalent accuracy. These comparisons were run on data simulated from the Gaussian model that

the MLE assumes, but our moments method does not require that assumption. Likelihood based

variance estimates for variance components, e.g., Var(σ2A) can fail to be even asymptotically correct

when the underlying parametric model is violated.

3.5.2 An Example from E-commerce

Stitch Fix has provided us some of their client ratings data. This data is fully anonymized and void

of any personally identifying information. Further, the data provided by Stitch Fix is a sample of

their data, and consequently does not reflect their actual numbers of clients, items or their ratios,

for example. Nonetheless this is an interesting data set on which to fit a linear mixed effects model.

We received data on clients’ ratings of items they received, as well as the following information

about the clients and items. For each client and item pair, we have a composite rating Yij on a

scale from 1 to 10, the item material, whether the item style is edgy, whether the client likes edgy

styles, and a match score, which is the probability that the client keeps the item, predicted before it

is actually sent. The match score is a prediction from a baseline model and is not representative of

all algorithms used at Stitch Fix. There is some circularity in the data, because whether items are

labeled as edgy/Bohemian is based on whether clients who like edgy/Bohemian styles like it. We

ignore such complications for the sake of simplicity. Note that we treat the rating as a continuous


(a) p = 1 (b) p = 5

(c) p = 10 (d) p = 20

Figure 3.2: For each value of p, log-log plots of the computation times of the two algorithms againstN .

variable, even though it is actually ordinal.

We first investigated the observation pattern of the data. We received N = 5,000,000 ratings on

C = 6,318 items by R = 762,752 clients. Thus, in the sample of data provided by Stitch Fix we have

C/N.= 0.00126 and R/N

.= 0.153. The latter ratio indicates that only a relatively small number of

ratings from each client are included in the data (their full shipment history is not included in the

sampled data). The data are not dominated by a single row or column because εR.= 9× 10−6 and

εC.= 0.0143. Similarly

N∑iN

2i•

.= 0.103,

∑iN

2i•

N2

.= 1.95× 10−6,

N∑j N

2•j

.= 1.22× 10−4, and

∑j N

2•j

N2

.= 0.00164

so that the average row or column has size � 1 and � N .


(a) p = 1 (b) p = 5

(c) p = 10 (d) p = 20

Figure 3.3: For each value of p, log-log plots of the (scaled by p) mean squared error of β against N .

We built a linear mixed effects model for the ratings as a function of the other information about

the client and item, with client and item random effects, as follows:

Model 3. For client i and item j,

ratingij = β0 + β1matchij + β2I{client edgy}i + β3I{item edgy}j

+ β4I{client edgy}i ∗ I{item edgy}j + β5I{client boho}i + β6I{item boho}j

+ β7I{client boho}i ∗ I{item boho}j + β8materialij + ai + bj + eij ,

where I{client edgy}i ∈ {0, 1} indicates whether client i likes edgy styles and I{item edgy}j indicates

whether item j is edgy. Similar notation holds for the Bohemian variables. Note that since materialij

is a categorical variable, in practice it is replaced by indicator variables for each type of material.

We chose ‘Polyester’, the most common material, as the baseline.

Suppose that we erroneously believed that the data were independent and identically distributed


(a) p = 1 (b) p = 5

(c) p = 10 (d) p = 20

Figure 3.4: For each value of p, log-log plots of the mean squared error of σ2A against N .

by ignoring both client and item random effects. We might then run OLS. The resulting reported

regression coefficients and standard errors are shown in the first two columns of Table 3.1. Estimated

coefficients are starred if they are reported as being significant at the 0.05 level.

The third column contains more realistic standard errors of the OLS regression coefficients, ac-

counting for both the client and item random effects. These standard errors were computed using

the variance component estimates from our algorithm. As expected, they can be much larger, often

by a factor of ten, than the OLS reported standard errors. Ten of the variables, ‘Acrylic’, ‘Cash-

mere’, ‘Cupro’, ‘Fur’, ‘Linen’, ‘PVC’, ‘Rayon’, ‘Silk’, ‘Viscose’, and ‘Wool’, that appear significantly

different from polyester by OLS are not really significant when one accounts for client and item

correlations.

The final two columns contain the regression coefficients estimated by our algorithm and their

standard errors as described in Section 3.4.2.1. Again, estimated coefficients are starred if they are

significant at the 0.05 level. The estimated variance components are σ2A = 1.133, σ2

B = 0.1463, and


(a) p = 1 (b) p = 5

(c) p = 10 (d) p = 20

Figure 3.5: For each value of p, log-log plots of the mean squared error of σ2B against N .

σ2E = 4.474. Their standard errors are approximately 0.0046, 0.00089 and 0.0050 respectively, so

these components are well determined. The error variance component is largest, and the client effect

dominates the item effect by almost a factor of eight.

The ‘Match’ variable is significantly positively associated with rating, indicating that the baseline

prediction provided by Stitch Fix is a useful predictor in this data set. However the random effects

model reduces its coefficent from about 5 to about 3.5. We believe that this is because some clients

tend to give higher ratings on average than others; it can be verified that there are pairs of clients

(each with at least 40 ratings) for which the averages of the ‘Match’ variable are nearly equal but

have significantly different average ratings. OLS does not model this client effect and tries to overfit

using ‘Match’. The random effects model allows for a client effect and thus the magnitude of the

estimated coefficient for ‘Match’ diminishes.

Shipping an edgy item to a client who does not like edgy styles is associated with a rating decrease

of about a third, but shipping such an item to a client who does like edgy styles is associated with a


(a) p = 1 (b) p = 5

(c) p = 10 (d) p = 20

Figure 3.6: For each value of p, log-log plots of the mean squared error of σ2E against N .

small increase in rating. An interesting pattern emerges for Bohemian styles. For both clients who

like Bohemian styles and clients who don’t, shipping an item that is Bohemian is associated with a

decrease in rating. This suggests that it is very difficult to choose a Bohemian clothing item that

pleases a client, and so Stitch Fix may want to send a Bohemian item only when they are sure that

the client will like it based on a strong signal from some other variables.

To investigate this possibility, we consider subsets of the data based on the value of ‘Match’.

For t = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, we consider only the observations for which ‘Match’≥ t,

and fit model (3) using our algorithm. We compute the estimated β5, β6 (the effect of sending a

Bohemian item to a non-Bohemian client), β6 + β7 (the effect of sending a Bohemian item to a

Bohemian client), and β5 + β6 + β7, and their standard errors. Finally, we plot those four estimates

and their 95 percent confidence intervals as a function of t in Figure 3.7.

As we see from Figure 3.7(b), for clients who don’t like Bohemian styles, the optimal choice is

not to send them a Bohemian item, even if ‘Match’ is as high as 0.7. Similarly, from Figure 3.7(c),


(a) β5 (b) β6

(c) β6 + β7 (d) β5 + β6 + β7

Figure 3.7: For each subset of data, estimates and confidence intervals for Bohemian-related variablesof interest.

even when ‘Match’ is as high as 0.6, for clients who like Bohemian styles, the optimal choice is still

not to send them a Bohemian item. The Boho flag is not as useful as the Edgy one.

Of the materials, ‘Cotton’, ‘Faux Fur’, ‘Leather’, ‘Modal’, ‘Pleather’, ‘PU’, ‘PVC’, ‘Silk’, ‘Span-

dex’, and ‘Tencel’ are significantly different from the baseline, ‘Polyester in our crossed random effects

model. ‘PU’ and ‘PVC’ are associated with an increase in rating of at least half a point. Those

materials are often used to make shoes and specialty clothing, which may explain their association

with high ratings.

The computations in this section were done in python.


Table 3.1: Stitch Fix Regression Results

βOLS seOLS(βOLS) se(βOLS) β se(β)

Intercept 4.635∗ 0.005397 0.05808 5.110∗ 0.01250Match 5.048∗ 0.01174 0.1464 3.529∗ 0.02153I{client edgy} 0.001020 0.002443 0.004593 0.001860 0.003831I{item edgy} −0.3358∗ 0.004253 0.03730 −0.3328∗ 0.01542I{client edgy}∗I{item edgy} 0.3925∗ 0.006229 0.01352 0.3864∗ 0.006432

I{client boho} 0.1386∗ 0.002264 0.004354 0.1334∗ 0.003622I{item boho} −0.5499∗ 0.005981 0.03049 −0.6261∗ 0.01661I{client boho}∗I{item boho} 0.3822∗ 0.007566 0.01057 0.3837∗ 0.007697

Acrylic −0.06482∗ 0.003778 0.03804 −0.01627 0.02149Angora −0.01262 0.007848 0.09631 0.07271 0.05837Bamboo −0.04593 0.06215 0.2437 0.05420 0.1716Cashmere −0.1955∗ 0.02484 0.1593 0.01354 0.1176Cotton 0.1752∗ 0.003172 0.04766 0.09743∗ 0.01811Cupro 0.5979∗ 0.3016 0.4857 0.5603 0.4852Faux Fur 0.2759∗ 0.02008 0.08631 0.3649∗ 0.07524Fur −0.2021∗ 0.03121 0.1560 −0.03478 0.1331Leather 0.2677∗ 0.02482 0.08671 0.2798∗ 0.07335Linen −0.3844∗ 0.05632 0.2729 0.006269 0.1660Modal 0.002587 0.009775 0.2052 0.1417∗ 0.06498Nylon 0.03349∗ 0.01552 0.1000 0.1186 0.06436Patent Leather −0.2359 0.1800 0.4235 −0.2473 0.4222Pleather 0.4163∗ 0.008916 0.09905 0.3344∗ 0.05023PU 0.4160∗ 0.008225 0.09019 0.4951∗ 0.04196PVC 0.6574∗ 0.06545 0.3898 0.8713∗ 0.3883Rayon −0.01109∗ 0.002951 0.04602 0.01029 0.01493Silk −0.1422∗ 0.01317 0.1004 −0.1656∗ 0.05471Spandex 0.3916∗ 0.01729 0.1549 0.3631∗ 0.1284Tencel 0.4966∗ 0.009313 0.1935 0.1548∗ 0.06718Viscose 0.04066∗ 0.006953 0.09620 −0.01389 0.03527Wool −0.06021∗ 0.006611 0.08141 −0.006051 0.03737

Chapter 4

Conclusions

We conclude by discussing some practical lessons we have gathered, summarizing, and outlining

some potential directions for future work.

4.1 Practical Considerations

In this section, we briefly describe some issues to keep in mind when working with large data sets

with crossed random effects structure.

Data Preprocessing In the real-world examples we considered, the responses Yij were not actu-

ally continuous, but ordinal. For e-commerce, in many cases the responses are ordinal or counts. We

did not transform the responses in any way, but there are several options for doing so. First, we can

add random perturbations to the responses and account for this additional variance in our estimate

of σ2E . Second, we can standardize the responses and adjust the estimated regression coefficients and

variance components accordingly. Third, we can apply some variance stabilizing transform, such as

taking logarithms. However, this approach has the disadvantage of not leading to interpretable re-

sults when the transformation is reversed. Alternatively, we could use ordinal or Poisson regression,

as described in Section 4.3.3.

Our algorithms have assumed that the observations are stored in log format, i.e. in tuples

(i, j, Yij , xij). We believe that this is the most efficient way to store the data since we expect that

the majority of the (i, j) tuples are not observed. For the Stitch Fix data set used in Section 3.5.2,

approximately 0.1 percent of the client-item pairs were observed. If the data are not originally in

this sparse format, it is simple to convert it to this format.

Computational Requirements Throughout this thesis, we have assumed that the N observa-

tions and O(R + C) intermediate quantities can fit on one machine. But, this may not always be

59

CHAPTER 4. CONCLUSIONS 60

possible.

In that case, we believe that the best setup is to separate the observations onto different machines

and store the intermediate quantities on one machine, which we call Machine 0. Then, as we pass

over the entire data set, the intermediate quantities that are sums of the observations, e.g. Ni•

and Xi•, are computed and then communicated to Machine 0. At the end, Machine 0 collects

all the intermediate quantities, (possibly summing over all the machines, since the observations

corresponding to the same level of a factor will be on separate machines) and then finishes computing

the estimates.

To obtain standard errors, we need to do a second pass over the data as stated previously. In

that pass, we would communicate some intermediate quantities back to the individual machines, and

the individual machines would then communicate O(1) new intermediate quantities back to Machine

0. Finally, Machine 0, as before, collects all the intermediate quantities and computes the standard

errors. In all, this procedure would incur O(R+ C) communication cost as well.

4.2 Summary

In this thesis, we studied large linear mixed models with two crossed random effects with unknown

densities, which appear in many modern e-commerce applications. The inspiration for our project

came from Stitch Fix, an e-commerce company that provides personal styling services for women

and men. Their goal was to determine, in general, what characteristics of clients and clothes lead to

greater client satisfaction. Mixed models are ideal for such tasks because clients and clothes should

be treated as random effects in order to obtain conclusions about the population.

We started with a simplification of the linear mixed model with two crossed random effects,

the crossed random effects model; the regression coefficients are replaced by a global mean. This

model still illustrates the difficulties of estimation and inference for large linear mixed models with

crossed random effects, since they are primarily caused by the correlations among observations. After

showing that Bayesian analysis does not scale to large crossed random effects models, we propose a

new method of moments estimator of the variance components based on U -statistics. The estimator

requires linear computation time and sublinear memory, which makes it scalable to the data sets of

our interest. It admits easier analysis than previous estimators in the literature, and thus we are

able to compute its variance exactly and prove a central limit theorem. The variance can also be

approximated in linear computation time and sublinear memory.

Going back to the linear mixed model with two crossed random effects, we propose a scalable

algorithm that alternates between estimating the regression coefficients and the variance components.

To estimate the variance components, we use the estimators proposed for the crossed random effects

model, which requires linear computation time and sublinear memory. To estimate the regression

coefficients, we use the generalized least squares estimator ignoring one of the random effects, which


can be computed using linear time and sublinear space. The final estimates are shown to be consistent

and asymptotically normal.

For both the crossed random effects model and the linear mixed model with crossed random

effects, we carried out simulations and experiments on real-world data. The simulations compared

the computation time and mean squared errors of our proposed estimators and maximum likelihood

estimators, when data is generated using Gaussian random effects. We found that the mean squared

errors of our estimates of σ2A and σ2

B were comparable to those of the maximum likelihood estimates,

while the mean squared error of our estimates of β and σ2E were at most a few factors larger than

those of the maximum likelihood estimates. However, the computation time of our estimates is much

smaller, by as much as a factor of 10.

4.3 Future Work

There are numerous avenues for future work. Here we will discuss several that are based on gener-

alizations of model (1) and on which there has been some previous work in the literature.

4.3.1 Heteroscedasticity

In many real-world situations, the variances of the random effects at different levels of the same factor

may be different. For example, suppose that we have a data set of how many times a user bought

particular items during some time period. For common household items, each brand is expected to

have a fairly equitable and sizable market share, since many have been around for a long time; thus,

we expect that the random effects for those items have a small variance. For consumer discretionary

and luxury items, some brands are merely much more popular than others, e.g. some movies only

have limited release, while others are released in multiple countries; then, we expect that the random

effects for these items have a large variance.

There are several possible ways to implement this generalization in our model (1). We could

assign each level its own variance, such as in Owen (2007) [36] and Owen & Eckles (2012) [37].

However, this model may be too expressive to be practical. Alternatively, we can assume that

for each random effect, the variances are generated from a mixture model with a set number of

parametric distributions.

4.3.2 Factor Interactions

In many real-world data sets, we would also expect there to be interactions between the levels of

different factors. In the Stitch Fix data set, due to differences in personal preferences, certain clients

may tend to rate certain items higher. For instance, younger clients might like edgier styles while

older clients might prefer more expensive materials.


Therefore, in model (1), we could add a term abij to account for such interactions. However,

some restrictions must be made on its distribution in order to make it identifiable from the noise

term eij . A popular technique from computer science is to assume that its covariance matrix has

some low rank structure.

4.3.3 Generalized Linear Models

In this thesis, we have assumed that the responses Yij are continuous. However, in practice in many

cases the responses are discrete. In the data set from Last.fm [29], the responses are actually counts,

while Stitch Fix not only collects data on the ratings of items by clients but also whether the clients

keep the items.

Therefore, a natural next step is to extend this work to link functions other than the identity,

e.g. logistic and Poisson regression. The extension of maximum likelihood procedures to generalized

linear mixed models has been well-studied and implemented in software including lme4 and Mixed-

Models. As shown in Bates (2014) [5], such procedures, in addition to requiring strong parametric

assumptions, requires computation time O(N3/2) to compute the likelihood at a given set of values

for the parameters.

Because we have utilized the method of moments to effect for linear mixed models, we can also

consider using it for generalized linear mixed models. Jiang (1998) [24] has done so, using sufficient

statistics as the estimating equations. Due to the nonlinearity of generalized linear mixed models,

the Newton-Raphson procedure is used to solve the system of equations.

Appendix A

Crossed Random Effects Model

A.1 Bayesian Analysis via MCMC

A.1.1 Proof of Theorem 2.3.1

We may assume that i ∈ {1, 2, . . . , R} and j ∈ {1, 2, . . . , C}. The posterior distribution of the

parameters is given by

p(µ, a, b, σ2A, σ

2B , σ

2E | Y ) ∝

R∏i=1

1√2πσ2

A

exp(− a2i

2σ2A

) C∏j=1

1√2πσ2

B

exp(−

b2j2σ2

B

)

×R∏i=1

C∏j=1

1√2πσ2

E

exp(− (Yij − µ− ai − bj)2

2σ2E

)∝ σ−RA σ−CB σ−RCE exp

(−∑i a

2i

2σ2A

−∑j b

2j

2σ2B

−∑ij(Yij − µ− ai − bj)2

2σ2E

).

Then, φ ≡ p(a, b | µ, σ2A, σ

2B , σ

2E , Y ) is proportional to

exp(−∑i a

2i

2

( 1

σ2A

+C

σ2E

)−∑j b

2j

2

( 1

σ2B

+R

σ2E

)−∑ij aibj

σ2E

).

Therefore, the posterior distribution of a and b is a joint normal with precision matrix

Q =

σ2E + Cσ2

A

σ2Aσ

2E

IR1

σ2E

1R1TC

1

σ2E

1C1TRσ2E +Rσ2

B

σ2Bσ

2E

IC

.

From Theorem 1 of [42], for the Gibbs sampler described in Section 2.3.1.1, we have the following

result. Let A = I −diag(Q−111 , Q−122 )Q, where Q11 denotes the upper left block of Q and Q22 denotes

63

APPENDIX A. CROSSED RANDOM EFFECTS MODEL 64

the lower right block. Let L be the block lower triangular part of A, and U = A − L. Then, the

convergence rate ρ is given by the spectral radius of the matrix B = (I −L)−1U . Now, we compute

ρ. First

A = I −

σ2Aσ

2E

σ2E + Cσ2

A

IR 0

0σ2Bσ

2E

σ2E +Rσ2

B

IC

Q =

0 − σ2A

σ2E + Cσ2

A

1R1TC

− σ2B

σ2E +Rσ2

B

1C1TR 0

.

Next

L =

0 0

− σ2B

σ2E +Rσ2

B

1C1TR 0

and U =

0 − σ2A

σ2E + Cσ2

A

1R1TC

0 0

from which

B =

IR 0

σ2B

σ2E +Rσ2

B

1C1TR IC

−1

U =

IR 0

− σ2B

σ2E +Rσ2

B

1C1TR IC

U

=

0 − σ2A

σ2E + Cσ2

A

1R1TC

0Rσ2

Aσ2B

(σ2E + Cσ2

A)(σ2E +Rσ2

B)1C1TC

.

Clearly, B has rank one. Then, its spectral radius must be equal to its nonzero eigenvalue, which is

also the trace of B. Hence,

ρ =RCσ2

Aσ2B

(σ2E + Cσ2

A)(σ2E +Rσ2

B).

A.1.2 Simulation results

The results of our simulations described in Section 2.3.2 are presented here in Tables A.1 through A.5.

A.2 Expectation and Variance of Moment Estimates

In this section we provide the details and proofs for Sections 2.4.1 and 2.4.2. Before diving in, we

note an additional identity:

∑ir

N−1i• (ZZT)ir =∑ijr

N−1i• ZijZrj =∑ij

ZijN−1i• N•j . (A.1)

We also need some notation for equality among index sets. The notation 1ij=rs means 1i=r1j=s. It

is different from 1{i,j}={r,s} which we use as well. Additionally, 1ij 6=rs means 1− 1ij=rs.


Table A.1: Median CPU time in seconds.Method Gibbs Block Reparam. Lang. MALA Indp. RWM RWM Sub. pCN

R=10C=10 20 9 23 20 27 21 19 21 21R=20C=20 33 10 37 35 45 34 32 33 33R=50C=50 71 17 80 79 101 71 68 75 70R=100C=100 143 361 159 156 199 139 133 141 136R=200C=200 326 984 351 323 462 300 279 303 280R=500C=500 1157 2356 1205 955 1786 952 851 1019 817R=1000C=1000 3432 15046 4099 2302 4760 2513 2141 2635 1966R=2000C=2000 10348 88756 11434 6991 15836 7815 5712 9274 6006R=50C=100 105 287 121 112 151 103 101 107 102R=10C=200 138 316 167 139 200 138 137 142 138R=100C=1000 898 5148 964 807 1179 795 748 822 760


Table A.2: Median estimates of µ and number of lags after which ACF(µ) ≤ 0.5.Method Gibbs Block Reparam. Lang. MALA Indp. RWM RWM Sub. pCN

R=10 0.72 0.94 1.27 1.07 1.18 2.40 0.76 0.74 1.51C=10 26 29 24 178 689 1604 1252 1522 1392R=20 0.81 1.02 1.01 1.07 0.94 2.89 1.69 1.08 1.47C=20 34 43 26 75 841 1019 1674 1720 1765R=50 1.09 0.91 0.98 0.98 1.04 2.97 1.66 1.70 1.58C=50 83 84 75 8 610 5000+ 1158 1681 1104R=100 0.98 1.02 1.13 0.99 0.85 2.73 1.57 1.61 1.49C=100 123 185 144 2 398 5000+ 1145 1713 1522R=200 1.01 1.02 1.03 1.01 0.95 3.22 1.60 1.31 1.52C=200 257 346 272 1 1 1278 1508 1692 807R=500 0.99 1.01 0.99 0.99 1.00 2.26 1.58 1.15 1.55C=500 536 617 576 9 4 1572 924 1687 1613R=1000 0.97 1.02 1.04 0.99 0.96 2.39 1.55 1.07 1.53C=1000 801 790 694 1 2501 5000+ 1133 1656 1008R=2000 0.98 1.01 1.00 1.01 1.00 2.57 1.55 1.03 1.55C=2000 672 721 771 1 5000+ 1086 1176 1716 799R=50 0.89 1.03 0.95 1.01 1.06 2.70 1.57 1.61 1.45C=100 144 155 118 7 1095 5000+ 1219 1725 1371R=10 0.86 1.08 0.84 0.94 0.80 2.40 1.41 1.36 1.23C=200 329 244 299 120 944 3339 1518 1657 1437R=100 1.06 1.06 1.02 1.01 1.03 2.73 1.57 1.11 1.55C=1000 573 536 672 1 1 3330 1161 1681 3333


Table A.3: Median estimates of σ2A and number of lags after which ACF(σ2

A) ≤ 0.5.




Table A.4: Median estimates of σ2B and number of lags after which ACF(σ2

B) ≤ 0.5.




Table A.5: Median estimates of σ2E and number of lags after which ACF(σ2

E) ≤ 0.5.


R=10 1.02 0.99 0.96 0.91 1.17 0.17 0.76 0.80 0.75C=10 1 1 1 196 334 1604 1354 1329 1504R=20 0.97 0.98 1.00 0.91 1.00 0.17 0.48 0.45 0.37C=20 1 1 1 61 75 1218 1649 1614 1827R=50 1.00 1.01 0.98 0.96 0.99 0.17 0 0.01 0C=50 1 1 1 10 12 5000+ 1107 1616 1466R=100 1.00 1.00 1.00 0.98 1.00 0.16 0 0.38 0C=100 1 1 1 3 3 5000+ 1199 1714 1532R=200 1.00 1.00 1.00 1.01 1.01 0.21 0 0.66 0C=200 1 1 1 1 1 1266 1626 1691 636R=500 1.00 1.00 1.00 118.45 52.70 0.14 0 0.87 0C=500 1 1 1 545 138 1572 834 1702 1616R=1000 1.00 1.00 1.00 65.22 2.66 0.15 0 0.93 0C=1000 1 1 1 385 3062 5000+ 1518 1724 621R=2000 1.00 1.00 1.00 115.59 1.05 0.18 0 0.97 0C=2000 1 1 1 10 5000+ 1021 1194 1702 1014R=50 1.01 0.99 1.00 0.98 1.01 0.15 0 0.19 0C=100 1 1 1 5 6 5000+ 1676 1774 1442R=10 0.99 0.99 1.01 0.92 0.99 0.17 0 0.55 0C=200 1 1 1 12 15 3309 1570 1678 1279R=100 1.00 1.00 1.00 3.50 3.46 0.19 0 0.87 0C=1000 1 1 1 3 3 3330 1454 1699 3333


A.2.1 Weighted U-statistics

We will work with weighted U-statistics

Ua =1

2

∑ijj′

uiZijZij′(Yij − Yij′)2,

Ub =1

2

∑iji′

vjZijZi′j(Yij − Yi′j)2, and

Ue =1

2

∑iji′j′

wijZijZi′j′(Yij − Yi′j′)2,

for weights ui, vj and wij chosen below.

We can write Ua =∑i uiNi•(Ni• − 1)s2i• where s2i• is an unbiased estimate of σ2

B + σ2E from

within any row i with Ni• ≥ 2. Under our model the values in row i are IID with mean µ+ ai and

variance σ2B + σ2

E , and so

Var(s2i•) = (σ2B + σ2

E)2( 2

Ni• − 1+κ(bj + eij)

Ni•

)where κ(bj + eij) = (κBσ

4B + κEσ

4E)/(σ2

B + σ2E)2 is the kurtosis of Yij for the given i and any j.

Thus

Var(s2i•) =2(σ2

B + σ2E)2

Ni• − 1+κBσ

4B

Ni•+κEσ

4E

Ni•. (A.2)

Inverse variance weighting then suggests that we weight s2i• proportionally to a value between Ni•

and Ni•− 1. Weighting proportional to Ni•− 1 has the advantage of zeroing out rows with Ni• = 1.

This consideration motivates us to take ui = 1/Ni•, and similarly vj = 1/N•j .

If Ue is dominated by contributions from eij then the observations enter symmetrically and there

is no reason to not take wij = 1. Even if the eij do not dominate, the statistic Ue compares more

data pairs than the others. It is unlikely to be the information limiting statistic. So wij = 1 is a

reasonable default.

If the data are IID then only Ue above is nonzero. This is appropriate as only the sum σ2A+σ2

B+σ2E

can be identified in that case. For data that are nested but not IID, only two of the U-statistics

above are nonzero and in that case only one of σ2A and σ2

B can be identified separately from σ2E .


The U-statistics we use are then

Ua =1

2

∑ijj′


Ub =1

2

∑iji′


Ue =1

2

∑iji′j′


(A.3)

A.2.2 Expected Values

Here we find the expected values for our three U -statistics.

Lemma A.2.1. Under the random effects model (2), the U-statistics in (A.3) satisfyE(Ua)

E(Ub)

E(Ue)

=

0 N −R N −R

N − C 0 N − C

N2 −∑iN

2i• N2 −

∑j N

2•j N2 −N

σ2A

σ2B

σ2E

. (A.4)

Proof. First we note that

E((ai − ai′)2) = 2σ2A(1− 1i=i′),

E((bj − bj′)2) = 2σ2B(1− 1j=j′), and

E((eij − ei′j′)2) = 2σ2E(1− 1i=i′1j=j′).

Now Yij − Yij′ = bj − bj′ + eij − eij′ , and so

E(Ua) =1

2

∑ijj′

N−1i• ZijZij′(2σ2

B(1− 1j=j′) + 2σ2E(1− 1i=i1j=j′)

)= (σ2

B + σ2E)∑ijj′

N−1i• ZijZij′(1− 1j=j′)

= (σ2B + σ2

E)∑ij′

Zij′(1− 1j=j′)

= (σ2B + σ2

E)∑i

(Ni• − 1)

= (σ2B + σ2

E)(N −R).

The same argument gives E(Ub) = (σ2A + σ2

E)(N − C).


The matrix in (A.4) is

M ≡

0 N −R N −R

N − C 0 N − C

N2 −∑iN

2i• N2 −

∑j N

2•j N2 −N

. (A.5)

Our moment based estimates are σ2A

σ2B

σ2E

= M−1

Ua

Ub

Ue

. (A.6)

They are only well defined when M is nonsingular. The determinant of M is

(N −R)[(N − C)(N2 −

∑j

N2•j)]− (N −R)

[(N − C)(N2 −N)− (N − C)(N2 −

∑i

N2i•)]

= (N −R)(N − C)[N2 −

∑i

N2i• −

∑j

N2•j +N

].

The first factor is positive so long as maxiNi• > 1, and the second factor requires maxj N•j > 1.

We already knew that we needed these conditions in order to have all three U-statistics depend on

the Yij . It is still of interest to know when the third factor is positive. It is sufficient that no row or

column has over half of the data.

A.2.3 Variances

From equation (A.6) we get

Var

σ2A

σ2B

σ2E

= M−1Var

Ua

Ub

Ue

M−1

where M is given at (A.5). So we need the variances and covariances of the three U statistics.

To find variances, we will work out E(U2) for our U -statistics. Those involve

E((Yij − Yi′j′)2(Yrs − Yr′s′)2)

= E((ai − ai′ + bj − bj′ + eij − ei′j′)2(ar − ar′ + bs − bs′ + ers − er′s′)2

)= E

[((ai − ai′)2 + (bj − bj′)2 + (eij − ei′j′)2

+ 2(ai − ai′)(bj − bj′) + 2(ai − ai′)(eij − ei′j′) + 2(bj − bj′)(eij − ei′j′))


×((ar − ar′)2 + (bs − bs′)2 + (ers − er′s′)2

+ 2(ar − ar′)(bs − bs′) + 2(ar − ar′)(ers − er′s′) + 2(bs − bs′)(ers − er′s′))].

This expression involves 8 indices and it has 36 terms. Some of those terms simplify due to

independence and some vanish due to zero means. To shorten some expressions we use

BA,ii′,rr′ ≡ E((ai − ai′)(ar − ar′)),

DA,ii′ ≡ E((ai − ai′)2), and

QA,ii′,rr′ ≡ E((ai − ai′)2(ar − ar′)2)

with mnemonics bilinear, diagonal and quartic. There are similarly defined terms for component B.

For the error term we have

BE,iji′j′,rsr′s′ ≡ E((eij − ei′j′)(ers − er′s′)),

DE,ij,i′j′ ≡ E((eij − ei′j′)2), and

QE,iji′j′,rsr′s′ ≡ E((eij − ei′j′)2(ers − er′s′)2).

The generic contribution E((Yij−Yi′j′)2(Yrs−Yr′s′)2) to the mean square of a U -statistic equals

QA,ii′,rr′ + QB,jj′,ss′ + QE,iji′j′,rsr′s′ + DA,ii′DB,ss′ + DA,ii′DE,rs,r′s′

+ DB,jj′DA,rr′ + DB,jj′DE,rs,r′s′ + DE,ij,i′j′DA,rr′ + DE,ij,i′j′DB,ss′

+ 4BA,ii′,rr′BB,jj′,ss′ + 4BA,ii′,rr′BE,iji′j′,rsr′s′ + 4BB,jj′,ss′BE,iji′j′,rsr′s′ .

(A.7)

The other 24 terms are zero.

Next we collect expressions for the quantities appearing in the generic term of our squared U -

statistics.

Lemma A.2.2. In the random effects model (2),

BA,ii′,rr′ = σ2A

(1i=r − 1i=r′ − 1i′=r + 1i′=r′

),

BB,jj′,ss′ = σ2B

(1j=s − 1j=s′ − 1j′=s + 1j′=s′

), and

BE,iji′j′,rsr′s′ = σ2E

(1ij=rs − 1ij=r′s′ − 1i′j′=rs + 1i′j′=r′s′

).

Proof. The first one follows by expanding and using E(aiar) = σ2A1i=r, et cetera. The other two use

the same argument.


DA,ii′ = 2σ2A(1− 1i=i′),


DB,jj′ = 2σ2B(1− 1j=j′), and

DE,ij,i′j′ = 2σ2E(1− 1ij=i′j′).

Proof. Take i = r and i′ = r′ in Lemma A.2.2.


QA,ii′,rr′ = 1i 6=i′1r 6=r′σ4A

(4 + (κA + 2)(1i∈{r,r′} + 1i′∈{r,r′}) + 4× 1{i,i′}={r,r′}

),

QB,jj′,ss′ = 1j 6=j′1s6=s′σ4B

(4 + (κB + 2)(1j∈{s,s′} + 1j′∈{s,s′}) + 4× 1{j,j′}={s,s′}

), and

QE,iji′j′,rsr′s′ = 1ij 6=i′j′1rs6=r′s′σ4E

(4 + (κE + 2)(1ij∈{rs,r′s′} + 1i′j′∈{rs,r′s′})

+ 4× 1{ij,i′j′}={rs,r′s′}

).

Proof. We prove the first one; the others are similar. This quantity is 0 if i = i′ or r = r′. When

i 6= i′ and r 6= r′, there are 3 cases to consider: |{i, i′} ∩ {r, r′}| = 0, |{i, i′} ∩ {r, r′}| = 1 and

|{i, i′} ∩ {r, r′}| = 2. The kurtosis is defined via κA = E(a4)/σ4A − 3, so E(a4) = (κA + 3)σ4

A.

For no overlap, we find

E((a1 − a2)2(a3 − a4)2) = E((a1 − a2)2)2 = 4σ4A.

For a single overlap,

E((a1 − a2)2(a1 − a3)2) = E((a21 − 2a1a2 + a22)(a21 − 2a1a3 + a23))

= E(a41) + 3σ4A = σ4

A(κA + 6).

For a double overlap,

E((a1 − a2)4) = E(a41 − 4a1a32 + 6a21a

22 − 4a31a2 + a42)

= 2E(a41) + 6σ4A = σ4

A(2κA + 12).

As a result,

E((ai − ai′)2(ar − ar′)2) =

4σ4

A, |{i, i′} ∩ {r, r′}| = 0,

σ4A(κA + 6), |{i, i′} ∩ {r, r′}| = 1,

σ4A(2κA + 12), |{i, i′} ∩ {r, r′}| = 2,

and so E((ai − ai′)2(ar − ar′)2) equals

1i 6=i′1r 6=r′σ4A

(4 + (κA + 2)(1i∈{r,r′} + 1i′∈{r,r′}) + 4× 1{i,i′}={r,r′}

).


A.2.4 Variance of Ua

We will work out E(U2a ) and then subtract E(Ua)2. First we write

U2a =

1

4

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′(Yij − Yij′)2(Yrs − Yrs′)2.

For E(U2a ) we use the special case i = i′ and r = r′ of (A.7), and

E(U2a ) =

1

4

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′

[QA,ii,rr + QB,jj′,ss′ + QE,ijij′,rsrs′ + DA,iiDB,ss′

+ DA,iiDE,rs,rs′ + DB,jj′DA,rr + DB,jj′DE,rs,rs′ + DE,ij,ij′DA,rr + DE,ij,ij′DB,ss′

+ 4BA,ii,rrBB,jj′,ss′ + 4BA,ii,rrBE,ijij′,rsrs′ + 4BB,jj′,ss′BE,ijij′,rsrs′]

=1

4

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′

[QB,jj′,ss′︸︷︷︸

Term 1

+QE,ijij′,rsrs′︸︷︷︸Term 2

+ DB,jj′DE,rs,rs′︸︷︷︸Term 3

+DE,ij,ij′DB,ss′︸︷︷︸Term 4

+ 4BB,jj′,ss′BE,ijij′,rsrs′︸︷︷︸Term 5

]

after eliminating terms that are always 0. We handle these five sums in the next paragraphs.

Term 1

1

4

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′QB,jj′,ss′

=σ4B

4

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′(1− 1j=j′)(1− 1s=s′)

×(

4 + (κB + 2)(1j∈{s,s′} + 1j′∈{s,s′}) + 4× 1{j,j′}={s,s′}

)= σ4

B

((N −R)2 + (κB + 2)

∑ir

(ZZT)ir(1−N−1i• )(1−N−1r• )

+ 2∑ir


T)ir − 1)).

Term 2

1

4

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′QE,ijij′,rsrs′

=σ4E

4

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′1j 6=j′1s6=s′


×(

4 + (κE + 2)1i=r(1j∈{s,s′} + 1j′∈{s,s′}) + 41i=r1{j,j′}={s,s′}

)= σ4

E

((N −R)2 + (κE + 2)

∑i

Ni•(1−N−1i• )2 + 2∑i

(1−N−1i• )).

Terms 3 and 4 These terms are equal by symmetry. We evaluate term 3.

1

4

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′DB,jj′DE,rs,rs′

=1

4

(∑ijj′

N−1i• ZijZij′DB,jj′)(∑

rss′

N−1r• ZrsZrs′DE,rs,rs′).

Now

∑ijj′

N−1i• ZijZij′DB,jj′ = 2σ2B

∑ijj′

N−1i• ZijZij′(1− 1j=j′) = 2σ2B(N −R)

and

∑rss′

N−1r• ZrsZrs′DE,rs,rs′ = 2σ2E

∑rss′

N−1r• ZrsZrs′(1− 1s=s′) = 2σ2E(N −R)

by the same steps. Therefore term 3 of E(U2a ) equals σ2

Bσ2E(N −R)2 and the sum of terms 3 and 4

is 2σ2Bσ

2E(N −R)2.

Term 5 The term equals

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′BB,jj′,ss′BE,ijij′,rsrs′

=∑ijj′

∑ss′

N−1i• ZijZij′BB,jj′,ss′∑r

N−1r• ZrsZrs′BE,ijij′,rsrs′ .

Now

∑r

N−1r• ZrsZrs′BE,ijij′,rsrs′ = σ2EN−1i• ZisZis′

(1j=s − 1j=s′ − 1j′=s + 1j′=s′

).

Term 5 is then

σ2E

∑ijj′

∑ss′

N−2i• ZijZij′ZisZis′(1j=s − 1j=s′ − 1j′=s + 1j′=s′

)BB,jj′,ss′

= σ2Eσ

2B

∑ijj′

∑ss′

N−2i• ZijZij′ZisZis′1j=s(1j=s − 1j=s′ − 1j′=s + 1j′=s′

)


− σ2Eσ

2B

∑ijj′

∑ss′

N−2i• ZijZij′ZisZis′1j=s′(1j=s − 1j=s′ − 1j′=s + 1j′=s′

)− σ2

Eσ2B

∑ijj′

∑ss′

N−2i• ZijZij′ZisZis′1j′=s(1j=s − 1j=s′ − 1j′=s + 1j′=s′

)+ σ2

Eσ2B

∑ijj′

∑ss′

N−2i• ZijZij′ZisZis′1j′=s′(1j=s − 1j=s′ − 1j′=s + 1j′=s′

)= 4σ2

Bσ2E(N −R).

Combination Combining the results of the previous sections, we have

E(U2a ) = σ4

B

((N −R)2 + (κB + 2)

∑ir

(ZZT)ir(1−N−1i• )(1−N−1r• )

+ 2∑ir


T)ir − 1))

+ 2σ2Bσ

2E(N −R)2 + 4σ2

Bσ2E(N −R)

+ σ4E

((N −R)2 + (κE + 2)

∑i

Ni•(1−N−1i• )2 + 2∑i

(1−N−1i• )).

Subtracting E(Ua)2 = (N −R)2(σ2B + σ2

E)2 we find that Var(Ua) equals

4σ2Bσ

2E(N −R) + σ4

E

((κE + 2)

∑i

Ni•(1−N−1i• )2 + 2∑i

(1−N−1i• ))

+ σ4B

((κB + 2)

∑ir

(ZZT)ir(1−N−1i• )(1−N−1r• ) + 2∑ir


T)ir − 1)).

(A.8)

Variance of Ub This case is exactly symmetric to the one above with Var(Ua) given by (A.8).

Therefore Var(Ub) equals

4σ2Aσ

2E(N − C) + σ4

E

((κE + 2)

∑j

N•j(1−N−1•j )2 + 2∑j

(1−N−1•j ))

+ σ4B

((κA + 2)

∑js

(ZTZ)js(1−N−1•j )(1−N−1•s ) + 2∑js

N−1•j N−1•s (ZTZ)js((Z

TZ)js − 1)).

(A.9)

A.2.5 Variance of Ue

As before, we find E(U2e ) and then subtract E(Ue)

2. Now

U2e =

1

4

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′(Yij − Yi′j′)2(Yrs − Yr′s′)2.

From (A.7),

E(U2e ) =

1

4

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′[QA,ii′,rr′︸︷︷︸Term 1

+QB,jj′,ss′︸︷︷︸Term 2

+QE,iji′j′,rsr′s′︸︷︷︸Term 3

+DA,ii′DB,ss′︸︷︷︸Term 4


+ DA,ii′DE,rs,r′s′︸︷︷︸Term 5

+DB,jj′DA,rr′︸︷︷︸Term 6

+DB,jj′DE,rs,r′s′︸︷︷︸Term 7

+DE,ij,i′j′DA,rr′︸︷︷︸Term 8

+DE,ij,i′j′DB,ss′︸︷︷︸Term 9

+ 4BA,ii′,rr′BB,jj′,ss′︸︷︷︸Term 10

+ 4BA,ii′,rr′BE,iji′j′,rsr′s′︸︷︷︸Term 11

+ 4BB,jj′,ss′BE,iji′j′,rsr′s′︸︷︷︸Term 12

].

We handle the twelve sums in the next paragraphs.

Terms 1 and 2 Term 1 is

1

4

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′QA,ii′,rr′

=1

4

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′1i6=i′1r 6=r′σ4A

(4 + (κA + 2)(1i∈{r,r′} + 1i′∈{r,r′}) + 4× 1{i,i′}={r,r′}

)=σ4A

4

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′(1− 1i=i′)(1− 1r=r′)(4 + (κA + 2)(1i∈{r,r′} + 1i′∈{r,r′}) + 4× 1{i,i′}={r,r′}

)= σ4

A

(N4 − 2N2

∑i

N2i• + 3

(∑i

N2i•

)2− 2

∑i

N4i•

)+ σ4

A(κA + 2)(N2∑i

N2i• − 2N

∑i

N3i• +

∑i

N4i•

).

We can use the symmetry of the roles of A and B and their indices. Therefore, term 2 is equal

to

σ4B

(N4 − 2N2

∑j

N2•j + 3

(∑j

N2•j

)2− 2

∑j

N4•j

)+ σ4

B(κB + 2)(N2∑j

N2•j − 2N

∑j

N3•j +

∑j

N4•j

).

Term 3

1

4

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′QE,iji′j′,rsr′s′

=σ4E

4

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′(1− 1ij=i′j′)(1− 1rs=r′s′)(4 + (κE + 2)(1ij∈{rs,r′s′} + 1i′j′∈{rs,r′s′}) + 4× 1{ij,i′j′}={rs,r′s′}

)= σ4

EN(N − 1)[N(N − 1) + 2] + σ4E(κE + 2)N(N − 1)2.


Term 4

1

4

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′DA,ii′DB,ss′ =1

4

(∑ii′jj′

ZijZi′j′DA,ii′)( ∑

rr′ss′

ZrsZr′s′DB,ss′).

The first factor is

∑ii′jj′

ZijZi′j′DA,ii′ = 2σ2A

∑ii′jj′

ZijZi′j′(1− 1i=i′) = 2σ2A(N2 −

∑i

N2i•).

By the same argument, the second factor is

∑rr′ss′

ZrsZr′s′DB,ss′ = 2σ2B(N2 −

∑s

N2•s),

and so term 4 is

σ2Aσ

2B(N2 −

∑i

N2i•)(N

2 −∑j

N2•j).

Term 5

1

4

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′DA,ii′DE,rs,r′s′ =1

4

(∑ii′jj′

ZijZi′j′DA,ii′)( ∑

rr′ss′

ZrsZr′s′DE,rs,r′s′).

The first factor is computed in the previous section. The second factor is

∑rr′ss′

ZrsZr′s′DE,rs,r′s′ = 2σ2E

∑rr′ss′

ZrsZr′s′(1− 1rs=r′s′) = 2σ2EN(N − 1).

Thus, term 5 is

σ2Aσ

2EN(N − 1)(N2 −

∑i

N2i•).

Terms 6-9 By symmetry of indices, term 6 is the same as term 4:

σ2Aσ

2B(N2 −

∑i

N2i•)(N

2 −∑j

N2•j).

Term 7 is like term 5 with factors A and B interchanged. Thus, term 7 is equal to

σ2Bσ

2EN(N − 1)(N2 −

∑j

N2•j).


By symmetry of indices, term 8 is the same as term 5:

σ2Aσ

2EN(N − 1)(N2 −

∑i

N2i•).

By symmetry of indices, term 9 is the same as term 7:

σ2Bσ

2EN(N − 1)(N2 −

∑j

N2•j).

Term 10

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′BA,ii′,rr′BB,jj′,ss′

= σ2Aσ

2B

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′(1i=r − 1i=r′ − 1i′=r + 1i′=r′

)(1j=s − 1j=s′ − 1j′=s + 1j′=s′

)= 4σ2

Aσ2B

(N3 − 2N

∑ij


N2i•N

2•j

).

Terms 11 and 12

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′BA,ii′,rr′BE,iji′j′,rsr′s′

= σ2Aσ

2E

∑ii′jj′

∑rr′ss′

ZijZi′j′ZrsZr′s′(1i=r − 1i=r′ − 1i′=r + 1i′=r′

)(1ij=rs − 1ij=r′s′ − 1i′j′=rs + 1i′j′=r′s′

)= σ2

Aσ2E

(4N3 − 4N

∑i

N2i•

).

For term 12, we can use the symmetry with term 11, interchanging rows columns. Thus, term

12 is

σ2Bσ

2E

(4N3 − 4N

∑j

N2•j

).

Combination Summing up the results of the previous twelve sections, E(U2e ) equals

σ4AN

4 − 2σ4AN

2∑i

N2i• + 3σ4

A

(∑i

N2i•

)2− 2σ4

A

∑i

N4i• + σ4

BN4 − 2σ4

B

∑j

N4•j

+ σ4A(κA + 2)

(N2∑i

N2i• − 2N

∑i

N3i• +

∑i

N4i•

)− 2σ4

BN2∑j

N2•j + 3σ4

B

(∑j

N2•j

)2+ σ4

B(κB + 2)(N2∑j

N2•j − 2N

∑j

N3•j +

∑j

N4•j

)+ σ4

E

(N4 − 2N3 + 3N2 − 2N

)


+ σ4E(κE + 2)N(N − 1)2 + σ2

Aσ2B(N2 −

∑i

N2i•)(N

2 −∑j

N2•j) + σ2

Aσ2EN(N − 1)(N2 −

∑i

N2i•)

+ σ2Aσ

2B(N2 −

∑i

N2i•)(N

2 −∑j

N2•j) + σ2

Bσ2EN(N − 1)(N2 −

∑j

N2•j)

+ σ2Aσ

2EN(N − 1)(N2 −

∑i

N2i•) + σ2

Bσ2EN(N − 1)(N2 −

∑j

N2•j)

+ 4σ2Aσ

2B

(N3 − 2N

∑ij


N2i•N

2•j

)+ 4σ2

Aσ2E

(N3 −N

∑i

N2i•

)+ σ2

Bσ2E

(4N3 − 4N

∑j

N2•j

).

Then, we have, applying some simplifications,

Var(Ue) = E(U2e )− E(Ue)

2

= 2σ4A

((∑i

N2i•

)2−∑i

N4i•

)+ 2σ4

B

((∑j

N2•j

)2−∑j

N4•j

)+ 2σ4

EN(N − 1)

+ (κA + 2)σ4A

∑i

N2i•(N −Ni•)2 + (κB + 2)σ4

B

∑j

N2•j(N −N•j)2

+ (κE + 2)σ4EN(N − 1)2 + 4σ2

Aσ2B

∑ij

(Ni•N•j −NZij)2

+ 4σ2Aσ

2EN

(N2 −

∑i

N2i•

)+ 4σ2

Bσ2EN

(N2 −

∑j

N2•j

).

(A.10)

A.2.6 Covariance of Ua and Ub

We use the formula Cov(Ua, Ub) = E(UaUb) − E(Ua)E(Ub), so we just need to compute E(UaUb).

Using our preferred normalization,

UaUb =1

4

(∑ijj′

N−1i• ZijZij′(Yij − Yij′)2)(∑rr′s

N−1•s ZrsZr′s(Yrs − Yr′s)2)

=1

4

∑ijj′

∑rr′s

N−1i• N−1•s ZijZij′ZrsZr′s(Yij − Yij′)2(Yrs − Yr′s)2.

Then,

E(UaUb) =1

4

∑ijj′

∑rr′s

N−1i• N−1•s ZijZij′ZrsZr′s

(QE,ijij′,rsr′s︸︷︷︸

Term 1

+ DB,jj′DA,rr′︸︷︷︸Term 2

+DB,jj′DE,rs,r′s︸︷︷︸Term 3

+DE,ij,ij′DA,rr′︸︷︷︸Term 4

).

We consider each term separately.


Term 1

1

4

∑ijj′

∑rr′s

N−1i• N−1•s ZijZij′ZrsZr′sQE,ijij′,rsr′s

=σ4E

4

∑ijj′

∑rr′s

N−1i• N−1•s ZijZij′ZrsZr′s1j 6=j′1r 6=r′(

4 + (κE + 2)(1ij∈{rs,r′s} + 1ij′∈{rs,r′s}) + 4× 1{ij,ij′}={rs,r′s}

)= σ4

E

(N −R

)(N − C

)+ σ4

E(κE + 2)∑ij

Zij(1−N−1i• )(1−N−1•j ).

Term 2

1

4

∑ijj′

∑rr′s

N−1i• N−1•s ZijZij′ZrsZr′sDB,jj′DA,rr′

=1

4

∑ijj′

∑rr′s

N−1i• N−1•s ZijZij′ZrsZr′s2σ

2B(1− 1j=j′)2σ

2A(1− 1r=r′)

= σ2Aσ

2B

(N −R

)(N − C

).

Term 3

1

4

∑ijj′

∑rr′s

N−1i• N−1•s ZijZij′ZrsZr′sDB,jj′DE,rs,r′s

=1

4

∑ijj′

∑rr′s


2B(1− 1j=j′)2σ

2E(1− 1r=r′)

= σ2Bσ

2E

(N −R

)(N − C

).

Term 4

1

4

∑ijj′

∑rr′s

N−1i• N−1•s ZijZij′ZrsZr′sDE,ij,ij′DA,rr′

=1

4

∑ijj′

∑rr′s


2E(1− 1j=j′)2σ

2A(1− 1r=r′)

= σ2Aσ

2E

(N −R

)(N − C

).

Combination Adding up the four terms, we have

E(UaUb) = σ4E

(N −R

)(N − C

)+ σ4

E(κE + 2)∑ij

Zij(1−N−1i• )(1−N−1•j )

+ σ2Aσ

2B

(N −R

)(N − C

)+ σ2

Bσ2E

(N −R

)(N − C

)+ σ2

Aσ2E

(N −R

)(N − C

),


and so

Cov(Ua, Ub) = E(UaUb)− E(Ua)E(Ub)

= E(UaUb)− (σ2B + σ2

E)(σ2A + σ2

E)(N −R)(N − C)

= σ4E(κE + 2)

∑ij

Zij(1−N−1i• )(1−N−1•j ).

Notice that Cov(Ua, Ub) = 0 when σ2E = 0. This can be verified by noting that when σ2

E = 0 then

Ua is a function only of ai while Ub is a function only of bj . Therefore Ua and Ub are independent

when σ2E = 0.

A.2.7 Covariance of Ua and Ue

We use the formula Cov(Ua, Ue) = E(UaUe) − E(Ua)E(Ue), so we just need to compute E(UaUe).

First,

UaUe =1

4

(∑ijj′

N−1i• ZijZij′(Yij − Yij′)2)( ∑

rr′ss′

ZrsZr′s′(Yrs − Yr′s′)2)

=1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′(Yij − Yij′)2(Yrs − Yr′s′)2.

Then,

E(UaUe) =1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′(QB,jj′,ss′︸︷︷︸

Term 1

+QE,ijij′,rsr′s′︸︷︷︸Term 2

+DB,jj′DA,rr′︸︷︷︸Term 3

+ DB,jj′DE,rs,r′s′︸︷︷︸Term 4

+DE,ij,ij′DA,rr′︸︷︷︸Term 5

+DE,ij,ij′DB,ss′︸︷︷︸Term 6

+ 4BB,jj′,ss′BE,ijij′,rsr′s′︸︷︷︸Term 7

).

We consider each term separately.

Term 1

1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′QB,jj′,ss′

=1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′1j 6=j′1s 6=s′σ4B(

4 + (κB + 2)(1j∈{s,s′} + 1j′∈{s,s′}) + 4× 1{j,j′}={s,s′}

)= 2σ4

B

(∑i

N−1i•

(∑j

ZijN•j

)2−∑ij

N−1i• ZijN2•j

)+ σ4

B

(N −R

)(N2 −

∑j

N2•j

)+ σ4

B(κB + 2)∑ij

Zij(N −N•j)N•j(1−N−1i• ).


Term 2

1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′QE,ijij′,rsr′s′

=1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′1j 6=j′1rs6=r′s′σ4E(

4 + (κE + 2)(1ij∈{rs,r′s′} + 1ij′∈{rs,r′s′}) + 4× 1{ij,ij′}={rs,r′s′}

)= σ4

EN(N − 1)(N −R) + 2σ4E(N −R) + σ4

E(κE + 2)(N −R)(N − 1).

Term 3

1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′DB,jj′DA,rr′

=1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′2σ2B(1− 1j=j′)2σ

2A(1− 1r=r′)

= σ2Aσ

2B

(N −R

)(N2 −

∑r

N2r•

).

Term 4

1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′DB,jj′DE,rs,r′s′

=1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′2σ2B(1− 1j=j′)2σ

2E(1− 1r=r′1s=s′)

= σ2Bσ

2E

(N −R

)(N2 −N).

Term 5

1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′DE,ij,ij′DA,rr′

=1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′2σ2E(1− 1j=j′)2σ

2A(1− 1r=r′)

= σ2Aσ

2E

(N −R

)(N2 −

∑r

N2r•

)using the result for term 3.

Term 6

1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′DE,ij,ij′DB,ss′


=1

4

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′2σ2E(1− 1j=j′)2σ

2B(1− 1s=s′)

= σ2Bσ

2E

(N −R

)(N2 −

∑s

N2•s

).

Term 7

∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′BB,jj′,ss′BE,ijij′,rsr′s′

=∑ijj′

∑rr′ss′

N−1i• ZijZij′ZrsZr′s′σ2B

(1j=s − 1j=s′ − 1j′=s + 1j′=s′

)× σ2

E

(1ij=rs − 1ij=r′s′ − 1ij′=rs + 1ij′=r′s′

)= 4σ2

Bσ2EN(N −R).

Combination We add up the seven terms, replacing some Nr• and N•s expressions by equivalents

using Ni• and N•j , getting

E(UaUe) = σ4B

(N −R

)(N2 −

∑j

N2•j

)+ 2σ4

B

(∑i

N−1i•

(∑j

ZijN•j

)2−∑ij

N−1i• ZijN2•j

)+ σ4

B(κB + 2)∑ij

Zij(N −N•j)N•j(1−N−1i• ) + 2σ4E(N −R) + σ4

EN(N − 1)(N −R)

+ σ4E(κE + 2)(N −R)(N − 1) + σ2

Aσ2B

(N −R

)(N2 −

∑i

N2i•

)+ σ2

Bσ2E

(N −R

)(N2 −N) + σ2

Aσ2E

(N −R

)(N2 −

∑i

N2i•

)+ σ2

Bσ2E

(N −R

)(N2 −

∑j

N2•j

)+ 4σ2

Bσ2EN(N −R).

Now E(Ua)E(Ue) equals

(N −R)(σ2B + σ2

E)(σ2A(N2 −

∑i

N2i•) + σ2

B(N2 −∑j

N2•j) + σ2

E(N2 −N))

which contains terms equaling several of those in E(UaUe) above. Subtracting those term from

E(UaUe) yields

Cov(Ua, Ue) = 2σ4B

(∑i

N−1i•

(∑j

ZijN•j

)2−∑ij

N−1i• ZijN2•j

)+ σ4

B(κB + 2)∑ij

Zij(N −N•j)N•j(1−N−1i• ) + 2σ4E(N −R)

+ σ4E(κE + 2)(N −R)(N − 1) + 4σ2

Bσ2EN(N −R).


Covariance of Ub and Ue By interchanging the roles of the rows and columns in Cov(Ua, Ue),

we find that

Cov(Ub, Ue) = 2σ4A

(∑j

N−1•j

(∑i

ZijNi•

)2−∑ij

N−1•j ZijN2i•

)+ σ4

A(κA + 2)∑ij

Zij(N −Ni•)Ni•(1−N−1•j ) + 2σ4E(N − C)

+ σ4E(κE + 2)(N − C)(N − 1) + 4σ2

Aσ2EN(N − C).

A.2.8 Estimating Kurtoses

To estimate the kurtoses κA, κB and κE in our variance formulas, it suffices to estimate fourth

central moments such as µA,4 = σ4A(κA+3) and similarly defined µB,4 and µE,4. Given σ2

A, σ2B , and

σ2E , we can do this via method of moments. Consider the following estimating equations and their

expectations,

Wa =1

2

∑ijj′

1

Ni•ZijZij′(Yij − Yij′)4,

Wb =1

2

∑ii′j

1

N•jZijZi′j(Yij − Yi′j)4, and

We =1

2

∑ii′jj′


Using previous results,

E(Wa) =1

2

∑ijj′

1

Ni•ZijZij′E

((bj − bj′ + eij − eij′)4

)=

1

2

∑ijj′

ZijZij′

Ni•E((bj − bj′)4 + 6(bj − bj′)2(eij − eij′)2 + (eij − eij′)4

)= (N −R)

(µB,4 + 3σ4

B + 12σ2Bσ

2E + µE,4 + 3σ4

E

).

By symmetry,

E(Wb) = (N − C)(µA,4 + 3σ4

A + 12σ2Aσ

2E + µE,4 + 3σ4

E

).

Next

E(We) =1

2

∑ii′jj′

ZijZi′j′E((Yij − Yi′j′)4

)=

1

2

∑ii′jj′

ZijZi′j′E((ai − ai′ + bj − bj′ + eij − ei′j′)4

)


=1

2

∑ii′jj′

ZijZi′j′E((ai − ai′)4 + 6(ai − ai′)2(bj − bj′)2 + (bj − bj′)4

+ 6(ai − ai′)2(eij − ei′j′)2 + 6(bj − bj′)2(eij − ei′j′)2 + (eij − ei′j′)4)

+ (µE,4 + 3σ4E)N(N − 1) + 12σ2

Aσ2B(N2 −

∑i

N2i• −

∑j

N2•j +N).

These expectations are all linear in the fourth moments. Therefore, given estimates of σ2A, σ2

B ,

and σ2E , we can solve another three-by-three system of equations to get estimates of the fourth

moments.

Letting M be the matrix in equation (A.5) we find thatE(Wa)

E(Wb)

E(We)

= M

µA,4

µB,4

µE,4

+

3(N −R)σ4

B + 12(N −R)σ2Bσ

2E + 3(N −R)σ4

E

3(N − C)σ4A + 12(N − C)σ2

Aσ2E + 3(N − C)σ4

E

H

where

H = (3σ4A + 12σ2

Aσ2E)(N2 −

∑i

N2i•) + (3σ4

B + 12σ2Bσ

2E)(N2 −

∑j

N2•j)

+ 3σ4EN(N − 1) + 12σ2

Aσ2B(N2 −

∑i

N2i• −

∑j

N2•j +N).

For plug-in method of moment estimators we replace expected W -statistics by their sample

quantities, replace the variance components by their estimates, and solve the matrix equation getting

µA,4 et cetera. Then κA = µA,4/σ4A − 3 and so on.

A.3 Asymptotic Normality of Moment Estimates

In this section we provide the proofs for Section 2.4.3.


Letting ε = max(εR, εC), we have

M =

N

N

N2

0 1−R/N 1−R/N1− C/N 0 1− C/N

1 1 1

(1 +O(ε))


and so if max(R,C)/N ≤ θ for some θ < 1, then

M−1 =

− NN−R 0 1

0 − NN−C 1

NN−R

NN−C −1

N−1

N−1

N−2

(1 +O(ε)).

It follows that

σ2A =

( UeN2− UaN −R

)(1 +O(ε)),

σ2B =

( UeN2− UbN − C

)(1 +O(ε)), and

σ2E =

( UaN −R

+Ub

N − C− UeN2

)(1 +O(ε)).

(A.11)

From Lemma 2.4.1, E(Ua) = (σ2B + σ2

E)(N −R), E(Ub) = (σ2A + σ2

E)(N − C), and

E(Ue) = σ2A

(N2 −

∑i

N2i•

)+ σ2

B

(N2 −

∑j

N2•j

)+ σ2

E(N2 −N), so

E(Ue)

N2= σ2

A + σ2B + σ2

E −Υ, where

Υ =(σ2A

∑iN

2i•

N2+ σ2

B

∑j N

2•j

N2+σ2E

N

)= O(ε).

By substitution in (A.11) we find that all of the variance component biases are Υ×(1+O(ε)) = O(ε).

Turning now to variances,

Var(σ2A) = O

(Var(Ue)

N4+

Var(Ua)

N2

),

Var(σ2B) = O

(Var(Ue)

N4+

Var(Ub)

N2

), and

Var(σ2E) = O

(Var(Ua)

N2+

Var(Ub)

N2+

Var(Ue)

N4

).

(A.12)

We work from the exact finite sample formulas in Theorem 2.4.1.

Using (ZZT)ir ≤ Nr• and∑ir(ZZ

T)ir =∑j N

2•j , we find from (2.12) that

Var(Ua) ≤ σ4B(κB + 2)

∑ir

(ZZT)ir + 2σ4B

∑ir

N−1i• (ZZT)ir + 4σ2Bσ

2EN

+ σ4E(κE + 2)

∑i

Ni• + 2σ4E

∑i

1

≤ σ4B(κB + 4)

∑j

N2•j +

(4σ2

Bσ2E + σ4

E(κE + 2))N + 2Rσ4

E

= O(∑

j

N2•j

). (A.13)


The same logic yields Var(Ub) = O(∑iN

2i•). The second term in Var(Ua), which was lumped in with

the first, might ordinarily be much smaller than the first, and then a lead coefficient of σ4B(κB + 2)

would be more accurate than σ4B(κB + 4).

For Var(Ue) the nonnegative terms in (2.14) have magnitudes proportional to

(∑i

N2i•

)2,(∑

j

N2•j

)2, N2

∑i

N2i•, N

2∑j

N2•j ,∑i

N4i•,∑j

N4•j ,∑i

N2i•

∑j

N2•j , N

3

or smaller. These are all O(N2(∑iN

2i• +

∑j N

2•j)), and so

Var(Ue) = O(N2(∑

i

N2i• +

∑j

N2•j

)). (A.14)

Combining (A.13) and (A.14) into (A.12) yields

Var(σ2A) = O

(∑iN

2i•

N2+

∑j N

2•j

N2

)(1 +O(ε)),

and the same follows for σ2B by symmetry. Precisely the same terms appear in Var(σ2

E) so it also

has that rate.

A.3.2 Asymptotic approximation: proof of Theorem 2.4.4

We suppose that the following inequalities all hold

Ni• ≤ εN, N•j ≤ εN, R ≤ εN, C ≤ εN,

N ≤ ε∑i

N2i•, N ≤ ε

∑j

N2•j ,

∑i

N2i• ≤ εN2, and

∑j

N2•j ≤ εN2

for the same small ε > 0. The first six inequalities are assumed in the theorem statement. The last

two follow from the first two. We also assume that

0 < κA + 2, κB + 2, κE + 2, σ4A, σ

4B , σ

4E <∞.

Note that we can bound σ2Aσ

2B , σ2

Aσ2E , and σ2

Aσ2B away from 0 and ∞ uniformly with those other

quantities.

We also suppose that

∑ij

ZijN−1i• N•j ≤ ε

∑i

N2i• and

∑ij

ZijNi•N−1•j ≤ ε

∑j

N2•j . (A.15)

The bounds in (A.15) seem reasonable but it appears that they cannot be derived from the first

eight bounds above.


We begin with the coefficient of σ4B(κB + 2) in Var(Ua) from equation (2.12). It is

∑ir

(ZZT)ir(1−N−1i• −N−1r• +N−1i• N

−1r• ) =

∑j

N2•j − 2

∑ij

ZijN−1i• N•j +

∑jir

ZijN−1i• ZrjN

−1r•

=∑j

N2•j(1 +O(ε)).

The third, fourth and fifth terms in Var(Ua) are all∑j N

2•jO(ε). The second term contains

∑ir


T)ir − 1) ≤∑ir

N−1i• (ZZT)ir =∑irj

N−1i• ZijZrj

=∑ij

ZijN−1i• N•j =

∑j

N2•jO(ε).

It follows that Var(Ua) = σ4B(κB + 2)

∑j N

2•j(1 +O(ε)). Similarly Var(Ub) = σ4

A(κA+ 2)∑iN

2i•(1 +

O(ε)).

The expression for Var(Ue) contains terms σ4A(κA + 2)N2

∑j N

2•j + σ4

B(κB + 2)N2∑iN

2i•. All

other terms are O(ε) times these two, mostly through N �∑iN

2i•,∑j N

2•j � N2. The coefficient

of σ2Aσ

2B contains

N∑ij

ZijNi•N•j ≤ εN2∑ij

ZijNi• = εN2∑i

N2i•

so it is of smaller order than the lead term, as well as

∑i

N2i•

∑j

N2•j ≤ εN2

∑i

N2i•.

As a result

Var(Ue) =(σ4A(κA + 2)N2

∑j

N2•j + σ4

B(κB + 2)N2∑i

N2i•

)(1 +O(ε)).

Turning to the covariances

Cov(Ua, Ub) = σ4E(κE + 2)

∑ij

Zij(1−N−1i• −N−1•j +N−1i• N

−1•j )

= σ4E(κE + 2)(N −R− C +O(R)) = σ4

E(κE + 2)N(1 +O(ε)).

Next Cov(Ua, Ue) contains the term σ4B(κB + 2)N

∑ij ZijN•j = σ4

B(κB + 2)N∑j N

2•j . The terms

appearing after that one are O(N2) = O(εN∑j N

2•j). The largest term preceding it is dominated

by ∑i

N−1i•

(∑j

ZijN•j

)2≤ εN

∑i

N−1i•

(∑j

ZijN•j

)(∑j

Zij

)= εN

∑j

N2•j .


It follows that Cov(Ua, Ue) = σ4B(κB + 2)N

∑j N

2•j(1 +O(ε)) and similarly, Cov(Ub, Ue) = σ4

A(κA +

2)N∑iN

2i•(1 +O(ε)).

Next, using (2.19)

Var(σ2A) =

(Var(Ue)

N4+

Var(Ua)

N2− 2

Cov(Ua, Ue)

N3

)(1 +O(ε))

= σ4A(κA + 2)

1

N2

∑i

N2i•(1 +O(ε)), and similarly

Var(σ2B) = σ4

B(κB + 2)1

N2

∑j

N2•j(1 +O(ε)).

The last variance is

Var(σ2E) =

(Var(Ua)

N2+

Var(Ub)

N2+

Var(Ue)

N4− 2

N3Cov(Ua, Ue)

− 2

N3Cov(Ub, Ue) +

2

N2Cov(Ua, Ub)

)(1 +O(ε))

= σ4E(κE + 2)

1

N(1 +O(ε)).

Next we verify that these variance estimates are asymptotically uncorrelated. Ignoring the 1 +

O(ε) factors we have

Cov(σ2A, σ

2B)

.=

Var(Ue)

N4− Cov(Ub, Ue)

N3− Cov(Ua, Ue)

N3+

Cov(Ua, Ub)

N2

.=

1

N2

(σ4A(κA + 2)

∑i

N2i• + σ4

B(κB + 2)∑j

N2•j))

− 1

N2σ4A(κA + 2)

∑i

N2i• −

1

N2σ4B(κB + 2)

∑j

N2•j +

1

Nσ4E(κE + 2)

=1

Nσ4E(κE + 2)

which is O(ε) times Var(σ2A) and Var(σ2

B). Likewise

Cov(σ2A, σ

2E)

.=

1

N3Cov(Ua, Ue) +

1

N3Cov(Ub, Ue)−

1

N4Var(Ue)−

1

N2Var(Ua)

− 1

N2Cov(Ua, Ub) +

1

N3Cov(Ua, Ue)

.= σ4

B(κB + 2)2

N2

∑j

N2•j + σ4

A(κA + 2)1

N2

∑i

N2i•

−(σ4A(κA + 2)

∑i

N2i• + σ4

B(κB + 2)∑j

N2•j

) 1

N2

− σ4B(κB + 2)

1

N2

∑j

N2•j − σ4

E(κE + 2)1

N


= −σ4E(κE + 2)

1

N

which is much smaller than Var(σ2A). Similarly Cov(σ2

B , σ2E)

.= −σ4

E(κE + 2)/N , is much smaller

than Var(σ2B).

A.3.3 Proof of Lemma 2.4.2

We first consider the relevant results for Ua−E(Ua). It is clear that SAN,k,0 is measurable with respect

to FN,k,0, SAN,q,1 is measurable with respect to FN,q,1, and SAN,`,m,2 is measurable with respect to

FN,`,m,2.

First,

E(SAN,k,0 | FN,k−1,0) =∑i

N−1i• Zik

(∑j:j 6=k

ZijE(b2k − σ2B)− 2

∑j:j<k

ZijE(bj)E(bk))

= 0,

E(SAN,q,1 | FN,q−1,0) = 0, and

E(SAN,`,m,2 | FN,`,m−1,0) = N−1`• Z`m

( ∑j:j 6=m

Z`j(2(bm − bj)E(e`m) + E(e2`m − σ2E))

− 2∑j:j<m

Z`jE(e`j)E(e`m))

= 0.

Next we sum the martingale differences,

∑k

SAN,k,0 +∑`m

SAN,`,m,2

=∑k

∑i

N−1i• Zik

(∑j:j 6=k

Zij(b2k − σ2

B)− 2∑j:j<k

Zijbjbk

)+∑`m

N−1`• Z`m

( ∑j:j 6=m

Z`j(2(bm − bj)e`m + e2`m − σ2E)− 2

∑j:j<m

Z`je`je`m

)=∑i

N−1i•

(∑j 6=k

ZikZij(b2k − σ2

B)− 2∑j<k

ZikZijbjbk

)+∑`

N−1`•

(∑j 6=m

Z`jZ`m(e2`m − σ2E + 2(bm − bj)e`m)− 2

∑j<m

Z`jZ`me`je`m

)=∑i

N−1i•

(∑j 6=k

ZikZijb2k − σ2

BNi•(Ni• − 1)−∑j 6=k

ZikZijbjbk

)+∑`

N−1`•(∑j 6=m

Z`jZ`me2`m − σ2

EN`•(N`• − 1)−∑j 6=m

Z`jZ`me`je`m + 2∑j 6=m

Z`jZ`m(bm − bj)e`m)

=1

2

∑ijk

N−1i• ZijZik(bj − bk)2 − σ2B(N −R) +

1

2

∑`jm

N−1`• Z`jZ`m(e`j − e`m)2 − σ2E(N −R)


+∑`jm

N−1`• Z`jZ`m(bj − bm)(e`j − e`m)

=1

2

∑ijk

N−1i• ZijZik(bj + e`j − bk − e`k)2 − (σ2B + σ2

E)(N −R)

= Ua − E(Ua).

The results for Ub − E(Ub) follow by symmetry.

Now, we consider the results for Ue − E(Ue). It is clear that SEN,k,0 is measurable with respect

to FN,k,0, SEN,q,1 is measurable with respect to FN,q,1, and SEN,`,m,2 is measurable with respect to

FN,`,m,2. Similarly the martingale terms all have expectation zero,

E(SEN,k,0 | FN,k−1,0) =∑ii′

Zik

(∑j:j 6=k

Zi′jE(b2k − σ2B)− 2

∑j:j<k

Zi′jE(bj)E(bk))

= 0,

E(SEN,q,1 | FN,q−1,1)

=∑jj′

Zqj

(∑i:i6=q

Zij′(E(a2q − σ2A) + 2(bj − bj′)E(aq))− 2

∑i:i<q

Zij′E(ai)E(aq))

= 0, and

E(SEN,`,m,2 | FN,`,m−1,2)

= Z`m

( ∑ij:ij 6=`m

Zij(E(e2`m − σ2E) + 2(a` − ai)E(e`m) + 2(bm − bj)E(e`m))

− 2∑

ij:i<` ori=`,j<m

ZijE(eij)E(e`m))

= 0,

and their sum is

∑k

SEN,k,0 +∑q

SEN,q,1 +∑`m

SEN,`,m,2

=∑k

∑ii′

Zik

(∑j:j 6=k

Zi′j(b2k − σ2

B)− 2∑j:j<k

Zi′jbjbk

)+∑q

∑jj′

Zqj

(∑i:i 6=q

Zij′(a2q − σ2

A + 2(bj − bj′)aq)− 2∑i:i<q

Zij′aiaq

)+∑`m

Z`m

( ∑ij:ij 6=`m

Zij(e2`m − σ2

E + 2(a` − ai)e`m + 2(bm − bj)e`m)− 2∑

ij:i<` ori=`,j<m

Zijeije`m

)

=∑ii′

(∑j 6=k

ZikZi′j(b2k − σ2

B)− 2∑j<k

ZikZi′jbjbk

)+∑jj′

(∑i 6=q

ZqjZij′(a2q − σ2

A + 2(bj − bj′)aq)− 2∑i<q

ZqjZij′aiaq

)


+∑ij 6=`m

ZijZ`m(e2`m − σ2E + 2(a` − ai)e`m + 2(bm − bj)e`m)− 2

∑i<` ori=`,j<m

ZijZ`meije`m

=∑ii′

(∑j 6=k

ZikZi′jb2k − σ2

B(Ni•Ni′• − (ZZT)ii′)−∑j 6=k

ZikZi′jbjbk

)+∑jj′

(∑i 6=q

ZqjZij′a2q − σ2

A(N•jN•j′ − (ZTZ)jj′) + 2∑i 6=q

(bj − bj′)aq −∑i 6=q

ZqjZij′aiaq

)+∑ij 6=`m

ZijZ`me2`m − σ2

E(N2 −N) + 2∑ij 6=`m

ZijZ`m(a` − ai + bm − bj)e`m −∑ij 6=`m

ZijZ`meije`m

=1

2

∑ii′jk

ZikZi′j(bj − bk)2 − σ2B(N2 −

∑j

N2•j)

+1

2

∑iqjj′

ZqjZij′(aq − ai)2 − σ2A(N2 −

∑i

N2i•) +

∑iqjj′

ZqjZij′(aq − ai)(bj − bj′)

+1

2

∑i`jm

ZijZ`m(e`m − eij)2 − σ2E(N2 −N) +

∑i`jm

ZijZ`m(a` − ai + bm − bj)(e`m − eij)

= Ue − E(Ue).


We utilize the Cramer-Wold Theorem and show that any linear combination of the scaled martingale

difference sequences are asymptotically Gaussian:

w1SAN√∑j N

2•j

+ w2SBN√∑iN

2i•

+ w3SEN

N√∑

iN2i• +

∑j N

2•j

. (A.16)

Without loss of generality, assume that w21 + w2

2 + w23 = 1.

Consider the filtration indexed by k, FN,k,0. Then, (A.16) becomes the martingale difference

ηk(b2k − σ2B)− 2λkbk

where

ηk =w1√∑j N

2•j

(N•k −∑i

N−1i• Zik) +w3

N√∑

iN2i• +

∑j N

2•j

(NN•k −N2•k),

λk =w1√∑j N

2•j

∑i

N−1i• Zik∑j:j<k

Zijbj +w3

N√∑

iN2i• +

∑j N

2•j

∑j:j<k

N•kN•jbj ,

and

E(ηk(b2k − σ2B)− 2λkbk | FN,k−1,0) = η2kσ

4B(κB + 2) + 4λkσ

2B(λk − ηkγBσB)

where γB is the skewness of the column random effect.


Next, for the filtration indexed by q, FN,q,1, the MG difference is

ηq(a2q − σ2

A)− 2λqaq

where

ηq =w2√∑iN

2i•

(Nq• −∑j

N−1•j Zqj) +w3

N√∑

iN2i• +

∑j N

2•j

(NNq• −N2q•),

λq =w2√∑iN

2i•

∑j

N−1•j Zqj∑i:i<q

Zijai +w3

N√∑

iN2i• +

∑j N

2•j

∑i:i<q

Nq•Ni•ai

− w3

N√∑

iN2i• +

∑j N

2•j

(N∑j

Zqjbj −Nq•∑j′

N•j′bj′),

and

E(ηq(a2q − σ2

A)− 2λqaq | FN,q−1,1) = η2qσ4A(κA + 2) + 4λqσ

2A(λq − ηqγAσA).

Finally, for the filtration indexed by ` and m, FN,`,m,2, the MG difference is

Z`m(η`m(e2`m − σ2E)− 2λ`me`m)

where

η`m =w1√∑j N

2•j

(1−N−1`• ) +w2√∑iN

2i•

(1−N−1•m) +w3

N√∑

iN2i• +

∑j N

2•j

(N − Z`m),

λ`m =w1√∑j N

2•j

N−1`•

∑j:j<m

Z`je`j −w1√∑j N

2•j

N−1`•

∑j:j 6=m

Z`j(bm − bj)

+w2√∑iN

2i•

N−1•m∑i:i<`

Zimeim −w2√∑iN

2i•

N−1•m∑i:i 6=`

Zim(a` − ai)

+w3

N√∑

iN2i• +

∑j N

2•j

∑ij:i<` ori=`,j<m

Zijeij

− w3

N√∑

iN2i• +

∑j N

2•j

∑ij:ij 6=`m

Zij(a` − ai + bm − bj),

and

E(Z`m(η`m(e2`m−σ2E)− 2λ`me`m) | FN,`,m−1,2) = Z`m(η2`mσ

4E(κE + 2) + 4λ`mσ

2E(λ`m− η`mγEσE)).

For the proof, we merely need to check two conditions. First, the Lindeberg condition, which


requires that for all δ > 0 the following has limit 0:

∑k

E((ηk(b2k − σ2B)− 2λkbk)2I{|ηk(b2k − σ2

B)− 2λkbk| ≥ δ} | FN,k−1,0)

+∑q

E((ηq(a2q − σ2

A)− 2λaq)2I{|ηq(a2q − σ2

A)− 2λqaq| ≥ δ} | FN,q−1,1)

+∑`m

Z`mE((η`m(e2`m − σ2E)− 2λ`me`m)2I{|η`m(e2`m − σ2

E)− 2λ`me`m| ≥ δ} | FN,`,m−1,2).

Second, the condition that the sum of the conditional variances of the martingale differences, shown

below, converges in probability to a constant:

σ4B(κB + 2)

∑k

η2k + 4σ2B

∑k

λ2k − 4γBσ3B

∑k

λkηk

+ σ4A(κA + 2)

∑q

η2q + 4σ2A

∑q

λ2q − 4γAσ3A

∑q

λqηq

+ σ4E(κE + 2)

∑`m

Z`mη2`m + 4σ2

E

∑`m

Z`mλ2`m − 4γEσ

3E

∑`m

Z`mλ`mη`m.

A.3.4.1 Lindeberg condition

Here, we assume that |ai| ≤ cA, |bj | ≤ cB , and |eij | ≤ cE . We show that |ηk(b2k − σ2B) − 2λkbk|,

|ηq(a2q − σ2A) − 2λqaq|, and |η`m(e2`m − σ2

E) − 2λ`me`m| are each bounded above by some quantity

that converges to 0. Then, for any δ > 0, for large enough N , I{|ηk(b2k − σ2B) − 2λkbk| ≥ δ},

I{|ηq(a2q − σ2A) − 2λqaq| ≥ δ}, and I{|η`m(e2`m − σ2

E) − 2λ`me`m| ≥ δ} are always zero. Therefore,

the Lindeberg condition is satisfied.

LetmaxiNi• + maxj N•j

min(√∑

iN2i•,√∑

j N2•j

) = ξ → 0.

Then,

|ηk| ≤|w1|√∑j N

2•j

N•k +|w3|√∑

iN2i• +

∑j N

2•j

N•k ≤ ξ(|w1|+ |w3|),

|λk| ≤|w1|√∑j N

2•j

∑i

N−1i• Zik∑j:j<k

ZijcB +|w3|

N√∑

iN2i• +

∑j N

2•j

∑j:j<k

N•kN•jcB

≤ |w1|√∑j N

2•j

cBN•k +|w3|√∑

iN2i• +

∑j N

2•j

cBN•k ≤ ξcB(|w1|+ |w3|), and

|ηk(b2k − σ2B)− 2λkbk| ≤ |ηk||b2k − σ2

B |+ 2|λk||bk| ≤ |ηk|(σ2B + c2B) + 2cB |λk|

≤ ξ(|w1|+ |w3|)(σ2B + 3c2B)→ 0.


By symmetry, |ηq(a2q − σ2A)− 2λqaq| ≤ ξ(|w2|+ |w3|)(σ2

A + 3c2A)→ 0.

Similarly,

|η`m| ≤|w1|√∑j N

2•j

+|w2|√∑iN

2i•

+|w3|√∑

iN2i• +

∑j N

2•j

≤ ξ(|w1|+ |w2|+ |w3|),

|λ`m| ≤|w1|√∑j N

2•j

N−1`• cE∑j:j,m

Z`j +|w1|√∑j N

2•j

N−1`• 2cB∑j:j 6=m

Z`j

+|w2|√∑iN

2i•

N−1•mcE∑i:i<`

Zim +|w2|√∑iN

2i•

N−1•m2cA∑i:i6=`

Zim

+|w3|

N√∑

iN2i• +

∑j N

2•j

cE∑

ij:i<` ori=`,j<m

Zij

+|w3|

N√∑

iN2i• +

∑j N

2•j

2(cA + cB)∑

ij:ij 6=`m

Zij

≤ |w1|√∑j N

2•j

cE +|w1|√∑j N

2•j

2cB +|w2|√∑iN

2i•

cE +|w2|√∑iN

2i•

2cA

+|w3|√∑

iN2i• +

∑j N

2•j

cE +|w3|√∑

iN2i• +

∑j N

2•j

2(cA + cB)

≤ ξcE(|w1|+ |w2|+ |w3|) + 2ξ(|w1|cB + |w2|cA + |w3|(cA + cB)), and

|η`m(e2`m − σ2E)− 2λ`me`m| ≤ |η`m|(σ2

E + c2E) + 2cE |λ`m|

≤ ξ(|w1|+ |w2|+ |w3|)(σ2E + 3c2E)

+ 4ξcE(|w1|cB + |w2|cA + |w3|(cA + cB))

→ 0.

We now have the desired results.

A.3.4.2 Sum of conditional variances

To show that the sum of the conditional variances of the martingale differences converges in proba-

bility to a constant, we take the following steps.

1. Show that∑k η

2k,∑q η

2q , and

∑`m Z`mη

2`m converge in probability to constants.

2. Show that∑k λ

2k,∑q λ

2q, and

∑`m Z`mλ

2`m converge in probability to 0.

3. Show that∑k ηkλk,

∑q ηqλq, and

∑`m η`mλ`m converge in probability to 0.


Notice that step 3 actually follows from the first two steps, by the Cauchy-Schwarz inequality.

We explain this explicitly for∑k ηkλk. Letting

∑k η

2k

p→ c,

|∑k

ηkλk| ≤ (∑k

η2k)1/2(∑k

λ2k)1/2p→√c ∗ 0 = 0.

Then,∑k ηkλk must converge in probability to 0.

Therefore, we need to show steps 1 and 2.

∑k

η2k =w2

1∑j N

2•j

∑k

(N•k −∑i

N−1i• Zik)2 +w2

3

N2(∑iN

2i• +

∑j N

2•j)

∑k

(NN•k −N2•k)2

+ 2w1√∑j N

2•j

w3

N√∑

iN2i• +

∑j N

2•j

∑k

(N•k −∑i

N−1i• Zik)(NN•k −N2•k)

=w2

1∑j N

2•j

(∑k

N2•k − 2

∑ik

N•kN−1i• Zik +

∑kir

ZikZrkN−1i• N

−1r• )

+w2

3

N2(∑iN

2i• +

∑j N

2•j)

(N2∑k

N2•k − 2N

∑k

N3•k +

∑k

N4•k)

+ 2w1√∑j N

2•j

w3

N√∑

iN2i• +

∑j N

2•j

(N∑k

N2•k −

∑k

N3•k −N

∑ik

ZikN−1i• N•k

+∑ik

ZikN−1i• N

2•k)

= w21(1− 2O(ε) +O(ε)) + w2

3

∑j N

2•j∑

iN2i• +

∑j N

2•j

(1− 2O(ε) +O(ε))

+ 2w1w3

√ ∑j N

2•j∑

iN2i• +

∑j N

2•j

(1−O(ε)−O(ε) +O(ε))

→ w21 + w2

3

∑j N

2•j∑

iN2i• +

∑j N

2•j

+ 2w1w3

√ ∑j N

2•j∑

iN2i• +

∑j N

2•j

.

The fact that∑q η

2q converges in probability to a constant follows by symmetry from the result

for∑k η

2k.

∑`m

Z`mη2`m =

w21∑

j N2•j

∑`m

Z`m(1−N−1`• )2 +w2

2∑iN

2i•

∑`m

Z`m(1−N−1•m)2

+w2

3

N2(∑iN

2i• +

∑j N

2•j)

∑`m

Z`m(N − Z`m)2

+2w1w2√∑iN

2i•

∑j N

2•j

∑`m

Z`m(1−N−1`• )(1−N−1•m)


+2w1w3

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

∑`m

Z`m(1−N−1`• )(N − Z`m)

+2w2w3

N√∑

iN2i•

√∑iN

2i• +

∑j N

2•j

∑`m

Z`m(1−N−1•m)(N − Z`m)

≤ w21O(ε) + w2

2O(ε) + w23O(ε) + |w1w2|O(ε2) + |w1w3|O(ε2) + |w2w3|O(ε2)→ 0.

Step 1 is concluded.

Now consider∑k λ

2k. It is a quadratic form in the bj . The coefficient of bj in λk is

Mjk = I{k > j}

w1√∑j N

2•j

∑i

N−1i• ZikZij +w3

N√∑

iN2i• +

∑j N

2•j

N•kN•j

and so we can write

λk =[M1k . . . MCk

]b1...

bC

.

Then,∑k λ

2k is a quadratic form in terms of

b1...

bC

, with matrix in the middle M having entry

(j, s)∑kMjkMsk. Now, we apply formulas about the expectation and variance of quadratic forms,

given in Khoshnevisan (2014) [27].

E(∑k

λ2k) = σ2B Tr(M) = σ2

B

∑jk

M2jk and

M2jk = I{k > j}

[w2

1∑j N

2•j

∑ir

N−1i• N−1r• ZikZrkZijZrj +

w23

N2(∑iN

2i• +

∑j N

2•j)N2•kN

2•j

+w1w3

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

N•j∑i

ZikN−1i• N•kZij

].

Thus,∑jkM

2jk has three terms.

Term 1:

w21∑

j N2•j

∑jk

I{k > j}∑ir

N−1i• N−1r• ZikZrkZijZrj ≤

w21∑

j N2•j

∑j

∑kir

ZikZrkN−1i• N

−1r• ZijZrj

≤ w21∑

j N2•j

∑kir

ZikZrkN−1i•


=w2

1∑j N

2•j

∑ik

ZikN−1i• N•k = w2

1O(ε)→ 0.

Term 2:

w23

N2(∑iN

2i• +

∑j N

2•j)

∑jk

I{k > j}N2•kN

2•j ≤ w2

3

∑j N

2•j

N2

∑j N

2•j∑

iN2i• +

∑j N

2•j

= w23O(ε)→ 0.

Term 3: Taking absolute values,

|w1w3|

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

∑jk

I{k > j}N•jN•k∑i

ZikN−1i• Zij

≤ |w1w3|

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

∑ik

ZikN−1i• N•k

∑j

N•j

≤ |w1w3|O(ε)→ 0.

Thus, E(∑k λ

2k)→ 0.

Var(∑k

λ2k) = (µ4,B − 3σ4B)∑j

(∑k

M2jk)2 + (σ4

B − 1)(∑jk

M2jk)2 + 2σ4

B

∑js

(∑k

MjkMsk)2.

We have proved that∑jkM

2jk = O(ε), so the second term in the above expression is O(ε2)→ 0.

Since∑j(∑kM

2jk)2 ≤ (

∑jkM

2jk)2 = O(ε2) → 0, the first term also converges to 0. It remains to

consider the third term, which we again upper bound.

By ignoring the indicator variables, we have

∑k

|MjkMsk| ≤w2

3

N2(∑iN

2i• +

∑j N

2•j)N•jN•s

∑k

N2•k

+|w1w3|

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

N•j∑ik

ZikN−1i• N•kZis

+|w1w3|

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

N•s∑ik

ZikN−1i• N•kZij

+w2

1∑j N

2•j

∑kir

ZikZrkN−1i• N

−1r• ZijZrs.

By squaring this and summing over j and s, we get an upper bound on∑js(∑kMjkMsk)2.

There are 10 terms in that sum, which are as follows.


Term 1:

w43

N4(∑iN

2i• +

∑j N

2•j)

2

∑js

N2•jN

2•s(∑k

N2•k)2 = w4

3(

∑j N

2•j

N2)2

(∑j N

2•j)

2

(∑iN

2i• +

∑j N

2•j)

2= w4

3O(ε2)→ 0.

Term 2:

w21w

23

N2∑j N

2•j(∑iN

2i• +

∑j N

2•j)

∑js

N2•j(∑ik

ZikN−1i• N•kZis)

2

=w2

1w23

N2(∑iN

2i• +

∑j N

2•j)

∑s

∑iki′k′

ZikZi′k′N−1i• N

−1i′• N•kN•k′ZisZi′s

≤ w21w

23

N2(∑iN

2i• +

∑j N

2•j)

∑iki′k′

ZikZi′k′N−1i• N•kN•k′

= w21w

23

∑kN

2•k

N2

∑ik ZikN

−1i• N•k∑

iN2i• +

∑j N

2•j

= w21w

23O(ε)→ 0.

Term 3 is the same as term 2 by symmetry.

Term 4:

w41

(∑j N

2•j)

2

∑js

∑kk′ii′rr′

ZikZrkN−1i• N

−1r• ZijZrsZi′k′Zr′k′N

−1i′• N

−1r′•Zi′jZr′s

≤ w41

(∑j N

2•j)

2

∑kk′ii′rr′

ZikZrkN−1i• N

−1r• Zi′k′Zr′k′ =

w41

(∑j N

2•j)

2

∑ii′rr′

(ZZT)ir(ZZT)i′r′N

−1i• N

−1r•

=w4

1∑j N

2•j

∑ir

(ZZT)irN−1i• N

−1r• = w4

1O(ε)→ 0.


|w1w33|

N3(∑iN

2i• +

∑j N

2•j)√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

∑js

N2•jN•s

∑k

N2•k

∑ik

ZikN−1i• N•kZis

≤ |w1w33|∑j N

2•j

N2

∑kN

2•k∑

iN2i• +

∑j N

2•j

∑sN•sN

∑ik ZikN

−1i• N•k√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

= |w1w33|O(ε2)→ 0.


Term 7:

w21w

23

N2(∑iN

2i• +

∑j N

2•j)

∑js

N•jN•s∑kir

ZikZrkN−1i• N

−1r• ZijZrs


≤ w21w

23

N2(∑iN

2i• +

∑j N

2•j)

∑js

N•jN•s∑ir

N−1i• ZijZrs

= w21w

23

∑j N

2•j∑

iN2i• +

∑j N

2•j

∑ij ZijN

−1i• N•j

N2= w2

1w23O(ε2)→ 0.

Term 8:

w21w

23

N2∑j N

2•j(∑iN

2i• +

∑j N

2•j)

∑js

N•jN•s∑ii′kk′

ZikZi′k′N−1i• N•kN

−1i′• N•k′ZisZi′j

≤ w21w

23∑

j N2•j(∑iN

2i• +

∑j N

2•j)

∑ii′kk′

ZikZi′k′N−1i• N•kN

−1i′• N•k′ = w2

1w23O(ε2)→ 0.


|w31w3|

N∑j N

2•j

√∑j N

2•j

√∑iN

2i• +

∑j N

2•j

∑js

∑ii′kk′r

N•jZikN−1i• N•kZisZi′k′Zrk′N

−1i′• N

−1r• Zi′jZrs

≤ |w31w3|

N∑j N

2•j

√∑j N

2•j

√∑iN

2i• +

∑j N

2•j

∑j

∑ii′kk′r

N•jZikN−1i• N•kZi′k′Zrk′N

−1i′• Zi′j

≤ |w31w3|∑

j N2•j

√∑j N

2•j

√∑iN

2i• +

∑j N

2•j

∑ii′kk′

ZikN−1i• N•kZi′k′N

−1i′• N•k′ = |w3

1w3|O(ε2)→ 0.


Adding up the 10 terms, we see that the variance of∑k λ

2k approaches 0, and

∑k λ

2k converges

in probability to 0.

Consider∑q λ

2q. Note that λq is a linear combination of the ai and bj . Therefore,

∑q λ

2q is the

sum of a quadratic form in a, a quadratic form in b, and a weighted inner product of a and b.

The quadratic form in b and the weighted inner product of a and b are

∑q

α2q =

w23

N2(∑iN

2i• +

∑j N

2•j)

∑q

(∑j

(NZqj −Nq•N•j)bj)2 and

∑q

αqνq =w2w3

N√∑

iN2i•

√∑iN

2i• +

∑j N

2•j

∑q

(∑j

(Nq•N•j −NZqj)bj)(∑j

∑i:i<q

ZijZqjN−1•j ai)

+w2

3

N2(∑iN

2i• +

∑j N

2•j)

∑q

(∑j

(Nq•N•j −NZqj)bj)(Nq•∑i:i<q

Ni•ai)

and∑q ν

2q is the quadratic form in a that converges in probability to 0. The quadratic form in a

converges in probability to 0 by the same argument that∑k λ

2k converges in probability to 0.


More explicitly, αq is a linear combination of the bj , with coefficients

Mqj =w3

N√∑

iN2i• +

∑j N

2•j

(NZqj −Nq•N•j).

Then,∑q α

2q is a quadratic form in the bj , with inner matrix entry (j, s)

∑qMqjMqs.

E(∑q

α2q) = σ2

B

∑jq

M2qj and

Var(∑q

α2q) = (µB,4 − 3σ4

B)∑j

(∑q

M2qj)

2 + (σ4B − 1)(

∑jq

M2qj)

2 + 2σ4B

∑js

(∑q

MqjMqs)2.

Then,

∑qj

M2qj =

w23

N2(∑iN

2i• +

∑j N

2•j)

(N2∑qj

Zqj +∑qj

N2q•N

2•j − 2N

∑qj

ZqjNq•N•j)

≤ w23O(ε)→ 0.

Thus, the first two terms of Var(∑q α

2q) also converge to 0, using the same reasoning as before.

∑q

MqjMqs =w2

3

N2(∑iN

2i• +

∑j N

2•j)

[N2(ZTZ)js +N•jN•s∑q

N2q• −NN•s

∑q

ZqjNq• −NN•j∑q

ZqsNq•].

We square this and sum over j and s. There are 10 terms in the sum.

Term 1:

w43

(∑iN

2i• +

∑j N

2•j)

2

∑js

(ZTZ)2js ≤w4

3

(∑iN

2i• +

∑j N

2•j)

2

∑j

N2•jC = w4

3O(ε)→ 0.

Term 2:

w43

N4(∑iN

2i• +

∑j N

2•j)

2(∑q

N2q•)

2∑j

N2•j

∑s

N2•s = w4

3O(ε)→ 0.

Term 3:

w43

N2(∑iN

2i• +

∑j N

2•j)

2

∑s

N2•s

∑j

(∑q

ZqjN2q•)

2 = w43O(ε)→ 0.



Term 5:

w43

N2(∑iN

2i• +

∑j N

2•j)

2

∑q

N2q•

∑js

N•jN•s(ZTZ)js = w4

3O(ε)→ 0.

Term 6: Taking absolute values, we have

w43

N(∑iN

2i• +

∑j N

2•j)

2

∑js

(ZTZ)jsN•s∑q

ZqjNq• = w43O(ε)→ 0.


Term 8: Taking absolute values, we have

w43

N3(∑iN

2i• +

∑j N

2•j)

2

∑q

N2q•

∑s

N2•s

∑jq

ZqjNq•N•j = w43O(ε)→ 0.


Term 10:

w43

N2(∑iN

2i• +

∑j N

2•j)

2

∑jq

ZqjNq•N•j∑sq

ZqsNq•N•s = w43O(ε)→ 0.

Thus,∑js(∑qMqjMqs)

2 is bounded above by something that converges to 0, and Var(∑q α

2q)

converges to 0. Then,∑q α

2q converges in probability to 0.

Because∑q α

2q and

∑q ν

2q converge in probability to 0, again by the Cauchy-Schwarz inequality∑

q αqνq converges in probability to 0. Therefore,∑q λ

2q is the sum of three sequences that converge

in probability to 0 and thus converges in probability to 0.

Finally, we consider∑`m Z`mλ

2`m. Notice that λ`m is a linear combination of the ai, bj , and eij .

Therefore,∑`m Z`mλ

2`m is the sum of quadratic forms in a, b, and e, and weighted inner products

of them. Using the same reasoning as for∑q λ

2q, it suffices to show that the three quadratic forms

converge in probability to 0.

The coefficient of eij in λ`m is

Mij,`m =w1√∑j N

2•j

N−1i• ZijI{i = `, j < m}+w2√∑iN

2i•

N−1•j ZijI{i < `, j = m}

+w3

N√∑

iN2i• +

∑j N

2•j

ZijI{i < ` or i = `, j < m}.


Then, let QF (e) be the quadratic form in e in∑`m Z`mλ

2`m, with inner matrix having entry

(ij, rs)∑`m Z`mMij,`mMrs,`m.

E(QF (e)) = σ2E

∑`m,ij

Z`mM2ij,`m and

Var(QF (e)) = (µE,4 − 3σ4E)∑ij

(∑`m

Z`mM2ij,`m)2 + (σ4

E − 1)(∑ij,`m

Z`mM2ij,`m)2

+ 2σ4E

∑ij,rs

(∑`m

Z`mMij,`mMrs,`m)2.

Then,

∑`m,ij

Z`mM2ij,`m ≤

w21∑

j N2•j

∑ijm

ZimZijN−2i• +

w2∑iN

2i•

∑ij`

Z`jZijN−2•j

+w2

3

N2(∑iN

2i• +

∑j N

2•j)

∑ij`m

ZijZ`m +2|w1w3|√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

+2|w2w3|√∑

iN2i•

√∑iN

2i• +

∑j N

2•j

→ 0.

Thus, E(QF (e)) and the first two terms of Var(QF (e)) converge to 0. To bound the third term

of Var(QF (e)),

∑`m

Z`m|Mij,`mMrs,`m| ≤w2

1∑j N

2•j

N−1r• ZijZrsI{i = r}+w2

2∑iN

2i•

N−1•s ZijZrsI{j = s}

+|w1w2|√∑iN

2i•

∑j N

2•j

ZisZijZrsN−1i• N

−1•s + ZijZrs[

w23

N(∑iN

2i• +

∑j N

2•j)

+|w1w3|

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

+|w2w3|

N√∑

iN2i•

√∑iN

2i• +

∑j N

2•j

].

We square the above and sum over ij and rs. There are 10 terms in the sum.

Term 1:

w41

(∑j N

2•j)

2

∑ij,rs

N−2r• ZijZrsI{i = r} =w4

1

(∑j N

2•j)

2

∑ijs

N−2i• ZijZis = w41O(ε)→ 0.



Term 3:

w21w

22∑

iN2i•

∑j N

2•j

∑ij,rs

ZisZijZrsN−1i• N

−1•s =

w21w

22∑

iN2i•

∑j N

2•j

∑is

Zis = w21w

22O(ε)→ 0.

Term 4:

[w2

3

N(∑iN

2i• +

∑j N

2•j)

+|w1w3|

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

+|w2w3|

N√∑

iN2i•

√∑iN

2i• +

∑j N

2•j

]2N2

= O(ε)→ 0.

Term 5:

w21w

22∑

iN2i•

∑j N

2•j

∑ij,rs

N−1r• N−1•s ZijZrsI{i = r, j = s} =

w21w

22∑

iN2i•

∑j N

2•j

∑ij

N−1i• N−1•j Zij

= w21w

22O(ε)→ 0.

Term 6:

|w31w2|∑

j N2•j

√∑iN

2i•

∑j N

2•j

∑ijs

N−2i• ZijZisN−1•s =

|w31w2|∑

j N2•j

√∑iN

2i•

∑j N

2•j

∑is

ZisN−1i• N

−1•s

= |w31w2|O(ε)→ 0.


Term 7:

[w2

3

N(∑iN

2i• +

∑j N

2•j)

+|w1w3|

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

+|w2w3|

N√∑

iN2i•

√∑iN

2i• +

∑j N

2•j

]

w21∑

j N2•j

∑ijs

N−1i• ZijZis = O(ε)→ 0.


Term 10:

[w2

3

N(∑iN

2i• +

∑j N

2•j)

+|w1w3|

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

+|w2w3|

N√∑

iN2i•

√∑iN

2i• +

∑j N

2•j

]

|w1w2|√∑iN

2i•

∑j N

2•j

∑ij,rs

ZijZrsZisN−1i• N

−1•s = O(ε)→ 0.


Thus, QF (e) converges in probability to 0. The coefficient of bj in λ`m is

Mj,`m =w1√∑j N

2•j

(N−1`• Z`j − I{j = m}) +w3

N√∑

iN2i• +

∑j N

2•j

(N•j −NI{j = m}).

Then, let QF (b) be the quadratic form in the b in∑`m Z`mλ

2`m, with inner matrix having entry

(j, s)∑`m Z`mMj,`mMs,`m.

E(QF (b)) = σ2E

∑`m,j

Z`mM2j,`m and

Var(QF (b)) = (µE,4 − 3σ4E)∑j

(∑`m

Z`mM2j,`m)2 + (σ4

E − 1)(∑j,`m

Z`mM2j,`m)2

+ 2σ4E

∑j,s

(∑`m

Z`mMj,`mMs,`m)2.

Then,

∑`m,j

Z`mM2j,`m =

w21∑

j N2•j

∑`m,j

Z`m(N−2`• Z`,j + I{j = m} − 2I{j = m}N−1`• Z`j)

+w2

3

N2(∑iN

2i• +

∑j N

2•j)

∑`m,j

Z`m(N2•j +N2I{j = m} − 2NN•jI{j = m}))

+2|w1w3|

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

∑`m,j

(N−1`• N•jZ`j −NN−1`• Z`jI{j = m}

−N•jI{j = m}+NI{j = m})

≤ w21∑

j N2•j

(3∑`m

Z`mN−1`• +

∑`m

Z`m) +w2

3

N2(∑iN

2i• +

∑j N

2•j)

(3N∑j

N2•j +N3)

+2|w1w3|

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

(2∑j

N2•j +NR+N2)

= w21O(ε) + w2

3O(ε) + 2|w1w3|O(ε)→ 0.

Thus, the expectation and first two terms of the variance of Var(QF (b)) converge to 0. To bound

the third term of Var(QF (b)),

∑`m

Z`mMj,`mMs,`m

=w2

1∑j N

2•j

∑`m

Z`m(N−2`• Z`jZ`s −N−1`• Z`sI{j = m} −N−1`• Z`jI{s = m}+ I{j = s = m})


+w1w3

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

∑`m

Z`m(N−1`• N•sZ`j −NN−1`• Z`jI{s = m} −N•sI{j = m}

+NI{j = s = m})

+w1w3

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

∑`m

Z`m(N−1`• N•jZ`s −NN−1`• Z`sI{j = m} −N•jI{s = m}

+NI{j = s = m})

+w2

3

N2(∑iN

2i• +

∑j N

2•j)

∑`m

Z`m(N•jN•s −NN•jI{s = m} −NN•sI{j = m}

+N2I{j = s = m})

=w2

1∑j N

2•j

(N•jI{j = s} −∑`

N−1`• Z`jZ`s) +w2

3

N2(∑iN

2i• +

∑j N

2•j)

(N2N•jI{j = s} −NN•jN•s)

+w1w3

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

(NN•jI{j = s} −N∑`

N−1`• Z`jZ`s)

+w1w3

N√∑

j N2•j

√∑iN

2i• +

∑j N

2•j

(NN•sI{j = s} −N∑`

N−1`• Z`jZ`s).

We square the above and sum over j and s; there are 10 terms in the sum.

Term 1:

w41

(∑j N

2•j)

2

∑js

(N2•jI{j = s} − 2N•jI{j = s}

∑`

N−1`• Z`jZ`s +∑i`

N−1i• N−1`• ZijZisZ`jZ`s)

≤ w41

(∑j N

2•j)

2(∑j

N2•j − 2

∑`j

Z`jN−1`• N•j +R2) = w4

1O(ε)→ 0.

Term 2:

w43

N2(∑iN

2i• +

∑j N

2•j)

2

∑js

(N2N2•jI{j = s}+N2

•jN2•s − 2NN2

•jN•sI{j = s})

=w4

3

N2(∑iN

2i• +

∑j N

2•j)

2(N2

∑j

N2•j +

∑j

N2•j

∑s

N2•s − 2N

∑j

N3•j) = w4

3O(ε)→ 0.

Term 3:

w21w

23

N2∑j N

2•j(∑iN

2i• +

∑j N

2•j)

∑js

(N2N2•jI{j = s} − 2N2N•jI{j = s}

∑`

N−1`• Z`jZ`s

+N2∑i`


=4w2

1w23

N2∑j N

2•j(∑iN

2i• +

∑j N

2•j)

(N2∑j

N2•j − 2N2

∑j

N•j∑`

N−1`• Z`j +N2R2)


= w21w

23O(ε)→ 0.


Term 5:

w21w

23

N∑j N

2•j(∑iN

2i• +

∑j N

2•j)

∑js

(N2•jI{j = s} −N2

•jN•sI{j = s} −NN•jI{j = s}∑`

N−1`• Z`jZ`s

+N•jN•s∑`

N−1`• Z`jZ`s)

=w2

1w23

N∑j N

2•j(∑iN

2i• +

∑j N

2•j)

(∑j

N2•j −

∑j

N3•j −N

∑`j

N−1`• N•jZ`j +∑`js

Z`jN−1`• N•jZ`sN•s)

= w21w

23O(ε)→ 0.

Term 6:

w31w3∑

j N2•j

√∑j N

2•j

√∑iN

2i• +

∑j N

2•j

∑js

(N2•jI{j = s} − 2N•jI{j = s}

∑`

N−1`• Z`jZ`s

+∑i`


≤ |w31w3|∑

j N2•j

√∑j N

2•j

√∑iN

2i• +

∑j N

2•j

(∑j

N2•j − 2

∑`j

N•jN−1`• Z`j +R2) = |w3

1w3|O(ε)→ 0.


Term 8:

w1w33√∑

j N2•j

√∑iN

2i• +

∑j N

2•jN(

∑iN

2i• +

∑j N

2•j)

∑js

(NN2•jI{j = s} −N2

•jN•sI{j = s}

−NN•jI{j = s}∑`

N−1`• Z`jZ`s +N•jN•s∑`

N−1`• Z`jZ`s)

=|w1w

33|√∑

j N2•j

√∑iN

2i• +

∑j N

2•jN(

∑iN

2i• +

∑j N

2•j)

(N∑j

N2•j −

∑j

N3•j −N

∑`j

N−1`• N•jZ`j

+∑`js

Z`jN−1`• N•jZ`sN•s)

= |w1w33|O(ε)→ 0.



Term 10:

w21w

23

N2∑j N

2•j(∑iN

2i• +

∑j N

2•j)

∑js

(N•jN•sI{j = s} −N2N•jI{j = s}∑`

N−1`• Z`jZ`s

−N2N•sI{j = s}∑`

N−1`• Z`jZ`s +N2∑i`


≤ w21w

23

N2∑j N

2•j(∑iN

2i• +

∑j N

2•j)

(∑j

N2•j −N2

∑`j

Z`jN−1`• N•j +N2R2) = w2

1w23O(ε)→ 0.

Thus, Var(QF (b)) converges to 0, and QF (b) converges in probability to 0. By symmetry, the

same result holds for the quadratic form in a in∑`m Z`mλ

2`m. Thus,

∑`m Z`mλ

2`m converges in

probability to 0. The results follow.

Appendix B

Linear Mixed Model with Crossed

Random Effects

B.1 Estimator Efficiencies

To compute a lower bound for effRLS, we first transform x into z = V−1/2A x. Then, from (3.10),

effRLS =(zTz)2

(zTV−1/2A VRV

−1/2A z)(zTV

1/2A V −1R V

1/2A z)

.

Scaling z by a nonzero constant does not change effRLS. Hence, it suffices to consider unit vectors

u = z/‖z‖, and rewrite

1/effRLS = (uTV−1/2A VRV

−1/2A u)(uV

1/2A V −1R V

1/2A u) ≡ (uTAu)(uTA−1u)

for A = V−1/2A VRV

−1/2A . We get an upper bound for 1/effRLS from the Kantorovich inequality after

getting upper and lower bounds on the eigenvalues of

A = V−1/2A (VA + σ2

BBR)V−1/2A = IN + σ2

BV−1/2A BRV

−1/2A

The eigenvalues of A are the eigenvalues of σ2BV−1/2A BRV

−1/2A plus one.

The matrix BC and by extension BR is singular and positive semidefinite, with nonzero eigen-

values N•j for j = 1, . . . , R. Also, VA is symmetric and nonsingular with eigenvalues σ2E , and

σ2E + σ2

ANi• for i = 1, . . . , R. Then, V−1/2A is symmetric and nonsingular with eigenvalues 1/

√σ2E

and 1/√σ2E + σ2

ANi• for i = 1, . . . , R.

Therefore, σ2BV−1/2A BRV

−1/2A is singular and positive semidefinite. Its smallest eigenvalue is zero,

111

APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 112

and its largest eigenvalue is bounded above by

‖σ2BV−1/2A BRV

1/2A ‖2 ≤ σ2

B‖V−1/2A ‖22‖BR‖2 =

σ2B

σ2E

maxjN•j .

Hence, the smallest eigenvalue of A is 1, and the largest eigenvalue is bounded above by 1 +

σ2B maxj N•j/σ

2E . By the Kantorovich inequality (Theorem 3.3.1), we have

1/effRLS = (uTV−1/2A VRV

−1/2A u)(uTV

1/2A V −1R V

1/2A u)

≤ (2 + σ2B maxj N•j/σ

2E)2

4(1 + σ2B maxj N•j/σ2

E)=

(2σ2E + σ2

B maxj N•j)2

4σ2E(σ2

E + σ2B maxj N•j)

Taking reciprocals gives the desired result. The result for effCLS follows by symmetry.

B.2 A Useful Lemma

Several of the proofs for Section 3.4 utilize the following lemma, which is not given in the main text

for brevity’s sake. This lemma rewrites Ua(β), Ub(β), and Ue(β) in a useful form.

Lemma B.2.1.

Ua(β) = Ua +1

2(β − β)T

(∑ijj′

N−1i• ZijZij′(xij − xij′)(xij − xij′)T)

(β − β)

+ (β − β)T∑ijj′

N−1i• ZijZij′(xij − xij′)(bj − bj′)

+ (β − β)T∑ijj′

N−1i• ZijZij′(xij − xij′)(eij − eij′),

Ub(β) = Ub +1

2(β − β)T

(∑jii′

N−1•j ZijZi′j(xij − xi′j)(xij − xi′j)T)

(β − β)

+ (β − β)T∑jii′

N−1•j ZijZi′j(xij − xi′j)(ai − ai′)

+ (β − β)T∑jii′

N−1•j ZijZi′j(xij − xi′j)(eij − ei′j), and

Ue(β) = Ue +1

2(β − β)T

(∑iji′j′

ZijZi′j′(xij − xi′j′)(xij − xi′j′)T)

(β − β)

+ (β − β)T∑iji′j′

ZijZi′j′(xij − xi′j′)(ai − ai′) + (β − β)T∑iji′j′

ZijZi′j′(xij − xi′j′)(bj − bj′)

+ (β − β)T∑iji′j′

ZijZi′j′(xij − xi′j′)(eij − ei′j′),


where for ηij = ai + bj + eij,

Ua =∑ijj′

N−1i• ZijZij′(ηij − ηij′)2,

Ub =∑jii′

N−1•j ZijZi′j(ηij − ηi′j)2, and

Ue =∑iji′j′

ZijZi′j′(ηij − ηi′j′)2.

Proof. Straightforward algebra.

Note that the ηij exactly follow a two-factor crossed random effects model. Thus, Lemma B.2.1

shows that we can leverage results about Ua, Ub, and Ue from Section 2.4 to analyze Ua(β), Ub(β),

and Ue(β).

B.3 Consistency of Estimates

B.3.1 Proof of Theorem 3.4.1

Let the data be ordered by row and write Y = Xβ + ε, where ε has mean zero and variance

σ2AAR + σ2

BBR + σ2EIN . Then βOLS = β + (XTX)−1XTε. Clearly E((XTX)−1XTε) = 0. Now let

w ∈ Rd be any unit vector. Then

Var(wT(XTX)−1XTε) = wT(XTX)−1XT(σ2AAR + σ2

BBR + σ2EIN )X(XTX)−1w

≤ (σ2E + σ2

A maxiNi• + σ2

B maxjN•j)w

T(XTX)−1XTX(XTX)−1w

=1

N(σ2E + σ2

A maxiNi• + σ2

B maxjN•j)w

T( 1

NXTX

)−1w

≤(σ2

E

N+ εRσ

2A + εCσ

2B

) /I(XTX

N

)→ 0.

The first inequality follows from the facts that the maximum eigenvalue of σ2EIN is σ2

E , the maximum

eigenvalue of σ2AAR is σ2

A maxiNi•, and the maximum eigenvalue of σ2BBR is σ2

B maxj N•j . The

conclusion now follows from our asymptotic assumptions.


In light of Theorem 2.4.3 it suffices to show that (Ua(β)−Ua(β))/(N−R), (Ub(β)−Ub(β))/(N−C),

and (Ue(β)− Ue(β))/N2 all converge to zero in probability. Because max(R,C)/N < θ ∈ (0, 1) we


can replace denominators N −R and N − C by N . Using the expansion in Lemma B.2.1,

Ua(β)− Ua(β)

N=

1

2(β − β)T

( 1

N

∑ijj′


(β − β)︸︷︷︸A1

+ (β − β)T1

N

∑ijj′

N−1i• ZijZij′(xij − xij′)(bj − bj′)︸︷︷︸A2

+ (β − β)T1

N

∑ijj′

N−1i• ZijZij′(xij − xij′)(eij − eij′)︸︷︷︸A3

.

We consider Terms A1 through A3 in turn.

A1: The middle factor in Term A1 is no larger than (4M∞/N)∑ijj′ N

−1i• ZijZij′ = 4M∞ = O(1).

Therefore term A1 is O(‖β − β‖2).

A2: The coefficient of β − β has mean zero and second moment

E( 1

N2

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′(xij − xij′)(xrs − xrs′)T(bj − bj′)(bs − bs′)

)=

1

N2

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′(xij − xij′)(xrs − xrs′)TE((bj − bj′)(bs − bs′))

=σ2B

N2

∑ijj′

∑rss′

N−1i• N−1r• ZijZij′ZrsZrs′(xij − xij′)(xrs − xrs′)T(1j=s − 1j=s′ − 1j′=s + 1j′=s′)

=4σ2

B

N2

∑ijj′

∑rs′

N−1i• N−1r• ZijZij′ZrjZrs′(xij − xij′)(xrj − xrs′)T.

No component in this matrix is larger than

16M∞σ2B

N2

∑ijj′

∑rs′

N−1i• N−1r• ZijZij′ZrjZrs′ =

16M∞σ2B

N2

∑ij

∑r

ZijZrj =16M∞σ

2B

N2

∑j

N2•j = O(ε),

and so Term A2 is O(‖β − β‖ε).

A3: As in A2 we find that the coefficient of β − β has mean zero and second moment

4σ2E

N2

∑ijj′

∑s′

N−2i• ZijZij′Zis′(xij − xij′)(xij − xis′)T


in which no component is larger than

16σ2EM∞N2

∑ijj′

∑s′

N−2i• ZijZij′Zis′ =16σ2

EM∞N2

∑ij

Zij =16σ2

EM∞N

.

Therefore Term A3 is O(‖β − β‖/N).

Combining these results (Ua(β)−Ua(β))/N = O(‖β−β‖(ε+‖β−β‖)). The same argument applies

to (Ub(β)− Ub(β))/N . Now we turn to (Ue(β)− Ue(β))/N2. Using the expansion in Lemma B.2.1,

Ue(β)− Ue(β)

N2=

1

2(β − β)T

( 1

N2

∑iji′j′

ZijZi′j′(xij − xij′)(xij − xij′)T)

(β − β)︸︷︷︸E1

+ (β − β)T1

N2

∑iji′j′

ZijZi′j′(xij − xi′j′)(ai − ai′)︸︷︷︸E2

+ (β − β)T1

N2

∑iji′j′

ZijZi′j′(xij − xi′j′)(bj − bj′)︸︷︷︸E3

+ (β − β)T1

N2

∑iji′j′

ZijZi′j′(xij − xi′j′)(eij − ei′j′)︸︷︷︸E4

.

E1: By arguments like the one for A1, we find that E1 is also O(‖β − β‖2).

E2: Similarly to A2, the coefficient of β − β has mean zero and second moment

E( 1

N4

∑iji′j′

∑rsr′s′

ZijZi′j′ZrsZr′s′(xij − xi′j′)(xrs − xr′s′)T(ai − ai′)(ar − ar′))

=4σ2

A

N4

∑iji′j′

∑sr′s′

ZijZi′j′ZisZr′s′(xij − xi′j′)(xis − xr′s′)T

with components no larger than

16M∞σ2A

N4

∑iji′j′

∑sr′s′

ZijZi′j′ZisZr′s′ =16M∞σ

2A

N2

∑ij

∑s

ZijZis =16M∞σ

2A

N2

∑i

N2i•.

Therefore Term E2 is Op(‖β − β‖ε).

E3: Term E3 is also Op(‖β − β‖ε). by the argument used for Term E3.


E4: Following arguments similar to the preceding ones, the coefficient of β− β has mean zero and

second moment

4σ2E

N4

∑iji′j′

∑r′s′

ZijZi′j′Zr′s′(xij − xi′j′)(xij − xr′s′)T = O(16σ2

EM∞N

)

and so Term E4 is O(‖β − β‖/N). Combining these results we have consistency for the variance

components. The error in replacing β by β changes the variance component estimates by O(‖β −β‖(‖β − β‖+ ε)).


Suppose that the data are ordered by rows. Then we may write Y = Xβ + η, where η has mean

zero and variance VR = σ2EIN + σ2

AAR + σ2BBR. Now

βRLS = β + (XTV −1A X)−1XTV −1A η

where VA = σ2AAR + σ2

EIN . The matrix X is not random and both σ2A

p→ σ2A and σ2

E

p→ σ2E so

it suffices to show that ε ≡ (XTV −1A X)−1XTV −1A ηp→ 0. Write η = a + b + e where these are the

random effects in the row order. We can easily handle the effect of a+ e, via

Var((XTV −1A X)−1XTV −1A (a+ e)) = (XTV −1A X)−1XTV −1A X(XTV −1A X)−1 = (XTV −1A X)−1.

The largest eigenvalue of VA is O(Ni•) and so this quantity is O(εR)→ 0.

We will need a sharper analysis of (XTV −1A X)−1 to control the contribution of the column random

effects b to the row-weighted GLS estimate βRLS. Furthermore their contribution to the intercept

term in β motivates centering the xij . For a nonrandom invertible matrix K ∈ Rp×p, we may replace

X by X∗ = XK and β by β∗ = K−1β. Now βRLS = Kβ∗RLS and so Var(βRLS) = KVar(β∗RLS)KT.

Our matrix K will be uniformly bounded as N → ∞ and independent of η. Then Var(β∗RLS) → 0

implies Var(βRLS)→ 0. The matrix we choose is

K =

1 −k2 · · · −kp0 1 · · · 0...

.... . .

...

0 0 · · · 1

with values kt for t = 2, . . . , p given below. We have x∗ij = (1, xij,2 − k2, xij,3 − k3, . . . , xij,p − kp).

We begin by noting that in the row ordering,

V −1A =1

σ2E

diag(INi• −N−1i• γi1Ni•1

TNi•

)


where there are R diagonal blocks of size Ni• ×Ni• and γi = Ni•σ2A/(σ

2E +Ni•σ

2A). Then

σ2EX

TV −1A X = XT(X − col

(γi1Ni• x

Ti•

))where col(·) ∈ RN×p is a column of R blocks of sizes Ni• × p. Continuing

σ2EX

TV −1A X =∑i

∑j

Zij(xijxTij − xij xTi•γi)

= XTX −∑i

γi∑j

Zijxij xTi•

= XTX −∑i

Ni•γixi•xTi•

=∑ij

Zij(xij − xi•)(xij − xi•)T +∑i

Ni•(1− γi)xi•xTi•. (B.1)

The lower right (p − 1) × (p − 1) submatrix of the first term in (B.1) grows proportionally to N .

We will see that the upper left 1× 1 submatrix of the second term grows at least as fast as R. We

choose our matrix K to zero out all of the top row and leftmost column of X∗TV −1A X∗ except the

upper left entry. To this end, define

kt =

∑iNi•(1− γi)xi•,t∑iNi•(1− γi)

=

∑i xi•,tNi•σ

2E/(Ni•σ

2A + σ2

E)∑iNi•σ

2E/(Ni•σ

2A + σ2

E), t = 2, . . . , p.

Now from (B.1),

σ2EX∗TV −1A X∗ =

(∑iNi•(1− γi) 0Tp−1

0p−1 V

)

where V is the lower right (p−1)× (p−1) submatrix of∑ij Zij(xij− xi•)(xij− xi•)T plus a positive

semidefinite matrix. Therefore

(X∗TV −1A X∗

)−1= σ2

E

(1/∑iNi•(1− γi) 0Tp−1

0p−1 V −1

).

Continuing the derivation,

Var((X∗TV −1A X∗)−1X∗TV −1A b) = (X∗TV −1A X∗)−1(X∗TV −1A BRV−1A X∗)(X∗TV −1A X∗)−1σ2

B .

The eigenvalues of V −1A are all smaller than 1, so in the ordering of positive semidefinite matrices,

X∗TV −1A BRV−1A X∗ ≤ X∗TBRX∗ =

∑j

N2•j x∗•j x∗•jT.

Now for a unit vector w ∈ Rp with w1 = 0 we have ‖(X∗TV −1A X∗)−1w‖ ≤ cN−1σ2E because the


sample covariance of non-intercept x’s grows (at least) proportionally to N . The xij are bounded

and so therefore the x∗•j are also bounded. So now

Var(wT(X∗TV −1A X∗)−1X∗TV −1A b) = O(N−2∑j

N2•j)→ 0.

Next we consider w = (1, 0, . . . , 0). For this w,

(X∗TV −1A X∗)−1w =σ2E∑

iNi•(1− γi)=

σ2E∑

iNi•σ2E/(Ni•σ

2A + σ2

E)≤ 1

Rσ2A

.

The matrix BR has C blocks of the form 1N•j1TN•j permuted into the row ordering. We may write

BR = ZbZTb where Zb ∈ {0, 1}N×C . The row of Zb corresponding to observation ij has only one 1

in it, at position j. Now

V −1A X∗ =

(1− γ1)1N1• 0 0 · · · 0

(1− γ2)1N2• 0 0 · · · 0...

......

. . ....

(1− γR)1NR• 0 0 · · · 0

∈ RN×p

and the j’th row of ZbV−1A X∗ ∈ RC×p is (

∑i Zij(1 − γi), 0, . . . , 0) ∈ Rp. Then the only nonzero

element of (X∗TV −1A BRV−1A X∗) is the upper left one and it equals

∑ijr ZijZrj(1 − γi)(1 − γr).

Therefore, for w = (1, 0, . . . , 0),

Var(wT(X∗TV −1A X∗)−1X∗TV −1A b) ≤ 1

R2σ4A

∑ijr

ZijZrj(1− γi)(1− γr)

=1

R2σ4A

∑ir

(ZZT)irσ2E

σ2E +Ni•σ2

A

σ2E

σ2E +Nr•σ2

A

≤ σ4E

R2σ8A

∑ir

(ZZT)irN−1i• N

−1r•

which vanishes by condition (3.14). A general unit vector w can be written as a linear combination

of unit vectors with w1 = 0 and w1 = 1 and so Var(wT(X∗TV −1A X∗)−1X∗TV −1A b)→ 0. Because K

is bounded Var(wT(XTV −1A X)−1XTV −1A b)→ 0 as well. This completes the proof.

B.4 Asymptotic Normality of Estimates


We will use the following central limit theorem for a triangular array of weighted sums of IID random

variables.


Theorem B.4.1. For integers i and n with 1 ≤ i ≤ n, let εn,i be a triangular array of random

variables that are IID within each row with mean µn and variance σ2n ∈ (0,∞). Let cn,i be a

triangular array of finite constants, not all zero within each row. Define

Tn =1

Bn

n∑i=1

cni(εn,i − µn), where B2n = σ2

n

n∑i=1

c2ni.

If max1≤i≤n c2ni/∑ni=1 c

2ni → 0 as n→∞, then Tn

d→ N (0, 1).

Proof. This is from Theorem 2.2 of [4].

Our use case is for µn = 0 and σn constant in n. That case was also handled by Theorem 1 of

[26] who has a converse.

From Section B.3.3, βRLS− β = (XTV −1A X)−1XTV −1A η, where ηij = ai + bj + eij . We will make

use of sums ηi• =∑j Zijηij and Xi• =

∑j Zijxij ∈ Rp as well as corresponding column sums. The

matrix (XTV −1A X)−1 is not random. We will establish a central limit theorem for XTV −1A η.

Consider wTXTV −1A η for a unit vector w ∈ Rp. By the Woodbury formula,

wTXTV −1A η =wTXTη

σ2E

− σ2A

σ2E

∑i

wTXi•ηi•σ2E + σ2

ANi•

=1

σ2E

[∑i

aiwTXi• +

∑j

bjwTX•j +

∑ij

ZijeijwTxij

]− σ2

A

σ2E

∑i

wTXi•

σ2E + σ2

ANi•

(Ni•ai +

∑j

Zijbj +∑j

Zijeij

)=∑i

aiwTXi•

σ2E + σ2

ANi•︸︷︷︸Term R1

+∑j

bjσ2E

(wTX•j − σ2

A

∑i

ZijwTXi•

σ2E + σ2

ANi•

)︸︷︷︸

Term R2

+∑ij

Zijeijσ2E

(wTxij − σ2

A

wTXi•

σ2E + σ2

ANi•

)︸︷︷︸

Term R3

.

Terms R1, R2 and R3 are independent. We will show CLTs for each of them individually.

R1: We use Theorem B.4.1 with random variables ai and weights ci = wTXi•/(σ2E +σ2

ANi•). Now

maxi c2i ≤M2

∞ and

∑i

c2i =∑i

( wTXi•

σ2E + σ2

ANi•

)2≥∑i

( wTxi•σ2E + σ2

A

)2≥ (σ2

E + σ2A)−2 I

(∑i

xi•xTi•

)→∞.

Therefore Term R1 is asymptotically normally distributed.


R2: This term is a weighted sum of independent random variables bj/σ2E with weights cj =

wT∑i Zij(xij − γixi•), where γi = σ2

A/(σ2A + σ2

E/Ni•). Therefore cj = N•jwT(x•j − x•j) for the

second order averages x•j given by (3.15).

As in the proof of Theorem 3.4.4 from Section B.4.1 we employ a bounded invertible centering

matrixK =(

1 −k0 Ip−1

), not necessarily the same one as there. We will show thatK

∑j N•jbj(x•j−x•j)

is asymptotically Gaussian and then so is∑j N•jbj(x•j − x•j). Let c∗j = wTK

∑j N•j(x•j − x•j).

Then ∑j

c∗j2 = wT

∑j

N2•jK(x•j − x•j)(x•j − x•j)TKTw.

For 2 ≤ t ≤ p let

kt =∑j

N2•j(x•j,t − x•j,t)

/∑j

N2•j

and define k∗ = (0, k2, . . . , kp)T. Then

∑j N

2•jK(x•j− x•j−k∗)(x•j− x•j−k∗)TKT is block diagonal

with an upper 1× 1 block and a lower (p− 1)× (p− 1) block.

First suppose that w = (w1, w2, . . . , wp) is a unit vector with |w1| 6= 1, and ‖w‖2−1 be the squared

norm of w excluding w1. Then,

∑j

c∗j2 ≥ ‖w‖2−1I0

(∑j

N2•j(x•j − x•j − k∗)(x•j − x•j − k∗)T

)

which diverges faster than maxj N2•j by hypothesis. It remains to consider wT = (±1, 0, . . . , 0).

For this vector c∗j = cj =∑i Zij(1 − γi) =

∑i Zijσ

2E/(σ

2E + Ni•σ

2A) and maxj c

2j/∑j c

2j → 0 by

hypothesis. Therefore Term R2 is asymptotically normally distributed.

R3: This term is a weighted sum of IID random variables eij/σ2E with weights cij = Zijw

T(xij −γixi•). As in paragraph R2, we employ a bounded invertible centering matrix. Then for a unit

vector w 6= (±1, 0, . . . , 0)T

∑ij

c∗ij2 ≥ ‖w‖2−1I0

(∑ij

Zij(xij − γixi•)(xij − γixi•)T)

≥ ‖w‖2−1I0(∑ij

Zij(xij − xi•)(xij − xi•)T)

which, by hypothesis, diverges to infinity, while maxij c∗ij

2 = O(1). The case w = (±1, 0, . . . , 0)T is

handled by one of the assumptions in the theorem.

All three terms have asymptotic normal distributions with mean zero, and they are indepen-

dent. Therefore, (XTV −1A X)−1XTV −1A η is asymptotically Gaussian with mean zero and variance

(XTV −1A X)−1XTV −1A VRV−1A X(XTV −1A X)−1.


B.4.2 Proof of Lemma 3.4.1

We mimic the story in Section B.3.2. First, we use Lemma B.2.1,

Ua(β)− Ua√∑j N

2•j

=1

2(β − β)T

( 1√∑j N

2•j

∑ijj′


(β − β)

︸︷︷︸A2

+ (β − β)T1√∑j N

2•j

∑ijj′

N−1i• ZijZij′(xij − xij′)(bj − bj′)

︸︷︷︸A3

+ (β − β)T1√∑j N

2•j

∑ijj′

N−1i• ZijZij′(xij − xij′)(eij − eij′)

︸︷︷︸A4

.

These terms are the same ones we considered in Section B.3.2, except they are now normalized by√∑j N

2•j instead of by N . The A2 term was of the largest magnitude there. It is also largest here

and so we show that Term A2 is op(1).

The middle factor in Term A2 has entries no larger than

4M∞√∑j N

2•j

∑ijj′

N−1i• ZijZij′ =4M∞N√∑

j N2•j

.

Also ‖β − β‖2 = O(εR + εC) so this combined with the growth condition (2.31) makes this term

op(1).

It follows that (Ua(β)− Ua)/√∑

j N2•j

p→ 0, and similarly (Ub(β)− Ub)/√∑

iN2i•

p→ 0. Now we

expand

Ue(β)− UeN√∑

iN2i• +

∑j N

2•j

into terms just like those in Section B.3.2, but with a new denominator. As before, Term E2 is of

larger order than all the others. The magnitude of this term with its new normalization is

M∞‖β − β‖2N√∑

iN2i• +

∑j N

2•j

=Op(maxiNi• + maxj N•j)√∑

iN2i• +

∑j N

2•j

= op(1).

Therefore (Ue(β)− Ue)/[N√∑

iN2i• +

∑j N

2•j ]

p→ 0.



By Slutsky’s Theorem, Theorem 2.4.6, and Lemma 3.4.1, Ua(β), Ub(β), and Ue(β) are asymptotically

normally distributed with mean

(σ2B + σ2

E)(N −R)√∑j N

2•j

(σ2A + σ2

E)(N − C)√∑iN

2i•

σ2A(N2 −

∑iN

2i•) + σ2

B(N2 −∑j N

2•j) + σ2

E(N2 −N)

N√∑

iN2i• +

∑j N

2•j

and approximate covariance

σ4B(κB + 2)

σ4E(κE + 2)N√∑iN

2i•

∑j N

2•j

σ4B(κB + 2)

∑j N

2•j√∑

j N2•j(∑iN

2i• +

∑j N

2•j)

∗ σ4A(κA + 2)

σ4A(κA + 2)

∑iN

2i•√∑

iN2i•(∑iN

2i• +

∑j N

2•j)

∗ ∗σ4A(κA + 2)

∑iN

2i• + σ4

B(κB + 2)∑j N

2•j + σ4

E(κE + 2)N∑iN

2i• +

∑j N

2•j

.

Using (3.18), we see that asymptotically σ2A, σ2

B , and σ2E are jointly normal. With some algebra,

we see that the mean and approximate covariance are:

σ2A

σ2B

σ2E

and

σ4A(κA + 2)

∑iN

2i•

N2σ4E(κE + 2)

1

N−σ4

E(κE + 2)1

N

∗ σ4B(κB + 2)

∑j N

2•j

N2−σ4

E(κE + 2)1

N

∗ ∗ σ4E(κE + 2)

1

N

.

Bibliography

[1] Stitch Fix, Inc. https://www.stitchfix.com/. Accessed: 2017-04-08.

[2] Apache Software Foundation. Hadoop. https://hadoop.apache.org, March 2017. Version

2.8.0.

[3] Apache Software Foundation. Apache spark. https://spark.apache.org/, 2017 May. Version

2.1.1.

[4] Subhash C. Bagui, Dulal K. Bhaumik, and K. L. Mehra. A few counter examples useful in

teaching central limit theorems. The American Statistician, 67(1):49–56, 2013.

[5] Douglas Bates. Computational methods for mixed models. Technical report, Department of

Statistics, University of Wisconsin–Madison, November 2014. https://cran.r-project.org/

web/packages/lme4/vignettes/Theory.pdf.

[6] Douglas Bates. Linear mixed-effects models in Julia. https://github.com/dmbates/

MixedModels.jl, 2016.

[7] Douglas Bates, Martin Machler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects

models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.

[8] James Bennett and Stan Lanning. The Netflix prize. In Proceedings of KDD Cup and Workshop

2007, 2007.

[9] Andreas Buja, Trevor Hastie, and Robert Tibshirani. Linear smoothers and additive models.

The Annals of Statistics, 17(2):453–510, 1989.

[10] Tony F. Chan, Gene H. Golub, and Randall J. LeVeque. Algorithms for computing the sample

variance: Analysis and recommendations. The American Statistician, 37(3):242–247, 1983.

[11] D. Clayton and J. Rasbash. Estimation in large cross random-effect models by data augmenta-

tion. Journal of the Royal Statistical Society: Series A (Statistics in Society), 162(3):425–436,

1999.

123

https://www.stitchfix.com/

https://hadoop.apache.org

https://spark.apache.org/

https://cran.r-project.org/web/packages/lme4/vignettes/Theory.pdf

https://cran.r-project.org/web/packages/lme4/vignettes/Theory.pdf

https://github.com/dmbates/MixedModels.jl

https://github.com/dmbates/MixedModels.jl

BIBLIOGRAPHY 124

[12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data

via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological),

39(1):1–38, 1977.

[13] Katelyn Gao and Art Owen. Efficient moment calculations for variance components in large

unbalanced crossed random effects models. Electronic Journal of Statistics, 11(1):1235–1296,

2017.

[14] Katelyn Gao and Art B. Owen. Estimation and inference for very large linear mixed effects

models. Arxiv e-prints, 2016. http://arxiv.org/abs/1610.08088v2.

[15] Andrew Gelman, David A. Van Dyk, Zaiying Huang, and John W. Boscardin. Using redundant

parameterizations to fit hierarchical models. Journal of Computational and Graphical Statistics,

17(1), 2012.

[16] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian

restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–

741, 1984.

[17] H.-U. Graser, S.P. Smith, and B. Tier. A derivative-free approach for estimating variance

components in animal models by restricted maximum likelihood. Journal of Animal Science,

64(5):1362–1370, 1987.

[18] Zhonghua Gu. Model diagnostics for generalized linear mixed models. PhD thesis, University

of California, Davis, 2008.

[19] William W. Hager. Updating the inverse of a matrix. SIAM Review, 31(2):221–239, 1989.

[20] Martin Hairer, Andrew M. Stuart, and Sebastian J. Vollmer. Spectral gaps for a Metropolis

Hastings algorithm in infinite dimensions. The Annals of Applied Probability, 24(6):2455–2490,

2014.

[21] Peter Hall and Christopher C Heyde. Martingale limit theory and its application. Academic

press, 1980.

[22] Herman O. Hartley and Jon N.K. Rao. Maximum-likelihood estimation for the mixed analysis

of variance model. Biometrika, 54(1-2):93–108, 1967.

[23] Charles R. Henderson. Estimation of variance and covariance components. Biometrics,

9(2):226–252, 1953.

[24] Jiming Jiang. Consistent estimators in generalized linear mixed models. Journal of the American

Statistical Association, 93(442):720–729, 1998.

BIBLIOGRAPHY 125

[25] Fredrik Johansson. mpmath: a Python library for arbitrary-precision floating-point arithmetic

(version 0.14), 2015. http://mpmath.org.

[26] Peter Kevei. A note on asymptotics of linear combinations of iid random variables. Periodica

Mathematica Hungarica, 60(1):25–36, 2010.

[27] Davar Khoshnevisan. Quadratic forms, lecture notes for Math 6010-1. http://www.math.utah.

edu/~davar/math6010/2014/QuadraticForms.pdf, 2014. Accessed: 2017-05-30.

[28] Nan Laird, Nicholas Lange, and Daniel Stram. Maximum likelihood computations with repeated

measures: Application of the EM algorithm. Journal of the American Statistical Association,

82(397):97–105, 1987.

[29] Last.fm. Last.fm dataset – 360k users. http://ocelma.net/MusicRecommendationDataset/

lastfm-360K.html, 2010. http://www.last.fm/.

[30] Paul J. Lavrakas. Encyclopedia of Survey Research Methods: A-M., volume 1. Sage, 2008.

[31] Mary J. Lindstrom and Douglas M. Bates. Newton-Raphson and EM algorithms for linear

mixed-effects models for repeated-measures data. Journal of the American Statistical Associa-

tion, 83(404):1014–1022, 1988.

[32] Jun S. Liu. Monte Carlo Strategies in Scientific Computing. Springer New York, 2004.

[33] James D. Malley. Optimal unbiased estimation of variance components, volume 39 of Lecture

Notes in Statistics. Springer Verlag, Berlin, 1986.

[34] Albert W. Marshall and Ingram Olkin. Matrix versions of the Cauchy and Kantorovich inequal-

ities. Aequationes Mathematicae, 40(1):89–93, 1990.

[35] Peter Mccullagh. Resampling and exchangeable arrays. Bernoulli, 6(2):285–301, 2000.

[36] Art B. Owen. The pigeonhole bootstrap. The Annals of Applied Statistics, 1(2):386–411, 2007.

[37] Art B. Owen and Dean Eckles. Bootstrapping data arrays of arbitrary order. The Annals of

Applied Statistics, 6(3):895–927, 2012.

[38] Philippe Pebay. Formulas for robust, one-pass parallel computation of covariances and arbitrary-

order statistical moments. Technical Report SAND2008-6212, Sandia National Laboratories,

2008.

[39] Michael J.D. Powell. The BOBYQA algorithm for bound constrained optimization without

derivatives. Technical Report NA06, DAMTP, University of Cambridge, 2009.

http://www.math.utah.edu/~davar/math6010/2014/QuadraticForms.pdf

http://www.math.utah.edu/~davar/math6010/2014/QuadraticForms.pdf

http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html

http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html

http://www.last.fm/

BIBLIOGRAPHY 126

[40] Stephen W. Raudenbush. A crossed random effects model for unbalanced data with applications

in cross-sectional and longitudinal research. Journal of Educational and Behavioral Statistics,

18(4):321–349, 1993.

[41] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling for various Metropolis Hastings

algorithms. Statistical Science, 16(4):351–367, 2001.

[42] Gareth O. Roberts and Sujit K. Sahu. Updating schemes, correlation structure, blocking and

parameterization for the Gibbs sampler. Journal of the Royal Statistical Society. Series B

(Methodological), 59(2):291–317, 1997.

[43] Shayle R. Searle. Another look at Henderson’s methods of estimating variance components.

Biometrics, 24(4):749–787, 1968.

[44] Shayle R. Searle, George Casella, and Charles E. McCulloch. Variance components. John Wiley

& Sons, 2006.

[45] Robert J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley Series in Prob-

ability and Statistics. Wiley, 2009.

[46] Tom A.B. Snijders. Multilevel analysis. In Miodrag Lovric, editor, International Encyclopedia

of Statistical Science, pages 879–882. Springer Berlin Heidelberg, 2011.

[47] David A. Van Dyk and Xiao-Li Meng. The art of data augmentation. Journal of Computational

and Graphical Statistics, 10(1):1–50, 2001.

[48] Matthew Wall. Big Data: Are you ready for blast-off? http://www.bbc.com/news/

business-26383058, March 2014. Accessed: 2017-04-08.

[49] Yahoo!-Webscope. Dataset ydata-ymovies-user-movie-ratings-train-v1 0, 2015. http://

research.yahoo.com/Academic_Relations.

[50] Yahoo!-Webscope. Dataset ydata-ymusic-rating-study-v1 0-train, 2015. http://research.

yahoo.com/Academic_Relations.

[51] Yaming Yu and Xiao-Li Meng. To center or not to center: That is not the question – an

ancillarity–sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency. Journal of

Computational and Graphical Statistics, 20(3):531–570, 2011.

http://www.bbc.com/news/business-26383058

http://www.bbc.com/news/business-26383058

http://research.yahoo.com/Academic_Relations




Documents

SCALABLE ESTIMATION AND INFERENCE FOR …statweb.stanford.edu/~owen/students/KatelynGaoThesis.pdfSCALABLE ESTIMATION AND INFERENCE FOR MASSIVE LINEAR MIXED MODELS WITH CROSSED RANDOM