21
© 2018 Royal Statistical Society 1369–7412/19/81163 J. R. Statist. Soc. B (2019) 81, Part 1, pp. 163–183 Deterministic parallel analysis: an improved method for selecting factors and principal components Edgar Dobriban University of Pennsylvania, Philadelphia, USA and Art B. Owen Stanford University, USA [Received November 2017. Final revision October 2018] Summary. Factor analysis and principal component analysis are used in many application areas. The first step, choosing the number of components, remains a serious challenge. Our work proposes improved methods for this important problem. One of the most popular state of the art methods is parallel analysis (PA), which compares the observed factor strengths with simulated strengths under a noise-only model. The paper proposes improvements to PA. We first derandomize it, proposing deterministic PA, which is faster and more reproducible than PA. Both PA and deterministic PA are prone to a shadowing phenomenon in which a strong factor makes it difficult to detect smaller but more interesting factors. We propose deflation to counter shadowing. We also propose to raise the decision threshold to improve estimation accuracy. We prove several consistency results for our methods, and test them in simulations.We also illustrate our methods on data from the human genome diversity project, where they significantly improve the accuracy. Keywords: Factor analysis; Parallel analysis; Permutation methods; Principal component analysis; Random matrix theory 1. Introduction Factor analysis is widely used in many application areas (see, for example, Malinowski (2002), Bai and Ng (2008) and Brown (2014)). Given data describing p measurements made on n objects or n subjects, factor analysis summarizes it in terms of some number k min.n, p/ of latent factors, plus noise. Finding and interpreting those factors may be very useful. However, it is difficult to choose the number k of factors to include in the model. There have been many proposed methods to choose the number k of factors. Classical methods include using an a priori fixed singular value threshold (Kaiser, 1960) and the scree test, which is a subjective search for a gap separating small from large singular values (Cattell, 1966). The recent surge in data sets with p comparable with n or even larger than n has been met with new methods in econometrics (Bai and Li, 2012; Ahn and Horenstein, 2013) and random matrix theory (Nadakuditi and Edelman, 2008) among many others. In this paper we focus exclusively on the method of parallel analysis (PA), which was first introduced by Horn (1965), and variations of it. Address for correspondence: Edgar Dobriban, Department of Statistics, Wharton School, University of Pennsylvania, 465 Jon M. Hunstsman Hall, 3730 Walnut Street, Philadelphia, PA 19104, USA. E-mail:[email protected]

Deterministic parallel analysis: an improved method for

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deterministic parallel analysis: an improved method for

© 2018 Royal Statistical Society 1369–7412/19/81163

J. R. Statist. Soc. B (2019)81, Part 1, pp. 163–183

Deterministic parallel analysis: an improved methodfor selecting factors and principal components

Edgar Dobriban

University of Pennsylvania, Philadelphia, USA

and Art B. Owen

Stanford University, USA

[Received November 2017. Final revision October 2018]

Summary. Factor analysis and principal component analysis are used in many applicationareas. The first step, choosing the number of components, remains a serious challenge. Ourwork proposes improved methods for this important problem. One of the most popular state ofthe art methods is parallel analysis (PA), which compares the observed factor strengths withsimulated strengths under a noise-only model. The paper proposes improvements to PA. Wefirst derandomize it, proposing deterministic PA, which is faster and more reproducible than PA.Both PA and deterministic PA are prone to a shadowing phenomenon in which a strong factormakes it difficult to detect smaller but more interesting factors. We propose deflation to countershadowing. We also propose to raise the decision threshold to improve estimation accuracy.We prove several consistency results for our methods, and test them in simulations. We alsoillustrate our methods on data from the human genome diversity project, where they significantlyimprove the accuracy.

Keywords: Factor analysis; Parallel analysis; Permutation methods; Principal componentanalysis; Random matrix theory

1. Introduction

Factor analysis is widely used in many application areas (see, for example, Malinowski (2002),Bai and Ng (2008) and Brown (2014)). Given data describing p measurements made on n objectsor n subjects, factor analysis summarizes it in terms of some number k � min.n, p/ of latentfactors, plus noise. Finding and interpreting those factors may be very useful. However, it isdifficult to choose the number k of factors to include in the model.

There have been many proposed methods to choose the number k of factors. Classical methodsinclude using an a priori fixed singular value threshold (Kaiser, 1960) and the scree test, whichis a subjective search for a gap separating small from large singular values (Cattell, 1966).The recent surge in data sets with p comparable with n or even larger than n has been metwith new methods in econometrics (Bai and Li, 2012; Ahn and Horenstein, 2013) and randommatrix theory (Nadakuditi and Edelman, 2008) among many others. In this paper we focusexclusively on the method of parallel analysis (PA), which was first introduced by Horn (1965),and variations of it.

Address for correspondence: Edgar Dobriban, Department of Statistics, Wharton School, University ofPennsylvania, 465 Jon M. Hunstsman Hall, 3730 Walnut Street, Philadelphia, PA 19104, USA.E-mail: [email protected]

Page 2: Deterministic parallel analysis: an improved method for

164 E. Dobriban and A. B. Owen

PA, especially the version by Buja and Eyuboglu (1992), is one of the most popular ways topick k in factor analysis. It is used, for example, as a step in bioinformatics papers such as Leekand Storey (2008). It seems that the popularity of PA is because users find that it ‘works wellin practice’. That property is difficult to quantify on data with a true unknown parameter (theground truth). We believe that PA ‘works well’ because it selects factors above the largest noisesingular value of the data, as shown by Dobriban (2017). Our goal in this paper is to provide animproved method that estimates this threshold faster and more reproducibly than PA does.

We shall describe and survey PA in detail below. Here we explain how it is based on sequentialtesting, assuming some familiarity with factor analysis. If there were no factors underlying thedata, then the p variables would be independent. We can then compare the size of the apparentlargest factor in the data matrix (the largest singular value) with what we would see in datawith n independent and identically distributed (IID) observations, each of which consists of p

independent but not identically distributed variables. We can simulate such data sets many times,either by generating new random variables or by permuting each column of the original dataindependently. If the first singular value is above some quantile of the simulated first singularvalues, then we conclude that k � 1. If we decide that k � 1, then we check whether the secondfactor (the second singular value) has significantly large size compared with simulated secondfactors, and so on. If the .k +1/st factor is not large compared with simulated factors then westop with k factors.

Our first improvement on PA is to derandomize it. For large n and p, the size of the largestsingular value can be predicted very well by using tools from random matrix theory. The pre-diction depends very little on the distribution of the data values: a fact called universality. Forexample, random matrix theory developed for Gaussian data (Johnstone, 2001) has been seento work very well on single-nucleotide polymorphism (SNP) data that take values in {0, 1, 2}(Patterson et al., 2006). The recently developed ‘spectrode’ tool (Dobriban, 2015) computesthe canonical distribution of noise singular values—the so-called Silverstein–Marchenko–Pastur(MP) distribution—without any simulations. We propose to use the upper edge of that distri-bution as a factor selection threshold.

We call the resulting method deterministic PA (DPA). A deterministic threshold has two im-mediate advantages over PA. It is reproducible, requiring no randomization. It is also much fasterbecause we do not need to repeat the factor analysis on multiple randomly generated data sets.

DPA also frees the user from choosing an ad hoc level of significance for singular values. If thereare no factors in the data, then PA will mistakenly choose k�1 with a probability correspondingto the quantile that is used in simulation, because the top sample singular value has the samedistribution as simulated values. As a result, we need a huge number of simulations to obtainsufficiently extreme quantiles that we rarely make mistakes. By contrast, DPA can be adjustedso that the probability of false positive results vanishes. For instance the user can insist that apotential factor must be 1:05 times larger than what we would see in uncorrelated noise.

Zhou et al. (2017) have recently developed a related method to derandomize PA, also byemploying Dobriban’s (2015) spectrode algorithm to approximate the level of noise. However,there are several crucial differences between our approaches: first, we use the variances of thecolumns as the population spectral distribution to be passed to the spectrode, whereas Zhouet al. (2017) fitted a five-point mixture to the empirical eigenvalue spectrum and used the resultingupper edge. Second, we provide theoretical consistency results, whereas Zhou et al. (2017)evaluated the methods empirically on some genetic data.

In addition to randomness, PA has a problem with shadowing. If the first factor is very largethen the simulations behind PA lump that factor into the noise, raising the threshold for thenext factors. Then PA can miss some of the next factors. This was seen empirically by Owen and

Page 3: Deterministic parallel analysis: an improved method for

Deterministic Parallel Analysis 165

Wang (2016) and explained theoretically by Dobriban (2017) who coined the term ‘shadowing’.It is fairly common for real data sets to contain one ‘giant factor’ explaining a large proportionof the total variance. In finance, many stock returns move in parallel. In educational testing, the‘overall ability’ factor of students is commonly large. Such giant factors are usually well known apriori and can shadow smaller factors that might otherwise have yielded interesting discoveries.

We counter shadowing by deflation. If the first factor is sufficiently large to keep, then weestimate its contribution and subtract it from the data. We then look for additional factorswithin the residual. The resulting deflated DPA (DDPA) is less affected by shadowing.

Deflating the kth factor counters shadowing by lowering the threshold for the .k + 1/st.The result can be a cascade where threshold reductions by DDPA trigger increased k andvice versa. Recent work has shown that the threshold for detecting a factor is lower than thethreshold for obtaining beneficial estimates of the corresponding eigenvectors (Perry, 2009;Gavish and Donoho, 2014). Because deflation involves subtracting an outer product of weightedeigenvectors, we consider raising the deflation threshold to keep only factors that improveestimation accuracy. We call this method ‘DDPA+’.

Vitale et al. (2017) proposed a method for selecting the number of components by usingpermutations as in PA, including deflation, using a standardized eigenvalue test statistic, andadjusting the permuted residual by a certain projection method. Their first two steps are essen-tially deflated (but not derandomized) PA. Similarly to DDPA, this suffers from overselection,and their remaining two steps deal with that problem. Their approach differs from ours in severalimportant aspects, including that our final method is deterministic, and that we prove rigoroustheoretical results supporting it.

An outline of this paper is as follows. Section 2 gives some context on factor analysis, randommatrices and PA. Section 3 presents our DPA. We show that, similarly to PA, DPA selectsperceptible factors and omits imperceptible factors. Section 4 presents our deflated version ofDPA, called DDPA, that counters the shadowing phenomenon by subtracting estimated factors.We raise the threshold in DDPA to improve estimation accuracy. We call this final versionDDPA+. Section 5 presents numerical simulations where we know the ground truth. In oursimulations, DPA is more stable than PA and much faster as well. DDPA is better able tocounter shadowing than is DPA but the number of factors that it chooses has more variabilitythan with DPA. Raising the decision threshold in DDPA+ appears to reduce its variance. Theresults are reproducible with software that is available from github.com/dobriban/DPA.In Section 6 we compare the various methods on the human genome diversity project (HGDP)data set (Li et al., 2008). We believe that PA, DPA and DDPA choose too many factors inthose examples. The choice for DDPA+ seems to be more accurate but conservative. Section 7gives a faster way to compute the DPA threshold that we need by avoiding the full spectrodecomputation as well as a way to compute the DDPA+ threshold. We close with a discussion offuture work in Section 8. The proofs are presented in Appendix A.

2. Background

In this section we introduce our notation and quote some results from random matrix theory.We also describe goals of factor analysis and survey some of the literature on PA. For brevitywe do not give a full account of factor analysis.

2.1. NotationThe data are the n × p matrix X and contain p features of n samples. The rows of X areIID p-dimensional observations xi, for i = 1, : : : , n. We shall work with the standard factor

Page 4: Deterministic parallel analysis: an improved method for

166 E. Dobriban and A. B. Owen

model, where observations have the form xi =Ληi +"i. Here Λ is a non-random p× r matrix offactor loadings, ηi is the r-dimensional vector of factor scores of the ith observation and "i thep-dimensional noise vector of the ith observation.

In matrix form, this can also be written as

X=ηΛT +ZΦ1=2: .1/

Here, η is an n× r random matrix containing the factor values, so ηi is the ith row of η. AlsoE =ZΦ1=2 is the n×p matrix containing the noise, so "i is the ith row of E .

Thus, the noise is modelled as "i =Φ1=2zi, where zi is the ith row of Z. The n×p matrix Z con-tains independent random variables with mean 0 and variance 1, and Φ is a p×p diagonal matrixof idiosyncratic variances Φj, j =1, : : : , p. The covariance matrix of xi is Σ=ΛΨΛT +Φ whereΨ∈ Rr×r is the covariance of ηi. We define the scaled factor loading matrix as ΛΨ1=2 ∈ Rp×r.Note that we could reparameterize the problem by this matrix. The factor model has knownidentifiability problems; see for instance Bai and Li (2012) for a set of possible resolutions. How-ever, in our case, we want to estimate only the number of large factors, which is asymptoticallyidentifiable.

Model (1) contains r factors, and we generally do not know r. We use k factors and k is notnecessarily the same as r.

We shall need several matrix and vector norms. For a vector v∈ Rm we use the usual l1-, l2-and l∞-norms ‖v‖1, ‖v‖2 and ‖v‖∞. An unsubscripted vector norm ‖v‖ denotes ‖v‖2. For amatrix A∈Rn×r we use the induced norms

‖A‖p = supv�=0

‖Av‖p=‖v‖p:

The spectral—or operator—norm is ‖A‖2, whereas ‖A‖∞ = max1�i�n Σrj=1|Xij| and ‖A‖1 =

max1�j�r Σni=1|Xij|. The Frobenius norm is given by ‖A‖2

F =ΣiΣjA2ij.

We consider ηΛT to be the signal matrix and E = ZΦ1=2 to be the noise. It is convenient tonormalize the data, via

n−1=2X=S +N S =n−1=2ηΛT, N =n−1=2E : .2/

The aspect ratio of X is γp =p=n. We consider limits with p→∞ and γp →γ ∈ .0, ∞/. In ourmodels, under some precise assumptions that are stated later, ‖N‖2 →b > 0 almost surely, forsome b∈ .0, ∞/. We call b the size of the noise.

For a positive semidefinite matrix Σ∈Rp×p, with eigenvalues λ1 �λ2 �λp �0, its empiricalspectral distribution is

ESD.Σ/= 1p

p∑j=1

δλj

where δλ denotes a point mass distribution at λ. In our asymptotic setting, ESD.Σ/ convergesweakly to a distribution H , which is denoted ESD.Σ/⇒H . For a general bounded probabilitydistribution H , its upper edge is the essential supremum

U.H/≡ ess sup.H/= inf{M ∈R|H..−∞, M]/=1}:

The sample covariance matrix is Σ= .1=n/XTX, and we let D=diag.Σ/. If there are no factors,then D is a much better estimate of Σ than Σ is. We call ESD.D/ the variance distribution. Thesingular values of a matrix X∈Rn×p are denoted by σ1.X/�σ2.X/� : : :�σmin.n,p/.X/�0. Theeigenvalues of Σ are λj =σ2

j .n−1=2X/.

Page 5: Deterministic parallel analysis: an improved method for

Deterministic Parallel Analysis 167

2.2. Random matrix theoryHere we outline some basic results that are needed from random matrix theory; see Bai andSilverstein (2010), Paul and Aue (2014) and Yao et al. (2015) for references. We focus on the MPdistribution and the Tracy–Widom distribution.

For the spherically symmetric case cov.xi/ = Ip we have ESD.Ip/ = δ1, but the eigenvaluesof the sample covariance matrix Σ= .1=n/XTX do not all converge to 1 in our limit when thesample size is only a constant times larger than the dimension. Instead the empirical spectraldistribution of Σ converges to the MP distribution, ESD.Σ/ ⇒ Fγ (Marchenko and Pastur,1967). If n and p→∞ with p=n→γ ∈ .0, 1] then Fγ has probability density function

fγ.λ/=⎧⎨⎩

√{.λ−aγ/.bγ −λ/}2πλγ

, aγ �λ�bγ ,

0, otherwise,.3/

where

aγ = .1−√γ/2

and

bγ = .1+√γ/2:

If γ > 1, then Fγ is a mixture placing probability 1=γ on the above density and probability1 − 1=γ on a point mass at 0. In either case, it is also known that the largest eigenvalue of thesample covariance matrix converges to the upper edge of the MP distribution U{ESD.Σ/}→.1 +p=n/2 >U{ESD.Σ/}= 1. If the components of "i are IID with kurtosis κ∈ .−2, ∞/, thenU{ESD.D/} is the largest of p IID random variables with mean 1 and variance .κ+2/=n. ThenU{ESD.D/} will be much closer to 1 than U{ESD.Σ/} is.

For a general distribution H of bounded non-negative eigenvalues of Σ, there is a generalizedMP distribution Fγ,H (Bai and Silverstein, 2010; Yao et al., 2015). If γp →γ and ESD.Σ/→H ,then under some moment conditions ESD.Σ/→Fγ,H .

We can use the spectrode (Dobriban, 2015) to compute Fγ,H and to obtain U.Fγ,H / for anyH . Typically we shall use H =ESD.Σ/ or ESD.D/. It is convenient to use Fγ,Σ as a shorthandfor Fγ,ESD.Σ/. If the data are noise without any factors, we can estimate Σ by the diagonalmatrix of variances D = diag.Σ/ and expect that the largest eigenvalue of the empirical noisecovariance matrix Σ will be close to U.Fγ,D/.

The largest eigenvalue of the sample covariance matrix is distributed around the upper edgeof Fγ,H . The difference, appropriately scaled, has an asymptotic Tracy–Widom distribution. SeeHachem et al. (2016) and Lee and Schnelli (2016) for recent results. In Gaussian white noise,Johnstone (2001) showed that the scale parameter is

τp =n−1=2{1+ .p=n/1=2}{n−1=2 +p−1=2}1=3, .4/

so fluctuations in the largest eigenvalue quickly become negligible.

2.3. Factor analysisA common approach to factor analysis begins with a principal component analysis of the dataand selects factors based on the size of the singular values of n−1=2X. Let n−1=2X have singularvalues σj for j = 1, : : : , min.n, p/. Now choosing k is equivalent to choosing a threshold σÅand retaining all factors with σl.n

−1=2X/ >σÅ. There are several goals for factor analysis anddifferent numbers k may be best for each of them. Here we list some goals.

Page 6: Deterministic parallel analysis: an improved method for

168 E. Dobriban and A. B. Owen

In some cases the goal is to select interpretable factors. In many applications, there are knownfactors that we expect to see in the data. These are the giant factors that were mentioned above,such as in finance, where many stock returns move in parallel. In such cases, interest centreson new interpretable factors beyond the already known factors. For instance, we may look forpreviously unknown segments of the market. The value of the threshold is what matters here;however, users will typically also inspect and interpret the factor loadings.

Another goal is estimation of the signal matrix ηΛT, without necessarily interpreting all thefactors, as a way of denoising X. Then we can choose a threshold by calculating the effectof retained factors on the mean-squared error. Methods for estimating the unobserved signalhave been developed under various assumptions by Perry (2009), Gavish and Donoho (2014),Nadakuditi (2014) and Owen and Wang (2016). The optimal k for estimation can be muchsmaller than the true value r.

In bioinformatics, factor models are useful for decorrelating genomic data so that hypothesistest statistics are more nearly independent. See, for example, Leek and Storey (2007), Sun et al.(2012), Gagnon-Bartsch et al. (2013) and Gerard and Stephens (2017). Test statistics that aremore nearly independent are better suited to multiplicity adjustments such as the Benjamini–Hochberg procedure.

2.4. Parallel analysisPA was introduced by Horn (1965). The idea is to simulate independent data with the samevariances as the original data but without any of the correlations that factors would induce.If the observed top eigenvalue of the sample covariance is significantly large, say above thesimulated 95th percentile, then we include that factor or eigenvector in the model. If the firstk�1 factors have been accepted, then we accept the .k+1/st if it is above the 95th percentile of itssimulated counterparts. Horn’s original approach used the mean instead of the 95th percentile.Horn was aware of the bias in the largest eigenvalue of sample covariance matrices even beforeMarchenko and Pastur (1967) characterized it theoretically.

There is plenty of empirical evidence in favour of PA. For instance, Zwick and Velicer (1986)compared five methods in some small simulated data sets. Their criterion was to choose a correctk not signal recovery. They concluded that ‘The PA method was consistently accurate’. The errorsof PA were typically to overestimate k (about 65% of the errors). In addition, Glorfeld (1995)wrote that

‘Numerous studies have consistently shown that Horn’s parallel analysis is the most nearly accuratemethodology for determining the number of factors to retain in an exploratory factor analysis’.

He also reported that the errors are mostly of overestimation.

Table 1. Algorithm 1: PA

1, input: data X∈Rn×p, centred, containing p features of n samples; number ofpermutations np (default, 19); percentile α (default, 100)

2, generate np independent random matrices Xiπ , i=1, : : : , np, in the following

way: permute the entries in each column of X (each feature) separately3, initialize: k ←04, Hk ← distribution of the kth largest singular values of Xi

π , i=1, : : : , np5, if σk.X/ is greater than the αth percentile of Hk then6, k ←k +17, return to step 48, return: selected number of factors k

Page 7: Deterministic parallel analysis: an improved method for

Deterministic Parallel Analysis 169

The version of PA by Buja and Eyuboglu (1992) replaces Gaussian simulations by indepen-dent random permutations within the columns of the data matrix (see algorithm 1 in Table 1).Permutations have the advantage of removing the parametric assumption. They are widely usedfor decorrelation in genomics to identify latent effects such as batch variation.

PA was developed as a purely empirical method, without theoretical justification. Recently,Dobriban (2017) has developed a theoretical understanding: PA selects certain perceptible fac-tors in high dimensional factor models, as described below. Dobriban (2017) clarified the limi-tations of PA, including the shadowing phenomenon that was mentioned above.

3. Derandomization

If there are no factors, so r =0 in model (1), then X=ZΦ1=2 is a matrix of uncorrelated noise.We can estimate Φ by D = diag.XTX=n/ and compute the upper edge of the MP distri-bution that is induced by D with aspect ratio p=n, i.e. U.Fp=n,D/. This is a deterministic approxi-mation of the location of the largest noise eigenvalue of n−1XTX.

Our new method of derandomized PA chooses k factors where k satisfies

σ2k .n−1=2X/>U.Fp=n,D/�σ2

k+1.n−1=2X/:

The spectrode method (Dobriban, 2015) computes Fp=n,D from which we can obtain this upperedge. We give an even faster algorithm in Section 7.1 that computes U.Fp=n,D/ directly.

The threshold that DPA estimates is a little different from the PA threshold. For the kthsingular value, PA is estimating something like the 1 − k=p quantile of the generalized MPdistribution whereas DPA uses the upper edge. This difference is minor for small k.

We say that factor k ∈ {1, : : : , r} is perceptible if σk.n−1=2X/ > b + ε almost surely for someε>0, where b is the almost sure limit of n−1=2‖E‖2 that was defined in Section 2. It is imperceptibleif σk.n−1=2X/<b− ε almost surely holds.

Perceptible factors are defined in terms of X and N, and not in terms of underlying factorsignal size. This may seem rather circular, or even tautological. However, there are two reasonsfor adopting this definition. First, as Dobriban (2017) has argued, in certain spiked covariancemodels we can show that a large underlying signal size leads to separated factors. Second, this isthe ‘right’ definition mathematically, leading to elegant proofs. Moreover, the definition is relatedto the Baik–Ben Arous–Peche phase transition (Baik et al., 2005), and it reduces to the Baik–BenArous–Peche phase transition for the usual spiked models. However, our definition also makessense in other models, where p=n→∞ and the number of spikes diverges to ∞. These are notincluded in usual spiked models. See Dobriban (2017), section 5, for a detailed explanation.

Theorem 1 below shows that DPA selects all perceptible factors and no imperceptible factors,just like PA does. Although the result is similar to those on PA from Dobriban (2017), the proofis different. Let Ψ1=2 be the symmetric square root of Ψ=cov.ηi/, and recall that we defined thescaled factor loading matrix ΛΨ1=2.

Theorem 1 (DPA selects perceptible factors). Let the centred data matrix X follow factormodel (1) with IID xi = Ληi + "i, i = 1, : : : , n. Assume the following conditions (for somepositive constants ε and δ).

Condition 1 (factors). The factors ηi have the form ηi =Ψ1=2Ui, where Ui has r independententries with mean 0, variance 1 and bounded moments of order 4+ δ.

Condition 2 (idiosyncratic terms). The idiosyncratic terms are "i =Φ1=2Zi, where Φ1=2 isa diagonal matrix, and Zi have p independent entries with mean 0, variance 1 and boundedmoments of order 8+ ε.

Page 8: Deterministic parallel analysis: an improved method for

170 E. Dobriban and A. B. Owen

Condition 3 (asymptotics). n, p → ∞, such that p=n → γ > 0, and there is a boundeddistribution H with ESD.Φ/⇒H and max1�j�p Φj →U.H/.

Condition 4 (number of factors). The number r of factors is fixed or grows such thatr=n1+δ=4 is summable.

Condition 5 (factor loadings). The scaled loadings are delocalized, in the sense that‖ΛΨ1=2‖∞ →0.

Then with probability tending to 1, DPA selects all perceptible factors, and no imperceptiblefactors.

For a proof of theorem 1, see Appendix A.1.This shows that the behaviour of DPA is similar to that of PA (Dobriban, 2017). However,

there are some important differences. In contrast with the result on PA, theorem 1 allows both agrowing number of factors, as well as growing factor strength. Specifically, the number of factorscan grow, as long as r=n1+δ=4 is summable. For instance, if we assume that δ=4, which translatesinto the condition that the eighth moments of the entries of Ui are uniformly bounded, then r

can grow as fast as n1−ε for any ε> 0.The factor strengths can grow as follows. Let the scaled factor loading matrix ΛΨ1=2 have

columns dl for l =1, : : : , r. These columns can have a diverging Euclidean norm ‖dl‖2 →∞ solong as the sum of the max-norms tends to 0, Σr

l=1‖dl‖∞ →0. As a result, the Frobenius normof the scaled loading matrix can grow, subject to ‖ΛΨ1=2‖F = o.[n=r]1=2/. There is a trade-offbetween the number of factors and their sizes. For instance, if r ∼n1−ε, then the scaled factorscan grow, subject to their Euclidean norm being ‖dl‖=o.nε/.

DPA may yield false positive and false negative results stemming from factors that are notperceptible. In particular, there is a positive probability, even asymptotically, for DPA to pickk � 1 from data with no factors. This may be why Glorfeld (1995) found that PA’s errors wereerrors of overestimation.

We can remove false positive results asymptotically by taking factors only above some multiple1+ εp of the estimated upper edge, i.e. we select a factor if σk >.1+ εp/U.Fp=n,D/1=2. The choiceof εp is problem dependent; we think that 0:05 is sufficiently large to trim out noise in thebioinformatics problems that we have in mind. A factor that is only 1:05 times as large as wewould expect in pure noise is not likely to be worth interpreting.

For estimation of the number of factors, we might want to take εp → 0 in a principled wayby using the Tracy–Widom distribution. We could take εp = cn,pτ

1=2p , where cn,p is a determin-

istic sequence and τp is given by equation (4). The constant cn,p can be chosen as some fixedpercentile of the Tracy–Widom distribution. However, this leads to only a tiny increase in thethreshold, so we decided that it is not worth making the method more complicated with thisadditional parameter. Therefore, in the following DDPA, DDPA+ algorithms and simulations,εp =0.

For estimation of the signal matrix ηΛT, we shall discuss a different choice in Section 4.1.

4. Deflation

To address the shadowing of mid-sized factors, we shall deflate X. Let the singular value de-composition (SVD) of X ∈ Rn×p be X =Σmin.n,p/

i=1 σiuivTi , with non-increasing σi. If we decide

to keep the first factor, then we subtract σ1u1vT1 from X, recompute the upper edge of the MP

distribution and repeat the process. This ensures that the very strong factors are removed, andthus they do not shadow mid-sized factors. We call this DDPA.

Page 9: Deterministic parallel analysis: an improved method for

Deterministic Parallel Analysis 171

After k deflations, the current residual matrix is X = Σmin n,pi=k+1 σiuiv

Ti . Thus, this algorithm

requires only one SVD computation.Deflation is a myopic algorithm. At each step we inspect the strongest apparent factor. If it

is large compared with a null model, we include it. Otherwise we stop. There is no lookahead.There is a potential downside to myopic selection. If some large singular values are close to eachother, but well separated from the bulk, the algorithm might stop without taking any of them,even though it could be better to take all of them. We think that this sort of shadowing frombelow is unlikely in practice.

Deflation must be analysed carefully, as the errors in estimating singular values and vectorscould propagate through the algorithm. To develop a theoretical understanding of deflation,our first step is to show that it works in a setting which parallels theorem 1. Recall that the scaledfactor loading matrix is ΛΨ1=2 = .d1, : : : , dr/. The statement of this theorem involves the Stieltjestransform mγ,H =EH 1=.X− z/ which is defined in more detail in the on-line supplement.

Theorem 2 (DDPA selects the perceptible factors). Consider factor models under the condi-tions 1–3 of theorem 1, with modified conditions 4 and 5 as follows.

Condition 6 (number of factors). The number r of factors is fixed.

Condition 7 (factor loadings). The vectors of scaled loadings dl are delocalized in the sensethat ‖dl‖∞ →0 for l=1, : : : , r. They are also delocalized with respect to Φ, in the sense that

xT.Φ− zIp/−1dl −mγ,H.z/xTdl

‖dl‖ →0 .5/

uniformly for ‖x‖�1, l=1, : : : , r, and z∈C with Im.z/> 0 fixed.

Then, with probability tending to 1, DDPA with "p =0 selects all perceptible factors, and noimperceptible factors.

For a proof of theorem 2, see the on-line supplement.The proof requires one new key step, controlling the l∞-norm of the empirical spike eigen-

vectors in certain spiked models. The delocalization condition (5) is somewhat technical andrequires that dl are ‘random like’ with respect to Φ. This condition holds if dl are random vectorswith IID co-ordinates, albeit only in the special case where the entries of Φ have a vanishingvariability, so that we are in a near-homoscedastic situation. For other Φ-matrices, there can ofcourse be other vectors dl that make this condition hold.

In practice, the advantage of deflation is that it works much better than plain DPA in thesetting with growing factors. We see this clearly in the simulations of Section 5. Our theoreticalresults for deflation are comparable with usual DPA, so this empirical advantage is not reflectedin theorem 2. Analysing the strong factor model theoretically seems to be beyond what thecurrent proof techniques can accomplish.

As with DPA, we might want to increase the threshold to trim factors that are not perceptible,or too small to be interesting or useful.

Proposition 1 (slightly increasing the threshold retains the properties of DDPA). Considerthe DDPA algorithm in Table 2 where in step 4 of algorithm 2 we select whether σ1.n−1=2X/>

.1+ εp/U.Fγp,Hp/1=2, and εp →0. Then the resulting algorithm satisfies theorem 2.

The proof of proposition 1 is immediate, by checking that the steps of the proof of theorem2 go through.

Page 10: Deterministic parallel analysis: an improved method for

172 E. Dobriban and A. B. Owen

Table 2. Algorithm 2: DDPA

1, input: data X∈Rn×p, centred, containing p features of n samples2, initialize: k ←03, compute variance distribution: Hp ←ESD{diag.n−1XTX/}4, if σ1.n−1=2X/>.1+ εp/U.Fγp,Hp

/1=2 (by default εp =0), then5, k ←k +16, X←X−σ1u1vT

1 (from the SVD of X)7, return to step 38, return: selected number of factors k

Table 3. Algorithm 3: DDPA

1, input: data X∈Rn×p, centred, containing p features of n samples2, initialize: k ←03, compute singular values σi.X/, and eigenvalues λi =σ2

i sorted indecreasing order: let λ=σ2

1, r =min.n, p/4, compute spectral functionals:

Stieltjes transforms,

m= .r −1/−1r∑

i=2.λi −λ/−1,

v=γm− .1−γ/=λ;

D-transform, D=λmv;population spike estimate, l=1=D;derivatives of Stieltjes transforms,

m′ = .r −1/−1r∑

i=2.λi −λ/−2,

v′ =γm′ + .1−γ/=λ2

and derivative of D-transform D′ =mv+λ.mv′ +m′v/

5, compute estimators of the squared cosines: c2r =m=.D′l/, c2

l =v=.D′l/6, if σ2

1 < 4l2c2r c2

l , then7, k ←k +18, X←X−σ1u1vT

1 (from the SVD of X)9, return to step 310, return: selected number of factors k

4.1. DDPA+: increasing the threshold to account for eigenvector estimationEvery time that deflation removes a factor, the threshold for the next factor is decreased. Forsome data, DDPA will remove many factors one by one. It is worth considering a criterion thatis more stringent than perceptibility.

Deflation involves subtracting σkukvTk from X. This ukvT

k matrix is a noisy estimate of the kthsingular space of the signal matrix S. A higher standard than perceptibility is to require thatukvT

k be an accurate estimate of the corresponding quantity of the matrix S. For this we usea higher threshold for σk that was obtained by Perry (2009) and also by Gavish and Donoho(2014). This Perry–Gavish–Donoho threshold was derived for a white noise model. In Section7.2 we extend it to a more general factor analysis set-up by using ideas from the OptShrinkalgorithm (Nadakuditi, 2014). We call the resulting method DDPA+. We note that the Tracy–Widom scale factor τp of equation (4) is too small to use in DDPA+. We provide the detailedalgorithm in algorithm 3 (Table 3).

Page 11: Deterministic parallel analysis: an improved method for

Deterministic Parallel Analysis 173

5. Numerical experiments

We perform numerical simulations to understand and compare the behaviour of our proposedmethods. We follow the simulation set-up of Dobriban (2017).

5.1. Deterministic parallel analysis versus parallel analysisFirst we compare DPA with PA. For PA, we use the most classical version, generating 19permutations, and selecting the kth factor if σk.X/ is larger than all the permuted singular values.

We simulate from the factor model xi =Ληi + "i. We generate the noise "i ∼N .0, Tp/, whereTp is a diagonal matrix of noise variances uniformly distributed on a grid between 1 and 2. Thefactor loadings are generated as Λ=θZ, where θ>0 is a scalar corresponding to factor strength,and Z is generated by normalizing the columns of a random matrix Z ∈Rp×r with IID N .0, 1/

entries. Also, ηi have IID standard normal entries.We use a one-factor model, so r=1, and work with sample size n=500 and dimension p=300.

We report additional experiments with larger p and non-Gaussian data in the on-line supple-ment. We take θ=γ1=2s, for s on a linear grid of 10 points between 0:2 and 6 inclusively. This isthe same simulation set-up as used in Dobriban (2017). We repeat the simulation 100 times.

The results in Fig. 1(a) show that PA and DPA both select the right number of factors as soonas the signal strength s is larger than about 2. This agrees with our theoretical predictions, sinceboth algorithms select the perceptible factors. The close match between PA and DPA confirmsour view that PA is estimating the upper edge of the spectrum.

A key advantage of DPA is its speed. PA and DPA both perform a first SVD of X to computethe singular values σk.X/. This takes O{np min.n, p/} flops. Afterwards, PA generates npermindependent permutations Xπ and computes their SVD. This takes O{npermnp min.n, p/} flops.It would be reasonable to run both PA and DPA with a truncated SVD algorithm computingonly the first k singular values where k is a problem-specific a priori upper bound on k. DPAwould still have a cost advantage as that truncated SVD would need to be computed only onceversus nperm +1 times for PA.

After computing the SVD, DPA needs to compute only U.Fγp, Hp/, the upper edge of the MP

distribution. The fast method that is described in Section 7.1 has an approximate cost of O.p/

per iteration. We do not know whether the number of iterations that are required to achieve thedesired accuracy depends strongly on p.

(a) (b)

Fig. 1. (a) Mean and ˙1 standard deviation of the number of factors selected by PA ( ) and DPA( ) as a function of signal strength (all standard deviations were 0; p=n D 0.60; a small amount of jitterhas been added for visual clarity) and (b) running time (in log-seconds) of PA ( ) and DPA ( )

Page 12: Deterministic parallel analysis: an improved method for

174 E. Dobriban and A. B. Owen

In conclusion, the cost of DPA should be lower by a factor of the order of nperm than that ofPA. In empirical applications, the number of SVD computations is typically about 20; thus thisrepresents a meaningful speed-up.

DPA is also much faster empirically. In Fig. 1(b), we report the results of a simulation wherewe increase the dimension n and sample size p, keeping p=n=γ =0:6, from n=500 to n=3500in steps of 500. We use nperm =20 permutations in PA. We obtain only one Monte Carlo samplefor each parameter setting. We set the signal strength θ = 6γ1=2, so that the problem is ‘easy’statistically. Both methods can select the correct number of factors, regardless of the samplesize (for brevity the data are not shown). Fig. 1(b) shows that DPA is much faster than PA. Ona log-scale, we see that the improvement is between 10 and 15 times. The expected improvementwithout the upper edge computation would be 20 times, but the time that is needed for thatcomputation reduces the improvement. However, DPA is still much faster than PA.

5.2. Deflated deterministic parallel analysis versus deterministic parallel analysisHere we show that DPA is affected by shadowing and that DDPA counters it. We consider a two-factor and a three-factor model, with the same parameters as above, including heteroscedasticnoise. For the two-factor model, we set the smaller signal strength to θ1 = 6γ1=2, which is aperceptible factor. We vary the larger factor strength θ2 = c2γ

1=2 on a 20-point grid from c2 =6to c2 = 70. This is the same simulation set-up as used in Dobriban (2017). The simulationwas repeated 100 times at each level. The results in Fig. 2(a) show that DPA correctly findsboth factors for small, but not for large, θ2. DPA begins to suffer from shadowing at θ2 ≈ 30and shadowing is solidly in place by θ2 = 40. In contrast, DDPA on average selects a constantnumber of factors, which is only slightly more than the true number, even when the larger factoris very strong.

We see more scatter in the choices of k by DDPA than by DPA. Both DPA and DDPA aredeterministic functions of X, but DDPA subtracts a function of estimated singular vectors. Aswe mentioned in Section 4.1 those vectors are not always well estimated, which could contributeto the variance of DDPA. The standard deviation for DPA is 0 when all 100 simulations pickedthe same k.

For the three-factor model, we set θ1 = 6γ1=2 and θ2 = 10γ1=2 and vary the largest factorθ3 =c3γ

1=2 on the same 20-point grid between c3 =10 and c3 =70. In Fig. 2(b), we see the samebehaviour as for the two-factor model. For θ3 <30 there is very little shadowing, by θ3 =40 onefactor is being shadowed and by θ3 = 50 we start to see a second factor becoming shadowed.

(a) (b)

Fig. 2. Mean and ˙1 standard deviation of the number of factors selected by DPA ( ) and DDPA( ) as a function of the stronger factor value, with (a) two and (b) three factors

Page 13: Deterministic parallel analysis: an improved method for

Deterministic Parallel Analysis 175

(a) (b)

Fig. 3. Mean and ˙1 standard deviation of the number of factors selected by DDPA as a function ofstronger factor value, with (a) one and (b) three factors, with an increased threshold ( , DDPA+) andwithout ( , DDPA)

DDPA counters the shadowing, once again with more randomness than DPA has. In both cases1 standard deviation is approximately 0:5 for DDPA.

5.3. DDPA+ versus deflated deterministic parallel analysisHere we examine the behaviour of DDPA+. We consider the same multifactor setting as in theprevious section, now with a one-factor and a three-factor model.

The results in Fig. 3(a) show that the increased threshold leads to some weak factors notbeing selected by DDPA+. The threshold at which a factor is included is increased comparedwith DDPA with threshold, as expected. For the three-factor model in Fig. 3(b), increasing thethreshold counters the variability that we saw for DDPA in Fig. 2(b). In conclusion, increasingthe threshold has the expected behaviour.

Moreover, we see an increased variablity for DDPA+ in the leftmost setting on the right-handside, when θ3 = θ2. The problem is that the two population singular values are equal, and thisis a problematic setting for choosing the number of factors. The problem disappears almostimmediately when θ3 is just barely larger than θ2 so then there is only a small range of signalstrengths where this happens. We expect a very asymmetric distribution of k there, so perhaps±1 standard deviation does not depict it well. This is a type of ‘shadowing from below’.

6. Data analysis

We consider the HGDP data set (e.g. Cann et al. (2002) and Li et al. (2008)). The purpose ofcollecting this data set was to evaluate the diversity in the patterns of genetic variation across theglobe. We use the Centre d’Etude du Polymorphisme Humain panel, in which SNP data were col-lected for 1043 samples representing 51 populations from Africa, Europe, Asia, Oceania and theAmericas. We obtained the data from http://www.hagsc.org/hgdp/data/hgdp.zip.We provide the data and processing pipeline at github.com/dobriban/DPA.

The data have n = 1043 samples, and we focus on the p = 9730 SNPs on chromosome 22.Thus we have an n×p data matrix X, where Xij ∈{0, 1, 2} is the number of copies of the minorallele of SNP j in the genome of individual i. We standardize the data SNP-wise, centring eachSNP by its mean, and dividing by its standard error. For this step, we ignore missing values.Then, we impute the missing values as 0s, which are also equal to the mean of each SNP.

Page 14: Deterministic parallel analysis: an improved method for

176 E. Dobriban and A. B. Owen

Num

ber o

f sin

gula

r val

ues

Fig. 4. Singular value histogram of the HGDP data (overlaid are the thresholds where selection of factorsstops for the four methods): , PA; , DPA; , DDPA; , DDPA+

In Fig. 4, we show the singular values of the HGDP data, and thresholds for selection ofour methods: PA, DPA, DDPA and DDPA+. For PA, we use the sequential version, where theobserved singular value must exceed all 20 permuted counterparts. PA selects 212 factors, DPA122, DDPA 1042 and DDPA+ 4.

Because we have scaled the data, the generalized MP distribution here reduces to the usualdistribution. The bulk edge of that distribution is exceeded by all 122 factors that are selected byDPA. PA selects even more factors. The PA threshold is a noisy estimate of the 1−212=9730 :=0:978 quantile of the MP distribution. Though their thresholds are not very far apart, the densityof sample singular values is high in that region, so PA and DPA pick quite different k.

Every time that DDPA increases k by 1, the threshold that it uses decreases and the cascadeterminated with k=n−1 factors. This is the rank of our centred data matrix, and so the residualis 0. 51 populations are represented in the data and it is plausible that a very large number offactors are present in the genome. They cannot all be relatively important and we certainly cannotestimate all those singular vectors accurately. Restricting to well-estimated singular vectors asDDPA+ does leads to k = 4 factors. The threshold is well separated from the threshold thatwould lead to k =5.

Since the ground truth is not known, we cannot be sure which k is best. We can, however,make a graphical examination. Real factors commonly show some interesting structure, whereaswe expect spurious factors or poorly determined factors to appear Gaussian. Of course, thisis entirely heuristic. However, practitioners often attempt to interpret the principal componentestimators of the factors in exploratory factor analysis, so this is a reasonable check. Perryand Owen (2010) even based a test for factor structure on the non-Gaussianity of the singularvectors, using random rotation matrices.

Figs 5 and 6 show the left singular vectors of the data corresponding to the top singular values.We see a clear clustering structure in at least the first eight principal components. DDPA+ selects

Page 15: Deterministic parallel analysis: an improved method for

Deterministic Parallel Analysis 177

(a)

(b)

(c)

Fig. 5. Scatter plots of the left singular vectors (a) 1–4, (b) 5–8 and (c) 9–12 of the HGDP data (each pointrepresents a sample)

Page 16: Deterministic parallel analysis: an improved method for

178 E. Dobriban and A. B. Owen

(a)

(b)

(c)

Fig. 6. Scatterplots of the left singular vectors (a) 13–16, (b) 17–20 and (c) 21–24 of the HGDP data

Page 17: Deterministic parallel analysis: an improved method for

Deterministic Parallel Analysis 179

only four factors which we interpret as being conservative about estimating the singular vectorsbeyond the fourth. Although it is visually apparent that there is some structure in eigenvectors5–8, DDPA+ stopped because that structure could not be confidently estimated. There is muchless structure in the principal components beyond the eighth. Thus, we believe that PA and DPAselect many more factors than can be well estimated from these data even if that many factorsexist. DDPA failed completely on these data, unlike in our simulations, where it was merelyvariable. Here, the histogram of the bottom singular values never looked like a generalized MPdistribution. We think that DDPA+ is more robust.

To be clear, we think that the main reason why DDPA does not work is that there is a largeamount of clustering in the data. There are samples from several different populations fromEurope, Asia and Africa. Within each continent, there are samples from different regions, andwithin those there are samples from different countries. The assumption of IID data is notaccurate. Instead, it would be more accurate to use a hierarchical mixture model. Simulations inthe on-line appendix show how DDPA fails under such a model. However, developing methodsfor such models is beyond our current scope.

7. Computing the thresholds

In this section we describe how our thresholds are computed. First we consider a fast methodto compute the DPA threshold. Then we describe how to compute the DDPA+ threshold.

7.1. Fast computation of the upper edgeWe describe a fast method to compute the upper edge U.Fγ,H/ of the MP distribution, given thepopulation spectrum H and the aspect ratio γ. Our method described here is, implicitly, one ofthe steps of the spectrode method, which computes the density of the whole MP distribution Fγ,H(Dobriban, 2015). However, this subproblem was not considered separately in Dobriban (2015),and thus calling the spectrode method can be inefficient when we need only the upper edge.

We are given a probability distribution H =Σpj=1wjδΦj which is a mixture of point masses at

Φj �0, with wj �0 and Σpj=1wj =1. We are also given the aspect ratio γ > 0. As a consequence

of the results in Silverstein and Choi (1995) (see, for example, Bai and Silverstein (2010), lemma6.2), the upper edge U.Fγ,H/ has the following characterization. Let B = −1= maxj Φj, andassume that H �= δ0, so that B< 0 is well defined. Consider the map

z.v/=−1v

+γp∑

j=1

wjΦj

1+Φjv:

On the interval .B, 0/, z has a unique minimum vÅ. Then, U.Fγ,H/ = z.vÅ/. Moreover, z isstrictly convex on that interval, and asymptotes to ∞ at both end points. This enables findingvÅ numerically by solving z′.v/=0 by using any one-dimensional root finding algorithm, suchas bisection or Brent’s method (Brent, 1973).

7.2. The DDPA+ thresholdFor simplicity we consider discerning whether our X from which k − 1 factors have been sub-tracted contains an additional kth well-estimated factor or not. Let n−1=2X= θabT +N, whereS =θabT is the signal, with unit vectors a and b, and N is the noise. Let X=Σr

i=1σiuivTi , where

r =min.n, p/, be the SVD of X.We compare two estimators of S: the first empirical singular matrix S =σ1u1vT

1 , and the n×p

zero matrix 0n×p, which corresponds to a hard thresholding estimator on the first singular value.

Page 18: Deterministic parallel analysis: an improved method for

180 E. Dobriban and A. B. Owen

Now S is more accurate than 0 in mean-squared error, when

‖σ1u1vT1 −θabT‖2

F <‖θabT‖2F:

By expanding the square, isolating σ1 and squaring, the criterion becomes

σ21 < 4θ2.uT

1 a/2.vT1 b/2: .6/

The theory of spiked covariance models (e.g. Benaych-Georges and Nadakuditi (2012)), showsthat as n, p→∞, under the conditions of the theorems in our paper and for fixed factor strengths,the empirical spike singular value σ1 and the empirical squared cosines .uT

1 a/2, .vT1 b/2 have al-

most sure limits. Moreover, the OptShrink algorithm (Nadakuditi, 2014) provides consistentestimators of the unknown population spike θ2, and the squared cosines .uT

1 a/2, .vT1 b/2. Opt-

Shrink relies on the empirical singular values σi, i=1, : : : , r. To estimate the unknown populationspike θ2, it uses the singular values σi, i=m, : : : , r, for some prespecified m. Here we take m=2.

The specific formulas can be found in Benaych-Georges and Nadakuditi (2012). We providethem here for completeness and they are also in the code at github.com/dobriban/DPA.The full algorithm is stated separately in Benaych-Georges and Nadakuditi (2012). Let λ=σ2

1,and λi = σ2

i , i = 2, : : : , r, where r = min.n, p/. Then, let m = .r − 1/−1Σi.λi − λ/−1, v = γm −.1 −γ/=λ, D =λmv, l = 1=D, m′ = .r − 1/−1Σi.λi −λ/−2, v′ =γm′ + .1 −γ/=λ2 and D′ =mv+λ.mv′ + m′v/. The estimators of the squared cosines between the true and empirical left andright singular vectors are respectively c2

r =m=.D′l/ and c2l =v=.D′l/.

8. Discussion

In this paper we have developed a deterministic counterpart, DPA, to the popular PA methodfor selecting the number of factors. DPA is completely deterministic and hence reproducible. Itis also much faster than PA. Both of these traits make it a suitable replacement for PA whenfactor analysis is used as one step in a lengthy data analysis workflow. Under standard factoranalysis assumptions, DPA inherits PA’s vanishing probability of selecting false positive resultsthat was proved in Dobriban (2017).

PA and DPA have a shadowing problem that we can remove by deflation. The resulting DDPAalgorithm counters deflation but chooses a more random number of factors because it subtractspoorly estimated singular vectors. We counter this in our DDPA+ algorithm by raising thethreshold to include only factors whose singular vectors are well determined.

Factor analysis is used in data analysis pipelines for multiple purposes, including factorinterpretation, data denoising and data decorrelation. The best choice of k is probably differentfor each of these uses. Denoising has been well studied but we think that more work is neededfor the other goals.

Acknowledgements

This work was supported by the US National Science Foundation under grants DMS-1521145and DMS-1407397. The authors are grateful to Jeha Yang for pointing out an error in an earlierversion of the manuscript.

Appendix A: Proof outlines of theorem 1

In this section we provide some proof outlines. The remaining proofs and more simulations results are inthe on-line supplementary material.

Page 19: Deterministic parallel analysis: an improved method for

Deterministic Parallel Analysis 181

A.1. General theoryRecall that n−1=2X = S + N for signal and noise matrices S and N, where ‖N‖2 → b > 0 almost surely.We begin with a technical lemma that provides a simple condition for DPA to select the perceptiblefactors.

Lemma 1 (DPA selects the perceptible factors). If U.Fγp ,Hp/1=2 →b∈ .0, ∞/ almost surely, then DPA

selects all perceptible factors and no imperceptible factors, almost surely.

The proof is immediate. Here, we shall provide concrete conditions when U.Fγp ,Hp/1=2 →b holds. Recall

that the noise has the form N =n−1=2ZΦ1=2, where the entries of Z are independent random variables ofmean 0 and variance 1. Then the true (population) variances are the entries of the diagonal matrix Φ, andwe let Hp =ESD.Φ/.

Proposition 2 (noise operator norm). Let N = n−1=2ZΦ1=2, where the entries of Z are independentstandardized random variables with bounded .6+ ε/th moment for some ε>0, and Φ∈Rp×p is a diagonalpositive semidefinite matrix with maxj Φj �C <∞. Let p→∞ with p=n→γ > 0, whereas ESD.Φ/→Hand maxi Φi → U.H/ where the limiting distribution has U.H/ < ∞. Then ‖N‖2 → U.FH ,γ/1=2 almostsurely.

The proof of proposition 2 essentially follows from corollary 6.6 of Bai and Silverstein (2010). A smallmodification is needed to deal with the non-IID-ness, as explained in Dobriban et al. (2017).

Using proposition 2 we can ensure that U.Fγp ,Hp/1=2 → b, as required in lemma 1, by showing that

U.Fγp ,Hp/→U.Fγ,H /. We now turn to this.

A.2. Upper edge of Marchenko–Pastur lawIn this section, we shall provide conditions under which U.Fγp ,Hp

/→U.Fγ,H /. This statement requires adelicate new analysis of the MP law, going back to the first principles of the MP–Silverstein equation.

Theorem 3 (convergence of upper edge). Suppose that, as n, p→∞, γp =p=n→γ ∈ .0, ∞/, Hp ⇒Hand U.Hp/→U.H/<∞. Then U.Fγp , Hp

/→U.Fγ,H /.

For a proof of theorem 3 see the on-line supplement.Theorem 3 provides sufficient conditions guaranteeing the convergence of the MP upper edge. For

normalized data n−1=2X=S +N with N =n−1=2ZΦ1=2, as in the setting of proposition 1, these conditionsare satisfied provided that the signal columns of S have vanishing norms.

Proposition 3. Let the scaled data be n−1=2X = S + N with noise N = n−1=2ZΦ1=2. Let Φ and Z obeythe conditions of proposition 2 except that the moment condition on Z is increased from 6 + ε to 8 + εbounded moments for some ε> 0. Let S have columns s1, : : : , sp, and suppose that maxj ‖sj‖2 →0. ThenHp ⇒H and U.Hp/→U.H/.

For a proof of proposition 3, see the on-line supplement.Theorem 3 and proposition 3 provide a concrete set of assumptions that can be used to prove our main

result for factor models.

A.3. Finishing the proof of theorem 1It remains to check that the conditions of propositions 2 and 3 hold. In matrix form, the factor model readsX=UΨ1=2ΛT +ZΦ1=2. We first normalize it to have operator norm of unit order: n−1=2X=n−1=2UΨ1=2ΛT +n−1=2ZΦ1=2.

We need to verify conditions on signal and noise. The assumption about the idiosyncratic terms intheorem 1 satisfies the conditions of proposition 3 which are stricter than those of proposition 2. So thenoise conditions are satisfied by assumption.

Now we turn to the signal S. It has columns s1, : : : , sp. For proposition 2, we need to show that M =maxj ‖sj‖→0. Let ΛΨ1=2 = .d1, : : : , dr/. Then

S =n−1=2UΨ1=2ΛT =n−1=2r∑

l=1uld

Tl :

Page 20: Deterministic parallel analysis: an improved method for

182 E. Dobriban and A. B. Owen

Now sj =n−1=2Σrl=1uldl.j/, where dl.j/ denotes the jth entry of dl. Thus

‖sj‖�n−1=2r∑

l=1‖ul‖|dl.j/|�n−1=2 max

1�l�r‖ul‖ max

1�j�p

r∑

l=1|dl.j/|:

Next, by definition maxj Σrl=1|dl.j/|=‖ΛΨ1=2‖∞ which approaches 0 by condition 5 on factor loadings in

theorem 1. It now suffices to show that maxl ‖ul‖=O.n1=2/.Each ul has IID entries with mean 0 and variance 1, so n−1=2‖ul‖→1 almost surely by the law of large

numbers. The problem is to guarantee that the maximum of r such random variables also converges to 1.If r is fixed, then this is clear. A more careful analysis allows r to grow subject to the limits in assumption 4of theorem 1, and using the bounded moments of order ν =4+δ for the entries of U given by assumption1. Specifically, with Mu.p/=Mu =max1�l�r n−1=2‖ul‖ we can write

Pr.Mu > 2/� r max1�l�r

Pr.‖ul‖2 −n> 3n/

� r max1�l�r

E{.‖ul‖2 −n/ν=2}=.3n/ν=2

�Crn−ν=2,

for some C <∞. Thus Mu.p/ is almost surely bounded, if the sequence rn−ν=2 is summable. Summabilityholds by the assumption that the sequence r=n1+δ=4 is summable. This finishes the proof.

References

Ahn, S. C. and Horenstein, A. R. (2013) Eigenvalue ratio test for the number of factors. Econometrica, 81, 1203–1227.

Bai, J. and Li, K. (2012) Statistical analysis of factor models of high dimension. Ann. Statist., 40, 436–465.Bai, J. and Ng, S. (2008) Large Dimensional Factor Analysis. Boston: Now.Bai, Z. and Silverstein, J. W. (2010) Spectral Analysis of Large Dimensional Random Matrices, 2nd edn. New York:

Springer.Baik, J., Ben Arous, G. and Peche, S. (2005) Phase transition of the largest eigenvalue for nonnull complex sample

covariance matrices. Ann. Probab., 33, 1643–1697.Benaych-George, F. and Nadakuditi, R. R. (2012) The singular values and vectors of low rank perturbations of

large rectangular random matrices. J. Multiv. Anal., 111, 120–135.Brent, R. P. (1973) Algorithms for Minimization without Derivatives. Englewood Cliffs: Prentice Hall.Brown, T. A. (2014) Confirmatory Factor Analysis for Applied Research, 2nd edn.New York: Guilford.Buja, A. and Eyuboglu, N. (1992) Remarks on parallel analysis. Multiv. Behav. Res., 27, 509–540.Cann, H. M., de Toma, C., Cazes, L., Legrand, M.-F., Morel, V., Piouffre, L., Bodmer, J., Bodmer, W. F.,

Bonne-Tamir, B., Cambon-Thomsen, A., Chen, Z., Chu, J., Carcassi, C., Contu, L., Du, R., Excoffier, L.,Ferrara, G. B., Friedlaender, J. S., Groot, H., Gurwitz, D., Jenkins, T., Herrara, R. J., Huang, K., Kidd, J.,Kidd, K. K., Langaney, A., Lin, A. A., Mehdi, S. Q., Parham, P., Piazza, A., Pistillo, M. P., Qian, Y., Shu, Q.Xu, J., Zhu, S., Weber, J. L., Greely, H. T., Feldman, M. W., Thomas, G., Dausset, J. and Cavalli-Sforza, L. L.(2002) A human genome diversity cell line panel. Science, 296, 261–262.

Cattell, R. B. (1966) The scree test for the number of factors. Multiv. Behav. Res., 1, 245–276.Dobriban, E. (2015) Efficient computation of limit spectra of sample covariance matrices. Rand Matr. Theory

Appl., 4, no. 4, article 1550019.Dobriban, E. (2017) Factor selection by permutation. Preprint arXiv:1710.00479. University of Pennsylvania,

Philadelphia.Dobriban, E., Leeb, W. and Singer, A. (2017) Optimal prediction in the linearly transformed spiked model.

Preprint arXiv:1709.03393. University of Pennsylvania, Philadelphia.Gagnon-Bartsch, J. A., Jacob, L. and Speed, T. P. (2013) Removing unwanted variation from high dimensional

data with negative controls. Technical Report. University of California at Berkeley, Berkeley.Gavish, M. and Donoho, D. L. (2014) The optimal hard threshold for singular values is 4/sqrt(3). IEEE Trans.

Inform. Theory, 60, 5040–5053.Gerard, D. and Stephens, M. (2017) Unifying and generalizing methods for removing unwanted variation based

on negative controls. Preprint arXiv:1705.08393.Glorfeld, L. W. (1995) An improvement on Horn’s parallel analysis methodology for selecting the correct number

of factors to retain. Educ. Psychol. Measmnt, 55, 377–393.Hachem, W., Hardy, A. and Najim, J. (2016) Large complex correlated Wishart matrices: fluctuations and asymp-

totic independence at the edges. Ann. Probab., 44, 2264–2348.Horn, J. L. (1965) A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179–185.

Page 21: Deterministic parallel analysis: an improved method for

Deterministic Parallel Analysis 183

Johnstone, I. M. (2001) On the distribution of the largest eigenvalue in principal components analysis. Ann.Statist., 29, 295–327.

Kaiser, H. F. (1960) The application of electronic computers to factor analysis. Educ. Psychol. Measmnt, 20,141–151.

Lee, J. O. and Schnelli, K. (2016) Tracy–Widom distribution for the largest eigenvalue of real sample covariancematrices with general population. Ann. Appl. Probab., 26, 3786–3839.

Leek, J. T. and Storey, J. D. (2007) Capturing heterogeneity in gene expression studies by surrogate variableanalysis. PLOS Genet., 3, no. 9, article e161.

Leek, J. T. and Storey J. D. (2008) A general framework for multiple testing dependence. Proc. Natn. Acad. Sci.USA, 105, 18718–18723.

Li, J. Z., Absher, D. M., Tang, H., Southwick, A. M., Casto, A. M., Ramachandran, S., Cann, H. M.,Barsh, G. S., Feldman, M., Cavalli-Sforza, L. L. and Myers, R. M. (2008) Worldwide human relationshipsinferred from genome-wide patterns of variation. Science, 319, 1100–1104.

Malinowski, E. R. (2002) Factor Analysis in Chemistry, 3rd edn. New York: Wiley.Marchenko, V. A. and Pastur, L. A. (1967) Distribution of eigenvalues for some sets of random matrices. Mat.

Sb., 114, 507–536.Nadakuditi, R. R. (2014) Optshrink: an algorithm for improved low-rank signal matrix denoising by optimal,

data-driven singular value shrinkage. IEEE Trans. Inform. Theory, 60, 3002–3018.Nadakuditi, R. R. and Edelman, A. (2008) Sample eigenvalue based detection of high-dimensional signals in

white noise using relatively few samples. IEEE Trans. Signl Process., 56, 2625–2638.Owen, A. B. and Wang, J. (2016) Bi-cross-validation for factor analysis. Statist. Sci., 31, 119–139.Patterson, N., Price, A. and Reich, D. (2006) Population structure and eigenanalysis. PLOS Genet, 2, no. 2, article

e190.Paul, D. and Aue, A. (2014) Random matrix theory in statistics: a review. J. Statist. Planng Inf., 150, 1–29.Perry, P. O. (2009) Cross-validation for unsupervised learning. PhD Thesis. Stanford University, Stanford.Perry, P. O. and Owen, A. B. (2010) A rotation test to verify latent structure. J. Mach. Learn. Res., 11, 603–624.Silverstein, J. W. and Choi, S.-I. (1995) Analysis of the limiting spectral distribution of large dimensional random

matrices. J. Multiv. Anal., 54, 295–309.Sun, Y., Zhang, N. R. and Owen, A. B. (2012) Multiple hypothesis testing adjusted for latent variables, with an

application to the AGEMAP gene expression data. Ann. Appl. Statist., 6, 1664–1688.Vitale, R., Westerhuis, J. A., Næs, T., Smilde, A. K., de Noord, O. E. and Ferrer, A. (2017) Selecting the number of

factors in principal component analysis by permutation testing—numerical and practical aspects. J. Chemometr.,31, no. 12, article e2937.

Yao, J., Bai, Z. and Zheng. S. (2015) Large Sample Covariance Matrices and High-dimensional Data Analysis.New York: Cambridge University Press.

Zhou, Y.-H., Marron, J. and Wright. F. A. (2017) Eigenvalue significance testing for genetic association. Biometrics,74, 439–447.

Zwick, W. R. and Velicer, W. F. (1986) Comparison of five rules for determining the number of components toretain. Psychol. Bull., 99, no. 3, 432–442.

Supporting informationAdditional ‘supporting information’ may be found in the on-line version of this article.