Models for Millions Bob Stine Department of Statistics The Wharton School, University of Pennsylvania stat.wharton.upenn.edu/~stine 34 th NJ ASA Spring Symposium June 7, 2013

Models for Millions
Bob Stine
Department of Statistics
The Wharton School, University of Pennsylvania
34th NJ ASA Spring Symposium
June 7, 2013

Models for Millions

Bob StineDepartment of Statistics

The Wharton School, University of Pennsylvaniastat.wharton.upenn.edu/~stine

34th NJ ASA Spring SymposiumJune 7, 2013

Wharton Department of Statistics


Wharton Department of StatisticsWharton Department of Statistics

Statistics in the NewsHot topics

Big DataBusiness AnalyticsData Science

Are the authors talking about statistics?Or about … ! ! information systems?! ! database technology?! ! visualization, eye candy?


Wharton Department of StatisticsWharton Department of Statistics

Even Farming...


Wharton Department of StatisticsWharton Department of Statistics

Big DataRecent modeling projectsCredit scoring

75,000 cases15,000+ possible explanatory variables

Spatial time series3,000 locations100 time points20+ features at each location and time

TextReal estate listings6,000 prices, millions of possible descriptions

Tagging1.2 million words, 60,000+ ‘explanatory variables’


Notationn = # rows of X

p = #columns of X

Wharton Department of StatisticsWharton Department of Statistics

Is Big Data Really So Big?Not always so large as they may seem

Repeated measurement ≠ more degrees of freedomWhat is the relevant source of variation?

Transfer learning problemMachine learningBuild model for structure of text on corpus such as the New York TimesWhat transfers from that model to ! Washington Post?! Richmond Times-Dispatch?

Implications for estimates of standard error


Wharton Department of StatisticsWharton Department of Statistics

Example of DependencePredict returns on mutual funds

Do funds that do well in one year anticipate doing well (or poorly) the next year?







What’s happening?

Wharton Department of StatisticsWharton Department of Statistics

Does Big Data Imply Big Models?Perhaps all one needs is a very simple analysis

GoogleMassive hardwareExtensive data

Text modelingHard problem: predict next word in sentence! ! I took a walk ____Tabulation of all 5-grams (5 word sequence)Replace modeling with frequency table

Web page designContinuous experimentationRandomized, two-sample t-test


Wharton Department of StatisticsWharton Department of Statistics

Simple Models Can Be Better Association rules

Low tech… Build tablesIdentify associationLow-tech ≠ low impact…grab low-hanging fruit

Predictive modeling via support vector machine

High tech… Locate separating hyperplanes in kernel spaceIdentify predictive featuresHigh-tech ≠ high impact…Complexity vs communication


Wharton Department of StatisticsWharton Department of Statistics

Simple might be right!Recent WSJ story on reproducibility and proliferation of research...


Wharton Department of StatisticsWharton Department of Statistics

Attractive Misconceptions*Thinking the true predictor is in my data rather than running an experiment

Reject inference and white carsTraining: we give students the data

Outliers don’t matter with millions of casesCentral limit theoremCorollary: estimators are normally distributed.

Methods are black boxesLasso is popular, so it’s best for my application.

Cross-validation keeps me out of troubleAs long as the model validates well out-of-sample, the predictions are reliable.

11*ie, Lessons I have learned the hard way.

Wharton Department of StatisticsWharton Department of Statistics

PlanFamiliar context

Fit LS regression of continuous Y to large collection of possible explanatory variables

Two themesReducing dimensionsColumns: Random projectionsRow: Subsampling

StreamingSequential from rowsSequential from columns

Mixtures of the two (VIF regression)

CommentsRegularization (shrinkage) can be addedWhere are the Bayesian models?


Wharton Department of Statistics


Wharton Department of StatisticsWharton Department of Statistics

Reducing ColumnsContext

PCA, common column scalesHuge p >> n

Random projection Methods based on random projection have revived interest in PCA

IdeaUse random projections to reduce the data matrix to a size amenable to calculation.Explanatory variables in n × p matrix XPick d << pMultiply X by a p × d matrix of random numbers Ω so that resulting dimension is n × d.


Wharton Department of StatisticsWharton Department of Statistics

Arcene ExampleAutomation

Automated data collection produces extensive measurements, here p=10,000 featuresOnly n=200 cases

Arcene example from UCIMass spectrometer measurementsOrigin: Separate normal cells from cancerous cells Make into a regression problemUse continuous response, not the 0/1 indicator in respository

Complications galore…Collinear: sampling smooth functionToo many ‘perfect’ solutionsHard to test out-of-sample because so few cases

15UCI = Univ of Ca Irving ML databases, http://archive.ics.uci.edu

Wharton Department of StatisticsWharton Department of Statistics

Marginal AnalysisMarginal correlations (Xi,Y) show signal

Deviate from distribution of random noise (red)

But: weakly spread over many coordinatesMultiple regression finds weak effectsR2 = 0.19 is larger than might expect




-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3




R2 = 0.19

Null: Expect p/n R2 = 10/200 = 0.05

Wharton Department of StatisticsWharton Department of Statistics

PCA AnalysisCompute singular value decomposition! ! ! ! ! ! ! X = U D V’

Columns of U, V are orthonormalD is a diagonal matrix of singular values (spectrum of X)

Doable in R if X is 200×10,000 matrixRegression finds clear, strong effect in U5


R2=0.270 10 20 30 40 50






r Val


Spectrum of X

0 10 20 30 40 50






r Val


Spectrum of X

Wharton Department of StatisticsWharton Department of Statistics

Random ProjectionProject down to smaller size

Example with d=100Compare random projections to exact from R

ProcedureP0 = X Ω, Ω is 10,000×d random matrixP1 = XX’ P0 is one step of power methodTake first few columns of U from SVD of Pj

Compare to fit with exact SVD


Random Projection Exactone iteration

Wharton Department of StatisticsWharton Department of Statistics

Comparison of FitsReconstruction

Random projection preserves subspace holding range of matrix, but not necessarily in the same coordinates.Eg: different components appear in regression

Comparison of fits shows same subspace

19-3 -2 -1 0 1 2 3





SVD Regression Fit







r= 0.94

-3 -2 -1 0 1 2 3





SVD Regression Fit







r= 0.999



Wharton Department of StatisticsWharton Department of Statistics

A really big X matrix?Arcene example is ‘small’: we can do do the exact SVD quickly in R.Suppose X had more columns, say ! ! ! ! 10,0002 = 100,000,000such as from the interaction space of X.Linear models often approximate non-linear structure…


Okay, half that


first 10 PCs of X

Wharton Department of StatisticsWharton Department of Statistics

Random Projection Random projection with 50,000,000 explanatory variables (Xj Xk)

Cannot compare to the exact solution for this oneRuns ‘quickly’: about 5 minutes on laptop!

Fitted model on 5 elements of the random projection of the quadratic X’s

21R2=0.23 → R2=0.46 → R2=0.57

One power iteration

Wharton Department of StatisticsWharton Department of Statistics

Postscript...What’s the response in that regression?

What Y variable lives in the quadratic space?

Short answer: Kernel trickCompute the quadratic kernel of the dataFind the SVDLet Y be one of the singular vectors

Story for another day ...


Wharton Department of StatisticsWharton Department of Statistics

Reducing RowsContext

Very large n >> moderate pAgain, less interested in selecting specific Xs

Common senseDon’t need to fit a model more precisely than needed for statistical precision/selection.However…More data reveals a more interesting model, one with subtle effects

Speed of OLSb = (X’X)-1X’YSlow part if n >> p is computing X’X



Wharton Department of StatisticsWharton Department of Statistics

Case SamplingExploit familiar property of regression

Precision of slope is maximized by finding cases with large variation in XsTask becomes finding cases with high leverage

Machine learning has developed methods to seek high-leverage points

Hard to find sequentially

Simple improvementSample m << n cases to estimate X’XUse all n cases to estimate X’Y

Leverage points however may not be your friends in modeling large data sets...



Not sampling on the response!

Page 25: Models for Millions - Statistics DepartmentModels for Millions Bob Stine Department of Statistics The Wharton School, University of Pennsylvania stat.wharton.upenn.edu/~stine 34th

Outliers in Big DataSparse data

n=10,000X ≈ 0 Y=0 for 9,990, Y=1 for 10X ≈ 1 Y=1 for one case

What’s the appropriate p-value?Classical OLS

Use residual after fit slope, as if right modelt ≈ 10, pick your level of significance!

Common sensep = 1/1000 more sensible p-value




Page 26: Models for Millions - Statistics DepartmentModels for Millions Bob Stine Department of Statistics The Wharton School, University of Pennsylvania stat.wharton.upenn.edu/~stine 34th

Streaming Methods


Wharton Department of StatisticsWharton Department of Statistics

Streaming CasesContext

Huge number of cases, more than memory holds

IdeaCompute estimates as read in data so do not have to store all dataCalculations can be split over network

Different take on OLSOLS estimate for n-1 cases ! ! ! bn-1 =(X’X)-1X’Y The estimate for n cases is! ! ! ! bn!= bn-1 + (X’X)-1xn(yn-xn’bn-1)/(1+hn)! ! ! ! ! = bn-1 + [(1+hn)(X’X)]-1 xn ewhere the leverage hn=xn’(X’X)-1xn.


slow step

Wharton Department of StatisticsWharton Department of Statistics

Stochastic GradientBuild up normal equations and solutions by randomly sampling cases Stochastic gradient

Robbins & MonroTo minimize (yi-xi’b)2 w.r.t. b, step in the direction of the negative gradient, ! ! ! ! ! ! xi(yi-xi’b) = xi ei

Full least squares solution uses X’X! ! ! bn!= bn-1 + [(1+hn)(X’X)]-1 xn ePretend X’X is diagonal, and life moves faster! ! ! b*n! = b*n-1 + δn D-1 xn e*with D = diagonal (X’X) and δn is a learning rate.


Wharton Department of StatisticsWharton Department of Statistics

How fast is it?Goal in stochastic gradient is to run as fast as you can read data!


n p OLS SG

2,500 500 <0.1 <0.1

5,000 1,000 0.7 0.2

10,000 2,000 9.5 0.9

20,000 4,000 84 3.7

40,000 8,000 675 25

80,000 16,000 5394 276

100,000 20,000 10480 312n=5p

5000 10000 15000 20000



Number of Features



Wharton Department of StatisticsWharton Department of Statistics

How good are estimates?Graph plots estimated coefficients from one-pass of stochastic gradient versus exact OLSDeviation from OLS below standard error

Small error relative to variation in estimates


At least when there is not much collinearity!


Wharton Department of StatisticsWharton Department of Statistics

Statistical Significance?Don’t have X’X so don’t have usual SE

How to evaluate modeling?

Cross-validationLess sensitive to modeling assumptionsSplit dataTraining data: Fit model on part of the dataTest data: Reserved dataCompare fit in two datasets

Three way split becoming necessaryTraining dataTuning data…Set tuning parameters, such as level of shrinkage

Testing data31

Wharton Department of StatisticsWharton Department of Statistics

Population DriftCross-validation is an optimistic assessment

One of few places when have random sample

Credit scoringPredict performance of applicantsCross-validation shows model spot on

Data collection is a long processGather data over 1-2 yearsTakes 1-2 more years to find the response

The world changed!Booming economy during data collectionCollapsing recession when implementedNo way CV could see this problem




More issues ...Variation?

How to allocate?

Wharton Department of StatisticsWharton Department of Statistics

Streaming VariablesContext

Huge number of variablesWant to preserve scales

IdeaStepwise search pays a large cost for searching! Bonferroni p-value threshold 0.05/millionsStreaming: Examine features one at a timeResembles forward stepwise, but without sorting/ordering based on p-values

Exploit context“Scientist” orders variables, defines search strategyAdaptive:Build interactions as features added


Wharton Department of StatisticsWharton Department of Statistics

Feature Auction




Expert2 ...


α1 α2


Collection of experts bid for the

opportunity to recommend feature

Auction collects winning bid α2

Expert supplies values of recommended feature Xw



Expert receives payoff ω if pw ≤ α2

Experts only learn if the bid was accepted, not the value of b or the p-value.



Wharton Department of Statistics















Xaccepted Xrejected




Source Experts







Wharton Department of StatisticsWharton Department of Statistics

ExpertsExpertStrategy for creating list of features. Experts embody domain knowledge, science of application.Source experts

A collection of measurements (eg, synonyms, clusters)Components of a subspace basis (PCA, RKHS)Lags of a time series

Scavenger expertsInteractions - among features accepted into model- among features rejected by model- between those accepted with those rejectedTransformations- segmenting, as in scatterplot smoothing- polynomial transformations


Wharton Department of StatisticsWharton Department of Statistics

Winning ExpertsExpert is rewarded if correct

Experts have alpha-wealthIf recommended feature is accepted in the model, expert earns ω additional wealthIf recommended feature is refused, expert loses bid

As auction proceeds, it...Rewards experts that offer useful features. Eliminates experts whose features are not accepted.Taxes fund scavenger expertsEnsure that continue to control overall FDR

CriticalAdjust for multiplicityp-values determine useful features


Wharton Department of StatisticsWharton Department of Statistics

Robust Standard Errorsp-values are critical, but...

Error structure often heteroscedasticObservations frequently dependent

Dependence“Observations”Spatial time series at multiple locationsDocuments from various news feeds

Transfer learning problem

ExamplesUse sandwich-type estimate of standard error


heteroscedasticityvar(b) = (X’X)-1X’E(ee’)X(X’X)-1 = (X’X)-1 X’D2X (X’X)-1

dependencevar(b) = (X’X)-1X’E(ee’)X(X’X)-1 = σ2(X’X)-1 X’BX (X’X)-1

Wharton Department of StatisticsWharton Department of Statistics

Flashback...Heteroscedastic error

Estimate standard error with outlierSandwich estimator allowing heteroscedastic error variances givesa t-stat ≈ 1, not 10.

Dependent errorEven more important need for accurate SENetflix exampleBonferroni (or hard thresholding) overfits due to dependence in responses.Spatial modelingEverything seems significant unless incorporate dependence into the calculation of the SE


Wharton Department of StatisticsWharton Department of Statistics

Control for Over FittingAlpha investing

Test possibly infinite sequence of m hypotheses! ! H1, H2, H3, … Hm … obtaining the p-values p1, p2, ...

ProcedureStart with an initial alpha wealth W0 Invest wealth 0 ≤ αj ≤ Wj in the test of HjChange in wealth depends on test outcome! If reject, wealth goes up by payout ω-αj

! If don’t reject, wealth goes down by αj

PropertiesControls expected false discovery rateCan reproduce Bonferroni or FDR methods


Page 41: Models for Millions - Statistics DepartmentModels for Millions Bob Stine Department of Statistics The Wharton School, University of Pennsylvania stat.wharton.upenn.edu/~stine 34th

Auction Run


First 4,000 rounds of auction modeling.

0 1000 2000 3000 4000

Auction Round












Wharton Department of StatisticsWharton Department of Statistics

Streaming Cases & VariablesBackground

A variance inflation factor (VIF) is a diagnostic for collinearity in regression

VIF compares variances of slope estimatesVariance of bk were it uncorrelated with others! ! ! ! var(bx) = s2/(xk’xk)Actual variance is larger due to collinearity ! ! ! ! var(bk) ≈ VIFk s2/(xk’xk) where 1 ≤ VIFk = 1/(1-R2k|rest)

Handy interpretationIs xk not significant because ! ! ! ! ! It is not useful?! ! ! ! ! Redundant?


Wharton Department of StatisticsWharton Department of Statistics

VIF RegressionIdea

Speed up the slow step in forward stepwise

Usual selectionHas variables X and residual! ! e = (I - X(X’X)-1X’) y = (I - H) yPartial t-statistic for testing another variable z with partial regression z*=(I-H)z! ! t2 = (z*’e)2/(s2 z*’z*)

Re-express t-statistic using VIF! ! ! t2 = (z’e)2/(s2 z’z VIFk)

Conservatively estimate VIFk from subsample


O(np2) given (X’X)-1

Wharton Department of StatisticsWharton Department of Statistics

PerformanceFaster than rivals

Plus smaller out-of-sample error


secondsn=1000, p=500

Wharton Department of StatisticsWharton Department of Statistics

Comment on L1Success of lasso depends on nature of underlying modelRisk comparison

Compare the risk of the modelidentified by subset selection to the model identified by lasso (L1).Grey region in plot represent possible model datasets

Take-awayIn models for which lasso identifies high penalty, L0 has better performance.Why? It shrinks them all.




Wharton Department of StatisticsWharton Department of Statistics

Wrap-UpDimension reduction

Random projectionSubsampling

StreamingVIF regressionAlpha investing, auction models

IssuesImportance of substantive insightPrediction/association vs causationDependence, population drift


Wharton Department of StatisticsWharton Department of Statistics

ReferencesStochastic GradientPapers of John Langford, Microsoft Research

Random projectionHalko, Martinsson, and Tropp, SIAM Review, 2011

VIF Regression“VIF Regression: A Fast Regression Algorithm for Large Data”, JASA, 2011, Lin, Foster and Ungar

Alpha investing“α-investing: a procedure for sequential control of expected false discoveries”, JRSSB, 2006

Improved stepwise regression“Variable selection in data mining: Building a predictive model for bankruptcy”, JASA, 2004

Streaming feature selection “Streamwise feature selection”, JMLR, 2006, with Foster, Ungar, and Zhou.
