Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
A REVIEW OF CONCEPTUAL APPROACHES ANDEMPIRICAL EVIDENCE ON PROBABILITY ANDNONPROBABILITY SAMPLE SURVEY RESEARCH
CARINA CORNESSE*
ANNELIES G. BLOMDAVID DUTWINJON A. KROSNICKEDITH D. DE LEEUWST�EPHANE LEGLEYEJOSH PASEKDARREN PENNAYBENJAMIN PHILLIPSJOSEPH W. SAKSHAUGBELLA STRUMINSKAYAALEXANDER WENZ
There is an ongoing debate in the survey research literature aboutwhether and when probability and nonprobability sample surveys pro-duce accurate estimates of a larger population. Statistical theory providesa justification for confidence in probability sampling as a function of thesurvey design, whereas inferences based on nonprobability sampling areentirely dependent on models for validity. This article reviews the
CARINA CORNESSE is a postdoctoral researcher; she is with the Collaborative Research Center SFB884 “Political Economy of Reforns” at the University of Mannheim, 68131 Mannheim, Germany.ANNELIES G. BLOM is Professor for Data Science at the Department of Political Science, School ofSocial Sciences, and Principal Investigator of the German Internet Panel (GIP) at theCollaborative Research Center (SFB 884) “Political Economy of Reforms” at the University ofMannheim, 68131 Mannheim, Germany. DAVID DUTWIN is Senior Vice President; he is withNORC at the University of Chicago, 55 East Monroe Street, 30th Floor, Chicago, IL 60603,USA. JON A. KROSNICK is Frederic O. Glover Professor in Humanities and Social Sciences,Professor of Communication, and Professor of Political Science at Stanford University, 432McClatchy Hall, 450 Serra Mall, Stanford, CA 94305, USA. EDITH D. DE LEEUW is ProfessorEmerita of Survey Methodology at the Department of Methodology and Statistics at UtrechtUniversity, Sjoerd Groenmangebouw, Padualaan 14, 3584 CH Utrecht, Netherlands. ST�EPHANELEGLEYE is Head of the Team “Drug Use in the General Population” at CESP, Inserm, facult�e dem�edecine UVSQ, facult�e de m�edecine, Universit�e Paris-Saclay, Universit�e Paris-Sud and Head ofthe “Household Living Conditions” Unit at Institut National de la Statistique et des �Etudes
doi: 10.1093/jssam/smz041 Advance access publication 9 January 2020VC The Author(s) 2020. Published by Oxford University Press on behalf of the American Association for Public Opinion Research.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction inany medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Journal of Survey Statistics and Methodology (2020) 8, 4–36
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
current debate about probability and nonprobability sample surveys. Wedescribe the conditions under which nonprobability sample surveys mayprovide accurate results in theory and discuss empirical evidence onwhich types of samples produce the highest accuracy in practice. Fromthese theoretical and empirical considerations, we derive best-practicerecommendations and outline paths for future research.
KEYWORDS: Accuracy; Nonprobability sampling; Probabilitysampling; Survey data quality; Survey inference; Weightingadjustments.
1. INTRODUCTION
In recent years, several cases of mispredicted election outcomes have made thenews across the world. Prominent examples include the 2015 Israeli parliamen-tary election where the majority of polls predicted that Benjamin Netanyahu
�Economiques, 88, avenue Verdier, CS 70058, 92541 Montrouge cedex, France. JOSH PASEK isAssociate Professor, Department of Communication and Media and Faculty Associate, Center forPolitical Studies, Institute for Social Research at the University of Michigan, 105 S. State Street,5413 North Quad, Ann Arbor MI 48109. DARREN PENNAY is Founder, Board Member, andExecutive Director Research, Methods & Strategy at the Social Research Centre, Campus visitor atAustralian National University Centre for Social Research, Level 9, 277 William St, MelbourneVictoria Australia 3000 and Methods and Adjunct Professor with the Institute for Social ScienceResearch (ISSR), University of Queensland. BENJAMIN PHILLIPS is Senior Research Director, SurveyMethodology at the Social Research Centre, Level 9, 277 William St, Melbourne Victoria Australia3000 and Campus Visitor at the Centre for Social Research & Methods, Australian NationalUniversity. JOSEPH W. SAKSHAUG is Distinguished Researcher, Head of the Data Collection and DataIntegration Unit, and Acting Head of the Statistical Methods Research Department at the Institutefor Employment Research (IAB), 104 Regensburgerstr. 90478 Nuremberg, Germany, UniversityProfessor of Statistics in the Department of Statistics at the Ludwig Maximilian University ofMunich, 80539 Munich, Germany, and Honorary Full Professor in the School of Social Sciences atthe University of Mannheim, 68131 Mannheim, Germany. BELLA STRUMINSKAYA is assistant profes-sor with the Department of Methodology and Statistics at the University of Utrecht, SjoerdGroenmangebouw Padualaan 14, 3584 CH Utrecht, Netherlands. ALEXANDER WENZ is a postdoctoralresearcher with the Collaborative Research Center SFB 884 “Political Economy of Reforms” at theUniversity of Mannheim, 68131 Mannheim, Germany and the Institute for Social and EconomicResearch at the University of Essex. The authors would like to thank the Collaborative ResearchCenter (SFB) 884 “Political Economy of Reforms” (projects A8 and Z1), funded by the GermanResearch Foundation (DFG), for organizing the Workshop on Probability-based and NonprobabilitySurvey Research in July 2018 at the University of Mannheim. In addition, the authors would like tothank Melvin John, Sabrina Seidl, and Nourhan Elsayed for their assistance with manuscript prepa-ration. The authors are grateful to Andrew Mercer for valuable input and feedback on earlier ver-sions of this manuscript. The order of authors is alphabetical after the lead author.
*Address correspondence to Carina Cornesse, Collaborative Research Center SFB 884 “PoliticalEconomy of Reforms,” University of Mannheim, 68131 Mannheim, Germany, E-mail: [email protected]
Review of Probability and Nonprobability Research 5
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
would lose his presidency,1 the 2016 Brexit referendum where the majority ofpolls predicted that Britain would vote to remain in the European Union,2 andthe 2016 US presidential election where the majority of polls predicted thatHillary Clinton would defeat Donald Trump.3 When investigating potentialreasons for these and other polling failures, researchers have pointed towardthe fact that election polls based on probability samples usually reached moreaccurate predictions than election polls based on nonprobability samples(Sohlberg, Gilljam, and Martinsson 2017; Sturgis, Kuha, Baker, Callegaro,and Fisher 2018).
The finding that election polls based on probability samples usually reachbetter predictions than election polls based on nonprobability samples is notnew. Already in the 1920s and 1930s, the scientific community debated whichsampling design was better: probability sampling, as initially introduced byArthur L. Bowley in 1906, or nonprobability sampling, as initially introducedby Anders N. Kiaer in 1895 (Lessler and Kalsbeek 1992; Bethlehem 2009).After several dramatic cases of mispredicted election outcomes in the UnitedStates (Literary Digest in 1936, see Crossley 1937; Gallup, Crossley, andRoper in the polling debacle of 1948, see Converse 1987), nonprobability sam-pling was identified as a principal cause of prediction inaccuracy and wasreplaced by probability sampling in most high-quality social research.
With the rise of the internet in the late 20th century, however, nonprobabil-ity sampling rose to popularity again as a fast and cheap method for recruitingonline panels (Göritz, Reinhold, and Batinic 2000). However, nonprobabilityonline panels face a number of challenges, such as noncoverage of peoplewithout internet access and selection bias due to the reliance on conveniencesamples of volunteers who might participate in multiple online panels(Bethlehem 2017). Despite these challenges, a vast amount of opinion polls to-day are conducted using nonprobability online panels (Callegaro, Villar,Yeager, and Krosnick 2014a; Callegaro, Baker, Bethlehem, Göritz, andKrosnick 2014b). In addition, nonpanel-based recruitment of online respond-ents, for example through river sampling (American Association for PublicOpinion Research 2013) has been on the rise. As a result, the majority of sur-vey data collected online around the world today rely on nonprobability sam-ples (Callegaro et al. 2014a, 2014b).
Regardless of whether they are designed to predict election outcomes or tomeasure public opinion and regardless of whether they are conducted online oroffline, when they are used for research and polling purposes, probability and
1. https://edition.cnn.com/2015/03/18/middleeast/israel-election-polls/index.html, accessed onNovember 30, 2019.
2. https://www.theguardian.com/politics/2016/jun/24/how-eu-referendum-pollsters-wrong-opin-ion-predict-close, accessed on September 30, 2019.
3. https://www.forbes.com/sites/startswithabang/2016/11/09/the-science-of-error-how-polling-botched-the-2016-election/#babf86437959, accessed on September 30, 2019.
6 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
https://edition.cnn.com/2015/03/18/middleeast/israel-election-polls/index.htmlhttps://www.theguardian.com/politics/2016/jun/24/how-eu-referendum-pollsters-wrong-opinion-predict-closehttps://www.theguardian.com/politics/2016/jun/24/how-eu-referendum-pollsters-wrong-opinion-predict-closehttps://www.forbes.com/sites/startswithabang/2016/11/09/the-science-of-error-how-polling-botched-the-2016-election/#babf86437959https://www.forbes.com/sites/startswithabang/2016/11/09/the-science-of-error-how-polling-botched-the-2016-election/#babf86437959
nonprobability sample surveys often share a common goal: to efficiently esti-mate the characteristics of a large population based on measurements of a smallsubset of the population. Therefore, both probability and nonprobability sam-ple surveys require that (i) the sampled units are exchangeable with non-sampled units that share the same measured characteristics, (ii) no parts of thepopulation are systematically excluded entirely from the sample, and (iii) thecomposition of the sampled units with respect to observed characteristics eithermatches or can be adjusted to match the composition of the larger population(Mercer, Kreuter, Keeter, and Stuart 2017).
Despite their shared objective of providing accurate insights into a popula-tion of interest, probability and nonprobability sample surveys differ in a criti-cal aspect. The key difference between probability and nonprobability samplesurveys lies in the type and strength of the justification for why each approachshould achieve accuracy. In the case of probability sample surveys, the justifi-cation is probability sampling theory, which is based on a set of establishedmathematical principles (Fisher 1925; Neyman 1934; Kish 1965). This soundtheoretical basis makes it possible to compute the accuracy of estimates (e.g.,in the form of confidence intervals or margins of error) and gives a universalvalidity to the estimation method. Furthermore, because providers of probabil-ity sample surveys routinely describe the details of the data-generating process,researchers are able to make adjustments that account for potential coverage,sampling, and nonresponse biases (e.g., Harter, Battaglia, Buskirk, Dillman,and English 2016; Blumberg and Luke 2017).
For nonprobability sample surveys, the justification for expecting accuratemeasurements rests on untested modeling assumptions that are based on aresearcher’s beliefs about the characteristics that make the sample different fromthe rest of the population and how those characteristics relate to the researchtopic. These assumptions can take different forms, such as quasi-randomizationor superpopulation modeling (Deville 1991; Elliott and Valliant 2017).However, there is no general statistical theory of nonprobability sampling thatjustifies when and why accurate inferences can be expected: the validity istopic and survey dependent. Furthermore, online nonprobability sample pro-viders often consider their data collection procedures to be proprietary, thusmaking it difficult or impossible to know what factors to include in anymodel aimed at correcting for selection bias in key estimates (Mercer et al.2017).
This article is intended to move the debate about probability and nonprob-ability sample surveys forward: We first describe the assumptions that must bemade in order to expect nonprobability samples to yield accurate results (sec-tion 2). We then summarize the empirical evidence on the accuracy of proba-bility and nonprobability sample surveys to date (section 3). Finally, weconclude our review with practical recommendations and paths for future re-search (section 4).
Review of Probability and Nonprobability Research 7
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
2. CONCEPTUALIZATION OF NONPROBABILITYSAMPLING APPROACHES
The Total Survey Error (TSE; Groves and Lyberg 2010) framework that formsthe bedrock of quality assessments for probability samples does not cleanlytranslate to the world of nonprobability sample data collection. Nonprobabilitysamples do not involve a series of well-controlled and well-understood depar-tures from a perfect sampling frame. Instead, most such samples rely on a col-lection of convenience samples that are aggregated and/or adjusted, with thegoal of reducing the final difference between sample and population.Nonprobability samples cannot be evaluated by quantifying and summing theerrors that occur at each stage of the sampling process. Instead, nonprobabilitysamples can only be evaluated by assessing how closely the final modeledsample compares to the population in terms of various characteristics.
Although little has been proposed by way of formal statistical theory justify-ing the use of nonprobability samples for population inferences, the methodsadopted for engaging with these samples suggest that a few combinations ofassumptions could justify such an approach. In general, justifications can stemfrom four basic types of claims: (i) that any sample examining a particularquestion will yield the same inferences, (ii) that the specific design of the sam-ple, as related to the questions at hand, will produce conclusions that mirror thepopulation of interest, (iii) that a series of analytical steps will account for anydifferences between the sample and the population, and (iv) that the particularcombination of sample and/or analytic approaches will produce accurate popu-lation estimates. Hence, the suggestion that any particular method, when usedfor a specific research question, is appropriate depends on underlying claimsabout the question of interest, the sample, and any adjustment procedures used.
2.1 Design Ignorability Due to the Question of Interest
In some cases, the method of sampling may be unrelated to the phenomena ofinterest. For instance, researchers may be trying to understand some processthat occurs for all individuals, as in most physiological and some psychologicalstudies. Under these circumstances, it may be reasonable to presume that anygiven group of individuals would behave like any other group of individualsunless there are known confounding factors. Researchers may claim that non-probability sampling is irrelevant when answering particular questions for afew different reasons. They may believe that the process they are investigatingis universal and, thus, that all people would behave similarly. On a more lim-ited scope, they might contend that the particular phenomenon they are study-ing is appropriately distributed in any broad population sample and that thespecific composition of that sample is unlikely to influence their conclusions.
The suggestion that some particular inference is unrelated to the sample be-ing drawn could derive either from theoretical expectations of orthogonality or
8 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
prior empirical evidence of orthogonality. Of particular note, there are sometheoretical and empirical reasons to believe that certain classes of inferencemay be more or less susceptible to sample imbalance. Some scholars have ar-gued that trends over time in attitudes and behaviors should be less stronglydependent on sample composition than estimates of the distributions of thoseattitudes and behaviors (Page and Shapiro 1992). Similarly, a few empiricalstudies have found that relations between variables were more similar acrossprobability and nonprobability samples than other types of estimates (Berrens,Bohara, Jenkins-Smith, Silva, and Weimer 2003; Pasek 2016). The claim thatsome kinds of inferences may be made equivalently well regardless of sam-pling strategy may sometimes be correct. The challenge, however, is determin-ing when this might be the case.
2.2 Fit for Purpose Designs
Researchers do not need to establish that the question they are studying is im-pervious to sampling strategies to make the case that a nonprobability sampleis appropriate for their inferences. Instead, they can assert that the designemployed mitigates whatever biases might have emerged in the sampling pro-cess. The classic example of this type of argument stems from the use of quotasamples. Quota samples are typically designed to ensure that the set ofrespondents matches the population on certain key demographic parameters.The idea underlying this approach is that the demographic parameters thatform the basis for the quotas capture the sources of bias for researchers’ infer-ences. To the extent that this is true, inferences made from quota samples willbe accurate because all potential confounds are neutralized by the design. Thatis, the remaining error induced from the sampling process would be orthogonalto the questions of interest.
Notably, demographic quotas are not the only way to select respondentssuch that they reflect the population across key confounds. Scholars have pro-posed techniques ranging from matching individuals in nonprobability sampleswith individuals from probability samples as a means to recruit respondents tosurveys (Rivers 2007) to blending together samples drawn from sources thathave known opposing biases (Comer 2019). For any of these processes, if a re-searcher can be confident that the sample selection strategy eliminates all po-tential confounds for their particular question of interest, then the use of thatsampling strategy is not only justifiable but will yield accurate inferences.
The challenge with these sorts of approaches is that the accuracy of criticalassumptions can only really be established empirically. It is also unclear whatto make of evidence that a particular conclusion is robust to a particular sam-pling decision. It may be the case that the nature of the question and/or type ofinference renders that conclusion accurate for any similar question on a simi-larly derived sample or it might be that the particularities of a single analysis
Review of Probability and Nonprobability Research 9
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
happen to have yielded identical conclusions by mere chance. There is no obvi-ous way to establish which of these is the case, though claims of empirical ro-bustness are strengthened by a clear theoretical rationale.
2.3 Global Adjustment Approaches
In the next two sections, we describe modeling approaches that have been usedto improve the accuracy of nonprobability sample data. Researchers have longknown that even probability samples are sometimes inaccurate either bychance or due to variations in the likelihood that certain subgroups of the popu-lation will participate in a survey. For this reason, many statistical adjustmentprocedures typically used to adjust for systematic biases in probability sampleshave been adopted to adjust for selection biases in nonprobability samples.These approaches can be divided into two types: global adjustments andoutcome-specific adjustments. Global adjustments refer to approaches that usea model to create a single adjustment that can be applied in any subsequentanalysis, regardless of the outcome of interest. Outcome-specific adjustmentstailor the adjustment model to a specific outcome of interest.
Regarding global adjustments, one commonly used approach is calibrationweighting (Deville and S€arndal 1992; Roshwalb, Lewis, and Petrin 2016;Santoso, Stein, and Stevenson 2016). Calibration weighting involves weight-ing the respondent pool such that the weighted sample totals of a certain char-acteristic correspond to known population totals of that same characteristic.The known population totals might come from census data, official statistics,or other data sources assumed to be of high quality. The procedure produces aglobal adjustment weight that can be applied to the analysis of any outcomevariable. Using such weights amounts to making the assumption that once theknown sources of deviation are accounted for in the adjustment procedure, theremaining errors will be unrelated to the likelihood that a particular unit in thepopulation participated in the survey. This strategy presumes that the sampledunits within specified population subgroups will be roughly equivalent to thenonsampled units within those subgroups with respect to any inferences thatwill be made with the data.
Although calibration weighting only requires access to population-levelbenchmark data, alternative global adjustment procedures make use of unit-level reference data to improve the accuracy of nonprobability sample esti-mates. One approach, known as sample matching, attempts to compose abalanced nonprobability sample by selecting units from a very large frame,such as a list of registered members of an opt-in panel, based on an array ofauxiliary characteristics (often demographic) that closely match to the charac-teristics of units from a reference probability sample (Rivers 2007; Vavreckand Rivers 2008; Bethlehem 2016). The matching procedure, which may beperformed before any units are invited to the nonprobability survey, relies on a
10 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
distance metric (e.g., Euclidean distance) to identify the closest match betweenpairs of units based on the set of common auxiliary characteristics.
Another approach that uses unit-level reference data is propensity scoreweighting. This approach is performed after the survey data have been col-lected from units in a nonprobability sample. The basic procedure is to verti-cally concatenate the nonprobability sample survey data with a referencedataset, typically a large probability sample survey. Then a model (e.g., logitor probit) is fitted using variables measured in both datasets to predict the prob-ability that a particular unit belongs to the nonprobability sample (Rosenbaumand Rubin 1983, 1984; Lee 2006; Dever, Rafferty, and Valliant 2008; Valliantand Dever 2011). A weight is then constructed based on the inverse of this esti-mated inclusion probability and used in any subsequent analysis of the non-probability survey data. Like calibration weighting, propensity score weightingonly works if two conditions are satisfied: (i) the weighting variables and thepropensity of response in the sample are correlated; and (ii) the weighting vari-ables are correlated with the outcome variables of interest. A related approachis to use the concatenated dataset to fit a prediction model using variables mea-sured in both datasets, which is then used to impute the values of variables forunits in the nonprobability sample that were only observed for units in the ref-erence probability sample (Raghunathan 2015).
All of the previously described adjustment approaches do not require theuse of the reference data beyond the weighting, matching, or imputation stepsand are discarded during the analysis of the nonprobability survey data.Alternative approaches combine both data sources and analyze them jointly.One such approach is pseudo design-based estimation (Elliott 2009; Elliott andValliant 2017), where pseudo-inclusion probabilities are estimated for the non-probability sample units based on a set of variables common to both nonprob-ability and probability samples. Different techniques may be used to estimatethe inclusion probabilities. For example, one could concatenate the nonprob-ability and probability datasets and predict the probability of participating inthe nonprobability survey, similar to the aforementioned propensity scoreweighting procedure. Alternatively, one could employ sample matching withboth surveys and donate a probability sample unit’s inclusion probability tothe closest recipient match in the nonprobability sample. Once the pseudo-inclusion probabilities have been assigned to all nonprobability sample units,then these units can be treated as if they were selected using the same underly-ing sampling mechanism as the probability sample units. The datasets maythen be combined and analyzed jointly by using the actual and pseudo weights.For variance estimation, Elliott and Valliant (2017) recommend the use ofdesign-based resampling approaches, such as the bootstrap or jackknife to ac-count for variability in both the pseudo weights and the target quantity of inter-est. For nonprobability samples that have an underlying cluster structure (e.g.,different types of persons recruited from different web sites), cluster resam-pling approaches should be used.
Review of Probability and Nonprobability Research 11
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
Another approach to combining and analyzing probability and nonprobabil-ity samples jointly is blended calibration (DiSogra, Cobb, Chan, and Dennis2011; Fahimi, Barlas, Thomas, and Buttermore 2015). Blended calibration is aform of calibration weighting that combines a weighted probability samplewith an unweighted nonprobability sample and calibrates the combined sampleto benchmark values measured on units from the weighted probability samplesurvey. The combined sample is then analyzed using the calibrated weights.
In summary, each of the previously described approaches produces a singleglobal adjustment that can be applied to any analysis regardless of the outcomevariable of interest. These methods entail the assumption that the selectionmechanism of the nonprobability sample is ignorable conditional on the varia-bles used in the adjustment method. For example, selection bias is assumed tobe negligible or nonexistent within subgroups used in the calibration and pro-pensity scoring procedures. It is further assumed that the adjustment variablesin the reference data source (census or probability sample survey) are measuredwithout error and are highly correlated with the target analysis variables andcorrelated with the probability to participate in the nonprobability samplesurvey. A very large and diverse reference data source with an extensive set ofcommon variables is therefore needed to maximize the validity of these strongassumptions. One should always keep in mind that global adjustment proce-dures may improve the situation for some estimates but not others, and there isno guarantee that biases in the nonprobability data will be removedcompletely. In practice, one never knows with certainty that ignorabilityassumptions hold, although in some cases it may be possible to place boundson the potential magnitude of bias through sensitivity analysis (Little, West,Boonstra, and Hu 2019).
2.4 Outcome-Specific Adjustment Approaches
Turning now to outcome-specific adjustment approaches, these approaches uti-lize adjustment models that are tailored to a specific outcome variable. That is,they adjust for the selection mechanism into a nonprobability sample with re-spect to a given outcome variable, Y, which is of interest to the researcher.Such approaches attempt to control for variables that govern the selection pro-cess and are correlated with the target outcome variable. One example of sucha framework is the notion that probability and nonprobability samples bothconstitute draws from a hypothetical infinite “superpopulation” (Deville 1991).The goal of the analyst is then to model the data-generating process of the un-derlying superpopulation by accounting for all relevant variables in the analy-sis model. In practice, this means that a researcher may fit a prediction modelfor some analysis variable Y based on the sample at hand, which is then usedto predict the Y’s for the nonsampled units. The sampled and nonsampled unitsare then combined to estimate the quantity of interest (e.g., mean, total,
12 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
regression coefficient) for Y in the total population (Elliott and Valliant 2017).The key assumptions of this approach are that the analysis variable, Y, isexplained through a common model for the sampled and nonsampled units andthat all parameters governing the superpopulation model are controlled for inthe analysis model. The approach also requires the availability of auxiliarydata on the population to make predictions for the nonsampled units. Varianceestimation for the predictions can be implemented using a variety of frequentistmethods, including jackknife and bootstrap replication estimators as describedin Valliant, Dorfman, and Royall (2000).
Another model-based approach, which can be applied in a superpopulationframework, is model-assisted calibration. This approach involves constructingcalibrated weights by using a model to predict the values for a given analysisvariable (Wu and Sitter 2001). The calibrated weights are generated based onconstraints placed on the population size and population total of the predictedvalues. Various model selection approaches (e.g., LASSO) have been pro-posed for parsimonious modeling of the analysis variable (Chen, Valliant, andElliott 2018). A key assumption of the method is that the model is correctlyspecified and capable of making reliable predictions across different samplesof the population. It is also assumed that all relevant superpopulation parame-ters are included in the model if the method is implemented in a superpopula-tion framework.
Multilevel regression and poststratification is another approach used to esti-mate a specific outcome of interest from a nonprobability sample (Wang,Rothschild, Goel, and Gelman 2015; Downes, Gurrin, English, Pirkis, Currier2018). The basic idea is to fit a multilevel regression model predicting an out-come given a set of covariates. The use of a multilevel model makes it possibleto incorporate a large number of covariates or high order interactions into theprediction model (Ghitza and Gelman 2013). The model is then used to esti-mate the mean value for a large number of poststratification cells defined bythe cross-classification of all variables used in the regression model. It is neces-sary that the relative size of each cell in the population is known or can be reli-ably estimated from external data sources such as a census or populationregistry. Population quantities are estimated by aggregating these predictedcell means with each cell weighted proportionally to its share of the population.The multilevel regression model allows for cell-level estimates to be generatedeven when few units exist within the sample cells. The method can also beimplemented in a Bayesian hierarchical modeling framework (Park, Gelman,and Bafumi 2004). Like all of the previously mentioned model-basedapproaches, the model is assumed to control for all variables that affect theprobability of inclusion in the nonprobability sample. The method also requiresgood model fit, and large cell sizes are preferable to generate robust cell-levelestimates. For additional hierarchical and Bayesian modeling approaches thathave been proposed to estimate outcomes from nonprobability samples, we
Review of Probability and Nonprobability Research 13
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
refer to Ganesh, Pineau, Chakraborty, and Dennis (2017), Pfeffermann (2017),and Pfeffermann, Eltinge, and Brown (2015).
Collectively, these approaches to using nonprobability sample data to reachpopulation-level conclusions depend on combinations of assumptions about ig-norable errors and the availability of information about the sources of nonran-domness in respondent selection that can be used to adjust for any errors thatare not ignorable. Although these dependencies are not fundamentally differentfrom the assumptions underlying the use of probability samples, they are moredifficult to rely on as we know little about the factors that lead individuals tobecome members of nonprobability samples.
3. THE ACCURACY OF PROBABILITY ANDNONPROBABILITY SAMPLE SURVEYS
Several studies have empirically assessed the accuracy of probability and non-probability sample surveys by comparing survey outcomes to external popula-tion benchmarks (see table 1). The vast majority of these studies concludedthat probability sample surveys have a significantly higher accuracy than non-probability sample surveys. Only a few studies have found that probabilitysample surveys do not generally have a significantly higher accuracy than non-probability sample surveys.
Table 1 provides a list of the studies that are included in our overview. Akey inclusion requirement is that the studies contain comparisons of probabil-ity and nonprobability sample surveys with external population benchmarks.We do not include any studies published in languages other than English orthat contain sample accuracy assessments of probability and nonprobabilitysample surveys only as a minor byproduct or that compare probability samplesurveys with nonsurvey convenience samples, such as Amazon MTurk (e.g.,Coppock and McClellan, 2019).
3.1 Initial Accuracy Comparisons for Probability and NonprobabilitySample Surveys
As table 1 shows, a number of studies have demonstrated that probability sam-ple surveys have a higher accuracy than nonprobability sample surveys. Thehigher accuracy of probability sample surveys has been demonstrated acrossvarious topics, such as voting behavior (Malhotra and Krosnick 2007; Changand Krosnick 2009; Sturgis et al. 2018), health behavior (Yeager, Krosnick,Chang, Javitz, and Levendusky 2011), consumption behavior (Szolnoki andHoffmann 2013), sexual behavior and attitudes (Erens, Burkill, Couper,Conrad, and Clifton 2014; Legleye, Charrance, Razafindratsima, Bajos, andBohet 2018), and socio-demographics (Malhotra and Krosnick 2007; Chang
14 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
Tab
le1.
Stud
ies
onth
eA
ccur
acy
ofP
roba
bilit
y(P
S)an
dN
onpr
obab
ility
(NP
S)Sa
mpl
eSu
rvey
s
Stu
dyC
ount
ryB
ench
mar
kP
Sm
odes
stud
ied
PS
mor
eac
cura
teth
anN
PS
?aD
id(r
e)w
eigh
ting
suffi
cien
tlyre
-du
ceN
PS
bias
?b
Blo
met
al.(
2018
)G
erm
any
Cen
sus
data
F2F
,web
Yes
(uni
vari
ate)
No
(rak
ing)
Ele
ctio
nda
taM
acIn
nis
etal
.(20
18)
US
AH
igh-
qual
ityP
SP
hone
,web
Yes
(uni
vari
ate)
No
(pos
tstr
atifi
catio
n)D
asso
nnev
ille
etal
.(20
18)
Bel
gium
Cen
sus
data
F2F
Yes
(uni
vari
ate)
No
(uns
peci
fied
appr
oach
)–
univ
aria
teE
lect
ion
outc
ome
No
(mul
tivar
iate
)N
/A–
mul
tivar
iate
Leg
leye
etal
.(20
18)
Fra
nce
Cen
sus
data
Pho
neY
es(u
niva
riat
e)N
/AP
enna
yet
al.(
2018
)A
ustr
alia
Hig
h-qu
ality
PS
Pho
neY
es(u
niva
riat
e)N
o(p
osts
trat
ifica
tion)
Stu
rgis
etal
.(20
18)
UK
Ele
ctio
nou
tcom
eF
2FY
es(u
niva
riat
e)N
o(r
akin
g,pr
open
sity
wei
ghtin
g,m
atch
ing)
Dut
win
and
Bus
kirk
(201
7)U
SA
Hig
h-qu
ality
PS
F2F
,pho
neY
es(u
niva
riat
e)N
o(p
rope
nsity
wei
ghtin
g,m
atch
-in
g,ra
king
)S
ohlb
erg
etal
.(20
17)
Sw
eden
Ele
ctio
nou
tcom
eP
hone
Yes
(uni
vari
ate)
N/A
Brü
ggen
etal
.(20
16)
Net
herl
ands
Pop
ulat
ion
regi
ster
F2F
,web
Yes
(uni
vari
ate)
No
(Gen
eral
ized
Reg
ress
ion
Est
imat
ion)
–un
ivar
iate
,biv
aria
teY
es(b
ivar
iate
)K
enne
dyet
al.(
2016
)U
SA
Hig
h-qu
ality
PS
Web
,mai
lN
o(u
niva
riat
e)N
/A–
univ
aria
teN
o(m
ultiv
aria
te)
N/A
–m
ultiv
aria
teP
asek
(201
6)U
SA
Hig
h-qu
ality
PS
Pho
neY
es(u
niva
riat
e)N
o(r
akin
g,pr
open
sity
wei
ghtin
g)–
univ
aria
te,
biva
riat
e,lo
ngitu
din
alN
o(b
ivar
iate
)Y
es(l
ongi
tudi
nal)
Con
tinue
d
Review of Probability and Nonprobability Research 15
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
Tab
le1.
Con
tinue
d
Stu
dyC
ount
ryB
ench
mar
kP
Sm
odes
stud
ied
PS
mor
eac
cura
teth
anN
PS
?aD
id(r
e)w
eigh
ting
suffi
cien
tlyre
-du
ceN
PS
bias
?b
Gitt
elm
anet
al.(
2015
)U
SA
Hig
h-qu
ality
PS
Pho
neN
o(u
niva
riat
e)N
o(p
osts
trat
ifica
tion)
Ans
olab
eher
ean
dS
chaf
fner
(201
4)U
SA
Hig
h-qu
ality
PS
Pho
ne,m
ail
No
(uni
vari
ate)
N/A
Ele
ctio
nou
tcom
eN
o(m
ultiv
aria
te)
Ere
nset
al.(
2014
)U
KH
igh-
qual
ityP
SC
AS
IY
es(u
niva
riat
e)N
/AS
tein
met
zet
al.(
2014
)N
ethe
rlan
dsH
igh-
qual
ityP
SW
ebN
o(u
niva
riat
e)Y
es(p
rope
nsity
wei
ghtin
g)–
uni-
vari
ate,
biva
riat
eN
o(b
ivar
iate
)A
nsol
abeh
ere
and
Riv
ers
(201
3)U
SA
Hig
h-qu
ality
PS
F2F
No
(uni
vari
ate)
N/A
Ele
ctio
nou
tcom
eS
zoln
okia
ndH
offm
ann
(201
3)G
erm
any
Hig
h-qu
ality
PS
F2F
,pho
neY
es(u
niva
riat
e)N
/AC
han
and
Am
bros
e(2
011)
Can
ada
Uns
peci
fied
Web
No
(uni
vari
ate)
N/A
Sch
erpe
nzee
lan
dB
ethl
ehem
(201
1)N
ethe
rlan
dsE
lect
ion
outc
ome
F2F
,web
Yes
(uni
vari
ate)
N/A
Uns
peci
fied
Yea
ger
etal
.(20
11)
US
AH
igh-
qual
ityP
SP
hone
,web
Yes
(uni
vari
ate)
No
(pos
tstr
atifi
catio
n)C
hang
and
Kro
snic
k(2
009)
US
AH
igh-
qual
ityP
SP
hone
,web
Yes
(uni
vari
ate)
No
(rak
ing)
Wal
ker,
Pet
tit,
and
Rub
inso
n(2
009)
cU
SA
Uns
peci
fied
Pho
ne,m
ail
Yes
(uni
vari
ate)
N/A
Loo
svel
dtan
dS
onck
(200
8)B
elgi
umC
ensu
sda
taF
2FN
o(u
niva
riat
e)N
o(p
osts
trat
ifica
tion,
prop
ensi
tyw
eigh
ting)
16 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
Mal
hotr
aan
dK
rosn
ick
(200
7)U
SA
Hig
h-qu
ality
PS
F2F
Yes
(uni
vari
ate)
No
(uns
peci
fied
)B
erre
nset
al.(
2003
)U
SA
Hig
h-qu
ality
PS
Web
,pho
neN
o(u
niva
riat
e)Y
es(r
akin
g,pr
open
sity
wei
ght-
ing)
–un
ivar
iate
,mul
tivar
iate
No
(mul
tivar
iate
)
a The
stud
yre
sults
repo
rted
here
are
base
don
initi
alco
mpa
riso
nsof
accu
racy
asre
port
edby
the
auth
ors.
Som
est
udie
sus
era
w(u
nwei
ghte
d)da
tain
thei
rin
itial
com
pari
sons
,w
hile
othe
rsus
eda
tath
atar
eal
read
yw
eigh
ted
usin
gei
ther
wei
ghts
that
the
auth
ors
calc
ulat
edth
emse
lves
orth
ew
eigh
tsth
ata
surv
eyve
ndor
deliv
ered
with
the
data
.bW
ere
port
whe
ther
(re)
wei
ghtin
gha
ssu
ffici
ently
redu
ced
bias
base
don
the
auth
ors’
own
judg
men
tsas
repo
rted
inth
eir
conc
lusi
ons.
Thi
sco
lum
nin
clud
esst
udie
sth
atco
mpa
rera
w(u
nwei
ghte
d)da
tain
thei
rin
itial
com
pari
son
ofac
cura
cyw
ithth
eda
taaf
ter
aw
eigh
ting
proc
edur
ew
aspe
r-fo
rmed
and
stud
ies
that
used
wei
ghte
dda
tain
thei
rin
itial
com
pari
son
ofac
cura
cyan
dre
wei
ghte
dth
eda
tain
subs
eque
ntco
mpa
riso
ns.S
tudi
esth
aton
lyre
port
edra
wda
taor
only
wei
ghte
dda
ta,w
ithou
tusi
ngan
yre
wei
ghtin
gap
proa
ches
,are
labe
lled
“not
appl
icab
le”
(N/A
)in
this
colu
mn.
c As
repo
rted
byC
alle
garo
etal
.(20
14a,
2014
b).
Review of Probability and Nonprobability Research 17
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
and Krosnick 2009; Yeager et al. 2011; Szolnoki and Hoffmann 2013; Erenset al. 2014; Dutwin and Buskirk 2017; MacInnis, Krosnick, Ho, and Cho2018). In addition, the higher accuracy of probability sample surveys has beenfound across a number of countries, such as Australia (Pennay, Neiger,Lavrakas, and Borg 2018), France (Legleye et al. 2018), Germany (Szolnokiand Hoffmann 2013; Blom, Ackermann-Piek, Helmschrott, Cornesse, Bruch,and Sakshaug 2018), the Netherlands (Scherpenzeel and Bethlehem 2011;Brüggen, van den Brakel, and Krosnick 2016), Sweden (Sohlberg et al. 2017),the United Kingdom (Sturgis et al. 2018), and the United States (Malhotra andKrosnick 2007; Chang and Krosnick 2009; Yeager et al. 2011; Dutwin andBuskirk 2017; MacInnis et al. 2018). Furthermore, the higher accuracy ofprobability sample surveys has been shown over time, with the first study dem-onstrating higher accuracy of probability sample surveys in 2007 (Malhotraand Krosnick, 2007) to the most recent ones in 2018 (Blom et al. 2018;Legleye et al. 2018; MacInnis et al. 2018; Sturgis et al. 2018). All of thesestudies from different times and countries that focused on different topicsreached the conclusion that probability sample surveys led to more accurateestimates than nonprobability samples.
A prominent study from this line of research is Yeager et al. (2011). In a se-ries of analyses of surveys conducted between 2004 and 2008 in the UnitedStates, the authors found that probability sample surveys were consistentlymore accurate than nonprobability sample surveys across many benchmarkvariables (primary demographics such as age, gender, and education; second-ary demographics such as marital status and homeownership; and nondemo-graphics such as health ratings and possession of a driver’s license), even afterpoststratification weighting.
Another influential study that is based on a particularly rich database wasconducted with four probability sample face-to-face surveys, one probabilitysample online survey, and eighteen nonprobability sample online surveys (theso-called NOPVO [National Dutch Online Panel Comparison Study] project)in 2006 to 2008 in the Netherlands (Brüggen et al. 2016). In line with the find-ings from Yeager et al. (2011), the authors found that the probability face-to-face and internet surveys were consistently more accurate than the nonprob-ability internet surveys across a variety of sociodemographic variables and var-iables on health and life satisfaction.
A recent example that focuses on the current debate about why polls mispre-dict election outcomes was published by Sturgis et al (2018). Investigatingwhy British polls mispredicted the outcome of the 2015 UK general election,Sturgis et al. (2018) examined data from twelve British pre-election polls andassessed a number of potential reasons for the polling debacle, such as whethervoters changed their mind at the last minute (i.e., “late swing”), whether themode in which the surveys were conducted (telephone or online) played a role,or whether polls failed to correct for the common problem of overestimatingvoter turnout (i.e., proper turnout weighting). The authors found that sample
18 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
inaccuracy due to nonprobability sample selection likely had the biggest im-pact on the mispredictions.
Although most studies have demonstrated that accuracy is higher in proba-bility sample surveys than in nonprobability sample surveys, two studies haveyielded mixed findings depending on the type of estimate examined (Pasek2016; Dassonneville, Blais, Hooghe, and Deschouwer 2018). Both studiesshow that accuracy is higher in probability sample surveys for univariate esti-mates. Pasek (2016) also reports higher accuracy of probability sample surveysfor longitudinal analyses. However, these studies found no difference in accu-racy regarding bivariate (Pasek 2016) and multivariate (Dassonneville et al.2018) estimates.
Several studies have found no consistent superiority in accuracy of probabil-ity or nonprobability sample surveys over one another. These studies generallyyielded mixed findings: a probability sample survey was found to be more ac-curate than some but not all nonprobability sample surveys examined(Kennedy, Mercer, Keeter, Hatley, McGeeney, and Gimenez 2016); or proba-bility sample surveys were shown to be more accurate than nonprobabilitysample surveys on some variables while nonprobability sample surveys weremore accurate than probability sample surveys on other variables (Loosveldtand Sonck 2008; Chan and Ambrose 2011; Steinmetz, Bianchi, Tijdens, andBiffignandi 2014). In some of the studies, the authors speculated that it mightbe survey mode rather than the sampling design that led to comparable accu-racy (Berrens et al. 2003; Ansolabehere and Schaffner 2014; Gittelman,Thomas, Lavrakas, and Lange 2015).
In general, it should be noted that sample accuracy assessments face somechallenges. One common challenge of such studies is to disentangle modeeffects (i.e., measurement bias) from sampling effects (i.e., selection bias).This challenge occurs because probability sample surveys are usually con-ducted offline (e.g., via face-to-face or telephone interviews), whereas non-probability sample surveys are usually conducted online via nonprobabilityonline panels. However, several studies show that it is possible to disentanglethe mode effect from the sampling effect by comparing offline probability sam-ple surveys with online probability sample surveys (mode effect) and compar-ing online probability sample surveys to online nonprobability sample surveys(sampling effect). The majority of these studies conclude that both offline andonline probability sample surveys are more accurate than nonprobability onlinesample surveys (Chang and Krosnick 2009; Scherpenzeel and Bethlehem2011; Yeager et al. 2011; Brüggen et al. 2016; Dutwin and Buskirk 2017;Blom et al. 2018; MacInnis et al. 2018).
A related challenge that sample accuracy assessments of probability andnonprobability sample surveys face is the question of how to measure accuracyin a way that accounts for both sampling variability and systematic bias. Thischallenge occurs because often there is only one probability sample survey andone nonprobability sample survey available for sample accuracy assessment.
Review of Probability and Nonprobability Research 19
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
However, several large-scale studies that compare a larger number of probabil-ity sample surveys with a larger number of nonprobability sample surveyshave found that probability sample surveys are consistently more accurate thannonprobability sample surveys (Yeager et al. 2011; Brüggen et al. 2016; Blomet al. 2018; MacInnis et al. 2018; Sturgis et al. 2018). This suggests that al-though surveys vary on a number of design factors other than their samplingdesign (e.g., incentive schemes, contact frequency) and might sometimes bemore or less accurate by chance, samples are generally more likely to havehigher accuracy if they are based on probability sampling procedures ratherthan nonprobability sampling procedures.
Another common challenge that sample accuracy assessments face is theavailability of appropriate gold standard benchmarks. In the current literature,the most commonly used benchmarks are large-scale, high-quality probabilitysample surveys. A typical example of such a benchmark is the US AmericanCommunity Survey (Yeager et al. 2011 and MacInnis et al. 2018). Otherbenchmarks used in the literature are population census data (Legleye et al.2018), election outcomes (Sohlberg et al. 2017), and population register data(Brüggen et al. 2016).
All of these benchmarks have advantages and disadvantages. A key advan-tage of using a large-scale, high-quality probability sample survey is that theset of variables available for comparisons usually includes not only sociode-mographic variables but also substantive variables on attitudes and behavior. Adisadvantage is that large-scale, high-quality probability sample surveys aresurveys themselves and might therefore contain typical survey errors, such ascoverage, sampling, and nonresponse errors (Groves and Lyberg 2010).
Census data and population register data have the advantage of not sufferingfrom survey errors. However, such data are often not available for the currentyear and might therefore be outdated at the time of the study. Population regis-ter data might also be outdated, for example, if immigration, emigration, births,and deaths are not captured in a timely manner. In addition, census data andpopulation register data are typically limited to a small set of sociodemo-graphic characteristics. With regard to election outcomes, an advantage is thatthey are key variables of substantive interest to many social scientists.However, if survey data fail to accurately predict election outcomes, there aremany potential explanations for this besides the sampling approach; seeSturgis et al. (2018) for a list of reasonable explanations tested after the Britishpolling disaster of 2015.
3.2 Weighting Approaches to Reduce Bias in Nonprobability SampleSurveys
Many studies examining the accuracy of probability and nonprobability samplesurveys have attempted to eliminate biases in nonprobability sample surveys.
20 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
The majority of these studies found that weighting did not sufficiently reducebias in nonprobability sample surveys (see table 1). Generally speaking, proba-bility sample surveys were found to be more accurate than nonprobability sam-ple surveys even after (re)weighting.
Although the majority of studies found that weighting did not reduce thebias in nonprobability sample surveys sufficiently (table 1), some studiesshowed that weighting did reduce the bias somewhat. However, whetherresearchers considered the bias to be sufficiently reduced by weighting variedfrom study to study. For example, Berrens et al. (2003) considered the bias in anonprobability sample survey sufficiently reduced even though an estimate ofmean household income deviated from the benchmark by between 4.8 percent-age points (after propensity weighting) and 11.9 percentage points (after rak-ing). A study concluding that weighting approaches sufficiently reduced biasin nonprobability sample surveys also reported that weighting increased thevariance of the estimates significantly (Steinmetz et al. 2014).
Most of the studies listed in table 1 focus on the success of weighting proce-dures to reduce bias in nonprobability sample surveys. Only a few studieslisted in table 1 have also assessed the success of weighting procedures in re-ducing bias in probability sample surveys. For example, MacInnis et al. (2018)reported that weighting reliably eliminated the small biases present inunweighted probability sample survey data. This is in line with research byGittelman et al. (2015), who showed that poststratification weighting success-fully reduced biases in a probability sample survey but was less successfulacross a number of nonprobability survey samples and even increased bias inone instance.
The studies listed in table 1 used one or more common weighting procedures,such as raking, poststratification, and propensity weighting, to improve the accu-racy of nonprobability sample survey measurements. Several other studies in theliterature have investigated the effectiveness of various weighting procedures inreducing bias in nonprobability sample surveys, without examining the accuracyof probability sample surveys. Table 2 provides an overview of these studies. Akey inclusion requirement is that studies assess whether weighting nonprobabilitysample surveys reduced bias as compared with unweighted estimates and if so, towhat extent. We again exclude all studies published in languages other thanEnglish and studies that contain assessments of weighting procedures for non-probability sample surveys only as a minor byproduct or that examine nonsurveydata.
In general, the majority of the studies that investigated the effectiveness ofvarious weighting procedures in reducing bias in nonprobability sample sur-veys (table 2) reached the same conclusion as the studies that assessed weight-ing approaches in both probability and nonprobability sample surveys(table 1): weighting does not sufficiently reduce bias in nonprobability samplesurveys. Only a few studies found that weighting could sufficiently reduce biasin nonprobability sample surveys.
Review of Probability and Nonprobability Research 21
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
As table 2 shows, two of the studies that documented a sufficient reductionin the bias of nonprobability sample surveys applied multilevel regression andpoststratification weights (Wang et al. 2015; Gelman, Goel, Rothschild, andWang 2017), one study applied model-based poststratification (Goel, Obeng,and Rothschild 2015), and one study applied propensity weighting and calibra-tion (Lee and Valliant 2009). Two of these studies used weighting to adjustnonprobability sample survey data to accurately predict election outcomes afterthe actual election outcomes were already known, which reduces confidence inthe conclusions (Wang et al. 2015; Gelman et al. 2017).
In sum, the majority of the research on weighting and accuracy finds thatthe inaccuracy of nonprobability samples cannot be reliably solved by weight-ing procedures. Some authors conducting such studies also offer explanationsas to why their attempts to achieve accurate estimates from weighted nonprob-ability samples were not successful. Mercer, Lau, and Kennedy (2018), for in-stance, show that complex weighting procedures outperform basic weightingprocedures. Furthermore, the authors show that to get accurate estimates fromnonprobability sample surveys by weighting, the availability of variables thatpredict the outcome of interest are more important than which statisticalmethod is used.
4. CLOSING REMARKS
In this article, we have reviewed conceptual approaches and empirical evi-dence on probability and nonprobability sample surveys. Probability samplingtheory is well established and based on sound mathematical principles,whereas nonprobability sampling is not. Although there are potential justifica-tions for drawing inferences from nonprobability samples, the rationale formany studies remains unarticulated, and the inferences from nonprobabilitysample surveys generally require stronger modeling assumptions than are nec-essary for probability samples. The basic problem with these modelingassumptions remains that they cannot be tested. We have therefore proposed aconceptual framework for nonprobability sample surveys to explicate thesemodeling assumptions, including practical suggestions about when it might bejustified to make such assumptions (section 2).
In addition, we have summarized the empirical evidence on the accuracy ofprobability and nonprobability sample surveys (section 3). Our literature over-view shows that, even in the age of declining response rates, accuracy in prob-ability sample surveys is generally higher than in nonprobability samplesurveys. There is no empirical support to the claim that switching to nonprob-ability sample surveys is advisable because the steadily declining responserates across the globe compromise probability sample survey data quality.Based on the accumulated empirical evidence, our key recommendation is tocontinue to rely on probability sample surveys.
22 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
In the case that only nonprobability sample survey data are available, werecommend carefully choosing among the various modeling approaches basedon their underlying assumptions. In order for researchers to be able to justifythe modeling approaches used, we recommend obtaining as much informationas possible about the data-generating process (see the transparency recommen-dations in the appendix).
Apart from these key recommendations, this report also shows that there aregaps in the existing literature. To be able to evaluate if and when nonprobabil-ity sample surveys can be used as an alternative to probability sample surveys,
Table 2. Studies Exclusively Investigating Weighting Procedures to Reduce Biasin Nonprobability Sample Surveys
Study Benchmark Does weighting sufficiently reducebias in NPS?a
Mercer et al. (2018) High-quality PS No (raking, propensity weighting,matching)
Smyk, Tyrowicz, andVan der Velde (2018)
High-quality PS No (propensity weighting)
Gelman et al. (2017) High-quality PS No (raking)Election outcome Yes (multilevel regression and
poststratification)Goel et al. (2015) High-quality PS No (raking)
Yes (model-basedpoststratification)
Wang et al. (2015) Election outcome Yes (multilevel regression andpoststratification)
Lee and Valliant (2009) High-quality PS Yes (propensity weighting,calibration)
Schonlau, van Soest,Kapteyn, and Couper(2009)
High-quality PS No (propensity weighting, matching)
Schonlau, van Soest, andKapteyn (2007)
PS No (propensity weighting)
Lee 2006 High-quality PS No (propensity weighting)Duffy, Smith, Terhanian,
and Bremer (2005)High-quality PS No (raking, propensity weighting)
Schonlau, Zapert,Simon, Sanstad, andMarcus (2004)
PS No (poststratification, propensityweighting)
Taylor (2000) High-quality PS þElection outcome
Yes (raking, propensity weighting)
aWe report whether weighting sufficiently reduced bias based on the authors’ ownjudgments as reported in their conclusions.
Review of Probability and Nonprobability Research 23
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
we need more insights into the success of nonprobability sample surveys inproducing accurate estimates in bivariate and multivariate analyses, longitudi-nal analyses, and experimental research settings. In addition, we need more re-search into variance estimation and advanced weighting techniques, inparticular with regard to collecting and utilizing weighting variables that arecorrelated with key survey variables and the data-generating process.
Finally, this report shows that there is great variability across nonprobabilitysample surveys. Therefore, we would like to end this report with a general callfor more transparency in the survey business. For users, researchers and cli-ents, it can be difficult to decide which vendors to trust with data collection. Aslong as many vendors are unwilling to disclose necessary information aboutthe data collection and processing, researchers will remain unable to make in-formed decisions about vendors and will lack the information necessary to un-derstand the limitations of their collected data. The availability of researchreports that outline the methodology used for data collection and manipulationis therefore of utmost importance (Bethlehem 2017, Chapter 11).
As clients, we can reward vendors who are willing to provide more method-ological information to us. As a practical matter, this requires being explicitabout our needs prior to contracting and in reaching out to a broad set of non-probability sample providers. Another form of action we can take as clients isto ensure that when contracting vendors who belong to an organization withrelevant standards, such as ESOMAR, or that have ISO 20252 or 26362 certifi-cation, these vendors disclose information as required by the respective codeor certification (see the appendix and Bethlehem [2017, Chapter 12] for moreinformation on relevant transparency guidelines and standards).
24 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
Appendix: Transparency Guidelines
Various codes of ethics and guidelines address the disclosure of methodologi-cal information for online panels (e.g., Arbeitskreis Deutscher Markt- undSozialforschungsinstitute eV 2001; Interactive Marketing ResearchOrganization [IMRO] 2015; International Organization for Standardization[ISO] 2009, 2012; ESOMAR 2012, 2014, 2015a, 2015b; AmericanAssociation for Public Opinion Research 2015). The lack of informationavailable from some online panel vendors can unfortunately make it impossi-ble for researchers to comply with their own codes or certifications (e.g., ISO2012, §4.5.1.4; American Association for Public Opinion Research 2015,§III.A.5–8, 12, 14). The unwillingness of some vendors to disclose necessaryinformation is unfortunate for all concerned. Consumers of research are de-nied information needed to form an opinion as to the quality of the research.Researchers are unable to make informed decisions about vendors and lack in-formation needed to understand the limitations of their data. Further, vendorsthemselves are unable to benefit from the methodological advances that wouldfollow from greater availability of information on panel operations.
What can be done? Given the abundance of guidelines—several of whichare particularly helpful (IMRO 2015; ESOMAR 2015a)—there is little needfor another extensive set of recommendations. Therefore, we summarize com-mon reporting recommendations in table A1 and, for the most part, refer thereader to existing guidelines addressing these points (particularly helpful sour-ces are italicized). At times, we make additional recommendations that maygo beyond reporting requirements; these contain the words, “Werecommend.”
In table A2, we also list relevant case-level data that researchers may wishto ensure that the vendor will be able to provide. Such data will likely be mostuseful to the researcher where they request all data from all cases invited toparticipate in the survey, not only those that completed the survey or were notremoved for data quality reasons.
Review of Probability and Nonprobability Research 25
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
Tab
leA
1.R
epor
ting
Rec
omm
enda
tion
s
Con
stru
ctD
escr
iptio
nR
efer
ence
s
Uni
vers
eW
hat
univ
erse
does
the
pane
lre
pres
ent?
How
wel
ldo
esth
epa
nelc
over
that
univ
erse
?A
mer
ican
Ass
ocia
tion
for
Pub
licO
pini
onR
esea
rch
(201
5,§I
II.A
.3);
ES
OM
AR
(201
4,§5
.2);
ISO
(200
9,§4
.4.1
,§4.
7);I
SO
(201
2,§4
.5.1
.4,§
7.2.
e,§7
.2.f
.2)
Sam
plin
gfr
ame
Wha
tm
etho
dor
met
hods
are
used
tore
crui
tpa
nel
mem
bers
?A
mer
ican
Ass
ocia
tion
for
Pub
licO
pini
onR
esea
rch
(201
5,§I
II.A
.5-8
);E
SO
MA
R(2
012,
§2,
§5);
ES
OM
AR
(201
4,§5
.2);
ES
OM
AR
(201
5a,
§3.7
);E
SO
MA
R(2
015b
,§5
.2,
§6,
§6.1
);IM
RO
(201
5,§4
);IS
O(2
012,
§4.5
.1.4
,§7.
2.f)
We
reco
mm
end
that
the
perc
enta
geof
the
sam
ple
recr
uite
dfr
omdi
ffer
ent
sour
ces
bein
clud
ed(e
.g.,
on-
line
adve
rtis
emen
t,em
aill
ist,
aggr
egat
orw
ebsi
te).
Pan
elsi
zeH
owla
rge
isth
epa
nel?
Wha
tis
the
perc
enta
geof
activ
em
embe
rs?
How
is“a
ctiv
e”de
fine
d?IM
RO
(201
5,§1
4,§1
7);I
SO
(200
9,§4
.4.2
)
Rep
leni
shm
enta
ndre
tirem
ent
How
freq
uent
lyar
ene
wpa
nelm
embe
rsad
ded?
IMR
O(2
015,
§18,
§19)
;IS
O(2
009,
§4.5
.1)
Are
pane
lmem
bers
retir
edfr
omth
epa
nelb
yth
eve
ndor
?If
so,w
hatc
rite
ria
are
used
tode
term
ine
retir
emen
t?W
hat
isth
epa
nel’
san
nual
“chu
rnra
te”
(the
perc
enta
geof
all
pane
lists
who
are
volu
ntar
ilyor
invo
lunt
arily
re-
tired
each
year
)?A
ggre
gato
rsi
tes
Doe
sth
epa
nel
draw
onag
greg
ator
site
s,w
hich
are
site
sth
atal
low
resp
onde
nts
tose
lect
from
mul
tiple
surv
eys
for
whi
chth
eym
ayqu
alif
y(s
eeIM
RO
2015
)
IMR
O(2
015,
§6)
Sam
ple
from
othe
rpa
nels
Was
sam
ple
from
othe
rpr
ovid
ers
used
inth
isst
udy?
ES
OM
AR
(201
2,§4
)
Ble
ndin
gIf
ther
eis
mor
eth
anon
esa
mpl
eso
urce
,ho
war
eth
ese
blen
ded
toge
ther
?E
SO
MA
R(2
012,
§3);
ES
OM
AR
(201
5b,§
6.1)
26 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
Sam
ple
draw
nfr
omfr
ame
How
was
the
sam
ple
sele
cted
from
the
pane
l?A
nyqu
o-ta
sor
othe
rsp
ecifi
catio
nsus
edin
sele
ctio
nof
the
sam
-pl
esh
ould
besp
ecifi
ed.H
owm
any
units
wer
edr
awn?
ES
OM
AR
(201
2,§7
);E
SO
MA
R(2
015a
,§3
.7);
ES
OM
AR
(201
5b,
§6.2
);IM
RO
(201
5,§2
3);
ISO
(200
9,4.
6.1)
;IS
O(2
012,
4.5.
1.4)
Rou
ters
Was
aro
uter
used
?If
so,
how
did
the
vend
orde
term
ine
whi
chsu
rvey
sar
eco
nsid
ered
for
apa
rtic
ipan
t?W
hat
isth
epr
iori
tyba
sis
for
allo
catin
gpa
rtic
ipan
tsto
sur-
veys
?H
owis
pote
ntia
lbi
asfr
omth
eus
eof
aro
uter
addr
esse
d?
Am
eric
anA
ssoc
iatio
nfo
rP
ublic
Opi
nion
Res
earc
h(2
015,
§III
.A.1
4);
ES
OM
AR
(201
2,§8
-§11
);E
SO
MA
R(2
015a
,§3.
7);E
SO
MA
R(2
015b
,§6.
2)
Iden
tity
valid
atio
nH
owdo
esth
eve
ndor
confi
rmpa
rtic
ipan
ts’
iden
titie
s?H
owdo
esth
eve
ndor
dete
ctfr
audu
lent
part
icip
ants
?A
mer
ican
Ass
ocia
tion
for
Pub
licO
pini
onR
esea
rch
(201
5,§I
II.A
.17)
;E
SO
MA
R(2
012,
§22)
;E
SO
MA
R(2
015a
,§3.
8);
ES
OM
AR
(201
5b,§
6.1)
;IM
RO
(201
5,§2
4);I
SO
(200
9,§4
.3.4
.2,§
4.3.
4.3)
With
in-p
anel
dedu
plic
atio
nW
hatp
roce
dure
sdo
esth
epa
nele
mpl
oyto
ensu
reth
atno
rese
arch
part
icip
antc
anco
mpl
ete
the
surv
eym
ore
than
once
?
Am
eric
anA
ssoc
iatio
nfo
rP
ublic
Opi
nion
Res
earc
h(2
015,
§III
.A.1
7);
ES
OM
AR
(201
5a,
§3.2
,§3
.8);
IMR
O(2
015,
§25)
Cro
ss-p
anel
dedu
plic
atio
nIf
mul
tiple
sam
ple
sour
ces
are
used
,ho
wdo
esth
epa
nel
ensu
reth
atno
rese
arch
part
icip
ant
can
com
plet
eth
esu
rvey
mor
eth
anon
ce?
Am
eric
anA
ssoc
iatio
nfo
rP
ublic
Opi
nion
Res
earc
h(2
015,
§III
.A.1
7);
ES
OM
AR
(201
2,§3
);E
SO
MA
R(2
015a
,§3.
8);I
MR
O(2
015,
§29)
Cat
egor
yex
clus
ion
How
are
pane
lm
embe
rsw
hoha
vere
cent
lyco
mpl
eted
asu
rvey
onth
esa
me
topi
ctr
eate
d?If
they
are
excl
uded
,fo
rho
wlo
ngor
usin
gw
hatr
ules
?
IMR
O(2
015,
§15)
;IS
O(2
009,
§4.6
.2)
Par
ticip
atio
nlim
itsH
owof
ten
can
apa
nel
mem
ber
beco
ntac
ted
tota
kepa
rtin
asu
rvey
ina
spec
ified
peri
od?
ES
OM
AR
(201
2,§1
9,§2
0);
ES
OM
AR
(201
5a,
§3.8
);IM
RO
(201
5,§1
6);I
SO
(200
9,§4
.5.1
,§4.
6.2) C
ontin
ued
Review of Probability and Nonprobability Research 27
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
Tab
leA
1.C
ontin
ued
Con
stru
ctD
escr
iptio
nR
efer
ence
s
Oth
erda
taqu
ality
chec
ksW
hatc
heck
sar
em
ade
for
satis
fici
ngan
dfr
aud?
Am
eric
anA
ssoc
iatio
nfo
rP
ublic
Opi
nion
Res
earc
h(2
015,
§III
.A.1
7);
ES
OM
AR
(201
2,§1
8);
ES
OM
AR
(201
5a,
§3.3
,§3
.8);
ES
OM
AR
(201
5b,
§6,
§6.1
,§6
.4);
IMR
O(2
015,
§30)
;IS
O(2
009,
§4.6
.6,
§4.7
);IS
O(2
012,
§7.2
.k)
The
sem
ight
incl
ude
chec
ksfo
rsp
eedi
ngan
dpa
ttern
edre
spon
ses,
trap
ques
tions
,cou
nts
ofm
issi
ngor
nons
ub-
stan
tive
resp
onse
s(e
.g.,
don’
tkn
ow),
devi
cefi
nger
-pr
intin
gor
logi
cch
ecks
.P
anel
profi
leH
owof
ten
are
profi
les
upda
ted?
IMR
O(2
015,
§21)
Fie
ldw
ork
date
sW
hen
was
the
surv
eyfi
elde
d?If
ther
ew
ere
mul
tiple
sam
ple
sour
ces,
did
the
fiel
dpe
riod
vary
betw
een
sour
ces?
Am
eric
anA
ssoc
iatio
nfo
rP
ublic
Opi
nion
Res
earc
h(2
015,
§III
.A.4
);E
SO
MA
R(2
014,
§5.2
);E
SO
MA
R(2
015a
,§3
.8);
ES
OM
AR
(201
5b,
§6.3
);IS
O(2
009,
§4.7
);IS
O(2
012,
§4.8
.3,§
7.2.
g)In
vita
tions
and
rem
inde
rsW
hatt
exti
sus
edfo
rin
vita
tions
and
any
rem
inde
rs?
Am
eric
anA
ssoc
iatio
nfo
rP
ublic
Opi
nion
Res
earc
h(2
015,
§III
.A.1
6);
ES
OM
AR
(201
2,§1
3);
ISO
(200
9,§4
.7)
We
reco
mm
end
disc
losi
ngth
em
axim
umnu
mbe
rof
rem
inde
rsse
nt.
Ince
ntiv
esW
hati
ncen
tives
,if
any,
are
offe
red
topa
nelm
embe
rs?
Am
eric
anA
ssoc
iatio
nfo
rP
ublic
Opi
nion
Res
earc
h(2
015,
§III
.A.1
6);
ES
OM
AR
(201
2,§1
4);
ES
OM
AR
(201
5a,
§3.7
);E
SO
MA
R(2
015b
,§6
.1-2
);IM
RO
(201
5,§2
0);I
SO
(200
9,§4
.5.2
);IS
O(2
012,
§7.2
.i)Q
uest
ionn
aire
Inth
eev
ent
that
the
rese
arch
erdi
dno
tcr
eate
the
ques
-tio
nnai
re,
the
ques
tionn
aire
shou
ldbe
avai
labl
e,in
-cl
udin
gan
ysc
reen
ing
ques
tions
.
Am
eric
anA
ssoc
iatio
nfo
rP
ublic
Opi
nion
Res
earc
h(2
015,
§III
.A.2
,§I
II.A
.14-
15);
ES
OM
AR
(201
4,§5
.2);
ES
OM
AR
(201
5a,
§3.8
);E
SO
MA
R(2
015b
,§5
.2,§
6.3)
;IS
O(2
009,
§4.7
);IS
O(2
012,
§7.2
.l)A
sw
ebsu
rvey
sar
ean
inte
nsel
yvi
sual
med
ium
(e.g
.,C
oupe
r,T
oura
ngea
u,an
dK
enyo
n20
04an
dC
oupe
r,T
oura
ngea
u,C
onra
d,an
dC
raw
ford
2004
)w
ere
com
-m
end
that
scre
ensh
ots
bepr
ovid
edas
part
ofre
port
ing.
28 Cornesse et al.
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
Fin
aldi
spos
ition
san
dou
tcom
era
tes
How
man
ypa
nel
mem
bers
wer
ein
vite
dto
com
plet
eth
esu
rvey
?W
hat
isth
ebr
eak-
off
rate
?H
owar
eth
ese
calc
ulat
ed?
Am
eric
anA
ssoc
iatio
nfo
rP
ublic
Opi
nion
Res
earc
h(2
015,
§III
.A.1
8);
ES
OM
AR
(201
5a,
§3.7
-8);
ES
OM
AR
(201
5b,
§6.1
);IM
RO
(201
5,§2
6);
ISO
(200
9,§4
.7);
ISO
(201
2,7.
2.f.
4)W
ere
com
men
dth
atth
ede
fini
tions
ofou
tcom
era
tes
de-
velo
ped
byC
alle
garo
and
DiS
ogra
(200
8)an
dD
iSog
raan
dC
alle
garo
(201
6)be
used
;th
ese
are
larg
ely
inco
r-po
rate
din
Am
eric
anA
ssoc
iatio
nfo
rP
ublic
Opi
nion
Res
earc
h(2
015)
.H
andl
ing
ofm
obile
devi
ces
/sm
artp
hone
s
We
reco
mm
end
docu
men
ting
whe
ther
the
ques
tionn
aire
layo
utis
adap
ted/
optim
ized
for
smar
tpho
neco
mpl
e-tio
nan
d,if
so,h
owit
isad
apte
dor
optim
ized
.
Not
refe
renc
edin
any
guid
elin
e
The
reis
anex
tens
ive
liter
atur
eon
met
hods
for
form
at-
ting
web
surv
eys
onm
obile
devi
ces
and
impl
icat
ions
for
data
qual
ity(e
.g.,
Pey
tche
van
dH
ill20
10;
Chr
zan
and
Sau
nder
s20
12;
deB
ruijn
ean
dW
ijnan
t20
13a,
2013
b;L
ink,
Mur
phy,
Sch
ober
,B
uski
rk,
Hun
ter
Chi
lds
2014
;M
avle
tova
and
Cou
per
2014
,201
6;A
rn,
Klu
g,an
dK
olod
ziej
ski2
015;
Str
umin
skay
a,W
eyan
dt,
and
Bos
njak
2015
;L
ugtig
and
Toe
poel
2016
;R
evill
a,T
onin
elli,
and
Och
oa,
2016
;R
evill
a,T
onin
elli,
and
Och
oa20
16;
Cou
per,
Ant
oun,
and
Mav
leto
va20
17;
Pet
erso
n,G
riffi
n,L
aFra
nce,
and
Li
2017
;A
ntou
n,K
atz,
Arg
ueta
,an
dW
ang
2018
).A
lthou
ghno
tin
cor-
pora
ted
into
othe
rgu
idel
ines
,it
isim
port
ant
that
cli-
ents
are
suffi
cien
tlyin
form
edas
hand
ling
mob
ilede
vice
sm
ayha
veco
nseq
uenc
esfo
rbr
eak-
off
and
mea
sure
men
t
Review of Probability and Nonprobability Research 29
Dow
nloaded from https://academ
ic.oup.com/jssam
/article-abstract/8/1/4/5699631 by guest on 22 May 2020
Table A2. Case-Level Data Recommendations
Construct Notes
Sample source Sample source (e.g., list, web advertisement)Within-survey paradata Date-time survey started and completed
Duration between survey started and completedFrequency of invitations to surveys: each week, etc.Absence of invitations to concurrent surveys during a
certain time (e.g., 2 consecutive weeks after the first in-vitation to participate in the survey)
Device information provided for each session for surveyscompleted in multiple sessions:User agent string or equivalent information (e.g.,
operating system and