15
Review Approximating Bayesian Inference through Model Simulation Brandon M. Turner 1, * and Trisha Van Zandt 1 The ultimate test of the validity of a cognitive theory is its ability to predict patterns of empirical data. Cognitive models formalize this test by making specic processing assumptions that yield mathematical predictions, and the mathematics allow the models to be tted to data. As the eld of cognitive science has grown to address increasingly complex problems, so too has the complexity of models increased. Some models have become so complex that the mathematics detailing their predictions are intractable, meaning that the model can only be simulated. Recently, new Bayesian techniques have made it possible to t these simulation-based models to data. These techniques have even allowed simulation-based models to transition into neuroscience, where tests of cognitive theories can be biologically substantiated. The Increasing Granularity of Cognitive Science As the quality of everyday computational resources has improved, so too has our ability as scientists to pursue more nuanced explanations of the mind. In some lines of work, our pursuit has lead to the development of neurally plausible architectures of the brain [14], while in others it has led to advanced modeling of individual- or even trial-level decision dynamics [57]. Central to both of these endeavors is the development of computational models whose mechanisms mimic our conceptualization of the mind. Because our understanding of the mind is still in its infancy, there are many theoretical mechanisms that remain plausible representations of the computations performed by the mind, and cognitive scientists are working to provide evidence for specic mechanisms. To test the plausibility of a mechanism, we must determine whether it can produce patterns of data that might be observed when studying real-world behavior (e.g., an experiment). Like knobs on a machine, parameters control the mechanisms within a computational model, where different settings of the parameters produce different predictions for data. The range of predictions a model can make is then limited by the range of values the parameters can take. Hence, testing the validity of our computational model is determined by whether the knobs can be adjusted in such a way to produce data that closely match what we observe in the real world. Mechanisms provide the bridge that connects the data that we observe in the real world with our theoretical explanation of how those data came to be. In statistics, this bridge is embodied in the likelihood function (see Glossary) because it quanties how likely the data are under the model. Although mechanisms with likelihood functions have served cognitive scientists well in recent years, in part by facilitating our understanding of individual differences through hierar- chical Bayesian analyses [5,810], the mechanisms supporting neurally plausible computations are typically more complex. They are so complex that their likelihood functions are unusable: either the functions are too poorly behaved to be computed efciently or they have no analytic form and cannot be computed at all. As a consequence, until now it has been computationally Highlights Likelihood functions quantify the like- lihood of a specic pattern of data given a particular cognitive model. However, as cognitive models become more complex, their likelihood func- tions become increasingly difcult to calculate. Cognitive models are becoming so complex that their likelihood functions cannot be directly calculated. There- fore, new tools to approximate the like- lihood functions of these complex models are needed. Likelihood-free techniques, such as approximate Bayesian computation, allow scientists to simulate datasets that can be compared to observed datasets so as to estimate a likelihood function for the model. Likelihood-free techniques permit researchers to use neurally plausible architectures to understand individual differences in cognitive performance. 1 Department of Psychology, Ohio State University, Columbus, OH 43210, USA *Correspondence: [email protected] (B.M. Turner). 826 Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9 https://doi.org/10.1016/j.tics.2018.06.003 © 2018 Elsevier Ltd. All rights reserved.

Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

Review

Approximating Bayesian Inference throughModel Simulation

Brandon M. Turner1,* and Trisha Van Zandt1

HighlightsLikelihood functions quantify the like-lihood of a specific pattern of datagiven a particular cognitive model.However, as cognitive models becomemore complex, their likelihood func-tions become increasingly difficult tocalculate.

Cognitive models are becoming socomplex that their likelihood functionscannot be directly calculated. There-fore, new tools to approximate the like-lihood functions of these complexmodels are needed.

Likelihood-free techniques, such asapproximate Bayesian computation,allow scientists to simulate datasetsthat can be compared to observeddatasets so as to estimate a likelihoodfunction for the model.

Likelihood-free techniques permitresearchers to use neurally plausiblearchitectures to understand individualdifferences in cognitive performance.

1Department of Psychology, OhioState University, Columbus, OH43210, USA

*Correspondence:[email protected] (B.M. Turner).

The ultimate test of the validity of a cognitive theory is its ability to predictpatterns of empirical data. Cognitive models formalize this test by makingspecific processing assumptions that yield mathematical predictions, andthe mathematics allow the models to be fitted to data. As the field of cognitivescience has grown to address increasingly complex problems, so too has thecomplexity of models increased. Some models have become so complex thatthe mathematics detailing their predictions are intractable, meaning that themodel can only be simulated. Recently, new Bayesian techniques have made itpossible to fit these simulation-based models to data. These techniques haveeven allowed simulation-based models to transition into neuroscience, wheretests of cognitive theories can be biologically substantiated.

The Increasing Granularity of Cognitive ScienceAs the quality of everyday computational resources has improved, so too has our ability asscientists to pursue more nuanced explanations of the mind. In some lines of work, our pursuithas lead to the development of neurally plausible architectures of the brain [1–4], while in othersit has led to advanced modeling of individual- or even trial-level decision dynamics [5–7]. Centralto both of these endeavors is the development of computational models whose mechanismsmimic our conceptualization of the mind. Because our understanding of the mind is still in itsinfancy, there are many theoretical mechanisms that remain plausible representations of thecomputations performed by the mind, and cognitive scientists are working to provide evidencefor specific mechanisms.

To test the plausibility of a mechanism, we must determine whether it can produce patterns ofdata that might be observed when studying real-world behavior (e.g., an experiment). Likeknobs on a machine, parameters control the mechanisms within a computational model, wheredifferent settings of the parameters produce different predictions for data. The range ofpredictions a model can make is then limited by the range of values the parameters can take.Hence, testing the validity of our computational model is determined by whether the knobs canbe adjusted in such a way to produce data that closely match what we observe in the real world.

Mechanisms provide the bridge that connects the data that we observe in the real world withour theoretical explanation of how those data came to be. In statistics, this bridge is embodiedin the likelihood function (see Glossary) because it quantifies how likely the data are under themodel. Although mechanisms with likelihood functions have served cognitive scientists well inrecent years, in part by facilitating our understanding of individual differences through hierar-chical Bayesian analyses [5,8–10], the mechanisms supporting neurally plausible computationsare typically more complex. They are so complex that their likelihood functions are unusable:either the functions are too poorly behaved to be computed efficiently or they have no analyticform and cannot be computed at all. As a consequence, until now it has been computationally

826 Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9 https://doi.org/10.1016/j.tics.2018.06.003

© 2018 Elsevier Ltd. All rights reserved.

Page 2: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

GlossaryApproximate Bayesiancomputation (ABC): a class ofcomputational methods rooted inBayesian statistics that bypass directevaluation of the likelihood functionthrough model simulation. ABCmethods rely on a comparison ofsummary statistics from simulatedand observed data.Kernel function: a kernel function K(x) is defined as a symmetric, non-negative function that integrates toone. That is: KðxÞ > 0 for all x 2ð�1; 1Þ; Kð�xÞ ¼ KðxÞ; andR1�1 KðxÞdx ¼1:Likelihood-free techniques: anycomputational method that estimatesparameters without explicitcalculation of the likelihood function.Likelihood function: the likelihoodfðyjuÞ is a function that calculatesthe compatibility of the evidencewithin the given hypothesis. Thelikelihood is a function defined overthe data space.Metropolis–Hastings probability:this is best defined in the context ofan algorithm, such as that shown inFigure 1C. If X* represents thesimulated data associated with theproposal, and u* and Xt are thesimulated data associated with thestate of the chain on the t-thiteration, we can write the likelihood-approximated Metropolis–Hastingsprobability as: a ¼min 1; pðu�ÞcðrðX

� ;YÞjdÞqðut ju�ÞpðutÞcðrðXt ;YÞjdÞqðu� jutÞ

� �: The

function qðajbÞ is a transition kernelbecause it defines the probability oftransitioning a parameter value b toa new parameter value a. Forexample, in Figure 1C, thetransition kernel is Gaussian, andtherefore q takes the form of aGaussian density function. Thefunction cðrðX; YÞjdÞ is the similaritykernel function that weights valuesof u according to the distance r(X,Y). For the example presented inFigure 1, cð�jdÞ was also aGaussian density function withstandard deviation equal to d.Prior: the prior p(u) is the formalspecification of a hypothesis (e.g.,about a parameter u) before thedata, which provide the currentevidence, are observed. The prior isa function defined over thehypothesis space (e.g., a function ofthe parameter u).

infeasible to use neurally plausible architectures to understand individual- or trial-level decisiondynamics. Hence, to understand how the mind emerges from the brain, new tools are neededfor approximating the likelihood function for complex models such that neural mechanisms maybe evaluated properly.

In the sections that follow we provide an overview of new techniques designed to approximatethe likelihood function for complex stochastic models in a variety of cognitive domains. First,we provide a conceptual overview of the statistical concepts behind likelihood-free techni-ques. We then discuss several recent applications of likelihood-free techniques in threecognitive domains, specifically episodic memory and learning, perceptual decision-making,and preferential choice. Finally, we discuss how likelihood-free techniques create new possi-bilities for relating cognitive models to the brain.

Simulation-based Likelihood ApproximationWe are among the many cognitive scientists who advocate Bayesian statistics as a way to performinference with cognitive models, and we therefore describe likelihood-free methods in the contextof performing Bayesian inference. Bayesian statistics provide several important advantages thathave been described elsewhere [10,11]. Most important for our purposes is that they allow highlystructured and multilevel models to be easily fit to complex data, such that individual differencescan be examined more closely than they could in alternative statistical frameworks.

To understand the central role played by the likelihood function, consider how Bayesianinference is performed. In the Bayesian framework, parameters are considered to be randomvariables in the same way as experimental measurements are. The variability of a parametermay reflect true variability (different values that the parameter might take on) or simply ouruncertainty about the true value of the parameter. The prior distribution (or density) p(u) is aquantification of our beliefs about a parameter u before we observe the data, and it is definedover all the possible values that u can take. After the data Y = y are obtained, we use thelikelihood function of the model, fðyjuÞ, and Bayes’ rule to compute the posterior distributionpðujyÞ of u. The posterior distribution, expressed aspðujyÞ / fðyjuÞpðuÞ; [1]

quantifies our updated beliefs about the parameter u – beliefs that include both the priorinformation contained in p(u) and the new information in the data y.

Bayesian statistics has enjoyed overwhelming enthusiasm in recent years owing to easy-to-usesoftware that implements powerful algorithms for sampling from the posterior distribution.When the likelihood fðyjuÞ is known, and because the prior is specified by the researcher,sampling from the posterior can be done simply by specifying the form of the likelihoodfunction and the prior within software such as Just Another Gibbs Sampler [12] or Stan [13].However, what do we do when we do not know the functional form of our model? That is, whatdo we do when the likelihood fðyjuÞ in Equation 1 has no analytic form?

Many exciting new models in cognitive science are constructed from the ‘bottom up’: relativelywell-understood neural mechanisms are instantiated and used as the foundation for morecomplex structures that can produce patterns of data that might be observed in the real world.As we discuss in this Review, many of these models are used in learning and memory, perpetualdecision-making, and preferential choice. Before the application of likelihood-free methods,these models were tested by constructing massive sets of hypothetical data through repeatedsimulation using constrained (e.g., fixed to a particular setting) parameter values derived fromthe neuroscientific literature [14–18].

Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9 827

Page 3: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

Posterior: the posterior pðujyÞ isthe distribution of the parameter u

given the observed data y. Iteffectively aggregates theinformation from the prior and thelikelihood, and it updates ourunderstanding about the parameteru as a function of the observeddata.Stochastic models: models that areused to estimate probabilitydistributions of potential outcomes(e.g., decision variables) by allowingfor random variation in one or moreinputs over time.Sufficiency: the property ofsufficiency [22,109] stipulates that asufficient statistic S(�) of some data Yprovides as much information aboutthe data when estimating a model’sparameters u as does the wholedataset itself. For the examplepresented in Figure 1, X is asufficient statistics for u*.

Several authors have discussed the limitations of the aforementioned repeated simulationapproach [19–23]. Although it is not conceptually difficult, it is orders of magnitude morecomputationally demanding than the methods that have been harnessed to bring forth theBayesian revolution in cognitive science. Perhaps the most crucial drawback of the repeatedsimulation approach is in how estimates of parameters are quantified. Specifically, we have nogood way of assessing our uncertainty in the estimates (sampling variability), examining theircorrelation structure, or determining the appropriate tests of our research hypotheses.Because we cannot quantify the uncertainty in the parameter estimates, it becomes difficultto combine parameter estimates across individuals to build hierarchical models, and thus, giventhe typically small number of data points collected per person in psychological experiments,individual differences are difficult to explore in a principled fashion [10]. Finally, methods formodel comparison are extremely limited because proper penalty terms for model flexibilitybased on the likelihood function cannot be applied [20,24]. To solve all these issues, what isneeded is a statistical framework for obtaining accurate parameter estimates while simulta-neously providing principled quantifications of uncertainty.

Approximate Bayesian computation (ABC) was designed to address exactly this type ofproblem. Originally derived to solve problems in genetics (Box 1), ABC techniques are a specialtype of likelihood-free method for performing Bayesian inference by bypassing the evaluation ofthe likelihood of a model with a simulation of that model. The simulation step produces a sampleof simulated data X that are evaluated relative to the observed data Y. This evaluation is madeon the basis of a distance r(X,Y) between X and Y, and distance can be defined in several ways.However, ABC does not use distance minimization to generate point estimates of parameters,but rather to estimate the posterior distributions of the model parameters.

Suppose our goal is to estimate the posterior distribution of a model parameter u given someobserved data Y. The logic of ABC is that, if a proposed parameter value u* is able to generate a

Box 1. History of Likelihood-free Methods

Given a sample of DNA sequences taken from several different species, how does a geneticist estimate the time elapsedsince the most recent common ancestor of those species lived? The solution to this problem requires a model of thelocations and types of genetic mutations that have occurred over generations. The model must include the number ofancestors of each species over time and some idea of the number of offspring any individual has. There is no formulagiving the probability of any sequence (the likelihood of the data) at each elapsed time because this probability dependson factors that explode exponentially as the number of sequences increases and as the number of mutation sites withineach sequence increases.

Some ‘rejection’ algorithms have been developed to permit a posterior distribution for elapsed time to be estimated bysimulating the mutation process and comparing the results of these simulations to their DNA sequences [29]. A similaralgorithm was used that contrasted simulated summary statistics to the summary statistics of their data [99]. If thesimulated and summary statistics were the same, then the proposed parameters were retained as samples from theposterior distributions of those parameters. The original rejection algorithms in [29] were then modified such that theproposed parameters were retained if the distance between the summary statistics was less than some small value [30].

A comprehensive survey of this pioneering work was presented in [32] in the context of modeling population genetics. Anovel algorithm was also proposed that used a regression reweighting scheme that greatly improved the efficiency ofthe algorithm. Following [32], the use of ABC methods in population genetics exploded, and these methods were thenadopted by engineers, biologists, archaeologists, epidemiologists, and statisticians. ABC was introduced to cognitivescientists in a tutorial paper [22]. ABC methods were then applied to modeling recognition memory, and the authorsshowed how ABC could provide estimates of computational model parameters and improve interpretation of changesin those parameters over experimental conditions [39]. These techniques were then adapted for hierarchical Bayesianmodels, and new generalized methods were developed that relaxed in part the need for sufficient summary statistics[38,25]. This literature is summarized in a recent volume, Likelihood-Free Methods for Cognitive Science [23].

828 Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9

Page 4: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

simulated dataset X that is ‘close’ to the observed data Y, then u* must have come from thedesired posterior distribution with a probability that is proportional to the discrepancy betweenX and Y. There are therefore three components of the process that we must establish: (i) thesimulation of the data X, (ii) how to measure the distance between X and Y, and (iii) how tocompute the probability that u* came from the desired posterior distribution.

To make this logic concrete, consider Figure 1. The observed data are simulated responses to10 true/false questions, Y = {Y1, Y2, . . . , Y10}, where the probability of a correct response isu = 0.50 (shown as the � symbol in Figure 1B). We will use ABC to estimate the posteriordistribution of u given the data Y. The ABC algorithm that we used is shown as pseudocode inFigure 1C. Corresponding to the first component of the process, we randomly selected manydifferent values for u* ranging from 0.30 to 0.70 (line 4). For each value of u* we simulated acorresponding dataset X = {X1, X2, . . . , X10} by generating 10 random correct and incorrectresponses with probability u* (line 5).

(A)

(B)

(C)

0.40 0.45 0.50 0.55 0.60

0

Dens

ityX

– Y

–0.0

50.

000.

055

1015

1. Given data and model Y ~ Model( θ), error term δ, and prior distribu on π (θ).

2. Ini alize θ1 = θ0.

5. Generate data X from θ∗: X ~ Model( θ∗)

6. Calculate discrepancy ρ(X, Y)

7. Sample p∗ ~ U(0, 1), and calculate MHP.

8. if p∗ < MHP then

3. for 2 < i < N do

4. Sample θ∗ by perturbing it from the chain’s loca on: θ∗ ∼ N(θi–1, σ)

Store θi9. θ∗

Store θi11.

10. else

12. end if

13. end for

θi–1

θ

Figure 1. Approximate Bayesian Computation Example. Panel (A) shows the posterior estimate (i.e., histogram) for u alongside the true posterior distribution (i.e.,black line). The red vertical line indicates the true value of the parameter (i.e., u = 0.50). Panel (B) shows the joint distribution of the parameter of interest u (y axis) alongwith the discrepancy between the simulated data X produced by u to the observed data Y (x axis). The true value of the parameter is illustrated with an � symbol (i.e.,u = 0.50), and the horizontal broken line is drawn where the discrepancy between X and Y is zero. The joint distribution is color-coded such that values of u producingsimulated data X far from the observed data Y have low weights (i.e., cooler colors), whereas data that are close have high weights (i.e., warmer colors). Panel (C) showspseudocode for implementing the sampling procedure obtained in (A) and (B). Here, MHP stands for the likelihood-approximated Metropolis–Hastings probability,which is the computation that selects weighted values of u for the posterior distribution [23,33,34].

Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9 829

Page 5: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

Corresponding to the second component of the process, we need to define a function r(X,Y)that quantifies the discrepancy between a simulated dataset X and the observed data Y. Thereare many ways that we could do this: for the purposes of this example we definedrðX; YÞ ¼ X � Y , the difference between the means of the two datasets. Figure 1B showseach sampled u* on the x axis, and the difference (X�Y) corresponding to each X on the y axis.The broken horizontal line in Figure 1A is located at X � Y ¼ 0, where the values of u*produced data that perfectly matched the observed data. If we were to compute the posteriordistribution of u given the observed data Y using the likelihood, these values of u would besamples from the true posterior distribution.

Values of u* that produce data X that do not exactly match Y may also arise from the trueposterior with some probability. This probability is proportional to the distance r(X,Y) between Xand Y. Corresponding to the third and final component of the process, we must choose asimilarity kernel function cðrðX; YÞjdÞ that produces a weight for each value of u*; the highestweights are given to those values of u* that generated datasets X that were closest to Y. Theparameter d determines how quickly the weights decrease as the distance between X and Yincreases. For this example we used a Gaussian (normal) similarity kernel. The influence of thesimilarity kernel on the posterior weights is illustrated by the color gradient in Figure 1B, wherewarmer (red) colors indicate higher weights than cooler (blue) colors.

With the three components of the ABC process in place, we can implement the algorithm inFigure 1C and obtain the estimated posterior density for u shown as the histogram in Figure 1A.Because this is a toy problem, we know the true posterior distribution for u, which is shown asthe unbroken black line in Figure 1A. The similarity of the histogram to the true distributionindicates that the posterior distribution of u has been approximated reasonably well.

Figure 1 illustrates one particular likelihood-free technique (kernel-based ABC). For this tech-nique to be successful we must correctly choose the distance function and similarity kernelfunction ([22,23] for tutorials). In particular, the best choices of r(X,Y) and cðrðX; YÞjdÞ willdepend on the model used to describe the data. For our simple example, these choices werenot difficult: the distance rðX; YÞ ¼ X � Y is determined by what are called sufficient statisticsfor the datasets X and Y (Box 2), and the Gaussian similarity kernel is convenient and workswell in many situations. In more realistic situations, these choices may be more difficult [25–28].

As the sections below will overview, different behavioral measures, different levels of analysis (e.g., hierarchical or not), and models with different statistical properties all dictate the choice ofalgorithm that will most efficiently yield the desired results. To provide an overview of theseconsiderations, Box 2 discusses a few important decisions that any would-be user shouldconsider when selecting among algorithms. In addition, there are several problem-specificdetails that can guide the choice of algorithm. For example, if the number of parameters to beestimated is low, a good choice is a rejection-based sampler similar to the kernel-basedalgorithm illustrated in Figure 1, except that the similarity kernel is all-or-none [22,29–31]. Otherdecisions focus on more statistically nuanced distinctions, such as error terms in the approxi-mation [32–34], or the type of approximation that is most appropriate for the application[25,35,36]. Finally, overarching statistical techniques can be used to extend most of the corelikelihood-approximation algorithms when fitting hierarchical models to data [37–40].

Models of Episodic MemoryIn cognitive science, likelihood-free techniques were first applied to models of episodicrecognition memory. In recognition memory tasks, people are asked to commit a list of items

830 Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9

Page 6: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

Box 2. Distinguishing among Inference Algorithms

Many algorithms that can approximate the likelihood function, and they vary along several dimensions, which makesorganizing them difficult. At the most basic level, most likelihood-free algorithms can be described roughly in fiveessential steps:1. Generate a candidate parameter value u*.2. Generate a set of simulated data X using u* and a model.3. Summarize the properties of X.4. Compare the summary of X to a summary of Y.5. Apply a weight to u* that reflects how close X is to Y.Across the literature, most likelihood-free algorithms have focused on developing innovative techniques on either steps1, 3, or 5. Regarding step 2, most likelihood-free algorithms agree that simply generating a large set of simulated data Xshould be good enough provided that the summaries in step 3 are relatively stable. Although we do not consider step 2to be an important step for classification, it can affect the choices at later steps. For example, determining the size of Xwill depend on how quickly one can simulate data from the model, as well as on how large the observed data Y are. Instep 3, the typical approach is to summarize the data X by a set of summary statistics S(X). Summarizing the data cansometimes lead to poor accuracy, but has the benefit of being computationally efficient. On the other hand, somealgorithms require larger sets of simulated data to ensure that a quality likelihood approximation can be formed. Thesealgorithms are considered ‘brute force’ because they bypass step 3 and instead compare the full distribution of X to Yrather than using statistics S(X). This process is computationally costly, but the benefit is that no information is lost in thesummarization.

Regarding step 4, most likelihood-free algorithms specify that a Euclidean distance can be used to compare thestatistics of the simulated data S(X) to the statistics of the observed data S(Y). In other words, we can simply calculate S(X) � S(Y) to evaluate how similar X is to Y. However, again we find that the steps listed above interact with one another:because step 4 depends on the way in which the data are summarized via S(�), step 4 depends intimately on step 3. Thebrute-force algorithms mentioned above summarize the full distribution of X through kernel density estimation [25,100]or even a histogram method [36]. In this case, because a proper probability density function is used as a surrogate forthe predictions of a model, we can evaluate the probability of observing the data Y under the distribution of X.

to memory during a study phase, and are then asked to discriminate between studied and newitems during a test phase. People make different types of errors depending on whether theybelieve an item to have been on the previously studied list (and it was not) or they believe an itemto be new (and it is not), and the frequencies of these errors tend to vary with experimentalmanipulations such as the baseline frequency of the item in the environment, repeatedpresentations of some items during study, and the length of the study list.

Different models explain why the patterns of response probabilities change as a function of thesedifferent experimental manipulations. One model, BCDMEM (bind, cue, decide model of episodicmemory; Figure 2C) [15], postulates that, when an item is presented at test, the contexts in whichthat item was previously experienced are retrieved and matched against a representation of thecontext of interest. The overlap between the context of the study phase and all other contexts inwhich the item was previously experienced ultimately dictates whether an item is correctlyrecognized. Another model, the REM (retrieving effectively from memory [14]) model, assumesthat at test people perform a global matching process where the test item is compared to all othermemory traces produced during the study period. These memory traces are incomplete repre-sentations of the items presented at study, and thus the probability of a correct decision primarilydepends on the ability of a person to encode the information during the study phase.

Both models can explain a range of experimental effects, but they make different assumptionsabout the role played by interference in episodic memory: BCDMEM posits that interferencefrom different contexts in which a test item appeared makes recognition difficult, whereas REMposits that interference from incomplete representations of the other items in the study contextmakes recognition difficult. These two different ideas about interference lead the two models tomake different predictions about performance under a range of experimental conditions. Inparticular, the two models make different predictions for how response proportions change as

Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9 831

Page 7: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

(A) (B)

(C) (D)

‘Dog’

Naming output

Inputlayer Output layer

A ributeselec on

Inputpreprocessing Decision

process

Retrievedcontext

Reinstatedcontext

Targetword

IT

V2/V4

V1

Xt Xt+1 Xt+2 Xt+x

yt yt+1 yt+2 yt+x

Figure 2. Four Models Benefiting from Likelihood-free Techniques. Panel (A) shows a connectionist network,such as LEABRA (local, error-driven and associative, biologically realistic algorithm), that maps featural inputs from a visualdisplay into a semantic label (e.g., ‘dog’). Panel (B) shows a dynamic learning model in which stimulus inputs xt at time tactivate representational nodes to produce a category response yt. If information about the true category structure issupplied, it is associated with the features xt. Over time, the representations evolve to a highly structured categoryrepresentation in the feature space. Panel (C) shows how BCDMEM (bind, cue, decide model of episodic memory)retrieves and reinstates previous contexts when confronted with a probe word (input layer). The pattern of overlap betweenthe reinstated and retrieved context vectors ultimately determines the recognition memory decision. Panel (D) shows apath diagram of the multi-attribute leaky competing accumulator model. At the attribute selection state, features of thestimulus are sampled according to a Bernoulli process. Once a feature is selected, it is used as input to a series ofprocessing steps where context is constructed by way of transformations on pairwise differences. Finally, the processedattribute values are used as input into a leaky, competitive accumulation process where a choice is made corresponding tothe ‘winning’ accumulator.

a function of the length of the study list: the list-length effect. While REM predicts thatperformance should decrease with increases in the length of the study list, BCDMEM predictsthat performance should be invariant with the size of the study list.

In a typical model-comparison situation we might fit both models to data, examine theestimated parameter values and goodness-of-fit statistics, and come to a conclusion about

832 Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9

Page 8: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

which model explained the data better. However, both BCDMEM and REM are simulation-based models with likelihoods that are very difficult to derive [41,42]. The lack of likelihoodfunctions made this model-comparison exercise anything but typical, and thus their relativegoodness of fit to data had never been examined in a rigorous way. Using a mixture oflikelihood-free techniques, hierarchical versions of BCDMEM and REM were fit to four differentdatasets [39]. The most important test of the models was centered on the number of people inthe experiments that exhibited a list-length effect. If most people do, then the models must beable to capture the magnitude of the effect. For these data, BCDMEM provided the bestexplanation for list-length data across the four experiments. The authors argued that, becauseREM always predicts a strong list-length effect, people who did not exhibit the list-lengthpattern strongly supported BCDMEM, and therefore BCDMEM better explained individualvariability in the hierarchical model.

Dynamic Models of LearningAlthough all models of learning incorporate a memory component in some form, they include anadditional layer of complexity that complicates the inference process. Namely, when studyinglearning dynamics, one must describe how a person’s representation of the stimulus and taskenvironment evolves through time. Importantly, models of learning must specify why a personmakes a particular choice on trial t, given the interaction of the person with stimuli experiencedfrom trials 1 through (t � 1), and perceptual processes that are unique to that person.

Modeling learning dynamics presents a unique challenge because the commonly usedassumption of independent and identically distributed observations in cognitive experimentsis violated. In situations where the measurements arise from an evolving data-generatingmechanism, it can be difficult to specify a likelihood function, and simulation-based methodsare therefore particularly advantageous. To fit learning models to data, one must ‘train’ a modelon the same sequence of stimuli that the person performing the task experienced, at the sametime making predictions about the observed measurements on each trial. Because thepredictions evolve with experience, the likelihood function must be approximated on eachtrial for each person.

Confronting this problem, likelihood-free techniques have been used to fit error-correctingcriterion models [43] to data from a signal-detection task [25]. This model specifies how thesignal-detection theory criterion parameter [44–46] should be adjusted in response to feedbackfor each trial. Although the error-correcting model is mathematically tractable, it served as a casestudy for the validity of likelihood-free methods. In one example, the error-correcting model was fitto the same data using both likelihood-free and likelihood-informed methods [25]. By showing thatthe posterior estimates resulting from the two inference methods were similar, the authors wereable to affirm the feasibility and accuracy of the likelihood-free technique they applied.

Other results have established that likelihood-free techniques are plausible methodologies forneurally plausible learning models [38], such as in the dynamic, stimulus-driven model of signaldetection (Figure 2B) [47]. The model they investigated is an associative and recursive learningmodel, where the weights associating stimuli to their appropriate responses are adjusted oneach trial using a similarity kernel [48]. Because the representations depend on the stimulus andcorrective feedback presented on each trial, unlike the error-correcting criterion model [43], thepredictions of the model are intractable and must be simulated through the entire learningsequence. As such, a new method, Gibbs ABC, was devised for performing likelihood-freeinference on a hierarchical version of the model, and this was used to provide detailed insightinto the learning dynamics of individual participants [38].

Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9 833

Page 9: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

Stochastic Models of Perceptual and Preferential ChoiceIn the field of perceptual decision-making, classic models of decision dynamics are derivedfrom sequential sampling theory – which assumes that, on the presentation of a stimulus,evidence toward alternative responses is sampled from the stimulus array and accumulatedover time. The accumulation dynamics are often described with stochastic differential equa-tions, which make the models mathematically elegant, but often complex.

Continuing discoveries in neurophysiology have compelled researchers to improve on theseclassic computational models by augmenting them with neurally plausible computationalmechanisms [49–55]. These mechanisms include leakage, gated and temporally varyingaccumulation, and lateral and feedforward inhibition. Although these new mechanisms havestimulated exciting new directions of study, they have also rendered nearly all modern modelsof perceptual decision-making mathematically intractable, which further emphasizes the needfor likelihood-free algorithms. Without such algorithms there is no way to justify the additionalcomplexity of models that cannot be fit to data [56]. There is neither a way to estimate theirparameters nor a way to validate the models by observing appropriate changes in thoseparameters over experimental conditions.

One investigation of several neurally inspired mathematical models of perceptual choice usinglikelihood-free techniques was performed in ([57]; also see [25,58]). They fit their models to datafrom a random-dot motion task, where people were instructed to decide in which directionmost of the dots in an array were moving. This task also asked people to respond as quickly aspossible, as accurately as possible, or at their own pace. Two different neurally inspired modelswere evaluated and contrasted with a baseline model that did not include neurophysiologicalmechanisms. One model had feedforward inhibition, where a constant level of inhibition wasapplied throughout the accumulation process. The other model had leakage – the passive lossof information through time – and lateral inhibition that was proportional to the amount ofevidence accumulated at each point in time. They found that both the neurally inspired modelsfit data better than the baseline model, even after penalties for model complexity were applied. .These results suggest that adding neurally plausible mechanisms can sometimes providebetter fits to data, despite that adding them can complicate the mathematics detailing themodels.

The human ability to integrate changing stimulus information through time has provoked thedevelopment of experimental paradigms using dynamic stimuli with temporally dependentfeatures. In these experiments the strength of evidence may (for example) strongly favor onealternative early within the trial, but the strength may decrease at later points such that anotheralternative should be preferred [26,50,59]. Models that account for choice performance in suchtasks are nonstationary, in that the parameters describing the rate at which informationaccumulates change over time [60,61]. Nonstationary parameters greatly increase the difficultyof fitting the models to data. For example, the multichannel diffusion model [60] has no analyticlikelihood. Instead, approximate expressions for the mean and variance were derived for thebehavioral measure it was trying to predict (response times), and the parameters of the modelwere estimated using the method of least squares.

In the experiments presented in [60] the stimuli varied over time in a deterministic way. In thedot-motion task the stimuli vary according to some random process, which means that theparameters of the accumulation process not only vary over time but are also random processesthemselves. The likelihood function therefore can become intractable. To address this problem,a pseudo-likelihood approximation was developed based on a mixture of analytic [62] and

834 Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9

Page 10: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

brute-force [25,36] methods to derive probability distributions for the behavioral metrics ofinterest (i.e., choice response times [26]). Using this methodology, seven nested hierarchicalmodels were tested with varying assumptions about which psychological mechanisms weremost likely responsible for delays in decision times attributable to the acquisition of differentstimulus information [26]. Ultimately, the authors concluded that the delayed informationintegration was best explained by changes in evidence accumulation rates rather than stimu-lus-independent biases such as starting points. These techniques were later extended to otherstochastic accumulator models, such as the piecewise diffusion decision model [27,110].

There is growing interest in applying perceptual decision-making models to tasks involvinghedonic stimuli [63–67]. Without modification, perceptual models are incapable of producingmany of the interesting behaviors that people exhibit when making decisions among hedonicstimuli, such as context-preference reversals [68,69], common-difference effects [70], and themagnitude effect [71]. The modifications made to explain these effects involve several addi-tional valuation mechanisms that process the stimuli before the final deliberation process,which typically resembles the classic architectures assumed by models of perceptual decisions(Figure 2D). These additional valuation processes complicate the mathematics describing thepredictions of the model such that the likelihood function cannot be derived.

In two examples, an assortment of likelihood-free techniques have been used to examine therelative fidelity of mechanisms previously proposed to explain context effects [40,72]. Inaddition to fitting hierarchical versions of the four most prominent models of context effects[63,66,67,73], the authors also developed a ‘switchboard’ analysis intended to integrate overthe many decisions that a potential model-maker could choose [40]. To do this, each modelwas treated as a particular sequence of binary switches in the process of theory development.By examining every possible configuration of switches on the switchboard, the authors not onlysubsumed the extant models in the literature but also proposed and evaluated 428 new modelsof context effects. In the end, they used likelihood-free techniques and model averaging toidentify the best and worst combinations of model mechanisms for capturing context effects.

Extensions to NeuroscienceThe likelihood-free techniques we have presented so far have given us a better understandingof cognitive models applied to behavioral data. These techniques also offer exciting newpossibilities to make connections between theoretical models of cognition and cognitiveneuroscience. Within the emerging field of model-based cognitive neuroscience [74,75], theparameters of cognitive models are linked to statistical models of neural activation in an effort toascribe mechanistic interpretations to brain function. The field is fortunate to have many optionsfor establishing these links [76], such as complete cognitive architectures [1,2,77–79], explicitmodels of covariation [7,80–84], or even the humble general linear model [85–88].

For example, likelihood-free techniques have been used to examine the relationship betweenlateral inhibition and the engagement of prefrontal cortex in the intertemporal choice task [28]. Inthis task, people are asked to choose between a lower-valued immediate reward and a higher-valued reward at some point in the future. Similar to the adjustments made in perceptualmodels for preferential choice, mechanisms such as lateral inhibition and leakage [49,50] wereadapted to examine a broad range of possible theoretical explanations of how self-controlprocesses emerge when making intertemporal choice decisions [28]. After fitting their modelshierarchically to data from many individuals, they determined that the best explanation for thedecision processes of these individuals was a dynamic, oscillatory feature-selection process[49,64] combined with active suppression (i.e., through lateral inhibition [50,63]) of tempting,

Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9 835

Page 11: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

but inferior, choice options. After the best model had been selected, an empirical ABCprocedure was used to estimate single-trial parameters describing lateral inhibition, and linkedthem to neural measures obtained in fMRI through a general linear model. Ultimately, they foundthat the frontoparietal areas exhibiting trial-by-trial correlation with the degree of lateral inhibitionwere consistent with extant theories of cognitive control [89,90], suggesting the possibility of aunifying neural basis for control across perceptual and hedonic stimulus domains.

Concluding Remarks and Future PerspectivesLikelihood-free algorithms are extremely important for the field of cognitive science whereexplanations of the mind (i.e., realized through behavioral experiments or neuroscientific data)often take complex forms. In this Review we have discussed several applications of likelihood-free techniques to important cognitive problems, problems that are addressed by evaluatingthe ability of a model to fit behavioral data. We are now able to establish connections betweenthe mechanisms assumed by cognitive models and neural activity through various statistical

Observed data

Simulated data

Time

Time

Time

0

–1.5

–1.0

–0.5

0.0

0.5

1.0

1.5

–1.5

–1.0

–0.5

0.0

0.5

1.0

1.5

–1.5

–1.0

–0.5

0.0

0.5

1.0

1.5

Bol

d re

spon

seB

old

resp

onse

Bol

d re

spon

se

20 40 60 80 100

0 20 40 60 80 100

0 20 40 60 80 100

−2500

−1800

ρ(X, Y)

ψ (ρ(X, Y) δ)

Figure 3. Pipeline for Evaluating Neural Architectures. During an experiment, neural measures can be collected (i.e., observed data in red; Top panel) assignatures of mental activity. Our goal is then to determine what the appropriate settings of the parameters of our neural architecture should be such that simulated data(i.e., simulated data in blue; Bottom panel) closely match the data that were observed. To do this, we compare the simulated data to the observed data [i.e., r(X.Y)], andweigh the distance with a kernel function [i.e., cð�jdÞ]. If chosen appropriately, these functions allow us to accept a good parameter value (e.g., simulation 1) andreject poor parameter settings (e.g., simulation 2). Abbreviation: BOLD, blood oxygen level-dependent.

836 Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9

Page 12: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

Outstanding QuestionsWhat can be done to ameliorate theissue of computational complexity? Tofully resolve the statistical issues inher-ent to many likelihood-free algorithms,brute-force algorithms have become apopular method for forming theapproximation of the likelihood func-tion. However, many neural architec-tures are inefficient to simulate, makingthe application of likelihood-free tech-niques time-consuming. How can thesimulation of the model be expedited ina way that facilitates likelihoodapproximation?

Can cloud computing facilitate the useof likelihood-free techniques? Giventhe common interest in particular setsof cognitive models, are there collec-tive methods that could be used suchthat simulated data could be stored,downloaded, and used during theinference process, rather than simulat-ing entirely new datasets?

Are there better statistical methods forreducing the number of model simu-

linking procedures [7,28,55,91,92]. There are many exciting new developments that uselikelihood-free techniques to understand neural function.

For example, we now have complete neural architectures that make simultaneous predictionsfor neural activity and decision dynamics [3,4,53,93–95]. As Figure 3 illustrates, likelihood-freealgorithms such as probability density approximation [25,36] or synthetic likelihood [35] allow usto approximate even the most complex likelihood functions for model evaluation purposes. Theplausibility of these models can now be assessed by fitting such models hierarchically acrosspeople such that important individual differences can be appreciated. Whereas, in the past,fitting these complex neural architectures would be unfathomable, the accessibility of powerfulcomputers combined with generalized simulation-based fitting routines make these endeavorsentirely possible.

Likelihood-free methods are essential for testing whether our cognitive theories are accurate,predicting patterns of human data (e.g., [111]), characterizing aberrant behavior (e.g., compu-tational psychiatry [96]), and understanding important individual differences in cognitive per-formance. Because individual differences are the key to understanding complex relationshipsbetween brain activity and behavior, likelihood-free algorithms permit researchers to investigatecognition at finer granularity than has ever been accomplished before. Whereas in the pastscientists made simplifying assumptions to facilitate model evaluation that might have com-promised the predictive or explanatory accuracy of the model, likelihood-free algorithms permitscientists to forego simplifying assumptions (see Outstanding Questions). Although likelihood-free techniques are clearly not without their own intrinsic faults (Box 3), waves of new

Box 3. Limitations

No discussion of a new methodology would be complete without a brief mention of its limitations. Three such limitationsare discussed here: computational cost, inflated parameter uncertainty, and lack of good methods for modelcomparison.

Computational Cost

One of the current major limitations is the number of model simulations that are sometimes needed when usinglikelihood-free techniques, and this number is especially high in methods such as probability density approximation[25,36] and synthetic likelihood [35]. These methods require a stable distribution of data such that some form of kernelcan be applied to connect the model simulation to the observed data. Unfortunately, this process can be compu-tationally demanding. However, this bottleneck could be circumvented by better computational resources (e.g.,graphics processor units [25]), better data-smoothing techniques [36], or sophisticated methods of automaticallyselecting summary statistics [101,102].

Inflated Parameter Uncertainty

A persistent issue in likelihood-free techniques is the additional uncertainty introduced into posterior estimates by virtueof the model simulation process. This occurs when the summary statistics are not sufficient for the model parameters[22,103]. When sufficiency cannot be guaranteed, the hope is that a large set of summary statistics can be used toproduce joint sufficiency. Otherwise, one must incur a large computational cost by brute-force approximations[25,36,57,58].

Model Comparison

An issue related to parameter uncertainty is the comparison between models [104]. When parameter uncertainty isinflated then the statistics based on the approximated likelihood [105–107] or the predictive distribution [108] areimprecise. For example, some have claimed that likelihood-free methods can increase the error of the approximated loglikelihood by 1–2% [36], and this can distort the validity of conclusions drawn in a relative comparison. Hence, havingaccurate parameter estimates is vital for accurate model comparison in the likelihood-free context.

lations in some cognitive tasks? Forexample, choices that depend onsequences of dependencies, such asthose experienced in free recall tasks,create a combinatorial explosion ofpossibilities for data sequences thatcould have been observed. Hence,when using likelihood-free techniques,one must simulate the model manymore times to effectively approximatethe probability of each recall sequence.

How will the development and acces-sibility of efficient methods of likeli-hood-free inference influence thedevelopment of new cognitive theo-ries? In years past it was consideredbeneficial to propose new theories ofcognition with elegant mathematicaldescriptions for the predictions of amodel. However, to arrive at analyticexpressions for these predictions, sim-plifying assumptions were often nec-essary. With the advent of newlikelihood-free algorithms, it can some-times be faster to approximate the like-lihood rather than calculating itscorresponding analytic expression.

Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9 837

Page 13: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

methodological advances continue to solve these problems. With the caveat that candidatemodels are falsifiable and parameters are identifiable [20,97,98], likelihood-free techniquesmake a scientist’s imagination the upper bound on new theories of cognition.

AcknowledgmentsWe would like to thank Fiona Molloy for her assistance with the artwork, and Andrew Heathcote and William Holmes for

comments that improved an earlier version of this article.

Supplemental InformationSupplemental information associated with this article can be found, in the online version, at https://doi.org/10.1016/j.tics.

2018.06.003.

References

1. Anderson, J.R. et al. (2008) A central circuit of the mind. Trends

Cogn. Sci. 12, 136–143

2. Anderson, J.R. et al. (2010) Neural imaging to track mentalstates. Proc. Natl. Acad. Sci. U. S. A. 107, 7018–7023

3. Eliasmith, C. et al. (2012) A large-scale model of the functioningbrain. Science 338, 1202–1205

4. Eliasmith, C. (2013) How to Build a Brain: A Neural Architecturefor Biological Cognition, Oxford University Press

5. Lee, M.D. (2008) Three case studies in the Bayesian analysis ofcognitive models. Psychon. Bull. Rev. 15, 1–15

6. Frank, M. et al. (2015) fMRI and EEG predictors of dynamicdecision parameters during human reinforcement learning. J.Neurosci. 35, 485–494

7. Turner, B.M. et al. (2015) Combining cognitive abstractions withneurophysiology: the neural drift diffusion model. Psychol. Rev.122, 312–336

8. Rouder, J.N. and Lu, J. (2005) An introduction to Bayesianhierarchical models with an application in the theory of signaldetection. Psychon. Bull. Rev. 12, 573–604

9. Rouder, J.N. et al. (2005) A hierarchical model for estimatingresponse time distributions. Psychon. Bull. Rev. 12, 195–223

10. Shiffrin, R.M. et al. (2008) A survey of model evaluationapproaches with a tutorial on hierarchical Bayesian methods.Cogn. Sci. 32, 1248–1284

11. Lee, M.D. and Wagenmakers, E.-J. (2013) Bayesian Modelingfor Cognitive Science: A Practical Course, Cambridge UniversityPress

12. Plummer, M. (2003) JAGS: a program for analysis of Bayesiangraphical models using Gibbs sampling. In Proceedings of the3rd International Workshop on Distributed Statistical Computing(Hornik, K. et al. eds)

13. Carpenter, B. et al. (2016) Stan: a probabilistic programminglanguage. J. Stat. Softw. 76, 1–37

14. Shiffrin, R.M. and Steyvers, M. (1997) A model for recognitionmemory: REM – retrieving effectively from memory. Psychon.Bull. Rev. 4, 145–166

15. Dennis, S. and Humphreys, M.S. (2001) A context noise modelof episodic word recognition. Psychol. Rev. 108, 452–478

16. O’Reilly, R.C. (2006) Biologically based computational models ofcortical cognition. Science 314, 91–94

17. O’Reilly, R.C. and Frank, M. (2006) Making working memorywork: a computational model of learning in the prefrontal cortexand basal ganglia. Neural Comput. 18, 283–328

18. van Ravenzwaaij, D. et al. (2012) Optimal decision making inneural inhibition models. Psychol. Rev. 119, 201–215

19. Myung, I.J. (2000) The importance of complexity in model selec-tion. J. Math. Psychol. 44, 190–204

20. Roberts, S. and Pashler, H. (2000) How persuasive is a good fit?Psychol. Rev. 107, 358–367

21. Van Zandt, T. (2000) How to fit a response time distribution.Psychon. Bull. Rev. 7, 424–465

838 Trends in Cognitive Sciences, September 2018, Vol. 22, No.

22. Turner, B.M. and Van Zandt, T. (2012) A tutorial on approximateBayesian computation. J. Math. Psychol. 56, 69–85

23. Palestro, J.J. et al. (2018) Likelihood-Free Methods for CognitiveScience, Springer International

24. Pitt, M.A. et al. (2002) Toward a method of selecting amongcomputational models of cognition. Psychol. Rev. 109, 472–491

25. Turner, B.M. and Sederberg, P.B. (2014) A generalized, likeli-hood-free method for parameter estimation. Psychon. Bull. Rev.21, 227–250

26. Holmes, W.R. et al. (2016) A new framework for modelingdecisions about changing information: the piecewise linear bal-listic accumulator model. Cogn. Psychol. 85, 1–29

27. Holmes, W.R. and Trueblood, J.S. (2018) Bayesian analysis ofthe piecewise diffusion decision model. Behav. Res. Methods50, 730–743

28. Turner, B.M. et al. (2018) On the neural and mechanistic basesof self-control. Cereb. Cortex 1–19. Published online January24, 2018. http://dx.doi.org/10.1093/cercor/bhx355

29. Tavaré, S. et al. (1997) Inferring coalescence times from DNAsequence data. Genetics 145, 505–518

30. Pritchard, J.K. et al. (1999) Population growth of human Ychromosomes: a study of Y chromosome microsatellites.Mol. Biol. Evol. 16, 1791–1798

31. Marjoram, P. et al. (2003) Markov chain Monte Carlo withoutlikelihoods. Proc. Natl. Acad. Sci. U. S. A. 100, 324–328

32. Beaumont, M.A. et al. (2002) Approximate Bayesian computa-tion in population genetics. Genetics 162, 2025–2035

33. Wilkinson, R.D. (2008) Approximate Bayesian computation(ABC) gives exact results under the assumption of model error.Biometrika 96, 983–990

34. Turner, B.M. and Sederberg, P.B. (2012) Approximate Bayesiancomputation with differential evolution. J. Math. Psychol. 56,375–385

35. Wood, S. (2010) Statistical inference for noise nonlinear eco-logical dynamic systems. Nature 466, 1102–1107

36. Holmes, W.R. (2015) A practical guide to the probability densityapproximation (PDA) with improved implementation and errorcharacterization. J. Math. Psychol. 68, 13–24

37. Bazin, E. et al. (2010) Likelihood-free inference of populationstructure and local adaptation in a Bayesian hierarchical model.Genetics 185, 587–602

38. Turner, B.M. and Van Zandt, T. (2014) Hierarchical approximateBayesian computation. Psychometrika 79, 185–209

39. Turner, B.M. et al. (2013) Bayesian analysis of memory models.Psychol. Rev. 120, 667–678

40. Turner, B.M. et al. (2018) Comparing models of multi-alternative,multi-attribute choice. Psychol. Rev. 125, 329–362

41. Myung, J.I. et al. (2007) Analytic expressions for the BCDMEMmodel of recognition memory. J. Math. Psychol. 51, 198–204

42. Montenegro, M. et al. (2014) REM integral expressions. J. Math.Psychol. 60, 23–28

9

Page 14: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

43. Treisman, M. and Williams, T. (1984) A theory of criterion settingwith an application to sequential dependencies. Psychol. Rev.91, 68–111

44. Green, D.M. and Swets, J.A. (1966) Signal Detection Theory andPsychophysics, Wiley Press

45. Macmillan, N.A. (2002) Signal detection theory. In Stevens’Handbook of Experimental Psychology: Methodology in Experi-mental Psychology (3rd edn) (Pashler, H. and Wixted, J., eds),pp. 43–90, John Wiley and Sons

46. Macmillan, N.A. and Creelman, C.D. (2005) Detection Theory: AUser’s Guide, Lawrence Erlbaum Associates

47. Turner, B.M. et al. (2011) A dynamic, stimulus-driven model ofsignal detection. Psychol. Rev. 118, 583–613

48. Nosofsky, R.M. (1986) Attention, similarity, and the identification-categorization relationship. J. Exp. Psychol. Gen. 115, 39–57

49. Busemeyer, J. and Townsend, J. (1993) Decision field theory: adynamic-cognitive approach to decision making in an uncertainenvironment. Psychol. Rev. 100, 432–459

50. Usher, M. and McClelland, J.L. (2001) On the time course ofperceptual choice: the leaky competing accumulator model.Psychol. Rev. 108, 550–592

51. Mazurek, M.E. et al. (2003) A role for neural integrators inperceptual decision making. Cereb. Cortex 13, 1257–1269

52. Shadlen, M.N. and Newsome, W.T. (2001) Neural basis of aperceptual decision in the parietal cortex (area LIP) of the rhesusmonkey. J. Neurophysiol. 86, 1916–1936

53. Wong, K.-F. and Wang, X.-J. (2006) A recurrent network mech-anism of time integration in perceptual decisions. J. Neurosci.26, 1314–1328

54. Cisek, P. et al. (2009) Decisions in changing conditions: theurgency-gating model. J. Neurosci. 29, 11560–11571

55. Purcell, B. et al. (2010) Neurally-constrained modeling of per-ceptual decision making. Psychol. Rev. 117, 1113–1143

56. Myung, I.J. and Pitt, M.A. (1997) Applying Occam’s razor inmodeling cognition: a Bayesian approach. Psychon. Bull. Rev.4, 79–95

57. Turner, B.M. et al. (2016) Bayesian analysis of simulation-basedmodels. J. Math. Psychol. 72, 191–199

58. Miletic, S. et al. (2017) Parameter recovery for the leaky com-peting accumulator model. J. Math. Psychol. 76, 25–50

59. Tsetsos, K. et al. (2012) Using time-varying evidence to testmodels of decision dynamics: Bounded diffusion vs: the leakycompeting accumulator model. Front. Neurosci. 6, 1–17

60. Diederich, A. (1995) Intersensory facilitation of reaction time:evaluation of counter and diffusion coactivation models. J. Math.Psychol. 39, 197–215

61. Smith, P.L. and Van Zandt, T. (2000) Time-dependent Poissoncounter models of response latency in simple judgment. Br. J.Math. Stat. Psychol. 53

62. Brown, S.D. and Heathcote, A. (2008) The simplest completemodel of choice reaction time: Linear ballistic accumulation.Cogn. Psychol. 57, 153–178

63. Usher, M. and McClelland, J.L. (2004) Loss aversion and inhibi-tion in dynamical models of multialternative choice. Psychol.Rev. 111, 757–769

64. Dai, J. and Busemeyer, J.R. (2014) A probabilistic, dynamic, andattribute-wise model of intertemporal choice. J. Exp. Psychol.Gen. 143, 1489–1514

65. Rodriguez, C.A. et al. (2014) Intertemporal choice as discountedvalue accumulation. PLoS One 9, e90138

66. Hotaling, J.M. et al. (2010) Theoretical developments in decisionfield theory: comment on Tsetsos, Usher, and Chater. Psychol.Rev. 117, 1294–1298

67. Bhatia, S. (2013) Associations and the accumulation of prefer-ence. Psychol. Rev. 120, 522–543

68. Trueblood, J.S. et al. (2013) Not just for consumers: contexteffects are fundamental to decision making. Psychol. Sci. 24,901–908

69. Tsetsos, K. et al. (2010) Preference reversal in multiattributechoice. Psychol. Rev. 117, 1275–1293

70. Loewenstein, G. and Prelec, D. (1992) Anomalies in intertem-poral choice: evidence and an interpretation. Q. J. Econ. 107,573–597

71. Kirby, K.N. and Marakovic, N.N. (1996) Delay-discounting prob-abilistic rewards: rates decrease as amounts increase. Psychon.Bull. Rev. 3, 100–104

72. Molloy, M.F. et al. What’s in a response time? On the importanceof response time measures in constraining models of contexteffects. Decision. Published online July 16, 2018. https://doi.org/10.1037/dec0000097.

73. Trueblood, J.S. et al. (2014) The multiattribute linear ballisticaccumulator model of context effects in multialternative choice.Psychol. Rev. 121, 179–205

74. Forstmann, B.U. et al. (2011) Reciprocal relations betweencognitive neuroscience an formal cognitive models: oppositesattract? Trends Cogn. Sci. 15, 272–279

75. Forstmann, B.U. and Wagenmakers, E.-J. (2015) An Introduc-tion to Model-Based Cognitive Neuroscience, Springer

76. Turner, B.M. et al. (2017) Approaches to analysis in model-based cognitive neuroscience. J. Math. Psychol. 76, 65–79

77. Anderson, J.R. (2007) How Can the Human Mind Occur in thePhysical Universe? Oxford University Press

78. Anderson, J.R. (2012) Tracking problem solving by multivariatepattern analysis and hidden Markov model algorithms. Neuro-psychologia 50, 487–498

79. Anderson, J.R. et al. (2012) Using brain imaging to track prob-lem solving in a complex state space. Neuroimage 60, 633–643

80. Turner, B.M. et al. (2013) A Bayesian framework for simulta-neously modeling neural and behavioral data. Neuroimage 72,193–206

81. Turner, B.M. et al. (2016) Why more is better: a method forsimultaneously modeling EEG, fMRI, and behavior. Neuroimage128, 96–115

82. Turner, B.M. (2017) Factor analysis linking functions for simul-taneously modeling neural and behavioral data. Neuroimage153, 28–48

83. Turner, B.M. et al. (2018) Joint Models of Neural and BehavioralData, Springer (New York)

84. Palestro, J.J. et al. (2018) A tutorial on joint models of neuraland behavioral measures of cognition. J. Math. Psychol. 84,20–48

85. Forstmann, B.U. et al. (2008) Striatum and pre-SMA facilitatedecision-making under time pressure. Proc. Natl. Acad. Sci.105, 17538–17542

86. Forstmann, B.U. et al. (2010) Cortico-striatal connections pre-dict control over speed and accuracy in perceptual decisionmaking. Proc. Natl. Acad. Sci. 107, 15916–15920

87. Forstmann, B.U. et al. (2011) The speed–accuracy tradeoff inthe elderly brain: a structural model-based approach. J. Neuro-sci. 31, 17242–17249

88. Ratcliff, R. and Frank, M.J. (2012) Reinforcement-based deci-sion making in corticostriatal circuits: mutual constraints byneurocomputational and diffusion models. Neural Comput.24, 1186–1229

89. Botvinick, M.M. et al. (2001) Conflict monitoring and cognitivecontrol. Psychol. Rev. 108, 624–652

90. Botvinick, M.M. et al. (2004) Conflict monitoring and anteriorcingulate cortex: an update. Trends Cogn. Sci. 8, 539–546

91. van Maanen, L. et al. (2011) Neural correlates of trial-to-trialfluctuations in response caution. J. Neurosci. 31, 17488–17495

92. Mack, M.L. et al. (2013) Decoding the brain’s algorithm forcategorization from its neural implementation. Curr. Biol. 23,2023–2027

93. Eliasmith, C. and Anderson, C.H. (2004) Neural Engineering:Computation, Representation, and Dynamics in NeurobiologicalSystems, MIT Press

Trends in Cognitive Sciences, September 2018, Vol. 22, No. 9 839

Page 15: Approximating Bayesian Inference through Model Simulation · 2018. 8. 14. · Metropolis–Hastings inference probability: this is best defined in the context of an algorithm, such

94. Eliasmith, C. et al. (2016) Biospaun: a largescale behaving brainmodel with complex neurons. ArXiv In: https://arxiv.org/abs/1602.05220

95. Sharma, S. et al. (2016) Large-scale cognitive model designusing the NENGO neural simulator. Biol. Inspired Cogn. Archit.17, 86–100

96. Wang, X.-J. and Krystal, J.H. (2014) Computational psychiatry.Neuron 84, 638–654

97. Heathcote, A. et al. (2015) An introduction to good practices incognitive modeling. In An introduction to model-based cognitiveneuroscience (Forstmann, B.U. and Wagenmakers, E.-J., eds),pp. 25–48, Springer

98. Jones, M. and Dzhafarov, E.N. (2014) Unfalsifiability and mutualtranslatability of major modeling schemes for choice reactiontime. Psychol. Rev. 121, 1–32

99. Fu, Y.X. and Li, W.H. (1997) Estimating the age of the commonancestor of a sample of DNA sequences. Mol. Biol. Evol. 14,195–199

100. Silverman, B.W. (1986) Density Estimation for Statistics andData Analysis, Chapman & Hall

101. Sousa, V.C. et al. (2009) Approximate Bayesian computationwithout summary statistics: the case of admixture. Genetics181, 1507–1519

102. Fearnhead, P. and Prangle, D. (2012) Constructing summarystatistics for approximate Bayesian computation: semi-auto-matic approximate Bayesian computation. J. R. Stat. Soc. B74, 419–474

840 Trends in Cognitive Sciences, September 2018, Vol. 22, No.

103. Beaumont, M.A. (2010) Approximate Bayesian computation inevolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41, 379–406

104. Robert, C.P. et al. (2011) Lack of confidence in approximateBayesian computation model choice. Proc. Natl. Acad. Sci. U.S. 108, 15112–15117

105. Akaike, H. (1973) Information theory and an extension of themaximum likelihood principle. In Second International Sympo-sium on Information Theory (Csáki, F. and Petrov, B.N., eds), pp.267–281, Akademiai Kiado

106. Schwarz, G. (1978) Estimating the dimension of a model. Ann.Stat. 6, 461–464

107. Celeux, G. et al. (2006) Deviance information criteria for missingdata models. Bayesian Anal. 1, 651–673

108. Tomohiro, A. (2007) Bayesian predictive information criterion forthe evaluation of hierarchical Bayesian and empirical Bayesmodels. Biometrika 94, 443–458

109. Rice, J.A. (2007) Mathematical Statistics and Data Analysis,Duxbury Press

110. Heathcote, A. et al. (2018) Dynamic models of choice. Behav.Res. Methods. Published online 29 June 2018. http://dx.doi.org/10.3758/s13428-018-1067-y

111. Trueblood, J.S. et al. (2018) The impact of speed and bias on thecognitive processes of experts and novices in medical imagedecision-making. Cognit. Res.: Principles Implic. 3, 28

9