Models of Spoken Word Recognition

Embed Size (px)

Citation preview

  • 8/12/2019 Models of Spoken Word Recognition

    1/15

    Advanced Review

    Models of spoken-wordrecognitionAndrea Weber and Odette Scharenborg

    All words of the languages we know are stored in the mental lexicon. Psycholin-guistic models describe in which format lexical knowledge is stored and howit is accessed when needed for language use. The present article summarizeskey findings in spoken-word recognition by humans and describes how modelsof spoken-word recognition account for them. Although current models ofspoken-word recognition differ considerably in the details of implementation,there is general consensus among them on at least three aspects: multiple wordcandidates are activated in parallel as a word is being heard, activation of wordcandidates varies with the degree of match between the speech signal and storedlexical representations, and activated candidate words compete for recognition.

    No consensus has been reached on other aspects such as the flow of informationbetween different processing levels, and the format of stored prelexical and lexicalrepresentations. 2012 John Wiley & Sons, Ltd.

    How to cite this article:

    WIREs Cogn Sci2012, 3:387401. doi: 10.1002/wcs.1178

    INTRODUCTION

    In order to understand the utterance The sunbegan to rise, listeners must recognize theindividual words in that utterance. This decodingof the message must be achieved by mapping theauditory information in the speech input onto storedrepresentations of words in the mental lexicon.Although the mapping task is usually perceived to beeffortless for listeners, the underlying decoding processis in fact very complex. Particularly, three aspects ofspoken language make the mapping difficult. First,words resemble each other. As languages build largevocabularies from a limited set of phonemes, wordsare necessarily alike (e.g., sun, sum, suck, and suchonly differ in their final consonant), and short wordsare often embedded within longer ones (e.g., rye and

    eye in rise). Second, speech is highly variable. Theacoustic realization of sounds and words is differentfor each speaker; speaking style, speaking rate, andphonological context additionally cause variability inthe signal (e.g., sun is usually pronounced as sum whenfollowed by a bilabial stop consonant as in began).Third, speech is transitory and continuous. Not only

    Correspondence to: [email protected]

    Max Planck Institute for Psycholinguistics, Nijmegen, TheNetherlands

    is spoken language distributed in time and fadesquickly from the perceptual field, the acoustic speechsignal is also continuous with no clear boundariesfor individual words. As shown in the spectrogram in

    Figure 1, breaks in the speech signal do not necessarilycorrespond to word boundaries (e.g., there is no breakbetween the sun but there is one in began). Thisalso implies that embedded words can span wordboundaries (e.g.,antis embedded inbegan to).

    Psycholinguistic research investigates how listen-ers master the decoding of speech. In the following, wewill selectively describe those findings that highlightsome of the most important findings in spoken-wordrecognition. Probably the most central finding is thatthe comprehension process is incremental. That is,listeners do not wait until the end of a word or

    an utterance before they interpret the input. Ratherthey consider multiple word candidates simultane-ously that are consistent with the incoming speech.Parallel activation has been shown repeatedly in prim-ing studies in which a word onset that is consistentwith two words (e.g., /kp/ can start capital andcaptain) facilitates the recognition of semanticallyrelated words for both possible continuations (moneyand ship1). Parallel activation has also been foundfor embedded words (e.g., bone in trombone facil-itates recognition of the semantically related rib2,3)

    Volume 3, May/June 2012 2012 John Wiley & Sons, Ltd. 387

  • 8/12/2019 Models of Spoken Word Recognition

    2/15

    Advanced Review wires.wiley.com/cogsci

    The sun began to rise

    2.26205000

    Frequency(Hz)

    00

    Time (s)2.262

    F I G U R E 1 | Waveform and spectrogram of the utterance The sun began to rise. The horizontal axis represents time and the vertical axis

    amplitude for the waveform and frequency for the spectrogram, with greater energy being represented by darker shading. White spaces in the

    spectrogram correspond to breaks in the speech signal. The vertical dotted lines are aligned as closely as possible with word boundaries.

    and words that are spanning a word boundary.4

    More recently, the eye-tracking paradigm has beenused to demonstrate this core process of spoken-wordrecognition: when being presented with a display ofobjects, listeners take longer to look at a target objectmentioned in an utterance when the display includesobjects with similar names (e.g., later looks to candlewhen the display also shows candy5). This suggeststhat listeners temporarily consider the two objectswith similar names as possible targets.

    Lexical activation is furthermore modulated byacoustic-phonetic detail. Specifically, the goodness ofmatch between the speech signal and stored represen-tations co-determines how strongly word candidatesare activated. It has been shown, for instance, thatsoup is recognized more slowly when the formanttransitions following the /s/ are manipulated to betypical of a different fricative than when they are typ-ical of /s/.6 Connine et al.7,8 have found in primingstudies that words that mismatch with the signal onmultiple articulatory features are less strongly acti-vated than words that mismatch on one feature.But also phonological context influences the effect

    of mismatch: [gri:m] can be recognized as green inthe context of green bench where the bilabial stoponset in bench licenses the place assimilation of thepreceding nasal /n/.911 Many more studies using dif-ferent paradigms have confirmed that the degree oflexical activation varies in response to fine-grainedacousticphonetic information.1214

    Word candidates are not only activated in paral-lel, they also compete for recognition. That is, a candi-dates activation is not independent of the activationof other candidates, and the more candidates are

    active, the more they inhibit each other. Competi-tion has been shown with a variety of behavioralparadigms, and it is a generally assumed componentof spoken-word recognition. Listeners in a word-spotting task, for instance, find it more difficultto spot short embedded words in word onsets oflonger words than in onsets of nonwords: sack isharder to spot in /skr ef/, the beginning of sacri-fice, than in [skr ek] which has no possible wordcontinuation.15 This presumably reflects the competi-tion between sack and sacrifice. Furthermore, it hasrepeatedly been found that lexical decision times areinfluenced by the number of similar-sounding wordsin the lexicon,16,17 as well as by preceding wordsthat are phonologically related.18 Although compe-tition alone can correctly parse an utterance into asequence of individual words, listeners use a varietyof cues to likely word boundaries to further help thesegmentation process. These cues include phonotacticconstraints and probabilities,1922 metrical cues,2325

    and fine-grained acoustic information.2628

    MODELS OF SPOKEN-WORD

    RECOGNITION

    Early models of word recognition were developed onthe basis of data obtained in reading tasks,29,30 butwere often assumed to account for spoken-languageprocessing as well. Morton30 introduced in his logogenmodel the powerful metaphor of activation, whichconveys that multiple words in our mental lexiconare responsive to the speech signal. The metaphor ofactivation still features in many subsequent modelsas it captures the notion of parallel availability that

    388 20 12 Jo hn Wi ley & Son s, Ltd. Vo lume 3, Ma y/Ju ne 2 012

  • 8/12/2019 Models of Spoken Word Recognition

    3/15

    WIREs Cognitive Science Models of word recognition

    behavioral studies have shown to be at the heart ofspoken-word recognition. Only later was it realizedthat the temporal nature of the speech signal has farreaching consequences for the comprehension process,and that models of spoken-word recognition mustaccount for the transitory nature of speech, acoustic-

    phonetic and phonologically conditioned variation, aswell as the continuity of the speech signal. Since the1980s, a number of models have then been developedspecifically for spoken-word recognition.

    From a theoretical point of view, current modelsof spoken-word recognition differ particularly in twoaspects. First, they vary in their assumptions about theabstractness of the representations that make contactwith the lexicon, as well as the nature of the lexicalrepresentations themselves. Second, the models differwith respect to information flow between levels ofthe processing system. Different levels are responsiblefor different processing stages and are ordered fromrelatively low-level acousticphonetic processing tohigher stages involving the lexicon. Interactive modelsnot only allow information to flow from lower tohigher levels but also top-down information flow,whereas autonomous models assume that flow ofinformation is unidirectional from the bottom up.

    Models not only vary in their theoretical assump-tions but also in type, and different terms have beensuggested in the literature for the varying types ofmodels.31 In the present review we use the termverbalfor models that explain the stages and mechanisms ofspoken-word recognition descriptively, the term math-

    ematicalis used for models that capture the processesof spoken-word recognition with a mathematicalform, and simulation models are models that aim toaccount for the cognitive processes in speech compre-hension. All mathematical and simulation models arecomputationally implemented as computer programs.Most current models of spoken-word recognition arecomputational models. An advantage of computa-tional models is that they can be used to simulate theconditions of behavioral research and to compare amodels predictions with behavioral results obtainedfrom human listeners. A criticism of computational

    models is that in order to build a functioning model,theoretical and implementational assumptions need tobe made that are possibly unspecified in the behav-ioral research.32,33 An example of this are the differentassumptions models make about the form of prelexi-cal representations (e.g., multidimensional features inTRACE and phonemes in Shortlist; see also below).Thus, behavioral findings are not necessarily a directvalidation of how aspects of spoken-word recognitionare incorporated in a computational model (see thedemand for a linking hypothesis in Ref 34).

    The Cohort Model

    The Cohort model35,36 was the first psycholinguisticmodel of word recognition specifically developedfor spoken language. Central to this verbal modelis the temporal aspect of spoken language, that is,the availability of acousticphonetic information over

    time. The Cohort model provided many predictionsabout the time-course of recognition, and it motivatedsubstantial research that paved the way for the furtherdevelopment of models.

    In the Cohort model, spoken-word recognitiontakes place in three stages: access,selection, andinte-gration. Duringaccess, acousticphonetic elements inthe speech signal are mapped onto words in the lex-icon. Words that match with the input are activatedsimultaneously and make up the cohort. This simul-taneous consideration of multiple candidate wordsis central to all subsequently developed models. In

    the Cohort model, however, only words that arealigned with the onset of the input are activated.For example, the Cohort model assumes that after theinitial 150200 ms (roughly consistent with the firsttwo phonemes of a word), all words beginning withthose phonemes will be activated. During selection,candidate words that mismatch the incoming speechsignal by more than a single feature are removed fromthe cohort. For example, on hearing /f/, all wordsbeginning with /f/ are activated; when the subse-quent sound is /b/, words that do not begin with /fb/drop out of the cohort. This process repeats until(ideally) the cohort is reduced to one member. The

    focus on onset overlap implies that words can be rec-ognized before their offset.February, for instance, canbe recognized by the third segment, because no otherEnglish word begins with /fb/. During integration, thesyntactic and semantic properties of activated wordsare retrieved and checked for integrability with higherlevels. A mismatch with contextual constraints, forinstance, can result in the removal from the cohort.Sentential context can thus affect the selection stagein the original Cohort model. The candidate words inthe cohort do not actively compete with one another;it is just the presence of other candidate words that

    forms the recognition process. Segmentation of thespeech stream follows implicitly from the recognitionof individual words: a words offset signals the startof a new word.

    A number of behavioral findings challenged theCohort model. It had been found, for example, that lis-teners can recognize words that mismatch acousticallyor contextually,37 but the removal of mismatchingwords from the cohort entails that the model can-not recover from mismatches. Also listeners recognizefrequent words more easily than infrequent ones,38

    Volume 3, May/June 2012 2012 John Wiley & Sons, Ltd. 389

  • 8/12/2019 Models of Spoken Word Recognition

    4/15

    Advanced Review wires.wiley.com/cogsci

    but the Cohort model cannot capture word-frequencyeffects.

    The successor version Cohort II3941 adjusted itsarchitecture to account for these findings. In con-trast to the original version, Cohort II is a fullybottom-up model. Words that (minimally) mismatch

    the input can now enter the cohort and can thereforebe recognized; one part of the solution for handlingmismatches is the introduction of word activationwith selection and activation being dependent on thegoodness of fit with the word input. In addition,the input to the model is now more fine-grained. Toaccount for word-frequency effects, candidate wordswere assigned resting activation values in Cohort II,with higher values for frequent than for infrequentwords, which makes frequent words reach the thresh-old for recognition faster.

    The main challenge for the Cohort model, how-ever, proved to be analyses of on-line dictionariesthat showed that relatively few words can be uniquelyidentified before word offset,42 and that listeners donot recognize the majority of words correctly untilafter word offset.43 It was therefore a logical conse-quence that ensuing models should no longer consideronly word candidates that match in onset with thespeech input but also allow later parts of a word tobe relevant. Allowing the activation of candidates thatmatch with later parts is also a prerequisite for beingable to handle the segmentation of continuous speech.

    TRACETRACE44,a was the first computationally implementedmodel of spoken-word recognition. It is a localist (i.e.,one node represents one representational unit) connec-tionist interactive-activation framework45 with threelayers of nodes: a feature, a phoneme, and a wordlayer (see Figure 2). The input to TRACE consists ofmultidimensional features, and words are representedas phonemic strings. TRACE was the first model thatinstantiated the activation of multiple word candi-dates that match any part of the speech input. Thatis, nodes are activated in proportion to their degree of

    fit to the input, with activation spreading through thelayers (e.g., activated feature nodes spread activationto matching phoneme nodes and on to word nodes), sothat on hearing the wordsun, overlapping words likeunder andrun are also considered in parallel. More-over, this mechanism ensures that TRACE is ableto handle ambiguous or distorted speech. Activatednodes on the phoneme and word layer receive activeinhibition from other nodes compatible with the sameportion of input. The word with the highest activationwill inhibit candidate words with lower activation

    during competition, and finally the candidate wordbest matching the input will be recognized. Inhibitoryconnections on the word level help to solve acti-vation of multiple word candidates (i.e., the fewerthe word candidates that actively compete with eachother, the easier recognition is). There is no inhibition

    between layers in TRACE, and word activation doesnot decrease in the presence of mismatching input.The temporal aspect of speech is handled by TRACEby duplicating all phoneme and word nodes acrosstime (e.g., the phoneme node /s/ is duplicated for alltime slices of the word sun, but it is activated thestrongest when the feature nodes representative of /s/are aligned in time). Feedback connections from theword layer to the phoneme layer make TRACE aninteractive model. Through these connections, lexicalknowledge can affect perception.

    Word-frequency effects were not accountedfor in the original TRACE model. However, theywere later implemented by Dahan, Magnuson, andTanenhaus,46 who proposed three possible ways ofincorporating frequency in TRACE: by adjustingresting-activation levels, by adjusting connectionstrengths, or as a post-activation decision bias.

    TRACE successfully simulated a wide range ofbehavioral findings, including the Ganong effect47

    and the finding that lexical information is not usedfor phoneme monitoring.48 For simulations, TRACErelies on a large number of parameters that have to beset correctly. A strength of TRACE is that the param-eter settings as determined by McClelland and Elman

    have been used for all simulations in the original paperand were only changed slightly for later simulations.Thus, TRACEs parameters do not have to be tweakedto fit individual data.

    Continuous mapping of speech input to lexicalrepresentations as in TRACE predicts activation ofword candidates that overlap in onset earlier thanthose that overlap in rhyme with the speech input.Such a difference in the time course of activationwas indeed found in a seminal eye-tracking studyby Allopenna et al.49 In this study, listeners lookedearlier and more at onset competitors than rhyme

    competitors, with the pattern of eye fixations closelymatching the pattern predicted by TRACE. The resultsconvincingly showed that continuous mapping modelscan generate quantitative predictions about the wordrecognition process over time.

    The two most controversial components ofTRACE are the implausible duplication of thenetwork,44,50 and the existence of the lexical feed-back loop.51 In order to recognize words over time,the entire lexical network in TRACE needs to beduplicated many times. Consequently, TRACE can

    390 20 12 Jo hn Wi ley & Son s, Ltd. Vo lume 3, Ma y/Ju ne 2 012

  • 8/12/2019 Models of Spoken Word Recognition

    5/15

    WIREs Cognitive Science Models of word recognition

    Sunday

    song

    sun

    s

    f o

    m

    n

    Word layer

    Phoneme layer

    Feature layer

    With inhibitory connections between nodes which receivethe same bottom-up information

    With inhibitory connections between nodes which receivethe same bottom-up information

    Phoneme strings converted to multidimensional featurevectors (that are approximations of acoustic spectra)

    Frication

    Back vowel

    Nasality

    High

    High

    High

    High

    Low

    Low

    Low

    Low

    F I G U R E 2 | Recognition process of the word sunby TRACE. For every time slice the entire network is copied, for better visualisation this

    duplication is not depicted in the figure. Activation in the lower layers flows upwards to the higher levels to all nodes that incorporate the lower layer

    node. Activation from the word layer also flows back to the phoneme layer.

    only handle unrealistically small lexicons. Simulationstypically involve lexicons of just a few hundred words,and use only a limited subset of English phonemes.Lexical feedback on the other hand, has been arguedto be unnecessary since it cannot speed processingor improve accuracy,52 and it can furthermore pre-vent recovery from mispronunciations.53 Proponentsof interactive models have pointed out that lexicalfeedback is in line with research showing that lexicalknowledge allows listeners to quickly adapt to speak-ers with unfamiliar pronunciation,54 but proponentsof feed-forward models have countered that feed-back for perceptual learning is different from onlinefeedback as is implemented in TRACE.55,56

    Shortlist

    Shortlist50 was developed in response to the criticismof duplication and lexical feedback in TRACE, and

    combines aspects of feed-forward models, such asthe phoneme decision model Race57 and CohortII, with the competition mechanism of TRACE.The duplication of the entire network for eachinput feature in TRACE is avoided by implementingShortlist as a two-stage model in which the generationof lexical candidates and the competition process

    are separated (see Figure 3). The first stage consistsof an exhaustive serial lexical search (although itis assumed that the search in humans occurs inparallel), which results in a shortlist of maximally30 candidate words that match the input processedso far. Subsequently, these word candidates arewired into a small interactive-activation network(the second, competition stage) in which the wordsthat receive support from the same section of theinput are connected via inhibitory links and competewith one another. Activation of candidate words is

    Volume 3, May/June 2012 2012 John Wiley & Sons, Ltd. 391

  • 8/12/2019 Models of Spoken Word Recognition

    6/15

    Advanced Review wires.wiley.com/cogsci

    Sunday

    summary

    sun uncle

    nice

    s n

    Word layer

    Input

    A phoneme string

    Inhibitory connections between nodes which receive thesame bottom-up information

    Shortlists of word candidates are created for each phonemein the input, are indicated with the grey boxes, and aliged

    with the input indicated with the arrows

    not

    undone

    F I G U R E 3 | Recognition process of the word sunby Shortlist. For every time slice, a new shortlist (indicated with the gray box) is created, which is

    subsequently wired into the competition stage. For better visualisation this repetition is not depicted in the figure. Candidate words that overlap

    with each other at any position compete with one another. In this example, all candidate words would inhibit one another; however, for better

    visualisation not all inhibitory connections are shown.

    determined by their degree of fit with the input,where word activations decrease with mismatchinginformation. The word with the highest activation willinhibit candidate words with lower activation duringcompetition, and finally the candidate word bestmatching the input will be recognized. The interactive-activation network is equivalent to the word layer ofTRACE. The entire process is repeated with eachnew phoneme that becomes available, so that thereis a separate shortlist and word layer for each input

    phoneme. Shortlist is a feed-forward only model.The two-stage set-up makes it possible for Short-

    list to use a more realistically sized lexicon of over26,000 words. As in TRACE, words in the lexicon arerepresented as phonemic strings, and word candidatescan be activated at any moment in time; word begin-nings and endings are not explicitly marked. Word-frequency effects are not accounted for in Shortlist.

    Shortlist has two unique features: lexical stresscan constrain word activation (as has been foundfor speakers of stress-timed languages such asEnglish and Dutch, who use the rhythmic dis-

    tinction between strong and weak syllables forsegmentation15,23,25,5860), and activation of candi-date words is decreased when they leave adjacentinput that cannot constitute a viable word (e.g., sincea single consonant cannot be a word in English,activation of apple in fapple is reduced61) throughthe implementation of the possible-word constraint.Much like TRACE, Shortlist can make detailed pre-dictions of word activation over time. Shortlist suc-cessfully simulated various behavioral findings, suchas the right-context problem43,58,62 as well as results

    from cross-model priming studies regarding the timecourse of multiple word activation, competition, andselection.15,59

    Shortlist B is a newer version of the originalShortlist model (ever since called Shortlist A, A foractivation53), which argues that human listeners areoptimal Bayesian recognizers (p. 357). The theoret-ical assumptions underlying Shortlist B are identicalto Shortlist A, but the implementation of the modelis fundamentally different. First, Shortlist B is based

    on Bayesian principles; word candidates no longerhave word activations, but word probabilities thatwere developed using techniques from the field ofautomatic speech recognition.33,63 Second, the inputno longer consists of handcrafted phoneme strings,instead it is a sequence of phoneme probabilities overthree time slices per segment, derived from a large-scale gating study.64 Shortlist B incorporates wordfrequencies as prior probabilities, and is able to handlemismatches in the input through the computation oflikelihoods. Shortlist B successfully simulates variousbehavioral findings, including data on the segmen-

    tation of continuous speech43

    and word frequencyeffects.46,65 Shortlist B can be used to make detailedpredictions on the optimality of the word recognitionprocess53 (p. 391).

    Fine-Tracker

    Fine-Tracker33 was specifically developed to accountfor the accumulating evidence that fine-phoneticdetail, as provided in durational and prosodic infor-mation, is important in word recognition.12,66,67 The

    392 20 12 Jo hn Wi ley & Son s, Ltd. Vo lume 3, Ma y/Ju ne 2 012

  • 8/12/2019 Models of Spoken Word Recognition

    7/15

    WIREs Cognitive Science Models of word recognition

    0

    1

    0

    1

    0

    1

    0

    1

    Nasality

    Input

    Feature vector representation layer

    Word layer

    Implemented as a lexical tree with word-initial cohortsMatching of feature vectors and lexical representation

    using a probabilistic search

    Obtained using artificial neural networks

    Acoustic signal

    Back vowel

    Frication

    Sun

    Sunday summary

    songfortfolly

    n m

    B

    s

    o o

    f

    ... ...

    ...

    Time

    F I G U R E 4 | Recognition process of the word sunby Fine-Tracker. The acoustic signal is transformed into a sequence of feature vectors over time

    by a set of artificial neural networks. At the word layer, words are represented as feature vectors, for better visualisation they are depicted as

    phonemes in the figure. Fine-Trackers lexicon is implemented as a lexical tree, with B as the beginning of the tree. Not all possible paths in the

    lexical tree are shown. Each node can be followed by multiple other nodes, indicated with the dotted arrows as examples. The input feature vectors

    and the lexical feature vectors are mapped onto one another using a probabilistic word search.

    role of subtle phonetic information is problematic for

    computational models that assume a discrete, abstract

    level between input and lexicon, because the abstract

    representations are too coarse to capture phonetic

    details. Unlike humans, these models cannot use dura-

    tional information to avoid activation of (the slightly

    longer)ham in hamster.67

    Fine-Tracker is based on the theory underlying

    Shortlist, and, like its predecessor SpeM,63 takes the

    actual acoustic signal as input. It consists of two mod-

    ules (see Figure 4). The first module is an artificial

    neural network (ANN) consisting of an input, hid-

    den, and output layer, which converts the acoustic

    signal into articulatory feature vectors, created over

    small time steps. The value for each of the articula-

    tory features can be regarded as the likelihood of that

    articulatory feature. The feature vectors are then the

    input to the word recognition module. In the Fine-

    Tracker lexicon, words are represented in terms of

    articulatory feature vectors. Because these vectors can

    take any value between 0 and 1 (which are the canon-

    ical values for lexical vectors), contextual phenomena

    like assimilation and nasalization of vowels can be

    encoded through feature spreading. Fine-Trackers

    word recognition module uses a probabilistic word

    search (dynamic time warping, a standard technique in

    Volume 3, May/June 2012 2012 John Wiley & Sons, Ltd. 393

  • 8/12/2019 Models of Spoken Word Recognition

    8/15

    Advanced Review wires.wiley.com/cogsci

    automatic speech recognition) to match the prelexicalfeature vectors onto the candidate words in the lexiconin order to find the most likely sequence of words; mul-tiple prelexical vectors (one for every 5 ms of speech)are sequentially mapped onto a single lexical featurevector. For each of the prelexical vectors the degree of

    fit with the lexical vector is calculated and affects thelikelihood of a word. The number of feature vectorscan be set for each phoneme or word separately in thelexical vector thus ensuring that the model can dealwith durational information. In this way, lexical rep-resentations can vary in duration (e.g., the duration ofham in ham and hamster). Fine-Tracker incorporatesword and word co-occurrence frequencies. Similarto Shortlist B, multiple activations, competition, andselection are thus implemented as a probabilistic wordsearch. Words can start and end at any time, and thereis no explicit segmentation process. Unlike in TRACEor Shortlist, candidate words do not actively suppressor inhibit each other. The output of the word recog-nition module consists of an ordered list of the bestmatching hypothesized parses.

    A strength of Fine-Tracker is that it can betested with real speech rather than an abstract formof input representation as is used by other models ofword recognition. Moreover, the activation flow ofcandidate words over time in Fine-Tracker has beensuccessfully linked to word activation in eye-trackingstudies33 that examined the use of durational cues inword recognition.67,68 A shortcoming of real-speechmodels is that due to limitations of the speech con-

    version module, i.e., the imperfect conversion of thespeech signal to prelexical representations, such mod-els are currently only able to use a small subset of alanguages vocabulary.63,69 Obviously, if the speechconversion module fails, everything downstream willas well. Better speech conversion modules are there-fore of paramount importance in the development ofbetter real-speech models.

    NAM/PARSYN

    The neighborhood activation model (NAM65) i s a

    mathematical model of spoken-word recognition. Itwas developed to examine effects of number of simi-lar words and their word frequencies on spoken-wordrecognition. In NAM, the input is assumed to activatea set of words (stored as acousticphonetic patterns)that differ maximally by one phoneme from the input.The difference can be by deletion, addition, or sub-stitution. Activation is determined by degree of fitwith the input; that is, NAM computes a frequency-weighted neighborhood probability for each word.The acousticphonetic patterns then activate word

    decision units. Activation of word decision units isdetermined by activation of the acousticphonetic pat-terns, by higher-level lexical information (i.e., wordfrequency, which is calculated by weighting eachneighbor in the metric by its log frequency), andby overall level of activity in the entire system of

    word decision units. Decision values are computedon the basis of a frequency-biased, activation-basedversion of R.D. Luces choice rule.70 The choice rulein NAM approximates the competition process. Aword is recognized if its decision value is above acertain threshold. NAM makes several predictionsabout the effects of the number of similar wordsand their word frequency on spoken-word recogni-tion, for which there now is considerable evidencefrom behavioral studies.42,65,71,72 As such, NAM hada large impact on theories of spoken-word recog-nition and research on spoken-word recognition ingeneral, as studies on spoken-word recognition nowoften control for neighborhood density.73

    PARSYN18 is the connectionist instantiation ofNAM. It consists of three levels: an input level ofposition-specific allophones, a level of allophones,and a word level. Like in the previously discussedconnectionist models, activation spreads bottom-upthrough the levels. Competition is implemented asinhibitory connections between the words on the wordlevel. Word boundaries are explicitly marked in theinput. Unlike TRACE, Shortlist, and Fine-Tracker,NAM and PARSYN are only able to recognize wordsin isolation, but not in continuous speech. PARSYN

    successfully replicated the findings NAM was able tosimulate and extended on that, e.g. with the simulationof findings from priming studies which showedthat phonetic priming does not depend on targetdegradation, but that it affects processing times.18

    Minerva2

    Minerva2 is an episodic (or exemplar) model ofmemory.74 Whereas all earlier described modelsassume abstract prelexical and lexical representations,an episodic theory of spoken-word recognition

    considers acoustic variability due to speaking rateor voice characteristics, for instance, an integral partof the theory and keeps this information in memory.Goldinger75 used Minerva2 to investigate an episodicview of spoken-word recognition, motivated by thefact that the speech signal is highly variable (i.e., thelack of invariance), and that listeners good memoryfor surface forms of words is well attested.76 Minerva2simulates episodic memory by storing numerous,independent memory traces for every word. Whena new word is presented at the models input (the

    394 20 12 Jo hn Wi ley & Son s, Ltd. Vo lume 3, Ma y/Ju ne 2 012

  • 8/12/2019 Models of Spoken Word Recognition

    9/15

    WIREs Cognitive Science Models of word recognition

    probe in the form of a vector of numeric elements),it is compared to all traces in memory. Activation ofthe traces is then dependent on the degree of fit withthe probe. Subsequently, an echo is retrieved, whichconstitutes essentially a weighted composite of allactivated traces, and which may contain information

    not present in the probe, such as its word class. Theintensity of an echo corresponds to word activation inabstract models. In Minerva2, words are representedby vectors of numeric elements. Note that althoughMinerva2 is a pure episodic model, it can mimicabstract behavior due to the blending of probesand stored traces, forming experience. Repeatedpresentation of multiple tokens of a word will thusresult in an echo that mainly captures commonaspects of traces (thereby eliminating the idiosyncraticcharacteristics stored in individual traces).

    The issue of feedback from the lexical to theprelexical level does not arise, because episodic modelslike Minerva2 do not have an intermediate levelbetween input and lexicon. Abstract intermediaterepresentations have been argued to render wordrecognition more efficient by avoiding redundancyat the lexical level: when acoustic knowledge abouta sound is stored prelexically, it need not be storedseparately for every word containing that sound onthe lexical level.56,77,78 However, recoding the speechsignal into abstract representations is very difficult dueto the high variability and complexity of the speechsignal.

    Because of its nature, Minerva2 incorporates

    fine-grained speaker-specific information and uses itfor word recognition. Minerva2 correctly predicts forinstance the tendency of participants in a shadowingtask to imitate the acoustic pattern of the word theyhave to repeat,75 and the sensitivity of listeners towords spoken in the same voice and different voices.79

    The model offers currently no solution forrecognizing continuous speech; episodes are alwayssingle words, and it is not clear how multiple wordsin an utterance could be identified. Furthermore, nomechanism has been suggested for how the similaritymapping between speech signal and stored memory

    traces could be achieved (without reducing the surfacevariability in some form).

    Distributed Cohort Model

    The Distributed Cohort Model (DCM80) works fromthe key assumption of connectionist theory thatinformation is represented in a distributed manner81,82

    and as such deviates from all previously discussedmodels in that it combines recognition of form andmeaning. DCM is a connectionist model, but unlike

    TRACE, Shortlist, and PARSYN, information isrepresented in a distributed manner, that is, there is noone-on-one mapping of word and node in the model.Importantly, nodes in DCM stand for phonologicaland semantic features of words. The model has aninput layer which takes binary phonetic features as

    input and a hidden layer, which is connected to twosets of output units, one for the phonological featuresof a word and one for semantic features.

    Because DCM is a distributed model, explicitintermediate levels of representation are not needed;instead DCM regards the speech recognition processas a direct mapping from phonetic features onto dis-tributed abstract representations of both form andmeaning simultaneously. As in the previous models,the mapping process is based on similarity. The goal ofthe model is not to explicitly recognize the phonologi-cal form of words, but rather to retrieve phonologicaland semantic information from speech input. Imme-diate access to semantic information in continuousspeech can help, for example, to reduce the activationof semantically implausible candidate words.83

    Since all words are represented with the sameset of nodes in DCM, there is no explicit activation ofa candidate word and no direct competition betweenthem. Instead, activation and competition are implicitin the blend formed by the patterns of the candidatewords. Word activation is inversely related to thedistance of the models output and the target wordsrepresentation. Competition in DCM is mediatedby the number of candidate words in the set; the

    higher the number of candidate words, the lowertheir activation. Semantic information starts out as ablend of the semantic vectors of all candidate words.As the number of candidate words is reduced withmore input being available, the blend consists of thesemantic vectors of fewer candidates and eventuallyresults in the vector activation of the remaining singleword candidate.

    Word beginnings and endings are not explicitlymarked in the input. The binary input features arechosen such that fine-grained information regardingthe representation of vowel transitions can be

    captured, which makes DCM able to simulate theeffect of mismatching vowel transitions.14 Wordfrequency can be taken into account through repeatedpresentation of the word during the training phase ofthe model.

    One prediction of DCM is that word beginningswith few completion possibilities (e.g., /ga:m/ canonly be completed as garment) should exhibitstronger semantic activation than words with manypossibilities (e.g., /kpt/ can startcaptiveandcaptain)since for the later semantic information is still a blend

    Volume 3, May/June 2012 2012 John Wiley & Sons, Ltd. 395

  • 8/12/2019 Models of Spoken Word Recognition

    10/15

    Advanced Review wires.wiley.com/cogsci

    of words. This is exactly what Gaskell and Marslen-Wilson found in a priming study.11 It has been argued,however, that breaking the comprehension processinto separate stages is cognitively more economicalthan a combined mapping of form and meaning asput forward in the DCM.52,84 Additionally, evidence

    from priming studies supports the assumption thatphonological and conceptual representations arepossibly separate and to a certain extent independentcomponents of word recognition.85

    SUMMARY AND CONCLUSION

    In the previous section, we described the basicarchitecture of a number of influential models ofspoken-word recognition. We furthermore tried topoint out for each model where its strengths andweaknesses lie. Table 1 summarizes the main aspectsof the models providing a quick overview of the

    commonalities and differences between the models.The list of models is however not complete.The focus of our model overview is on models oflexical processing, we therefore omitted models withan emphasis on speech sound perception such as theLAFS model,87 the Laff model,88 ARTWORD,89 andFLMP.90 Although lexical aspects can act a part inthese models, the accounts usually give no explicitdescription of word recognition.

    Furthermore, there are two relevant issues thatwe have not explicitly discussed in the model overview:semantic and morphological processing. With theexception of DCM,80 the models in Table 1 are con-

    cerned with the recognition of word form and not ofmeaning. On the other side of the spectrum, numerousmodels exist that are mainly concerned with meaningand not with phonological form. In general, thesemodels explain how meaning is organized in the men-tal lexicon and less which mechanisms are used toaccess meaning. Classical examples of semantic mod-els are the hierarchical network model,91 the semanticfeature model,92 the spreading activation model,93

    and the ACT model.94 The question of whetherphonological representations of words are tantamountto semantic representations is also a matter of debate

    in the field of spoken-word recognition. A typicalempirical approach to this question is to compareform priming with semantic priming.85 Based on thesestudies, it has been argued that phonological formsare separate from conceptual representations, and thatduring word recognition phonological representationsare activated first, but that activation cascades throughto conceptual representations as soon as possible (butsee e.g., Ref 95).

    The main question with respect to morpho-logical structure in lexical activation is whether

    morphologically complex words are stored as wholeforms that do not reflect their morphological com-plexity (full listing96), as multiple morphemes withseparate access representations (full parsing97), orthat storage depends on the regularity of the mor-phological forms (dual-route98,99). Although most of

    the research on morphology has been done with read-ing, a considerable amount of research by now hasbeen conducted in the auditory domain (see Refs100 and 101 for reviews); form priming102 and wordreconstruction103 are typical tasks to investigate mor-phological processing in the auditory domain. Baayen,McQueen, Dijkstra, & Schreuder104 proposed a modelin which phonological representations of full forms, aswell as of stems and affixes are all activated in parallel;such an account is in line with the competition-basedmodels of spoken-word recognition described above.

    Having summarized how standard models ofspoken-word recognition relate to models of semanticand morphological processing, we want to turn nowto the question of where the field goes from here.Obviously the remaining disagreement on flow ofinformation (feed-forward versus top-down) and formof stored representations (abstract versus episodic)must be settled. With respect to flow of information,empirical evidence is needed that shows whether lex-ical knowledge can directly influence pre-decisionalprelexical processing or not; researchers on both sideshave acknowledged that it is difficult to develop stud-ies that can convincingly make this point (for bothsides105,106). With respect to form of representations,

    it has become obvious that both purely abstract mod-els and purely episodic models are incomplete, andthe challenge for the future is to develop a hybridapproach that combines both abstract and episodicrepresentations107,108; an example of such a comple-mentary system account can be found in Normanand OReilly,109 and see also an account of Connineand colleagues in which abstract lexical representa-tions encode phonological variants based on variantfrequency.110

    Models of spoken-word recognition have oftenbeen developed with a focus on particular aspects

    of lexical processing: the size of the phonologicalneighborhood in NAM,65 for example, or lexicalsegmentation in Shortlist.50 Other parts of the modelsare frequently underspecified. This makes it difficultto assess them. Not only is it hard to determinehow well the models can simulate specific empiricalfindings, judging whether the theoretical assumptionsin the model are consistent with an effectivecomplete recognition system is nearly impossible.32,33

    For example, many models make the simplifyingassumption that the word recognition process receives

    396 20 12 Jo hn Wi ley & Son s, Ltd. Vo lume 3, Ma y/Ju ne 2 012

  • 8/12/2019 Models of Spoken Word Recognition

    11/15

    WIREs Cognitive Science Models of word recognition

    TABLE1

    ModelsofSpoken-WordRecognition

    Model

    PrimaryReferences

    Input

    Representation

    Prelexical

    Represent

    ations

    Word-form

    Representation

    Online

    LexicalPrelexical

    Feedback

    CompetitionProcess

    Handling

    Fine-Gra

    ined

    Information

    TypeofModel

    Cohortmodel

    Marslen-Wilson&

    Welsh,3

    6

    Marslen-W

    ilson&

    Tyler35

    Notspecified

    Features

    Underspecified

    phonological

    structures

    No

    Decision-levelprocess,

    no

    inter-wordcompetition

    No

    Verbal

    TRACE

    McClelland&

    Elman44

    Multidimensional

    features,whichare

    convertedinto

    phonemes

    Featuresand

    phoneme

    s

    Logogens

    Yes

    Interactive-activationnetwork,

    withactive,

    directinhibitionof

    phonemeandlexicalnodes

    Partly

    Simulation

    Shortlist

    Norris50

    Phonemestrings

    Phonemes

    Phoneme

    strings

    No

    Interactive-activationnetwork,

    with

    mismatchparameteranddirect

    competitionbetweenwords

    No

    Simulation

    ShortlistB

    Norris&McQ

    ueen53

    Sequenceofphoneme

    probabilitiesoverThree

    timeslicespersegment

    Phoneme

    probabilities

    Phoneme

    strings

    No

    Beamsearch

    Partly

    Simulation

    Fine-Tracker

    Scharenborg33

    Acousticsignal

    Articulatory-

    acousticfeature

    vectors

    Featurevector

    strings

    No

    Beamsearch

    Yes

    Simulation

    Neighborhood

    ActivationModel

    (NAM)

    Luce42

    Acousticphonetic

    patterns

    Acousticph

    onetic

    patterns

    Logogens

    No

    Decision-levelprocess,

    no

    inter-wordcompetition

    Partly

    Mathematical

    PARSYN

    Luceetal.18

    Context-sensitive

    allophones

    Allophones

    Logogens

    Yes

    Interactive-activationnetwork,

    directcompetitionbetween

    words

    Partly(dueto

    allophones)

    Simulation

    Minerva2

    Hintzman,7

    4

    Goldinger75

    Numericvectorsof1,

    0,+1

    N/A

    Episodictraces

    N/A

    Decision-levelprocess,

    no

    inter-wordcompetition

    Yes

    Simulation

    DistributedCohort

    Model

    Gaskell&

    Marslen-W

    ilson80

    Multidimensional

    features

    Phoneticfeatures

    Distributed

    vectors

    No

    Nodirectcompetitionbetween

    words;competitioninversely

    relatedtothesizeofthecohort

    Partly

    Simulation

    Volume 3, May/June 2012 2012 John Wiley & Sons, Ltd. 397

  • 8/12/2019 Models of Spoken Word Recognition

    12/15

    Advanced Review wires.wiley.com/cogsci

    a sequence of abstract units (typically phonemes orfeatures) as input rather than actual spontaneousspeech. If this simplifying assumption is abandoned,it could have serious consequences for the way othercomponents of the model work. What is thereforeneeded is a unifying theory that accounts for all aspects

    of spoken-word recognition by human listeners.

    NOTE

    aThe actual name of the model is TRACE II. TRACE

    I111 focused on the conversion of digitized speech into

    a set of phonetic features, and was never connected

    to TRACE II. However, TRACE is commonly used to

    refer to the model of spoken-word recognition.

    REFERENCES

    1. Zwitserlood P. The locus of the effects of sentential-semantic context on spoken-word processing. Cogni-tion1989, 32:2564.

    2. Shillcock RC. Lexical hypotheses in continuousspeech. In: Altmann G, ed. Cognitive Models of SpeechProcessing: Psycholinguistic and Computational Per-spectives.Cambridge: MIT Press; 1990, 24 29.

    3. Luce PA, Cluff MS. Delayed commitment in spokenword recognition: evidence from cross-modal priming.Percept Psychophys1998, 60:484490.

    4. Tabossi P, Collina S, Mazzetti M, Zoppello M. Sylla-bles in the processing of spoken Italian. J Exp Psychol:Hum Percept Perform2000, 26:758775.

    5. Tanenhaus MK, Spivey-Knowlton MJ, Eberhard KM,Sedivy JC. Integration of visual and linguistic informa-tion in spoken language comprehension. Science 1995,268:16321634.

    6. Whalen D. Subcategorical phonetic mismatches andlexical access. Percept Psychophys 1991, 50:351360.

    7. Connine CM, Blasko DG, Titone DG. Do the begin-nings of spoken words have a special status in auditoryword recognition?J Mem Lang1993, 32:193210.

    8. Connine CM, Titone D, Deelman T, Blasko DG. Sim-ilarity mapping in spoken word recognition. J MemLang1997, 37:463480.

    9. Gaskell MG, Marslen-Wilson WD. Phonological vari-ation and inference in lexical access. J Exp Psychol:Hum Percept Perform1996, 22:144158.

    10. Gaskell MG, Marslen-Wilson WD. Mechanisms ofphonological inference. J Exp Psychol: Hum PerceptPerform1998, 24:380396.

    11. Gaskell MG, Marslen-Wilson WD. Representation

    and competition in the perception of spoken words.Cogn Psychol2002, 45:220266.

    12. Andruski JE, Blumstein SE, Burton M. The effect ofsubphonemic differences on lexical access. Cognition1994, 52:163187.

    13. Dahan D, Magnuson JS, Tanenhaus MK, Hogan EM.Tracking the time course of subcategorical mis-matches: evidence for lexical competition.Lang CognProcess2001, 16:507534.

    14. Marslen-Wilson WD, Warren P. Levels of percep-tual representation and process in lexical access:

    words, phonemes and features. Psychol Rev 1994,101:653675.

    15. McQueen JM, Norris D, Cutler A. Competition in spo-ken word recognition: spotting words in other words.

    J Exp Psychol Learn Mem Cogn1994, 20:621638.

    16. Vitevitch MS, Luce PA. When words compete: levelsof processing in spoken-word recognition.Psychol Sci

    1998, 9:325329.17. Vitevitch MS, Luce PA. Probabilistic phonotactics and

    neighborhood activation in spoken-word recognition.J Mem Lang1999, 40:374408.

    18. Luce PA, Goldinger SD, Auer ET, Jr. Vitevitch MS.Phonetic priming, neighborhood activation, andparsyn.Percept Psychophys2000, 62:615625.

    19. Dumay N, Frauenfelder UH, Content A. The role ofthe syllable in lexical segmentation in French: word-spotting data.Brain Lang2002, 81:144161.

    20. McQueen JM. Segmentation of continuous speechusing phonotactics.J Mem Lang1998, 39:2146.

    21. van der Lugt A. The use of sequential probabilites inthe segmentation of speech.Percept Psychophys2001,63:811823.

    22. Weber A, Cutler A. First-language phonotactics insecond-language listening. J Acoust Soc Am 2006,119:597607.

    23. Cutler A, Butterfield S. Rhythmic cues to speechsegmentation: evidence from juncture misperception.

    J Mem Lang1992, 31:218236.

    24. Pallier C, Sebastian-Galles N, Felguera T, ChristopheA, Mehler J. Attentional allocation within the sylla-ble structure of spoken words. J Mem Lang 1993,32:373389.

    25. Vroomen J, van Zon M, de Gelder B. Cues to speechsegmentation: evidence from juncture misperceptionsand word spotting.Mem Cogn 1996, 24:744755.

    26. Quene H. Durational cues for word segmentation inDutch.J Phonet1992, 20:331350.

    27. Quene H. Segment durations and accent as cues toword segmentation in Dutch. J Acoust Soc Am1993,94:20272035.

    28. Turk AE, Shattuck-Hufnagel S. Word-boundaryrelated duration patterns in English. J Phonet 2000,28:397440.

    398 20 12 Jo hn Wi ley & Son s, Ltd. Vo lume 3, Ma y/Ju ne 2 012

  • 8/12/2019 Models of Spoken Word Recognition

    13/15

    WIREs Cognitive Science Models of word recognition

    29. Forster KI. Accessing the mental lexicon. In: Wales RJ,

    Walker EW, eds. New Approaches to Language Mech-

    anisms.Amsterdam: North-Holland; 1976.

    30. Morton J. The integration of information in word

    recognition.Psychol Rev1969, 76:165178.

    31. Marr D.VisionSan Francisco: W. H.: Freeman; 1982.

    32. Norris D. How do computational models help us

    build better theories? In: Cutler AM, ed. Twenty-

    First Century Psycholinguistics: Four Cornerstones.

    NJ: Lawrence, Erlbaum; 2005.

    33. Scharenborg O, Boves L. Computational modelling

    of spoken-word recognition processes: design choices

    and evaluation.Pragmat Cogn2010, 18:136164.

    34. Tanenhaus MK, Magnuson J, Dahan D, Cham-

    bers C. Eye movements and lexical access in

    spoken-language comprehension: evaluating the link-

    ing hypothesis between fixations and linguistic pro-

    cessing.J Psycholing Res2000, 29:557580.

    35. Marslen-Wilson WD, Tyler LK. The temporal struc-

    ture of spoken language understanding. Cognition

    1980, 8:171.

    36. Marslen-Wilson WD, Welsh A. Processing interac-

    tions and lexical access during word recognition in

    continuous speech.Cogn Psychol1978, 10:2963.

    37. Cole RA. Listening for mispronunciations: a measure

    of what we hear during speech. Percept Psychophys

    1973, 1:153156.

    38. Taft M, Hambly G. Exploring the cohort model of spo-

    ken word recognition.Cognition1986, 22:259282.

    39. Marslen-Wilson WD. Functional parallelism in spokenword-recognition.Cognition1987, 25:71102.

    40. Marslen-Wilson WD. Activation, competition and

    frequency in lexical access. In: Altman GTM, ed.

    Cognitive Models of Speech Processing: Psycholin-

    guistic and Computational Perspectives. Cambridge,

    MA: MIT Press; 1990, 148172.

    41. Marslen-Wilson WD, Brown CM, Tyler LK. Lexical

    representations in spoken language comprehension.

    Lang Cogn Process1988, 3:116.

    42. Luce PA. A computational analysis of uniqueness

    points in auditory word recognition. Percept Psy-

    chophys1986, 39:155158.

    43. Bard EG, Shillcock RC, Altmann GE. The recognition

    of words after their acoustic offsets in spontaneous

    speech: evidence of subsequent context. Percept Psy-

    chophys1988, 44:395408.

    44. McClelland JL, Elman JL. The TRACE model of

    speech perception.Cogn Psychol1986, 18:186.

    45. McClelland JL, Rumelhart DE. An interactive activa-

    tion model of context effects in letter perception, Part

    1: an account of basic findings. Psychol Rev 1981,

    88:375405.

    46. Dahan D, Magnuson J, Tanenhaus M. Time courseof frequency effects in spoken-word recognition: evi-dence from eye movements. Cogn Psychol 2001,42:361367.

    47. Ganong WF. Phonetic categorization in auditory wordperception. J Exp Psychol: Hum Percept Perform

    1980, 6:110125.48. Foss DJ, Blank MA. Identifying the speech codes. Cogn

    Psychol1980, 12:131.

    49. Allopenna PD, Magnuson JS, Tanenhaus MK. Track-ing the time course of spoken word recognition usingeye movements: evidence for continuous mappingmodels.J Mem Lang1998, 38:419439.

    50. Norris D. Shortlist: a connectionist model of continu-ous speech recognition.Cognition1994, 52:189 234.

    51. McQueen JM, Jesse A, Norris D. No lexical-prelexicalfeedback during speech perception or: is it time to stopplaying those Christmas tapes? J Mem Lang 2009,61:118.

    52. Norris D, McQueen JM, Cutler A. Merging informa-tion in speech recognition: feedback is never necessary.Behav Brain Sci2000, 23:299370.

    53. Norris D, McQueen JM. Shortlist B: a Bayesian modelof continuous speech recognition. Psychol Rev2008,115:357395.

    54. Magnuson JS, McMurray B, Tanenhaus MK, AslinRN. Lexical effects on compensation for coarticula-tion: the ghost of Christmash past. Cogn Sci 2003,27:285298.

    55. McQueen JM. The ghost of Christmas future: didntScrooge learn to be good? Commentary on Magnu-

    son, McMurray, Tanenhaus, and Aslin (2003). CognSci2003, 27:795799.

    56. Norris D, McQueen JM, Cutler A. Perceptual learningin speech.Cogn Psychol2003, 47:204238.

    57. Cutler A, Norris D. Monitoring sentence compre-hension. In: Cooper WE, Walker ECT, eds. SentenceProcessing: Psycholinguistic Studies Presented to Mer-rill Garrett.Hillsdale: Erlbaum; 1979.

    58. Cutler A, Norris D. The role of strong syllables insegmentation for lexical access. J Exp Psychol HumPercept Perform1988, 14:113121.

    59. Norris D, McQueen JM, Cutler A. Competition and

    segmentation in spoken-word recognition. J Exp Psy-chol Learn Mem Cogn1995, 21:12091228.

    60. Vroomen J, de Gelder B. Metrical segmentation andlexical inhibition in spoken word recognition. J ExpPsychol Hum Percept Perform1995, 21:98108.

    61. Norris D, McQueen JM, Cutler A, Butterfield S. Thepossible-word constraint in the segmentation of con-tinuous speech.Cogn Psychol1997, 34:191243.

    62. Grosjean F. The recognition of words after theiracoustic offsets: evidence and implications. PerceptPsychophys1985, 38:299310.

    Volume 3, May/June 2012 2012 John Wiley & Sons, Ltd. 399

  • 8/12/2019 Models of Spoken Word Recognition

    14/15

    Advanced Review wires.wiley.com/cogsci

    63. Scharenborg O, Norris D, ten Bosch L, McQueen J.How should a speech recognizer work? Cogn Sci 2005,29:867918.

    64. Smits R, Warner N, McQueen JM, Cutler A. Unfold-ing of phonetic information over time: a database ofDutch diphone perception. J Acoust Soc Am 2003,

    113:563574.65. Luce PA, Pisoni DB. Recognizing spoken words:

    the neighborhood activation model. Ear Hear 1998,19:136.

    66. Marslen-Wilson WD, Gaskell MG. Leading up thelexical garden-path: Segmentation and ambiguity inspoken word recognition.J Exp Psychol Hum PerceptPerform2002, 28:218244.

    67. Salverda AP, Dahan D, McQueen J. The role ofprosodic boundaries in the resolution of lexicalembedding in speech comprehension.Cognition2003,90:5189.

    68. Shatzman KB, McQueen JM. Segment duration as a

    cue to word boundaries in spoken-word recognition.Percept Psychophys2006, 68:116.

    69. Cutler A.Native Listening: Language Experience andthe Recognition of Spoken Words. Cambridge, MA:MIT Press; 2012, in press.

    70. Luce RD.Individual Choice Behavior. Oxford: JohnWiley; 1959.

    71. Cluff MS, Luce PA. Similarity neighborhoods of spo-ken two-syllable words: retroactive effects on multipleactivation.J Exp Psychol Hum Percept Perform1990,16:551563.

    72. Goldinger SD, Luce PA, Pisoni DB. Priming lexical

    neighbors of spoken words: effects of competition andinhibition.J Mem Lang1989, 28:501518.

    73. Magnuson JS, Mirman D, Harris HD. Computationalmodels of spoken word recognition. In: Spivey M,McRae K, Joanisse M, eds. The Cambridge Handbookof Psycholinguistics. Cambridge University Press; inpress.

    74. Hintzman DL. Schema abstraction in a multiple-tracememory model.Psychol Rev1986, 93:411428.

    75. Goldinger SD. Echoes of echoes? An episodic theoryof lexical access.Psychol Rev1998, 105:251 279.

    76. Hintzman DL, Block R, Inskeep N. Memory for mode

    of input.J Verb Learn Verb Behav 1972, 11:741749.77. McQueen JM, Mitterer H. Lexically-driven perceptual

    adjustments of vowel categories. Poster Presented atthe ISCA Workshop on Plasticity in Speech Perception.London: 2005.

    78. Pitt MA, McQueen JM. Is compensation for coartic-ulation mediated by the lexicon? J Mem Lang1998,39:347370.

    79. Goldinger SD. Words and voices: episodic traces inspoken word identification and recognition memory.

    J Exp Psychol Lang Mem Cogn1996, 22:1166 1183.

    80. Gaskell MG, Marslen-Wilson WD. Integrating formand meaning: a distributed model of speech perception.Lang Cogn Process1997, 12:613 656 (Special cogni-tive models of speech processing: psycholinguistic andcomputational perspectives on the lexicon).

    81. Hinton GE, McClelland JL, Rumelhart DE. Dis-

    tributed representations. In: Rumelhart DE, McClel-land JL, eds., vol. 1. Parallel Distributed Processing:Explorations in the Microsstructure of Cognition.

    Cambridge, MA: MIT Press; 1986.

    82. Smolensky P. On the proper treatment of connection-ism.Behav Brain Sci1988, 11:174.

    83. Weber A, Crocker MW. On the nature of semanticconstraints on lexical access. J Psycholing Res 2011,Advance online publication. doi:10.1007/s10936-011-9184-0.

    84. Fodor JA.Modularity of Mind: An Essay on FacultyPsychology.Cambridge, MA: MIT Press; 1983.

    85. Norris D, Cutler A, McQueen JM, Butterfield S.

    Phonological and conceptual activation in speech com-prehension.Cogn Psychol2006, 53:146193.

    86. McQueen JM. Speech perception. In: Lamberts K,Goldstone R, eds.The Handbook of Cognition.Lon-don: Sage Publications; 2005, 255 275.

    87. Klatt DH. Speech perception: a model of acoustic-phonetic analysis and lexical access. J Phonet 1979,7:279312.

    88. Stevens KN. Toward a model for lexical accessbased on acoustic landmarks and distinctive features.

    J Acoust Soc Am2002, 111:1872 1891.

    89. Grossberg S, Myers CW. The resonant dynam-

    ics of speech perception: Interword integration andduration-dependent backward effects. Psychol Rev2000, 107:735 767.

    90. Massaro DW.Perceiving Talking Faces: From SpeechPerception to A Behavioral Principle. Cambridge, MA:MIT Press; 1997.

    91. Collins AM, Quillian MR. Retrieval time fromsemantic memory. J Verb Learn Verb Behav 1969,8:240247.

    92. Smith EE, Shoben EJ, Rips LJ. Structure and processin semantic memory: a featural model for semanticdecisions.Psychol Rev1974, 81:214241.

    93. Collins AM, Loftus EF. A spreading-activation the-ory of semantic processing. Psychol Rev 1975,82:407428.

    94. Anderson JR. ACT: a simple theory of complex cogni-tion.Am Psychol1996, 51:355365.

    95. Bolte J, Coenen E. Is phonological informationmapped onto semantic information in a one-to-onemanner?Brain Lang2002, 81:384397.

    96. Butterworth B. Lexical representation. In: ButterworthB, ed., vol. 2. Language Production. London: Aca-demic Press; 1983, 257 332.

    400 20 12 Jo hn Wi ley & Son s, Ltd. Vo lume 3, Ma y/Ju ne 2 012

  • 8/12/2019 Models of Spoken Word Recognition

    15/15

    WIREs Cognitive Science Models of word recognition

    97. Taft M, Forster KI. Lexical storage and retrievalof prefixed words. J Verb Learn Verb Behav 1975,14:630647.

    98. Clahsen H. Lexical entries and rules of language: amultidisciplinary study of German inflection. BehavBrain Sci1999, 22:991 1060.

    99. Schreuder R, Baayen H., Modeling morphological pro-cessing. In: Feldman LB, ed.Morphological Aspects ofLanguage Processing.Hillsdale, NJ: Erlbaum; 1995,131154.

    100. Marslen-Wilson WD. Access to lexical representa-tions: cross-linguistic issues.Lang Cogn Process2001,16:699708.

    101. Marslen-Wilson WD. Morphology and language. In:Brown K, ed.Encyclopedia of Language and Linguis-tics.Oxford: Elsevier; 2006.

    102. Marslen-Wilson WD, Tyler LK, Waksler R, OlderL. Morphology and meaning in the English mentallexicon.Psychol Rev1994, 101:333.

    103. Ernestus M, Baayen H. Paradigmatic effects in audi-tory word recognition: the case of alternating voice inDutch.Lang Cogn Process2007, 22:124.

    104. Baayen H, McQueen JM, Dijkstra T, Schreuder R.Frequency effects in regular inflectional morphology:revisiting Dutch plurals. In: Baayen H, Schreuder R,eds.Morphological Structure in Language Processing.Berlin: Mouton de Gruyter; 2003, 355 390.

    105. McClelland JL, Mirman D, Holt LL. Are there interac-

    tive processes in speech perception?Trends Cogn Sci

    2006, 10:363 369.

    106. McQueen JM, Norris D, Cutler A. Are there really

    interactive processes in speech perception? Trends

    Cogn Sci2006, 10:533.

    107. Goldinger SD. A complementary-systems approach toabstract and episodic speech perception. 16th Inter-

    national Congress of Phonetic Sciences., Dudweiler:

    Pirrot; 2007, 4954.

    108. Cutler A, Weber A. Listening experience and phonetic-

    to-lexical mapping in L2.16th International Congress

    of Phonetic Sciences. Dudweiler: Pirrot; 2007, 43 48.

    109. Norman K, OReilly R. Modeling hippocampal and

    neocortical contributions to recognition memory: a

    complementary learning systems approach. Psychol

    Rev2003, 110:611646.

    110. Connine CM, Ranbom L, Patterson DJ. On the

    representation of phonological variant frequency in

    spoken word recognition. Percept Psychophys 2008,

    70:403411.

    111. Elman JL, McClelland JL. Exploiting lawful variabil-

    ity in the speech wave. In: Perkell JS, Klatt DH, eds.

    Invariance and Variability in Speech Processes.Hills-

    dale, NJ: Lawrence Erlbaum Associates; 1986.

    FURTHER READING

    McQueen JM. Eight questions about spoken-word recognition. In Gaskell MG, Ed. The Oxford Handbook ofPsycholinguistics. Oxford: Oxford University Press; 2007, 37 53.

    McQueen JM, Cutler A. Spoken word access processes: an introduction. Lang Cogn Process2001, 16:469490.

    Volume 3, May/June 2012 2012 John Wiley & Sons, Ltd. 401