23
The cocktail-party problem revisited: early processing and selection of multi-talker speech Adelbert W. Bronkhorst Published online: 1 April 2015 # The Author(s) 2015. This article is published with open access at Springerlink.com Abstract How do we recognize what one person is saying when others are speaking at the same time? This review sum- marizes widespread research in psychoacoustics, auditory scene analysis, and attention, all dealing with early processing and selection of speech, which has been stimulated by this ques- tion. Important effects occurring at the peripheral and brainstem levels are mutual masking of sounds and Bunmasking^ resulting from binaural listening. Psychoacoustic models have been de- veloped that can predict these effects accurately, albeit using computational approaches rather than approximations of neural processing. Groupingthe segregation and streaming of soundsrepresents a subsequent processing stage that interacts closely with attention. Sounds can be easily groupedand sub- sequently selectedusing primitive features such as spatial lo- cation and fundamental frequency. More complex processing is required when lexical, syntactic, or semantic information is used. Whereas it is now clear that such processing can take place preattentively, there also is evidence that the processing depth depends on the task-relevancy of the sound. This is consistent with the presence of a feedback loop in attentional control, trig- gering enhancement of to-be-selected input. Despite recent progress, there are still many unresolved issues: there is a need for integrative models that are neurophysiologically plausible, for research into grouping based on other than spatial or voice- related cues, for studies explicitly addressing endogenous and exogenous attention, for an explanation of the remarkable sluggishness of attention focused on dynamically changing sounds, and for research elucidating the distinction between bin- aural speech perception and sound localization. Keywords Attention . Auditory scene analysis . Cocktail-party problem . Informational masking . Speech perception Speech communication is so all-pervasive and natural that it is easy to underestimate the formidable difficulties our auditory system has to overcome to be able to extract meaningful in- formation from the complex auditory signals entering our ears. In particular in environments where we try to understand one talker among multiple persons speaking at the same time, the capacities of the auditory system are stretched to the limit. To most of us blessed with normal hearing, it seems as if this task is achieved without any effort, but the fragility of speech perception is clearly revealed when there is background noise or when a hearing impairment affects the peripheral encoding of the incoming signals. The difficulties associated with un- derstanding speech in multiple-talker situations often are as- sociated with the term Bcocktail-party problem^ (or Bcocktail- party effect^), coined by Colin Cherry in his 1953 paper. While the widespread use of this term might suggest the ex- istence of a single, coherent field of research, scientific work has actually for many years proceeded along different lines that showed little or no overlap. Cherry himself was mainly interested in the ability of listeners to select target speech while ignoring other sounds in conditions where signals were either mixed or presented to separate ears. This work acted as starting point of a line of research into selective attention, which generated influential early Bfilter^ models (Broadbent, 1958; Treisman, 1964; Deutsch & Deutsch, 1963). Later work on attention has predominantly focused on the visual modality A. W. Bronkhorst TNO Human Factors, POB 23, 3769 ZG Soesterberg, The Netherlands e-mail: [email protected] A. W. Bronkhorst (*) Department of Cognitive Psychology, Vrije Universiteit, van den Boechorststraat 1, 1081 BT Amsterdam, The Netherlands e-mail: [email protected] Atten Percept Psychophys (2015) 77:14651487 DOI 10.3758/s13414-015-0882-9

The cocktail-party problem revisited: early processing and

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The cocktail-party problem revisited: early processing and

The cocktail-party problem revisited: early processingand selection of multi-talker speech

Adelbert W. Bronkhorst

Published online: 1 April 2015# The Author(s) 2015. This article is published with open access at Springerlink.com

Abstract How do we recognize what one person is sayingwhen others are speaking at the same time? This review sum-marizes widespread research in psychoacoustics, auditoryscene analysis, and attention, all dealing with early processingand selection of speech, which has been stimulated by this ques-tion. Important effects occurring at the peripheral and brainstemlevels are mutual masking of sounds and Bunmasking^ resultingfrom binaural listening. Psychoacoustic models have been de-veloped that can predict these effects accurately, albeit usingcomputational approaches rather than approximations of neuralprocessing. Grouping—the segregation and streaming ofsounds—represents a subsequent processing stage that interactsclosely with attention. Sounds can be easily grouped—and sub-sequently selected—using primitive features such as spatial lo-cation and fundamental frequency. More complex processing isrequired when lexical, syntactic, or semantic information isused.Whereas it is now clear that such processing can take placepreattentively, there also is evidence that the processing depthdepends on the task-relevancy of the sound. This is consistentwith the presence of a feedback loop in attentional control, trig-gering enhancement of to-be-selected input. Despite recentprogress, there are still many unresolved issues: there is a needfor integrative models that are neurophysiologically plausible,for research into grouping based on other than spatial or voice-related cues, for studies explicitly addressing endogenous andexogenous attention, for an explanation of the remarkable

sluggishness of attention focused on dynamically changingsounds, and for research elucidating the distinction between bin-aural speech perception and sound localization.

Keywords Attention . Auditory scene analysis .

Cocktail-party problem . Informational masking . Speechperception

Speech communication is so all-pervasive and natural that it iseasy to underestimate the formidable difficulties our auditorysystem has to overcome to be able to extract meaningful in-formation from the complex auditory signals entering ourears. In particular in environments where we try to understandone talker among multiple persons speaking at the same time,the capacities of the auditory system are stretched to the limit.To most of us blessed with normal hearing, it seems as if thistask is achieved without any effort, but the fragility of speechperception is clearly revealed when there is background noiseor when a hearing impairment affects the peripheral encodingof the incoming signals. The difficulties associated with un-derstanding speech in multiple-talker situations often are as-sociated with the term Bcocktail-party problem^ (or Bcocktail-party effect^), coined by Colin Cherry in his 1953 paper.While the widespread use of this term might suggest the ex-istence of a single, coherent field of research, scientific workhas actually for many years proceeded along different linesthat showed little or no overlap. Cherry himself was mainlyinterested in the ability of listeners to select target speechwhile ignoring other sounds in conditions where signals wereeither mixed or presented to separate ears. This work acted asstarting point of a line of research into selective attention,which generated influential early Bfilter^ models (Broadbent,1958; Treisman, 1964; Deutsch &Deutsch, 1963). Later workon attention has predominantly focused on the visual modality

A. W. BronkhorstTNO Human Factors, POB 23, 3769ZG Soesterberg, The Netherlandse-mail: [email protected]

A. W. Bronkhorst (*)Department of Cognitive Psychology, Vrije Universiteit, van denBoechorststraat 1, 1081 BTAmsterdam, The Netherlandse-mail: [email protected]

Atten Percept Psychophys (2015) 77:1465–1487DOI 10.3758/s13414-015-0882-9

Page 2: The cocktail-party problem revisited: early processing and

and it was not until relatively recently that further progresswas made in understanding how auditory attention affectsspeech perception (Cowan & Wood, 1997; Pulvermüller &Shtyrov, 2006; Parmentier, 2013).

A line of research with an even longer history has studiedhow simultaneous sounds interfere with each other already atthe peripheral level. It originated at Bell Labs in the beginningof the previous century (Allen, 1994) and has culminated inthe development of powerful models that can predict effects ofvarious interfering sounds on speech intelligibility (French &Steinberg, 1947; Jørgensen, Ewert, & Dau, 2013). An impor-tant finding, which is incorporated in more recent models andwhich is relevant for Bcocktail-party^ conditions, is that theauditory system benefits considerably from the fact that wehave two ears. The head provides an acoustic Bshadow,^which can favor one ear, depending on the location of thetalkers. In addition, the differences between the signals enter-ing the two ears enable us to partially Bunmask^ interferingsounds, effectively providing an increase of the signal-to-noise ratio (SNR: the ratio of levels of target and interferingsounds) of up to 4 dB (Bronkhorst & Plomp, 1988).

While it is evident that speechmust be audible and needs tobe selected in order to be understood, there is actually a thirdcrucial stage in the early processing of speech, which is ad-dressed in the third line of research reviewed here. In thisstage, individual speech elements are grouped together intostreams. Past research into selection and audibility did not takethis into account because it used stimuli that can be easilygrouped, e.g., sounds presented to the two ears or speechmixed with interfering noise. An early review of research ongrouping was written by Bregman (1990), who then had torely mainly on results from experiments conducted with non-speech stimuli. Fortunately, grouping of speech sounds wasaddressed in many later studies, in particular those investigat-ing Binformational masking^: interference that cannot be ex-plained by reduced audibility (Brungart, 2001; Arbogast,Mason, & Kidd, 2002). Bregman’s (1990) review revivedinterest in the Bcocktail party^ effect and introduced a novelterm for the research area—auditory scene analysis—that hasbeen widely adopted. However, because this term also refersto attentional effects, it does not fit into the distinction betweenresearch lines made in this review. Thus, the term Bgrouping^is used instead.

This review is intended to supplement an earlier one(Bronkhorst, 2000), which mainly considered the second ofthe three research lines. Its purpose is to discuss all lines andintegrate the results in a single framework. This task is facil-itated by the increasing number of studies that cross theBboundaries^ of the research lines. The work on informationalmasking provides a good example, because it looks at effectsof attention and/or grouping while controlling for audibility(Gallun, Mason, & Kidd, 2005; Freyman, Balakrishnan, &Helfer, 2001). The review restricts itself to the three research

lines and to early processing of speech by normal-hearinglisteners, which means that it does not include animal researchor work on psycholinguistics, memory, or hearing impair-ment. It focuses on studies that use speech stimuli, but inci-dentally, for example when there is a lack of data, results fornon-speech stimuli are considered as well. The organization ofthe review is as follows. After a short section that considersspeech itself, there are sections addressing the three researchlines. In the sixth section, a conceptual, integrative model ofauditory processing of multi-talker speech is presented. Thereview ends with suggestions for future research. It is impor-tant to acknowledge that this review has been inspired byearlier overviews published by others, in particular the workof Bregman (1990) and more recent reviews written byDarwin (1997; 2008), Assman and Summerfield (2004),Shinn-Cunningham (2008), McDermott (2009), Näätänen,Kujala, and Winkler (2011), and Moore and Gockel (2012).

How Bspecial^ is speech?

If engineers would design an acoustic signal for communica-tion among humans, resistant to all kinds of acoustic interfer-ences, they would probably not come up with something re-sembling natural speech. With its voiced phonemes that showlarge variations in fundamental frequency (F0) across talkers,rapidly alternating with unvoiced phonemes that range fromnoise-like sounds to stops, it seems an unlikely candidate.Speech, however, appears to be remarkably well suited forits purpose. Acoustic analyses and vocal tract modeling showthat phonemes (a) are relatively invariant to vocal tract differ-ences, (b) make good use of the available Bperceptual space^(a simple space for vowels can be defined by the frequenciesof the first two resonance peaks, or formants), and, by con-centrating energy in limited spectral regions, (c) are resistantto masking by background noise (Diehl, 2008). Furthermore,the information contained in speech is coded with such redun-dancy that (d) missing parts often can be Breconstructed.^

The latter two properties are particularly relevant inBcocktail-party^ conditions. The fact that speech energy isconcentrated in discrete spectrotemporal regions has as con-sequence that, when speech is mixed with interfering speechor other sounds, it, on average, will still deliver the dominantcontribution to many regions. This is enhanced by the fact,noted by Darwin (2008), that due to the logarithmic intensitytransformation performed by the auditory system, the Bwinnertakes all^ when a stronger signal is added to a weaker one. Aninteresting application of this effect is the use of binary masksin automatic speech segregation that attempt to identify andremove all spectrotemporal regions that are not dominated bythe target speech (Roman, Wang & Brown, 2003; Hu &Wang, 2004; Cooke, 2006).

1466 Atten Percept Psychophys (2015) 77:1465–1487

Page 3: The cocktail-party problem revisited: early processing and

The redundancy of speech is apparent from its resistanceagainst various kinds of distortions, such as bandwidth reduc-tion (French & Steinberg, 1947), peak clipping (Pollack &Pickett, 1959), temporal smearing (Drullman et al. 1993),and even changes in the carrier to which spectrotemporalmodulations are applied (Remez, Rubin, Pisoni, & Carell,1981). There are many types of redundancies, ranging fromacoustic effects at the phonetic level (e.g., coarticulation be-tween phonemes) to contextual information at the sentencelevel (Kalikow, Stevens, & Elliot, 1977; Boothroyd &Nittrouer, 1988). Given that redundancy acts as a kind ofBsafety net^ that allows missing information to be recovered,its effects can be quantified by determining how well listenerscan fill in missing speech parts. Whereas these Bclozeprobabilities^ can be measured directly with written text(Taylor, 1953; Block & Baldwin, 2010), an indirect methodhas to be used for speech because removing phonemes orwords will almost always disrupt coarticulatory cues and/orintroduce false cues. Bronkhorst, Bosman, and Smoorenburg(1993) and Bronkhorst, Brand, and Wegener (2002) devel-oped such a method, which is based on probabilities of occur-rence of correct and partially correct responses. They estimat-ed that, for speech masked by noise, the probability of recov-ering a missing phoneme is approximately 50 % for meaning-ful words and 20 % for nonsense words. Words missing fromshort everyday sentences can be filled in more easily with 70–90 % accuracy, depending on sentence length. For meaning-less sentences with a known syntactic structure, the probabil-ity is still approximately 50 %.1 These data not only show thatthere is more contextual information on the word than on thephoneme level (as expected), but also that there is a majorcontribution of nonsemantic information.

Thus, speech is certainly Bspecial,^ not only because itsatisfies several acoustic criteria, but also because we are fine-ly tuned to its properties. Studies of speech intelligibility innoise for nonnative listeners demonstrate that this tuning re-quires exposure early in life. Mayo, Florentine, and Buus(1997), for example, compared performance of native listenerswith that of early bilinguals, who learned a second languagebefore age 6 years, and late bilinguals, learning it after age14 years. They found similar SRTs (Speech ReceptionThresholds: SNRs required for 50 % sentence intelligibility)for the first two groups, but 4–5 dB higher SRTs for the latebilinguals. This is a considerable penalty-expressed in averagespeech levels in Bcocktail party^ conditions this is equivalentto a threefold increase of the number of interfering talkers. Thefact that tuning occurs does not necessarily imply that theprocessing of speech by the auditory system is entirely

different from that of other sounds. On the one hand, EEGand neuroimaging studies indicate that there are specializedbrain regions for speech processing (Patterson & Johnsrude,2008) and that the infant brain already responds differently tonative speech than to other speech or to nonspeech sounds(Kuhl & Rivera-Gaxiola, 2008). On the other hand, there areremarkable similarities in neural processing of speech by an-imals and humans (Steinschneider, Nourski, & Fishman,2013) and it also seems unlikely that low-level preattentiveprocessing uses separate mechanisms for speech and othersounds (Darwin, 2008).

Masking and Unmasking of Speech Sounds

Normally, sounds originating from different sources will al-ways reach both ears, which means that there will be interfer-ence between incoming sounds already at a peripheral level.This peripheral interference is calledmasking or to distinguishit from informational masking: energetic masking. Our hear-ing system also benefits from the fact that interfering signalsreaching the two ears have similar components, because it cansuppress such components and thus achieve an effective re-duction of masking, referred to as unmasking. When studyingmasking, a crucial variable is the SNR. As shown by Plomp(1977), SNRs in a typical cocktail party are in theory around0 dB, which would mean that sentences can be easily under-stood, whereas isolated words are somewhat less intelligible.In practice, however, interfering sound levels often will behigher (e.g., because of background noise) and listeners willperform less then optimally (e.g., because of a hearing impair-ment). This means that the gain afforded by binaural listeningcan be crucial to reach sufficient intelligibility levels.

Many factors that influence (un)masking of speech havebeen discussed by Bronkhorst (2000); the most relevant arethe type of target speech, spectral differences between targetand interfering sounds, the spatial configuration of the soundsources, fluctuations in level (modulations) of the interferingsounds, the acoustics of the environment, and hearing impair-ment of the listener. The effects of these factors often arequantified as shifts of the SRT. For example (see alsoTable 1 in Bronkhorst, 2000), spatial separation of soundsources and level fluctuations of interfering sounds can eachcause positive effects (SRT reductions) of up to 10 dB.Spectral differences will have smaller positive effects, of upto 5 dB. Reverberation and (moderate) hearing impairmentcan result in large negative effects of up to 10 dB. In view ofthe number of factors, and given that some of them interactwith each other, it is clear that one needs modelsencompassing as many factors as possible when trying tomake sense of the experimental data. I will, therefore, concen-trate on the evolution of the most important speech perceptionmodels, summarizing their main properties and indicating

1 Note that these estimations differ from Bcloze^ probabilities of writtentext. As shown in Bronkhorst et al. (2002), probabilities of recoveringwords in sentences measured using orthographic presentation are signif-icantly lower than estimated probabilities for the same material presentedauditorily.

Atten Percept Psychophys (2015) 77:1465–1487 1467

Page 4: The cocktail-party problem revisited: early processing and

their potential and limitations. An overview of the models thatare discussed and their main features is given in Table 1.

Quantifying speech information and effects of interferingnoise

A fundamental property of any model predicting speech intel-ligibility is the way in which speech information is quantified.The predictor with the longest history is the SpeechIntelligibility Index (SII; ANSI, 1997), originally called theArticulation Index (AI; French & Steinberg, 1947; Kryter,1962). Basically, the SII is determined by calculating SNRsin nonoverlapping frequency bands, truncating these to therange −15 to +15 dB, mapping them linearly to values be-tween 0 and the value of the Bimportance function^ for thatband, and finally summing them across bands.2 The SII iswidely used and has been extensively validated. Advantagesare that frequency-domain effects, in particular differences inlong-term average frequency spectra of target and interferingsounds, are modeled quite accurately and that, in contrast to

percent correct scores that are difficult to compare across ex-periments, the index represents a generic, uniform measure ofspeech intelligibility. An important disadvantage is, however,that differences in speech material affect the model at twostages: the band importance function used to calculate theindex, and the psychometric function used to map indexvalues to percent correct values. This means that it is actuallynot easy to adapt the SII to different types of speech. Othershortcomings are that effects of reverberation and of interferermodulations are not modeled. As shown by Rhebergen andVersfeld (2005), the latter factor can actually be approximatedrelatively easily by calculating the SII in short time frames(varying from 9.4 to 35 ms, depending on frequency) and thenaveraging it over time. The accuracy of the predictions can beimproved by taking forward masking into account; in thatcase, a constant time frame of 4 ms can be used (Rhebergen,Versfeld & Dreschler, 2006).

Another widely used predictor is the Speech TransmissionIndex (IEC, 2003; Steeneken & Houtgast, 1980). This indexborrows the SNR-based approach from the SII/AI to modelfrequency-domain masking effects but uses preservation ofspeech modulations as a measure for quantifying time-domaindistortions, such as reverberation and peak clipping. The STI, ineffect, quantifies how speech quality deteriorates when it istransmitted through an electric and/or acoustic channel. The

Table 1 Overview of monaural and binaural speech perception models

Model for quantifying speech information References Aspects that are modeled Binaural model References(binaural version)

Speech Intelligibility Index (SII) a I, B, F (refs. f, i) Equalization Cancellation g, h, i, j

Speech Transmission Index (STI) b I, B, R, T Binaural version of STI k

Speech-based Envelope Power Spectrum Model(sEPSM)

c, d I, B, F, R, T, P

Speech Recognition Sensitivity Model (SRS) e I, B, P

(No model) Descriptive model of binaural gain l, m

References a ANSI, 1997

b IEC, 2003

c, d Jørgensen & Dau, 2011; Jørgensen et al. (2013)

e Müsch and Buus (2001)

f Rhebergen et al. (2006)

g Durlach (1972)

h, i, j Wan et al. (2010), Beutelmann et al.(2010), Lavandier et al. (2012)

k Van Wijngaarden and Drullman (2008)

l, m Bronkhorst (2000), Jones and Litovsky (2011)

Aspects that are modeled I Long-term average frequency spectra of speech and interference

B Bandwidth reduction of speech and/or interference

F Envelope fluctuations of the interference

R Effect of reverberation on target speech

T Time-domain distortions of target speech (e.g. peak clipping)

P Implicit modeling of the psychometric function

2 The actual calculation is more complex, because corrections are appliedfor hearing loss, high speech levels, and self-masking (upward spread ofmasking).

1468 Atten Percept Psychophys (2015) 77:1465–1487

Page 5: The cocktail-party problem revisited: early processing and

crucial parameter is the modulation transfer function (MTF)—the quotient of the modulation depths at the output and input ofthe channel. The MTF is determined in frequency bands, con-verted to equivalent SNRs using the equation SNR =10log(MTF/(1-MTF)) and then weighted with a band impor-tance function in a similar way as is done in the SII calculation.Recently, this modulation-based approach has been generalizedto a Bspeech-based envelope power spectrum model^ (sEPSM;Jørgensen & Dau, 2011). Instead of using SNRs based on fre-quency spectra, this model uses SNRs in the envelope powerdomain, derived from the power of the envelopes of the noiseand speech + noise signals. This model adds complexity to theSTI, because it determines these SNRs in modulation bands aswell as frequency bands. However, because it converts SNRs tod’ values and then uses an ideal observer to calculate percentcorrect values, it not only implicitly models the psychometricfunction, but also includes the effect of response set size (i.e., theperformance increase related to reduction of the number of re-sponse alternatives). Another crucial difference with the STI andSII approaches is that no band importance function is used toweigh contributions of frequency bands. Differences in speechmaterial are accounted for by adjusting four parameters: twoused in the conversion of SNRs to d’ values, and two (one isthe response set size) entering the calculation of percentage cor-rect scores. Jørgensen et al. (2013) recently developed amultiresolution version of this model to be able to predict effectsof interferer modulations as well. The method used is roughlysimilar to that of Rhebergen and Versfeld (2005), discussedabove. The averaging across time frames is, however, done withenvelope power SNRs and time frames have durations depend-ing on the modulation band.

A third approach to predicting speech perception was in-troduced by Müsch and Buus (2001). Their SpeechRecognition Sensitivity (SRS) model is in essence a modelof speech transmission, just as the STI and the sEPSM are,because it quantifies speech degradation. Different types ofdegradation are modeled as independent sources of variancethat affect the match between an ideal speech signal and thetemplate of that signal used by the listener to identify it. Thedegradations are imperfections in speech production, interfer-ing sounds, and Bcognitive noise,^ representing speech entro-py determined by various contextual cues (Van Rooij &Plomp, 1991). The model generates a d’ value just as thesEPSM model does. A specific feature is that intelligibilityof speech presented in spectrally disjoint frequency bandscan be predicted by taking into account synergetic effects,which are not modeled in the SII and STI approaches. Themodel, however, requires many parameters and is, as yet, notable to predict effects of reverberation and interferermodulations.

It should be noted that there are constraints limiting howaccurately speech information can be quantified, becausespeech intelligibility depends on many properties of the

speech material—in particular its linguistic, syntactic, and se-mantic information—and on possible interactions with inter-fering sounds. All models presented above can be adapted tosome degree to such variations, but this is normally done in arelatively coarse way, for example based on corporaconsisting of certain types of sentences or words, in combina-tion with specific types of interfering sounds (e.g., IEC, 2003,Fig. 1). However, there may be considerable differences inintelligibility between items in a corpus. Van Rooij andPlomp (1991), for example, tested the sentence set developedby Plomp and Mimpen (1979), designed to be relatively ho-mogeneous, and found differences of up to 4 dB in the SRT innoise between individual sentences. That speech material andtype of interference can interact with each other was recentlydemonstrated by Uslar, Carroll, Hanke, Hamann, Ruigendijket al. (2013), who found that variations in the linguistic com-plexity of speechmaterial affected intelligibility differently forsteady-state than for fluctuating interfering noise.

Binaural speech perception

The models described above can account for effects of inter-fering sounds and reverberation but do not predict the gainresulting from binaural listening. Three types of cues shouldbe taken into account to achieve this: interaural time differ-ences (ITDs, differences in arrival time between the ears),interaural level differences (ILDs), and interauraldecorrelation (reduced coherence). The latter factor, whichoccurs in any environment where there is reverberation, re-sults from differences between the reflections arriving at thetwo ears (Hartmann, Rakerd & Koller, 2005). The binauralcues depend on many acoustic factors: the spatial configura-tion and directivity of the sound sources, the room geometryand reverberation, and the shape of the head and ears of thelistener. As a result, they cannot be calculated easily, and itoften is necessary to measure them using either an artificialhead (Burkhard & Sachs, 1975) or miniature microphonesinserted into the ear canals of a human subject (Wightman &Kistler, 2005).

Due to the combined acoustic effects of head and ears,ILDs show a complex dependency on frequency. They arearound zero for sound sources in the median plane of the headand increase when sources are moved to one side; they alsoincrease as a function of frequency, reaching values up to20 dB at 4 kHz or higher (e.g., Fig. 2 in Bronkhorst &Plomp, 1988, which presents artificial-head data for a singlesource in an anechoic environment). When target and interfer-ing sounds originate from different locations, their ILDs willnormally be different, resulting in an SNR that is, on average,higher at one ear than at the other. The simplest way to predictthe effects of ILDs on speech intelligibility is to equate binau-ral performance with that for the ear with the highest averageSNR. A somewhat better method is to determine the ear with

Atten Percept Psychophys (2015) 77:1465–1487 1469

Page 6: The cocktail-party problem revisited: early processing and

the highest SNR per frequency band and to combine theseBbest bands^ in the calculation of speech intelligibility. Themost sophisticated method, not used in current models, wouldbe to perform a spectrotemporal analysis of the SNRs at bothears and calculate binary masks, indicating which ear isBbetter^ in each time-frequency cell. Brungart and Iyer(2012) showed that a monotic signal created by applying suchmasks to the left and right signals and summating the results isequally intelligible as the original binaural stimulus, whichsuggests that the auditory system is indeed integratingBglimpses^ of speech information fluctuating rapidly acrossears and across frequency.

While the dependence of ITDs and interaural decorrelationon acoustics and on frequency is actually less complex thanthat of ILDs, predicting their effects is not as straightforward.Fortunately, quantitative models of binaural signal detectionhave been developed that also can be applied to speech per-ception (Durlach, 1972; Colburn, 1973). The Equalization-Cancellation (EC) model developed by Durlach (1972) is cur-rently most widely used. It assumes that the auditory systemoptimizes SNRs in separate frequency bands by combiningleft- and right-ear signals in such a way that the energy ofinterfering sounds is minimized. This optimization takes placein three steps: the levels of the twomonaural signals are equat-ed, their phases are shifted, and one signal is subtracted fromthe other. The cross-correlation function of the left and rightinterferer signals provides important input for the EC model,because the position of the maximum determines the phaseshift that should be applied, and the maximum value (theinteraural coherence) indicates how effective the cancellationwill be. The model also assumes that auditory signal process-ing is hampered by internal noise, so that perfect cancellationwill never happen. The internal noise is modeled by applyingtime and amplitude jitters to the left- and right-ear signalsbefore the EC operation.

The models of binaural speech perception developed byWan, Durlach, and Colburn (2010), Beutelmann, Brand, andKollmeier (2010), and Lavandier et al. (2012) all combine theEC model with a Bbest band^ prediction of effects of ILDs.They therefore yield comparable predictions while usingsomewhat different implementations. The approach ofLavandier et al. (2012) is interesting, because it uses an ana-lytical expression to calculate directly the binaural unmaskingin dB, which makes it computationally efficient. The imple-mentation of Beutelmann et al. (2010) is more complex buthas as advantage that it performs calculations in shorttimeframes, which means that effects of interferer modula-tions can be predicted as well.

VanWijngaarden and Drullman (2008) developed a binau-ral version of the STI model, discussed above, that does notuse the EC model but quantifies how modulations of the inputsignal are preserved in the interaural cross correlation func-tion. Because this function depends on interaural delay as well

as on time, the delay is chosen at which modulations are op-timally preserved, i.e., at which the MTF is maximal. Thesebinaural MTFs are calculated within nonoverlapping frequen-cy bands3 and compared to the left- and right-ear monauralMTFs. Per band, only the largest of the three MTFs is used forthe final STI calculation. This method is attractive, because ituses a relatively simple way to calculate unmasking, whileremaining consistent with the existing STI standard (IEC,2003). It, furthermore, is the only binaural model that predictshow the intelligibility of target speech deteriorates as a resultof reverberation. However, it is not able to model effects ofinterferer modulations.

Another approach is taken in the descriptive model firstproposed by Bronkhorst (2000) and later refined by Jonesand Litovsky (2011). This model considers conditions whereall sound sources have the same average level and where thetarget speech always comes from the front. It predicts thedecrease of the SRT that occurs when one or more interferersare moved from the front to positions around the listener. Itconsists of two additive terms: one related to how close orspatially separated the interferers are, and the other to thesymmetry of their configuration. Jones and Litovsky (2011)applied the model to data from five studies with up to threeinterferers and found very high correlations between measure-ments and predictions (ρ ≥ 0.93). This model, therefore, seemsvery useful in cases when one wants quick estimates ofunmasking occurring in a variety of spatial configurations.

Summary of Research into Masking and Unmasking

In actual Bcocktail-party^ conditions, peripheral masking andbinaural unmasking inevitably affect speech perception.Quantifying how much (un)masking occurs is, however, noteasy, because it depends on amultitude of factors related to thespeech signals, the environment, and the listener. Fortunately,powerful psychoacoustic models have been developed in thepast decades that can deal with all of these factors and are ableto generate sufficiently accurate predictions, using only a lim-ited number of free parameters. Crucial properties of themodels are (1) how speech information and its sensitivity tointerfering sound and reverberation are quantified, and (2)which increase in speech information occurs during binaurallistening.

The sEPSN model developed by Jørgensen et al. (2013)currently seems the most powerful approach to quantifyingspeech information, because it can deal with many factors,including reverberation, and requires only few parameters.Binaural listening is associated with three types of interauraldifferences: ILDs, ITDs, and interaural decorrelation. Given

3 Binaural MTFs are only calculated for octave bands with center fre-quencies of 500, 1000 and 2000 Hz. The monaural MTFs are used forthe other frequency bands.

1470 Atten Percept Psychophys (2015) 77:1465–1487

Page 7: The cocktail-party problem revisited: early processing and

that ILDs cause differences in SNR between the ears, theireffects can actually be predicted relatively easily usingBmonaural^ speech perception models. Unmasking resultingfrom ITDs and interaural decorrelation can be adequately pre-dicted byDurlach’s (1972) ECmodel or VanWijngaarden andDrullman’s (2008) binaural STI. Although no single binauralmodel is available that addresses all relevant factors, related tosource, environment, and listener, such a model actually canbe developed relatively easily, because it is already knownhow any missing factor can best be quantified.

Grouping of speech sounds

When extracting target speech from a multi-talker mixture, twodifferent tasks need to be performed. One is to separate, at anytime, target elements from other speech (segregation). The oth-er is to connect elements across time (streaming). Bregman(1990) refers to these as simultaneous and sequential organiza-tion, respectively. That there can be substantial differences be-tween these tasks is illustrated by results obtained with theCoordinate Response Measure (CRM) task, a speech intelligi-bility task used extensively in studies of informational masking(Bolia, Nelson, Ericson & Simpson, 2000). This task usesphrases of the form BReady < call sign > go to < color ><number > now.^ There are 8 call signs (e.g., BBaron^), 4colors and 8 numbers, resulting in 256 possible phrases, spokenby 4 male and 4 female talkers. Listeners are asked to onlyattend to a phrase containing a specific call sign and to respondboth the color and the number of that phrase. Multi-talker con-ditions are created by presenting phrases with different callsigns, numbers, and colors at the same time. When same-sextalkers are used in such conditions, scores can be relatively poor(approximately 60 % when target and interfering speech havethe same level), but almost all errors are colors and/or numbersof the nontarget sentence (Brungart, 2001). This means thatlisteners have little trouble segregating the two phrases, but theyfind it difficult to group the words in the correct stream.

Another distinction introduced by Bregman (1990) is thatbetween Bprimitive^ and Bschema-based^ grouping. Primitivegrouping is supposed to take place preattentively, acting in a^symmetric^ way. It attempts to disentangle all superimposedsounds so that they are accessible for further processing, with-out Bfavoring^ or selecting one specific sound. Schema-basedgrouping, on the other hand, is thought to rely on learned and/or effortful processes that make use of specific stored soundpatterns. It also is thought to create a Bfigure-ground^ distinc-tion, which implies that it selects target information from otherinput, just as attention does. Bregman (1990), however, doesnot link attention directly to schema-based grouping. He indi-cates that such grouping could also take place preattentively,as long as it is based on learned schemata. Although the con-cept of primitive grouping seems useful, because it is linked to

basic acoustic features that have been studied extensively,schema-based grouping is a more problematic notion, giventhat it is not easy to separate attentive from preattentive pro-cessing, and, especially in the case of speech, it is challengingto differentiate learned from innate schemata (e.g., Goldstoneand Hendrickson, 2009). Furthermore, it is difficult to studybecause of the multitude of possible schemata and types oflearned behavior that can be involved.

Given that several reviews of research on auditory group-ing are available (Bregman, 1990; Darwin & Carlyon, 1995;Darwin, 1997; Darwin, 2008; Moore & Gockel, 2012), thisoverview focuses on research with speech that includes con-ditions in which target and interfering stimuli are presentedsimultaneously. First, two types of (Bprimitive^) groupingcues will be considered that dominate the recent literature:those based on voice characteristics and those related to spatialseparation of target and interfering sources. Other cues, e.g.,those based on language, are discussed in the third subsection.

Grouping Based on Voice Characteristics

As already noted by Bronkhorst (2000), speech intelligibilityis generally better when target and interfering speech areuttered by different-sex instead of same-sex talkers. Brungart(2001), for example, found that performance for the CRM taskat negative SNRs differs approximately 20 percentage pointsbetween these talker combinations. An even larger differ-ence—approximately 40 percentage points—occurs whenthe same voice is used as target and interferer. Festen andPlomp (1990) also compared different-sex with same-talkerinterference using a sentence intelligibility task and observedan SRT difference of no less than 6–10 dB.

Darwin, Brungart, and Simpson (2003) have looked moreclosely at the voice characteristics associated with differencesin talker gender. They considered how fundamental frequency(F0) and vocal tract length affect CRM task performance.These parameters can adequately model the difference be-tween male and female speech in synthesized speech (Atal& Hanauer, 1971). Maximum F0 changes and vocal tractlength ratios used in the study were 1 octave and 1.34, respec-tively, which cover the differences between natural male andfemale speech (Peterson & Barney, 1952). It appears thatCRM scores increase monotonically as a function of bothparameters, but that the increase is somewhat higher for F0changes than for vocal tract length changes. Interestingly, theeffect of using an actual different–sex talker is larger than thesum of individual effects of F0 and vocal tract length, butaround the same as the combined effect. In other words,changes of F0 and vocal tract length have, together, asuperadditive influence on speech intelligibility. Note thatDarwin et al. (2003) used natural fluctuations of the pitchcontours of target and interfering speech, which were keptintact when the speech was resynthesized. Somewhat different

Atten Percept Psychophys (2015) 77:1465–1487 1471

Page 8: The cocktail-party problem revisited: early processing and

effects of F0—a larger increase of scores for small differencesand a dip occurring at one octave—were found in earlier stud-ies that used completely monotonous speech (Brokx &Nooteboom, 1982; Bird & Darwin, 1998), probably due topartial fusion of target and interfering sounds.

Given that F0 is such a strong grouping cue, it is somewhatsurprising that differences in F0 contour havemuch less effect.Binns and Culling (2007) used target and interfering speechwith normal, flat (monotonous), and inverted F0 contours andfound an increase of the SRT of up to 4 dB when the F0contour of the target speech was manipulated but no signifi-cant effect of manipulations of the interfering speech. Thus,whereas F0 contour appears to be important for intelligibility(as shown previously by Laures & Weismer, 1999), differ-ences in F0 contour between target and interfering speech donot seem to improve segregation of concurrent speech signals.

The results of Brungart (2001) mentioned above alreadyindicated that voice characteristics that allow simultaneousgrouping do not necessarily provide enough information forsequential grouping. In line with this, Brungart, Simpson,Ericson, and Scott (2001) showed that providing a priori in-formation about the target talker by using the same talker in ablock of trials helped to prevent different-sex, but not same-sex confusions (errors where the reported color and number inthe CRM task were uttered by a different-sex or same-sexinterferer, respectively). Apparently, such a priori informationonly cues the sex of the target talker.4 Furthermore, the mo-dality used to present the information does not seem to matter.As shown by Helfer and Freyman (2009), who used a speechperception task similar to the CRM task, results did not de-pend on whether the target talker could be identified using akey word presented visually, or using an auditory Bpreview^of the target talker’s voice. Interestingly, Johnsrude et al.(2013) recently showed that listeners are much better at sup-pressing same-sex confusions when the target or interferingtalker is highly familiar (the listener’s spouse).

Grouping based on spatial cues

In studying spatial grouping cues, themost direct approach is tosimply present target and interfering speech stimuli from dif-ferent spatial locations. Such experiments show that a relativelysmall spatial separation can already lead to efficient segrega-tion. Brungart and Simpson (2007), for example, found that aseparation of only 10° of two voices is already sufficient tomaximize performance on the CRM task. However, becausespatial separation is normally accompanied by changes in au-dibility, the true contribution of grouping cannot be determined

in this way. One solution for this problem is to minimizemasking by making sure that the frequency spectra of targetand interfering hardly overlap. Arbogast et al. (2002) realizedthis with sine-wave speech, created by filtering CRM sentencesin multiple frequency bands, using the envelopes to modulatepure tones at the center frequencies of these bands, and subse-quently summing together separate subsets of these tones togenerate target and interfering signals. Such speech is perfectlyintelligible after some training. Using a similar procedure, un-intelligible sine-wave Bnoise^ was generated with a frequencyspectrum identical to that of the sine-wave speech.When targetand interfering sounds were both presented from the front, theBnoise^ had much less effect on intelligibility than the speechinterference did. This reflects the difficulty of grouping CRMphrases demonstrated earlier by Brungart (2001). The interest-ing finding was that this difference was reduced drastically(from more than 20 dB to approximately 7 dB in terms ofSRTs) when a spatial separation of 90° was introduced. Thus,spatial separation can strongly reduce the influence of interfer-ing speech even when audibility hardly changes.

Another way to separate audibility changes from effects ofgrouping was devised by Freyman and colleagues. They com-pared a baseline condition in which target and interferingspeech came from the front (labeled F-F) with a special con-dition in which the interfering sound was presented both fromthe right side, at an angle of 60° and, slightly delayed, from thefront (the F-RF condition). Thus, they made use of the prece-dence effect (the dominance of the first-arriving sound in lo-calization) to create a perceived spatial separation withoutreducing energetic masking. Freyman et al. (2001) showedthat this change of perceived location caused a gain of approx-imately 8 dB for interfering speech (at 60 % intelligibility) butof only 1 dB for interfering noise. Interestingly, Freyman,Helfer, McCall, and Clifton (1999) also used an F-FR condi-tion in which the two interfering signals were reversed; i.e.,the delayed copy was presented from the spatially separatedsource. This generated a much smaller shift in perceived loca-tion but about the same gain with respect to the FF condition.Spatial separation, thus, appears to be very effective in im-proving segregation of speech sounds.

Because spatial segregation introduces both ILD and ITD,it is of interest to look at the individual contributions of thesecues. The general approach to studying this has been to pres-ent signals with one of these cues through headphones andthen measure how segregation diminishes when other group-ing cues are added. The Bshadowing^ studies conducted byCherry (1953) and others are, in fact, early examples becausethey combined Binfinite^ ILD with various manipulations oftarget and/or interfering speech.5 Treisman (1960), for

4 As discussed below, this also can be interpreted as an attentional effect,namely that listeners find it difficult to focus sustained attention on voicecharacteristics.

5 In practice, the ILD cannot exceed the bone conduction limit, which isapproximately 50 dB (Hood, 1957).

1472 Atten Percept Psychophys (2015) 77:1465–1487

Page 9: The cocktail-party problem revisited: early processing and

example, asked listeners to shadow speech presented to oneear while sudden switches of target and nontarget speech wereintroduced. Listeners generally maintained their shadowingperformance and only occasionally reproduced words fromthe nontarget ear after a switch. This demonstrates that group-ing based on ILD is dominant but nevertheless can be super-seded by grouping based on speech properties (i.e., voicecharacteristics as well as contextual cues). Cutting (1976) useda different paradigm in which 2-formant synthetic syllableswere presented dichotically, and performance was determinedfor complementary signals (e.g., different formants) or con-flicting signals (e.g., pairs of formants differing in their initialparts). Note that the common F0 and the temporal alignmentof the left and right signals acted as strong grouping cues inthese conditions. His results show perfect integration for com-plementary stimuli and 50–70 % integration for conflictingstimuli. Although this indicates that grouping is affected, itdoes not mean that the two sounds are always fused. In thestudy of Broadbent and Ladefoget (1957), who used similar(complementary) stimuli, only about half of their listenersindicated that they heard just one sound. Darwin and Hukin(2004) later showed that fusion, in fact, depends on the pres-ence of overlapping spectral information at the left and rightears. When sharp filters without overlap are applied, listenersalways report hearing two sounds.

Interestingly, ILD-based grouping can be much less robustwhen there are multiple interfering sounds. Brungart andSimpson (2002) found that when one target and one interfer-ing talker are presented to one ear, performance decreasesconsiderably (by up to 40 percentage points) when a secondinterfering talker is presented to the contralateral ear. Evenmore surprising is their finding that this decrease stays almostthe same when the contralateral interfering speech is stronglyattenuated (by up to 15 dB). Further research by Iyer,Brungart, and Simpson (2010) revealed that this Bmulti-talkerpenalty^ only occurs under specific conditions: at least one ofthe interfering signals should be similar to the target signal (sothat they are easily confused) and the overall SNR should bebelow 0 dB.

While it appears to be difficult to disrupt segregation basedon (large) ILDs, several studies indicate that ITD-based seg-regation is less robust. Culling and Summerfield (1995)showed that artificial vowels consisting of two narrow noisebands can be easily segregated when they differ in ILD but notwhen they differ in ITD. Using a different paradigm thatlooked at the degree to which a single harmonic was integrat-ed in the percept of a vowel, Hukin and Darwin (1995) alsofound that ITD causes much weaker segregation than ILD.These findings are remarkable, because ITD is known to bea potent spatial cue for sound localization (Wightman &Kistler, 1992) and speech perception in noise (Bronkhorst &Plomp, 1988). Results of other studies are, however, muchless clear-cut. Drennan, Gatehouse, and Lever (2003), for

example, replicated the Culling and Summerfield experimentsusing natural ITDs and ILDs, derived from artificial-head re-cordings. They not only found that the majority of their lis-teners could now segregate the noise stimuli using ITD, butalso that performance for ILD was only slightly better thanthat for ITD. An important factor appeared to be the inclusionof onset ITDs. Were these absent, so that only ongoing ITDsremained, performance decreased. Another study comparingITD- and ILD-based segregation was conducted by Gallunet al. (2005). They used the sine-wave speech stimuli devisedby Arbogast et al. (2002) and looked at performance differ-ences occurring between a monotic baseline condition, inwhich target and interferer were presented to one ear and dich-otic conditions where the ILD or ITD of the interferer wasvaried. They, in essence, found that any interaural differencethat generated a perceived lateralization shift also allowedsegregation of target and interfering speech.

Further work on ITD-based grouping was conducted byDarwin and Hukin (1999), who studied segregation of wordsembedded in sentences. In their paradigm, listeners focusedon one of two sentences presented simultaneously but withdifferent ITDs, and their taskwas to select a target wordwithinthat sentence that coincided with a competing word in theother sentence. Care was taken to ensure that these wordscould not be identified using other (semantic, prosodic, orcoarticulatory) cues. Performance not only was high (>90 %for ITDs larger than 90 μs) but also did not change when F0difference was pitted against ITD (i.e., sentences had differentITDs and F0s but the target word had the same F0 as thecompeting sentence). A follow-up study by Darwin andHukin (2000) that used synthesized sentences with naturalprosody showed that grouping based on ITD will only breakdown when prosody, F0 difference, and a (large) vocal-tractdifference are all working against it. ITD, thus, appears to be aquite strong cue for segregation, provided that streaming canbuild up and natural onset cues are included in the stimuli.

In real-life conditions, cues are mostly working togetherand not against each other, and it is of interest to determinehow effectively cues are combined. Culling, Hawley, andLitovsky (2004) measured intelligibility for target speech pre-sented from the front and three-talker interference presentedfrom various combinations of source positions and processedtheir stimuli such that they contained only ITD or ILD, or bothITD and ILD. They found that recognition performance wasalways better for the combination than for individual cues, butthat the improvement was sub-additive. A different cuecombination-frequency and spatial location—was studied byDu et al. (2011). They used synthetic vowels with either thesame F0 or a small F0 difference that were presented from thefront or from locations at ±45°. Additive effects of the twocues were found both in the behavioral scores and in MEGresponses measured simultaneously. The latter results are con-sistent with earlier electrophysiological studies showing

Atten Percept Psychophys (2015) 77:1465–1487 1473

Page 10: The cocktail-party problem revisited: early processing and

additivity of responses to combinations of frequency and lo-cation (Schröger, 1995) and frequency and intensity(Paavilainen et al. 2001). Such additivity not only indicatesthat the auditory system can integrate these cues effectivelybut also suggests that they are processed in separate brainregions (McLachlan & Wilson, 2010).

Grouping based on other cues

Several studies mentioned above have used low-level group-ing cues, such as harmonicity and temporal alignment ofspeech items to counteract segregation introduced by spatialcues. Although these cues are clearly potent, I will not con-sider them further, because they are less representative of real-life listening conditions and because in-depth reviews of thiswork are already available (Darwin, 2008, Moore & Gockel,2012). Many other cues are present in speech that probablyalso affect grouping. Examples are speaking style (e.g.,timing, stress pattern), timbre, linguistic variability (e.g., na-tive vs. nonnative speech), and various types of contextual(e.g., semantic, syntactic, and coarticulatory) information.While a lot is known about their effects on intelligibility, asdiscussed in the section BHow Special is Speech,^ we knowsurprisingly little about how they affect grouping. This isprobably, because specific measures were taken in most

studies to remove or control these cues tomake effects of othercues more salient. Freyman and collegues, for example, usedmeaningless sentences with target words at fixed locations,which provide no semantic or syntactic information. Also inthe popular CRM task, listeners cannot benefit from such in-formation, nor from differences in timing or stress pattern,because all phrases have the same structure and target wordsare drawn randomly from fixed sets. Nevertheless, some rele-vant results are available that were obtained by manipulatingthese cues in the interfering instead of the target speech.

Freyman et al. (2001) performed experiments where theinterfering speech was spoken by native or nonnative talkers,time-reversed, or spoken in a foreign language, unknown tothe listeners. The precedence-effect-based paradigm discussedabove was used; all talkers were female. For all types of inter-ference, a clear difference was found between results for the F-RF and F-F conditions, demonstrating that grouping was al-ways facilitated by the introduction of a (perceived) spatialseparation (Fig. 1a). The difference was, however, much larg-er for the forward native speech than for the other types ofspeech. This indicates that grouping ofmultiple speech soundsis affected by any type of interfering speech, irrespective of itsintelligibility, but suffers in particular from normal speechuttered by native talkers. Iyer et al. (2010) performed some-what similar experiments with the CRM task, using not only

Fig. 1 Measures of Binformational masking^ derived from data collectedby Freyman et al. (2001; panel a) and Iyer et al. (2010; panel b) inconditions where target speech was presented together different types oftwo-talker interference. Shown are differences between speech perceptionscores, averaged over SNRs of −8, −4, and 0 dB, for reference and test

conditions. a Scores for test conditions using F-RF presentation fromwhich scores for reference conditions, using FF presentation of the samesounds, were subtracted. b Data obtained by subtracting scores for testconditions with speech interference from a reference condition with twomodulated noise signals.

1474 Atten Percept Psychophys (2015) 77:1465–1487

Page 11: The cocktail-party problem revisited: early processing and

interfering CRMphrases but also other native, foreign or time-reversed interfering speech. Their results are summarized inFig. 1b. As expected, performance was worst for the CRMinterferers, which have the same structure and timing as thetarget phrases and thus maximize confusions. Performancewas intermediate for interfering normal or time-reversedEnglish speech and relatively good for foreign speech.Although these results are largely consistent with those ofFreyman et al. (2001), it is surprising that no difference oc-curred between normal and time-reversed speech. Perhaps thisis an artifact caused by the use of CRM sentences, whichcould induce listeners to just focus on target words and dis-miss semantic information altogether.

Summary of Research into Grouping

The difficult task of a listener to disentangle target from inter-fering speech is facilitated by the presence of different types ofgrouping cues. Research shows that voice characteristics con-tain effective cues and that in particular female and malevoices are easily segregated. It can be surprisingly difficultto track voice characteristics over time, which means that per-fect segregation is not always accompanied by perfect stream-ing. Study of individual cues reveals that F0 and vocal tractdifferences both contribute to grouping and that they have asuperadditive effect when combined. Differences in intonation(F0 contour), however, do not seem to act as grouping cue.Other important cues are the ILDs and ITDs occurring whendifferent talkers occupy different positions in space. A largeILD acts as an extremely strong grouping cue, which canonly be made ineffective by pitting a combination of othergrouping cues against it. While ITD was found to be inef-fective for segregation of certain artificial stimuli, it isaround as effective as ILD when more natural stimuli areused. Given that the combination of ILD and ITD has aneven stronger effect than each of the individual cues, it isnot surprising that optimal segregation can already occurfor a small spatial separation. When spatial separation iscombined with other cues such as F0, additive effects onperformance are found, suggesting independent processingof these cues by the auditory system.

Remarkably little is known about other grouping cues, re-lated to e.g. language, pronunciation and redundancy. Studiesin which the interfering speech was manipulated show thatsegregation is easier for nonnative or foreign, than for nativeinterfering speech. Segregation is poorest when target andinterfering speech have the same syntactic structure. This in-dicates that linguistic and syntactic differences facilitategrouping. Comparisons of the effects of normal and reversedinterfering speech yield unequivocal results, which means thatit is not yet clear whether semantic differences support group-ing as well.

Role of attention in selecting speech

Attention is currently mostly defined from an information-processing point of view, stressing the need for selection ofinformation by a system that is capacity limited and largelyserial in its central processing stages and in response selectionand execution (Pashler, 1998). Attention is steered both bybottom-up sensory information (e.g., attentional capture by anovel stimulus) and by top-down processes (e.g., endogenousfocusing on a specific spatial location). This implies that it isdifficult to control attention experimentally, not only becauseone never knows whether subjects follow instructions tellingthem to focus their attention in a certain way but also becauseit is hard to prevent bottom-up capture. Shifts of attention alsomay be difficult to detect, because they can occur quite rapid-ly: within 100–200 ms (Spence & Driver, 1997).

Cherry’s (1953) interest in the cocktail-party problem re-sulted in a novel paradigm, in which listeners are asked toshadow speech presented to one ear while ignoring signalspresented to the other ear. This paradigm is important, becauseit provided most, if not all material on which the early theoriesof attention were based. These theories all state that attentionoperates as a filter that selects part of the incoming auditoryinformation, but they differ in their predictions of which un-attended information is processed. While the Bearly selection^theory states that only low-level signal features are processedpreattentively (Broadbent, 1958), the Blate selection^ theoryclaims that all input is processed up to a semantic level(Deutsch & Deutsch, 1963). An intermediate view is providedby the Battenuation^ theory of Treisman (1964), which pro-poses that unattended information is indeed processed up to ahigh (semantic) level, but with reduced resources and thusmore slowly. Although some clever experiments were per-formed in the attempts to prove one or the other theory(Moray, 1959, Treisman, 1960), it appears that this research,and the modeling based on it, suffers from two basic flaws.One flaw is that possible shifts in attention to the nontarget earwere at best poorly controlled, and its possible effects were notmonitored (Holender, 1986). The other flaw is that, while theuse of the term Bfilter^ implies that sensitivity is measured as afunction of a certain independent variable, this variable is notmade explicit, nor manipulated.

Is Attention Required for Speech Processing?

In order to shed more light on the results of the classicshadowing studies, Cowan, Wood, and colleagues replicatedseveral experiments, analyzing in detail the shadowing outputand its timing (Cowan & Wood, 1997). When they repeatedMoray’s (1959) classic experiment, in which the listener’sown name was unexpectedly presented to the nontarget ear,Wood and Cowan (1995a) reproduced the finding that aboutone third of the listeners recalled hearing their name

Atten Percept Psychophys (2015) 77:1465–1487 1475

Page 12: The cocktail-party problem revisited: early processing and

afterwards, but also showed that the shadowing errors and/orresponse lags of these listeners increased significantly in theseconds following the occurrence of their name. No decreasein performance was found for listeners who did not noticetheir name or were presented with other names. Similar resultswere observed for listeners presented with a fragment of time-reversed speech embedded in normal speech (Wood &Cowan, 1995b). While this indicates that attention switchesdo occur and that attentional resources are required for con-solidation of information in long-term memory (LTM), it alsosuggests that some preattentive semantic processing of speechmust be going on to trigger the switches. Evidence for suchpreattentive processing also was provided by Rivenez,Darwin, and Guillaume (2006), who studied dichotic primingwith a paradigm that combines high presentation rates (2words/s) with the addition of a secondary task to discourageattention switches to the nontarget ear. Despite these mea-sures, they found a clear priming effect: response times to atarget word were lowered when the same word was presenteddirectly before it to the nontarget ear.

Even more compelling evidence for preattentive processingemerges from recent mismatch-negativity (MMN) studies. TheMMN is an event-related potential (ERP) component that oc-curs when infrequent Bdeviant^ stimuli are inserted in a seriesof Bstandard^ stimuli (Näätänen, Gaillard, & Mäntysalo,1978). It is resistant to influences of attention and is thereforethought to reflect a preattentive process that compares thecurrent auditory input with a memory trace that encodes theregularity of previous input. In their review of MMN work onlanguage processing, Pulvermüller and Shtyrov (2006) con-clude that the MMN is sensitive to different types of manipu-lations, representing separate levels of language processing.MMNs are for example found in response to (1) pseudowordsversus words (lexical level), (2) action words versus abstractwords (semantic level), and (3) words in grammatically correctsentences versus words in ungrammatical strings (syntacticlevel). Because MMN data also depend on physical stimulusdifferences, Pulvermüller and Shtyrov (2006) only considerstudies where such effects are prevented. However, as theseauthors note themselves, another critical aspect of most MMNwork involving speech stimuli is that attention is only looselycontrolled, so that it is not clear to what degree results aremodulated by attention. An exception is a study byPulvermüller, Shtyrov, Hasting, and Carlyon (2008), who stud-iedMMN responses to syntactic violations in speech presentedto one ear, while subjects had to detect oddball stimuli amongstandard tones presented to the other ear (and watched a silentvideo as well). In the control condition, they were presentedwith the same stimuli but did not perform the (attentionallydemanding) task. The results show that the early part of theMMN (<150 ms) is immune to the manipulation of attention,which indicates that there is indeed preattentive processing ofspeech up to a syntactic level.

Attending to a target voice

What are the sound features that help us attending to a targetvoice? This question was already answered partly in the pre-vious sections, because any stimulus that can be recognized inbehavioral speech segregation experiments first must havebeen selected. This means that voice differences and spatialcues also are effective cues for selection. However, we canonly learn more about selection itself when paradigms areused in which attention is manipulated. Implici tmanipulation took place in a number of studies that werealready discussed. Brungart et al. (2001) included conditionswhere the target voice was either varied across trials or keptconstant, so that listeners either had to refocus attention on(the voice of) each individual call sign or could maintain at-tention on the same voice all the time. The manipulation justsuppressed different-sex confusions, which indicates thatsustained attention can only be focused on relatively coarsevoice characteristics. In another study based on the CRM task,Kidd, Arbogast, Mason, and Gallun (2005) manipulated in-formation about the target sentence and the target location.They used a paradigm in which three phrases (spoken bymake talkers) were presented simultaneously from equidistantsources on a 120°-arc in front of the listeners. In a block, thecall sign was cued either before or after presentation of thephrases, and the probability of the target location had a con-stant value between 0.33 (chance) and 1.0. It appeared thatknowing the location yielded maximal performance, alsowhen the call sign was provided afterwards, but that knowingthe call sign while being uncertain of the location caused areduction of 10–40 percentage points. These results confirmthe difficulty listeners have in keeping track of a target voice,and also demonstrate that spatial location provides relativelystrong cues.

Several studies on spatial auditory attention have used non-speech (and nonsimultaneous) stimuli but are neverthelessinteresting, because they combine behavioral and electrophys-iological measures. Teder-Sälejärvi and Hillyard (1998), forexample, used a spatial attention task in which noise burstswere presented from 7 regularly spaced loudspeakers span-ning a 54°-arc in front of the listeners. Standard stimuli as wellas (infrequent) target stimuli with a different frequency band-width were generated by all loudspeakers, but the listenerswere required to respond only to targets delivered by onespecific loudspeaker. Behavioral error rates were low, indicat-ing that attention had a relatively narrow spatial focus (within±9°, which is consistent with the spatial resolution found inthe abovementioned study of Brungart and Simpson, 2007).Interestingly, ERPs recorded simultaneously revealed a simi-lar spatial tuning at latencies around 300ms poststimulus but abroader tuning for earlier latencies. This suggests that the spa-tial tuning of attention operates in different stages with in-creasing sharpness. Similar results were obtained by Teder-

1476 Atten Percept Psychophys (2015) 77:1465–1487

Page 13: The cocktail-party problem revisited: early processing and

Sälejärvi, Hillyard, Röder, and Neville (1999), who also foundthat spatial tuning for a source at a horizontal angle of 90° wasaround twice as broad as for a location in front. This is con-sistent with the fact that sound localization accuracy decreasesfor more lateral source positions (Middlebrooks & Green,1991).

Hink and Hillyard (1976) developed a paradigm thatenabled them to measure ERPs using speech stimuli.They asked listeners to attend to one of two stories present-ed simultaneously to the left and right ears and used syn-thesized phonemes, mixed with the stories, as probe stim-uli. Although ERPs were smaller than when the probeswere presented alone, a clear effect of attention was ob-served: N1 responses were significantly higher for probescoming from the target side. Recently, Lambrecht, Spring,and Münte (2011) used the same paradigm for stimuli pre-sented in a (simulated) free field. Listeners had to attendone of two concurrent stories emanating from sources athorizontal angles of ±15°, and ERPs were elicited by probesyllables cut out from the stories. The probes were present-ed either from the same locations as the stories or fromlocations at more lateral angles (±30° or ±45°). An in-creased negativity was found for probes coming from thetarget location compared with those coming from the mir-rored location, but latencies were higher (>300 ms) thanthose observed by Hink and Hillyard (1976). The resultsare consistent with those of other studies using similarparadigms, which also revealed late attention-related re-sponses for dichotic (Power, Foxe, Forde, Reilly, &Lalor, 2012) and free-field (Nager, Dethlefsen, & Münte,2008) speech presentation. The deviant results of Hink andHillyard (1976) might be caused by their use of syntheticprobe stimuli that do not require speech-specificprocessing and can draw attention based on simpleacoustic differences. Interestingly, Lambrecht et al.(2011) found that probes coming from more lateral posi-tions yielded (strongly) increased negativity, but only forthe 45°-angle at the target side. The authors attributed thisto a reorienting response, indicating that attention wasrelocated to the original task after momentary distractionby the probe.

Exogenous attention to speech

The results of Lambrecht et al. (2011) and also the classicfinding that occurrence of one’s own name in nontargetsounds affects performance on a primary task (Moray, 1959;Wood & Cowan, 1995a) are examples of findings that can beexplained by exogenous shifts of attention. Another exampleis the irrelevant sound effect—the finding that performance on(visual) memory tasks is impaired when task-irrelevantsounds are presented during encoding or retention of to-beremembered items (Colle & Welsh, 1976; Neath, 2000). It

has been proposed that this deficit is primarily due to invol-untary shifts of attention away from the primary task (Cowan,1999; Bell, Röer, Dentale, & Buchner, 2012), although therealso are alternative explanations, suggesting a direct interfer-ence with memory processes (Jones, 1993). Given that theeffect mainly concerns the functioning of (visual) memory,this research will not be discussed further in this review.Other research into exogenous auditory attention hasborrowed tasks from visual research but has used nonspeechstimuli. Spence and Driver (1994), for example, developed anauditory version of the classic spatial cueing task (Posner &Cohen, 1984) and showed that presenting an auditory cue justbefore a target sound reduced response times when the cuecame from the target, instead of the mirrored side. Dalton andLavie (2004) developed a paradigm based on Theeuwes’s(1992) attentional capture task, in which a target tone withdeviating frequency (or level) had to be detected in a tonesequence that also could contain an irrelevant tone with devi-ating level (or frequency). They found that the irrelevant sin-gletons impaired performance, indicating that they indeedcaptured attention. In a recent study by Reich et al. (2013),both nonspeech and speech stimuli were used in a paradigmwhere participants responded to the second syllable of a spon-dee or the duration of a tone, while irrelevant deviations wereintroduced by changing the first syllable of the spondee or itspitch, or the frequency of the tone. All deviations increasedbehavioral error rates and response times, and elicited P3acomponents in the ERPs, consistent with the occurrence ofinvoluntary shifts of attention.

More insight into the relationship between speech process-ing and exogenous attention is provided by a series of exper-iments conducted by Parmentier and colleagues (Parmentier,2008, 2013; Parmentier, Elford, Escera, Andrés, & SanMiguel, 2008; Parmentier, Turner, & Perez, 2014). These re-searchers used different cross-modal tasks to measure effectsof irrelevant auditory stimuli on visual identification tasks.The paradigms are based on earlier studies employing non-speech stimuli (Escara, Alho, Winkler, & Näätänen, 1998).Participants were asked to classify digits or identify the direc-tion of an arrow. These visual stimuli were preceded by stan-dard auditory stimuli (mostly tones) or by infrequent deviantstimuli that could be noise bursts, environmental sounds, orutterances that were congruent or incongruent with the visualtask (the words Bleft^ or Bright^). Stimulus-onset asyn-chronies (SOAs) varied between 100 and 350 ms. Severalinteresting results were found. The basic finding, replicatedin all studies, is that deviant irrelevant stimuli reduce perfor-mance, as reflected by increased response times and decreasedaccuracy. As shown in the first study (Parmentier et al., 2008),this reduction is unaffected by visual task difficulty but itdisappears when an irrelevant visual stimulus is presented justafter the deviant auditory stimulus at the location of the sub-sequent target. This suggests that attention is indeed captured

Atten Percept Psychophys (2015) 77:1465–1487 1477

Page 14: The cocktail-party problem revisited: early processing and

by the deviant stimulus, but can be refocused quickly on thetarget position. Follow-up experiments (Parmentier, 2008) notonly revealed that incongruent deviants (e.g. the word Bleft^presented before an arrow pointing to the right) disrupt per-formance more than congruent deviants, but also that the dif-ference between the two (semantic effect) is independent ofwhether standards are acoustically similar to or different fromthe deviants (novelty effect). As shown by Parmentier et al.(2014), this semantic effect decreases substantially when thevisual target has varying SOA and 50 % probability of occur-rence, indicating that the degree to which the deviants aresemantically processed depends on how reliably occurrenceof the target is predicted by the auditory stimulus. Taken to-gether, the results indicate that deviant auditory stimuli incurexogenous shifts of attention but that further processing ofthese stimuli depends on their relevance for the task at hand,and, thus, on endogenous factors such as task goal and listen-ing strategy.

Attending to Changes in Talker Voice and Location

In real-life listening we often switch our attention betweensounds and/or source locations. Several studies have investi-gated this by introducing target uncertainty, thus discouraginglisteners to maintain focus on a single sound source. In thestudy by Kidd et al. (2005), discussed above, listeners per-formed a CRM task while uncertainty in target identity(knowledge of the call sign) and/or location was manipulated.Ericson, Brungart, and Simpson (2004) performed a similarstudy in which listeners always knew the call sign but whereuncertainty in the target voice and/or location was introduced.These studies in particular demonstrate how the different cuesaffect performance: knowing only target identity or targetvoice yields lower scores that knowing the location and know-ing combinations of cues always improves scores. They, how-ever, provide no information on actual switch costs, because itis unknown whether and how listeners refocus their attentionin these conditions. A more direct study of the dynamics ofattention was conducted by Koch, Lawo, Fels, and Vorländer(2011), who measured how switches of the target talker affect-ed response times for an auditory classification task. Theyused dichotic speech stimuli, consisting of number words spo-ken by a female and a male talker, and provided listeners witha visual cue signaling the target talker’s gender. Using a par-adigm that enabled them to separate visual from auditoryswitch costs, they found that the latter were around 100 ms,independent of cue SOA (which was either 100 or 1000 ms).However, because the target ear was randomized in the exper-iments, these costs actually represent a combination ofswitching between voices and (in 50 % of trials) switchingbetween ears.

Best, Ozmeral, Kopčo, and Shinn-Cunningham (2008) andBest, Shinn-Cunningham, Ozmeral, and Kopčo (2010)

conducted a series of experiments that provide the most de-tailed insight into effects of switches of target voice and loca-tion. They used 5 evenly spaced loudspeakers on a 60°-arc infront of the listeners. Target and interfering stimuli were se-quences of four digits uttered by male talkers. The target,which was always presented together with four (different)interferers, could come from a fixed loudspeaker, or consecu-tively from different loudspeakers. Listeners had to repeat the4 consecutive target digits in order. Surprisingly, the resultsshowed that the reduction in scores resulting from locationchanges was resistant to all kinds of measures designed toaid refocusing of attention. These included cueing the locationwith lights, either at the moment of change or in advance,using only changes to the nearest loudspeakers, and usingrepeated sequences of locations (Fig. 2a). Another surprisingfinding was that the beneficial effect of keeping the targetvoice the same almost disappeared when the target locationstarted shifting (Fig. 2b). Moving the target around thus dis-rupts the advantage of using voice identity as cue, which sug-gests that a position shift causes a sort of Breset^ of the system.

Best et al. (2008, 2010) also provided insight in the timeconstants involved in tracking shifts in talker and location.They found that the cost of switching location decreased whenthe interstimulus delay was increased to 1000 ms, but not tozero. Furthermore, they observed that performance for a fixedtarget location improved over time, irrespective of anyswitches in talker identity. Even for the 1000-ms delay (se-quence duration of approximately 5 s), the improvement wasstill considerable (approximately 10 percentage points).Brungart and Simpson (2007), who conducted an experimentwith two to four talkers using sequences of CRM trials withvarying spatial uncertainty, found a similar pattern of resultsthat spanned even a much longer time period. The improve-ment they found as a result of keeping the target location fixedcontinued up to the 30th trial (a sequence duration of approx-imately 160 s). Note, however, that there was uncertaintyabout the location of the target talker in this experiment sothat the slow improvement occurring when the target locationremained the same for some time may also be due to the factthat the listeners gradually changed their listening strategy toone optimized to a nonmoving talker.

Summary of research into attention

Whereas research into auditory attention stagnated for sometime after the early shadowing studies approximately 50 yearsago, an increasing number of studies is now being conducted,using various behavioral paradigms as well as electrophysio-logical measures. It appears that attention is very versatile. It isnot just able to suppress speech presented to one ear when weare listening to the other ear—it also can focus on a relativelynarrow spatial region around a target voice (within ±10°) or oncharacteristics of a target voice mixed with interfering speech.

1478 Atten Percept Psychophys (2015) 77:1465–1487

Page 15: The cocktail-party problem revisited: early processing and

These features are not equally effective: listeners find it easierto sustain focus on a location than to keep track of a voice,even if a relatively short message is uttered.

The most surprising trick that attention can perform is thatit can base selection not just on basic features, such as loca-tion, but also on sophisticated semantic cues, which are proc-essed preattentively. Indications for such processing alreadyemerged from many behavioral studies, but it was not untilrecently that convincing proof was provided by MMN mea-sures. Such processing is particularly useful for exogenoustriggering of attention: it enables us, for example, to pay at-tention to someone suddenly mentioning our name. We wouldexpect that capacity limitations must restrict how much unat-tended speech can be processed, and recent experiments byParmentier et al. (2014), indeed, provide evidence for this.The depth of processing seems to be related to the(task-)relevance of the sound.

The weak spot of auditory attention seems to be that it isremarkably sluggish. That sudden changes in talker identityare penalized is perhaps not too surprising, given that this isunusual and the system has to attune to subtle cues in individ-ual voices. Amore unexpected finding is that changes in talkerlocation seem to Breset^ this tuning process: speech percep-tion is no better when one can follow a single talker

successively moving to different positions in space, than whendifferent voices are presented from each of these locations, sothat changes in location are always coupled with changes intalker identity. Furthermore, optimal tuning to location andvoice takes a pretty long time. Perhaps this is the price thathas to be paid for superb selectivity.

Conceptual Model of Early Speech Processing

This section describes a conceptual model of preattentivespeech processing and auditory attention that intends toincorporate most results reviewed above. It builds uponearlier models, in particular on the model of conscious andunconscious auditory processing published recently byNäätänen et al. (2011) and to a lesser degree on the genericsensory-information-processing model of Cowan (1988).Because it is primarily intended as a high-level model inte-grating behavioral findings, it does not attempt to includeelements of neurobiological models (such as that developedby McLachlan and Wilson, 2010). Also, because of the focuson early processing, it does not consider results of psycholin-guistic research that addresses higher-level speech processing.

Fig. 2 Data from experiments of Best et al. (2008) and Best et al. (2010),in which one target and four interfering strings of digits were presentedfrom different loudspeakers placed in an arc in front of the listener. aIncreases in scores occurring when target digits are presented from a fixedlocation, instead of from Brestricted^ locations changing at most oneloudspeaker position at a time, or from random locations. The changescould be cued with lights either at the time of change or in advance. The

condition Bpredictable locations^ employed the same sequence of loca-tions throughout a block. b Increases are shown occurring when the stringof target digits was spoken by a single voice, instead of different voicesfor each digit. The Bsimultaneous cue^ condition in this case used randomlocations. All results are for an inter-digit delay of 250 ms, except thosefor the Bpredictable locations^ condition, to which a correction factor wasapplied because they were only measured for a delay of 0 ms

Atten Percept Psychophys (2015) 77:1465–1487 1479

Page 16: The cocktail-party problem revisited: early processing and

Structure and Operation of the Model

As shown in Fig. 3, the model contains processing stages,storages, and operations linked to attention, which have beengiven different shadings. The processing stages are (1) periph-eral and binaural processing, (2) primitive grouping, (3) fur-ther grouping and speech-specific processing (such as pro-cessing of lexical information), (4) selection, and (5) attention-al control. There are storages for (a) unmasked signals andspatial properties, (b) transients, (c) primitive features, (d)preattentive sensory information, and (d) short-term andlong-term information, including thresholds, sets, and tracesused for triggering attention. The model assumes that process-ing is largely feedforward but also is influenced by a feedbackloop, governed by attention, inducing selective enhancementof signals being processed. This is indicated in Fig. 3 by thedotted triangle and by the arrow linking it to attentional con-trol. The enhancement operates on signals that have triggeredattention, but extends to stages before the one at which thetrigger occurred. Support for this comes from EEG and imag-ing studies showing effects of attention on neural processingin the nonprimary (Ahveninen, Hämäläinen, Jääskeläinen,Ahlfors, Huand et al., 2011) and primary (Woldorff et al.,1993) cortex, and from evidence of selective processing al-ready taking place at the peripheral level. Scharf, Quigley,Peachey, and Reeves (1987; see also Scharf, 1998), for exam-ple, studied the effect of attention on detection of tones innoise and found that scores increased strongly when the tonefrequency was close to or equal to an expected frequency.More recently, Allen, Alais, and Carlile (2009) showed thatbinaural unmasking also is influenced by attention: they foundthat the spatial release from masking of speech stimuli disap-peared when the target speech was presented from an unex-pected location. Given that attention controls enhancement ofpreattentive signals, the selection stage itself can be seen as arelatively simple stage that just passes the strongest of thecompeting input signals.6

According to the model, attention can be triggered at mul-tiple levels: two early levels, enabling Bfast^ selection, and alate level, used for Bslow^ selection. In each case, the triggercomes from a comparison between incoming information andinformation present in memory. BFast^ bottom-up selection isbased on basic signal properties such as sound level or funda-mental frequency. Attention is drawn when a transient—forexample a sudden loud sound—exceeds the correspondingthreshold. BFast^ top-down selection is based on primitivefeatures, such as sound level, interaural differences, F0, andspectral envelope. These are compared to an attentional setdetermined by the task and goals of the listener. When listen-ing to a talker at a certain spatial location, this set would

contain the corresponding ITDs and ILDs. Focusing on a fe-male voice among male voices would require comparisonwith an F0 range and with templates of spectral envelopes.BSlow^ bottom-up selection occurs when a more complexdeviation from regularity than a transient is detected. This isrepresented in the model by the comparison between a storedmemory trace and a novel trace in sensory memory. It alsooccurs in the classic case when a listener recognizes one’s ownname in unattended speech. Such speech is not enhanced andis therefore processed more slowly and with more risk ofdecay. However, when the speech item reaches thepreattentive sensory memory and is compared with a genericattentional set containing relevant speech items (such as one’sown name), attention can nevertheless be drawn. A Bslow^top-down route is the comparison of complex speech itemsthat have passed through grouping and speech-specific pro-cessing, with items in an attentional set. This route is, forexample, followed when one is listening for the occurrenceof a specific message in a mixture of voices.

The model assumes a relatively straightforward interactionbetween attention and grouping. It supposes that all groupingoccurs at a pre-attentive level (i.e., before selection takesplace), which means that it acts on all sensory input, withoutrequiring conscious effort, and reuses Bregman’s (1990) con-cept of primitive grouping. In the primitive grouping stage,signals are organized along basic features that can

Fig. 3 Conceptual model of early speech processing. After peripheraland binaural processing, transients can already trigger attention.Primitive grouping (e.g., based on spatial location or F0) represents asubsequent stage, allowing efficient selection. More sophisticatedfeatures, such as syntactic and semantic information, are processed at ahigher level and enable selection based on complex information. Animportant element of the model is a feedback loop, initiated byattentional control, inducing enhancement of to-be-selected input. Seethe text for more details

6 Note that this is a unisensory model and that competition betweennonauditory and auditory input is not included.

1480 Atten Percept Psychophys (2015) 77:1465–1487

Page 17: The cocktail-party problem revisited: early processing and

subsequently be used for comparison with an attentional set.The following grouping stage can be based on more complexspeech properties, because it is combined with higher-level(lexical, semantic, and syntactic) speech processing. Becausethe model supposes that selective enhancement is largest atstages following the one that triggered attention (as illustratedby the shape of the dotted triangle), there is more bias towardsthe enhanced signals in higher-level than in primitive group-ing. This corresponds with Bregman’s (1990) proposition thatprimitive grouping operates similarly on all input, while latergrouping differentiates Bforeground^ from Bbackground.^The model, however, also deviates from Bregman’s (1990)views because it does not include effortful/conscious group-ing. This was done not only because of lack of experimentalevidence but also because the range of properties that couldunderlie such grouping is almost limitless.

The model includes peripheral and binaural processing asinitial stages, which may be seen as placeholders for quanti-tative models such as those presented above. There are, how-ever, some issues that prevent such a simplemerge. One is thatit is not certain to which degree the psychoacoustic modelsrepresent auditory processing in a neurophysiologically plau-sible way, at least not beyond the cochlear level (for whichmost use commonly accepted representations of critical-bandfiltering). A solution for this could emerge from recent re-search into the relationship between speech perception andneuronal oscillations (Zion Golumbic et al., 2013). This re-search focuses on the role of envelope fluctuations and thusmight provide a basis for envelope-based models such as thatof Jørgensen et al. (2013). Another issue is that the modelsrequire knowledge of the target speech signal—its level,frequency spectrum, and perhaps also modulation spec-trum—which presents a logical problem in the currentmodel because the target speech signal only emerges inthe later grouping stages. There are possible solutions forthis problem—all potential target signals may be processedin parallel, or the target speech may be Bhighlighted^ bythe selective enhancement initiated by attention, but this isall rather speculative. Clearly, further research is requiredto shed more light on these issues.

Relationship with earlier models

As mentioned above, the model is inspired mainly by thatdeveloped by Näätänen et al. (2011), which is based on a largebody of electrophysiological work, primarily on the MMNand the N1 (Näätänen & Picton, 1987). Many elements of thatmodel were incorporated, but there are several significant dis-crepancies. An obvious one is that Näätänen et al. (2011) donot include stages for peripheral processing, binaural process-ing, and grouping. This is understandable, because their mod-el is based on ERP data that are normally obtained with se-quential stimuli, eliciting little or no (un)masking and

requiring no simultaneous grouping. However, given thatNäätänen et al. (2011) also discuss conditions with multipleconcurrent auditory streams, this seems to be an omission.Although they also do not address sequential grouping explic-itly, this must be implicitly included in the feature detectorsand temporal feature recognizers that are part of their model.Another difference between the two models is that Näätänenet al. (2011) view conscious perception as a process acting onitems already stored in sensory memory and not as the resultof a sequential filtering operation. In that sense, their modelresembles that of Cowan (1988), discussed below. This differ-ence is, however, not as fundamental as it seems because theselection process postulated in the current model also can beinterpreted as a parallel operation acting on short-term storage(STS); i.e., the preattentive sensory memory can be seen as thepart of STS outside the focus of attention. A final distinctionbetween the current model and that of Näätänen et al. (2011) isthat the latter directly links most interactions between buildingblocks with observable ERPs, which makes it more explicit interms of timing and information flow in the brain. Althoughmost of these links may well be applicable to the currentmodel as well, given the similarity of the two models, theycannot be as specific, because conditions with multiple simul-taneous speech signals have hardly been addressed in ERPstudies.

Another relevant model is that of Cowan (1988). Althoughit at first sight seems very different because it is focuses on therole of the memory system while the current model highlightssubsequent processing stages, there are several similarities.First, the distinction made by Cowan (1988) between a briefsensory store (up to several hundreds of ms) and a long-termstore is reflected in the current model. The latter correspondsto STS; the former also is included but is divided into severalpreattentive stores with similar short durations. The sensorymemory and memory trace involved in MMN generationhave, for example, durations of up to 300 ms (Näätänenet al., 2011). Second, Cowan (1988) postulates multiple waysin which attention can be triggered. The occurrence of Bgrossphysical changes in the repeated pattern^ is one way, corre-sponding to bottom-up triggering of attention in the currentmodel. BVoluntary attentional focus^ is another way, corre-sponding to top-down attention modeled here. Third, Cowan(1988) assumes that Bperceptual processing occurs indepen-dently of attentive processes,^ which is consistent with theoccurrence of preattentive processing postulated in the currentmodel. There are, however, also differences. Cowan (1988)deliberately bases his model on parallel information process-ing instead of using a linear time-sequence-based approach. Inline with this, he does not interpret selection as a processingstage that either passes through or rejects input, but as a focusplaced on items already present in STS. Also, STS itself is notseen as a separate store but as that part of LTM that is activat-ed. As discussed above, this is probably not an essential

Atten Percept Psychophys (2015) 77:1465–1487 1481

Page 18: The cocktail-party problem revisited: early processing and

difference with the current model. Other differences betweenthe current model and that of Cowan (1988) are related todifferences in scope and focus between both models and donot reflect fundamental discrepancies. Cowan’s (1988) model,as mentioned earlier, mainly addresses the role of the memorysystem while the current model focuses on preattentive pro-cessing and the operation of attentional control.

Recommendations for Future Research

Given that the main results of this review have already beensummarized in subsections and brought together in a model, Iwill not present a further overview here but instead focus onrecommendations for future research.

1. I think there is an evident need for quantitative models.This review only addresses modeling efforts at the level ofperipheral and binaural processing. It appears that a com-prehensive model for this level can be derived relativelyeasily from published models, but that the neurophysio-logical basis of these models is incomplete. In fact, theonly part of most models that has a sound basis is therepresentation of peripheral critical-band filtering.Quantitative models for the grouping stages have emergedfrom the work on Computational Auditory SceneAnalysis (CASA; see Wang & Brown, 2006). However,because this research mainly aims to optimize separationand analysis of multiple sounds, it is unclear to what de-gree it generates plausible models of the auditory system.Computational models of attention have mainly been de-veloped for the visual modality (Navalpakkam & Itti,2005), but some auditory models have been publishedas well, for example that of Kalinli and Narayanan(2009), which combines acoustic features with lexicaland syntactic information to identify stressed syllables ina radio news corpus. Recently, a model covering bothgrouping and attention was developed by Lutfi,Gilbertson, Heo, Chang, and Stamas (2013). It uses asingle factor, derived from statistical differences betweensound features (such as F0 and angle of incidence) topredict effects of both target-masker similarity and uncer-tainty on word perception. Although it is a statistical mod-el, as yet based on relatively few (speech) data, it is ofinterest because it suggests that the auditory system uses ageneric strategy to deal with ambiguous input. Whilethere are, as yet, no integrated models that can quantifyearly processing and selection of target speech in multi-talker conditions, the combination of several availablemodels developed in different domains would alreadyrepresent a valuable first step.

2. Research on auditory grouping of speech has up to nowmainly focused on effects of voice properties and spatial

cues, so that we actually know relatively little about theinfluence of other speech properties, such as speakingstyle, linguistic variability, and contextual information.This omission is understandable, because theseproperties in general affect the intelligibility of the targetspeech itself, and thus introduce a confound. Asdemonstrated by Freyman et al. (2001) and Iyer et al.(2010), a simple solution for this problem is to manipulatethe properties only in the interfering speech. Another so-lution would be to vary combinations of properties in thetarget speech in such a way that the intelligibility does notchange.

3. Despite the fact that effects of exogenous auditory atten-tion have emerged in many studies and have been studiedextensively in relation to LTM performance, relativelyfew studies have looked at preconditions for attentionalshifts, at the interaction between exogenous and endoge-nous attention, or at how it affects speech perception inmulti-talker conditions. The research of Parmentier andhis colleagues, summarized above, represents an excep-tion and a welcome broadening of our knowledge thatshould be extended by further research.

4. In general, there has not been much research into speechperception where attention is explicitly manipulated whilekeeping all confounding factors within control. The recentinterest in informational masking has generated a numberof interesting studies, but the focus on (manipulation of)uncertainty in these studies leaves other important aspectsunexplored. Disentangling effects of attention from thoseof grouping will represent a particular challenge, becausethese phenomena are easily confounded in behavioralstudies. One approach could be to present various pairsof speech stimuli while making sure that each stimulusoccurs as target as well as interferer. This would make itpossible to separate grouping effects, which depend onpairs, from other effects, depending on individual stimuli.

5. The dynamic properties of attention also deserve furtherstudy. Not only should the research into effects of chang-ing voices or locations be extended, but changes in otherspeech properties should be investigated as well. It would,furthermore, be of great interest to uncover the processesunderlying the remarkable sluggishness observed in re-sponses to changes and in adaptation to steady-stateconditions.

6. Early behavioral research, such as the study by Cutting(1976) showing that certain dichotic stimuli can at thesame time be heard as two sounds and fused into onepercept, already indicated that sound perception andsound localization are distinct processes. This is con-firmed by more recent electrophysiological and neuroim-aging studies revealing separate pathways for analysis ofBwhat^ and Bwhere^ features in the auditory cortex(Ahveninen, Jääskeläinen, Raij, Bonmassar, Devore

1482 Atten Percept Psychophys (2015) 77:1465–1487

Page 19: The cocktail-party problem revisited: early processing and

et al., 2006; Alain, Arnott, Hevenor, Graham & Grady,2001). However, there also are indications that the distinc-tion is not clear-cut. The studies by Freyman andcollegues, for example, show that perceived spatial loca-tion facilitates speech segregation (Freyman et al., 1999).Furthermore, it is evident that binaural perception andsound localization partly rely on the same cues(Middlebrooks &Green, 1991), which indicates that thereis an overlap in the processing at peripheral and brainstemlevels. This means that it is still not clear to what degreebinaural speech perception depends on one’s ability tolocalize sound target and/or interfering sound sources.

These open issues make it clear that it is unlikely thatCherry’s (1953) paper will soon be forgotten or that thecocktail-party problem will cease to inspire us in the foresee-able future.

Acknowledgments The author thanks three anonymous reviewers fortheir thoughtful and constructive comments, given to an earlier version ofthis paper.

Open Access This article is distributed under the terms of the CreativeCommons Attribution License which permits any use, distribution, andreproduction in any medium, provided the original author(s) and thesource are credited.

References

Ahveninen, J., Hämäläinen, M., Jääskeläinen, I.P., Ahlfors, S.P., Huang,S., Lin, F.-H., ⋯Belliveau, J.W. (2011). Attention- driven auditorycortex short-term plasticity helps segregate relevant sounds fromnoise. Proceedings of the National Academy of Sciences 108,4182–4187. doi:10.1073/pnas.1016134108.

Ahveninen, J., Jääskeläinen, I. P., Raij, T., Bonmassar, G., Devore, S.,Hämäläinen, M.,⋯Belliveau, J. W. (2006). Task-modulated Bwhat^and Bwhere^ pathways in human auditory cortex.Proceedings of theNational Academy of Sciences 103, 14608–14613. doi:10.1073/pnas.0510480103.

Alain, C., Arnott, S. R., Hevenor, S., Graham, S., & Grady, C. L. (2001).BWhat^ and Bwhere^ in the human auditory system. Proceedings ofthe National Academy of Sciences, 98, 12301–12306. doi:10.1073/pnas.211209098

Allen, J. B. (1994). How do humans process and recognize speech? IEEETransactions on Speech and Audio Processing, 2, 567–577. doi:10.1109/89.326615

Allen, K., Alais, D., & Carlile, S. (2009). Speech intelligibility reducesover distance from an attended location: Evidence for an auditoryspatial gradient of attention. Attention, Perception, &Psychophysics, 71, 164–173. doi:10.3758/APP.71.1.164

ANSI. (1997). ANSI S3.5-1997: Methods for calculation of the speechintelligibility index. New York: American National StandardsInstitute.

Arbogast, T., Mason, C., & Kidd, G. (2002). The effect of spatial sepa-ration on informational and energetic masking of speech. Journal ofthe Acoustical Society of America, 112, 2086–2098. doi:10.1121/1.1510141

Assmann, P. F., & Summerfield, Q. (2004). The perception of speechunder adverse conditions. In S. Greenberg, W. A. Ainsworth, A.

N. Popper, & R. R. Fay (Eds.), Speech processing in the auditorysystem (pp. 231–308). New York: Springer.

Atal, B. S., & Hanauer, S. L. (1971). Speech analysis and synthesis bylinear prediction of the acoustic wave. Journal of the AcousticalSociety of America, 50, 637–655. doi:10.1121/ 1.1912679

Bell, R., Röer, J. P., Dentale, S., & Buchner, A. (2012). Habituation of theirrelevant sound effect: Evidence for an attentional theory of short-term memory disruption. Journal of Experimental Psychology:Learning, Memory, and Cognition, 38, 1542–1557. doi:10.1037/a0028459

Best, V., Ozmeral, E. J., Kopčo, N., & Shinn-Cunningham, B. G. (2008).Object continuity enhances selective auditory attention.Proceedingsof the National Academy of Sciences, 105, 13174–13178. doi:10.1073/pnas.0803718105

Best, V., Shinn-Cunningham, B. G., Ozmeral, E. J., & Kopčo, N. (2010).Exploring the benefit of auditory spatial continuity. Journal of theAcoustical Society of America, 127, EL258. doi:10.1121/1.3431093

Beutelmann, R., Brand, T., & Kollmeier, B. (2010). Revision, extension,and evaluation of a binaural speech intelligibility model. Journal ofthe Acoustical Society of America, 127, 2479–2497. doi:10.1121/1.3295575

Binns, C., & Culling, J. F. (2007). The role of fundamental frequencycontours in the perception of speech against interfering speech.Journal of the Acoustical Society of America, 122, 1765–1776.doi:10.1121/1.2751394

Bird, J., & Darwin, C. J. (1998). Effects of a difference in fundamentalfrequency in separating two sentences. In A. R. Palmer, A. Rees, A.Q. Summerfield, & R. Meddis (Eds.), Psychophysical and physio-logical advances in hearing (pp. 263–269). London: WhurrPublishers.

Block, C. K., & Baldwin, C. L. (2010). Cloze probability and completionnorms for 498 sentences: Behavioral and neural validation usingevent-related potentials. Behavior Research Methods, 42, 665–670. doi:10.3758/BRM.42.3.665

Bolia, R., Nelson, W., Ericson, M., & Simpson, B. (2000). A speechcorpus for multitalker communications research. Journal of theAcoustical Society of America, 107, 1065–1066. doi:10.1121/1.428288

Boothroyd, A., & Nittrouer, S. (1988). Mathematical treatment of contexteffects in phoneme and word recognition. Journal of the AcousticalSociety of America, 84, 101–114. doi:10.1121/1.396976

Bregman, A. S. (1990). Auditory scene analysis: the perceptual organi-zation of sound. Cambridge: MIT Press.

Broadbent, D. E. (1958). Perception and communication. London:Pergamon Press.

Broadbent, D. E., & Ladefoged, P. (1957). On the fusion of soundsreaching different sense organs. Journal of the Acoustical Societyof America, 29, 708–710. doi:10.1121/1.1909019

Brokx, J. P. L., & Nooteboom, S. G. (1982). Intonation and the perceptualseparation of simultaneous voices. Journal of Phonetics, 10, 23–36.

Bronkhorst, A. W. (2000). The cocktail party phenomenon: a review ofspeech intelligibility in multiple-talker conditions. Acta Acusticaunited with Acustica, 86, 117–128.

Bronkhorst, A. W., Bosman, A. J., & Smoorenburg, G. F. (1993). Amodel for context effects in speech recognition. Journal of theAcoustical Society of America, 93, 499–509. doi:10.1121/1.406844

Bronkhorst, A. W., Brand, T., & Wagener, K. (2002). Evaluation of con-text effects in sentence recognition. Journal of the AcousticalSociety of America, 111, 2874–2886. doi:10.1121/1.1458025

Bronkhorst, A. W., & Plomp, R. (1988). The effect of head-inducedinteraural time and level differences on speech intelligibility innoise. Journal of the Acoustical Society of America, 83, 1508–1516. doi:10.1121/1.395906

Brungart, D. S. (2001). Informational and energetic masking effects in theperception of two simultaneous talkers. Journal of the AcousticalSociety of America, 109, 1101–1109. doi:10.1121/1.1345696

Atten Percept Psychophys (2015) 77:1465–1487 1483

Page 20: The cocktail-party problem revisited: early processing and

Brungart, D. S., & Iyer, N. (2012). Better-ear glimpsing efficiency withsymmetrically-placed interfering talkers. Journal of the AcousticalSociety of America, 132, 2545–2556. doi:10.1121/1.4747005

Brungart, D. S., & Simpson, B. D. (2002). Within-ear and across-earinterference in a cocktail-party listening task. Journal of theAcoustical Society of America, 112, 2985–2995. doi:10.1121/1.1512703

Brungart, D. S., & Simpson, B. D. (2007). Cocktail party listening in adynamic multitalker environment. Perception & Psychophysics, 69,79–91. doi:10.3758/BF03194455

Brungart, D. S., Simpson, B. D., Ericson, M., & Scott, K. (2001).Informational and energetic masking effects in the perception ofmultiple simultaneous talkers. Journal of the Acoustical Society ofAmerica, 110, 2527–2538. doi:10.1121/1.1408946

Burkhard, M. D., & Sachs, R. M. (1975). Anthropometric manikin foracoustic research. Journal of the Acoustical Society of America, 58,214–222. doi:10.1121/1.380648

Cherry, E. C. (1953). Some experiments on the recognition of speech,with one and with two ears. Journal of the Acoustical Society ofAmerica, 25, 975–979. doi:10.1121/1.1907229

Colburn, H. S. (1973). Theory of binaural detection based on auditory-nerve data. General strategy and preliminary results on interauraldiscrimination. Journal of the Acoustical Society of America, 54,1458–1470. doi:10.1121/1.1914445

Colle, H. A., &Welsh, A. (1976). Acoustic masking in primary memory.Journal of Verbal Learning and Verbal Behavior, 15, 17–31. doi:10.1016/S0022-5371(76)90003-7

Cooke, M. (2006). A glimpsing model of speech perception in noise.Journal of the Acoustical Society of America, 119, 1562–1573.doi:10.1121/1.2166600

Cowan, N. (1988). Evolving conceptions of memory storage, selectiveattention, and their mutual constraints within the humaninformation-processing system. Psychological Bulletin, 104, 163–191. doi:10.1037/0033-2909.104.2.163

Cowan, N. (1999). An embedded-processes model of working memory.In A. Miyake & P. Shah (Eds.), Models of working memory:Mechanisms of active maintenance and executive control (pp. 62–101). Cambridge: Cambridge University Press.

Cowan, N., & Wood, N. L. (1997). Constraints on awareness, attention,processing and memory: Some recent investigations with ignoredspeech. Consciousness and Cognition, 6, 182–203. doi:10.1006/ccog.1997.0300

Culling, J. F., Hawley, M. L., & Litovsky, R. Y. (2004). The role of head-induced interaural time and level differences in the speech receptionthreshold for multiple interfering sound sources. Journal of theAcoustical Society of America, 116, 1057–1065. doi:10.1121/1.1772396

Culling, J. F., & Summerfield, Q. (1995). Perceptual separation of con-current speech sounds: absence of across-frequency grouping bycommon interaural delay. Journal of the Acoustical Society ofAmerica, 98, 785–797. doi:10.1121/1.413571

Cutting, J. E. (1976). Auditory and linguistic processes in speech percep-tion: inferences from six fusions in dichotic listening. PsychologicalReview, 83, 114–140. doi:10.1037/0033-295X.83.2.114

Dalton, P., & Lavie, N. (2004). Auditory attentional capture: Effects ofsingleton distractor sounds. Journal of Experimental Psychology:Human Perception and Performance, 30, 180–193. doi:10.1037/0096-1523.30.1.180

Darwin, C. J. (1997). Auditory grouping. Trends in Cognitive Sciences, 1,327–333. doi:10.1016/S1364-6613(97)01097-8

Darwin, C. J. (2008). Listening to speech in the presence of other sounds.Philosophical Transactions of the Royal Society B, 363, 1011–1021.doi:10.1098/rstb.2007.2156

Darwin, C., Brungart, D., & Simpson, B. (2003). Effects of fundamentalfrequency and vocal- tract length changes on attention to one of two

simultaneous talkers. Journal of the Acoustical Society of America,114, 2913–2922. doi:10.1121/1.1616924

Darwin, C. J., & Carlyon, R. P. (1995). Auditory grouping. In B. C. J.Moore (Ed.), The handbook of perception and cognition (Hearing,Vol. 6, pp. 387–424). London, UK: Academic Press.

Darwin, C. J., & Hukin, R. W. (1999). Auditory objects of attention: therole of interaural time- differences. Journal of ExperimentalPsychology: Human Perception and Performance, 25, 617–629.doi:10.1037/0096-1523.25.3.617

Darwin, C. J., & Hukin, R. W. (2000). Effectiveness of spatial cues,prosody and talker characteristics in selective attention. Journal ofthe Acoustical Society of America, 107, 970–977. doi:10.1121/1.428278

Darwin, C. J., & Hukin, R. W. (2004). Limits to the role of a commonfundamental frequency in the fusion of two sounds with differentspatial cues. Journal of the Acoustical Society of America, 116, 502–506. doi:10.1121/1.1760794

Deutsch, J. A., & Deutsch, D. (1963). Attention: Some theoretical con-siderations. Psychological Review, 70, 80–90. doi:10.1037/h0039515

Diehl, R. L. (2008). Acoustic and auditory phonetics: the adaptive designof speech sound systems. Philosophical Transactions of the RoyalSociety B, 363, 965–978. doi:10.1098/rstb.2007.2153

Drennan, W. R., Gatehouse, S., & Lever, C. (2003). Perceptual segrega-tion of competing speech sounds: The role of spatial location.Journal of the Acoustical Society of America, 114, 2178–2189.doi:10.1121/1.1609994

Drullman, R., Festen, J. M., & Plomp, R. (1993). Effect of temporalenvelope smearing on speech reception. Journal of the AcousticalSociety of America, 95, 1053–1064. doi:10.1121/1.408467

Du, Y., He, Y., Ross, B., Bardouille, T.,Wu,X., Li, L., &Alain, C. (2011).Human auditory cortex activity shows additive effects of spectraland spatial cues during speech segregation. Cerebral Cortex, 21,698–707. doi:10.1093/cercor/bhq136

Durlach, N. I. (1972). Binaural signal detection: Equalization and cancel-lation theory. In J. V. Tobias (Ed.), Foundations of modern auditorytheory (pp. 369–462). New York: Academic Press.

Ericson, M. A., Brungart, D. S., & Simpson, B. D. (2004). Factors thatinfluence intelligibility in multitalker speech displays. TheInternational Journal of Aviation Psychology, 14, 311–332. doi:10.1207/s15327108ijap1403_6

Escara, C., Alho, K., Winkler, I., & Näätänen, R. (1998). Neural mecha-nisms of involuntary attention to acoustic novelty and change.Journal of Cognitive Neuroscience, 10, 590–604. doi:10.1162/089892998562997

Festen, J. M., & Plomp, R. (1990). Effects of fluctuating noise and inter-fering speech on the speech-reception threshold for impaired andnormal hearing. Journal of the Acoustical Society of America, 88,1725–1736. doi:10.1121/1.400247

French, N. R., & Steinberg, J. C. (1947). Factors governing the intelligi-bility of speech sounds. Journal of the Acoustical Society ofAmerica, 19, 90–119. doi:10.1121/1.1916407

Freyman, R. L., Balakrishnan, U., & Helfer, K. S. (2001). Spatial releasefrom informational masking in speech recognition. Journal of theAcoustical Society of America, 109, 2112–2122. doi:10.1121/1.1354984

Freyman, R. L., Helfer, K. S., McCall, D. D., & Clifton, R. K. (1999). Therole of perceived spatial separation in the unmasking of speech.Journal of the Acoustical Society of America, 106, 3578–3588.doi:10.1121/1.428211

Gallun, F. J., Mason, C. R., & Kidd, G., Jr. (2005). Binaural release frominformational masking in a speech identification task. Journal of theAcoustical Society of America, 118, 1614–1625. doi:10.1121/1.1984876

1484 Atten Percept Psychophys (2015) 77:1465–1487

Page 21: The cocktail-party problem revisited: early processing and

Goldstone, R. L., & Hendrickson, A. T. (2009). Categorical perception.Wiley Interdisciplinary Reviews: Cognitive Science, 1, 69–78. doi:10.1002/wcs.26

Hartmann, W.M., Rakerd, B., & Koller, A. (2005). Binaural coherence inrooms. Acta Acustica united with Acustica, 91, 451–462.

Helfer, K. S., & Freyman, R. L. (2009). Lexical and indexical cues inmasking by competing speech. Journal of the Acoustical Society ofAmerica, 125, 447–456. doi:10.1121/1.3035837

Hink, R. F., & Hillyard, S. A. (1976). Auditory evoked potentials duringselective listening to dichotic speech messages. Perception &Psychophysics, 20, 236–242. doi:10.3758/BF03199449

Holender, D. (1986). Semantic activationwithout conscious identificationin dichotic listening, parafoveal vision, and visual masking: A sur-vey and appraisal. Behavioral and Brain Sciences, 9, 1–66. doi:10.1017/S0140525X00021269

Hood, J. D. (1957). The principles and practice of bone conduction audi-ometry: A review of the present position. Proceedings of the RoyalSociety of Medicine, 50, 689–697.

Hu, G., & Wang, D. L. (2004). Monaural speech segregation based onpitch tracking and amplitude modulation. IEEE Transactions onNeural Networks, 15, 1135–1150. doi:10.1109/TNN.2004.832812

Hukin, R. W., & Darwin, C. J. (1995). Effects of contralateral presenta-tion and of interaural time differences in segregating a harmonicfrom a vowel. Journal of the Acoustical Society of America, 98,1380–1387. doi:10.1121/1.414348

IEC (2003). Sound system equipment. Part 16: Objective rating of speechintelligibility by speech transmission index. InternationalElectrotechnical Commission, Standard 60268-16 (3rd edition).

Iyer, N., Brungart, D. S., & Simpson, B. D. (2010). Effects of target-masker contextual similarity on the multimasker penalty in a three-talker diotic listening task. Journal of the Acoustical Society ofAmerica, 128, 2998–3010. doi:10.1121/1.3479547

Johnsrude, I. S., Mackey, A., Hakyemez, H., Alexander, E., Trang, H. P.,& Carlyon, R. P. (2013). Swinging at a cocktail party: Voice famil-iarity aids speech perception in the presence of a competing voice.Psychological Science, 24, 1995–2004. doi :10.1177/0956797613482467

Jones, D. (1993). Objects, streams, and threads of auditory attention. InA.Baddeley & L. Weiskrantz (Eds.), Attention: Selection, awareness,and control: A tribute to Donald Broadbent (pp. 87–104). Oxford:Oxford University Press.

Jones, G. L., & Litovsky, R. Y. (2011). A cocktail party model of spatialrelease from masking by both noise and speech interferers. Journalof the Acoustical Society of America, 130, 1463–1474. doi:10.1121/1.3613928

Jørgensen, S., & Dau, T. (2011). Predicting speech intelligibility based onthe signal-to-noise envelope power ratio after modulation-frequencyselective processing. Journal of the Acoustical Society of America,130, 1475–1487. doi:10.1121/1.3621502

Jørgensen, S., Ewert, S. D., & Dau, T. (2013). A multi-resolution enve-lope-power based model for speech intelligibility. Journal of theAcoustical Society of America, 134, 436–446. doi:10.1121/1.4807563

Kalikow, D. N., Stevens, K. N., & Elliot, L. L. (1977). Development of atest of speech intelligibility in noise using sentence materials withcontrolled word predictability. Journal of the Acoustical Society ofAmerica, 61, 1337–1351. doi:10.1121/1.381436

Kalinli, O., &Narayanan, S. (2009). Prominence detection using auditoryattention cues and task-dependent high level information. IEEETransactions on Audio, Speech and Language Processing, 17,1009–1024. doi:10.1109/TASL.2009.2014795

Kidd, G., Jr., Arbogast, T. L., Mason, C. R., & Gallun, F. J. (2005). Theadvantage of knowing where to listen. Journal of the AcousticalSociety of America, 118, 3804–3815. doi:10.1121/1.2109187

Koch, I., Lawo, V., Fels, J., & Vorländer, M. (2011). Switching in thecocktail party: Exploring intentional control of auditory selective

attention. Journal of Experimental Psychology: Human Perceptionand Performance, 37, 1140–1147. doi:10.1037/a0022189

Kryter, K. D. (1962). Methods for the calculation and use of the articula-tion index. Journal of the Acoustical Society of America, 34, 1689–1697. doi:10.1121/1.1909094

Kuhl, P., & Rivera-Gaxiola, M. (2008). Neural Substrates of LanguageAcquisition. Annual Review of Neuroscience, 31, 511–534. doi:10.1146/annurev.neuro.30.051606.094321

Lambrecht, J., Spring, D. K., & Münte, T. F. (2011). The focus of atten-tion at the virtual cocktail party—Electrophysiological evidence.Neuroscience Letters, 489, 53–56. doi:10.1016/j.neulet.2010.11.066

Laures, J. S., &Weismer, G. (1999). The effects of a flattened fundamen-tal frequency on intelligibility at the sentence level. Journal ofSpeech, Language, and Hearing Research, 42, 1148–1156. doi:10.1044/jslhr.4205.1148

Lavandier, M., Jelfs, S., Culling, J. F., Watkins, A. J., Raimond, A. P., &Makin, S. J. (2012). Binaural prediction of speech intelligibility inreverberant rooms with multiple noise sources. Journal of theAcoustical Society of America, 131, 218–231. doi:10.1121/1.3662075

Lutfi, R. A., Gilbertson, L., Heo, I., Chang, A., & Stamas, J. (2013). Theinformation-divergence hypothesis of informational masking.Journal of the Acoustical Society of America, 134, 2160–2170.doi:10.1121/1.4817875

Mayo, L. H., Florentine, M., & Buus, S. (1997). Age of second-languageacquisition and perception of speech in noise. Journal of Speech,Language, and Hearing Research, 40, 686–693. doi:10.1044/jslhr.4003.686

McDermott, J. H. (2009). The Cocktail Party Problem. Current Biology,19, R1024–R1027.

McLachlan, N., &Wilson, S. (2010). The Central Role of Recognition inAuditory Perception: A Neurobiological Model. PsychologicalReview, 117, 175–196. doi:10.1037/a0018063

Middlebrooks, J. C., & Green, D. M. (1991). Sound localization by hu-man listeners. Annual Review of Psychology, 42, 135–159. doi:10.1146/annurev.ps.42.020191.001031

Moore, B. C. J., & Gockel, H. E. (2012). Properties of auditory streamformation. Philosophical Transactions of the Royal Society B, 367,919–931. doi:10.1098/rstb.2011.0355

Moray, N. (1959). Attention in dichotic listening: Affective cues and theinfluence of instructions. Quarterly Journal of ExperimentalPsychology, 11, 56–60. doi:10.1080/17470215908416289

Müsch, H., & Buus, S. (2001). Using statistical decision theory to predictspeech intelligibility. I. Model structure. Journal of the AcousticalSociety of America, 109, 2896–2909. doi:10.1121/1.1371971

Näätänen, R., Gaillard, A.W. K., &Mäntysalo, S. (1978). Early selective-attention effect on evoked potential reinterpreted. ActaPsychologica, 42, 313–329. doi:10.1016/0001-6918(78)90006-9

Näätänen, R., Kujala, T., & Winkler, I. (2011). Auditory processing thatleads to conscious perception: A unique window to central auditoryprocessing opened by the mismatch negativity and related re-sponses. Psychophysiology, 48, 4–22. doi:10.1111/j.1469-8986.2010.01114.x

Näätänen, R., & Picton, T. (1987). The N1wave of the human electric andmagnetic response to sound: A review and an analysis of the com-ponent structure. Psychophysiology, 24, 375–425. doi:10.1111/j.1469-8986.1987.tb00311.x

Nager, W., Dethlefsen, C., & Münte, T. F. (2008). Attention to humanspeakers in a virtual auditory environment: brain potential evidence.Brain Research, 1220, 164–170. doi:10.1016/j.brainres.2008.02.058

Navalpakkam, V., & Itti, L. (2005). Modeling the influence of task onattention. Vision Research, 45, 205–231. doi:10.1016/j.visres.2004.07.042

Atten Percept Psychophys (2015) 77:1465–1487 1485

Page 22: The cocktail-party problem revisited: early processing and

Neath, I. (2000). Modeling the effects of irrelevant speech on memory.Psychonomic Bulletin and Review, 7, 403–423. doi:10.3758/BF03214356

Paavilainen, P., Valppu, S., & Näätänen, R. (2001). The additivity of theauditory feature analysis in the human brain as indexed by the mis-match negativity: 1 + 1 approximately 2 but 1 + 1+1<3.Neuroscience Letters, 301, 179–182. doi:10.1016/S0304-3940(01)01635-4

Parmentier, F. B. R. (2008). Towards a cognitive model of distraction byauditory novelty: The role of involuntary attention capture and se-mantic processing. Cognition, 109, 345–362. doi:10.1016/j.cognition.2008.09.005

Parmentier, F. B. R. (2013). The cognitive determinants of behavioraldistraction by deviant auditory stimuli: a review. PsychologicalResearch. doi:10.1007/s00426-013-0534-4

Parmentier, F. B. R., Elford, G., Escera, C., Andrés, P., & San Miguel, I.(2008). The cognitive locus of distraction by acoustic novelty in thecross-modal oddball task. Cognition, 106, 408–432. doi:10.1016/j.cognition.2007.03.008

Parmentier, F. B. R., Turner, J., & Perez, L. (2014). A dual contribution tothe involuntary semantic processing of unexpected spoken words.Journal of Experimental Psychology: General, 143, 38–45. doi:10.1037/a0031550

Pashler, H. E. (1998). The psychology of attention. Cambridge: MITPress.

Patterson, R. D., & Johnsrude, I. S. (2008). Functional imaging of theauditory processing applied to speech sounds. PhilosophicalTransactions of the Royal Society B, 363, 1023–1035. doi:10.1098/rstb.2007.2157

Peterson, G. H., & Barney, H. L. (1952). Control methods used in a studyof the vowels. Journal of the Acoustical Society of America, 24,175–184. doi:10.1121/1.1906875

Plomp, R. (1977). Acoustical aspects of cocktail parties. Acustica, 38,186–191.

Plomp, R., & Mimpen, A. M. (1979). Improving the reliability of testingthe speech reception threshold for sentences. Audiology, 18, 43–52.doi:10.3109/00206097909072618

Pollack, I., & Pickett, J. M. (1959). Intelligibility of peak-clipped speechat high noise levels. Journal of the Acoustical Society of America,31, 14–16. doi:10.1121/1.1907604

Posner, M. I., & Cohen, Y. (1984). Components of visual orienting. InAttention and performance: Vol. 10. Control of language processes,ed. H. Bouma & D. G. Bouwhuis pp. 531–556. Erlbaum

Power, A. J., Foxe, J. J., Forde, E.-J., Reilly, R. B., & Lalor, E. C. (2012).At what time is the cocktail party? A late locus of selective attentionto natural speech. European Journal of Neuroscience, 35, 1497–2012. doi:10.1111/j.1460-9568.2012.08060.x

Pulvermüller, F., & Shtyrov, Y. (2006). Language outside the focus ofattention: The mismatch negativity as a tool for studying highercognitive processes. Progress in Neurobiology, 79, 49–71. doi:10.1016/j.pneurobio.2006.04.004

Pulvermüller, F., Shtyrov, Y., Hasting, A. S., & Carlyon, R. P. (2008).Syntax as a reflex: Neurophysiological evidence for early automa-ticity of grammatical processing. Brain and Language, 104, 244–253. doi:10.1016/j.bandl.2007.05.002

Reiche, M., Hartwigsen, G., Widmann, A., Saur, D., Schröger, E., &Bendixen, A. (2013). Brain Research, 1490, 153–160. doi:10.1016/j.brainres.2012.10.055

Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). Speechperception without traditional speech cues. Science, 212, 947–950.doi:10.1126/science.7233191

Rhebergen, K. S., &Versfeld, N. J. (2005). A Speech Intelligibility Index-based approach to predict the speech reception threshold forsentences in fluctuating noise for normal-hearing listeners. Journalof the Acoustical Society of America, 117, 2181–2192. doi:10.1121/1.1861713

Rhebergen, K. S., Versfeld, N. J., & Dreschler, W. A. (2006). Extendedspeech intelligibility index for the prediction of the speech receptionthreshold in fluctuating noise. Journal of the Acoustical Society ofAmerica, 120, 3988–3997. doi:10.1121/1.2358008

Rivenez, M., Darwin, C. J., & Guillaume, A. (2006). Processing unat-tended speech. Journal of the Acoustical Society of America, 119,4027–4040. doi:10.1121/1.2190162

Roman, N., Wang, D. L., & Brown, G. J. (2003). Speech segregationbased on sound localization. Journal of the Acoustical Society ofAmerica, 114, 2236–2252. doi:10.1121/1.1610463

Scharf, B. (1998). Auditory attention: the psychoacoustical approach. InH. Pashler (Ed.), Attention (pp. 75–117). Hove, UK: PsychologyPress.

Scharf, B., Quigley, S., Peachey, A. N., & Reeves, A. (1987). Focusedauditory attention and frequency selectivity. Perception &Psychophysics, 42, 215–223. doi:10.3758/BF03203073

Schröger, E. (1995). Processing of auditory deviants with changes in oneversus two stimulus dimensions. Psychophysiology, 32, 55–65. doi:10.1111/j.1469-8986.1995.tb03406.x

Shinn-Cunningham, B. G. (2008). Object-based auditory and visual at-tention. Trends in Cognitive Sciences, 12, 182–186. doi:10.1016/j.tics.2008.02.003

Spence, C. J., & Driver, J. (1994). Covert spatial orienting in audition:Exogenous and endogenous mechanisms facilitate sound localiza-tion. Journal of Experimental Psychology: Human Perception andPerformance, 20, 555–574. doi:10.1037/0096-1523.20.3.555

Spence, C., & Driver, J. (1997). Audiovisual links in exogenous covertspatial orienting. Perception & Psychophysics, 59, 1–22. doi:10.3758/BF03206843

Steeneken, H. J. M., & Houtgast, T. (1980). A physical method for mea-suring speech transmission quality. Journal of the Acoustical Societyof America, 67, 318–326. doi:10.1121/1.384464

Steinschneider, M., Nourski, K. V., & Fishman, Y. I. (2013).Representation of speech in human auditory cortex: Is it special?Hearing Research, 305, 57–73. doi:10.1016/j.heares.2013.05.013

Taylor, W. L. (1953). BCloze procedure^: A new tool for measuringreadability. Journalism Quarterly, 30, 415–433.

Teder-Sälejärvi, W. A., & Hillyard, S. A. (1998). The gradient of spatialauditory attention in free field: An event-related potential study.Perception & Psychophysics, 60, 1228–1242. doi:10.3758/BF03206172

Teder-Sälejärvi, W. A., Hillyard, S. A., Röder, B., &Neville, H. J. (1999).Spatial attention to central and peripheral auditory stimuli as indexedby event-related potentials. Cognitive Brain Research, 8, 213–227.doi:10.1016/S0926-6410(99)00023-3

Theeuwes, J. (1992). Perceptual selectivity for color and form.Perception& Psychophysics, 51, 599–606. doi:10.3758/BF03211656

Treisman, A. (1960). Contextual cues in selective listening. QuarterlyJournal of Experimental Psychology, 12, 242–248. doi:10.1080/17470216008416732

Treisman, A. (1964). Monitoring and storage of irrelevant messages inselective attention. Journal of Verbal Learning and Verbal Behavior,3, 449–459. doi:10.1016/S0022-5371(64)80015-3

Uslar, V. N., Carroll, R., Hanke, M., Hamann, C., Ruigendijk, E., Brand,T., & Kollmeier, B. (2013). Development and evaluation of a lin-guistically and audiologically controlled sentence intelligibility test.Journal of the Acoustical Society of America, 134, 3039–3056. doi:10.1121/1.4818760

Van Rooij, J. C. G. M., & Plomp, R. (1991). The effect of linguisticentropy on speech perception in noise in young and elderly listeners.Journal of the Acoustical Society of America, 90, 2985–2991. doi:10.1121/1.401772

Van Wijngaarden, S. J., & Drullman, R. (2008). Binaural intelligibilityprediction based on the speech transmission index. Journal of theAcoustical Society of America, 123, 4514–4523. doi:10.1121/1.2905245

1486 Atten Percept Psychophys (2015) 77:1465–1487

Page 23: The cocktail-party problem revisited: early processing and

Wan, R., Durlach, N. I., & Colburn, H. S. (2010). Application of anextended equalization-cancellation model to speech intelligibilitywith spatially distributed maskers. Journal of the AcousticalSociety of America, 128, 3678–3690. doi:10.1121/1.3502458

Wang, D., & Brown, G. J. (2006). Computational auditory sceneanalysis. New York: Wiley-IEEE Press.

Wightman, F. L., & Kistler, D. J. (1992). The dominant role of low-frequency interaural time differences in sound localization.Journal of the Acoustical Society of America, 91, 1648–1661. doi:10.1121/1.402445

Wightman, F., & Kistler, D. (2005). Measurement and validation of hu-man HRTFs for use in hearing research. Acta Acustica united withAcustica, 91, 429–439.

Woldorff, M. G., Gallen, C. C., Hampson, S. A., Hillyard, S. A., Pantev,C., Sobel, D., & Bloom, F. (1993). Modulation of early sensoryprocessing in human auditory cortex during auditory selective

attention. Proceedings of the National Academy of Sciences, 90,8722–8726.

Wood, N. L., & Cowan, N. (1995a). The cocktail party phenomenonrevisited: How frequent are attention shifts to one’s name in anirrelevant auditory channel? Journal of Experimental Psychology:Learning, Memory, and Cognition, 21, 255–260. doi:10.1037/0278-7393.21.1.255

Wood, N., & Cowan, N. (1995b). The cocktail party phenomenonrevisited: Attention and memory in the classic selective listeningprocedure of Cherry (1953). Journal of Experimental Psychology:General, 124, 243–262. doi:10.1037/0096-3445.124.3.243

Zion Golumbic, E.M., Ding, N., Bickel, S., Lakatos, P., Schevon, C. A.,McKhann, G. M.,⋯Schroeder, C. E. (2013). Mechanisms underly-ing selective neuronal tracking of attended speech at a Bcocktailparty .̂ Neuron, 77, 980–991. Doi:10.1016/j.neuron.2012.12.037

Atten Percept Psychophys (2015) 77:1465–1487 1487