20
1 Auditory perceptual organization Susan L. Denham 1 , István Winkler 2,3 1 Cognition Institute, University of Plymouth, UK 2 Institute of Cognitive Neuroscience and Psychology, Research Centre for Natural Sciences, MTA, Hungary 3 University of Szeged, Szeged, Hungary To appear in: Oxford Handbook of Perceptual Organization Oxford University Press Edited by Johan Wagemans 1. Introduction and background 1.1. The problem According to the functionalist view of perception and cognition (Brunswik 1955), perceptual information processing serves to support the organism in reaching its fundamental goals: avoiding dangers and gaining access to resources. Both dangers and resources are provided by objects in our environment. Thus a large part of perceptual processing can be understood as answering the question, ‘What is out there?’. However, even correctly answering this question is not sufficient for deciding on a course of action, because our possible interactions with the environment necessarily lie in the future compared to the time from which the information originated. Therefore, the second question to be answered is: ‘What will these objects do in the future?’; that is, our perceptual systems must describe the flow of events in the environment, and interpret them in terms of the behaviours of objects. In this chapter, we consider how sound information is processed by the human brain to answer the above questions. Sounds are produced by the movements or actions of objects and by interactions between them. As a consequence, sounds primarily carry information about what happens in the environment, rather than about the surface features of objects. Together with the fact that most environments are largely transparent to traveling pressure waves (the physical sound) this makes sounds especially useful for conveying information about the behaviours of objects. Sounds pose a number of specific challenges that need to be considered in any account of their interpretation. Sounds are ephemeral; we can’t go back to re-examine them. Sounds unfold in time and contain information at many scales of granularity; thus analysis over a number of different timescales is needed in order to extract their meaning (Nelken 2008). For example, a brief impulsive sound may tell the listener that two objects have been in collision, but a series of such sounds is needed in order for the listener to know that someone is clapping rather than walking. Many sound sources generate sounds intermittently and information about their behaviour typically spans several discrete sound events. To correctly associate sounds across time requires the formation of mental representations that are temporally persistent and allow the formation of associations between sounds emitted by the same source (Winkler, Denham et al. 2009). Finally, the pressure waves arriving at our ears are formed as a composite of all concurrent sounds. Thus the auditory system has to disentangle them. This process of partitioning acoustic features into meaningful groups is known as auditory perceptual organization or auditory scene analysis (Bregman 1990).

Auditory perceptual organization - … perceptual organization Susan L. Denham1, István Winkler2,3 ... For the purposes of this chapter we ignore the influences of other modalities,

Embed Size (px)

Citation preview

1

Auditory perceptual organization Susan L. Denham1, István Winkler2,3

1Cognition Institute, University of Plymouth, UK 2Institute of Cognitive Neuroscience and Psychology, Research Centre for Natural Sciences, MTA,

Hungary 3University of Szeged, Szeged, Hungary

To appear in: Oxford Handbook of Perceptual Organization Oxford University Press Edited by Johan Wagemans 1. Introduction and background 1.1. The problem According to the functionalist view of perception and cognition (Brunswik 1955), perceptual information processing serves to support the organism in reaching its fundamental goals: avoiding dangers and gaining access to resources. Both dangers and resources are provided by objects in our environment. Thus a large part of perceptual processing can be understood as answering the question, ‘What is out there?’. However, even correctly answering this question is not sufficient for deciding on a course of action, because our possible interactions with the environment necessarily lie in the future compared to the time from which the information originated. Therefore, the second question to be answered is: ‘What will these objects do in the future?’; that is, our perceptual systems must describe the flow of events in the environment, and interpret them in terms of the behaviours of objects. In this chapter, we consider how sound information is processed by the human brain to answer the above questions. Sounds are produced by the movements or actions of objects and by interactions between them. As a consequence, sounds primarily carry information about what happens in the environment, rather than about the surface features of objects. Together with the fact that most environments are largely transparent to traveling pressure waves (the physical sound) this makes sounds especially useful for conveying information about the behaviours of objects. Sounds pose a number of specific challenges that need to be considered in any account of their interpretation. Sounds are ephemeral; we can’t go back to re-examine them. Sounds unfold in time and contain information at many scales of granularity; thus analysis over a number of different timescales is needed in order to extract their meaning (Nelken 2008). For example, a brief impulsive sound may tell the listener that two objects have been in collision, but a series of such sounds is needed in order for the listener to know that someone is clapping rather than walking. Many sound sources generate sounds intermittently and information about their behaviour typically spans several discrete sound events. To correctly associate sounds across time requires the formation of mental representations that are temporally persistent and allow the formation of associations between sounds emitted by the same source (Winkler, Denham et al. 2009). Finally, the pressure waves arriving at our ears are formed as a composite of all concurrent sounds. Thus the auditory system has to disentangle them. This process of partitioning acoustic features into meaningful groups is known as auditory perceptual organization or auditory scene analysis (Bregman 1990).

2

1.2. Chapter overview. How does the auditory system achieve the remarkable feat of (generally correctly) decomposing the sound mixture into perceptual objects under the time constraints imposed by the need to behave in a timely manner? Based on our review we will argue for two key processing strategies; firstly, perceptual representations should be predictive (Friston 2005, Summerfield and Egner 2009), and secondly, perceptual decisions should be flexible (Winkler, Denham et al. 2012). In this chapter, we will first consider the principles that guide the formation of links between sounds, and their separation from other sounds. Some of the key experimental paradigms that have been used to investigate auditory perceptual organization are then described, and the behavioural and neural correlates of perceptual organization summarised. We use this information to motivate our working definition of an auditory perceptual object (Kubovy and Van Valkenburg 2001, Griffiths and Warren 2004, Winkler, Denham et al. 2009), and demonstrate the utility of this concept for understanding auditory perceptual organization. For the purposes of this chapter we ignore the influences of other modalities, but see (Spence (this volume)) for the importance of cross modal perceptual organization. 2. Grouping Principles, Events, Streams, and Perceptual Objects in the Auditory Modality 2.1. The inverse problem and the need for constraints. If the goal of perception is to characterize distal objects, then perceptual information processing must solve what physicists term the “inverse problem”: to find the causes (sources) of the physical disturbances reaching the sensors. The problem is that the information reaching the ears does not fully specify the sources; e.g., (Stoffgren and Brady 2001), however, see (Gibson 1979). Therefore, in order to achieve veridical perception, solutions need to be constrained in some way; e.g. by knowledge regarding the nature of the sound sources likely to be found in the given environment (Bar 2007), and/or by expectations arising from the current and recent context (Winkler, Denham et al. 2012). In his seminal book, Bregman (1990) argued that such constraints had already been discovered by the Gestalt school of psychology (Köhler 1947) during the first half of the 20th century. The core observation of Gestalt psychology was that discrete stimuli form larger perceptual units, which have properties not present in the separate components, and that the perception of the components is influenced by the overall perceptual structure. The Gestalt psychologists described principles that govern the grouping of sensory elements (for a detailed discussion of the Gestalt theory, see Section I.1. in this book and the excellent review by Wagemans, Elder et al. (2012)). Because the original Gestalt ’laws of perception‘ were largely based on the study of vision, here we discuss them in terms of sounds. Similarity between the perceptual attributes of successive events such as pitch, timbre, loudness and location provides a basis for linking them (Bregman 1990, Moore and Gockel 2002, Moore and Gockel 2012). However, it appears that it is not so much the raw difference that is important, but rather the rate of change; the slower the rate of change between successive sounds the more similar they are judged (Winkler, Denham et al. 2012). This leads one to consider that in the auditory modality, the law of similarity is not separate from what the Gestalt psychologists termed good continuation. Good continuation means that smooth continuous changes in perceptual attributes favour grouping, while abrupt discontinuities are perceived as the start of something new. Good continuation can operate both within a single sound event (e.g., amplitude-modulating a noise with a relatively high frequency results in the separate perception of a sequence of loud sounds and a continuous softer sound (Bregman 1990)), and between events (e.g. glides can help bind successive events (Bregman and Dannenbring 1973)).

3

The principle of common fate refers to correlated changes in features; e.g., whether they start and/or stop at the same time. This principle has also been termed ‘temporal coherence’ specifically with regard to correlations over time windows that span longer periods than individual events (Shamma, Elhilali et al. 2011). However, while common onset is a very powerful grouping cue, common offset is far less influential (for a review see (Darwin and Carlyon 1995)), and evidence for the grouping effects of coherent correlations between some other features (e.g. frequency modulations (Darwin and Sandell 1995, Lyzenga and Moore 2005), or spatial trajectories (Bőhm, Shestopalova et al. 2012)) is lacking. Disjoint allocation (or belongingness) refers to the principle that each element of the sensory input is only assigned to one perceptual object. In an auditory analogy to the exclusive border assignment in Rubin’s face-vase illusion, Winkler, van Zuijen et al. (2006) showed that a tone which could be equally assigned to two different groups was only ever part of one of them at any given point in time. However, while this principle often holds in auditory perception, there are some notable violations; e.g. in duplex perception, the same sound component can contribute to the perception of a complex sound as well as being heard separately (Rand 1974, Fowler and Rosenblum 1990). Finally, the principle of closure refers to the tendency of objects to be perceived as continuing unless there is evidence for their stopping, e.g. a glide continuing through a masking noise (Miller and Licklider 1950, Riecke, Van Opstal et al. 2008). For example, in ‘temporal induction’ (or phonemic restoration), the replacement of part of a sound (speech) with noise results in the perception of the original, unmodified, sound as well as a noise that is heard separately (Samuel 1981, Warren, Wrightson et al. 1988). However, temporal induction only works if the sound that is deleted is expected, as is found for over-learnt sounds such as speech; see also (Seeba and Klump 2009). 2.2. Perception as inference. This raises an important point: namely, that the key idea of a ‘Gestalt’ as a pattern implicitly carries within it the notion of predictability; i.e., parts can evoke the representation of the whole pattern. Specifically in the case of sounds, this allows one to generate expectations about sound events that have not yet occurred. This notion goes beyond Gestalt theory, aligning it with the empiricist tradition of unconscious inference (Helmholtz 1885), and perception as hypothesis formation (Gregory 1980, Feldman (this volume)). Indeed, whereas Gestalt psychologists thought that grouping principles were rooted in the laws of physics, more recent thinking (Bregman 1990) regards them as heuristics acquired through evolution and learning. By detecting patterns (or feature regularities) in the sensory input the brain can construct compressed representations that allow it to ‘explain away’ (Pearl 1988) future events and so radically reduce the amount of sensory data needed for adequately describing the environment (Summerfield and Egner 2009). The use of schemata (with the corresponding loss of some detail) has long been accepted as an explanation for the nature of long-term memory (Bartlett 1932) and seems also to be the basis for the formation of perceptual representations in general (Neisser 1967, Hochberg 1981, Bar 2007). In accordance with these ideas, Winkler and Cowan (2005) suggested that sound sequences are represented by feature regularities (i.e. relationships between features that define the detected pattern) with only a few items described in full detail for anchoring the representation. 2.3. Auditory perceptual objects as predictive representations. Based on the Gestalt principles and ideas of perceptual inference, outlined above, Winkler and colleagues (Winkler 2007, Winkler, Denham et al. 2009, Winkler 2010) proposed a definition of auditory perceptual objects as predictive representations, constructed on the basis of feature regularities extracted from the incoming sounds (see also (Koenderink (this volume)) for a more general treatment of ecological Gestalts). Object representations are persistent, and absorb expected sensory events. Object representations encode distributions over featural and temporal

4

patterns and can generalise appropriately with regard to the current context. Thus in accordance with the ideas of the Gestalt psychologists, it was suggested that individual sound events are processed within the context of the whole, and the consolidated representation refers to patterns of sound events. In accord with Griffiths and Warren (2004), Winkler, Denham et al. (2009) do not distinguish ‘concrete’ from ‘abstract auditory objects’, where the former refers to the physical source, and the latter to the pattern of emission (Wightman and Jenison 1995, Kubovy and Van Valkenburg 2001). Thus, the notion of an auditory perceptual object is compatible with the definition of an auditory stream, as a coherent sequence of sounds separable from other concurrent or intermittent sounds (Bregman 1990). However, whereas the term ‘auditory stream’ refers to a phenomenological unit of sound organization, with separability as its primary property, the definition proposed by Winkler, Denham et al. (2009) concerns the extraction and representation of the unit as a pattern with predictable components (Winkler, Denham et al. 2012). This definition of an auditory perceptual object is compatible with the memory component assumed in hierarchical predictive coding theories of perception (Friston 2005, Hohwy 2007). These theories posit that the brain acts to minimize the discrepancy between its predictions and the actual sensory input (termed the error signal), and that this occurs at many different levels of processing; e.g. (Friston and Kiebel 2009). Error signals propagate towards higher levels which then attempt to suppress them through refinements to internal models. Auditory perceptual objects can be regarded as models working at intermediate levels of this predictive coding hierarchy (Winkler and Czigler 2012). 3. Behavioural Correlates of Perceptual Sound Organization 3.1. Extraction and binding of features. It is generally accepted that the spectral decomposition carried out by the cochlea results in a topographically organised array of signals; i.e. a representation of incoming sounds in terms of their frequency content, and this sets up the tonotopic organization found through most of the auditory system, up to and including the primary auditory cortex (Zwicker and Fastl 1999), with other features such as onsets, amplitude and frequency modulations, and binaural differences, extracted sub-cortically and largely independently within each frequency channel (Oertel, Fay et al. 2002). It is important to note that even isolated sounds can be rather complex. In general, natural sounds contain many different frequency components, and both the frequencies of the components and their amplitudes can vary within a single sound (Ciocca 2008). Thus the auditory system has to find some way of correctly associating the features which originate from the same sound source. The classical view suggests that acoustic features are bound together to form auditory events (Bertrand and Tallon-Baudry 2000, Zhuo and Yu 2011). By a sound event, or token (Shamma, Elhilali et al. 2011), we mean a sound that is localised in time and is perceived originating from a single sound source; for example, a musical note or a syllable (Ciocca 2008). Events are subsequently grouped sequentially into patterns, streams, or objects. However, most of the studies and models of auditory feature extraction to date have been based on data obtained in experiments presenting isolated sounds to listeners, and many of the problems encountered in natural environments have not yet been fully explored due to their complexity. One consequence is that the commonly accepted feed-forward hierarchical grouping account, just described, is too simplistic; see also (van Leeuwen (this volume)). In order to determine the perceptual qualities of two or more overlapping sound events the brain must first bind their component features; i.e. it must decide which parts of the complex input belong to each event and group features according to which event they belong. But there is a problem, as the number of concurrent auditory objects and which features belong to each is unknown a priori; this must be

5

inferred incrementally from the on-going sensory input. Therefore, feature extraction, feature binding, and sequential grouping must proceed in an interactive manner. Unfortunately, as yet, little is known about the nature of these interactions beyond the fact that the ubiquitous presence of descending pathways throughout the auditory system could provide the substrate for contextual (top-down) influences (Schofield 2010). Therefore, despite being aware that grouping processes cannot be fully disconnected from feature extraction and binding, by necessity, we will address grouping as a separate process. 3.2. Auditory Scene Analysis. In the currently most widely accepted framework describing perceptual sound organization, Auditory Scene Analysis, Bregman (1990) proposes two separable processing stages. The first stage is suggested to be concerned with partitioning sound events into possible streams (groups) based primarily on featural differences (e.g. spectral content, location, timbre). The second stage, within which prior knowledge, context and/or task demands exert their influence, is a competitive process between candidate organizations that ultimately determines which one is perceived. Three notable further assumptions are included in the framework: 1) Initially, the brain assumes that all sounds belong to the same stream and segregating them requires evidence attesting to the probability that they originate from different sources; 2) For sequences with repeating patterns, perception settles on a final ‘perceptual decision’ after the evidence-gathering stage is complete; 3) Solutions that include the continuation of a previously established stream are preferred to alternatives (the “old+new” strategy). 3.3. The grouping stage. Most behavioural studies have targeted the first processing stage, assessing the effects of various cues on auditory group formation. Bregman (1990) distinguishes two classes of grouping processes: grouping based on concurrent (spectral, instantaneous, or vertical) cues, and grouping based on sequential (temporal, contextual, or horizontal) cues. However, although these two classes seem intuitively to be distinct, it turns out that instantaneous cues, while on the face of it seeming to be independent of the history of the stimulation, are susceptible to the influences of prior sequential grouping (Bendixen, Jones et al. 2010); e.g. a harmonic can be pulled out of a complex with which it would otherwise be grouped if there are prior examples of that tone (Darwin, Hukin et al. 1995). So what triggers the automatic grouping and segregation of individual sound events? There have been surprisingly few experiments addressing this question explicitly, but the gap transfer illusion (Nakajima, Sasaki et al. 2000) suggests that the auditory system tends to try to match onsets to offsets according to their temporal proximity, and that the result (which also depends on the extent to which features at the onset and offset match (Nakajima, Sasaki et al. 2004)) is a perceptual event, as defined above. Since listeners reliably reported the illusory event even though they were not trying to hear it out, these experiments provide evidence for obligatory grouping. Another typical example of this class of obligatory grouping is the mistuned partial phenomenon. When one partial of a complex harmonic tone is mistuned listeners perceive two concurrent sounds, a complex tone and a pure tone, the latter corresponding to the mistuned partial (Moore, Glasberg et al. 1986). However, not all features trigger concurrent grouping; e.g. common interaural time differences between a subset of frequency components within a single sound event do not generate a similar segregation of component subsets (Culling and Summerfield 1995). In contrast to concurrent grouping, sequential grouping is necessarily based on some representation of the preceding sounds. Most studies of this class of grouping have used sequences of discrete sound events, and asked two main questions: a) How do the various stimulus parameters affect sequential grouping of sound events, and b) What are the temporal dynamics of this grouping process (for reviews, see (Carlyon 2004, Haykin and Chen 2005, Snyder and Alain 2007, Ciocca 2008,

6

Shamma, Elhilali et al. 2011). In the most widely used stimulus paradigm (termed the auditory streaming paradigm), sequences of the structure ABA- (where A and B denote two sounds (typically tones) differing in some auditory feature(s) and - stands for a silent interval) are presented to listeners (van Noorden 1975). When the feature separation between A and B is small and/or they are delivered at a slow pace, listeners predominantly hear a single coherent stream with a galloping rhythm (termed the integrated percept). With a large separation between the two sounds and/or fast presentation rates, they most often experience the sequence in terms of two separated streams, one consisting only of the A tones and the other of the B tones, with each stream having its own isochronous rhythm (termed the segregated percept). Throughout most of the feature-separation/presentation-rate space there is a trade-off between the two cues: smaller feature separation can be compensated with higher presentation rate, and vice versa (van Noorden 1975). Differences in various auditory features, including frequency, pitch, loudness, location, timbre, and amplitude modulation, have been shown to support auditory stream segregation (Vliegen and Oxenham 1999, Grimault, Bacon et al. 2002, Roberts, Glasberg et al. 2002). Thus it appears that sequential grouping is based on perceptual similarity, rather than on specific low-level auditory features (Moore and Gockel 2002, Moore and Gockel 2012). As for the timing of the sounds, it was shown that the critical parameter is the silent interval between consecutive tones of the same set (the within-stream inter-stimulus interval) (Bregman, Ahad et al. 2000); however, see (Bee and Klump 2005) for a counter view. Temporal structure has also been suggested as a key factor in segregating streams either by guiding attentive grouping processes (Jones 1976, Jones, Kidd et al. 1981) or through temporal coherence between elements of the auditory input (Elhilali, Ma et al. 2009). Finally, contextual effects, such as the presence of additional sounds or attentional set, can bias the final perceptual outcome, suggesting that the second-stage processes of competition consider all possible alternative groupings (Bregman 1990, Winkler, Sussman et al. 2003). In summary, sequential grouping effects generally conform to the Gestalt principles of similarity/good continuation and common fate. 3.4. The competition/selection stage: Multi-stability in auditory streaming. Although the results of many experiments have painted a picture consistent with Bregman’s assumptions; e.g. (Cusack, Deeks et al. 2004, Snyder, Alain et al. 2006), other results appear to be at odds with the notion that the auditory system a) always starts from the integrated organization, and b) that eventually a stable final perception is reached. When listeners are presented with sequences of a few minutes duration and are asked to report their perception in a continuous manner, it has been found that perception fluctuates between alternative organizations in all listeners and with all of the combinations of stimulus parameters tested (Anstis and Saida 1985, Roberts, Glasberg et al. 2002, Denham and Winkler 2006, Pressnitzer and Hupe 2006, Schadwinkel and Gutschalk 2011, Denham, Gymesi et al. 2012). Thus the perception of ABA- sequences appears to be multi-stable (Schwartz, Grimault et al. 2012), similar to some other auditory (Wessel 1979) and visual stimulus configurations, e.g. (Leopold and Logothetis 1999, Alais and Blake (this volume)). Furthermore, segregated and integrated percepts are not the only ones that listeners experience in response to ABA- sequences (Bendixen, Denham et al. 2010, Bendixen, Bőhm et al. 2012, Bőhm, Shestopalova et al. 2012, Denham, Gymesi et al. 2012, Szalárdy, Bendixen et al. 2012), and, with stimulus parameters strongly promoting the segregated organization, participants often report segregation first (Deike, Heil et al. 2012, Denham, Gymesi et al. 2012). It has also been found that the first experienced perceptual organization is more strongly determined by stimulus parameters than those experienced later (Denham, Gymesi et al. 2012). Finally, higher-order cues, such as regularities embedded separately within the A and B streams, extend the duration of the phases (continuous intervals with the same percept) during which listeners experience the segregated percept, while they do not affect the duration of the phases of

7

the integrated percept (Bendixen, Denham et al. 2010). This suggests that predictability (closure in terms of the Gestalt principles) also plays into the competition between alternative sound organizations, although differently from cues based on the rate of perceptual change (similarity/good continuation and common fate). Closure in auditory perceptual organization may therefore be seen to resonate with Koffka’s early intuition as acting not so much as a low-level grouping cue but rather as something that helps to determine the final perceptual form (Wagemans, Elder et al. 2012). Just as closure in vision allows the transformation of a 1D contour into a 2D shape (Elder and Zucker 1993), so the discovery of a predictable temporal pattern transforms a sequential series of unrelated sounds into a distinctive motif. In contrast to the laboratory findings of multistable perception, everyday experience tells us that we perceive the world in a stable, continuous manner. We may find that initially we are not able to distinguish individual sound sources when suddenly confronted with a new auditory scene, such as entering a noisy classroom or stepping out onto a busy street. But generally within a few seconds, we are able to differentiate them; especially sounds that are relevant to our task. This experience is well captured by Bregman’s assumptions of initial integration and subsequent settling on a stable segregated organization. In support of these assumptions, when averaging over the reports of different listeners, it is generally found that within the initial 5-15 s of an ABA- sequence, the probability of reporting segregation monotonically increases (termed the build-up of auditory streaming) (but see Deike, Heil et al. (2012)), and the incidence of a break during this early period, or directing attention away from the sounds, causes a reset (i.e., a return to integration followed by a gradual increase in the likelihood of segregation (Cusack, Deeks et al. 2004)). So, should we disregard the perceptual multistability observed in the auditory streaming paradigm as simply a consequence of the artificial stimulation protocol used? We suggest not. Illusions and artificially constructed stimulus configurations have played an important role in the study of perception (e.g., as the main method of Gestalt psychology), because they provide insights into the machinery of perception. In the following, we provide a description of auditory perceptual organization based on insights gained from multistable phenomena. Winkler, Denham et al. (2012) suggested that one should consider sound organization in the brain in terms of the continuous discovery of proto-objects (alternative groupings) and on-going competition between them. Continuous discovery and competition are well suited to the everyday demands on auditory perceptual organization in a changing world. Proto-objects (Rensink 2000) are the candidate set of representations that have the potential to emerge as the perceptual objects of conscious awareness (Mill, Bőhm et al. 2013). Within this framework, proto-objects represent patterns which have been discovered embedded within the incoming sequence of sounds; they are constructed by linking sound events and recognising when a previously discovered sequence recurs and can thus be used to predict future events. In a new sound scene, the proto-object that is easiest to discover determines the initial percept. Since the time needed for discovering a proto-object depends largely on the stimulus parameters (i.e., to what extent successive sound events satisfy/violate the similarity/good continuation principle), the first percept strongly depends on stimulus parameters. However, the duration of the first perceptual phase is independent of the percept (Hupe and Pressnitzer 2012), since it depends on how long it takes for other proto-objects to be discovered (Winkler, Denham et al. 2012). Once alternative organizations have been discovered they start competing with each other. Competition between organizations is dynamic both because proto-objects are discovered on the fly, and may come and go, and because their strength, which determines which of them becomes dominant at a given time, is probably affected by dynamic factors, such as how often they successfully predict upcoming sound events (cf., predictive coding theories (Friston 2005) and Bregman’s “old+new” heuristic (Bregman 1990)), adaptation, and noise (Mill, Bőhm et al. 2013). The

8

latter two influences are also often assumed in computational models of bi-stable visual perceptual phenomena, e.g. (Shpiro, Moreno-Bote et al. 2009, van Ee 2009); adaptation ensures the observed inevitability of perceptual switching (the dominant percept cannot remain dominant forever), and noise accounts for the observed stochasticity in perceptual switching (successive phase durations are largely uncorrelated, and the distribution of phase durations resembles a gamma distribution) (Levelt 1968, Leopold and Logothetis 1999). Generalising the two-stage account of perceptual organization proposed by Bregman (1990) to two concurrent stages which operate continuously and in parallel, the first consisting of the discovery of predictive representations (proto-objects), and the second, competition for dominance between proto-objects, results in a theoretical and computational framework that explains a wide set of experimental findings (Winkler, Denham et al. 2012, Mill, Bőhm et al. 2013). For example, perceptual switching, first phase choice and duration, and differences between the first and subsequent perceptual phases can all be explained within this framework. It also accounts for the different influences of similarity and closure on perception; the rate of perceptual change (similarity/good continuation) determines how easy it is to form links between the events that make up a proto-object, while predictability (closure) does not affect the discovery of proto-objects, but can increase the competitiveness (salience) of a proto-object once it has been discovered (Bendixen, Denham et al. 2010). 3.5. Perceptual organization. Up to this point we have used the term ’sound organization‘ in a general sense. Now we consider it in a narrower sense. The two sound organizations most commonly (but not exclusively) appearing in the ABA- paradigm are integration and segregation. Whereas the integrated percept is fully specified, there are, in fact two possible segregated percepts: one may hear the A sounds in the foreground and the B’s in the background, or vice versa. It is comparatively easy to switch between these two variants of the segregated percept (since we are aware of them both of them at the same time), while it is more difficult to voluntarily switch between segregation and integration (as we are not simultaneously aware of both these organization, i.e. we don’t hear the integrated galloping rhythm while we experience the sequence in terms of two streams). Thus there are sound organizations, each of which corresponds to a different set of possible perceptual experiences, which are, in Bregman’s terms, compatible with each other, whereas others are perceptually mutually exclusive. What determines compatibility? Winkler, Denham et al. (2012) suggested that two (or more) proto-objects are compatible if they never predict the same sound event (i.e., they have no common element – c.f. the Gestalt principle of disjoint allocation), and considered three possible ways in which competition may be implemented in order to account for perceptual experience. The first possibility they considered is that compatibility is explicitly extracted and organizations are formed during the first processing stage. This leads to the assumption of hierarchical competition, one between organizations, and another within each organization that includes multiple proto-objects. The second possibility is a foreground-background solution. In this case all proto-objects compete directly with each other and once a dominant one emerges, all remaining sounds are grouped together into a background representation. Results showing no clear separation of sounds in the background are compatible with this solution (Brochard, Drake et al. 1999, Sussman, Bregman et al. 2005). However, other studies suggest that the background is not always undifferentiated (Winkler, Teder-Salejarvi et al. 2003). A third possibility is that proto-objects only compete with each other when they predict the same sound event (collide). In this case organizations emerge because of the simultaneous dominance of proto-objects that never collide with each other, and their alternation with other compatible sets with which they do collide; i.e. when one proto-object becomes dominant in the on-going competition, others with which it doesn’t collide will also become strong, while all proto-objects with which this set does collide are suppressed. Noise and adaptation ensure that at some point a switch will occur to one of the suppressed proto-objects and the cycle will

9

continue. A computational model that demonstrates the viability of this solution for modelling perceptual experience in the ABA- paradigm has recently been developed (Mill, Bőhm et al. 2013). The assumption that the perceptual organization of sounds is based on continuous competition between predictive proto-objects leads to a system that is flexible, because alternative proto-objects are available all the time, ready to emerge into perceptual awareness when they prove to be the best predictors of the auditory input. The system is also stable and robust, because it does not need to reassess all of its representations with the arrival of a new sound source in the scene, or in the event of temporary disturbances (such as a short loss of input, or during attentional switching between objects). 4. Neural Correlates of Perceptual Organization We turn now to consider what has been learnt from neurophysiological studies of auditory perceptual organization. Neural responses to individual sounds are profoundly influenced by the context in which they appear (Bar-Yosef, Rotman et al. 2002). The question is to what extent the contextual influences on neural responses reflect the current state of perceptual organization. This question has been addressed by a number of studies ranging in focus from the single neuron level to large scale brain responses, and the results provide important clues about the processing strategies adopted by the auditory system. 4.1. Stimulus specific adaptation and differential suppression. Context dependent responses at the single neuron level have been probed using repetitive sequences of tones within which occasional deviant tones (with a different frequency) are inserted. Under these circumstances many neurons in cortex (Ulanovsky, Las et al. 2003), thalamus (Anderson, Christianson et al. 2009) and inferior colliculus (Malmierca, Cristaudo et al. 2009) show stimulus specific adaptation (SSA), i.e. the response to a frequently recurring ‘standard’ tone diminishes, while the response to a ‘deviant’ tone is relatively enhanced. Furthermore, this preferential response is not solely a function of the low probability of the deviant sounds but also reflects their novelty; i.e. the extent to which they violate a previously established pattern (Taaseh, Yaron et al. 2011). This property of deviance detection is important in that it signals to the brain, by increased neural activity, that something new has occurred, such as the start of a new sound source. Thus SSA may indicate the presence of a primitive novelty detector in the brain. Single neuron responses to alternating tone sequences as used in the auditory streaming paradigm have also been investigated (Fishman, Arezzo et al. 2004, Bee and Klump 2005, Micheyl, Tian et al. 2005, Micheyl, Carlyon et al. 2007), and it was found that even when at the start of the stimulus train the neuron responds to both tones, with time the response to one of the tones (typically corresponding to the best frequency of the cell) remains relatively strong, while the response to the other tone diminishes; an effect termed differential suppression. Although no behavioural tests were conducted in these experiments, it was claimed that differential suppression was a neural correlate of perceptual segregation (Fishman, Arezzo et al. 2004). This claim was supported by showing that neuronal sensitivity to frequency difference and presentation rate was consistent with the classical van Noorden (1975) parameter space, and that spike counts from neurons in primary auditory cortex could predict an integration/segregation decision closely matching the results of perceptual studies in humans (Micheyl, Tian et al. 2005, Bee, Micheyl et al. 2010). The differential suppression account of auditory streaming is based on the idea that by default everything is grouped together but with time some part of primary auditory cortex comes to respond to one of the tone streams, while some other part responds to the other tone stream, and the time taken for these clusters to form and the degree to which they can be separated corresponds to the time-varying, stimulus-dependent probability of segregation. However, this account is challenged by two findings. Firstly, it suggests a

10

fixed perceptual decision and offers no explanation for the multistability of streaming described in the previous section. Secondly, it has been shown that while a similar distinct clustering of neural responses can be found when the A and B tones are overlapping in time, in this case, listeners report hearing an integrated pattern (Elhilali, Ma et al. 2009). So, while differential suppression may be necessary, it is not a sufficient condition for segregation. 4.2. Event-related potential correlates of sound organization. Auditory event-related brain potentials (AERPs) represent the synchronized activity of large neuronal populations, time-locked to some auditory event. Because they can be recorded noninvasively from the human scalp, one can use them to study the brain responses accompanying perceptual phenomena, such as auditory stream segregation. An AERP correlate of concurrent sound organization is found when a partial of a complex tone is mistuned giving rise to the perception of two concurrent sounds (see The grouping stage section); a negative wave peaking at about 180 milliseconds after stimulus onset, whose amplitude increases with the degree of mistuning, is elicited (Alain, Arnott et al. 2001). This AERP component, termed the ‘object-related negativity’ (ORN), is proposed to signal the automatic segregation of concurrent auditory objects (Alain, Schuler et al. 2002). An AERP correlate of sequential sound organization was found in an experiment showing that the amplitude of two early sensory AERP components, the auditory P1 and N1, vary depending on whether the same sounds are perceived as part of an integrated or segregated organization (Gutschalk, Micheyl et al. 2005, Szalárdy, Bőhm et al. 2013). Another electrophysiological measure that has been extensively used to probe sequential perceptual organisation is the Mismatch Negativity (MMN); for recent reviews see (Winkler 2007, Näätänen, Kujala et al. 2011). MMN is elicited by sounds that violate some regular auditory feature of the preceding sound sequence; therefore, it can be used to probe what auditory regularities are encoded in the brain. By setting up stimulus configurations which result in different regularities depending on how the sounds are organized, MMN can be used as an indirect index of auditory stream segregation. The first studies using MMN in this way (Sussman, Ritter et al. 1999, Nager, Teder-Sälejärvi et al. 2003, Winkler, Sussman et al. 2003) showed that the elicitation of MMN can be made dependent on sound organization, and furthermore, that MMN is only elicited by violations of regularities characterising the stream to which a sound belongs, but not by violating the regularities of some other parallel sound stream (Ritter, Sussman et al. 2000, Winkler, van Zuijen et al. 2006). These observations allowed the study of some issues not easily accessible to behavioural methods. Here we highlight three important questions: interactions between concurrent and sequential perceptual organisation, evidence for the existence of two stages in sound organization, and the role of attention in forming and maintaining auditory stream segregation. In a study delivering sequences of harmonic complexes in which the probability of a mistuned component was manipulated, it was found that the ORN was reliably elicited by mistuning in all conditions, but its magnitude increased with decreasing probability of occurrence (Bendixen, Jones et al. 2010). This was interpreted as being a heightened response towards the onset of a possible new auditory object. The additional finding that a positive AERP component, the P3a, usually associated with involuntary attentional switching (Escera, Alho et al. 2000), was elicited by mistuned sounds in the low mistuning probability condition but not by tuned sounds in the high mistuning probability condition, suggested that the auditory system is primarily interested in the onset of new sound sources rather than their disappearance (Dyson and Alain 2008, Bendixen, Jones et al. 2010); a view further supported by results obtained in a different behavioural paradigm (Cervantes Constantino, Pinggera et al. 2012). It has been shown that the early (<100 ms) AERP correlates of auditory stream segregation, the P1 and N1 components, are governed by the acoustic parameters (Winkler, Takegata et al. 2005,

11

Snyder, Alain et al. 2006), whereas later (>120 ms) responses (N2) correlate with perceptual experience (Winkler, Takegata et al. 2005, Szalárdy, Bőhm et al. 2013). Furthermore, the amplitude of the later AERP response correlates with the probability of reporting segregation (the build-up of streams) and it is augmented by attention (Snyder, Alain et al. 2006). These results suggest that the initial grouping, which precedes temporal integration between sound events (Yabe, Winkler et al. 2001, Sussman 2005), is mainly stimulus-driven, whereas later occurring perceptual decisions are susceptible to top-down modulation, a view compatible with Bregman’s theoretical framework. Whereas most accounts of auditory streaming assume that perceptual similarity affects grouping through automatic grouping processes, Jones, Maser et al. (1978) suggested that segregation results from a failure to rapidly shift attention between perceptually dissimilar items in a sequence. The literature is divided on the role of attention in auditory stream segregation. Some electrophysiological studies suggested that auditory stream segregation can occur in the absence of focused attention (Winkler, Sussman et al. 2003, Winkler, Teder-Salejarvi et al. 2003, Sussman, Horváth et al. 2007). In contrast, results of some behavioural and AERP studies suggest that attention may at least be needed for the initial formation of streams (Cusack, Deeks et al. 2004, Snyder, Alain et al. 2006); however, see (Sussman, Horváth et al. 2007). How can attention affect sound organization? Snyder, Gregg et al. (2012) argue for an attentional ‘gain model’ in which the representation of attended sounds is enhanced, while unattended ones are suppressed. Due to the short latency of the observed gain modulation they suggested that attention operates both on the group formation phase of segregation as well as the later selection phase (Bregman 1990). However, attention can also have other effects on sound organization; attention can retune and sharpen representations in order to improve the segregation of signals from noise (Ahveninen, Hamalainen et al. 2011), attention to a stream improves the phase locking of neural responses to the attended sounds (Elhilali, Xiang et al. 2009), attention allows the utilization of learned (non-primitive) grouping algorithms thus providing additional processing capacities (Lavie, Hirst et al. 2004); and, attention can bias the competition between alternative sound organizations (as found in the visual system (Desimone 1998)). Which of these are most relevant to auditory perceptual organization has yet to be established. 4.3. The neuroscience view of auditory objects. “... in neuroscientific terms, the concepts of an object and of object analysis can be regarded as inseparable” (Griffith and Warren, 2004, p. 887). Thus, neuroscientific descriptions of auditory perceptual objects focus on the processes involved in forming and maintaining object representations. The detection and representation of regularities by the brain, as indexed by the MMN, has been used to establish a functional definition of an auditory object (Winkler, Denham et al. 2009). Using evidence from a series of MMN studies, Winkler, Denham et al. (2009) proposed that an auditory object is a perceptual representation of a possible sound source, derived from regularities in the sensory input (Näätänen, Tervaniemi et al. 2001) that has temporal persistence (Winkler and Cowan 2005) and can link events separated in time (Näätänen and Winkler 1999). This representation forms a separable unit (Winkler, van Zuijen et al. 2006) that generalises across natural variations in the sounds (Winkler, Teder-Salejarvi et al. 2003) and generates expectations of parts of the object not yet available (Bendixen, Schröger et al. 2009). Evidence for the representation of auditory objects in cortex, consistent with this definition, is found in fMRI (Schadwinkel and Gutschalk 2011), and in MEG and multi-electrode surface recording studies of people listening to two competing talkers (Ding and Simon 2012, Mesgarani and Chang 2012). By decoding MEG signals correlated with the amplitude fluctuations of each of the speech signals it was shown that the brain preferentially locks onto the temporal patterns of the attended talker, and that this representation adapts to the sound level of the attended talker and not the interfering one (Ding and Simon 2012). Multi-electrode recordings in non-primary auditory cortex similarly show

12

that the brain locks onto critical features in the attended speech stream, and furthermore that a simple classifier built from a set of linear filters can be used to decode both the attended speaker and the words being uttered (Mesgarani and Chang 2012). Experiments showing that context-dependent predictive activity in the hippocampus encoded temporal relationships between events and correlated with subsequent recall of episodes (Paz, Gelbard-Sagiv et al. 2010) suggests that the hippocampus may also be involved, although this work used multisensory cinematic material so it is not clear whether the finding hold for sounds alone. While traditional psychological accounts implicitly or explicitly refer to representations of objects, there are models of auditory streaming and perception in general, which are not concerned with positing a representation that would directly correspond to the contents of conscious perception; we have already referred to two such theories. Although hierarchical predictive coding, e.g. (Friston and Kiebel 2009), includes predictive memory representations, which are in many ways compatible with the notion of auditory object representations (Winkler and Czigler 2012), no explicit connection with object representations is made. Indeed, whereas predictive coding models have been successful in matching the statistics of perceptual decisions (Lee and Mumford 2003, Aoyama, Endo et al. 2006, Yu 2007, Garrido, Kilner et al. 2009, Daunizeau, den Ouden et al. 2010), they are better suited to describing the neural responses observed during perception (Grill-Spector, Henson et al. 2006), than perceptual experience per se. Shamma and colleagues’ temporal coherence model of auditory stream segregation (Elhilali and Shamma 2008, Elhilali, Ma et al. 2009, Shamma, Elhilali et al. 2011) provides another way to avoid the assumption that object representations are necessary for sound organization, instead it is proposed that objects are essentially whatever occupies the perceptual foreground and exist only insofar as they do occupy the foreground. Temporal coherence can be calculated using relatively short time windows without building a description of the past stimulation. Thus auditory streams can be separated in a single pass. It is also claimed that object formation (binding) occurs late, i.e. the composite multi-featured percept of conscious awareness is formed through selective attention to some feature that causes all features correlated with the attended feature to emerge together into perceptual awareness (and thus form a perceptual object), while the background remains undifferentiated (Shamma, Elhilali et al. 2011). In summary, there is currently little consensus on the role of auditory object representations in perceptual organization and the importance placed on object representations by the various models differs markedly. 5. Conclusions and Future Directions The Gestalt principles and their application to auditory perception instantiated in Bregman’s two-stage auditory scene analysis framework have provided the impetus and initial basis for understanding auditory perceptual organization. Recent proposals have extended this framework in interesting ways. Specifically, a more precise definition of auditory objects (Winkler, Denham et al. 2009) and an explanation for how perceptual organization can emerge through parallel processes of construction and competition (Winkler, Denham et al. 2012, Mill, Bőhm et al. 2013), have been formed by integrating Gestalt ideas (Köhler 1947, Bregman 1990) with the notion of perception as a ratiomorphic (Brunswik 1955) inference process (Helmholtz 1885, Gregory 1980, Friston 2005). One key idea has been to show that perceptual object representations form plausible candidates for the generative models assumed by predictive coding theories (Winkler and Czigler 2012). The construction of proto-objects on the basis of pattern detection (closure) is well supported by recent experiments showing that people can detect regularities very quickly (Teki, Chait et al. 2011). As discussed above, the general approach of predictive coding (Friston 2005) and predictive auditory object representation (Winkler, Denham et al. 2009) are compatible (Winkler and Czigler 2012) although they have somewhat different aims. However, as of yet, there have been few attempts to

13

face up to the complexity of real auditory scenes in which grouping and categorization cues are not immediately available; but see, (Yildiz and Kiebel 2011). Progress may come from building bridges between competing theories. The instantiation of the principle of common fate in the form of temporal coherence (Shamma, Elhilali et al. 2011) suggests a basis for linking features and possibly events within a proto-object. Due to its generic nature, temporal coherence as a cue is not limited to discrete well-defined sound events and can thus help to generalise models that rely on such. The suggestion of a hierarchical decomposition of the sound world into objects which are differentiated by attention and task demands, while others remain rather more amorphous (Cusack and Carlyon 2003), can also be accommodated within the framework of predictive object representations. The patterns or regularities encoded by proto-objects represent distributions over featural and temporal structures. Thus it is entirely feasible for some proto-objects to represent well-differentiated and separated patterns, such as the voice of the person to whom one is talking, while others may represent the undifferentiated combination of background sounds, such as the background babble at a cocktail party (Cherry 1953). Finally, decomposing complex sounds and finding events in long continuous sounds (Coath and Denham 2007, Yildiz and Kiebel 2011) may feed into models concerned with grouping events into auditory object representations. We started out by highlighting the two questions that the auditory system needs to answer; ‘What is out there?’ and ‘What will it do next?’. In this chapter, we outlined the main approaches currently being pursued to provide insights into how the human auditory system answers these questions quickly and accurately under a variety of conditions, which can dramatically affect the cues that are available. We suggest that in order to deliver robust performance within a changing world, the human brain builds auditory object representations that are predictive of upcoming events, and uses these in the formation of perceptual organizations that represent its interpretation of the world. Flexible switching between candidate organizations ensures that the system can explore alternative interpretations, and revise its perceptual decisions in the light of further information. However, there is much that remains to be understood and current models are far from matching the capabilities of human auditory perception. Perhaps, as outlined above, convergence between the alternative approaches will provide a more satisfactory account of the processes underlying auditory perceptual organization. 6. Acknowledgements This work was supported in part by the Lendület project awarded to István Winkler by the Hungarian Academy of Sciences (contract number LP2012-36/2012).

14

7. References Ahveninen, J., M. Hamalainen, I. P. Jaaskelainen, S. P. Ahlfors, S. Huang, F. H. Lin, T. Raij, M. Sams, C. E. Vasios and J. W. Belliveau (2011). "Attention-driven auditory cortex short-term plasticity helps segregate relevant sounds from noise." Proc Natl Acad Sci U S A 108(10): 4182-4187. Alain, C., S. R. Arnott and T. W. Picton (2001). "Bottom-up and top-down influences on auditory scene analysis: evidence from event-related brain potentials." J Exp Psychol Hum Percept Perform 27(5): 1072-1089. Alain, C., B. M. Schuler and K. L. McDonald (2002). "Neural activity associated with distinguishing concurrent auditory objects." J Acoust Soc Am 111(2): 990-995. Alais, D. and R. Blake ((this volume)). Multistability and binocular rivalry The Oxford Handbook of Perceptual Organization. J. Wagemans. Oxford, Oxford University Press. Anderson, L. A., G. B. Christianson and J. F. Linden (2009). "Stimulus-specific adaptation occurs in the auditory thalamus." J Neurosci 29(22): 7359-7363. Anstis, S. and S. Saida (1985). "Adaptation to auditory streaming of frequency-modulated tones." Journal of Experimental Psychology: Human Perception and Performance 11: 257-271. Aoyama, A., H. Endo, S. Honda and T. Takeda (2006). "Modulation of early auditory processing by visually based sound prediction." Brain Res 1068(1): 194-204. Bar-Yosef, O., Y. Rotman and I. Nelken (2002). "Responses of neurons in cat primary auditory cortex to bird chirps: effects of temporal and spectral context." J Neurosci 22(19): 8619-8632. Bar, M. (2007). "The proactive brain: using analogies and associations to generate predictions." Trends Cogn Sci 11(7): 280-289. Bartlett, F. C. (1932). Remembering: A Study in Experimental and Social Psychology. Cambridge, Cambridge University Press. Bee, M. A. and G. M. Klump (2005). "Auditory stream segregation in the songbird forebrain: effects of time intervals on responses to interleaved tone sequences." Brain Behav Evol 66(3): 197-214. Bee, M. A., C. Micheyl, A. J. Oxenham and G. M. Klump (2010). "Neural adaptation to tone sequences in the songbird forebrain: patterns, determinants, and relation to the build-up of auditory streaming." J Comp Physiol A Neuroethol Sens Neural Behav Physiol 196(8): 543-557. Bendixen, A., T. M. Bőhm, O. Szalárdy, R. Mill, S. L. Denham and I. Winkler (2012). "Different roles of similarity and predictability in auditory stream segregation." J. Learning & Perception in press. Bendixen, A., S. L. Denham, K. Gyimesi and I. Winkler (2010). "Regular patterns stabilize auditory streams." J Acoust Soc Am 128(6): 3658-3666. Bendixen, A., S. J. Jones, G. Klump and I. Winkler (2010). "Probability dependence and functional separation of the object-related and mismatch negativity event-related potential components." Neuroimage 50(1): 285-290. Bendixen, A., E. Schröger and I. Winkler (2009). "I heard that coming: event-related potential evidence for stimulus-driven prediction in the auditory system." J Neurosci 29(26): 8447-8451. Bertrand, O. and C. Tallon-Baudry (2000). "Oscillatory gamma activity in humans: a possible role for object representation." Int J Psychophysiol 38(3): 211-223. Bőhm, T. M., L. Shestopalova, A. Bendixen, A. G. Andreou, J. Georgiou, G. Garreau, P. Pouliquen, A. Cassidy, S. L. Denham and I. Winkler (2012). "Spatial location of sound sources biases auditory stream segregation but their motion does not." J. Learning & Perception in press Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge, MA, MIT Press. Bregman, A. S., P. A. Ahad, P. A. Crum and J. O'Reilly (2000). "Effects of time intervals and tone durations on auditory stream segregation." Percept Psychophys 62(3): 626-636. Bregman, A. S. and G. Dannenbring (1973). "The effect of continuity on auditory stream segregation." Percept. Psychophys. 13: 308-312.

15

Brochard, R., C. Drake, M. C. Botte and S. McAdams (1999). "Perceptual organization of complex auditory sequences: effect of number of simultaneous subsequences and frequency separation." J Exp Psychol Hum Percept Perform 25(6): 1742-1759. Brunswik, E. (1955). "Representative design and probabilistic theory in a functional psychology." Psychological Review 62(3): 193-217. Carlyon, R. P. (2004). "How the brain separates sounds." Trends Cogn Sci 8(10): 465-471. Cervantes Constantino, F., L. Pinggera, S. Paranamana, M. Kashino and M. Chait (2012). "Detection of appearing and disappearing objects in complex acoustic scenes." PLoS One 7(9): e46167. Cherry, E. C. (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears." Journal of Acoustic Society of America 25 (5): 975–979. Ciocca, V. (2008). "The auditory organization of complex sounds." Front Biosci 13: 148-169. Coath, M. and S. L. Denham (2007). "The role of transients in auditory processing." Biosystems 89(1-3): 182-189. Culling, J. F. and Q. Summerfield (1995). "Perceptual separation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay." J Acoust Soc Am 98(2 Pt 1): 785-797. Cusack, R. and R. P. Carlyon (2003). "Perceptual asymmetries in audition." J Exp Psychol Hum Percept Perform 29(3): 713-725. Cusack, R., J. Deeks, G. Aikman and R. P. Carlyon (2004). "Effects of location, frequency region, and time course of selective attention on auditory scene analysis." J Exp Psychol Hum Percept Perform 30(4): 643-656. Darwin, C. J. and R. P. Carlyon (1995). Auditory grouping. The handbook of perception and cognition: Hearing. B. C. J. Moore. London, Academic Press. 6: 387-424. Darwin, C. J., R. W. Hukin and B. Y. al-Khatib (1995). "Grouping in pitch perception: evidence for sequential constraints." J Acoust Soc Am 98(2 Pt 1): 880-885. Darwin, C. J. and G. J. Sandell (1995). "Absence of effect of coherent frequency modulation on grouping a mistuned harmonic with a vowel." J Acoust Soc Am 97(5 Pt 1): 3135-3138. Daunizeau, J., H. E. den Ouden, M. Pessiglione, S. J. Kiebel, K. E. Stephan and K. J. Friston (2010). "Observing the observer (I): meta-Bayesian models of learning and decision-making." PLoS One 5(12): e15554. Deike, S., P. Heil, M. Böckmann-Barthel and A. Brechmann (2012). "The build-up of auditory stream segregation: a different perspective." Frontiers in Psychology 3. Denham, S. L., K. Gymesi, G. Stefanics and I. Winkler (2012). "Multistability in auditory stream segregation: the role of stimulus features in perceptual organisation." Journal of Learning & Perception in press. Denham, S. L. and I. Winkler (2006). "The role of predictive models in the formation of auditory streams." J Physiol Paris 100(1-3): 154-170. Desimone, R. (1998). "Visual attention mediated by biased competition in extrastriate visual cortex." Philos Trans R Soc Lond B Biol Sci 353(1373): 1245-1255. Ding, N. and J. Z. Simon (2012). "Emergence of neural encoding of auditory objects while listening to competing speakers." Proc Natl Acad Sci U S A 109(29): 11854-11859. Dyson, B. J. and C. Alain (2008). "Is a change as good with a rest? Task-dependent effects of inter-trial contingency on concurrent sound segregation." Brain Res 1189: 135-144. Elder, J. and S. Zucker (1993). "The effect of contour closure on the rapid discrimination of two-dimensional shapes." Vision Res 33(7): 981-991. Elhilali, M., L. Ma, C. Micheyl, A. J. Oxenham and S. A. Shamma (2009). "Temporal coherence in the perceptual organization and cortical representation of auditory scenes." Neuron 61(2): 317-329. Elhilali, M. and S. A. Shamma (2008). "A cocktail party with a cortical twist: how cortical mechanisms contribute to sound segregation." J Acoust Soc Am 124(6): 3751-3771.

16

Elhilali, M., J. Xiang, S. A. Shamma and J. Z. Simon (2009). "Interaction between attention and bottom-up saliency mediates the representation of foreground and background in an auditory scene." PLoS Biol 7(6): e1000129. Escera, C., K. Alho, E. Schroger and I. Winkler (2000). "Involuntary attention and distractibility as evaluated with event-related brain potentials." Audiol Neurootol 5(3-4): 151-166. Feldman, J. ((this volume)). The Oxford Handbook of Perceptual Organization. J. Wagemans. Oxford, UK, Oxford University Press. 56. Fishman, Y. I., J. C. Arezzo and M. Steinschneider (2004). "Auditory stream segregation in monkey auditory cortex: effects of frequency separation, presentation rate, and tone duration." J Acoust Soc Am 116(3): 1656-1670. Fowler, C. A. and L. D. Rosenblum (1990). "Duplex perception: a comparison of monosyllables and slamming doors." J Exp Psychol Hum Percept Perform 16(4): 742-754. Friston, K. (2005). "A theory of cortical responses." Philos Trans R Soc Lond B Biol Sci 360(1456): 815-836. Friston, K. and S. Kiebel (2009). "Predictive coding under the free-energy principle." Philos Trans R Soc Lond B Biol Sci 364(1521): 1211-1221. Garrido, M. I., J. M. Kilner, K. E. Stephan and K. J. Friston (2009). "The mismatch negativity: a review of underlying mechanisms." Clin Neurophysiol 120(3): 453-463. Gibson, J. J. (1979). The Ecological Approach to Visual Perception. Boston, Houghton Mifflin. Gregory, R. L. (1980). "Perceptions as hypotheses." Philos Trans R Soc Lond B Biol Sci. 290(1038): 181-197. Griffiths, T. D. and J. D. Warren (2004). "What is an auditory object?" Nat Rev Neurosci 5(11): 887-892. Grill-Spector, K., R. Henson and A. Martin (2006). "Repetition and the brain: neural models of stimulus-specific effects." Trends Cogn Sci 10(1): 14-23. Grimault, N., S. P. Bacon and C. Micheyl (2002). "Auditory stream segregation on the basis of amplitude-modulation rate." J Acoust Soc Am 111(3): 1340-1348. Gutschalk, A., C. Micheyl, J. R. Melcher, A. Rupp, M. Scherg and A. J. Oxenham (2005). "Neuromagnetic correlates of streaming in human auditory cortex." J Neurosci 25(22): 5382-5388. Haykin, S. and Z. Chen (2005). "The cocktail party problem." Neural Comput 17(9): 1875-1902. Helmholtz, H. v. (1885). On the sensations of tone as a physiological basis for the theory of music. London, Longmans, Green, and Co. Hochberg, J. (1981). Levels of perceptual organization. Perceptual organization. M. K. J. Pomerantz. Hillsdale, NJ, Erlbaum: 255-278. Hohwy, J. (2007). " Functional integration and the mind." Synthese 159: 315-328. Hupe, J. M. and D. Pressnitzer (2012). "The initial phase of auditory and visual scene analysis." Philos Trans R Soc Lond B Biol Sci 367(1591): 942-953. Jones, M. R. (1976). "Time, our lost dimension: Toward a new theory of perception, attention, and memory." Psychological Review 83: 323-355. Jones, M. R., G. Kidd and R. Wetzel (1981). "Evidence for rhythmic attention." Journal of Experimental Psychology: Human Perception & Performance 7: 1059-1073. Jones, M. R., D. J. Maser and G. R. Kidd (1978). "Rate and structure in memory for auditory patterns." Memory & Cognition 6: 246-258. Koenderink, J. ((this volume)). Gestalts as ecological templates. The Oxford Handbook of Perceptual Organization. J. Wagemans. Oxford, UK, Oxford University Press. Köhler, W. (1947). Gestalt psychology: An introduction to new concepts in modern psychology. New York, Liveright Publishing Corporation. Kubovy, M. and D. Van Valkenburg (2001). "Auditory and visual objects." Cognition 80(1-2): 97-126. Lavie, N., A. Hirst, J. W. de Fockert and E. Viding (2004). "Load theory of selective attention and cognitive control." J Exp Psychol Gen 133(3): 339-354.

17

Lee, T. S. and D. Mumford (2003). "Hierarchical Bayesian inference in the visual cortex." J Opt Soc Am A Opt Image Sci Vis 20(7): 1434-1448. Leopold, D. A. and N. K. Logothetis (1999). "Multistable phenomena: changing views in perception." Trends Cogn Sci 3(7): 254-264. Levelt, W. J. M. (1968). On binocular rivalry. Paris, Mouton. Lyzenga, J. and B. C. Moore (2005). "Effect of frequency-modulation coherence for inharmonic stimuli: frequency-modulation phase discrimination and identification of artificial double vowels." J Acoust Soc Am 117(3 Pt 1): 1314-1325. Malmierca, M. S., S. Cristaudo, D. Perez-Gonzalez and E. Covey (2009). "Stimulus-specific adaptation in the inferior colliculus of the anesthetized rat." J Neurosci 29(17): 5483-5493. Mesgarani, N. and E. F. Chang (2012). "Selective cortical representation of attended speaker in multi-talker speech perception." Nature 485(7397): 233-236. Micheyl, C., R. P. Carlyon, A. Gutschalk, J. R. Melcher, A. J. Oxenham, J. P. Rauschecker, B. Tian and E. Courtenay Wilson (2007). "The role of auditory cortex in the formation of auditory streams." Hear Res 229(1-2): 116-131. Micheyl, C., B. Tian, R. P. Carlyon and J. P. Rauschecker (2005). "Perceptual organization of tone sequences in the auditory cortex of awake macaques." Neuron 48(1): 139-148. Mill, R., T. Bőhm, A. Bendixen, I. Winkler and S. L. Denham (2013). "Competition and Cooperation between Fragmentary Event Predictors in a Model of Auditory Scene Analysis." PLoS Comput Biol in press. Miller, G. A. and J. C. R. Licklider (1950). " The intelligibility of interrupted speech." Journal of Acousitcal Society of America 22: 167-173. Moore, B. C., B. R. Glasberg and R. W. Peters (1986). "Thresholds for hearing mistuned partials as separate tones in harmonic complexes." J Acoust Soc Am 80(2): 479-483. Moore, B. C. and H. E. Gockel (2012). "Properties of auditory stream formation." Philos Trans R Soc Lond B Biol Sci 367(1591): 919-931. Moore, B. C. J. and H. E. Gockel (2002). "Factors influencing sequential stream segregation." Acta Acust. 88: 320-333. Näätänen, R., T. Kujala and I. Winkler (2011). "Auditory processing that leads to conscious perception: a unique window to central auditory processing opened by the mismatch negativity and related responses." Psychophysiology 48(1): 4-22. Näätänen, R., M. Tervaniemi, E. Sussman, P. Paavilainen and I. Winkler (2001). ""Primitive intelligence" in the auditory cortex." Trends Neurosci 24(5): 283-288. Näätänen, R. and I. Winkler (1999). "The concept of auditory stimulus representation in cognitive neuroscience." Psychol Bull 125(6): 826-859. Nager, W., W. Teder-Sälejärvi, S. Kunze and T. F. Münte (2003). "Preattentive evaluation of multiple perceptual streams in human audition." Neuroreport 14(6): 871-874. Nakajima, Y., T. Sasaki, K. Kanafuka, A. Miyamoto, G. Remijn and G. ten Hoopen (2000). "Illusory recouplings of onsets and terminations of glide tone components." Percept Psychophys 62(7): 1413-1425. Nakajima, Y., T. Sasaki, G. B. Remijn and K. Ueda (2004). "Perceptual organization of onsets and offsets of sounds." J Physiol Anthropol Appl Human Sci 23(6): 345-349. Neisser, U. (1967). Cognitive Psychology. New York, Appleton-Century-Crofts. Nelken, I. (2008). "Processing of complex sounds in the auditory system." Curr Opin Neurobiol 18(4): 413-417. Oertel, D., R. R. Fay and A. N. Popper (2002). Integrative Functions in the Mammalian Auditory Pathway. New York, Springer-Verlag. Paz, R., H. Gelbard-Sagiv, R. Mukamel, M. Harel, R. Malach and I. Fried (2010). "A neural substrate in the human hippocampus for linking successive events." Proc Natl Acad Sci U S A 107(13): 6046-6051. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, Morgan Kaufmann Publishers.

18

Pressnitzer, D. and J. M. Hupe (2006). "Temporal dynamics of auditory and visual bistability reveal common principles of perceptual organization." Curr Biol 16(13): 1351-1357. Rand, T. C. (1974). "Letter: Dichotic release from masking for speech." J Acoust Soc Am 55(3): 678-680. Rensink, R. A. (2000). "Seeing, sensing, and scrutinizing." Vision Res 40(10-12): 1469-1487. Riecke, L., A. J. Van Opstal and E. Formisano (2008). "The auditory continuity illusion: a parametric investigation and filter model." Percept Psychophys 70(1): 1-12. Ritter, W., E. Sussman and S. Molholm (2000). "Evidence that the mismatch negativity system works on the basis of objects." Neuroreport 11(1): 61-63. Roberts, B., B. R. Glasberg and B. C. Moore (2002). "Primitive stream segregation of tone sequences without differences in fundamental frequency or passband." J Acoust Soc Am 112(5 Pt 1): 2074-2085. Samuel, A. G. (1981). "The role of bottom-up confirmation in the phonemic restoration illusion." J Exp Psychol Hum Percept Perform 7(5): 1124-1131. Schadwinkel, S. and A. Gutschalk (2011). "Transient bold activity locked to perceptual reversals of auditory streaming in human auditory cortex and inferior colliculus." J Neurophysiol 105(5): 1977-1983. Schofield, A. R. (2010). Structural organization of the descending auditory pathway. the Oxford Handbook of Auditory Science: The Auditory Brain. A. Rees and A. R. Palmer. Oxford, Oxford University Press. 2: 43-64. Schwartz, J. L., N. Grimault, J. M. Hupe, B. C. Moore and D. Pressnitzer (2012). "Multistability in perception: binding sensory modalities, an overview." Philos Trans R Soc Lond B Biol Sci 367(1591): 896-905. Seeba, F. and G. M. Klump (2009). "Stimulus familiarity affects perceptual restoration in the European starling (Sturnus vulgaris)." PLoS One 4(6): e5974. Shamma, S. A., M. Elhilali and C. Micheyl (2011). "Temporal coherence and attention in auditory scene analysis." Trends Neurosci 34(3): 114-123. Shpiro, A., R. Moreno-Bote, N. Rubin and J. Rinzel (2009). "Balance between noise and adaptation in competition models of perceptual bistability." J Comput Neurosci 27(1): 37-54. Snyder, J. S. and C. Alain (2007). "Toward a neurophysiological theory of auditory stream segregation." Psychol Bull 133(5): 780-799. Snyder, J. S., C. Alain and T. W. Picton (2006). "Effects of attention on neuroelectric correlates of auditory stream segregation." J Cogn Neurosci 18(1): 1-13. Snyder, J. S., M. K. Gregg, D. M. Weintraub and C. Alain (2012). "Attention, awareness, and the perception of auditory scenes." Front Psychol 3: 15. Spence, C. ((this volume)). Cross-modal perceptual organization The Oxford Handbook of Perceptual Organization. J. Wagemans. Oxford, Oxford University Press. Stoffgren, T. A. and B. G. Brady (2001). "On specification and the senses." Behavioral and Brain Sciences 24: 195-222. Summerfield, C. and T. Egner (2009). "Expectation (and attention) in visual cognition." Trends Cogn Sci 13(9): 403-409. Sussman, E. S. (2005). "Integration and segregation in auditory scene analysis." J Acoust Soc Am 117(3 Pt 1): 1285-1298. Sussman, E. S., A. S. Bregman, W. J. Wang and F. J. Khan (2005). "Attentional modulation of electrophysiological activity in auditory cortex for unattended sounds within multistream auditory environments." Cogn Affect Behav Neurosci 5(1): 93-110. Sussman, E. S., J. Horváth, I. Winkler and M. Orr (2007). "The role of attention in the formation of auditory streams." Percept Psychophys 69(1): 136-152. Sussman, E. S., W. Ritter and H. G. Vaughan, Jr. (1999). "An investigation of the auditory streaming effect using event-related brain potentials." Psychophysiology 36(1): 22-34.

19

Szalárdy, O., A. Bendixen, D. Tóth, S. L. Denham and I. Winkler (2012). "Modulation-frequency acts as a primary cue for auditory stream segregation." J. Learning & Perception in press. Szalárdy, O., T. Bőhm, A. Bendixen and I. Winkler (2013). "Perceptual organization affects the processing of incoming sounds: An ERP study." Biological Psychology in press. Taaseh, N., A. Yaron and I. Nelken (2011). "Stimulus-specific adaptation and deviance detection in the rat auditory cortex." PLoS One 6(8): e23369. Teki, S., M. Chait, S. Kumar, K. von Kriegstein and T. D. Griffiths (2011). "Brain bases for auditory stimulus-driven figure-ground segregation." J Neurosci 31(1): 164-171. Ulanovsky, N., L. Las and I. Nelken (2003). "Processing of low-probability sounds by cortical neurons." Nat Neurosci 6(4): 391-398. van Ee, R. (2009). "Stochastic variations in sensory awareness are driven by noisy neuronal adaptation: evidence from serial correlations in perceptual bistability." J Opt Soc Am A Opt Image Sci Vis 26(12): 2612-2622. van Leeuwen, C. ((this volume)). Continuous versus discrete stages, emergence versus microgenesis The Oxford Handbook of Perceptual Organization. J. Wagemans. Oxford, Oxford University Press. van Noorden, L. P. A. S. (1975). Temporal coherence in the perception of tone sequences. Doctoral dissertation, , Technical University Eindhoven. Vliegen, J. and A. J. Oxenham (1999). "Sequential stream segregation in the absence of spectral cues." J Acoust Soc Am 105(1): 339-346. Wagemans, J., J. H. Elder, M. Kubovy, S. E. Palmer, M. A. Peterson, M. Singh and R. von der Heydt (2012). "A century of Gestalt psychology in visual perception: I. Perceptual grouping and figure-ground organization." Psychol Bull 138(6): 1172-1217. Warren, R. M., J. M. Wrightson and J. Puretz (1988). "Illusory continuity of tonal and infratonal periodic sounds." J Acoust Soc Am 84(4): 1338-1342. Wessel, D. L. (1979). " Timbre space as a musical control structure." Computer Music Journal 3: 45-52. Wightman, F. L. and R. Jenison (1995). Auditory spatial layout. Perception of Space and Motion. W. Epstein and S. J. Rogers. San Diego, CA, Academic Press: 365-400. Winkler, I. (2007). "Interpreting the mismatch negativity." Journal of Psychophysiology 21: 147-163. Winkler, I. (2010). In search for auditory object representations. Unconscious memory representations in perception: Processes and mechanisms in the brain. I. Winkle and I. Czigler. Amsterdam, John Benjamins: 71-106. Winkler, I. and N. Cowan (2005). "From sensory to long-term memory: evidence from auditory memory reactivation studies." Exp Psychol 52(1): 3-20. Winkler, I. and I. Czigler (2012). "Evidence from auditory and visual event-related potential (ERP) studies of deviance detection (MMN and vMMN) linking predictive coding theories and perceptual object representations." Int J Psychophysiol 83(2): 132-143. Winkler, I., S. Denham, R. Mill, T. M. Bohm and A. Bendixen (2012). "Multistability in auditory stream segregation: a predictive coding view." Philos Trans R Soc Lond B Biol Sci 367(1591): 1001-1012. Winkler, I., S. L. Denham and I. Nelken (2009). "Modeling the auditory scene: predictive regularity representations and perceptual objects." Trends Cogn Sci 13(12): 532-540. Winkler, I., E. Sussman, M. Tervaniemi, J. Horváth , W. Ritter and R. Näätänen (2003). "Preattentive auditory context effects." Cogn Affect Behav Neurosci 3(1): 57-77. Winkler, I., R. Takegata and E. Sussman (2005). "Event-related brain potentials reveal multiple stages in the perceptual organization of sound." Brain Res Cogn Brain Res 25(1): 291-299. Winkler, I., W. A. Teder-Salejarvi, J. Horváth , R. Näätänen and E. Sussman (2003). "Human auditory cortex tracks task-irrelevant sound sources." Neuroreport 14(16): 2053-2056. Winkler, I., T. L. van Zuijen, E. Sussman, J. Horváth and R. Näätänen (2006). "Object representation in the human auditory system." Eur J Neurosci 24(2): 625-634.

20

Yabe, H., I. Winkler, I. Czigler, S. Koyama, R. Kakigi, T. Sutoh, T. Hiruma and S. Kaneko (2001). "Organizing sound sequences in the human brain: the interplay of auditory streaming and temporal integration." Brain Res 897(1-2): 222-227. Yildiz, I. B. and S. J. Kiebel (2011). "A hierarchical neuronal model for generation and online recognition of birdsongs." PLoS Comput Biol 7(12): e1002303. Yu, A. J. (2007). "Adaptive behavior: humans act as Bayesian learners." Curr Biol 17(22): R977-980. Zhuo, G. and X. Yu (2011). Auditory feature binding and its hierarchical computational model. Third international conference on artificial intelligence and computational intelligence Springer-Verlag. Zwicker, E. and H. Fastl (1999). Psychoacoustics. Facts and Models. Heidelberg, New York, Springer.