Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Enlighten – Research publications by members of the University of Glasgow http://eprints.gla.ac.uk
Smith, R., and Hawkins, S. (2012) Production and perception of speaker-specific phonetic detail at word boundaries. Journal of Phonetics, 40 (2). pp. 213-233. ISSN 0095-4470 (doi:10.1016/j.wocn.2011.11.003)
http://eprints.gla.ac.uk/45862/ Deposited on: 11th September 2012
Smith & Hawkins, Speaker-specific detail at word boundaries
- 1 -
Production and perception of speaker-specific phonetic detail at word boundaries
Suggested running title: Speaker-specific detail at word boundaries
Rachel Smitha and Sarah Hawkinsb
aGlasgow University Laboratory of Phonetics, University of Glasgow, 12 University Gardens, Glasgow G12 8QQ, United Kingdom bCentre for Music and Science, Faculty of Music, University of Cambridge, 11 West Road, Cambridge, CB3 9DP, United Kingdom [email protected] [email protected] Corresponding author: Dr. Rachel Smith Glasgow University Laboratory of Phonetics University of Glasgow 12 University Gardens Glasgow G12 8QQ United Kingdom [email protected] Phone: +44 141 330 5533 http://www.gla.ac.uk/schools/critical/staff/rachelsmith/
Smith & Hawkins, Speaker-specific detail at word boundaries
- 2 -
Abstract Experiments show that learning about familiar voices affects speech processing in many tasks. However,
most studies focus on isolated phonemes or words and do not explore which phonetic properties are
learned about or retained in memory. This work investigated inter-speaker phonetic variation involving
word boundaries, and its perceptual consequences. A production experiment found significant variation in
the extent to which speakers used a number of acoustic properties to distinguish junctural minimal pairs
e.g. So he diced them—So he'd iced them. A perception experiment then tested intelligibility in noise of
the junctural minimal pairs before and after familiarisation with a particular voice. Subjects who heard the
same voice during testing as during the familiarisation period showed significantly more improvement in
identification of words and syllable constituents around word boundaries than those who heard different
voices. These data support the view that perceptual learning about the particular pronunciations associated
with individual speakers helps listeners to identify syllabic structure and the location of word boundaries.
Keywords
perceptual learning, segmentation, word boundaries, phonetic detail, individual speech style, allophone
1. Introduction
Experiments show that individual speaker characteristics can affect speech processing in a variety
of tasks, including serial recall (Goldinger, Pisoni, & Logan, 1991), explicit recognition memory
(Palmeri, Goldinger, & Pisoni, 1993), word recognition in noise (Nygaard, Sommers, & Pisoni, 1994;
Nygaard & Pisoni, 1998) and shadowing (Goldinger, 1998; Shockley, Sabadini, & Fowler, 2004). The
speaker characteristics that have been investigated mostly involve the realisation of particular phonemes
on the one hand (e.g. Norris, McQueen & Cutler, 2003), and on the other, very global properties such as
mean f0 and aspects of intonation (Church & Schacter, 1994), and amplitude, vocal effort and rate of
Smith & Hawkins, Speaker-specific detail at word boundaries
- 3 -
speech (Bradlow, Nygaard, & Pisoni, 1999). Little is known, however, about the perceptual effects of
speaker characteristics related to higher-level linguistic structure, such as the way a particular speaker
realises boundaries between different types of constituents in the prosodic hierarchy, or the way he/she
gives prominence to prosodic constituents. The present study investigates inter-speaker variation in the
production of phonetic patterns at word boundaries, and tests whether listeners learn about such variation
and use it to recognise words in noise.
Theories differ widely about the specific nature of the learning or memory processes that underlie
speaker-specific effects on perception, especially with regard to the issue of units of representation. In the
1990s, so-called “non-analytic” approaches were developed, which departed from earlier “abstractionist”
accounts of perception by assuming that holistic episodes or exemplars of words or longer stretches of
speech are stored in memory (e.g. Goldinger, 1996, 1998; Johnson, 1997, 2006; Pisoni, 1997; Lachs,
McMichael, & Pisoni, 2003). According to these approaches, when a new speech signal is heard, it is
matched simultaneously against all stored exemplar traces in memory, which are activated in proportion
to the goodness of match, and the aggregate of these activations produces a response. There is no need for
the exemplars to be broken down into smaller units.
Non-analytic approaches reflect scepticism about the relevance to memory or perception of
traditional linguistic units and categories such as features, phonemes, syllables, and so on. Instead of
assuming a single linguistic unit as the focus of perception, they are flexible about the units that might (or
might not) be involved, emphasising mainly the role of the individual perceiver's experience in linguistic
processing and change. Nonetheless, in order to make predictions and carry out simulations, a widespread
assumption has been that words are stored as exemplars. Johnson (2006: 492) justifies this choice as
follows, “I choose to treat ‘words’ as exemplars because words lie at the intersection of form and meaning
and thus generate coordinated patterns of activity in both sensory and higher level areas of cognition.”
Though a useful working assumption, this choice ignores the sublexical (phonological and
morphological) structure of words, and words’ participation in supralexical relations, as expressed in
prosodic and grammatical organisation.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 4 -
Non-analytic approaches’ de-emphasis of the role of traditional phonological units has been
viewed by some researchers as problematic (e.g. Pierrehumbert 2006; Goldinger 2007). A particular issue
is that when word-sized exemplar storage is assumed, there is little attention to lexical items’ internal
phonological structure. This makes it difficult for non-analytic models to explain how people generalise
from pronunciation or perception of particular words to the 'same' (or similar) sounds in other words.
Cutler and colleagues, for example, have emphasised the need not to throw out the linguistic baby with
the bathwater of abstractionism. Norris, McQueen, & Cutler (2003) demonstrated that after hearing a list
with words that contained an ambiguous fricative in place of either /s/ or /f/, listeners shift their category
boundary for the /s/-/f/ distinction in an appropriate direction. In a related priming task, McQueen, Cutler
and Norris (2006) demonstrated that such perceptual learning about phonetic segments generalises to
untrained words. To account for these data, the authors proposed that perceptual learning operates over
abstract phonemic representations, and is probably speaker-specific (Eisner & McQueen, 2005; although
Kraljic & Samuel, 2006 found a degree of generalization across speakers). While the units over which
such learning operates were argued to be sub-lexical, Norris et al. (2003) proposed that knowledge of the
intended lexical meaning was the driving force behind such learning (cf. Davis et al., 2005, but see also
Hervais-Adelman, Davis, Johnsrude, & Carlyon., 2008, for indications that lexical access is not always
necessary).
The question investigated in this paper is whether or not phoneme categories are the only aspect
of sound patterns that might trigger speaker-specific perceptual learning. In principle, perceptual learning
could also occur for aspects of syllabic and prosodic structure, and the ways in which individual speakers
phonetically implement these: for example, the way a particular speaker realises boundaries between
constituents such as syllables, words, and phrases, or the way he/she gives prosodic prominence to
constituents. Such aspects of speech beyond phonological category structure are important for
understanding meaning, especially functional and interactional meaning, and might be learned in a
speaker-specific way.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 5 -
1.1. Evidence regarding perceptual learning about allophonic detail
There is much evidence that the allophonic correlates of syllabic and prosodic structure are used
perceptually to help word segmentation and identification (e.g. Ogden, Hawkins, House, Huckvale, Local,
Carter, Dankovičová, & Heid, 2000; Davis, Marslen-Wilson, & Gaskell, 2002; Salverda, Dahan, &
McQueen, 2003; Cho, McQueen, & Cox, 2007; Baker, 2008). Rather less research has tested the role of
such properties in perceptual learning.
Some studies indicate that perceptual learning can under appropriate circumstances make
reference to sub-phonemic information, either position-specific allophones or phonological features.
Allen & Miller (2004) and Shockley, Sabadini & Fowler (2004) have shown that listeners can select the
VOT appropriate for /t/ in a voice they have been exposed to; this could be interpreted variously as
learning about a phonemic category, /t/, or a feature, [+ spread glottis], or a regional accent (e.g.
Yorkshire, Minnesota), or an individual speaker characteristic (‘breathy here and there’). Nielsen (2011)
investigated spontaneous imitation of VOT, and found that after exposure to word tokens with lengthened
VOTs, listeners extended VOT in their own production of words. Crucially, this finding generalised
across place of articulation, i.e. after hearing extended VOT in words with initial /p/, subjects lengthened
their own VOT not only in the exposed words with /p/, and novel words with /p/, but also—though to a
lesser extent—in new words with initial /k/. This finding is consistent with the idea that perceptual
learning about sublexical units makes reference to both features and phonemes, or articulatory or syllabic
patterns. Kraljic and Samuel (2006) also showed that perceptual learning about the voicing contrast can
be featural in nature. Kraljic and Samuel used a modified version of Norris et al. (2003)’s paradigm,
exposing listeners to words containing ambiguous stops in place of either /t/ or /d/. They observed a
subsequent boundary shift not only for a /t/-/d/ continuum, but also for a /p/-/b/ continuum where the
sounds tested shared a feature, but no phoneme, with the exposed words.
Crucially, in an experiment investigating generalisation of perceptual learning from noise-
vocoded speech, Dahan and Mead (2010) convincingly conclude that the primary linguistic level of
Smith & Hawkins, Speaker-specific detail at word boundaries
- 6 -
adaptation (learning) is allophonic and indeed coarticulatory: codas generalise better to codas than to
onsets, and onsets better to onsets than to codas, and it helps to have the same vowel context too. Dahan
and Mead “propose that the process by which adult listeners learn to interpret distorted speech is akin to
building phonological categories in one’s native language, a process where categories and structure
emerge from the words in the ambient language without completely abstracting from them.” (Dahan &
Mead 2010: Abstract).
In contrast, studies using long-term repetition priming have typically not found strong or reliable
evidence for the preservation of allophonic variation in memory for spoken language (McLennan, Luce,
& Charles-Luce, 2003; Sumner & Samuel, 2005). In this paradigm, pairs of tokens that either share, or do
not share, some aspect of phonetic form, are presented, separated by large numbers of intervening tokens.
If better priming is found for those tokens that do share the aspect of form relative to tokens that do not
share the aspect of form, it is assumed that the source of variation was retained in memory over at least
the time period investigated. We discuss the long-term repetition priming studies in some detail because
they raise methodological and interpretative issues that are important to the motivation for the current
experiment; readers who wish to skip the detail should proceed to the final paragraph of this section.
One instance of allophonic variation that has been investigated is flapping. McLennan et al.
(2003) conducted priming studies using tokens of words like atom and Adam, which contained either an
intervocalic flap, or non-flapped [t] or [d]. Flapped atom/Adam primed careful atom/Adam in most of the
repetition priming tasks used, suggesting that information about allophonic realisation was not being used
(cf. Luce, McLennan & Charles-Luce, 2003). In contrast, when the participants’ task was an easy lexical
decision task (with very un-wordlike nonwords), the degree of priming was found to be sensitive to the
allophonic realisation of the word. McLennan et al. offer an account of this finding in terms of adaptive
resonance theory (for a succinct description for phoneticians, see Grossberg, 2003) and suggest that
processing dynamics are such that allophonic detail is used only when the judgment task is easy or
captures performance at an early stage of processing.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 7 -
McLennan et al.’s account of why allophonic detail should affect early more than late processing
is compelling for the particular stimuli and allophonic variants used. However, we speculate that
allophonic effects could arise in more difficult tasks or later in processing, if different manipulations of
allophonic detail were used. First, intelligibility-in-noise tests show that phonetic detail benefits listeners
in difficult listening conditions (Ogden et al., 2000; Heinrich, Flory & Hawkins, 2010), and as yet we
know little about whether this benefit affects early or late decisions. Second, and returning to McLennan
et al. (2003), in natural speech, there can sometimes be subtle differences in flapped tokens according to
the underlying voicing status of the consonant (Charles-Luce, 1997; Kwong & Stevens 1999). In
McLennan et al.’s experiments, the same flapped tokens were used to instantiate both /t/ and /d/ and
indeed were chosen as the most ambiguous from a set of flapped tokens. In consequence, the presence of
a flap was uninformative regarding voiced/voiceless status, and the reason it did not play a role in tasks
requiring a deeper level of processing may have been because it did not actually help listeners to do the
task.
Similar considerations apply to a related experiment by Sumner and Samuel (2005), which
investigated whether the extent of priming was affected by whether words were pronounced with different
variants of word-final /t/: plain (non-glottalised) [t], glottalised and unreleased [ʔt̚], and glottal stop with
no supralaryngeal articulation [ʔ]. In an immediate semantic priming experiment, Sumner and Samuel
found that priming was not affected by allophonic variant: for example, a token of flute with a glottalised
unreleased [ʔt̚] primed the related word music just as well as a token with a plain [t] did. However, in a
long-term repetition priming experiment with a lexical decision task, strong priming was only found for
plain [t]: for example, if the first presentation of flute had a plain [t], the lexical decision response to the
second presentation of flute was faster if the second presentation also had a plain [t]; but if the first
presentation had [ʔt̚] or [ʔ], the response to the second presentation of flute was no faster if the second
presentation matched the first presentation than if it did not. Sumner and Samuel conclude that
Smith & Hawkins, Speaker-specific detail at word boundaries
- 8 -
information about non-canonical allophonic variants (i.e., those other than plain [t]) is not retained in
long-term memory, at least for these tasks that are about processing words heard in isolation.
Again, the negative results of Sumner and Samuel (2005) may be due to the role that allophonic
detail played in their tasks. Allophonic detail can inform listeners about aspects of linguistic structure, and
therefore assist in understanding meaning, but it does not always do so. For example, in Sumner and
Samuel’s (2005) experiment, the choice of one or other word-final variant of /t/ does not change the
meaning of the word. The way in which plosives are released can be informative about, e.g., position in
word, speech style, position in conversational turn (Local, 2003), or socio-indexical characteristics of the
speaker, but these factors are not relevant in Sumner and Samuel’s design, which presented isolated
words; therefore, there was little reason for listeners to remember the details of plosive release for an
extended period. Instead, the key task-relevant information was the phonemic identity of the final
consonant, since filler items were nonsense words like floop, and this could explain why only plain [t]s
(which presumably had the strongest place of articulation cues) showed long-term priming.
What none of the above studies have tested is whether allophonic variation and, in particular,
speaker-specific allophonic variation, is learned when it does actually play a role in giving information
about linguistic structure, and contributing to decisions about meaning. As other studies have suggested,
linguistic relevance may be an important factor in the extent to which systematic variation is retained and
used (e.g., Sommers & Barcroft, 2006; Nygaard, Burt, & Queen, 2000). Allophonic variation at word
boundaries can indicate differences in meaning, as in the case of “junctural minimal pairs”, i.e. identical
phoneme strings that differ in the placement of word boundaries, e.g. grey train—great rain (Wyld, 1913;
Jones, 1931; Lehiste, 1960; Hoard, 1966; Gårding, 1967; Umeda & Coker, 1975; Rietveld, 1980; Quené,
1992, 1993; Cruttenden, 1994). If speakers differ in how they use allophonic detail at word boundaries,
this speaker-specific detail could potentially be useful for segmenting and identifying words, and might
therefore be learned as a listener becomes familiar with a speaker.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 9 -
1.2. The phonetics of word boundaries
The phonetics of word boundaries are subject to many influences: the words' segmental
composition and syllabic structure, speech rate and degree of casualness, frequency (Cooper & Paccia-
Cooper 1977), and the prosodic and grammatical groupings in which the words participate. For example,
the higher the level of a prosodic boundary, the greater the ‘articulatory strength’ and duration of the
segments immediately after the boundary (see Fougeron, 2001 for a review). Moreover, the presence of a
boundary has greater effects on segment durations for pitch-accented than non-pitch-accented words
(Turk & White, 1999; Turk & Shattuck-Hufnagel, 2000). Similarly, patterns of assimilation—which can
help to indicate word boundaries—differ between function and content words (Local, 2003). Finally,
statistical distributions of word and phrase usage can affect pronunciation, including, of course, in
combination with all the above factors (Bybee, 2006; Jurafsky, Bell and Girand, 2002; Hay and Bresnan,
2006). The general point is that multiple factors influence how segments in a particular word or syllable
are pronounced. In the present case, we manipulate the pronunciation of two words by placing a given
phoneme (or short phoneme string) either at the end of one word or at the beginning of the next word. The
primary variable is thus word juncture—whether a word boundary falls before or after the critical
phoneme(s)—and this is marked phonetically by standard allophonic differences which reflect syllable
structure. Thus by varying the location of a word boundary in identical phonemic strings, we inevitably
and simultaneously vary syllabic structure and hence allophonic quality. Other influences on allophonic
quality such as sentence prosody are held constant.
Word-boundary contrasts involve a wide range of phonetic properties. Among them are certain
well-known allophonic patterns (e.g., for English, aspiration, glottalization, flapping, and clear and dark
variants of /l/), and durational regularities (e.g. longer word-initial than word-medial or -final allophones).
Less well-understood aspects include differences in intensity contour, spectral differences between word-
initial and non-initial vowels (Lehiste, 1960) and differences in voice quality and spectral balance of
consonants (Umeda & Coker, 1975). Word-boundary-related phonetic variation has been shown to
Smith & Hawkins, Speaker-specific detail at word boundaries
- 10 -
contribute to lexical identification, both in simple forced-choice identification tests (Lehiste, 1960;
O’Connor & Tooley, 1964; Hoard, 1966) and in on-line priming and gating tasks (Gow & Gordon, 1995;
Davis, Marslen-Wilson, & Gaskell, 2002). Therefore, several models of word segmentation and lexical
access incorporate it as a potentially important influence on segmentation, e.g. Shortlist (Norris,
McQueen, & Cutler, 1997), Good Start (Gow & Gordon, 1995), and the hierarchical model of Mattys,
White & Melhorn (2005).
Between-speaker variation in the phonetics of word boundaries has not been systematically
investigated. However—as is often the case with individual variation in both segmental and prosodic
parameters—it is mentioned in passing in many speech production papers. For example, Lehiste (1960)
noted a range of between-speaker differences in her study of English junctural minimal pairs. Quené
(1992) investigated the duration of Dutch vowels in CV1C # V2C vs. CV1 # CV2C phrases (where #
denotes a word boundary), and found that for some speakers, V2 was longer in word-initial than non-
initial position, while for others the opposite pattern was found. Fougeron & Keating (1997) examined
articulatory variation at the beginning of hierarchically-structured prosodic domains, and observed that
while all speakers tended to increase lengthening and strengthening of segments in successively higher
prosodic domains, speakers varied considerably in the particular pairs of domains that they systematically
distinguished.
The goal of this paper is to test whether non-phonemic phonetic detail at word boundaries is
learned about in a speaker-specific way. We first present a production study to test whether different
speakers of a single accent vary in the patterns of phonetic detail they use to achieve junctural contrasts,
and then a perception experiment which explores the consequences of the variation for intelligibility.
2. Production experiment
The main focus of the production experiment is on inter-speaker variation in durational
relationships around word boundaries. Realisations of word-initial vs. -final consonants and word-final
Smith & Hawkins, Speaker-specific detail at word boundaries
- 11 -
vs. non-final vowels are also briefly addressed. These properties do not exhaust the possible areas of
inter-speaker variation, but were chosen as representative of the range of production variables and
because of their likely perceptual importance. For example, durational relationships are basic to linguistic
coherence (e.g. Klatt, 1976) and to individuals’ idiosyncratic speech rhythms, and have been reliably
shown to underpin word-structure differences in a number of experiments in Dutch and English (Lehiste,
1960; Lehiste, 1972; Davis et al., 2002; Salverda et al., 2003; Kemps, Ernestus, Schreuder & Baayen,
2005; Kemps, Wurm, Ernestus, Schreuder & Baayen, 2005; Hay & Bresnan, 2006).
We expected to find the general patterns documented in the previous section (e.g., onset
consonants realised with longer duration and less lenition than coda consonants; syllable- and word-final
vowels realised more peripherally than non-final vowels). Crucially, we also expected that individual
speakers would vary in the extent to which they produced junctural distinctions and in the combinations
of cues used to achieve them.
2.1. Method
2.1.1. Materials
There were 24 pairs of experimental sentences, arranged in four groups of six sentences each.
They were phonemically identical but differed (usually grammatically) in a critical portion, underlined,
e.g. So he diced them vs. So he'd iced them. These critical portions are shown in Table 1, and the
complete sentences are in Appendix A. In each pair, between one and three phonemes had ambiguous
word affiliation. The members of each pair are denoted Early Boundary and Late Boundary (henceforth
EB, LB respectively) according to where in the phonemic sequence the first word boundary falls (e.g., for
/hiːdaɪst/, EB is he # diced, LB he'd # iced). Each experimental sentence was embedded in a meaningful
context (Appendix A), e.g.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 12 -
EB: He wanted the carrots to cook fast. So he diced them.
LB: The top of the cakes had come out looking uneven. So he'd iced them.
Table 1 shows the four groups of six sentence pairs, designed as follows. In Group D, e.g. So he
diced them—So he’d iced them, /d/ was either in the syllable onset of word 2 (EB) or else was the reduced
form of the auxiliary verbs had or would, suffixed to word 1 (LB, e.g. he’d). In Group S, e.g. Those are
cat size—Those are cat’s eyes, /s/ was in the syllable onset of word 2 (EB) or was a suffix to word 1 (LB,
e.g. cat’s, likes). Phonologically, these Group D and S suffixes constitute an ‘appendix’ to the syllable
according to Ogden (1999). In Group A, e.g. That surprise—That’s a prize, a weak syllable, /sə/ or /və/,
occurred as the initial syllable of word 2 (EB); or the weak syllable's initial /s/ or /v/ formed the end of
word 1, while word 2 was a or are (both /ə/ in the non-rhotic accent used), and word 3 was a
monosyllable (LB). In Group T, e.g. I lay Steve’s costume out in the wings—I laced Eve’s costume out in
the wings, /st/ was the onset of word 2 (EB), or was at the end of word 1, where /t/ was a past tense
marker (LB).
The 8 tokens of each of the 48 experimental sentences (each with its context) were randomised,
and this list of 384 experimental sentences + contexts was interleaved randomly with 216 filler sentences,
with the restriction that tokens of the same sentence, or its pair, did not appear next to one another. The
complete sentence list thus had (8(48) + 216 = ) 600 sentences.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 13 -
Table 1 Critical portions (Early Boundary—Late Boundary) of the phonemically-identical sentence materials, and measurements of segmental durations made (see text for details, this section and section 2.2.1). The four groups are defined by the phonological and/or grammatical nature of the difference between the members of the pair.
Group D
Measurements: D:Prec Syl | D:d | D:FolSyl
Group S
Measurements: S:PrecSyl | S:s | S:FolSyl
he diced—he’d iced
she dyed—she’d eyed
we dared—we’d aired
we dread—we’d read
we drank—we’d rank
I drove—I’d rove
hiː|d|aɪst ʃiː|d|aɪd
wiː|d|ɛəd
wiː|d|rɛd wiː|d|rank
aɪ|d|rəʊv
cat size—cat’s eyes
collect skulls—collects gulls
Pete stole—Pete’s dole
eat sweet—eats wheat
like psalms—likes arms
Pat sawed—Pat’s awed
kat|s|aɪz
lɛkt|s|{k/g}ʌlz
piːt|s|{t/d}əʊl
i ːt|s|wiːt laɪk|s|ɑːmz
pat|s|ɔːd
Group A
Measurements: A:PrecSyl | A:s | A:əəəə | A:C | A:FolSyl
Group T
Measurements: T:PrecSyl | T:s | T (T:cl, T:VOT) | T:FolSyl
that surprise—that’s a prize
Ralph surpasses—
Ralph’s are passes
Ruth sustained—
Ruth’s are stained
that salute—that’s a lute
that surround—
that’s a round
say veneer—save an ear
ðat|s|ə|p|raɪz
ralf|s|ə|p|ɑːsɪz
ruːθ|s|ə|s|tɛɪnd
ðat|s|ə|l|uːt
ðat|s|ə|r|aʊnd sɛɪ|v|ə|n|ɪə
lay Steve’s—laced Eve’s
whack Stan’s—waxed Ann’s
Mick stability—mixed ability
sly stroll—sliced roll
eye strain—iced rain
play strangers—placed
rangers
lɛɪ|s|t|iːvz
wak|s|t|anz mɪk|s|t|əbɪləti ː
slaɪ|s|t|rəʊl
aɪ|s|t|rɛɪn plɛɪ||s|t|rɛɪndʒəz
2.1.2. Speakers
The six speakers all had similar accents of Standard Southern British English. Table 2 shows their
demographic characteristics. The speakers did not know the purpose of the experiment, except for RS (the
first author) who was chosen because she has a fast, casual reading style suitable for these materials. The
decision to use speakers of different ages and genders was for purposes of another experiment (Smith,
2003, 2004) but in this experiment allowed for examination of variation in the realization of the forms
under investigation.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 14 -
Table 2 Speakers in the production experiment (age in years) and degree of phonetic training.
Young Mid Old Male Phonetic training:
MJ (27) some training
JR (35) trained phonetician
PF (53) no training
Female Phonetic training:
RS (25) trained phonetician
SC (45) no training
AK (54) no training
2.1.3. Recordings
Recordings were made in a double-walled IAC booth using a Sennheiser MKH40 P48 condenser
microphone and a Sony 55ES High Density Linear A/D D/A recorder, and were digitised at 16 kHz to a
Silicon Graphics machine running the commercial version of ESPS and xwaves+ (ESPS is now available
with WaveSurfer from http://www.speech.kth.se/software/#esps). Speakers read as naturally and
informally as possible. Because the critical parts involve frequently-occurring grammatical morphemes,
they were easy to speak in a natural way. Each speaker was allowed to self-correct as often as necessary
during the recordings, and no token was accepted unless it sounded (a) natural (b) fluent (c) prosodically
appropriate and (d) acceptable for the intended meaning to the experimenter (the first author), who is a
trained phonetician. Difficult cases were listened to by the second author, also a trained phonetician. The
criterion for prosodic appropriateness was that there be no significant prosodic differences between the
two members of any pair, or between speakers, as determined by the authors’ auditory judgments, in that
the realisations would be transcribed using identical symbols for stress and intonation. Where necessary,
extra tokens were recorded to achieve a total of 8 tokens per sentence per speaker.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 15 -
2.2. Analysis
2.2.1. Durational measurements
Durations were measured using xwaves. For all sentence groups, the durations of the syllable or
syllable fragment preceding and following the critical part were measured (henceforth PrecSyl, FolSyl).
For example, for So he diced them—So he’d iced them, the critical part was /d/, PrecSyl was /hiː/ and
FolSyl was /aɪst/. Thus PrecSyl either corresponded to an entire syllable and word (/hiː/ in he diced) or to
the onset and nucleus of a syllable (/hiː/ in he’d iced). FolSyl normally corresponded either to an entire
syllable and word (/aɪst/ in he’d iced) or to the rhyme of a syllable (/aɪst/ in he diced). In some cases, the
FolSyl measurement also included an onset consonant e.g. /rɛd/ in we dread / we’d read.
Measurements of the critical part differed between sentence groups, and are listed in Table 1. For
Groups D (he diced—he’d iced) and S (cat size—cat’s eyes), the critical part was a /d/ or /s/ respectively,
and its duration was measured (/d/ from start of closure to start of periodic vocalic formant structure,
henceforth D:d, /s/ from onset to offset of frication, henceforth S:s). For the other sentence groups, the
critical part contained more than one segment. In Group A (that surprise—that’s a prize), the critical part
was a CVC sequence e.g. /səp/ for this example. Durations of the C, V and C were measured separately
(henceforth A:s, A:ə and A:C). In Group T (lay Steve’s costume—laced Eve’s costume), the critical part
was a /st/ cluster, and durations of the /s/, /t/ closure, and VOT were measured separately, henceforth T:s,
T:cl, T:VOT. Predictions were that the components of the critical part would be longer when word-initial
than word-final, to different extents for different speakers.
VOT was treated differently in Groups D and T, i.e. for /d/ and /st/ respectively. The measure for
/d/ included VOT, i.e. closure duration and VOT were not measured separately. VOT and closure
duration are likely to be subject to different influences, e.g. VOT is subject to aerodynamic influences
Smith & Hawkins, Speaker-specific detail at word boundaries
- 16 -
related to segmental composition, in addition to the syllabic/prosodic influences that are our main focus.
Nevertheless, we did not measure VOT separately for three reasons: 1) there are no phonemic differences
within each sentence pair; 2) we expected the syllabic/prosodic influences on VOT to be in the same
direction as those for the closure; 3) we (correctly) expected VOT to vary relatively little in /d/ within
most pairs, but to be perceptually important in /dr/ sequences, because it is long in /dr/ but short in the
other cases. In contrast, for /st/ clusters we measured VOT and closure duration separately, in case
orthography caused a bias towards shorter VOT in LB than EB items (/t/ corresponds to orthographic
<ed> in LB items, e.g. laced Eve’s costume, but to orthographic <t> in EB items, e.g. lay Steve’s
costume). The difference was on the cusp of significance (p = 0.050), but in the opposite direction from
that predicted; it turned out to be driven by only one speaker, with the other five speakers showing no
difference.
An index of speech rate was calculated: Critical Phrase Duration = PrecSyl + critical part +
FolSyl. Mean Critical Phrase Duration differed considerably among the speakers (RS: 553 ms; JR: 563
ms; MJ: 606 ms; AK: 658 ms; PF: 679 ms; SC: 735 ms). The speakers with phonetic training tended to
speak slightly faster than the non-trained speakers. Impressionistically, they were slightly more fluent
readers, presumably due to practice with similar reading tasks. To control for these rate differences,
statistical analyses on individual syllables and segments used relative durations, expressed as proportions
of Critical Phrase Duration. Statistical results are, however, very similar when absolute durations are
considered. That is, of 48 terms tested in the statistical models, 40 had the same significance status (and
the same direction of difference, if a difference was present), regardless of whether absolute or relative
durations were tested. Of the remaining eight terms, five were significant in the analyses of relative, but
not absolute durations; and three were significant in the analyses of absolute, but not relative duration.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 17 -
Thus, the choice to use relative durations did not artificially inflate the significance of the results.
Moreover, only two of the eight divergences related to the critical part of the phrase itself; most related to
the preceding or following syllable.
2.2.2. Non-durational measurements
Group D was chosen for investigation of non-durational allophonic differences. Initial
observation indicated, amongst other things, that the critical /d/ had a range of realizations from canonical
stops to voiced fricatives, while the preceding vowel varied in quality.
For /d/, voicing and the presence of formant structure in the F2-F3 region during the constriction
portion were examined using wideband spectrograms, and categorised either as continuous through the
constriction portion, or not. Frication during the constriction portion was examined using the waveform
and spectrogram, and categorised as present or absent. No attempt was made to distinguish partial vs.
continuous frication, or frication that occurred at the beginning vs. end of the stop closure period
Predictions were that /d/ would be shorter and more weakly articulated in word-final than –initial
position, and therefore that voicing, formant structure and frication during closure would occur more
frequently in word-final than –initial tokens. This pattern was also expected to be speaker-dependent.
For the five sentences where the vowel preceding /d/ was /iː/, the frequencies of F1 and F2 were
measured in the middle of the steadiest part of F2, using Praat's formant tracker (Boersma & Weenink,
2006), with 12-pole Burg lpc analysis below 5.5 kHz, and a 25-ms Gaussian window, step size 6.25 ms.
Inaccurate values were hand-measured using 25-ms fft spectra in conjunction with the wideband
spectrogram. The difference between F1 and F2 frequencies (henceforth F2-F1) was calculated in Bark
Smith & Hawkins, Speaker-specific detail at word boundaries
- 18 -
(Traunmüller, 1990). F2-F1 was expected to be larger, suggesting a more peripheral vowel, when /iː/ was
word-final (e.g. he) than non-final (he'd).
2.2.3. Statistical analysis
The durational and formant measures were analysed in R using mixed-effects models (Baayen,
2008), which allow multiple random effects to be specified, e.g. for subjects and items. For each variable,
the model-fitting procedure was as follows: first a full model was fitted, with Sentence as random effect,
and Speaker (MJ, JR, PF, RS, SC, AK), Boundary Position (Early, Late) and their interaction as fixed
effects. Speaker was treated as a fixed effect because of the constraints that had gone into speaker
selection. Predictors that did not significantly contribute to the model were incrementally removed until
the simplest model had been found. Log-likelihood tests were used to check model fit as predictors were
removed. Sensitivity of the results to outliers was investigated by fitting models with and without the
removal of outliers with a standardised residual at a distance greater than 2.5 standard deviations from
zero (or 2 in the case of the two variables with the least normally distributed residuals, A:ə and
S:PrecSyl); results were consistent and those reported are with outliers removed. Outliers constituted no
more than 3% of the data (5% for A:ə).
The measures of voicing, formant structure and frication for intervocalic /d/ were analysed using
generalized linear mixed-effects modelling with logistic regression, following the same model fitting
procedure.
Pairwise comparisons, where reported, use a Bonferroni—Holm procedure (Holm, 1979). The
Bonferroni—Holm procedure is a stepwise procedure, based on ordered p-values. First, the comparison
with the most extreme (i.e. smallest) p-value is evaluated, using alpha level 0.05/n, where n = number of
Smith & Hawkins, Speaker-specific detail at word boundaries
- 19 -
comparisons. If the null hypothesis is rejected for this comparison, the next most extreme p-value is
evaluated, with alpha level 0.05/n-1. This procedure continues until a comparison is reached for which the
null hypothesis has to be accepted, after which no further comparisons are evaluated.
Tables detailing the statistical results are presented in Appendices; differences reported in the
main text as statistically significant always had p < 0.05 or better, and in most cases p < 0.001.
2.3. Results and Discussion
2.3.1. Durational results
Though the main focus of the experiment was the extent of speaker-specific variation in the
marking of word boundaries, we first describe results for general patterns due to word boundary
placement. As Figure 1 and the tables of statistical results in Appendix B show, these were consistent
with the literature. As expected, consonants at the crucial word boundary in the critical phrase were
always proportionately longer when word-initial than word-final (i.e. EB > LB for variables D:d, S:s, A:s,
T:s). The rhyme of the second word in the critical phrase was likewise almost always proportionately
longer when word-initial than when non-initial. For example, /aɪst/ was longer in he’d iced than in he
diced (FolSyl: LB > EB). The duration of the onset+nucleus of the first word in the critical phrase was
more variable, with only two sentence groups showing any systematicity across speakers, and even then,
the difference in these two groups was in opposite directions. In pairs like lay Steve’s costume vs. laced
Eve’s costume, /lɛɪ/ was longer in lay than laced (T:PrecSyl: EB > LB); in contrast, in pairs like he diced
vs. he’d iced, /hiː/ was longer in he’d than he (D:PrecSyl: LB > EB).The difference between the T and D
groups may be due to a number of factors: e.g. the presence of a voiceless complex coda in laced, and the
Smith & Hawkins, Speaker-specific detail at word boundaries
- 20 -
location of the morpheme boundary in the case of he’d, in conjunction with he being a metrically weak
(i.e. unstressed) function word.
Speaker-specific patterns, which are the main focus of our investigation, modulated the above
general trends to a considerable extent. Figure 1 shows the speaker-specific durational results; statistical
results are in Appendix B. For 13 out of the 16 durational variables investigated, significant interactions
were found between Speaker and Boundary Position, indicating significant inter-speaker variability in the
way that the word boundary contrasts were realised. (The exceptions were S:FolSyl; A:ə; T:PrecSyl.)
Specifically, for some variables, all speakers made word boundary distinctions in qualitatively the
same way, but differed in the magnitude of the distinction. All speakers had longer onset than coda
consonants, but to differing extents (Appendix B, Table B2: D:d, S:s, A:s and T:s). For example, /s/ was
53 ms (or 56%) longer when in onset than coda position for MJ, but only 23 ms (17%) longer for SC
(S:s). Onset /d/ was 38 ms (60%) longer than coda /s/ for PF in sentence group D, but only 21 ms (42%)
longer for JR (D:d). Similarly, in two sentence groups, the word rhyme of the second word was shorter
for all speakers when the critical consonant belonged in the onset of the second word (e.g. diced), than
when it did not (e.g. iced); but speakers differed in the magnitude of this difference. For example, /aɪst/
was 30 ms (13%) longer in iced than diced for MJ, but only 9 ms (3%) longer for SC (Appendix B, Table
B2: D:FolSyl, T:FolSyl).
Smith & Hawkins, Speaker-specific detail at word boundaries
- 21 -
Figure 1 Mean durations in the four sentence groups, expressed as percentages of total measured sequence. “Preceding syllable” refers to the syllable (or syllable fragment) before the part of the phrase whose word affiliation is ambiguous; “following syllable” refers to the syllable (or syllable fragment) after the critical part. For example, for So he diced them—So he’d iced them (Group D), “preceding syllable” refers to /hiː/ and “following syllable” to /aɪst/.
0 20 40 60 80 100
EBLB
EBLB
EBLB
EBLB
EBLB
EBLB
mj
jrpf
rssc
ak
Sp
eake
r an
d m
emb
er o
f pai
r
Percentage of critical phrase
preceding syllable
[d]
following syllable
0 20 40 60 80 100
EBLB
EBLB
EBLB
EBLB
EBLB
EBLB
mj
jrpf
rssc
ak
Sp
eake
r an
d m
emb
er o
f pai
r
Percentage of critical phrase
preceding syllable
[s]
following syllable
0 20 40 60 80 100
EBLB
EBLB
EBLB
EBLB
EBLB
EBLB
mj
jrp
frs
scak
Sp
eake
r an
d m
emb
er o
f pai
r
Percentage of critical phrase
preceding syllable
[s] or [v]
schwa
consonant
following syllable
0 20 40 60 80 100
EBLB
EBLB
EBLB
EBLB
EBLB
EBLB
mj
jrpf
rssc
ak
Sp
eake
r an
d m
emb
er o
f pai
r
Percentage of critical phrase
preceding syllable
[s]
[t] closure
[t] VOT
following syllable
Group D e.g. he diced (EB) vs he’d iced (LB)
Group S e.g. cat size (EB) vs cat’s eyes (LB)
Group A e.g. that surprise (EB) vs that’s a prize (LB)
Group T e.g. lay Steve’s costume (EB) vs laced Eve’s costume (LB)
.
.
.
.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 22 -
Other durational variables showed more extensive speaker-specific variation, in that not all
speakers made a significant durational distinction according to the word boundary placement. This was
the case for the onset+nucleus in pairs like he vs. he’d (D:PrecSyl), which half the speakers distinguished
durationally and half did not. Some of the critical consonant segments—especially those that were not in
absolute word-initial position—also showed idiosyncratic variation. The closure of /t/ (T:cl) was 16 ms
(57%) longer in lay Steve’s costume than laced Eve’s costume for speaker PF; and the VOT of /t/
(T:VOT) was 24 ms (80%) longer in laced Eve’s costume than lay Steve’s costume for speaker SC. PF’s
consonants were also significantly longer (17 ms or 21%) when they were word-initial (e.g. /p/ in prize)
than when syllable-initial but word-medial (/p/ in surprise; A:C), but no other speaker showed this
pattern.
Finally, for a third group of the most idiosyncratic variables, a significant word-boundary-related
difference went in one direction for some speakers, and the opposite direction for others. Word fragments
like /raɪz/ were longer in monosyllables (that’s a prize) than disyllables (that surprise) for most speakers,
but the other way round for speaker SC (A:FolSyl). For PF, fragments like /ðat/ and /kat/ were longer
when word-final (in that surprise and cat size) than when non-final (in that’s a prize and cats’ eyes), but
for speakers RS and MJ the opposite pattern was found, and the remaining speakers showed no difference
(A:PrecSyl; S:PrecSyl; Appendix B, Table B2).
Some speakers were more systematic than others in their use of duration to distinguish word
boundaries. PF was the speaker who had significant differences for the most variables (15 of the 16
tested, whereas the other speakers had between 10 and 12). PF also had significantly larger durational
differences than other speakers for five variables, compared to four for MJ, two for SC, RS and JR, and
only one for AK (Appendix B, Table B2). Interestingly, this systematicity does not relate
Smith & Hawkins, Speaker-specific detail at word boundaries
- 23 -
straightforwardly to age or speech rate. PF was not the slowest speaker, and indeed had a speech rate very
similar to that of AK, who showed least systematicity and is the same age as PF. SC, the slowest speaker,
was amongst the less systematic in marking her word boundaries: she had smaller EB-LB differences than
other speakers for five variables. It is possible, of course, that the mean speech rate of each speaker is not
the most informative predictor; instead the speech rate of the particular sentence token might predict the
use of durations to distinguish minimal pairs. When analyses were re-run to take account of this, i.e. by
including the total duration of each critical phrase as a predictor, the interactions between Boundary
Position and Speaker remained significant, supporting our conclusion that rate is not the only or main
contributor to the results.
In summary, the durational results show considerable speaker-specificity. The main patterns—
lengthening of word-initial relative to word-final consonants, and longer rhymes for syllables without a
consonantal onset than with one—are exhibited to some extent by all speakers, but implementation of
them is variable across speakers in terms of the magnitude of the distinctions. Other durational patterns
are used systematically by only some speakers. Certain speakers appear generally more systematic in their
use of durations than others, in a way that is not predictable from speech rate alone.
2.3.2. Consonant realisation
Figure 2 shows spectrograms illustrating the extremes of /d/ realization observed, and some
intermediate types. The descriptions refer to the period of maximum oral constriction in each case, i.e. the
closure in a canonical case and its equivalent in other cases: a) a “canonical /d/”, partially voiced, with no
formant structure or frication during its closure, b) /d/ with continuous voicing and partial formant
Smith & Hawkins, Speaker-specific detail at word boundaries
- 24 -
structure; c) /d/ with continuous voicing, partial formant structure and partial frication; d) a “fully lenited
/d/”, with continuous voicing, continuous formant structure, continuous frication and no burst.
As expected, coda /d/ (LB, e.g. he'd) was continuously voiced significantly more often than onset
/d/ (EB, e.g. diced; Boundary Position, χ2 (1) = 159.3, p < 0.0001). Table 3 shows that this pattern was
found for all speakers, but the speakers also differed significantly in the overall extent to which they used
continuous voicing (Speaker: χ2 (5) = 269.1, p < 0.0001). Table 3 shows that PF and JR had the highest
proportions of continuously voiced tokens (80% and 73%), RS, MJ and AK voiced an intermediate
number of tokens (42%, 28%, 19%), and SC voiced the least (4%). AK and SC only ever used continuous
voicing in coda /d/, and never in onset /d/, while JR and PF used continuous voicing not only in nearly all
their coda /d/s, but also in many of their onset /d/s. The inter-speaker differences cannot be a simple
function of the duration of the closure: PF had much longer closures than JR (83 ms vs 59 ms), but very
similar closure durations to AK (75 ms) and MJ (83 ms), who voiced significantly less during /d/ closure.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 25 -
Figure 2 Spectrograms illustrating four variants of /d/ in He was pleased that we’d aired them (shown: /wiːdɛə/ and part of following /d/). The progressive weakening of articulation is shown from top to bottom: a) partial voicing, speaker AK; b) continuous voicing and partial formant structure, speaker RS; c) continuous voicing, partial formant structure and partial frication, speaker AK; d) continuous voicing, continuous formant structure, continuous frication, speaker PF.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 26 -
Table 3 Allophonic variation in tokens of /d/ in pairs like he diced vs. he’d iced. Percentage of tokens that exhibited continuous voicing during closure, continuous formant structure during closure, and presence of frication during closure; and significant pairwise comparisons among speakers. MJ JR PF RS SC AK
total % EB LB
total % EB LB
total % EB LB
total % EB LB
total % EB LB
total % EB LB
continuous voicing 28 8 48
73 48 98
80 66 94
42 6 77
4 0 8
19 0 38
Inter-speaker differences: all speakers differ significantly from all other speakers (p < 0.05), except PF = JR and AK = MJ
continuous formant structure 6 4 8
22 8 36
19 9 29
7 0 14
0 0 0
0 0 0
Inter-speaker differences (all p < 0.05): (SC = AK) < (RS = MJ) < (PF = JR)
presence of frication 19 4 33
17 4 30
15 2 27
10 2 19
2 2 2
6 0 12
Inter-speaker differences: SC < MJ, SC < JR
Continuous formant structure during the constriction was also significantly more prevalent in
coda than onset /d/ (Boundary Position: χ2 (1) = 22.6, p < 0.0001). Significant individual differences were
found here too (Speaker: χ2 (5) = 51.2, p < 0.0001), e.g. Table 3 shows that female speakers SC and AK
exhibited continuous formant structure in none of their tokens; male speakers JR and PF did so in 22%
and 20% of their tokens. As above, these differences cannot result straightforwardly from closure
duration.
As expected, frication was very infrequent in onset /d/, yet occurred in about a fifth of coda /d/s
(2% vs 21% of tokens, Boundary Position: χ2 (1) = 47.5, p < 0.0001). Significant individual differences
were again found (Speaker: χ2 (5) = 21.6, p < 0.001). In general the male speakers fricated their /d/s more
frequently than the females, but the only significant inter-speaker pairwise comparisons were between SC
(2%) vs. JR (17%) and MJ (19%).
Smith & Hawkins, Speaker-specific detail at word boundaries
- 27 -
2.3.3. Spectral properties of /iː/ Figure 3 Bark frequencies of F1 vs. F2-F1 for /iː/, for each speaker separately. Early Boundary (EB): circles and solid-line ellipse. Late Boundary (LB): triangles and dashed-line ellipse. Ellipses contain the 70% of data points closest to the mean. Solid symbols indicate the mean of each distribution.
Figure 3 shows formant frequency measurements in Bark for /iː/ in he('d), she('d), we('d). As
expected, the F2-F1 difference in Bark was significantly greater in EB (he) than LB (he'd) for all speakers
(Appendix B, Tables B1 and B2), reflecting a more peripheral articulation in the former case i.e. in the
monomorphemic, open syllable. As Appendix B: Table B2 shows, RS and MJ have a significantly greater
EB-LB difference than any of the other speakers (mean 2.1 and 1.8 Bark respectively, as opposed to
between 0.4 and 0.6 Bark for the other speakers). RS and MJ’s LB vowels sound closer to /ɪ/ than /iː/.
They are two of the three speakers who did not distinguish the duration of pairs like he vs. he'd,
suggesting a possible trade-off between spectral and temporal marking of the word boundary. RS and MJ
MJ JR PF
AK SC RS
/iː/ in EB, e.g. he
/iː/ in LB, e.g. he’d
Smith & Hawkins, Speaker-specific detail at word boundaries
- 28 -
are also the two youngest speakers; the two oldest speakers, AK and PF, have more or less the smallest
spectral differences.
2.4. Discussion
The production experiment examined proportional durations, consonant realisation and vowel
formant frequencies at word boundaries in natural, meaningful minimal pairs, with main focus on inter-
speaker variation. Although the results showed the trends that are expected from the literature on mean
data, what the experiment has also shown is that there are systematic patterns that distinguish speakers.
Quantitative, and in several cases qualitative, inter-speaker variation was the norm for the variables
investigated, though the extent and type of variation depended on the particular phonetic attribute: some
variables were more consistently used than others.
A number of the durational variables were used consistently by all speakers. Most involved well-
known syllable- and word-initial vs. -final consonantal contrasts (e.g. Lehiste, 1960) and support the view
that initial position is phonetically strong. Others involved syllable rhymes, which were proportionately
longer in onsetless syllables than after a consonantal onset, congruent with results from a corpus study by
van Santen (1992). The consistency in the way speakers make these durational distinctions may be
because they are important for word segmentation, and/or because they reflect basic aspects of the
articulatory organisation of the syllable (Krakow, 1999). Nevertheless, significant inter-speaker variation
in the magnitude of the contrasts was observed, and may be perceptually important. Allen, Miller and
DeSteno (2003) report a similar pattern of quantitative inter-speaker variation for VOT in word-initial
stops, which contributes to perception of speaker identity (Allen and Miller, 2004).
The spectral variable measured, the peripherality of /iː/ in open versus closed syllables, was also
used consistently by all speakers: all speakers had a larger F2-F1 difference when the vowel was in an
open syllable, e.g. he, than in a closed syllable, e.g. he’d (cf. Stevens & House, 1963). However, the
behaviour of the two speakers with the most extreme differences goes beyond the expected degree of
Smith & Hawkins, Speaker-specific detail at word boundaries
- 29 -
centralisation of /iː/ in the contexts used: their he’d vowels sounded closer to /ɪ/ than /iː/. This quality
difference probably relates to the morphemic status of the words, in that this open-closed syllabic
distinction is also a morphemic distinction: he is monomorphemic while he’d is bimorphemic. Ogden
(1999) analyses similar forms (e.g. (h)ɪz, ʃɪz, ðɪv for he’s, she’s, they’ve) as resulting from phonological
fusion of non-syllabic weak clitic forms of auxiliaries with weak forms of the pronouns that host them.
Because /iː/ and /ɪ/ do not contrast in these function words, unlike in comparable content word pairs such
as heed—hid, considerable variation in the vowel quality is possible. In the present data, systematic
variation that accompanies the presence or absence of the clitic also appears to be age-related. Only the
youngest speakers in the cohort had extreme vowel quality differences. The oldest speakers had the
smallest differences, and the mid-age speakers had intermediate differences.
The variables related to realisation of onset and coda /d/ were fairly consistent across speakers. In
coda as compared to onset stops, all speakers had a higher probability of voicing and formant structure,
and all but one had a higher probability of frication. The patterns for voicing and formant structure are
consistent with acoustic and laryngoscopic observations by Umeda & Coker (1975). The pattern for
frication is consistent with lenition in the form of an incomplete closure or a slowly made or released
closure. (It could also arise due to higher oral pressure, but oral pressure seems unlikely to be higher in a
short, word-final /d/ than a longer, word-initial /d/.) There were also significant speaker differences, and
some indication of gender differences, with voicing, formant structure and frication generally more
prevalent among male than female speakers. The greater prevalence of voicing and formant structure for
male speakers may be a consequence of males’ larger oral tracts, which can maintain a transglottal
pressure drop for a longer duration. There is no obvious anatomical explanation, however, for why males
had more frication than females during part or all of the constriction, and sociological causes may be
more plausible than physiological ones. More work is needed to fully understand this pattern.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 30 -
Finally, for some variables, qualitative as well as quantitative variation was found among
speakers, i.e. not all speakers produced a significant contrast, or different speakers did so in opposite
directions. These variables typically concerned durational adjustment processes for which the evidence in
the literature is mixed, such as polysyllabic shortening in right-headed words (i.e. pairs like prize vs.
surprise; Turk & Shattuck-Hufnagel, 2000). Inconsistently used variables such as these have interesting
implications for perception: listeners may tune in to those which a speaker uses systematically, and learn
to ignore those that are not informative.
As well as differences between speakers, the results might also reflect differences in the speech
style (e.g. the degree of casualness) that the speakers either adopt habitually, or that they adopted for the
recording. Although an interpretation of the data in terms of speech style cannot be ruled out, it does not
correspond very well to our perceptual impression. While some pairs of speakers did differ in style—in
particular, SC sounded more careful than JR—other speakers, such as AK, MJ and PF, all sounded
similar in speech style, but differed in patterns of phonetic detail. Therefore, we consider the speaker
interpretation more likely, and adopt it in the remainder of the paper. However, we acknowledge that
whether the variation is about speaker or style or both is not critically important for the broad perceptual
aim of this research (i.e. to understand whether or not listeners can carry out perceptual learning about the
phonetic implementation of syllabic and prosodic structure, as well as about phonemic categories).
2.5. Summary
The present study demonstrated inter-speaker variation in durational and some non-durational
properties. However, it far from exhausts the scope for inter-speaker variation in properties affected by
word juncture. Other parameters include pitch, intensity, phonation quality, and additional spectral
properties. Ultimately, it will be important to assess whether the numerous variables involved in word
boundary distinctions can be captured by a 'speaker space' with lower dimensionality. For example,
Smith & Hawkins, Speaker-specific detail at word boundaries
- 31 -
Krakow (1999) observes two patterns for syllable-final consonants in casual speech, one in which the
primary consonant constriction is weakened or lost, and another in which it is strengthened, i.e. shows
articulatory organization more like that of an initial consonant. A speaker's preference for one or other of
these types of organization might give rise to particular clusters of differences.
What matters most for the present purpose, however, is that systematic inter-speaker differences
in phonetic detail at word boundaries exist. Speakers display subtly (and, in some cases, not-so-subtly)
different phonetic patterns around word junctures. If the set of variables investigated here is
representative, some speakers also mark boundaries more systematically than others. Similar fine-grained
differences in VOT have demonstrable perceptual relevance (Allen and Miller 2004). It is plausible that
exposure to a voice will result in learning those cues that assist with word segmentation in that voice. The
aim of the perception experiment was to test this hypothesis by testing speech intelligibility in noise.
Embedding speech in noise has been shown to produce sufficient errors of word identification for gross
and subtle influences on processing to be assessed (Miller, Heise, & Lichten 1951, Heinrich et al., 2010).
Speaker familiarity helps word identification in noise (Nygaard, Sommers, & Pisoni, 1994; Nygaard &
Pisoni, 1998) while correct phonetic detail improves intelligibility of synthetic speech in noise (Ogden et
al., 2000). Therefore, familiarity with a speaker's allophonic patterns at word boundaries is also expected
to improve segmentation in noise.
3. Perception experiment
3.1. Overview and Design
The perception experiment consisted of a pre-test, familiarization phase and post-test. In the pre-
and post-test, subjects heard the sentences from Experiment 1 in background noise without their
preceding contexts, and typed (in standard orthography) what they heard; in the familiarization phase they
Smith & Hawkins, Speaker-specific detail at word boundaries
- 32 -
heard the same sentences in context and without noise, and answered questions about their meaning.
Performance between pre-test and post-test was compared.
Each participant heard the same voice in the pre- and post-tests. Two crossed factors were
manipulated: Voice heard in tests (speaker PF or MJ) and Familiarization Voice (Same or Different to the
test voice). Thus the critical question was whether word segmentation in the post-test was facilitated by
having had exposure to the same voice in the familiarization session.
3.2. Method
3.2.1. Materials
Materials were the 48 sentences (24 pairs) from the production experiment. Tokens spoken by
two of the male speakers PF and MJ were used. These two speakers were selected because they were the
most suitable of the six speakers, in that, of the four who spoke with a similar moderate degree of
casualness (RS, MJ, AK, and PF) they were the two who shared the same gender, had the most similar
speech rates, and were naïve as to the purpose of the experiment. Their realization of word boundaries
showed a mixture of similarities and differences (details in Appendix C). First, both speakers showed
some basic durational differences between onset and coda consonants, but PF also produced several
additional subtle durational differences that MJ did not. Durations are known to be important for
identifying morphological structure and word juncture (e.g. Lehiste, 1960; O’Connor & Tooley, 1964;
Hoard, 1966; Gow & Gordon, 1995; Davis, Marslen-Wilson, & Gaskell, 2002; Kemps, Ernestus,
Schreuder, & Baayen, 2005; Kemps, Wurm, Ernestus, Schreuder, & Baayen, 2005) and also to be
learnable aspects of idiosyncratic behaviour (Allen & Miller, 2004; Nielsen, 2011). Second, the two
speakers differed with respect to the spectral structure of vowels and details of stop consonant
realisation, whose role in juncture perception and speaker adaptation is less well tested, but which
are—in common with many other phonetic properties—likely to be learned about where relevant to
the task (cf., for vowels, Sidaras, Alexander, & Nygaard, 2009).
Smith & Hawkins, Speaker-specific detail at word boundaries
- 33 -
Stimuli for the pre- and post-test were made from tokens of the 48 sentences (e.g. So he diced
them). For each speaker, one token of each sentence was selected at random from the eight originally
recorded in the production experiment. This chosen token was mixed with randomly-varying cafeteria
noise at an average signal-to-noise ratio of +2 dB (average amplitude of sentence:average amplitude of
noise). The noise was ramped up to its maximum amplitude over 5 s before the sentence began, continued
at this average amplitude for 15 s after sentence offset and was then ramped down to zero over a further 5
s. The 15 s response period was established in pilot tests as necessary for the response to be typed.
Stimuli for the familiarization session were tokens of the same 48 sentences, each in its
disambiguating context (e.g. He wanted the carrots to cook fast. So he diced them.). For each speaker, six
different tokens of each context+sentence were selected at random from the seven remaining after random
selection of the pre- and post-test token.
In each phase (pre-test, familiarization and post-test), sentences were presented in pseudo-random
order with the constraint that members of a sentence pair never occurred adjacent to one another.
3.2.2. Participants
Participants were 80 speakers of Standard Southern British English (SSBE), 20 in each of four
conditions: 1) Test Voice MJ, Same Familiarization Voice; 2) Test Voice MJ, Different Familiarization
Voice (i.e. PF), 3) Test Voice PF, Same Familiarization Voice; and 4) Test Voice PF, Different
Familiarization Voice (i.e. MJ). All were aged between 18 and 35 and were students or staff of the
Universities of Cambridge or Glasgow. 72 participants were tested at the University of Cambridge in a
sound-treated room, and 8 at the University of Glasgow in a quiet room.
3.2.3. Procedure
Participants were tested individually using high-quality Sennheiser headphones and a PC laptop
running DMDX. Participants did the pre-test (48 items, 25 minutes), then the familiarization session (288
Smith & Hawkins, Speaker-specific detail at word boundaries
- 34 -
items, 40 minutes), and finally the post-test (48 items, 25 minutes). The pre- and post-tests were each
preceded by one practice item and the familiarization session by two practice items.
In the pre- and post-tests, participants’ task was to type what they heard into a computer. The
words appeared on the screen as they typed, and they were instructed to type as many words as they had
understood. A short break occurred after half the items.
In the familiarization session, participants heard each sentence in its disambiguating context.
They were told that they would hear descriptions of events, and should judge whether it was LIKELY or
UNLIKELY that an event involved the object, person, emotion or idea specified by a question
displayed on the computer screen immediately after they had heard the sentence. Participants
responded by pressing one of two labelled keys on a keyboard. Assignment of responses to
hands was counterbalanced over participants. The comprehension questions appeared on the
computer screen for 3 s, and each took the form Does the event involve X? A short break occurred after
every 20 items, and a longer self-paced break half-way through the familiarization session. Participants
were offered self-paced breaks between pre-test, familiarization and post-test.
3.3. Results
3.3.1. Analysis
All analyses were carried out using generalized linear mixed-effects modelling with logistic
regression. Logistic regressions are performed on ratios of correct to incorrect responses, from which
odds ratios can be calculated. However, for ease of understanding, the Figures and Tables, and the text
below, are expressed in terms of percentages of correct responses.
Responses were measured in three different ways: Words, Boundary, and Boundary2, as
illustrated in Table 4. Words (W) measures the percentage of words correct in the entire sentence. To be
scored as correct, words had to be typed in the same order as in the actual spoken sentence. Obvious mis-
Smith & Hawkins, Speaker-specific detail at word boundaries
- 35 -
spellings and homophones were scored correct, morphological variants incorrect. The examples in
column 3 of Table 4 show number of words correct for the two six-word sentences illustrated; for the
presentation of the results, values were summed over all responses, and converted to a percentage of the
total number of possible correct words.
Table 4 Examples of the three scoring systems used in the intelligibility experiment, applied to one pair of sentences: (I) EB It may have been eye strain and (II) LB It may have been iced rain. ‘Words’ scores indicate the number of words correct for that sentence. ‘Boundary’ and ‘Boundary2’ scores for a single sentence are binary: 1 for correct, 0 for incorrect. In both cases, a correct score (1) indicates that the correct phoneme sequence was provided with the word boundary correctly located within it. The correct phoneme sequence was the final syllabic constituent of the first word (Word1End), and the initial syllabic constituent of the second word (Word2Start). Under the Boundary criterion, all other responses were scored as incorrect (0). In contrast, all incorrect Boundary2 responses contained the correct phoneme sequence, but with the word boundary misplaced within that sequence; any other incorrect response was excluded from the Boundary2 analysis. The difference between Boundary and Boundary2 is thus that Boundary2 was restricted to correct phoneme strings in the vicinity of the word boundary and focussed entirely on the location of the boundary. See text for further explanation.
I. Response to sentence “It may have been eye strain”
Phoneme string
Words (# of 6)
Boundary Word1End
Boundary Word2Start
Boundary2 Word1End
Boundary2 Word2Start
It may have been eye strain /aɪ str/ 6 1 1 1 1
He may have been high strung /aɪ str/ 3 1 1 1 1
He may have seen mice trained /aɪs tr/ 2 0 0 0 0
He may have seen my drain /aɪ dr/ 2 1 0 1 excluded
He may have seen my Dane /aɪ d/ 2 1 0 1 excluded
He may have seen my rain /aɪ r/ 2 1 0 1 excluded
It may have been the test train /ɛst tr/ 4 0 0 excluded 0
It may have enticed strain /aɪst str/ 3 0 1 0 1
It may have been an A train /ɛɪ tr/ 4 0 0 excluded excluded
II. Response to sentence “It may have been iced rain”
Phoneme string
Words (# of 6)
Boundary Word1End
Boundary Word2Start
Boundary2 Word1End
Boundary2 Word2Start
It may have been iced rain /aɪst r/ 6 1 1 1 1
He may have been high strung /aɪ str/ 3 0 0 0 0
He may have seen mice trained /aɪs tr/ 2 0 0 0 0
He may have seen my drain /aɪ dr/ 2 0 0 excluded excluded
He may have said it might rain /aɪt r/ 3 0 1 excluded 1
He may have seen my rain /aɪ r/ 3 0 0 excluded 1
It may have been the test train /ɛst tr/ 4 1 0 1 0
It may have enticed strain /aɪst str/ 3 1 0 1 0
Smith & Hawkins, Speaker-specific detail at word boundaries
- 36 -
The Boundary and Boundary2 measures were introduced because the number of words correct in
the sentence does not necessarily tell us much about listeners’ use of phonetic detail to segment the
critical words. Boundary and Boundary2 allow us to test not only whether familiarity with a voice allows
subjects to identify more words, but also whether it helps them segment phoneme sequences which can be
parsed in more than one way. Boundary (B) reflects the percentage of responses that contained all the
correct phonemes, in the correct sequence, in the correct syllabic constituent abutting the word boundary:
that is, the correct phoneme(s) in the nucleus or coda of the word before the boundary (/aɪ/ in the case of
eye strain, /st/ in the case of iced rain), and the correct phoneme(s) in the onset or nucleus directly after
the word boundary (/str/ in the case of eye strain, /r/ in the case of iced rain, and, not illustrated in Table
4, /aɪ/ in the case of he’d iced). However, phonemes before and after the critical word boundary were
scored separately: there was no requirement for both to be correct in order to obtain a correct score for
one of them, and therefore no requirement for the whole sequence to be correct across the word boundary.
Thus B has two subparts: B Word1End was scored correct if a response contained all and only the correct
phoneme(s) in the last syllabic constituent before the word boundary, and B Word2Start was scored
correct if a response contained all and only the correct phoneme(s) in the first syllabic constituent after
the word boundary. Columns 4 and 5 of Table 4 show the numerical score assigned to each response
illustrated under the B criterion: either 1 (correct) or 0 (incorrect). Although, as with the W score, Table 4
shows the B scores that each particular response would be assigned, observed scores are expressed in the
rest of this paper as a percentage of all possible responses.
The B measure is position-sensitive in that for a correct score to be achieved the listener must not
only identify the correct phoneme(s) but assign them to the correct position in syllable. Nevertheless,
improvement on this measure could conceivably be due to a general improvement in phoneme
Smith & Hawkins, Speaker-specific detail at word boundaries
- 37 -
identification, such as might be observed if exposure to a speaker’s voice led to phoneme-category
retuning. Therefore the third measure, Boundary2 (B2), was introduced as a more stringent test of
whether or not position-sensitive allophonic quality is encoded in a way that could in fact be mediated by
context-free phonemic categories. The B2 measure was identical to B in the way correct responses were
defined, but differed in the way incorrect responses were defined. B2 excluded from the count of incorrect
responses those cases where some or all of the correct phonemes were simply not reported at all, or, if
reported, were in the wrong sequence. Thus, certain types of error led to a score of 0, while others that
would score 0 under criterion B were excluded altogether from the B2 analysis, with the result that the
ratios and percentages of correct scores were calculated relative to a smaller total number of observations
for B2 than for B. Specifically, to be included as an incorrect B2 response, the response had to contain the
correct phonemes in the correct order in the vicinity of the critical word boundary, but some or all of
those phonemes would be in the wrong position relative to the word boundary. Further, a phoneme could
be duplicated on each side of the boundary, as long as the right sequential order was maintained. Thus,
the error was that not all the critical phonemes were in the correct syllabic constituent: one or more was in
the onset/nucleus of the post-boundary word when they should have been in the nucleus/coda of the
preceding word, or vice versa. Hence, for a B2 score of 0, all the phonemes were right, but one or more of
the allophones were wrong.
Since the description of B and B2 criteria is complicated, we discuss them in some detail in this
and the next paragraph. In this paragraph, we work through some of the examples in Table 4; the next
paragraph compares the consequences of the two measures. The second rows of Table 4.I and 4.II give
the correct answer to the heard stimulus (I: it may have been eye strain, II: it may have been iced rain)
and so both answers score maximally correct under all three criteria, W, B and B2. The second answer,
Smith & Hawkins, Speaker-specific detail at word boundaries
- 38 -
He may have been high strung, has three of the six words correct for both sentences, so the W score is 3/6
in both cases. As a response to eye strain, the phonemes around the word boundary, /aɪ str/, are all
present, correctly sequenced, and correctly aligned relative to the word boundary, so this response scores
1 for all four B and B2 criteria in Table 4.I. But as a response to iced rain (Table 4.II), /aɪ str/ scores 0
under all four B and B2 criteria because, though the right phonemes are in the right sequence, /st/ is on the
wrong side of the word boundary—and hence would be spoken with the wrong allophones. Finally
consider the answer in row 5 of each part of Table 4: He may have seen my drain. Two of the words are
correct, so W = 2/6 in both cases. The first of the two words around the critical boundary, my, lacks a
coda and has the correct nucleus /aɪ/ for eye strain, so Word1End scores 1 (correct) for both B and B2 in
Table 4.I. But in Table 4.II, my for iced scores 0 (incorrect) for Word1End under criterion B, and it is
excluded from the B2 Word1End analysis because it is the coda /st/ that must be right in this case, rather
than the nucleus /aɪ/. Likewise, drain scores 0 for B Word2Start as a response to both eye strain and iced
rain (Tables 4.I and 4.II respectively); although drain does contain /r/, it lacks /s/, and (perhaps as a
consequence) the stop has been written as the wrong phoneme too: /d/ instead of /t/, a point we return to
below. Under criterion B2, drain is of course excluded from both Word2Start analyses (Table 4.I and 4.II)
because it does not contain all the right phonemes.
B and B2 test subtly different, but nevertheless closely related predictions. Because both are
position-sensitive, any improvement observed on them after exposure to a voice implies that listeners are
sensitive to the particular way that position-in-word, and hence allophonic quality, is encoded by the
speaker that they are adapting to. The difference is that if the B measure shows improvement with
exposure, then it could reflect phoneme retuning; if the B2 measure shows improvement with exposure,
then it must reflect context-sensitive allophonic learning, without necessary recourse to context-free
Smith & Hawkins, Speaker-specific detail at word boundaries
- 39 -
phonemes. Specifically, improvement on the B measure could conceivably be due to a general
improvement in phoneme identification. In contrast, the difference between correct and incorrect B2
measures reflects solely the placement of the word boundary within a correct phonemic sequence,
because both phoneme identification and phoneme sequence are constant in B2. So any improvement
observed on the B2 measure would be more appropriately modelled in terms of learning about categories
that directly reflect syllable and word structure—in short, improvement on B2 must reflect learning about
allophonic identity, but not phonemic identity.
In summary, the W measure is a standard word-based measure of speech intelligibility in noise;
the B measure allows allophonic accuracy around the critical word boundary to be assessed; the B2
measures is a more stringent test of allophonic accuracy which is independent of phoneme identification
in that it was conducted only on responses in which (context-free) phoneme identification was correct, but
(context-sensitive) allophonic identification might or might not be.
As noted, the B2 analysis is reduced in statistical power compared with the B analysis, because a
subset of incorrect responses is excluded from the analysis. In fact, the B2 dataset was only 75% the size
of the B dataset. Comparisons between the two analyses must therefore be made cautiously (see below).
Generalized linear mixed-effects modelling was applied to the results of each measurement
criterion (W, B, B2). First, a full model was fitted, with predictors Test (Pre-test vs. Post-test), Voice
heard in tests (MJ vs. PF), Familiarization Voice (Same vs. Different as tests), Boundary Position (Early
vs. Late) and all their interactions. Following the same procedure as in the production experiment, non-
significant predictors were incrementally removed until the simplest model had been found; Appendix D
shows the fixed effects included in this final model. Also in the full model were control variable Sentence
Group, and random effects for Subject and Sentence. The inclusion of both sets of random effects ensures
Smith & Hawkins, Speaker-specific detail at word boundaries
- 40 -
that significant results, where obtained, are robust across participants and items. In all modelling, we
found greater variances associated with items than participants. The crucial prediction was of an
interaction between Test and Familiarization Voice: more improvement from pre-test to post-test was
expected when the voice heard in the familiarization period was the Same as that heard in the tests.
Tables detailing the statistical results are again presented in Appendices; differences reported in
the main text as statistically significant always had p < 0.05 or better, and in most cases p < 0.001.
3.3.2. Main results
Figure 4 and Table 5 show the percentages of correctly-reported Words and B and B2 syllable
constituents at Word1End and Word2Start. Relevant statistical results are in Appendix D. Results were
broadly similar for all measures. The most important result is that, as predicted, the improvement from
pre-test to post-test was slightly but significantly greater when the Same Voice was heard in both
familiarization and tests, than when a Different Voice was heard, as reflected in interactions between Test
and Familiarization Voice.
The increase in improvement for the Same Voice relative to the Different Voice condition was
very similar regardless of whether the B or B2 measure was considered (B Word1End: 5.1%, B2
Word1End: 4.6%; B Word2Start: 4.5%, B2 Word2Start: 4%). Because of the reduction in power for the
B2 analyses, when comparing statistical results for B and B2, it is more appropriate to consider the
estimate sizes for the crucial interaction between Test and Familiarization Voice, rather than their p-
values. These estimates can be expressed as odds ratios (Rice 1995), where an odds ratio of 1 corresponds
to no difference between conditions, and larger odds ratios reflect larger differences. The estimated odds
ratios from the logistic regression models for the interaction of pre/post Test with Familiarization Voice
Smith & Hawkins, Speaker-specific detail at word boundaries
- 41 -
were: for Word1End, 1.26 and 1.21 for B and B2 respectively; and for Word2Start, 1.24, and 1.20 for B
and B2 respectively. Appendix D1 shows that neither of the B2 values reaches statistical significance (p
= 0.096 for B2 Word1End and p = 0.1 for B2 Word2Start) whereas the equivalent p values are 0.017 and
0.036 for the B measure, but this difference reflects the reduction in statistical power from the smaller B2
sample. What is important is that the odds ratios for the same-voice advantage using the B2 measure are
slightly smaller, but essentially comparable in magnitude to those in the B measure.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 42 -
Figure 4 Percentage of correctly reported Words in whole sentence (a); B syllable constituents (b, c); B2 syllable constituents (d, e), at pre- and post-test. Error bars represent one standard error.
b) B Word1End
a) Words
c) B Word2Start
d) B2 Word1End e) B2 Word2Start
Smith & Hawkins, Speaker-specific detail at word boundaries
- 43 -
Table 5 Percentage of correctly reported Words and B and B2 syllable constituents, in pre-test and post-test according to Early or Late Boundary Position (top half of table, pooling across Voice) and Voice tested (bottom half of table, pooling across Boundary Position). Mean % correct responses Percentage change
(improvement) from pre-test to post-test
Overall mean
Pre-test Post-test
Words Early Boundary 59 46 72 +26 Late Boundary 58 45 71 +26
B Word1End Early Boundary 47 37 56 +19 Late Boundary 45 37 52 +15
B2 Word1End Early Boundary 55 49 60 +11 Late Boundary 65 67 64 -3 B Word2Start Early Boundary 54 45 63 +18
Late Boundary 38 27 48 +21 B2 Word2Start Early Boundary 68 66 70 +4 Late Boundary 54 49 58 +9 Words PF 69 56 81 +25
MJ 48 34 62 +28 B Word1End PF 54 45 62 +17
MJ 37 29 45 +16 B2 Word1End PF 56 59 65 +6
MJ 62 53 58 +5 B Word2Start PF 56 46 66 +20
MJ 36 26 45 +19 B2 Word2Start PF 65 61 68 +7
MJ 56 53 58 +5
Smith & Hawkins, Speaker-specific detail at word boundaries
- 44 -
Taken together, therefore, the B and B2 measures suggest that the Same-Voice advantage for
identifying syllable constituents at word boundaries is attributable only in small part to an overall
improvement in phoneme identification. In consequence, as discussed further below, retuning of context-
free phonemic categories is unlikely to be the best way to model the observed sensitivity to syllable
structure and position in word.
The other significant effects do not weaken this main finding, as Table 5 and Appendix D show.
As results for B and B2 measures were mostly similar, the following text does not distinguish them,
except where specifically indicated. First, subjects in all conditions improved considerably from pre- to
post-test, presumably because familiarization provided useful experience of the test sentences in
meaningful contexts (Test, Appendix D; cf. Davis et al., 2005). Improvement was greater on B than B2
measures, suggesting that from pre- to post-test listeners got better partly because they became able to
identify sounds that they had missed altogether, or whose phonemic identity they had misidentified, at the
pre-test. Second, responses to PF were more accurate than to MJ overall (lower half of Table 5), although
improvement was about the same for each speaker, within each measure. Third, responses to Early
Boundary items differed from responses to Late Boundary items. Responses to Early Boundary items
were very slightly but significantly more accurate than to Late Boundary items (by 1-2%) for Words and
B Word1End, but the opposite pattern was found for B2 Word1End, suggesting that if listeners were able
to identify the correct word-final phonemes for Late Boundary items at all, they were also quite accurate
at identifying their position in syllable. In contrast, for Word2Start, responses to Early Boundary items
were considerably more accurate than to Late Boundary items (B: by 16%; B2: by 14%) (upper half of
Smith & Hawkins, Speaker-specific detail at word boundaries
- 45 -
Table 5). That is, Late Boundary word beginnings, which generally corresponded to onsetless syllables
(e.g. iced), were especially poorly identified.
Finally, there were some differences among the W and B measures, which took the form of
significant interactions involving Boundary Position (Appendix D). These patterns all reflect that the
prosodically weaker consonants in Late Boundary items had less perceptual salience than Early Boundary
items, and were thus more difficult to identify, especially in the less intelligible voice (MJ’s). Thus, for
Word1End and Word2Start, the accuracy advantage for PF over MJ was greater for Late than Early
Boundary sentences (Appendix D). Because they were more difficult initially, Late Boundary sentences
generally had the scope to improve more as participants learned about how they were spoken. Thus, for
Word2Start, Late Boundary sentences improved significantly more than Early Boundary sentences from
pre- to post-test (Table 5; Test x Boundary Position interaction, Appendix D). For Words, greater
improvement was again found for Late than Early Boundary sentences for PF’s voice, but the opposite
pattern was found for MJ’s Late Boundary items (Test x Voice x Boundary Position interaction,
Appendix D). The difficulty of these items in MJ’s speech may have been so great that learning about
them was slower.
Taken together, these relationships all support the idea that the prosodically weaker consonants in
Late Boundary sequences make them more difficult to identify than Early Boundary sequences.
Importantly, however, the Same-Voice advantage was constant across all measures and was not affected
by these differences involving Early vs. Late Boundaries.
3.4. Discussion
Smith & Hawkins, Speaker-specific detail at word boundaries
- 46 -
The perception experiment showed that 40 minutes’ exposure to sentences in a speaker's voice
led to more improvement in understanding novel tokens of those sentences in background noise than did
exposure to the identical sentences in a different voice. This Same-Voice advantage was significant not
only for identifying words, but also for identifying allophones—that is, phonemic sequences in their
correct syllabic and sequential context. Specifically and crucially, the Same-Voice advantage was
significant for segmenting difficult phonemic sequences, as reflected in listeners’ ability to assign
segments to the correct structural position as a syllable coda or a syllable onset, and hence to the correct
critical word (e.g. to correctly identify the /d/ in we’d rank as a word-final /d/; Figure 4). The results for
word identification parallel those of Nygaard & Pisoni (1998), though that study used multiple speakers,
and a longer familiarization phase; the results for difficult word segmentation are novel. The results
provide evidence of learning of speaker-specific phonetic detail at word boundaries: evidently, an
advantage of a familiar voice during word segmentation emerges when, as here, the task is difficult, yet
the stimuli are sufficiently similar to a previous experience to make prior knowledge systematically
useful. This advantage presumably contributes to the intelligibility benefit that familiarity with a voice
provides to listeners in segregating a single talker from background noise (Newman & Evers, 2007).
Poorer performance was found for Late Boundary than Early Boundary sentences. In other words,
coda consonants and onsetless syllables were identified less accurately than onset consonants (cf. Pickett,
Bunnell & Revoile, 1995; Ohala, 1996). When a consonant has ambiguous syllable and word affiliation,
English listeners are more likely to parse it as belonging to a syllable or word onset. The words in the
Early and Late Boundary phrases had roughly equal frequencies and phonological probabilities (two-
tailed sign tests: frequency using Kučera and Francis (1968), p = 0.15; phonological probability using
Coleman's (2000) prosodic probabilistic grammar, p = 0.54). However, syllables with onsets have a
Smith & Hawkins, Speaker-specific detail at word boundaries
- 47 -
higher type-frequency overall in English than syllables without (O’Connor & Tooley, 1964), so the better
performance for Early Boundary sentences may reflect listeners’ implicit knowledge of this pattern. It
seems reasonable to speculate, however, that much of the difference reflects the fact that, in Fougeron and
Keating’s (1997) sense, consonants were normally ‘stronger’ in onset than in coda position (Figure 1 and
Table 3).
Poorer absolute performance was found for speaker MJ than PF. It is possible that PF was
understood better than MJ because he marks his word boundaries particularly clearly, but other factors
may also be at play: influences on intelligibility are complex and poorly understood (Markham & Hazan,
2004). Possible partial explanations are that PF’s lower pitch made his voice easier to segregate from the
cafeteria noise, and that PF is a professional teacher, while MJ is less used to public speaking. Regardless
of the intelligibility difference between the speakers, the advantage due to familiarization with a specific
voice was of equivalent magnitude for both.
What are the implications of these results for the kind of learning that took place? They must
reflect some degree of abstraction over experience, because test and familiarization never used identical
tokens of sentences. While previous studies have mostly assumed that listeners abstract to phonemic
categories, the present results suggest a more complex picture. Although phonemic retuning may be part
of the story, the data do not support the view that the listeners were learning only about categories as
abstract as phonemes. Learning exclusively about phonemes would require assuming that listeners first
generalise from specific contexts to context-free phonemic categories, and then throw away structure-
specific information. Not only does this seem unlikely given this particular task, but it would not help in
the assignment of sounds to the correct position in syllables and words. The results showed that exposure
to a specific speaker’s voice improved listeners’ ability to identify phonemes in specific positions in
Smith & Hawkins, Speaker-specific detail at word boundaries
- 48 -
syllables and words, and the natural conclusion is that the learning that took place was sensitive to
position in syllable.
4. General Discussion
This study investigated production and perception of speaker-specific variation in patterns of
phonetic detail at word boundaries. The production experiment found quantitative and qualitative
variation in six speakers' productions of junctural minimal pairs. Speakers varied in the extent and way in
which they used acoustic durations, /d/ realisation and vowel formant frequencies to mark various types
of word boundary. The perception experiment used two speakers who differed in a range of these
dimensions, and investigated whether familiarization with a single voice led to learning about their
speaker-specific properties. Familiarization with the specific voice improved participants' ability to
identify words and syllable constituents at word boundaries in hard-to-segment, phonemically-identical
sequences presented in background cafeteria noise.
In production, we found that speakers varied in multiple aspects of their phonetic realisation of
word junctures. This information has of course been known since measurements of speech first began, but
while traditionally group trends were sought, we report differences and exploit them in a perceptual
learning task. However, the differences and their implications are of interest in their own right. While
initial position in syllable and word was phonetically stronger than final position for all speakers, the
variation beyond this basic regularity was complex, and further work is required to elucidate factors
governing it. Interestingly, rate of speech, degree of lenition of /d/ consonants, and degree of durational
differentiation between positions in syllable, patterned partly together, but partly separately. For example,
speaker JR spoke fast, had relatively little durational differentiation between positions in syllable, and
Smith & Hawkins, Speaker-specific detail at word boundaries
- 49 -
generally lenited his /d/s. Speakers PF and MJ both spoke more slowly than JR, and at similar rates to
each other, but PF showed the greatest durational differentiation between positions in syllable, yet also
lenited his onset and coda /d/s much more than MJ and comparably to JR. Thus, we doubt that a single
parameter such as rate or ‘style’ governs all aspects of the variation. Further, while we have framed our
investigation in terms of idiosyncratic (indexical) differences, some aspects may be micro-dialectal or
socially stratified. For example, the variation we observed in the vowel in he’d, she’d, etc, seems
plausibly to be an age-graded change in the small set of contracted auxiliary verbs, related to
their grammatical and/or metrical properties, while some aspects of the /d/ realisation seem to
pattern with gender.
Although production models lacking an exemplar component can explain individual
differences in production by invoking individual differences in phonetic implementation of
linguistic units, we consider that our findings are most naturally accounted for in a model with an
exemplar component, specifically a hybrid exemplar-abstract model such as those proposed by
Pierrehumbert (2002, 2006) or Walsh, Möbius, Wade and Schütze (2010). For example,
Pierrehumbert (2002) presents a model that involves storage of exemplar chunks of speech,
combined with prosodic parsing/tagging of these chunks, and mapping of them to labelled
phonemic and lexical categories. Production goals are selected by sampling regions from within
the exemplar space corresponding to the selected label(s). Lexical knowledge acts as an
attentional bias on the selection of particular exemplars for production; this approach allows for
fine-grained allophonic differences to develop between phonemically similar structures (e.g. the
words realign and realise exhibit different patterns of glottalisation at the hiatus, due to the
morphological relationship between realign and align). Walsh et al. (2010) likewise present a
Smith & Hawkins, Speaker-specific detail at word boundaries
- 50 -
hybrid model in which production targets are generated from exemplar distributions labelled at
multiple levels corresponding to units of different sizes (e.g. phonemes and words). In general, a
larger unit (a word), will influence production more than its constituents (phonemes), as long as
the word is frequent enough to have acquired a critical mass of associated exemplars; if not,
production will be influenced mainly by the constituent phonemes. Thus, in hybrid exemplar
models, production targets can be generated on the basis of phonemic categories alone, but will
normally be influenced also by larger units, especially for experienced speakers. This idea
accords fairly well with the principles of adaptive resonance theory (Grossberg, 2003), of task
dynamics (Saltzman, 1995), where control of behaviour becomes more global as skill develops
(for an embodied formulation, see Simko and Cummins, 2010), and indeed of perception-action
robotics models (e.g., Sprague, Ballard and Robinson, 2007; Roy, 2005).
Our present findings support hybrid approaches in that they indicate that production is
influenced by multiple levels of structure (e.g. syllable and word identity, grammatical and
prosodic structure). Further, they suggest that the influences of these different levels vary
according to individual speaker. Some speakers (e.g. PF) showed stronger influences from
syllable structure and prosodic shape, but did not show, for example, the influence of grammar
noted above for MJ and RS. We would expect that attentional and situational biases could alter
the weightings of the various influences, as suggested by the intriguing data of Goldinger &
Azuma (2003), who found variation in word production according to whether the experimenter’s
introduction generated an implicit bias to attend to syllables or segments.
For perception, our results provide direct support for an exemplar component in modelling. The
perception experiment demonstrated a voice familiarization advantage that was small but consistent
Smith & Hawkins, Speaker-specific detail at word boundaries
- 51 -
across speakers and measures. Given that people do have experience throughout life with a variety of
speakers, any processing benefit due to experience with a particular speaker, no matter how small that
benefit, indicates that even the adult speech perceptual system is remarkably flexible. Furthermore, the
effect we observed is potentially important because the stimuli were more natural than those used in many
experiments on phonetic aspects of perceptual learning. For example, Allen & Miller (2004) tested inter-
speaker differences in VOT that were exaggerated to roughly twice the range found in Allen, Miller and
DeSteno's (2003) production study. Their study, and those of Norris, McQueen & Cutler (2003) and
Eisner & McQueen (2005) also used isolated words, whereas the present stimuli comprised meaningful
connected speech read very fluently in a casual style, and the emphasis in the perceptual tasks was on
accessing sentence meaning. Moreover, unlike other studies which use fluent read speech (Nygaard,
Sommers, & Pisoni, 1994; Nygaard & Pisoni 1998), the stimuli were also controlled enough to permit
conclusions about the phonetic nature of perceptual learning: in this case, that learning about the
particular pronunciations associated with individual speakers’ voices helps correct identification of
allophones at word boundaries, and hence facilitates access to meaning via lexical identification.
The B and B2 analyses allow us to explore to what extent the data can be reconciled with
accounts based on the retuning of sub-lexical, specifically phonemic, category representations (Norris,
Cutler, & McQueen 2003; Eisner & McQueen 2005). The data are compatible with this category retuning
account, as long as it is modified so that perceptual learning is not based solely on adjustments to
phonemic representations, but can also involve aspects of syllabic and prosodic structure (Hawkins &
Smith 2001; Hawkins 2003; Hawkins 2010; Pierrehumbert 2002, 2006). By this account, listeners would
not only store phonemic categories and learn about speaker-specific realisation of these (Norris et al.
2003), but would additionally store information about prosodic structures and processes, and learn about
Smith & Hawkins, Speaker-specific detail at word boundaries
- 52 -
speaker-specific phonetic implementation of these. For example, listeners undoubtedly know about
strengthening of syllable-onset consonants and lenition of syllable-coda consonants; they might learn that
such strengthening or lenition tends to be particularly pronounced for a particular speaker. Similarly,
listeners will know that pitch-accenting a syllable produces various phonetic changes in that syllable, and
might learn that certain of these changes tend to be particularly large or small for a given speaker.
Thus, while perceptual learning and category retuning may sometimes focus on phonemic
categories, many other sound categories are also relevant. Like many others (e.g. Saffran, Newport and
Aslin, 1996) we have argued that phonemic categories may be learned as “self-organizing” and
“emergent”, related by virtue of shared attributes or occurrence: the brain naturally works with contrasts,
and classifies like things together (Hawkins, 1995; Hawkins & Smith, 2001). The relationship between
sounds and phonemic categories may be learned very gradually as a speaker’s language system matures,
by association rather than because phonemes play a fundamental role in speech communication. Hawkins
(2010:500-501) argues that it follows that phonemic representations might not be comprehensive or fully
systematic. Assuming that the shared attributes of phonemes can be learned via any relevant experience,
then those that share consistent articulatory, auditory or orthographic properties (for people literate in an
alphabetic writing system), seem particularly likely to form categories. In English, candidates for
systematic phonemic categories might be /s/, which varies little acoustically across contexts (although it is
indexically variable, e.g. Stuart-Smith, 2007), and bilabial (but not alveolar) stops, because bilabial stops
vary little in critical aspects of articulation. Other sounds, for example vowel qualities conditioned by
certain consonants, e.g. those in whole and hope in several varieties of British English, might never be
grouped appropriately into phonemes. Orthographic consistency presumably influences the system’s
development and functioning (cf. Ranbom & Connine, 2011). If the language allows it, then a rough
Smith & Hawkins, Speaker-specific detail at word boundaries
- 53 -
phoneme inventory could develop, with some phonemes being more clearly defined as categories and
accessible to consciousness than others, and with individuals, and speakers of different languages,
differing in how systematic their phonemic inventories are. The implication for perceptual learning is that
learning may be about phonemic categories when those categories are particularly systematic, but may be
about allophonic, featural, syllabic, prosodic or grammatical categories when these are systematic. These
arguments are compatible with those made by Nielsen (2011), Dahan & Mead (2010) and Kraljic &
Samuel (2006), who all found generalization of perceptual learning to non-phonemic categories, namely
features or positions in syllable, and who emphasise the flexibility of perceptual learning as the key to
understanding the range of units that may be involved.
What emerges clearly from the present work and other studies on indexical variability in
production and perception is that when a listener encounters the mixture of personal and linguistic
information in an individual’s voice, understanding the message can be facilitated by learning about how
the individual speaks, in order that phonetic detail can be effectively mapped to linguistic and personal
perceptual dimensions. The results that we have observed demonstrate that speaker-specific variation in
phonetic detail that indicates word boundary location exists, and is learned about. Our results suggest that
by widening the range of phonetic properties investigated in perceptual learning studies, we will be able
to discover more about the aspects of speaker variation that help listeners to decode messages, and hence
more about the types of processes involved.
Acknowledgements
Smith & Hawkins, Speaker-specific detail at word boundaries
- 54 -
Parts of this work were done in partial fulfilment of the requirements for the PhD degree awarded to the first author by the University of Cambridge. Parts of it were presented in Proceedings of the 16th International Congress of Phonetic Sciences, Saarbrücken, August 2007. We thank Harald Baayen, Mirjam Ernestus and Sam Miller for statistical help, Naomi Hilton for assistance testing subjects, and Pam Beddor, Gerry Docherty, Gareth Gaskell, Jane Stuart-Smith and several anonymous reviewers for helpful comments on previous drafts. Financial support was provided by the Arts and Humanities Research Board and Newton Trust at the University of Cambridge, and the John Robertson Bequest at the University of Glasgow.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 55 -
Appendix A: experimental sentences and precursors Critical portions of experimental sentences are underlined. Group D D1. EB: We lined up all the wines from the cheapest to the most expensive. Then we drank them in order. LB: We decided that we’d mark all the tests first. Then we’d rank them in order. D2. EB: I bought a new car stereo in time for the trip. I drove all over Europe playing music. LB: I thought I would become a travelling bard. I’d rove all over Europe playing music. D3. EB: He wanted the carrots to cook fast. So he diced them. LB: The top of the cakes had come out looking uneven. So he’d iced them. D4. EB: Her teenage son insisted all his T-shirts had to be black. She dyed them resentfully. LB: She hadn’t been able to afford the jeans in the shop window. She’d eyed them resentfully. D5. EB: All writers are terrified when their book first comes out. We dread the reviews. LB: We expected the play to be unusual. We’d read the reviews. D6. EB: John thought they needed a challenge. He was pleased that we dared them. LB: All Mark’s clothes had been damp. He was pleased that we’d aired them. Group S S1. EB: There are some people out there with a morbid streak. Apparently the gentlemen collect skulls. LB: We have a dedicated bird collector living next door. Apparently the gentleman collects gulls. S2. EB: Most people would have borrowed money to pay for it. Pete stole money. LB: She eventually realised what the cheque on the table was. Pete’s dole money. S3. EB: Chocolate and sugar are comforting when you’re in danger. That’s why the airmen eat sweet products. LB: Carbohydrates are good for endurance. That’s why the airman eats wheat products. S4. EB: “The Lord is my Shepherd” is popular at St. Mary’s. The congregation certainly like psalms. LB: The pastor and his flock are gun-toting NRA members. The congregation certainly likes arms. S5. EB: For a while the fallen trees blocked access. But Pat sawed them. LB: To begin with they were unimpressed. But Pat’s awed them. S6. EB: Among the dog baskets, Sarah was surprised to see some much smaller ones. She said, “Those are cat size”. LB: I wanted to know what the little lights in the road were. She said, “Those are cat’s eyes”. Group A A1. EB: “The film was sentimental, but one scene was genuinely moving,” she said. “That surprise for
the child.” LB: “See the mountain bike over there?” he said. “That’s a prize for the child.” A2. EB: I agree that Simon excels. But Ralph surpasses. LB: Geoff’s grades are all distinctions. But Ralph’s are passes. A3. EB: Don’t say venereal. Say veneer for me. LB: Don’t eat the whole of the chocolate rabbit. Save an ear for me. A4. EB: John gave up pretty early. Ruth sustained. LB: I’m not sure whose trousers to borrow. Ruth’s are stained. A5. EB: He hasn’t worked with the Gurkhas before. It’s no wonder he didn’t recognise that salute. LB: He doesn’t know anything about music. It’s no wonder he didn’t recognise that’s a lute. A6. EB: There was still one thing she didn’t like about the fireplace. That surround. LB: That song’s not a fugue. That’s a round.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 56 -
Group T T1. EB: They’d told him not to leave the house. But he went for a sly stroll. LB: There were lots of fancy sandwiches on the table. But he went for a sliced roll. T2. EB: After each show I arrange the props backstage, ready for the next day. And I lay Steve’s costume
out in the wings. LB: Just before the curtain went up, I did the actors’ make-up in the green room. And I laced Eve’s
costume out in the wings. T3. EB: His parents take the idea of corporal punishment too far. They whack Stan’s legs quite painfully. LB: I don’t recommend the King’s Road salon. They waxed Ann’s legs quite painfully. T4. EB: I don’t know what caused my migraines. It may have been eye strain. LB: You must have misread the sentence: hail isn’t acid rain. It may have been iced rain. T5. EB: It’s not just that his foster parents are well off. They also offer Mick stability. LB: The yoga centre has a few beginners’ classes this year. They also offer mixed ability. T6. EB: They’ll have a game of football with anyone that’s willing. They even play strangers in the park. LB: The local authority took all sorts of measures to combat crime. They even placed rangers in the
park.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 57 -
Appendix B: Statistical results for production experiment
Table B1 Mean durations, and results of mixed-effects modelling for relative durations in the four sentence groups, and for vowel formants in Group D. Statistically significant effects (p ≤ 0.05) are shaded grey. ndf = numerator degrees of freedom; ddf = denominator degrees of freedom. Boundary position Speaker Boundary position x
Speaker means
Early, Late Boundary ddf F
(ndf 1) p F
(ndf 5) p F
(ndf 5) p
ms % of critical phrase
Group D (e.g. he diced vs. he’d iced) PrecSyl 96, 101 21.9, 23.3 554 24.3 <0.0001 20.2 <0.0001 2.3 0.04 /d/ 92, 63 21.4, 15.0 547 751.2 <0.0001 34.2 <0.0001 3.2 0.007 FolSyl 246, 264 56.5, 61.5 548 371.1 <0.0001 66.6 <0.0001 6.4 <0.0001 Group S (e.g. cat size vs. cats’ eyes) PrecSyl 248, 241 34.1, 34.1 542 0.1 0.82 18.1 <0.0001 2.4 0.03 /s/ 136, 103 19.9, 15.5 548 521.0 <0.0001 9.1 <0.0001 7.2 <0.0001 FolSyl 340, 362 45.8, 50.3 569 198.9 <0.0001 21.2 <0.0001 n/a n/a Group A (e.g. that surprise vs. that’s a prize) PrecSyl 214, 201 27.5, 26.9 552 1.9 0.17 30.2 <0.0001 4.3 0.0007 /s/ or /v/ 113, 92 14.6, 12.1 553 238.8 <0.0001 3.4 0.01 6.4 <0.0001 /ə/ 39, 45 5.1, 5.9 543 44.3 <0.0001 8.8 <0.0001 n/a n/a Consonant 86, 90 10.9, 11.5 556 21.6 <0.0001 18.3 <0.0001 5.1 0.0001 FolSyl 325, 329 41.7, 43.4 553 33.0 <0.0001 57.6 <0.0001 6.6 <0.0001 Group T (e.g. lay Steve’s costume vs. laced Eve’s costume) PrecSyl 226, 212 37.6, 36.8 569 18.3 <0.0001 20.6 <0.0001 n/a n/a /s/ 104, 83 17.2, 14.1 552 410.6 <0.0001 96.1 <0.0001 8.9 <0.0001 /t/ closure 40, 35 6.7, 6.2 564 11.8 <0.0006 9.4 <0.0001 5.2 0.0001 /t/ VOT 31, 35 5.2, 5.9 551 4.0 0.05 23.2 <0.0001 6.8 <0.0001 FolSyl 216, 237 32.9, 36.7 551 500.1 <0.0001 37.3 <0.0001 2.7 0.02 Group D /iː/ Early, Late Boundary
means (Bark)
F2-F1 11.0, 10.0 461 665.8 <0.0001 504.7 <0.0001 63.8 <0.0001
.
.
.
.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 58 -
Table B2 Pairwise comparisons for variables where a significant interaction between Boundary Position x Speaker was found. Only comparisons significant with the Bonferroni—Holm procedure are reported (see text for details). EB > LB LB > EB Differences among speakers in size of EB-
LB difference Group D (e.g. he diced vs. he’d iced) Preceding syllable -- AK (p = 0.0001)
PF (p = 0.0091) SC (p = 0.001)
AK > MJ (p = 0.0034)
/d/ all speakers (p < 0.0001)
-- PF > JR (p = 0.0014)
Following syllable -- all speakers (p < 0.0001)
MJ > {AK, JR, SC} (all p < 0.0002)
/iː/ F2-F1 all speakers (p < 0.0001)
-- MJ > {AK, JR, PF, SC} (all p < 0.0001) RS > {AK, JR, PF, SC} (all p < 0.0001)
Group S (e.g. cat size vs. cats’ eyes) Preceding syllable PF (p = 0.0217) MJ (p = 0.0126) PF ≠ MJ (p = 0.0007) /s/ all speakers
(p < 0.0001) -- MJ > {AK, PF, SC} (all p < 0.0025)
JR > PF (p = 0.0021) Group A (e.g. that surprise vs. that’s a prize) Preceding syllable PF (p = 0.0007) RS (p = 0.0314) PF ≠ JR (p = 0.001)
PF ≠ RS (p = 0.0001) /s/ or /v/ all speakers
(p = 0.006) -- {JR, MJ, PF, RS} > AK (all p < 0.005)
{MJ, RS} > SC (both p < 0.002) Consonant -- PF (p < 0.0001) PF > {AK, JR, MJ, SC} (all p < 0.002) Following syllable SC (p = 0.0173) AK (p =
0.0075) JR (p = 0.0127) MJ (p < 0.0001) PF (p = 0.0012) RS (p = 0.0361)
SC ≠ {AK, JR, MJ, PF, RS} (all p < 0.002)
Group T (e.g. lay Steve’s costume vs. laced Eve’s costume) /s/ all speakers (p <
0.0001) -- SC > {AK, JR, MJ, PF, RS} (all p < 0.0017)
/t/ closure PF (p < 0.0001) -- PF > {AK, JR, MJ, RS, SC} (all p < 0.0028) /t/ VOT -- SC (p < 0.0001) SC > {AK, JR, MJ, PF, RS} (all p < 0.0036) Following syllable -- all speakers (p <
0.0001) PF > MJ (p = 0.0007)
Smith & Hawkins, Speaker-specific detail at word boundaries
- 59 -
Appendix C: Similarities and differences between the speakers used in the perception experiment
Table C1 Summary of the differences in phonetic detail at word boundaries between the two speakers whose voices were heard in the perception experiment, PF and MJ.
Category Grp Specific variable PF MJ PF makes a statistically significant distinction that is not significant for MJ Durations D /hiː/ longer in he’d than he by 6 ms (6%) 2 ms (2%), n.s.
A /ðat/ longer in that surprise than that’s a prize) by 41 ms (17%) 8 ms (4%), n.s. A /p/ longer in prize than surprise by 17 ms (21%) 3 ms (4%), n.s. T /t/ closure longer in lay Steve’s than laced Eve’s by 16 ms (57%) 0 ms (0%), n.s. Both speakers make a statistically significant distinction in the same direction, but magnitude is greater for PF Durations T /i ːvz/ longer in Eve’s than Steve’s by 27 ms (11%) 18 ms (9%)
Consonant realisation
D Continuous formant structure present during /d/ 19% overall 6% overall
Consonant realisation
D Continuous formant structure present more often in he’d than he by
20 percentage points
4 percentage points
Both speakers make a statistically significant distinction in the same direction, but magnitude is greater for MJ Durations S /s/ longer in cat size than cat’s eyes 26 ms (23%) 53 ms (56%) Spectral D /i ː/ F2-F1 difference greater in he than he’d by 0.4 Bark 1.8 Bark
Consonant realisation
D Continuous voicing present during /d/ 80% overall 28% overall
Consonant realisation
D Continuous voicing present more often in he’d than he by
28 percentage points
40 percentage points
Both speakers make a statistically significant distinction, but in opposite directions Durations S /kat/ in cat vs cat’s longer by 10 ms
(4%) shorter by 3 ms (1%)
Speakers do not differ significantly Rate All Mean duration of critical phrase: 679 ms 606 ms Durations D /d/ longer in he diced than he’d iced by 38 ms (60%) 37 ms (57%) D /aɪst/ longer in he’d iced than he diced by 20 ms (8%) 30 ms (13%)
A /s/ or /v/ longer in that surprise than that’s a prize 23 ms (22%) 27 ms (33%) A /raɪz/ longer in that’s a prize than that surprise 22 ms (7%) 22 ms (7%)
T /s/ longer in lay Steve’s than laced Eve’s 14 ms (17%) 17 ms (22%) Consonant realisation
D Continuous frication present during /d/ 15% overall 19% overall
D Continuous frication present more often in he’d iced than he diced
25 percentage points
29 percentage points
Smith & Hawkins, Speaker-specific detail at word boundaries
- 60 -
Appendix D: Statistical results for perception experiment
Table D1 Statistical results for the perception experiment. Only significant predictors and non-significant main effects that participated in significant interactions are reported. Terms not in the best fitting model are labelled “—”. An asterisk (*) in the df column indicates that χ2, df and significance levels are for the named factor plus any interaction terms in the table that included that factor. Marginally significant effects (0.05 < p ≤ 0.1) are underlined. Nonsignificant effects are labelled n.s.. Effect Words χ
2 df p Test (pre/post) 3434.4 5* <0.0001 Voice (MJ/PF) 110.7 4* <0.0001 Familiarization Voice(Same/ Different)
20.6 2* <0.0001
BoundaryPosition (EB/LB)
88.6 4* <0.0001
SentenceGroup (D/S/A/T)
— — —
Test x Voice 12.1 2* 0.002 Test x Familiariz-ation Voice
19.5 1 <0.0001
Test x Boundary Position
13.9 2* 0.001
Voice x Boundary Position
10.7 2* 0.005
Test x Voice x Boundary Position
10.7 1 0.001
Effect B Word1End B Word2Start B2 Word1End B2 Word2Start χ
2 df p χ2 df p χ
2 df p χ2 df p
Test 245.3 2* <0.0001 351.3 3* <0.0001 51.7 3 * <0.0001 38.0 3 * <0.0001 Voice 75.0 2* <0.0001 95.9 2* <0.0001 28.5 3 * <0.0001 54.9 2 * <0.0001 Familiarization Voice
6.0 2* 0.05 5.1 2* n.s. 2.8 2 * n.s. 2.6 2 * n.s.
Boundary Position 11.5 2* 0.003 251.1 3* <0.0001 99.6 4 * <0.0001 134.9 3 * <0.0001 Sentence Group 9.1 3 0.028 8.3 3 0.039 8.6 3 0.036 5.5 3 0.1399 Test x Voice — — — — — — — — — — — — Test x Familiariz-ation Voice
5.7 1 0.017 4.4 1 0.036 2.8 1 0.096 2.6 1 0.109
Test x Boundary Position
— — — 7.0 1 0.008 32.2 1 <0.0001 3.6 1 0.056
Voice x Boundary Position
7.9 1 0.005 8.5 1 0.003 3.3 1 0.068 8.2 1 0.004
Smith & Hawkins, Speaker-specific detail at word boundaries
- 61 -
References Allen, J.S., Miller, J.L., & DeSteno, D. (2003). Individual talker differences in voice-onset time. Journal
of the Acoustical Society of America, 113, 544-552. Allen, J. S., & Miller, J. L. (2004). Listener sensitivity to individual talker differences in voice-onset-
time. Journal of the Acoustical Society of America, 116, 3171-3183. Baayen, R. H. (2008). Analysing linguistic data: A practical introduction to statistics. Cambridge:
Cambridge University Press. Baker, R., (2008). Grammatical and probabilistic influences on the production and perception of fine
phonetic detail. Unpublished Ph.D. dissertation, University of Cambridge. Best, C. T. (1995). A direct realist perspective on cross-language speech perception.. In Strange, W. (Ed.),
Speech perception and linguistic experience: Theoretical and methodological issues in cross-language speech research (pp. 167-200). York Press.
Boersma, P., & Weenink, D. (2006). Praat: doing phonetics by computer (Version 4.5) [Computer program]. Retrieved October 26, 2006, from http://www.praat.org/
Bradlow, A. R., & Bent, T. (2008) Perceptual adaptation to non-native speech. Cognition, 106, 707-729. Bradlow, A., Nygaard, L. C., & Pisoni, D. B. (1999). Effects of talker, rate and amplitude variation on
recognition memory for spoken words. Perception and Psychophysics, 61, 206-219. Bybee, J. (2006) From usage to grammar: The mind's response to repetition. Language 82, 711-733. Charles-Luce, J. (1997). Cognitive factors involved in preserving a phonemic contrast. Language and
Speech, 40, 229-248. Cho, T., McQueen, J. M., & Cox, E. A. (2007). Prosodically driven phonetic detail in speech processing:
The case of domain-initial strengthening in English. Journal of Phonetics, 35, 210-243. Church, B. A., & Schacter, D. L. (1994). Perceptual specificity of auditory priming: Implicit memory for
voice intonation and fundamental frequency. Journal of Experimental Psychology: Learning, Memory, & Cognition, 20, 521-533.
Coleman, J. (2000). Candidate selection. The Linguistic Review, 17, 167-179. Cooper, W. E., & Paccia-Cooper, J. (1980). Syntax and speech. Cambridge, MA: Harvard University
Press. Cruttenden, A. (1994). Gimson's introduction to the pronunciation of English (5th ed.). London: Arnold. Dahan, D., & Mead, R. L. (2010). Context-conditioned generalization in adaptation to distorted speech.
Journal of Experimental Psychology: Human Perception and Performance, 36, 704-728. Davis, M. H., Marslen-Wilson, W., & Gaskell, M. G. (2002). Leading up the lexical garden-path:
segmentation and ambiguity in spoken word recognition. Journal of Experimental Psychology: Human Perception and Performance, 28, 218-244.
Davis, M. H., Johnsrude, I. S., Hervais-Adelman, A., Taylor, K., McGettigan, C. (2005), Lexical information drives perceptual learning of distorted speech: evidence from the comprehension of noise-vocoded sentences. Journal of Experimental Psychology: General, 134, 222-241.
Eisner, F., & McQueen, J. M. (2005). The specificity of perceptual learning in speech processing. Perception and Psychophysics, 67, 224-238.
Fougeron, C., & Keating, P. A. (1997). Articulatory strengthening at edges of prosodic domains. Journal of the Acoustical Society of America, 101, 3728-3740.
Fougeron, C. (2001). Articulatory properties of initial segments in several prosodic constituents in French. Journal of Phonetics, 29, 109-135.
Gårding, E. (1967). Internal juncture in Swedish. Lund: CWK Gleerup. Goldinger, S. D., Pisoni, D. B., & Logan, J. S. (1991). On the nature of talker variability effects on recall
of spoken word lists. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17, 152-162.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 62 -
Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 1166-1183.
Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105, 251-279.
Goldinger, S.D. (2007). A complementary-systems approach to abstract and episodic speech perception. In J. Trouvain & W.J. Barry (Eds.), Proceedings of the 16th International Congress of Phonetic Sciences (pp. 49-54). Saarbrücken.
Goldinger, S.D., & Azuma, T. (2003). Puzzle-solving science: The quixotic quest for units in speech perception. Journal of Phonetics, 31, 305-320.
Gow, D. W., & Gordon, P. C. (1995). Lexical and prelexical influences on word segmentation: Evidence from priming. Journal of Experimental Psychology: Human Perception and Performance, 21, 344-359.
Grossberg, S. (2003). Resonant neural dynamics of speech perception. Journal of Phonetics, 31, 423-445. Hawkins, S. (1995). Arguments for a nonsegmental view of speech perception. In, K. Elenius & P.
Branderud, (Eds.) Proceedings of the 13th International Congress of Phonetic Sciences 3, (pp. 18-25). Stockholm.
Hawkins, S., & Smith, R. H. (2001). Polysp: a polysystemic, phonetically-rich approach to speech understanding. Italian Journal of Linguistics-Rivista di Linguistica, 13, 99-188.
Hawkins, S. (2003). Roles and representations of systematic fine phonetic detail in speech understanding. Journal of Phonetics, 31, 373-405.
Hawkins, S. (2010). Phonetic variation as communicative system: Perception of the particular and the abstract. In C. Fougeron, B. Kühnert, M. d'Imperio, & N. Vallée (eds.), Laboratory Phonology 10: Variability, Phonetic Detail and Phonological Representation. Berlin: Mouton de Gruyter. 479-510.
Hay, J., & Bresnan, J. (2006). Spoken syntax: The phonetics of giving a hand in New Zealand English. The Linguistic Review, 23, 351-379.
Heinrich, A, Flory, Y., & Hawkins, S. (2010). Influence of English r-resonances on intelligibility of speech in noise for native English and German listeners. Speech Communication, 52, 1038-1055
Hervais-Adelman, A., Davis, M.H., Johnsrude, I.S., & Carlyon, R.P. (2008). Perceptual learning of noise vocoded words: Effects of feedback and lexicality. Journal of Experimental Psychology: Human Perception and Performance, 34, 460-474.Hoard, J. E. (1966). Juncture and syllable structure in English. Phonetica, 15, 96-109.
Holm, S. A. (1979). A simple sequential rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
Johnson, K. (1997). Speech perception without speaker normalization: An exemplar model. In K. Johnson, & J. W. Mullennix (Eds.), Talker variability in speech processing (pp.145-165). San Diego: Academic Press.
Johnson, K. (2006). Resonance in an exemplar-based lexicon: The emergence of social identity and phonology. Journal of Phonetics, 34, 485-499.
Jones, D. (1931). The 'word' as a phonetic entity. Le Maître Phonétique, 3rd series 36, 60-65. Jurafsky, Daniel, Alan Bell, and Cynthia Girand (2002) The role of the lemma in form variation. In C.
Gussenhoven, & N. Warner (Eds.), Laboratory phonology 7 (pp. 1-34). . Berlin: Mouton de Gruyter.
Kemps, R. J. J. K., Ernestus, M., Schreuder, R., & Baayen, R. H. (2005). Prosodic cues for morphological complexity: The case of Dutch plural nouns. Memory & Cognition, 33, 430-446.
Kemps, R.,Wurm, L., Ernestus, M., Schreuder, R., & Baayen, R.. (2005). Prosodic cues for morphological complexity in Dutch and English. Language and Cognitive Processes, 20, 43-73.
Klatt, D. H. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journal of the Acoustical Society of America, 59, 1208-1221.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 63 -
Krakow, R. A. (1999). Physiological organisation of syllables: A review. Journal of Phonetics, 27, 23-54. Kraljic, T. & Samuel, A.G. (2006). Generalization in perceptual learning for speech. Psychonomic
Bulletin & Review, 13, 262-268. Kucera, H., & Francis, W.N. (1967). Computational Analysis of Present-day American English.
Providence: Brown University Press. Kwong, K. & Stevens, K. N. (1999). On the voiced-voiceless distinction for writer/rider. Speech
Communication Group Working Papers Research Laboratory of Electronics, Massachusetts Institute of Technology, volume 9, pp. 1–20.
Lachs, L., McMichael, K., & Pisoni, D. B. (2003). Speech perception and implicit memory: Evidence for detailed episodic encoding. In J. S. Bowers, & C. J. Marsolek (Eds.), Rethinking implicit memory (pp. 215-235). Oxford: Oxford University Press.
Lehiste, I. (1960). An acoustic-phonetic study of internal open juncture. Phonetica, 5, Supplement, 5-54. Lehiste, I. (1972). Suprasegmentals. Cambridge, MA: MIT Press. Local, J. K. (2003). Variable domains and variable relevance: Interpreting phonetic exponents. Journal of
Phonetics, 31, 321-339. Luce, P. A., McLennan, C. T., & Charles-Luce, J. (2003). Abstractness and specificity in spoken word
recognition: Indexical and allophonic variability in long-term repetition priming. In Bowers, J. & Marsolek, C. (Eds.), Rethinking implicit memory (pp. 197-214). Oxford: Oxford University Press.
Markham, D., & Hazan, V. (2004). The effect of talker- and listener-related factors on intelligibility for a real-word, open-set perception test. Journal of Speech, Language, and Hearing Research, 47, 725-737.
McLennan, C.T., Luce, P.A., & Charles-Luce, J. (2003). Representation of lexical form. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 539-553.
McQueen, J.M., Cutler, A., & Norris, D. (2006). Phonological abstraction in the mental lexicon. Cognitive Science, 30, 1113-1126.
Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility of speech as a function of the context of the test materials. Journal of Experimental Psychology, 41, 329-335.
Mattys, S. L., White, L., & Melhorn, J. F. (2005). Integration of multiple speech segmentation cues: A hierarchical framework. Journal of Experimental Psychology: General, 134, 477-500.
Newman, R.S., & Evers, S. (2007). The effect of talker familiarity on stream segregation. Journal of Phonetics, 35, 85-103.
Nielsen, K. Y. (2011). Specificity and abstractness of VOT imitation. Journal of Phonetics, 39, 132-142. Norris, D., McQueen, J., Cutler, A., & Butterfield, S. (1997). The possible-word constraint in the
segmentation of continuous speech. Cognitive Psychology, 34, 191-243. Norris, D., McQueen, J. M., & Cutler, A. (2003). Perceptual learning in speech. Cognitive Psychology,
47, 204-238. Nygaard, L.C., Burt, S.A., & Queen, J.S. (2000). Surface form typicality and asymmetric transfer in
episodic memory for spoken words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 1228-1244.
Nygaard, L. C., Sommers, M. S., & Pisoni, D. B. (1994). Speech perception as a talker-contingent process. Psychological Science, 5, 42-46.
Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception and Psychophysics, 60, 355-376.
O’Connor, J. D., & Tooley, O. M. (1964). The perceptibility of certain word-boundaries. In D. Abercrombie, D. B. Fry, P. A. D. MacCarthy, N. C. Scott, & J. L. M. Trim (Eds.), In honour of Daniel Jones: Papers contributed on the occasion of his eightieth birthday, 12 September 1961 (pp. 171-176). London: Longmans.
Ogden, R. A. (1999). A declarative account of strong and weak auxiliaries in English. Phonology, 16, 55-92.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 64 -
Ogden, R. A., Hawkins, S., House, J., Huckvale, M., Local, J. K., Carter, P., Dankovicová, J., & Heid, S. (2000). Prosynth: An integrated prosodic approach to device-independent, natural-sounding speech synthesis. Computer Speech and Language, 14, 177-210.
Ohala, J. J. (1996). Speech perception is hearing sounds, not tongues. Journal of the Acoustical Society of America, 99, 1718-1725.
Palmeri, T. J., Goldinger, S. D., & Pisoni, D. B. (1993). Episodic encoding of voice attributes and recognition memory for spoken words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, 309-328.
Pickett, J. M., Bunnell, H. T., & Revoile, S. G. (1995). Phonetics of intervocalic consonant perception: Retrospect and prospect. Phonetica, 52, 1-40.
Pierrehumbert, J. (2002). Word-specific phonetics. In C. Gussenhoven, & N. Warner (Eds.), Laboratory phonology 7 (pp. 101-139). Berlin: Mouton de Gruyter.
Pierrehumbert, J. (2006). The next toolkit. Journal of Phonetics, 34, 516-530. Pisoni, D. B. (1997). Some thoughts on “normalization” in speech perception. In K. Johnson, & J. W.
Mullennix (Eds.), Talker variability in speech processing (pp. 9–32). San Diego: Academic Press. Quené, H. (1992). Durational cues for word segmentation in Dutch. Journal of Phonetics, 20, 331-350. Quené, H. (1993). Segment durations and accent as cues to word segmentation in Dutch. Journal of the
Acoustical Society of America, 94, 2027-2035. Ranbom, L. J., & Connine, C. M. (2011). Silent letters are activated in spoken word recognition.
Language and Cognitive Processes, 26, 236-261. Rice, J.A. (1995). Mathematical Statistics and Data Analysis (2nd edition). Belmont, CA: Duxbury. Rietveld, A. C. M. (1980). Word boundaries in the French language. Language and Speech, 23, 289-296. Roy, D. (2005). Semiotic schemas: A framework for grounding language in action and perception.
Artificial Intelligence, 167, 170–205. Saffran, J.R., Newport, E.L., & Aslin, R.N. (1996). Word segmentation: The role of distributional cues.
Journal of Memory and Language, 35, 606-621. Saltzman, E. (1995). Dynamics and coordinate systems in skilled sensorimotor activity. In R. F. Port & T.
van Gelder (Eds.), Mind as motion: Explorations in the dynamics of cognition (pp. 149–173). Cambridge, MA: MIT Press.
Salverda, A. P., Dahan, D., & McQueen, J. M. (2003). The role of prosodic boundaries in the resolution of lexical embedding in speech comprehension. Cognition, 90, 51-89.
Shockley, K., Sabadini, L., & Fowler, C. A. (2004). Imitation in shadowing words. Perception and Psychophysics, 66, 422-429.
Sidaras, S.K., Alexander, J.E.D., & Nygaard, L.C. (2009). Perceptual learning of accented speech. Journal of the Acoustical Society of America, 125, 3306-3316.
Simko, J., & Cummins, F. (2010). Embodied task dynamics. Psychological Review, 117, 1229-1246. Smith, R. (2003). Influence of talker-specific phonetic detail on word segmentation. In Solé, M.J.,
Recasens, D. & Romero, J. (Eds.), Proceedings of the 15th International Congress of Phonetic Sciences. 1465-1468.
Smith, R. H. (2004). The role of fine phonetic detail in word segmentation. Unpublished Ph.D. thesis, University of Cambridge.
Sommers, M. S., & Barcroft, J. (2006). Stimulus variability and the phonetic relevance hypothesis: Effects of variability in speaking style, fundamental frequency, and speaking rate on spoken word identification. Journal of the Acoustical Society of America, 119, 2406-2416.
Sprague, N., Ballard, D., & Robinson, A. (2007). Modeling embodied visual behaviors. ACM Transactions on Applied Perception, 4, Article 11.
Stevens, K. N., & House, A. S. (1963). Perturbation of vowel articulations by consonantal context: An acoustical study. Journal of Speech and Hearing Research, 6, 111-128.
Stuart-Smith, J. (2007).Empirical evidence for gendered speech production: /s/ in Glaswegian. In J. Cole and J. Hualde (Eds.), Change in Phonology (Laboratory Phonology 9) (pp. 65-86). Berlin: Mouton de Gruyter.
Smith & Hawkins, Speaker-specific detail at word boundaries
- 65 -
Sumner, M. & Samuel, A. G. (2005). Perception and representation of regular variation: The case of final /t/. Journal of Memory and Language, 52, 322-338.
Traunmüller, H. (1990). Analytical expressions for the tonotopic sensory scale. Journal of the Acoustical Society of America, 88, 97-100.
Turk, A.E., & White, L. (1999). Structural influences on accentual lengthening in English. Journal of Phonetics, 27, 171-206.
Turk, A. E., & Shattuck-Hufnagel, S. (2000). Word-boundary-related duration patterns in English. Journal of Phonetics, 28, 397-440.
Umeda, N., & Coker, C. H. (1975). Subphonemic details in American English. In G. M. Fant and M. A. A. Tatham (Eds.), Auditory analysis and perception (pp. 539-564). London: Academic Press.
van Santen, J. P. H. (1992). Contextual effects on vowel duration. Speech Communication, 11, 513-546. Walsh, M., Möbius, B., Wade, T., & Schütze, H. (2010). Multilevel exemplar theory. Cognitive Science,
34, 537-582. White, L., & Mattys, S.L. (2007). Rhythmic typology and variation in first and second languages. In P.
Prieto, J. Mascaró & M.-J.Solé (Eds.), Segmental and Prosodic issues in Romance Phonology. Current Issues in Linguistic Theory series (pp. 237-257). Amsterdam: John Benjamins.
Wyld, H. C. (1913). Collected papers of Henry Sweet. Oxford: Oxford University Press.