16
Evaluating Low-Level Speech Features Against Human Perceptual Data Caitlin Richter Dept. of Linguistics University of Pennsylvania [email protected] Naomi H. Feldman Dept. of Linguistics and UMIACS University of Maryland [email protected] Harini Salgado Dept. of Computer Science Pomona College [email protected] Aren Jansen HLTCOE Johns Hopkins University [email protected] Abstract We introduce a method for measuring the corre- spondence between low-level speech features and human perception, using a cognitive model of speech perception implemented directly on speech recordings. We evaluate two speaker normalization techniques using this method and find that in both cases, speech features that are normalized across speakers predict hu- man data better than unnormalized speech fea- tures, consistent with previous research. Re- sults further reveal differences across normal- ization methods in how well each predicts hu- man data. This work provides a new frame- work for evaluating low-level representations of speech on their match to human perception, and lays the groundwork for creating more eco- logically valid models of speech perception. Index Terms: speech perception, automatic speech recognition, speaker normalization, cognitive models 1 Introduction Understanding the features that listeners extract from the speech signal is a critical part of understand- ing phonetic learning and perception. Different fea- ture spaces imply different statistical distributions of speech sounds in listeners’ input (Figure 1), and these statistical distributions of speech sounds influ- ence speech perception in both adults and infants (Clayards et al., 2008; Maye et al., 2002). The performance of automatic speech recognition (ASR) systems is also affected by the way in which these systems represent the speech signal. For ex- ample, changing the signal processing methods that are used to extract features from the speech wave- form (Hermansky, 1990; Hermansky and Morgan, 1994) and applying speaker normalization techniques to these features (Wegmann et al., 1996; Povey and Saon, 2006) can improve the performance of a recog- nizer. Recently there has been considerable interest in representation learning, in which new features that are created through exposure to data from the lan- guage to be recognized, or through exposure to other languages, lead to better system performance (Gr ´ ezl et al., 2014; Heigold et al., 2013; Kamper et al., 2015; Sercu et al., 2016; Thomas et al., 2012; Wang et al., 2015). It is potentially useful to know how closely the feature representations in ASR resemble those of human listeners, particularly for low-resource set- tings, where systems rely heavily on these features to guide generalization across speakers and dialects. In this paper we introduce a method for measuring the correspondence between low-level speech fea- ture representations and human speech perception. This allows us to compare different feature spaces in terms of how well the locations of sounds in the space can predict listeners’ perception of acoustic similarity. Our method has potential relevance both in ASR, for understanding how well the feature rep- resentations in ASR systems capture the similarity structure that guides human perception, and in cogni- tive science, for evaluating hypotheses regarding the feature spaces that human listeners use. We evaluate the ability of a feature space to capture listeners’ same-different responses in AX discrimi- nation tasks, in which listeners hear two sounds and decide whether they are acoustically identical. Rather than assuming that listeners’ ability to discriminate 425 Transactions of the Association for Computational Linguistics, vol. 5, pp. 425–440, 2017. Action Editor: Eric Fosler-Lussier. Submission batch: 11/2016; Revision batch: 3/2017; Published 11/2017. c 2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Evaluating Low-Level Speech Features Against Human

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Evaluating Low-Level Speech Features Against Human

Evaluating Low-Level Speech Features Against Human Perceptual Data

Caitlin RichterDept. of Linguistics

University of [email protected]

Naomi H. FeldmanDept. of Linguistics and UMIACS

University of [email protected]

Harini SalgadoDept. of Computer Science

Pomona [email protected]

Aren JansenHLTCOE

Johns Hopkins [email protected]

Abstract

We introduce a method for measuring the corre-spondence between low-level speech featuresand human perception, using a cognitive modelof speech perception implemented directly onspeech recordings. We evaluate two speakernormalization techniques using this methodand find that in both cases, speech featuresthat are normalized across speakers predict hu-man data better than unnormalized speech fea-tures, consistent with previous research. Re-sults further reveal differences across normal-ization methods in how well each predicts hu-man data. This work provides a new frame-work for evaluating low-level representationsof speech on their match to human perception,and lays the groundwork for creating more eco-logically valid models of speech perception.

Index Terms: speech perception, automatic speechrecognition, speaker normalization, cognitive models

1 Introduction

Understanding the features that listeners extract fromthe speech signal is a critical part of understand-ing phonetic learning and perception. Different fea-ture spaces imply different statistical distributionsof speech sounds in listeners’ input (Figure 1), andthese statistical distributions of speech sounds influ-ence speech perception in both adults and infants(Clayards et al., 2008; Maye et al., 2002).

The performance of automatic speech recognition(ASR) systems is also affected by the way in whichthese systems represent the speech signal. For ex-ample, changing the signal processing methods that

are used to extract features from the speech wave-form (Hermansky, 1990; Hermansky and Morgan,1994) and applying speaker normalization techniquesto these features (Wegmann et al., 1996; Povey andSaon, 2006) can improve the performance of a recog-nizer. Recently there has been considerable interestin representation learning, in which new features thatare created through exposure to data from the lan-guage to be recognized, or through exposure to otherlanguages, lead to better system performance (Grezlet al., 2014; Heigold et al., 2013; Kamper et al., 2015;Sercu et al., 2016; Thomas et al., 2012; Wang et al.,2015). It is potentially useful to know how closelythe feature representations in ASR resemble thoseof human listeners, particularly for low-resource set-tings, where systems rely heavily on these featuresto guide generalization across speakers and dialects.

In this paper we introduce a method for measuringthe correspondence between low-level speech fea-ture representations and human speech perception.This allows us to compare different feature spacesin terms of how well the locations of sounds in thespace can predict listeners’ perception of acousticsimilarity. Our method has potential relevance bothin ASR, for understanding how well the feature rep-resentations in ASR systems capture the similaritystructure that guides human perception, and in cogni-tive science, for evaluating hypotheses regarding thefeature spaces that human listeners use.

We evaluate the ability of a feature space to capturelisteners’ same-different responses in AX discrimi-nation tasks, in which listeners hear two sounds anddecide whether they are acoustically identical. Ratherthan assuming that listeners’ ability to discriminate

425

Transactions of the Association for Computational Linguistics, vol. 5, pp. 425–440, 2017. Action Editor: Eric Fosler-Lussier.Submission batch: 11/2016; Revision batch: 3/2017; Published 11/2017.

c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Page 2: Evaluating Low-Level Speech Features Against Human

F1

400 600 800 1000 1200 1400

F2

500

1000

1500

2000

2500

3000

3500Unnormalized Vowels

Normalized F1

-2 -1 0 1 2 3

Norm

alized F

2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5Normalized Vowels

/æ/ (bat)/a/ (bot)/ɔ/ (bought)/ɛ/ (bet)/e/ (bait)/ɝ/ (Bert)/ɪ/ (bit)/i/ (beet)/o/ (boat)/ʊ/ (book)/ʌ/ (but)/u/ (boot)

Figure 1: Acoustic characteristics of vowels produced in hVd contexts by men, women, and children from Hillenbrand etal. (1995), plotted as raw formant frequencies (left) and z-scored formant frequencies (right). These feature spaces yielddifferent distributions of sounds in acoustic space. If listeners’ perception is biased toward peaks in these distributions,the feature spaces make different predictions for listeners’ behavior in perceptual discrimination tasks.

sounds is directly related to those sounds’ distance ina feature space, we instead adopt a cognitive modelof speech perception which predicts that listeners’perception of sounds is biased toward peaks in thedistribution of sounds in their input. This leads listen-ers to perceive some sounds as closer together in thefeature space than they actually are, and others as far-ther apart (Figure 2). The model has previously beenshown to accurately predict listeners’ same-differentdiscrimination judgments when listening to pairs ofsounds (Feldman et al., 2009; Kronrod et al., 2016).

Our innovation in this work is to estimate the dis-tribution of sounds in the input from a corpus, usingdifferent feature spaces. Under the model, listenersare expected to bias their perception toward peaksin the distribution of sounds in the corpus. Thosepeaks occur in different locations for different fea-ture spaces. Thus, different feature representationsyield different predictions about listeners’ discrimina-tion. Features that yield a better match with listeners’discrimination are assumed to more closely reflectthe way in which listeners generalize their previousexperience when perceiving speech.

In addition to providing a way to evaluate featurerepresentations, adapting a cognitive model to oper-ate directly over speech recordings lays the ground-work for building more ecologically valid models ofspeech perception, by enabling cognitive scientists tomake use of the same rich corpus data that is oftenused by researchers in ASR. These speech corporawill allow models to be trained and tested on data

that more faithfully simulate the speech environmentthat listeners encounter, rather than on artificiallysimplified data.

As an initial case study, we use the perceptualmodel to evaluate features derived from two speakernormalization algorithms, which aim to reduce vari-ability in the speech signal stemming from physicalcharacteristics of the vocal tract. We compare thesenormalized features to a baseline set of unnormalizedfeatures. Speaker normalization has previously beenfound to improve phonetic categorization (Cole etal., 2010; Nearey, 1978), increase a phonetic catego-rization model’s match with human behavior (Apfel-baum and McMurray, 2015; McMurray and Jongman,2011), and improve the performance of speech recog-nizers (Povey and Saon, 2006; Wegmann et al., 1996).However, the degree to which specific normalizationtechniques from ASR match human perception isnot yet known. Experiments in this paper replicatethe general benefit of speaker normalization usingour perceptual model, while also providing new dataon the degree to which different normalization algo-rithms from ASR capture information that is similarto what humans use in perceptual tasks.

The paper is organized as follows. We begin bycharacterizing the speech features tested in this paper.The following section describes the method we usefor evaluating these features against human discrimi-nation data. Experiments are presented testing howwell each representation predicts human data. Wecompare our method to previous work, and conclude

426

Page 3: Evaluating Low-Level Speech Features Against Human

Actual Stimulus

Perceived Stimulus

Figure 2: Illustration from Feldman et al. (2009) show-ing listeners’ perceptual bias toward peaks in their priordistribution over sounds in a feature space. This biasleads listeners to perceive some sounds as closer together,and other sounds as farther apart, than they actually are.Reprinted from “The influence of categories on perception:Explaining the perceptual magnet effect as optimal statis-tical inference” by N. H. Feldman, T. L. Griffiths, & J. L.Morgan, 2009, Psychological Review, 116, p. 757. Copy-right 2009 by the American Psychological Association.Reprinted with permission.

by discussing implications and future directions.

2 Speech features

Speech recognition systems typically consist of afront-end, which transforms short time windows ofthe speech signal into numeric feature vectors, and aback-end, which uses the output of the front-end sys-tem together with phonetic models, language models,dictionaries, and so on to infer what was said. Ourwork analyzes the front-end feature vectors directly,examining the effects of speaker normalization algo-rithms on these feature vectors.

Speech contains commingled effects of linguis-tic, paralinguistic, and purely physical sources ofvariation. It has often been proposed that humanlisteners normalize the speech signal to generalizephonetic content across speakers. Listeners couldnormalize for vocal tract length, for example, eitherby attending to cues that are relatively constant acrossspeakers (Miller, 1989; Monahan and Idsardi, 2010;Peterson, 1961) or by relying on relative, rather thanabsolute, encoding of cues that are affected by vocaltract length (Barreda, 2012; Cole et al., 2010). Alter-natives to normalization include theories of adapta-tion in which listeners create representations that are

specific to individual speakers or groups of speakers(Dahan et al., 2008; Kleinschmidt and Jaeger, 2015),and theories which argue that no normalization oradaptation mechanisms are necessary at all (Johnson,1997; Johnson, 2006). The empirical literature onspeaker normalization and adaptation includes resultsthat support each of these approaches.

Our focus here is on replicating an advantage ofnormalized features found in cognitive models ofcategorization. McMurray and Jongman (2011) andApfelbaum and McMurray (2015) conducted a directcomparison between normalized and unnormalizedencodings of cues to fricative categorization. Theyfound that using normalized features yields a bet-ter match with human categorization performance.We aim to replicate the advantage of normalizedover unnormalized features using our discriminationmodel, in order to validate our new evaluation methodagainst an existing cognitive modeling approach.

We test two methods for speaker normalization:vocal tract length normalization (VTLN), which iswidely used in ASR systems (Wegmann et al., 1996);and z-scoring, which has been applied in linguis-tic analyses as well as in ASR systems (Lobanov,1971; Viikki and Laurila, 1998). Neither VTLNnor z-scoring is a cognitive model of speaker nor-malization, but both have the potential to captureinformation similar to what listeners might obtain ifthey generalize across utterances from speakers withdifferent vocal tract lengths.

We apply these normalization methods to vowelsthat are represented by mel frequency cepstral coeffi-cients (MFCCs), which have been widely employedas an input representation in ASR systems (Davis andMermelstein, 1980). MFCCs are a 12-dimensionalvector that describe a timepoint of speech by captur-ing information about the spectral envelope, reflect-ing vocal tract shape. They are computed by takingthe discrete time Fourier transform of the speech sig-nal, mapping it onto a mel frequency scale with logpower (to reflect how people process frequency andloudness), and applying the discrete cosine transformto result in a cepstrum (inverse spectrum) whose realvalues are the MFCC representation. This represen-tation of the signal is indirectly related to formantfrequencies, which are peaks in the spectral envelopethat are known to correlate with vowel identity andare often used to characterize vowels in studies of hu-

427

Page 4: Evaluating Low-Level Speech Features Against Human

man speech perception. However, MFCCs have theadvantage that they are computed deterministicallyfrom the speech signal, and thus are not subject tothe error inherent in automatic formant tracking.

This section gives an overview of both normaliza-tion methods and characterizes their effects; we deferdetails of their technical implementation to Section 4.

2.1 Vocal tract length normalization

The first normalization method we test, vocal tractlength normalization (VTLN), is a technique devel-oped for automatic speech recognition (Cohen et al.,1995; Wegmann et al., 1996). The aim of VTLN isto adjust the corpus so that it is as if all the speakershad identical vocal tract lengths. VTLN compensatesfor speaker differences in vocal tract length, whichaffects the resonant frequencies of the vocal tract, byapplying a piecewise linear transformation to expandor compress the frequency axis of the productions ofeach speaker. The particular adjustment to apply foreach speaker can be chosen by attempting to max-imise the similarity of their adjusted speech to otherspeakers. We use /i/ tokens as the basis of this com-parison across speakers. VTLN is widely successfulin ASR systems, where it substantially decreases theword error rate (Giuliani et al., 2006).

2.2 Z-scoring

The second normalization algorithm, within-speakerz-scoring (Lobanov, 1971), has been suggested for de-scriptive sociolinguistic research (Adank et al., 2004),as it highlights learned linguistic content while re-moving speaker-body confounds, such as vocal tractlength, from vowel formants. Z-scoring is also usedin ASR (Viikki and Laurila, 1998), where it is oftenreferred to as cepstral mean and variance normaliza-tion (CMVN). Z-scoring has been shown to reduceword error rate in speech recognizers, to improvegeneralization to noisy input, and to improve general-ization across speakers (Haeb-Umbach, 1999; Molauet al., 2003; Tsakalidis and Byrne, 2005).

2.3 Effects of normalization

The method that we use in Section 4 for testing hy-pothesized representations of speech against humanperception relies on the idea that different feature rep-resentations imply different distributions of soundsin listeners’ input. To examine these distributions of

sounds, we computed MFCCs, MFCCs with VTLN,and z-scored MFCCs from vowel recordings in theVowels subcorpus of the Nationwide Speech Project(NSP) (Clopper and Pisoni, 2006). This corpus con-tains ten different vowels pronounced in the contexthVd (hid, had, etc.) by 5 female and 5 male speakersfrom each of 6 dialect regions of the United States.Each of these 60 speakers repeats each of the 10hVd words 5 times, for a total of 3000 hVd tokensbalanced across vowel, gender, and dialect. For ouranalyses, these vowel tokens are represented by thespeech features of their midpoint.1 Because the tem-poral midpoint of a word may not be the midpointof its vowel, we detected the vocalic portion of eachNSP token as determined by relatively high total en-ergy (more than one standard deviation above themean) and low zero-crossing rate (below 30%). Weused the temporal midpoint of this span.

We characterize the effects of each normalizationmethod on the distribution of vowels in the NSPcorpus by comparing vowel distributions across gen-ders and dialect regions (averaged over 15 pairwisecomparisons of 6 dialect regions). Male and femalespeakers differ in their vocal tract lengths, and thusnormalization algorithms would be expected to in-crease similarity between their vowel productions.Dialects also differ in their pronunciations of differ-ent vowels; although this variation is not related tovocal tract length, it may nevertheless be impactedby normalization algorithms that seek to neutralizespeaker differences. To compare these distributions,we computed symmetrized Kullback-Leibler diver-gence, a measure of difference between two probabil-ity distributions, using a nearest-neighbor algorithm(Wang et al., 2006). Lower K-L divergence indicatesgreater similarity between two distributions.

K-L divergence between genders and between di-alect pairs is highest in MFCCs with no speaker nor-malization, reflecting the effects these factors haveon the original acoustic signal (Table 1). Both VTLNand z-scoring reduce K-L divergence between gen-ders, as predicted, so that male and female speakerssaying the same vowel appear more similar after ei-

1We believe that dynamic representations of sounds are criti-cal in human perception. However, this simplification is unlikelyto affect model performance for perceiving the sounds in ourexperiments (Section 4), which have constant formant valuesacross the duration of each vowel token.

428

Page 5: Evaluating Low-Level Speech Features Against Human

Table 1: Effect of normalization on symmetrized K-Ldivergence.

MFCC VTLN z-scored

Gender KLDiv 7.84 6.14 4.58Dialect KLDiv 4.41 4.19 2.09

ther of these normalizations than they are in unnor-malized MFCCs. Z-scoring using all 10 NSP vowelsalso sharply reduces K-L divergence between dialectpairs. In contrast, VTLN that matches across speak-ers on the basis of /i/, which differs little across USEnglish dialects (Clopper and Pisoni, 2006), has min-imal effects on dialect K-L divergence: cross-dialectvowel pronunciation differences remain distinct af-ter VTLN. Overall, K-L divergence is lowest whenz-scoring by speaker. VTLN removes somewhat lessof the gender and dialect variation.

In summary, MFCCs, MFCCs with VTLN, andz-scored MFCCs each correspond to different distri-butions of vowels in the input, with z-scoring beingthe most effective at increasing the overlap betweenthe distributions of vowels spoken by different speak-ers. The next section describes a cognitive model thatuses these input distributions to quantitatively pre-dict listeners’ vowel discrimination in the laboratory,allowing us to ask which of these feature representa-tions best corresponds with human perception.

3 Model of speech perception

Our goal is to evaluate MFCCs, MFCCs with VTLN,and z-scored MFCCs on their match to human per-ception. We perform this evaluation by integratingeach type of speech feature into a cognitive modelof speech perception. The model we adopt has pre-viously been shown to accurately predict listeners’perception of both vowels and consonants (Feldmanet al., 2009; Kronrod et al., 2016), but has not yetbeen implemented directly on speech recordings.

3.1 Model overview

The model formalizes speech perception as an infer-ence problem. Listeners perceive sounds by inferringthe acoustic detail of a speaker’s target productionthrough a noisy speech signal. This differs from theproblem of inferring a phoneme label, which is typi-cally the goal of ASR systems, but is consistent with

a large body of empirical evidence showing that hu-man listeners recover acoustic detail from the speechsignal in addition to category information (Andruskiet al., 1994; Blumstein et al., 2005; Joanisse et al.,2007; Pisoni and Tash, 1974; Toscano et al., 2010).

To correct for noise in the speech signal, listen-ers bias their perception toward acoustic values thathave high probability under their prior distribution.This creates a dependency between the distribution ofsounds in the input, which determines listeners’ priordistribution, and listeners’ perception of sounds.

Formally, speakers and listeners share a prior dis-tribution over possible acoustic values that can beproduced in a language, p(T ). The acoustic valuesare assumed to lie in a continuous feature space, suchas MFCCs. Prototypical sounds in the language havehighest probability under this distribution, but thedistribution is non-zero over a wide range of acous-tic values, corresponding to all the ways in whichspeech sounds might be realized. Speakers sample atarget production T from this distribution. The tar-get production carries meaningful information asidefrom category identity, such as dialect information orcoarticulatory information about upcoming sounds,making its acoustic value something that listenerswish to recover. The stimulus S heard by listeners issimilar to the target production T , but is assumed tobe corrupted by a noise process defined by a Gaussianlikelihood function,

p(S|T ) = N (T,ΣS) (1)

Both T and S are in Rd, where d is the dimension-ality of the feature space. In our experiments, thisfeature space is defined by either MFCCs, MFCCswith VTLN, or z-scored MFCCs.

Listeners hear S and reconstruct T by drawing asample from the posterior distribution,

p(T |S) ∝ p(S|T )p(T ) (2)

We refer to a listener’s sample from the posteriordistribution as a percept. The top row of Figure 2corresponds to S, and the bottom row corresponds tolisteners’ reconstruction of T .

3.2 Implementation on speech corporaPrevious work using this model has inferred listeners’prior distribution by measuring their perceptual cat-

429

Page 6: Evaluating Low-Level Speech Features Against Human

egorization, and estimating the parameters of Gaus-sian phonetic categories that listeners appear to beusing to make these categorization judgments. Inthe experiments below, we instead use vowel pro-ductions from the NSP corpus to estimate listeners’prior distributions. We avoid making parametric as-sumptions about the form of this prior distribution(such as expecting sounds from the corpus to fall intoGaussian categories) by using importance sampling,as proposed by Shi et al. (2010).

We assume that vowels in the NSP corpus con-stitute a set of samples {T (i)} drawn from listeners’prior distribution p(T ). We apply likelihood weight-ing, weighting each sound T (i) by the probability thatit generated the stimulus being perceived, p(S|T (i)).We then sample a sound from the corpus according toits weight. A sound from the corpus that is sampledin this way behaves as though it were drawn from theposterior distribution, p(T |S).

This way of approximating the model through im-portance sampling allows us to sample a percept fromthe model’s posterior distribution without knowingthe analytical form of the prior distribution. We re-quire only samples drawn from the prior, and canreweight these in order to obtain a sample from theposterior. In addition, this implementation of themodel does not require sounds from the corpus tohave phoneme labels, as the weights p(S|T (i)) aredefined by the model’s Gaussian likelihood function,corresponding to speech signal noise.

3.3 Modeling discrimination data

The model can be used to predict listeners’ discrimi-nation behavior. In AX discrimination tasks, listenershear two stimuli and decide whether they are acous-tically identical. We assume that for each stimulus,listeners sample a percept from their posterior distri-bution, p(T |S). They then compute the distance be-tween their percepts of the two stimuli and compare itto a threshold ε. If the percepts are separated by a dis-tance less than ε, listeners respond same; otherwisethey respond different. Given these assumptions, theproportion of same responses for two stimuli, S1 andS2, follows a binomial distribution whose parameteris the probability that the percepts for the two soundsare within a distance ε of each other,

p(same) = p(|T1−T2|<ε|S1, S2) (3)

We approximate the probability from (3) as

p(same) ≈ 1

n

n∑

i=1

1|T (i)1 −T

(i)2 |<ε

(4)

where T(i)1 and T

(i)2 are samples drawn through

importance sampling from posterior distributionsp(T1|S1) and p(T2|S2). We draw 100 pairs of tar-get productions from the posterior for each experi-mental trial and approximate listeners’ probabilityof responding same by estimating the proportion ofthese pairs that yielded a same response, using add-one smoothing.2 We use this estimate to computemodel likelihoods – the probability of the humanresponses, given the model predictions – based onlisteners’ actual same-different responses.

The noise covariance ΣS and the response thresh-old ε are free parameters of the model, which wefit to maximize model likelihoods.3 We implementthe model several times with different speech rep-resentations: MFCCs, z-scored MFCCs, or MFCCswith VTLN. Comparing model likelihoods across thethree speech representations allows us to ask whichrepresentations best predict listeners’ discrimination.

4 Experiments

Experiments implemented the perceptual model withnormalized and unnormalized representations, com-paring model predictions to human discriminationdata. To the extent that different representations ofspeech yield different distributions of sounds in acorpus, they should make different predictions aboutthe biases that listeners will exhibit in a speech per-ception experiment. Representations that yield moreaccurate predictions can be assumed to correspondto a similarity space more similar to the dimensionsthat human listeners use in speech perception.

2The distance computation in (4) uses the Mahalanobis dis-tance between T (i)

1 and T (i)2 , where the covariance matrix is

the noise variance ΣS . If listeners’ sensitivity is related to theirperceptual noise, then ΣS is the appropriate scaling factor forthis distance. In support of this, Feldman et al. (2009) found intheir one-dimensional model that increasing ΣS by adding whitenoise increases listeners’ decision threshold to the same degree.

3Although our use of Mahalanobis distance allows a tradeoffbetween ΣS and ε in the distance computation, ΣS also affectsthe weighting of exemplars through the perceptual model’s like-lihood function. Thus, these parameters are identifiable.

430

Page 7: Evaluating Low-Level Speech Features Against Human

4.1 Human perceptual data

We use vowel discrimination data from an AX dis-crimination task conducted by Feldman et al. (2009)in quiet listening conditions (Figure 3). Twentyparticipants heard stimuli consisting of 13 isolatedvowels ranging from /i/ (as in ‘beet’) to /e/ (as in‘bait’), which were synthesized to simulate a malespeaker. To control the placement of stimuli along thevowel continuum, first and second formant frequen-cies were equally spaced between /i/ and /e/ valuesalong the mel frequency scale. Participants heardall ordered pairs of stimuli and judged whether eachpair was acoustically identical, responding differentif they could hear any difference between the stimuli.MFCCs, MFCCs with VTLN, and z-scored MFCCscomputed from these 13 stimuli serve as input to themodel, as the stimulus S. Model predictions are thencompared with listeners’ same-different responses toeach pair of stimuli.

4.2 Speech representations

The samples {T (i)} that serve as the model’s priordistribution are vowels from the Vowels subcorpusof the NSP. Vowel midpoints were represented in themodel either as MFCCs, MFCCs with VTLN, or z-scored MFCCs, computed using a 25 ms window anda 10 ms step size.4 Features were scaled to have zeromean and a variance of one across all vowels.5

VTLN was implemented using a procedure fromWegmann et al. (1996) adapted for an unsupervisedsetting, which compensates for vocal tract lengthvariation by finding a frequency adjustment for eachspeaker that maximizes the similarity of their tokensto all the other speakers’. This VTLN implemen-tation first applies and then selects among severalfrequency rescaling factors, ranging from -20% to+20% of the original vocal tract length in steps of 5%.

4The zeroth cepstral dimension was excluded from the rep-resentation in experiments; it mainly describes the total amountof energy at each time in the speech signal, which is useful indetecting whether or not the speech is vocalic but contributeslittle to distinguishing different vowel sounds. Additionally, al-though ASR systems frequently append deltas and double deltasto the cepstral coefficients, corresponding to the first and secondderivatives of the MFCCs, we did not use these as they add freeparameters to the model.

5This was different from the z-scoring method for speakernormalization, which scaled the vowels from each individualspeaker to have zero mean and a variance of one.

Together with the original speech (scale factor of 0%)there are therefore 9 possible variants. The rescalingfactor for each speaker is chosen using expectationmaximization (Dempster et al., 1977). The E-steptrains a Gaussian mixture model with 9 componentsand spherical covariance from all /i/ vowel frames inthe NSP corpus. The M-step finds a rescaling factorfor each speaker that maximizes the likelihood un-der the current Gaussian mixture model. After EMconverges, the maximum likelihood rescaling factorfor the stimuli is selected under the final NSP-basedmixture model. Normalizing the stimuli is straight-forward, due to the reliance of our VTLN procedureon only /i/ tokens. Vowel midpoints of corpus tokensand stimuli were excerpted after normalization.

Z-scoring was applied to MFCCs at vowel mid-points. The NSP is an ideally balanced case for thez-scoring normalization, because each speaker saysthe same vowels the same number of times. MFCCsfor the stimulus ‘speaker’ were only available forthe /i/-/e/ vowel continuum, so it was not possibleto apply z-scoring directly to the stimuli in the sameway as it was applied to each NSP speaker. However,the stimuli were synthesized according to averageformant values for male speakers (Peterson and Bar-ney, 1952), and we verified that normalizing MFCCsfor these stimuli according to the average z-scoringfactors of the 30 male NSP speakers projected thestimuli into an appropriate region of the acousticspace.

Finally, although speech recognition systems typ-ically use 12 MFCC dimensions, we additionallyinclude experiments for all feature types that omitsubsets of the higher dimensions, as formant informa-tion that is useful for perceiving vowels is primarilyreflected in the lower dimensions.

4.3 Fitting parametersThe NSP corpus was divided into equally sized setsto fit parameters and test model likelihoods in 2-foldcross-validation. Two methods were used for divid-ing the corpus. In one case, the two halves containeddifferent tokens from the same speakers. In the othercase, the two halves contained tokens from differentspeakers (balanced for gender and dialect region).Each division of the corpus was created once, andused across experiments with all speech feature types.

The fitting procedure consisted of selecting the

431

Page 8: Evaluating Low-Level Speech Features Against Human

response threshold ε and the noise variance ΣS ,constrained to be diagonal, to best fit the percep-tual data. This was done using Metropolis-Hastingssampling (Hastings, 1970) with parallel tempering,which flexibly interleaves broad exploration of thesearch space with fine-tuning of the best parametervalues. The sampler used Gaussian proposal distri-butions for each parameter, with proposals Σ∗S andε∗ drawn based on current estimates ΣS and ε asΣ∗S ∼ N (ΣS , σ

21I) and ε∗ ∼ N (ε, σ22). Given this

symmetric proposal distribution, the acceptance ratiois

A =p(k|Σ∗S , ε∗)p(k|ΣS , ε)

(5)

where k is a vector of counts corresponding to same-different responses from human participants in theexperiment. The likelihood p(k|ΣS , ε) is a productof the binomial likelihood functions for each speechsound contrast, whose parameters are approximatedaccording to (4). We run several chains, indexed1..C, at temperatures τ1..τC , by raising each likeli-hood p(k|ΣS , ε) to a power of 1

τ . At each iterationthere is a proposal to swap the parameters betweenneighboring chains. The acceptance ratio for swap-ping parameters between chains c1 and c2 is

A =p(k|Σ(c2)

S , ε(c2))1τc1 p(k|Σ(c1)

S , ε(c1))1τc2

p(k|Σ(c1)S , ε(c1))

1τc1 p(k|Σ(c2)

S , ε(c2))1τc2

(6)

where τc1 and τc2 are the temperatures associatedwith chains c1 and c2, respectively. Proposals areaccepted with probability min(1, A). Model param-eters were fit using this procedure, with 50,000 itera-tions through eleven chains at different temperaturesranging from 0.01 to 1,000. Our analyses use thelast sample in the lowest temperature chain, with atemperature of 0.01.

In each run of the model, one half of the corpuswas used to fit model parameters, while the other halfwas used to compute model likelihoods at test. Theroles of the two sets were then reversed, resultingin 2-fold cross-validation. Each half of the corpusserved as a test set for 10 runs of the model, so pointsand error bars in Figures 4 and 5 represent means andstandard error calculated across all 20 replications;the relatively small error bars indicate that results

were consistent across replications.6

4.4 Results

Stimuli in the behavioral discrimination task weresynthesized to mimic a single speaker, and thusVTLN and z-scoring are not expected to directlyimpact relative distances between stimuli S in thefeature space. Instead, differences in model behavioracross the three feature types should arise primarilybecause VTLN and z-scoring impact the distribu-tion of vowels from the NSP corpus. This affectsthe model’s prior distribution and thus its perceptualbiases, leading to different reconstructions of T .

The predicted discrimination patterns that resultfrom different speech representations are illustratedqualitatively in Figure 3, together with human dataand a model benchmark which is described at theend of this section. Patches of darker shading in theconfusion matrices of Figure 3 indicate higher pro-portions of same responses, corresponding to reduceddiscrimination between stimuli in these regions. Themodels vary in how well they reflect humans’ percep-tual biases. One-dimensional models are extremelypoor regardless of feature type. Unnormalized MFCCmodels, as shown most clearly in 4 and 6 dimensions,often differ from human listeners by excessively re-ducing discrimination around the initial (/i/) end ofthe stimulus continuum. Z-scoring models, particu-larly in the low to mid range of dimensionality, reflecthuman judgments better than MFCC models but stilltend to deviate from human performance in visibleareas of reduced discrimination centered around stim-uli 6 and 9. Models using VTLN representations ofmoderately low dimensionality show the most faith-ful fit with human patterns of perceptual bias.

These differences in model performance acrossfeature types can be quantified by computing thelog likelihood of a model, given the human data.This measure treats the human same and differentresponses as ground truth, and assesses model per-formance on the binary decision task against thatground truth; higher log likelihoods indicate a closermatch of a model to human perceptual data. Log like-lihoods shown in Figure 4 quantify the performanceof models across feature types; this complements the

6Numerical values fit for model parameters were also consis-tent across the 20 replications for each speech feature type.

432

Page 9: Evaluating Low-Level Speech Features Against Human

Figure 3: Confusion matrices for how often each pairof stimuli is judged to be the same (black) vs. different(white) by humans, by a benchmark exemplar model, andby representative models testing the different speech fea-tures on familiar speakers. Axis labels of 1-13 denotestimulus numbers from an AX trial.

detailed views of model behavior in Figure 3. Recallthat our primary goal is to replicate previous findingswith categorization models showing that normalizedrepresentations are more consistent with listeners’perception than unnormalized representations. Wedo replicate this result with our discrimination model:in almost all cases, unnormalized MFCCs have the

lowest likelihoods among the three representationstested (Figure 4).

We also find that MFCCs normalized by VTLNoutperform z-scored MFCCs. This occurs in spiteof the fact that z-scoring within speakers puts maleand female speech into the most directly comparablespace and in general neutralizes the most differencesbetween speakers as measured by K-L divergence(Table 1). This indicates that more effective speakernormalization algorithms (in the sense of minimizingdifferences between speakers in a feature space) donot always translate into better matches with humanperception. Furthermore, it provides evidence thatthe better performance of the VTLN models is notmerely an artifact of high acoustic similarity amongvowels in the corpus.

As suggested by Figure 3, models using speechrepresentations of one dimension are always ex-tremely poor, with log likelihoods of −1300 to−1600; the first cepstral coefficient on its own doesnot provide enough information for the perceptualtask. In some experiments, such as for z-scored fea-tures, test likelihood decreases with higher numbersof dimensions, likely due to overfitting of param-eters to particular sounds from the corpus.7 Thelower MFCC dimensions, particularly the second di-mension, appear to contain information relevant tolisteners’ discrimination of an /i/-/e/ continuum.

As a representative example of the best fitting pa-rameters, one trained model using four dimensionsof VTLN features found parameters8 of ΣS = 0.1617,0.0092, 2.2400, 1.8194 and ε = 3.4388.9 Dimen-sions where noise is low indicate those that the modelfound informative for this perceptual task: low noisecauses even numerically small differences betweentokens to be attended to as differences in perceivedvowel quality, whereas high noise variance generatesa strong perceptual bias toward the center of vowel

7The corresponding training likelihoods for z-scored dimen-sions remained stable or even increased at higher numbers ofdimensions; there was not failure to converge in the training pro-cedure. Similarly, the low likelihood observed at six dimensionsfor MFCCs was due to several runs that achieved high likelihoodon the training set and low likelihood on the test set.

8The covariance matrix ΣS was constrained to be diagonal;here we list only the diagonal elements.

9Across the 10 training replications, the standard deviationin these trained parameters was respectively 0.0823, 0.0006,0.8684, 1.1518 for ΣS and 0.1645 for ε.

433

Page 10: Evaluating Low-Level Speech Features Against Human

2 4 6 8 10 12−800

−700

−600

−500

−400

−300

−200

Feature dimensionality

Tested on familiar speakers

Log lik

elih

oo

d

MFCCVTLN

z−scoredBenchmark

2 4 6 8 10 12−800

−700

−600

−500

−400

−300

−200

Feature dimensionality

Tested on new speakers

Log lik

elih

oo

d

MFCCVTLN

z−scoredBenchmark

Figure 4: Model fit to human data, when generalizing to familiar speakers (left) and new speakers (right).

space that collapses the space between vowel per-cepts along that dimension. Dimensions with highnoise parameter values therefore indicate that themodel benefited from perceiving all samples from theprior distribution as being similar along that dimen-sion, with variability along those dimensions treatedas noise relative to listeners’ perceptual judgmentsabout this vowel contrast. Across all feature types,successful models had low noise in dimension 2 andlow variance in this noise parameter across trainingreplications. This consistency, together with the largedifference in the model’s ability to match human per-formance between feature dimensionalities of 1 and2, indicates that the second cepstral coefficient isimportant for capturing listeners’ discrimination ofsounds along the /i/-/e/ vowel continuum.

Our next analyses examine the speech tokens sam-pled as percepts by the model when perceiving theexperimental stimuli. The model did not use theNSP’s speaker and phoneme labels for selecting thesepercepts, but we take advantage of these labels forour analyses. Across all models, the 100 perceptsdrawn from the posterior distribution for each stim-ulus contained on average 30 different tokens, indi-cating that a number of speech productions (fromdifferent speakers) are treated as linguistically sim-ilar to each other. As a measure of model qualityand interpretability, we examine two aspects of thetokens sampled by the model during each percep-tual judgment (Figure 5). The percentage of samplesthat belong to the classes of vowels along the /i/-/e/continuum (NSP ‘heed’, ‘hayed’, and ‘hid’ tokens;

henceforth referred to as high front vowels) givesinformation on model quality, because all the stimuliare perceived by US English speakers as falling alongthis continuum. The proportion of times a model sam-ples female tokens to recover the linguistic target forthis experiment’s male-speaker stimuli gives an indi-cation of the model’s ability to generalize linguisticcontent across genders.

Models using unnormalized MFCCs tend to makethe least use of female tokens, indicating that thisrepresentation does not recognize very much simi-larity between male and female speakers saying thesame vowel. Models with z-scored features are clos-est to sampling 50% female tokens; this confirms thatthe z-scored representation is highly effective at neu-tralizing differences between speakers of differentgenders (Table 1), although as noted above, it is notthe representation that gives the best match to humandiscrimination performance (Figure 4).

While models with 2 through 6 dimensions consis-tently treated the experimental stimuli as being mostsimilar to high front vowels in the corpus, modelswith higher orders of cepstral coefficients did not(Figure 5), reinforcing the importance of the lowerMFCC dimensions in capturing listeners’ perceptionof these stimuli. We suspect that failure to perceivestimuli as high front vowels in models with highernumbers of dimensions emerged due to the artificialsynthesis of the experimental stimuli, which resultedin these stimuli being most similar to low back vow-els from the corpus in two of the higher MFCC di-mensions. The model can perceive stimuli as low

434

Page 11: Evaluating Low-Level Speech Features Against Human

2 4 6 8 10 1220

30

40

50

60

70

80

90

100

Feature dimensionality

% h

igh fro

nt exem

pla

rs

Vowel quality

MFCC

VTLN

z−scored

2 4 6 8 10 120

10

20

30

40

50

60

Feature dimensionality

% fe

male

exem

pla

rs

Gender distribution

MFCC

VTLN

z−scored

Figure 5: Secondary evaluations, showing how often the stimuli were correctly perceived as high front vowels (left) andhow often the model based perceptions on female utterances (right). Data are averaged across the familiar and newspeaker test cases, which were very similar on these measures.

back vowels in cases where it generalizes along thosedimensions. This underscores the difficulty of bring-ing together ecologically valid speech corpora withthe type of controlled stimuli typically used in experi-mental settings, and illuminates areas in which futureresearch may provide insight by addressing thesechallenges. Nevertheless, even if we omit resultsfrom models that were affected by this mismatch be-tween corpus tokens and experimental stimuli, therelative ordering of the three feature types was quali-tatively consistent across a range of dimensionalities.

Finally, we can compare the log likelihoods fromFigure 4 to a benchmark showing ideal model per-formance. Feldman et al. (2009) estimated listeners’prior distributions for the /i/ and /e/ categories fromperceptual categorization data, rather than from pro-duction data in speech recordings. This method ofestimating listeners’ prior distributions sidesteps theproblem of mapping between perception and produc-tion in that it estimates prior distributions directlyfrom perceptual data. Using their estimate, the modelyields an average log likelihood of -233, somewhathigher than those obtained here. Our models shouldapproach this value as the distribution of sounds inthe corpus approaches the prior distribution that lis-teners use in perceptual tasks.

5 Relation to previous work

To our knowledge, this is the first time a cognitivemodel has been used to evaluate ASR representations

against human behavioral data. Front-end feature rep-resentations in ASR are typically evaluated on theirability to improve performance in an ASR system(Junqua et al., 1993; Welling et al., 1997). Measuringperformance in an ASR system provides a direct eval-uation relative to a particular task, but provides onlyindirect evidence about the correspondence of featurerepresentations with human perception. Our workmeasures the extent to which these feature represen-tations match human judgments, parallel to analysesthat have been conducted in other domains of lan-guage to compare representations used in languagetechnology to human data (Chang et al., 2009; Gersh-man and Tenenbaum, 2015). As a feature evaluationmetric, our approach also has the advantage that it ex-amines front-end feature representations directly, andis not subject to arbitrary decisions about back-endsystems in an ASR pipeline.

There have also been proposals that use direct mea-sures of feature variability (Haeb-Umbach, 1999) orminimal pair ABX discrimination tasks (Carlin etal., 2011; Schatz et al., 2013; Schatz et al., 2014) toevaluate feature representations, and thus do not de-pend on particular back-end ASR systems. Theseapproaches differ from ours in that they evaluatewhether discrimination performance is correct, ratherthan whether discrimination performance is similarto that of human listeners. In addition, these ap-proaches assume a direct mapping between distancein the feature space and perceived similarity between

435

Page 12: Evaluating Low-Level Speech Features Against Human

two stimuli, whereas ours takes into account humanlisteners’ perceptual bias toward sounds that occurfrequently within the feature space.

Research in cognitive science has previously exam-ined the dimensions that guide listeners’ perception,but has relied on data from categorization tasks, inwhich listeners decide which category a sound be-longs to (Apfelbaum and McMurray, 2015; Idemaruet al., 2012; McMurray and Jongman, 2011; Nittrouerand Miller, 1997; Repp, 1982). Our approach insteadfocuses on acoustic discrimination tasks. Discrimina-tion provides several advantages over categorization.It is a more fine-grained measure than categorization,and can be reliably measured in both adults and in-fants, even when listeners do not have well-formedcategories for a given set of sounds. In addition,whereas building a categorization model requires la-beled training data, building a discrimination modeldoes not, as explained in Section 3.

6 Discussion

In this paper a cognitive model of speech perceptionwas implemented directly on speech recordings andused to evaluate the low-level feature representationscorresponding to two speaker normalization methods:VTLN and z-scoring. Both normalization methodsimproved the model’s fit to human perceptual data.Between the two normalization methods, VTLN out-performed z-scoring, despite being less effective atcollapsing gender and dialect differences.

The ability of VTLN and z-scoring to improve thematch of a feature space with human perception maybe a relevant consideration in supporting the use ofthese normalization methods in ASR systems. Fur-thermore, although we do not mean to imply thatVTLN or z-scoring are cognitive models of speakernormalization, the better performance of VTLN overz-scoring does suggest that VTLN facilitates general-ization across speakers in a way that is more similarto human perceivers. For example, human listenersmay collapse across genders while retaining differ-ences across dialects. Knowing how to model theway in which listeners generalize across speakerscould have practical implications for assisting lis-teners with hearing impairments (Liu et al., 2008),and our work complements previous experimentaland computational evidence related to this question

(Barreda, 2012; Halberstam and Raphael, 2004; Mc-Murray and Jongman, 2011; Nearey, 1989).

Speaker normalization and speaker adaptation areoften treated as competing theories of human speechperception (Dahan et al., 2008), whereas speaker nor-malization algorithms are applied in conjunction withspeaker adaptation algorithms in the ASR literature(Gales and Young, 2007). Our experiments focusedonly on the effects of normalization, without con-sidering whether there might be a role of adaptationin explaining listeners’ discrimination behavior. Itis not obvious how we would apply speaker adap-tation in our discrimination model, which does notincorporate explicit category representations.

Despite the advantage observed for normalizedover unnormalized features, none of the representa-tions tested here match human perception exactly. Aprior distribution estimated from perceptual catego-rization still outperforms prior distributions measuredfrom a speech corpus. Our primary goal has been tointroduce a new method for assessing speech featurerepresentations against human data, and to validatethat method against previous findings that showed abenefit for normalization.

Our method is compatible with a wide range of fea-ture spaces. The importance sampling approximationused here can be implemented on any speech rep-resentation for which a likelihood function p(S|T )can be computed between experimental stimuli andsounds from a corpus, and for which a distance met-ric between sounds in the corpus can be comparedto a threshold ε. The likelihood function does notneed to be Gaussian, and the feature space does notneed to have fixed or even finite dimensionality. Evenwhen assuming a Gaussian likelihood function, notall feature spaces have uncorrelated dimensions, andthe model may need to be extended to use a fullcovariance matrix for ΣS . This extension is straight-forward to derive mathematically, but would requireadditional human perceptual data to constrain themodel’s free parameters.

The speech corpus used in these experiments con-sisted of /hVd/ utterances that were balanced acrossvowel, gender, and dialect. Conducting initial exper-iments with a corpus of vowels in neutral contextsallowed us to investigate algorithms for speaker nor-malization while sidestepping issues of how listenersgeneralize across phonological contexts. However,

436

Page 13: Evaluating Low-Level Speech Features Against Human

differences between this controlled corpus and lis-teners’ experience may account in part for the lowerperformance of the corpus-based models comparedto the model whose prior distribution was estimatedfrom perceptual categorization data. Future work thatestimates listeners’ prior distributions from speechrecordings can do so using more realistic corpora ofconversational speech. The model’s prior distributioncan in principle be estimated from any speech corpus,and does not require phoneme labels.

Using larger corpora of conversational speechwould be particularly beneficial because it could pro-vide sufficient data to derive and evaluate featuresfrom representation learning algorithms, which typi-cally involve training models with many parameters.Recent work in ASR has proposed ways of optimiz-ing feature representations based on linguistic data(Grezl et al., 2014; Heigold et al., 2013; Kamper etal., 2015; Sercu et al., 2016; Thomas et al., 2012;Wang et al., 2015). These proposals differ from eachother in their objective functions, and our method canhelp determine which of these objective functionsyield representations that capture human listeners’generalizations. Testing the outcome of these learn-ing algorithms against human data in high resourcelanguages could help select representation learningalgorithms for low resource settings. Such investi-gations would not necessarily be limited to speakernormalization algorithms, but could apply to any rep-resentation learning algorithm designed to optimizeperception for a particular language.

Perhaps the most exciting future application of thiswork is in investigating the way in which human lis-teners’ feature spaces become tuned to their nativelanguage. Research in cognitive science has begunasking how language-specific feature spaces mightbe learned (Holt and Lotto, 2006; Idemaru and Holt,2011; Idemaru and Holt, 2014; Liu and Holt, 2015;Nittrouer, 1992). The method proposed here providesa way of testing the predictions of these learning theo-ries against human data, and of testing whether repre-sentation learning algorithms from the ASR literaturehave analogues in human perceptual development.Unlabeled speech corpora are available in multiplelanguages, and cross-linguistic perceptual differencesare typically measured through discrimination tasks;our discrimination model can take advantage of thesedata. Thus, using the method proposed here, predic-

tions of human representation learning theories canbe evaluated with respect to adults’ cross-linguisticdifferences in sensitivity to perceptual dimensions.If the cognitive model were extended to infant dis-crimination tasks, a similar strategy could be usedto evaluate theories of human representation learn-ing with respect to children’s discrimination abilitiesthroughout the learning process. In this way, imple-menting a cognitive model directly on corpora ofnatural speech can lead to a richer understanding ofthe way in which listeners’ perception is shaped bytheir environment.

Acknowledgments

We thank Josh Falk for help in piloting the model,Phani Nidadavolu for help computing speech fea-tures, and Eric Fosler-Lussier, Hynek Hermansky,Bill Idsardi, Feipeng Li, Vijay Peddinti, Amy Wein-berg, the UMD probabilistic modeling reading group,and three anonymous reviewers for helpful commentsand discussion. Previous versions of this work werepresented at the 8th Northeast Computational Phonol-ogy Workshop and the 38th Annual Conference of theCognitive Science Society. The work was supportedby NSF grants BCS-1320410 and DGE-1321851.

ReferencesPatti Adank, Roel Smits, and Roeland van Hout. 2004.

A comparison of vowel normalization procedures forlanguage variation research. Journal of the AcousticalSociety of America, 116(5):3099–3107.

Jean E. Andruski, Sheila E. Blumstein, and Martha Burton.1994. The effect of subphonemic differences on lexicalaccess. Cognition, 52:163–187.

Keith S. Apfelbaum and Bob McMurray. 2015. Relativecue encoding in the context of sophisticated modelsof categorization: Separating information from catego-rization. Psychonomic Bulletin and Review, 22(4):916–943.

Santiago Barreda. 2012. Vowel normalization and theperception of speaker changes: An exploration of thecontextual tuning hypothesis. Journal of the AcousticalSociety of America, 132(5):3453–3464.

Sheila E. Blumstein, Emily B. Myers, and Jesse Rissman.2005. The perception of voice onset time: An fMRIinvestigation of phonetic category structure. Journal ofCognitive Neuroscience, 17(9):1353–1366.

Michael A. Carlin, Samuel Thomas, Aren Jansen, andHynek Hermansky. 2011. Rapid evaluation of speech

437

Page 14: Evaluating Low-Level Speech Features Against Human

representations for spoken term discovery. Proceedingsof Interspeech.

Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish,Chong Wang, and David M. Blei. 2009. Reading tealeaves: How humans interpret topic models. Advancesin Neural Information Processing Systems 22.

Meghan Clayards, Michael K. Tanenhaus, Richard N.Aslin, and Robert A. Jacobs. 2008. Perception ofspeech reflects optimal use of probabilistic speech cues.Cognition, 108(3):804–809.

Cynthia G. Clopper and David B. Pisoni. 2006. Thenationwide speech project: A new corpus of AmericanEnglish dialects. Speech Communication, 48(6):633–644.

Jordan Cohen, Terri Kamm, and Andreas G. Andreou.1995. Vocal tract normalization in speech recognition:Compensating for systematic speaker variability. Jour-nal of the Acoustical Society of America, 97(5):3246–3247.

Jennifer Cole, Gary Linebaugh, Cheyenne M. Munson,and Bob McMurray. 2010. Unmasking the acousticeffects of vowel-to-vowel coarticulation: A statisticalmodeling approach. Journal of Phonetics, 38:167–184.

Delphine Dahan, Sarah J. Drucker, and Rebecca A. Scar-borough. 2008. Talker adaptation in speech perception:Adjusting the signal or the representations? Cognition,108:710–718.

Steven B. Davis and Paul Mermelstein. 1980. Compar-ison of parametric representations for monosyllabicword recognition in continuously spoken sentences.Proceedings of the IEEE, pages 357–366.

Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin.1977. Maximum likelihood from incomplete data viathe EM algorithm. Journal of the Royal StatisticalSociety, B, 39:1–38.

Naomi H. Feldman, Thomas L. Griffiths, and James L.Morgan. 2009. The influence of categories on per-ception: Explaining the perceptual magnet effect asoptimal statistical inference. Psychological Review,116(4):752–782.

Mark Gales and Steve Young. 2007. The application ofhidden Markov models in speech recognition. Founda-tions and Trends in Signal Processing, 1(3):195–304.

Samuel J. Gershman and Joshua B. Tenenbaum. 2015.Phrase similarity in humans and machines. Proceed-ings of the 37th Annual Conference of the CognitiveScience Society.

Diego Giuliani, Matteo Gerosa, and Fabio Brugnara.2006. Improved automatic speech recognition throughspeaker normalization. Computer Speech & Language,20(1):107–123.

Frantisek Grezl, Ekaterina Egorova, and Martin Karafiat.2014. Further investigation into multilingual training

and adaptation of stacked bottle-neck neural networkstructure. IEEE Spoken Language Technology Work-shop, pages 48–53.

Reinhold Haeb-Umbach. 1999. Investigations on inter-speaker variability in the feature space. Proceedingsof the IEEE International Conference on Acoustics,Speech, and Signal Processing, pages 397–400.

Benjamin Halberstam and Lawrence J. Raphael. 2004.Vowel normalization: the role of fundamental fre-quency and upper formants. Journal of Phonetics,32:423–434.

W. Hastings. 1970. Monte Carlo sampling methods us-ing Markov chains and their applications. Biometrika,57(1):97–109.

Georg Heigold, Vincent Vanhoucke, Andrew Senior,Phuongrang Nguyen, Marc’Aurelio Ranzato, MatthieuDevin, and Jesse Dean. 2013. Multilingual acousticmodels using distributed deep neural networks. Pro-ceedings of the IEEE International Conference onAcoustics, Speech and Signal Processing, pages 8619–8623.

Hynek Hermansky and Nelson Morgan. 1994. RASTAprocessing of speech. IEEE Transactions on Speechand Audio Processing, 2(4):578–589.

Hynek Hermansky. 1990. Perceptual linear predictive(PLP) analysis of speech. Journal of the AcousticalSociety of America, 87(4):1738–1752.

James Hillenbrand, Laura A. Getty, Michael J. Clark, andKimberlee Wheeler. 1995. Acoustic characteristics ofAmerican English vowels. Journal of the AcousticalSociety of America, 97(5):3099–3111.

Lori L. Holt and Andrew J. Lotto. 2006. Cue weightingin auditory categorization: Implications for first andsecond language acquisition. Journal of the AcousticalSociety of America, 119(5):3059–3071.

Kaori Idemaru and Lori L. Holt. 2011. Word recognitionreflects dimension-based statistical learning. Journalof Experimental Psychology: Human Perception andPerformance, 37(6):1939–1956.

Kaori Idemaru and Lori L. Holt. 2014. Specificity ofdimension-based statistical learning in word recogni-tion. Journal of Experimental Psychology: HumanPerception and Performance, 40(3):1009–1021.

Kaori Idemaru, Lori L. Holt, and Howard Seltman. 2012.Individual differences in cue weights are stable acrosstime: The case of Japanese stop lengths. Journal of theAcoustical Society of America, 132(6):3950–3964.

Marc F. Joanisse, Erin K. Robertson, and Randy LynnNewman. 2007. Mismatch negativity reflects sen-sory and phonetic speech processing. NeuroReport,18(9):901–905.

Keith Johnson, 1997. Speech perception without speakernormalization: An exemplar model, pages 145–165.Academic Press, New York.

438

Page 15: Evaluating Low-Level Speech Features Against Human

Keith Johnson. 2006. Resonance in an exemplar-basedlexicon: The emergence of social identity and phonol-ogy. Journal of Phonetics, 34(4):485–499.

Jean-Claude Junqua, Hisashi Wakita, and Hynek Herman-sky. 1993. Evaluation and optimization of perceptually-based ASR front-end. IEEE Transactions on Speechand Audio Processing, 1(1):39–48.

Herman Kamper, Micha Elsner, Aren Jansen, and SharonGoldwater. 2015. Unsupervised neural network basedfeature extraction using weak top-down constraints.Proceedings of the IEEE International Conference onAcoustics, Speech and Signal Processing.

Dave F. Kleinschmidt and T. Florian Jaeger. 2015. Robustspeech perception: Recognize the familiar, generalizeto the similar, and adapt to the novel. PsychologicalReview, 122(2):148–203.

Yakov Kronrod, Emily Coppess, and Naomi H. Feldman.2016. A unified account of categorical effects in pho-netic perception. Psychonomic Bulletin and Review,23(6):1681–1712.

Ran Liu and Lori L. Holt. 2015. Dimension-basedstatistical learning of vowels. Journal of Experimen-tal Psychology: Human Perception and Performance,41(6):1783–1798.

Chuping Liu, John Galvin III, Qian-Jie Fu, andShrikanth S. Narayanan. 2008. Effect of spectralnormalization on different talker speech recognitionby cochlear implant users. Journal of the AcousticalSociety of America, 123(5):2836–2847.

B. M. Lobanov. 1971. Classification of Russian vowelsspoken by different speakers. Journal of the AcousticalSociety of America, 49(2):606–608.

Jessica Maye, Janet F. Werker, and LouAnn Gerken. 2002.Infant sensitivity to distributional information can affectphonetic discrimination. Cognition, 82:B101–B111.

Bob McMurray and Allard Jongman. 2011. What in-formation is necessary for speech categorization? har-nessing variability in the speech signal by integratingcues computed relative to expectations. PsychologicalReview, 118(2):219–246.

James D. Miller. 1989. Auditory-perceptual interpretationof the vowel. Journal of the Acoustical Society ofAmerica, 85(5):2114–2134.

Sirko Molau, Florian Hilger, and Hermann Ney. 2003.Feature space normalization in adverse acoustic con-ditions. Proceedings of the IEEE International Con-ference on Acoustics, Speech, and Signal Processing,pages 656–659.

Philip J. Monahan and William J. Idsardi. 2010. Audi-tory sensitivity to formant ratios: Toward an accountof vowel normalisation. Language and Cognitive Pro-cesses, 25(6):808–839.

Terrance Michael Nearey. 1978. Phonetic feature systemsfor vowels, volume 77. Indiana University LinguisticsClub.

Terrance M. Nearey. 1989. Static, dynamic, and rela-tional properties in vowel perception. Journal of theAcoustical Society of America, 85(5):2088–2113.

Susan Nittrouer and Marnie E. Miller. 1997. Pre-dicting developmental shifts in perceptual weightingschemes. Journal of the Acoustical Society of America,101(4):2253–2266.

Susan Nittrouer. 1992. Age-related differences in per-ceptual effects of formant transitions within syllablesand across syllable boundaries. Journal of Phonetics,20:351–382.

Gordon E. Peterson and Harold L. Barney. 1952. Controlmethods used in a study of the vowels. Journal of theAcoustical Society of America, 24(2):175–184.

Gordon E. Peterson. 1961. Parameters of vowel quality.Journal of Speech and Hearing Research, 4(1):10–29.

David B. Pisoni and Jeffrey Tash. 1974. Reaction timesto comparisons within and across phonetic categories.Perception and Psychophysics, 15(2):285–290.

Daniel Povey and George Saon. 2006. Feature and modelspace speaker adaptation with full covariance Gaus-sians. Proceedings of Interspeech.

Bruno H. Repp. 1982. Phonetic trading relations and con-text effects: New experimental evidence for a speechmode of perception. Psychological Bulletin, 92(1):81–110.

Thomas Schatz, Vijayaditya Peddinti, Francis Bach, ArenJansen, Hynek Hermansky, and Emmanuel Dupoux.2013. Evaluating speech features with the minimal-pairABX task: Analysis of the classical MFC/PLP pipeline.Proceedings of Interspeech, pages 1781–1785.

Thomas Schatz, Vijayaditya Peddinti, Xuan-Nga Cao,Francis Bach, Hynek Hermansky, and EmmanuelDupoux. 2014. Evaluating speech features with theminimal-pair ABX task (II): Resistance to noise. Pro-ceedings of Interspeech.

Tom Sercu, Christia Puhrsch, Brian Kingsbury, and YannLeCun. 2016. Very deep multilingual convolutionalneural networks for LVCSR. Proceedings of the IEEEInternational Conference on Acoustics, Speech and Sig-nal Processing.

Lei Shi, Thomas L. Griffiths, Naomi H. Feldman, andAdam N. Sanborn. 2010. Exemplar models as a mecha-nism for performing Bayesian inference. PsychonomicBulletin and Review, 17(4):443–464.

Samuel Thomas, Sriram Ganapathy, and Hynek Herman-sky. 2012. Multilingual MLP features for low-resourceLVCSR systems. Proceedings of the IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing, pages 4269–4272.

439

Page 16: Evaluating Low-Level Speech Features Against Human

Joseph C. Toscano, Bob McMurray, Joel Dennhardt, andSteven J. Luck. 2010. Continuous perception andgraded categorization: Electrophysiological evidencefor a linear relationship between the acoustic signal andperceptual encoding of speech. Psychological Science,21(10):1532–1540.

Stavros Tsakalidis and William Byrne. 2005. Acous-tic training from heterogeneous data sources: Exper-iments in Mandarin conversational telephone speechtranscription. Proceedings of the IEEE InternationalConference on Acoustics, Speech, and Signal Process-ing, pages 461–464.

Olli Viikki and Kari Laurila. 1998. Cepstral domainsegmental feature vector normalization for noise robustspeech recognition. Speech Communication, 25(1):133–147.

Qing Wang, Sanjeev R Kulkarni, and Sergio Verdu. 2006.A nearest-neighbor approach to estimating divergencebetween continuous random vectors. IEEE Interna-tional Symposium on Information Theory, pages 242–246.

Weiran Wang, Raman Arora, Karen Livescu, and Jeff A.Bilmes. 2015. Unsupervised learning of acoustic fea-tures via deep canonical correlation analysis. Proceed-ings of the IEEE International Conference on Acoustics,Speech, and Signal Processing.

Steven Wegmann, Don McAllaster, Jeremy Orloff, andBarbara Peskin. 1996. Speaker normalization on con-versational telephone speech. Proceedings of the IEEEInternational Conference on Acoustics, Speech, andSignal Processing, pages 339–341.

L. Welling, N. Haberland, and H. Ney. 1997. Acous-tic front-end optimization for large vocabulary speechrecognition. Proceedings of Eurospeech, pages 2099–2102.

440