19
SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 1 Transcribing Mandarin Broadcast Speech Using Multi-Layer Perceptron Acoustic Features Fabio Valente, Mathew Magimai Doss, Christian Plahl, Suman Ravuri and Wen Wang Member, IEEE, Abstract Recently, several MLP based front-ends have been developed and used for Mandarin speech recognition, often showing significant complementary properties to conventional spectral features. Although widely used in multiple Mandarin systems, no systematic comparison of all the different approaches as well as their scalability has been proposed. The novelty of this correspondence is mainly experimental. In this work, all the MLP front-ends recently developed at multiple sites are described and compared in a systematic manner on a 100 hours setup. The study covers the two main directions along which the MLP features have evolved: the use of different input representations to the MLP and the use of more complex MLP architectures beyond the three-layer perceptron. The results are analyzed in terms of confusion matrices and the paper discusses a number of novel findings that the comparison reveals. Furthermore the two best front-ends used in the GALE 2008 evaluation, referred as MLP1 and MLP2, are studied in a more complex LVCSR system in order to investigate their scalability in terms of the amount of training data (from 100 hours to 1,600 hours) and the parametric system complexity (maximum likelihood versus discriminative training, speaker adaptative training, lattice level combination). Results on 5 hours of evaluation data from the GALE project reveal that the MLP features consistently produce improvements in the range of 15% - 23% relative at the different steps of a multipass system when compared to MFCC and PLP features, suggesting that the improvements scale with the amount of data and with the complexity of the system. The integration of those features into the GALE 2008 evaluation system provide very competitive performances compared to other Mandarin systems. Index Terms TANDEM features, Multi-Layer Perceptron, Multi-stream, GALE project, Automatic Speech Recognition (ASR), Broadcast data. I. I NTRODUCTION Recently a growing number of Large Vocabulary Continuous Speech Recognition (LVCSR) systems make use of Multi-Layer Perceptron (MLP) features. MLP features have been originally introduced by Hermansky and his colleagues in [1], where the output of an MLP classifier is used as acoustic front-end for conventional speech recognition systems based on Hidden Markov Models/Gaussian Mixture Models (HMM/GMM). A large number of studies have proposed different types of MLP based front-ends (see [2],[3],[4],[5]) and investigated their use for transcribing English (see [6],[7]). The most common application is in concatenation with MFCC or PLP features, where MLP features show considerable complementarity properties. In recent years, in the framework of the GALE 1 program, MLP Revised version - Fabio Valente ([email protected]) and Mathew Magimai Doss ([email protected]) are with Idiap Research Institute, Martigny Switzerland; Christian Plahl ( [email protected]) is with the Computer Science Department, RWTH Aachen University, Germany; Suman Ravuri ([email protected]) is with the International Computer Science Institute, Berkeley, CA, USA; Wen Wang ([email protected]) is with Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, USA. 1 http://www.darpa.mil/ipto/programs/gale/gale.asp February 25, 2011 DRAFT

manuscript

Embed Size (px)

DESCRIPTION

http://fvalente.zxq.net/paperpdf/manuscript.pdf

Citation preview

Page 1: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 1

Transcribing Mandarin Broadcast Speech Using

Multi-Layer Perceptron Acoustic FeaturesFabio Valente, Mathew Magimai Doss, Christian Plahl, SumanRavuri and Wen WangMember, IEEE,

Abstract

Recently, several MLP based front-ends have been developedand used for Mandarin speech recognition, often showing significant

complementary properties to conventional spectral features. Although widely used in multiple Mandarin systems, no systematic

comparison of all the different approaches as well as their scalability has been proposed. The novelty of this correspondence is

mainly experimental. In this work, all the MLP front-ends recently developed at multiple sites are described and compared in a

systematic manner on a 100 hours setup. The study covers the two main directions along which the MLP features have evolved: the

use of different input representations to the MLP and the useof more complex MLP architectures beyond the three-layer perceptron.

The results are analyzed in terms of confusion matrices and the paper discusses a number of novel findings that the comparison

reveals. Furthermore the two best front-ends used in the GALE 2008 evaluation, referred as MLP1 and MLP2, are studied in amore

complex LVCSR system in order to investigate their scalability in terms of the amount of training data (from 100 hours to 1,600

hours) and the parametric system complexity (maximum likelihood versus discriminative training, speaker adaptativetraining, lattice

level combination). Results on 5 hours of evaluation data from the GALE project reveal that the MLP features consistently produce

improvements in the range of15% − 23% relative at the different steps of a multipass system when compared to MFCC and PLP

features, suggesting that the improvements scale with the amount of data and with the complexity of the system. The integration of

those features into the GALE 2008 evaluation system providevery competitive performances compared to other Mandarin systems.

Index Terms

TANDEM features, Multi-Layer Perceptron, Multi-stream, GALE project, Automatic Speech Recognition (ASR), Broadcast data.

I. I NTRODUCTION

Recently a growing number of Large Vocabulary Continuous Speech Recognition (LVCSR) systems make use of Multi-Layer

Perceptron (MLP) features. MLP features have been originally introduced by Hermansky and his colleagues in [1], where the

output of an MLP classifier is used as acoustic front-end for conventional speech recognition systems based on Hidden Markov

Models/Gaussian Mixture Models (HMM/GMM).

A large number of studies have proposed different types of MLP based front-ends (see [2],[3],[4],[5]) and investigatedtheir

use for transcribing English (see [6],[7]). The most commonapplication is in concatenation with MFCC or PLP features, where

MLP features show considerable complementarity properties. In recent years, in the framework of the GALE1 program, MLP

Revised version - Fabio Valente ([email protected])and Mathew Magimai Doss ([email protected]) are with Idiap Research Institute,

Martigny Switzerland; Christian Plahl ( [email protected]) is with the Computer Science Department, RWTH Aachen

University, Germany; Suman Ravuri ([email protected]) is with the International Computer Science Institute, Berkeley, CA, USA; Wen

Wang ([email protected]) is with Speech Technology andResearch Laboratory, SRI International, Menlo Park, CA, USA.

1http://www.darpa.mil/ipto/programs/gale/gale.asp

February 25, 2011 DRAFT

Page 2: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 2

features have been extensively used in ASR systems for Mandarin and Arabic languages (see [5],[8],[9],[10],[11]). Since the

original work [1], MLP front-ends have progressed along twomain directions:

1 the use of different input representations to the MLP

2 the use of complex MLP architectures beyond the conventional three-layer perceptron

The first category includes speech representations that aims at using long time spans of the speech signal which could capture

long term phenomena (such as co-articulation) and are complementarity to MFCC or PLP features [7]. Because of the large

dimension of the signal time spans, a number of techniques for efficiently encoding this information have been proposed like

MRASTA [4], DCT-TRAPS [12], and wLP-TRAPS [13]. The second category includes a heterogeneous number of techniques

that aim at overcoming the pitfalls of the single MLP classifier. They are based on the probabilistic combination of MLP outputs

obtained using different input representations. Those combinations can happen in a parallel fashion like in the multi-stream

approach [14],[2] or in a hierarchical fashion [15]. Furthermore, recently the probabilistic features generated by three-layer

MLPs have also been replaced by the bottleneck features extracted by four-layer and five-layer MLPs [16].

While previous works, e.g., [9], have discussed the development of the Mandarin LVCSR systems that use those features,

no exhaustive comparisons and analysis of the different front-ends have been presented in literature. Without such a side-by-

side comparison, it is not possible to assess which one of therecent advances actually produced improvements in the final

system. This correspondence focuses on those recent advances in training, scaling, and integrating MLP front-ends forMandarin

transcription. The novelty of this work is mainly experimental and the correspondence provides two contributions.

First, the various MLP based front-ends recently developedat multiple sites are described and compared on a common

experimental setup in a systematic way. The comparison covers all the MLP features used in GALE and is done using the same

phoneme set, the same speech-silence segmentation, the same amount of training data, and the same number of free parameters.

The study is done using a simplified version of the system described in [9] trained on 100 hours of Mandarin broadcast news

and conversation recordings. The investigation covers MLPacoustic front-ends as stand alone features and in concatenation with

conventional MFCC features. To our best knowledge, this is the most exhaustive comparison of MLP front-ends for Mandarin

speech recognition. The comparison reveals a number of novel facts on the different features and on their use in LVCSR systems.

The second contribution is the study on how the performancesscale with the amount of training data (from 100 hours to

1600 hours of broadcast audio) and with the parametric modelcomplexity of the system (including Speaker Adaptive Training,

lattice level combination, and discriminative training).As before, the contrastive experiments are run with and without the MLP

features to assess the maximum relative improvement that can be obtained.

The remainder of the paper is organized as follows: section II describes features obtained using three-layer MLPs with various

input representations and section III describes features obtained using modifications to the three layer architecture. Section IV

experiments with those features in a system trained on 100 hours and analyzes and discusses the results of the comparison.

Section V experiments in a large scale multi-pass evaluation system and finally the paper is concluded in section VI.

II. I NPUT REPRESENTATION FORTHREE-LAYER MLP FEATURES

The simplest MLP feature extraction is based on the following steps. At first, a three-layer MLP classifier is trained in order

to minimize the cross-entropy between its output and a set ofphonetic labels. Such a classifier produces phoneme posterior

probabilities conditioned on the input representation at agiven time instant [17].

In order to exploit this representation into HMM/GMM models, phoneme posterior probabilities are first gaussianized applying

a logarithm and then decorrelated using a Principal Component Analysis (PCA) transform. After PCA, a dimensionality reduction

February 25, 2011 DRAFT

Page 3: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 3

accounting for 95% of the total variability is applied. The resulting feature vectors are used as conventional acousticfeatures

into ASR systems. This framework is also known as TANDEM [1].

The input to the MLP classifier can be conventional short termfeatures like PLP/MFCC or long term features which aim

at capturing the dynamic characteristics of the speech signal over large time spans. Let us briefly describe four different MLP

inputs proposed and used for transcription of Mandarin broadcast:

A. TANDEM-PLP

In TANDEM-PLP features, the input to the MLP is represented by 9 consecutive frames of PLP cepstral features. Mandarin

is a tonal language, thus the PLP vector is augmented with thesmoothed log-pitch estimate plus its first and second order

temporal derivatives as described in [18]. PLP features undergo vocal tract length normalization and speaker-level mean and

variance normalization. The final dimension of this vector is 42, thus the input to the MLP is a vector of size42× 9 = 378.

TANDEM-PLP has been the first MLP-based feature to be proposed and aims at using a few consecutive frames of short term

spectral features. On the other hand, the input to the MLP canalso be represented by critical band temporal trajectory (up to

half a second) aiming at modeling long time patterns of the speech signal (also known as Temporal Patterns or TRAPS [19]).

The dimensionality of TRAPS is quite large; considering forinstance, 500 ms trajectories in a 19 critical band spectrogram

would produce a vector of dimension 9500. Several methods have been considered for efficiently encoding this information

while reducing the dimension and they will be briefly reviewed in the following.

B. Multiple RASTA

Multiple RASTA (MRASTA) filtering [4] is an extension of RASTA filtering and aims at using long signal time spans at the

input of the MLP. The model is consistent with studies on human perception of modulation frequencies modeled using a bank

of filters equally spaced on a logarithmic scale [20]. This bank of filters subdivides the available modulation frequencyrange

into separate channels with a decreasing resolution movingfrom slow to fast modulations.

Feature extraction is composed of the following parts: 19 critical band auditory spectrum is extracted from Short Time Fourier

Transform of a signal every 10 ms. A 600 milliseconds long temporal trajectory in each critical band is filtered with a bank

of band-pass filters. Those filters represent first derivativesG1 = [g1σi] (Eq. 1) and second derivativesG2 = [g2σi

](Eq. 2) of

Gaussian functions with varianceσi varying in the range 8-60 ms.

g1σi(x) ∝ −

x

σ2

i

exp(−x2/(2 σ2

i )) (1)

g2σi(x) ∝ (

x2

σ4

i

−1

σ2

i

) exp(−x2/(2 σ2

i )) (2)

with σi = {0.8 , 1.2 , 1.8 , 2.7 , 4 , 6}.

In effect, the MRASTA filters are multi-resolution band-pass filters on modulation frequency, dividing the available modulation

frequency range into its individual sub-bands2. In the modulation frequency domain, they correspond to a filter-bank with equally

spaced filters on a logarithmic scale. Identical filters are used for all critical bands. Thus, they provide a multiple-resolution

representation of the time-frequency plane.

After MRASTA filtering, frequency derivatives across threeconsecutive critical bands are introduced. The total number of

features used as input for a three-layer MLP is432.

2Unlike in [4], filter-banks G1 and G2 are composed of six filters rather than eight, leaving out the two filters with longest impulse responses.

February 25, 2011 DRAFT

Page 4: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 4

C. DCT-TRAPS

The DCT-TRAPS aims at reducing the dimension of the trajectories using a Discrete Cosine Transform (DCT). As described

in [12], the results obtained using DCT basis are very similar to the one obtained using a Principal Component Analysis. Critical

band auditory spectrum is extracted from Short Time FourierTransform of a signal every 10 ms. Then 500 ms long energy

trajectories are extracted for each of the 19 critical bandsthat compose the spectrogram. Those are projected on the first 16

coefficients of a DCT transform resulting in a vector of size19 × 16 = 304 used as input to the MLP. In contrary to the

MRASTA, they do not emulate any sensitivity of the hearing properties to the different modulation frequencies.

D. wLP-TRAPS

A third alternative for extracting information from long signal time spans is represented by the wLP-TRAPS [13]. In contrary

to previous front-ends, the process does not use the short term spectrum thus potentially provides more complementarity to the

MFCC features. Those features are obtained by warping the temporal axis after LP-TRAP features calculation [21]. The feature

extraction is composed of the following steps: at first, linear prediction is used to model the Hilbert envelops of pre-warped

500ms long energy trajectories in auditory-like frequencies sub-bands. The warping ensures that more emphasis is given to

the center of the trajectories compared to the borders [13],thus emulating again human perception. 25 LPC coefficients in 19

frequency bands are then used as input to the MLP producing a feature vector of dimension19× 25 = 475.

All the three representations described in sections II-B, II-C, II-D aim at using long temporal time spans; however theydiffer

from each other in a number of implementation issues like theuse of short-time power spectrum, the use of zero-mean filters,

and the warping of the time axes. Those differences are summarized in table I.

TABLE I

DIFFERENCES BETWEEN THE THREE INPUT REPRESENTATIONS THAT USES LONG TEMPORAL TIME SPANS.

MRASTA DCT-Trap wLP-TRAPS

Short-time power spectrum Yes Yes No

Time warping Yes No Yes

Mean removal filters Yes No No

As Mandarin is a tonal language, those representations can be augmented with the smoothed log-pitch estimate obtained as

described in [18] and with the value of the critical band energy (19 features per frame). In the following, we will refer tothem

as Augmented features.

III. MLP ARCHITECTURES

The second direction along which the front-ends have evolved is the use of more complex architectures to overcome

limitations of the three-layer MLP in different ways. Most of them are based on the combination of several MLP outputs

trained using different input representations. This combination can happen in a parallel or hierarchical fashion. Again, no side-

by-side comparisons of these architectures have been presented in the literature. The following paragraphs briefly describe these

front-ends used for LVCSR systems.

February 25, 2011 DRAFT

Page 5: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 5

A. Hidden Activation TRAPS (HATS)

HATS feature extraction is based on observations on human speech recognition [22], which conjectures that humans recognize

speech independently in each critical band and a final decision is obtained by recombining those estimates. HATS aims at using

information extracted from long time spans of critical bandenergies which are fed into a set of independent classifiers instead

of a single MLP classifier. At first, 19 critical band auditoryspectrum is extracted from Short Time Fourier Transform of a

signal every 10 ms. After that, HATS [2] feature extraction is composed of two steps.

1 in the first stage, an independent MLP for each of the 19 critical bands is trained to classify phonemes. The input to each of

the MLP is500ms long log critical band energy trajectories (i.e., 51 dimensional input). The input undergoes an utterance

level mean and variance normalization.

2 in the second stage, a merger MLP is trained using the hiddenactivations obtained from the 19 MLPs of the first stage.

The merger classifier aims at obtaining a single phoneme posterior estimate out of the independent estimates coming from

each critical band. Phoneme posteriors obtained from the merger MLP are then transformed and used as features.

The rationale behind this architecture consists in the factthat corruptions in particular critical bands should affect less the final

recognition results.

B. Multi-Stream

The output of MLPs are posterior probabilities of phonetic targets that can be combined into a single estimate using probabilistic

rules. This approach is typically referred as Multi-streamand has been introduced in [14]. The rationale behind it consists in the

fact that MLPs trained using different input representations will perform differently in multiple conditions. To takeadvantage of

both representations, the combination rule should be able to dynamically select the best posterior stream. Typical combination

rules weight the posterior probabilities using a function of the output entropy (see [23] and [24]). Posteriors obtained from

TANDEM-PLP (short signal time spans) and HATS (long signal time spans) are combined using the Dempster-Shafer method [24]

and used as features after a log/PCA transform. Multi-stream comes at the obvious cost of doubling the total number of parameters

in the system.

C. Hierarchical processing

While multi-stream approaches combine MLP outputs in parallel, studies on English and Mandarin data [15],[25] showed that

the most effective way of combining classifiers trained on separate ranges of modulation frequencies, i.e., on different temporal

spans, is based on hierarchical (sequential) processing. The hierarchical processing is based on the following steps.

MRASTA filters cover the whole range of modulation frequencies. The filter-banks G1 and G2 (6 filters each) are split into

two separate filter banks G1-Low and G2-Low, and G1-High and G2-High, which filter fast and slow modulation frequencies

respectively. G-High and G-Low are defined as follows:

G-High = [G1-High,G2-High] = [g1σi, g2σi

] (3)

with σi = {0.8 , 1.2 , 1.8}

G-Low = [G1-Low,G2-Low] = [g1σi, g2σi

] (4)

with σi = {2.7 , 4 , 6}

February 25, 2011 DRAFT

Page 6: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 6

Filters G1-fast and G2-fast are short filters and they process high modulation frequencies. Filters G1-slow and G2-sloware

long filters and they process low modulation frequencies. The cutoff frequency for both filter-banks G-High and G-Low is

approximatively 10Hz.

The output of the MRASTA filtering is processed according to ahierarchy of MLPs progressively moving from high to low

modulation frequencies (i.e. from short to long temporal contexts). The rationale behind this processing is based on the fact that

the errors produced from the first MLP can be corrected from a second one using the estimates from the first MLP together

with the evidence from another range of modulation frequencies.

The first MLP is trained on the first feature stream represented by the output of filter-banks G-High that extract high modulation

frequencies. This MLP estimates the first set of phoneme posterior probabilities. These posteriors are modified according to

a Log/PCA transform and then concatenated with the second feature stream thus forming an input to the second phoneme

posterior-estimating MLP. In such a way, phoneme estimatesfrom the first MLP are modified by the second net using an

evidence from a different feature stream. This process is depicted in Figure 1.

D. Bottleneck Features

Bottleneck features are recently introduced MLP non-probabilistic features [16]. The conventional three-layer MLP is replaced

with a four or five-layer MLP where the first layer is the input features and the last layer is the phonetic targets. As discussed in

[26], the five-layer architecture provides slightly betterperformances compared to the four-layer. The size of the second layer

is large to provide enough modeling power, the size of the third one is small, typically equal to the desired feature dimension,

while the size of the fourth one is approximatively half the second layer [26]. Instead of using the output of the MLP, features

are obtained from the linear activation of the third layer. Bottleneck features do not require a dimensionality reduction, as the

desired dimension can be obtained fixing the size of the bottleneck layer. Furthermore, the linear activations are already Gaussian

distributed thus they do not require any Log transform. The most common input to the non-probabilistic Bottleneck features are

long term features as DCT-TRAPS and the wLP-TRAPS describedin sections II-C and II-D.

IV. SMALL SCALE EXPERIMENTS

The following preliminary experiments are based on the large vocabulary ASR system for transcribing Mandarin broadcast

described in [9], developed by SRI/UW/ICSI for the GALE project. The recognition is performed using the SRI Decipher

recognizer and results are reported in terms of Character Error Rate (CER). The training is done using approximatively 100

hours of broadcast news and conversation data manually transcribed including speaker labels. Results are reported on the GALE

2006 evaluation data simply referred as eval06 in the following.

The baseline system uses 13 standard mel-frequency cepstral coefficients (MFCC) plus first and second order temporal

derivatives. Vocal Tract Length Normalization (VTLN) and speaker level mean-variance normalizations are applied. Mandarin

is a tonal language thus the MFCC vector is augmented with thesmoothed log-pitch estimate plus its first and second order

temporal derivatives as described in [18], resulting in a feature vector of dimension 42. In the following we will refer to this

system simply as the MFCC baseline.

The training is based on conventional Maximum Likelihood. The acoustic models are composed of within word triphone

HMM models and a 32 components diagonal covariance GMM is used for modeling acoustic emission probabilities. Parameters

are shared across different triphones according to a phonetic decision tree. Recognition networks are compiled from trigram

language models trained on over one billion words, with a 60Kvocabulary lexicon [9]. The decoding phase consists of two

February 25, 2011 DRAFT

Page 7: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 7

decoding passes, a speaker independent (si) decoding followed by a speaker adapted (sa) decoding. Speaker adaptation is done

using a 1-class constrained Maximum Likelihood Linear Regression (CMLLR) followed by 3-class MLLR. Performance of this

baseline system on the eval06 data is reported in Table II forboth speaker independent (si) and speaker adapted (sa) models.

TABLE II

BASELINE SYSTEM PERFORMANCE ON THE EVAL06 DATA .

Features CER (si) CER (sa)

MFCC 27.8 25.8

In this set of experiments, three-layer MLPs are trained on all the available 100 hour acoustic model training data. The

Mandarin toneme set is composed of 72 elements. The trainingis done using the ICSI Quicknet Software3.

A. MLP features

This section discusses experiments with features obtainedusing three-layer MLP architectures with different input repre-

sentations. Unless it is explicitly mentioned otherwise, the total number of parameters in the different MLP architectures is

equalized to approximatively one million parameters in order to assure a fair comparison between the different approaches. The

size of the input layer equals to the feature dimension, the size of the output layer equals to the number of phonetic targets

(72), and the size of the hidden layer is modified so that the total number of parameters equals to one million. After PCA, a

dimensionality reduction accounting for 95% of the total variability is applied. The resulting feature vectors has dimension 35

for all the different MLP features.

The investigation was carried out with MLP features as stand-alone front-end and in concatenation with spectral features, i.e.

MFCC. Results are reported in terms of Character Error Rate (CER) on the eval06 data as described in next section. Let us first

consider the TANDEM-PLP features described in section II-A. Performances of those features are reported in Table III aswell

as the relative improvements with respect to the MFCC baseline with and without speaker adaptation. When used as stand-alone

TABLE III

TANDEM-9FRAMESPLPPERFORMANCE ON THE EVAL06 DATA . RESULTS ARE REPORTED WITHMLP FEATURES ALONE AND IN

CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES.

Features CER (si) CER (sa)

TANDEM-9framesPLP w/o MFCC 27.6 (+1%) 25.5 (+1%)

TANDEM-9framesPLP with MFCC 23.4 (+16%) 22.2 (+14%)

features, TANDEM-PLP does not outperform the baseline, whereas a relative improvement of+16% is obtained when they are

used in concatenation with MFCC. After speaker adaptation,the relative improvement drops slightly by2%, still a 14% relative

improvement over the MFCC baseline.

Let us now consider the use of MLP features obtained using long time spans of the speech signal as described in subsections

II-B,II-C,II-D. Table IV shows that these features performquite poorly as stand alone features, whereas they can provide

improvements around10% relative in concatenation with the MFCC features. As a stand-alone front-end, the wLP-TRAPS

outperforms the other two; whereas, in concatenation with spectral features and after adaptation, the three representations are

3http://www.icsi.berkeley.edu/Speech/qn.html

February 25, 2011 DRAFT

Page 8: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 8

TABLE IV

MLP FEATURES MAKING USE OF LONG TIME SPANS OF THE SIGNAL AS INPUT. PERFORMANCE IS REPORTED ON THE EVAL06 DATA .

RESULTS ARE REPORTED WITHMLP FEATURES AS STAND ALONE FEATURES AND IN CONCATENATION WITHMFCC. THE RELATIVE

IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES.

MLP without MFCC

Features CER (si) CER (sa)

MRASTA 32.4 (-17%) 30.7 (-19%)

DCT-TRAPS 32.2 (-16%) 31.7 (-23%)

wLP-TRAPS 29.9 (-7%) 28.2 (-9%)

MLP with MFCC

MRASTA 24.4 (+12%) 23.1 (+10%)

DCT-TRAPS 24.6 (+12%) 23.2 (+10%)

wLP-TRAPS 23.2 (+16%) 23.0 (+11%)

comparable. Their performances are however inferior to theconventional TANDEM 9frames PLP. The performances of these

features augmented with the values of the critical band energy (19 features per frame) and the smoothed log-pitch estimates

are reported in Table V. Augmenting the long term features produces consistent improvements in all the cases and brings the

TABLE V

MLP FEATURES MAKING USE OF LONG TIME SPANS OF THE SIGNAL AS INPUT AUGMENTED WITH CRITICAL BAND ENERGY AND

LOG-PITCH. PERFORMANCE IS REPORTED ON THE EVAL06 DATA . RESULTS ARE REPORTED WITHMLP FEATURES AS STAND ALONE

FEATURES AND IN CONCATENATION WITHMFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN

PARENTHESES.

MLP without MFCC

Features CER (si) CER (sa)

A-MRASTA 29.0 (-4%) 26.6 (-3%)

A-DCTTRAPS 31.2 (-12%) 28.9 (-12%)

A-wLPTRAPS 29.0 (-5%) 27.3 (-6%)

MLP with MFCC

A-MRASTA 23.4 (+16%) 22.2 (+14%)

A-DCTTRAPS 24.0 (+13%) 22.5 (+13%)

A-wLPTRAPS 23.2 (+17%) 22.2 (+14%)

performances of these front-ends to the same level of the TANDEM-PLP when tested in concatenation with MFCC. As before,

the relative improvements are always reduced after speakeradaptation. In concatenation with spectral features, the three input

representations have similar performances.

In summary, MLP front-ends obtained using a three-layer MLPwith different input representations do not outperform the

conventional MFCC as stand alone features. On the other hand, they produce relative improvements in the range of10%-14%

when used in concatenation with spectral features. TANDEM-PLP front-end outperforms the other long term features. The

various coding schemes, MRASTA, DCT-TRAPS, and wLP-TRAPS,give similarly poor results as stand-alone features and

February 25, 2011 DRAFT

Page 9: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 9

similar improvements (approximatively11%) when used in concatenation with spectral features. Augmenting the long term

input with a vector of short term energy and pitch brings the performances close to those of the TANDEM-PLP features.

The relative improvements after speaker adaptation are generally reduced by 2% with respect to the speaker independent

systems. This is consistent with what has already been verified on English ASR experiments [27].

B. MLP architectures

This section discusses experiments with the different MLP architectures; the input signal representations are similar to those

used in the previous section while the information is exploited differently when changing the MLP architectures. The results

obtained using these methods are compared with their counterparts based on the three-layer MLPs.

1) Hidden Activation TRAPS:HATS aims at using information extracted from long time spans of critical band energies,

but the recognition is done independently in each critical band using 19 independent MLPs. The final posterior estimatesare

obtained by merging all these estimates (see subsection III-A). Results with HATS features are reported in Table VI. As stand-

alone features, HATS performs significantly worse than MFCC; whereas, a+12% relative improvement is obtained when used

in concatenation with MFCC. Comparing Tables IV and VI, it isnoticeable that this approach is marginally better than those

that use long-term features into a single MLP.

TABLE VI

HATS PERFORMANCE ON THE EVAL06 DATA . RESULTS ARE REPORTED WITHMLP FEATURES ALONE AND IN CONCATENATION WITH

MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES.

Features CER (si) CER (sa)

HATS w/o MFCC 30.5 (-10%) 29.1 (-13%)

HATS with MFCC 23.8 (+14%) 22.7 (+12%)

2) Multi-stream MLP features:Table VII reports the performance of the multi-stream front-end that combines information

from TANDEM-PLP (short time spans of signal) and HATS (long time spans of signal). These features outperform the MFCC

by 10% relative when used stand-alone and by16% relative in concatenation with MFCC.

Those numbers must be compared to the performances of the individual streams of TANDEM-PLP (Table III) and HATS

(Table VI). The combination provides a large improvement incase of stand-alone features (TANDEM-PLP 25.5%, HATS 29.1%,

Multistream 23.1%); however, the improvements are smallerwhen used in concatenation with MFCC (TANDEM-PLP 22.1%,

HATS 22.7%, Multistream 21.7%). This can be easily explained considering the fact that when used in concatenation with the

MFCC, the feature vector contains twice the spectral information, through the MFCC and trough the TANDEM features.

TABLE VII

MULTI -STREAM MLP FEATURE PERFORMANCE ON THE EVAL06 DATA . RESULTS ARE REPORTED WITHMLP FEATURES ALONE AND IN

CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES.

Features CER (si) CER (sa)

Multi-stream w/o MFCC 24.6 (+11%) 23.1 (+10%)

Multi-stream with MFCC 22.8 (+18%) 21.7 (+16%)

February 25, 2011 DRAFT

Page 10: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 10

Fig. 1. Proposed scheme for the MLP based feature extractionas used in the GALE 2008 Evaluation. The auditory spectrum isfiltered with

a set of multiple resolution filters that extract fast modulation frequencies. The resulting vector is concatenated with short term critical band

energy and pitch estimates and is used as input to the first MLPthat estimates phoneme posterior distributions. The output of the first MLP is

then concatenated with features obtained using slow modulation frequencies, short term critical band energy, and pitch estimates and is used as

input to the second MLP.

3) Hierarchical processing:Next, we discuss experiments with the hierarchical processing described in section III-C. Results

are reported in Table VIII in cases of both MRASTA and Augmented MRASTA inputs (processing is depicted in Figure 1).

Comparing Table VIII with Tables IV and V, it is noticeable that the hierarchical approach produces considerable improvements

with respect to the single classifier approach both with and without MFCC features. It is important to notice that the total number

of parameters is kept constant, thus the improvements are produced from the sequential architecture where short signaltime

spans are used first and then integrated with the longer ones.

TABLE VIII

HIERARCHICAL FEATURE PERFORMANCE ON THE EVAL06 DATA . RESULTS ARE REPORTED WITHMLP FEATURES ALONE AND IN

CONCATENATION WITH MFCC. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES.

MLP without MFCC

Features CER (si) CER (sa)

Hier 27.8 (+0%) 26.5 (-3%)

A-Hier 26.4 (+5%) 24.1 (+6%)

MLP with MFCC

Hier 22.9 (+18%) 21.9 (+15%)

A-Hier 22.3 (+20%) 21.2 (+18%)

4) Bottleneck features:Tables IX and X report the performances of the bottleneck features obtained using different long

term inputs (MRASTA, DCT-TRAPS, and wLP-TRAPS) and their augmented versions. The dimension of the bottleneck is

fixed to 35 in order to compare with other probabilistic MLP features. Results reveal that bottleneck features always outperform

their probabilistic counterparts obtained using the three-layer MLP. This is verified on all the different input features and their

augmented versions.

February 25, 2011 DRAFT

Page 11: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 11

TABLE IX

BOTTLENECK FEATURES PERFORMANCE ON THE EVAL06 DATA . RESULTS ARE REPORTED WITHMLP FEATURES ALONE. THE RELATIVE

IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES.

MLP without MFCC

Features CER (si) CER (sa)

bottleneck-MRASTA 27.8 (+0%) 25.9 (+0%)

bottleneck-DCTTRAPS 27.9 (+0%) 25.7 (+0%)

bottleneck-wLPTRAPS 26.4 (+5%) 24.9 (+3%)

MLP with MFCC

Features CER (si) CER (sa)

bottleneck-MRASTA 22.8 (+18%) 21.5 (+17%)

bottleneck-DCTTRAPS 23.1 (+17%) 22.0 (+15%)

bottleneck-wLPTRAPS 22.2 (+20%) 21.5 (+17%)

TABLE X

AUGMENTED BOTTLENECK FEATURE PERFORMANCE ON THE EVAL06 DATA . RESULTS ARE REPORTED WITHMLP FEATURES ALONE. THE

RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES.

MLP without MFCC

Features CER (si) CER (sa)

A-bottleneck-MRASTA 26.0 (+6%) 24.0 (+6%)

A-bottleneck-DCTTRAPS 27.2 (+1%) 24.9 (+3%)

A-bottleneck-wLPTRAPS 26.1 (+6%) 24.1 (+6%)

MLP with MFCC

Features CER (si) CER (sa)

A-bottleneck-MRASTA 22.2 (+18%) 21.2 (+18%)

A-bottleneck-DCTTRAPS 22.7 (+18%) 21.5 (+17%)

A-bottleneck-wLPTRAPS 22.1 (+20%) 21.2 (+18%)

For comparison purposes, TableXI reports also the performance of Bottleneck features when the input to the MLP is 9frames

PLP features augmented with pitch: In summary, replacing the three-layer MLP with a more complex MLP structure (while

TABLE XI

BOTTLENECK FEATURES PERFORMANCE ON EVAL06 DATA WHEN 9FRAMESPLPAND PITCH INPUT IS USED. RESULTS ARE REPORTED

WITH MLP FEATURES ALONE. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE BASELINE IS REPORTED IN PARENTHESES.

Bottleneck with 9frames PLP input

CER (si) CER (sa)

MLP without MFCC 24.9 (+9%) 23.8 (+7%)

MLP with MFCC 23.0 (+17%) 22.1 (+14%)

keeping constant the number of total parameters) produces areduction in the error both with and without concatenation of

spectral features. The Multi-stream approach that combines in parallel MLPs trained on long and short speech temporal features

produce the lowest CER as stand-alone front-end (16% relative CER reduction compared to the MFCC). On the other hand,

February 25, 2011 DRAFT

Page 12: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 12

Fig. 2. RWTH evaluation system composed of two subsystems trained on MFCC and PLP features. The two subsystems consist ofML training

followed by SAT/CMLLR training. The lattice outputs from the subsystems are combined in the end.

hierarchical and bottleneck structures that go beyond the three-layer appear to produce the highest complementarity to MFCC,

producing an improvement of17%-18% relative when used in concatenation. The reasons of these effects are investigated in

the next section where the front-ends are compared in terms of phonetic confusions.

C. Analysis of results

In order to understand the differences between the various MLP front-ends, let us now analyze the errors they produced in

terms of phonetic targets. Table XII reports the phonetic set composed of 72 tonemes used for training the MLP. The set is

sub-divided into six broad phonetic classes for analysis purposes. The numbers beside the vowels represent the tonal accents.

The frame-level accuracy of a three-layer MLP trained using9frames-PLP features in classifying the phonetic targets is 69.8%.

TABLE XII

PHONETIC SET USED TO TRAIN THE DIFFERENTMLPS DIVIDED INTO BROAD PHONETIC CLASSES. AS MANDARIN IS A TONAL LANGUAGE ,

THE NUMBER BESIDE THE VOWELS DESIGNATES THE ACCENT OF THE DIFFERENT TONEMES.

Vowels: A1 A2 A3 A4 E1 E2 E3 E4 I1 I3 I4 IH1 IH2 IH3 IH4

W Y a1 a2 a3 a4 e1 e2 e3 e4 er2 er3 er4 i1 i2 i3 i4 o1

o2 o3 o4 u1 u2 u3 u4 yu1 yu2 yu3 yu4

Stops: b d g k p t

Fricatives: c f h s sh v x z

Affricatives: ch q zh j

Approximants: l r w y

Nasals: ng m n

Figure 3 plots the per-class accuracy. Let us now consider the accuracies of the three-layer MLPs trained using long-term input

representations, i.e., the MRASTA, DCT-TRAPS and wLP-TRAPS. They are respectively64%, 62.9% and65.2%, which are

worse than the accuracy from the 9frame-PLP. The HATS features that are based on long-term critical band trajectories have a

similar frame-level accuracy, i.e.,65.7%.

While the overall performance of MLP trained on spectral features is superior to MLP trained on long time spans of speech

signals, the latter appears to perform better on some phonetic classes. Figure 3 plots the accuracy of recognizing each of the

phonetic classes for HATS. It is noticeable that in spite of an overall inferior performance, the HATS outperforms the TANDEM-

PLP on almost all the stop consonants ’p’, ’t’, ’k’, ’b’, ’d’ and the affricative ’ch’. Stop consonants are short sounds characterized

by burst of acoustic energy following a short period of silence and are known to be prone to strong co-articulation from the

following vowel. Studies like [28] have shown that stop consonant recognition can be largely improved considering information

February 25, 2011 DRAFT

Page 13: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 13

from the following vowel; this explains why using longer speech time spans produces higher recognition performance compared

to conventional short term spectral features. Also, the affricative ’ch’ (composed of a plosive and a fricative) is confused with

the fricatives ’zh’ and ’s’ by the short term features while this confusion is significantly reduced by the other long termfeatures.

Vowels and other consonants are still better recognized from the short term features.

Those facts are verified on all the MLP front-ends that use long temporal inputs (MRASTA,DCT-TRAPS, and wLP-TRAPS)

as well as the HATS. In summary, training MLPs using short-term spectral input outperforms training using long term temporal

input on most of the phonetic classes apart a few of them including the plosives and affricatives.

Let us now consider the Multi-Stream approach which dynamically weights the posterior estimates from the 9frames-PLP

and HATS according to the confidence of the MLP. The frame accuracy becomes73% and the phoneme-level confusion shows

that performances are never inferior to the best of the two streams that compose the combination. In other words, the combined

distribution appears to perform as the HATS on the stop consonants and affricatives and as the 9frame-PLP on the remaining

phonemes. Those results translate into a significant reduction on the CER, never worse than those obtained using the individual

MLP features (see experiments in section IV-B2).

The hierarchical approach described in section III-C is based on a completely different idea. This method uses an initial MLP

trained using energy trajectories filtered with short temporal filters (G1-High and G2-High). This provides an initial posterior

estimation then fed into the second MLP concatenated with energy trajectories filtered with long temporal filters (G1-Low,

G2-Low). The second MLP re-estimates the phonetic posteriors obtained from the first MLP using information from a longer

temporal context. The hierarchical framework achieves a frame accuracy of72%. Interestingly this is done without using any

spectral feature (MFCC or PLP) and keeping constant the number of parameters; only the architecture of the MLP is changed

where the temporal context of the input features is increased sequentially. In other words, the first MLP trained on shorttemporal

context is effective on most of the phonetic classes apart stops and affricatives. Those estimates are then corrected from the

second MLP using the information from longer temporal context. Figure 4 plots the phonetic class accuracy obtained by the

three-layer MLP trained using the MRASTA input and the hierarchical approach. It is noticeable that the hierarchical approach

outperforms training using the MRASTA on all the targets. Recognition results show that the hierarchical approach (where the

processing moves from short to long temporal contexts) reduces the CER with respect to the single MLP features (where the

different time spans are processed using the same MLP). Augmenting the input with pitch estimates and energy further reduces

the CER.

Another interesting finding is the fact that as stand-alone features, the multi-stream approach has the lowest CER, while in

concatenation with MFCC, the augmented hierarchical approach produces the largest CER reduction (compare Tables VIIIand

VII). This effect can be explained by the fact that the multi-stream approach makes use of spectral information (throughthe

9frame PLP). This information produces a frame accuracy of73% but does not appear complementary to the MFCC features

as they both represent spectral information. On the other hand, the hierarchical approach achieves a frame error rate of72%

without the use of any spectral features and appears more complementary when used in concatenation with the MFCC.

Results from the bottleneck features cannot be analyzed in asimilar way, as these are non-probabilistic features without any

explicit mapping to a phonetic target. However, recognition results in Tables IX and X show that replacing the three-layer

MLP with the bottleneck architectures reduces the CER for all the different input representations (MRASTA,DCT-TRAPS,wLP-

TRAPS). Bottleneck and hierarchical approaches produce similar improvements in concatenation with MFCC features.

February 25, 2011 DRAFT

Page 14: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 14

Vowels Stops Fricatives Nasals Approximants Affricatives0.55

0.6

0.65

0.7

0.75

Phonetic class

Acc

urac

y

Phonetic class accuracy

TANDEM−PLP

HATS

Fig. 3. Phonetic-class accuracy obtained by the TANDEM-9framesPLP and HATS. The former outperforms the latter on most of the classes

apart from stops and affricatives.

Vowels Stops Fricatives Nasals Approximants Affricatives0.55

0.6

0.65

0.7

0.75

0.77

Phonetic class

Acc

urac

y

Phonetic class accuracy

MRASTA

Hier−MRASTA

Fig. 4. Phonetic-class accuracy obtained by the MRASTA and the Hierarchical MRASTA. The latter improves the performance on all the

phonetic targets without the use of any spectral information.

V. L ARGE SCALE EXPERIMENTS

Contrastive experiments in literature are typically reported with small setups like the one presented so far. However,the

GALE evaluation systems are trained on a much larger amount of data, make uses of multi-pass training, and are composed of

a number of individual sub-systems. In order to study how theprevious results generalize on more complex LVCSR systems

and a large amount of training data, the experiments are extended using a highly accurate automatic speech recognizer for

continuous Mandarin speech trained on 1,600 hours of data collected by LDC (GALE releases P1R1-4, P2R1-2, P3R1-2, P4R1).

The training transcripts were preprocessed and the audio data were segmented into waveforms based on sentence boundaries

defined in the manual transcripts. Both were provided by UW-SRI as described in [9].

This comparison will cover the Multistream approach and thehierarchical MRASTA front-ends, which will be simply referred

as MLP1 and MLP2 in the remainder of this paper. These two features have been used in the GALE 2008 Mandarin evaluation.

The 1600 hours data are used for training the HMM/GMM systemsas well as the MLP front-ends. The evaluation is done on

the GALE 2007 development test set (dev07) which is used for tuning hyper-parameters, the GALE 2008 development test set

(dev08), and the sequestered data of the GALE 2007 evaluation (eval07-seq), for a total amount of 5 hours of data. Statistics

of the different test sets are summarized in table XIII. The number of parameters in the MLP architectures is increased tofive

millions parameters for the large scale setup. The trainingof MLP1 and MLP2 networks took approximatively 5 weeks on an 8

core machine (AMD Opteron(tm) Dual Core 2192 MHz 2x4-core CPUs). MLP1 networks have been trained at ICSI and MLP2

networks have been trained at IDIAP. On the other hand, the generation of the features is quite fast, approximatively 0.09xRT

on a single CPU.

The RWTH evaluation system is composed of two subsystems which only differ for their acoustic front-ends. The acoustic

February 25, 2011 DRAFT

Page 15: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 15

TABLE XIII

ACOUSTIC DATA FOR TRAINING AND TESTING.

Train Test set

set dev07 dev08 eval07-seq

total amount 1,600h 2.55h 1.0h 1.63h

# segments 1.3M 1985 619 1013

# running words 16.5M 28K 11K 17K

# distinct words 63K 5.3K 3K 4.1K

TABLE XIV

PERFORMANCES OF BASELINE SYSTEMS USINGMFCC OR PLPFEATURES.

Feature GALE-dev07 GALE-dev08 GALE-eval07-seq

CER (si) CER (sa) CER (si) CER (sa) CER (si) CER (sa)

MFCC 15.7 14.0 14.0 12.5 15.8 14.5

PLP 16.4 14.4 14.9 13.4 16.2 14.5

front-ends of the subsystems consist of conventional MFCCsand PLPs augmented with the log-pitch estimates [18]. The filter

banks underlying the MFCC and PLP feature extraction undergo VTLN. After that, features are mean and variance normalized

and they are fed into a sliding window of length nine. All feature vectors within the sliding window are concatenated and projected

to a 45 dimensional feature space using a linear discriminative analysis (LDA). The system uses a word-based pronunciation

dictionary described in [9] that maps words to phoneme sequences, while the phoneme carries the tone information, whichis

usually referred to as a toneme. The acoustic models for all systems are based on triphones with cross-word context, modelled

by a 3-state left-to-right HMM. A decision tree based state tying is applied, resulting in a total of 4,500 generalized triphone

states. The acoustic models consist of Gaussian mixture distributions with a globally pooled diagonal covariance matrix.

The first pass consists of Maximum Likelihood training. We will refer to this system as Speaker Independent (SI) system.

The second pass consists of Speaker Adaptive Training (SAT). Furthermore, during decoding, Maximum Likelihood Linear

Regression is applied to means for performing speaker adaptation. We will refer to this system as Speaker Adapted (SA) system.

Finally, the outputs of the different subsystems are combined at the lattice level using the min.fWER combination method

described in [29]. The min.fWER method has been shown to outperform other lattice combination methods as ROVER or

Confusion Network Combination (CNC) [29]. Figure 2 schematically depict the RWTH evaluation system.

The language model (LM) used in this work is kindly provided by SRI and UW. The vocabulary size is 60K. Experimental

results with the full LM are reported only in the system combination, while a pruned version is applied in all other recognition

steps.

Table XIV reports the CER for the speaker independent and thespeaker adapted subsystems trained using MFCC and PLP

features only. The error rate is in the range of12.5− 14.5% for the different test sets.

Let us now consider the integration of the MLP1 and MLP2 front-ends. Tables XV report the performance of the subsystems

when they are trained using MLP1 and MLP2 features only and when MFCC and PLP are concatenated with MLP1 and MLP2.

The results show similar trends as in the 100 hours system. Inother words, the MLP feature performance scales with the

February 25, 2011 DRAFT

Page 16: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 16

TABLE XV

SUMMARY OF FEATURE PERFORMANCES ONGALE DEV07/DEV08/SEQ-EVAL 07TEST SETS. RESULTS ARE REPORTED WITHMLP

FEATURES ALONE AND IN CONCATENATION WITHMFCC OR PLP. THE RELATIVE IMPROVEMENT WITH RESPECT TO THEMFCC AND PLP

BASELINES IS REPORTED IN PARENTHESES.

GALE-dev07

Features MLP w/o MFCC MLP with MFCC

CER (si) CER (sa) CER (si) CER (sa)

MLP2 13.3 (+15%) 12.3 (+12%) 11.6 (+26%) 10.6 (+24%)

MLP1 13.1 (+17%) 12.4 (+11%) 12.3 (+22%) 11.3 (+19%)

GALE-dev07

Features MFCC w/o PLP MLP with PLP

CER (si) CER (sa) CER (si) CER (sa)

MLP2 13.3 (+18%) 12.3 (+14%) 11.7 (+28%) 11.1 (+22%)

MLP1 13.1 (+20%) 12.4 (+14%) 12.3 (+25%) 11.4 (+20%)

GALE-dev08

Features MLP w/o MFCC MLP with MFCC

CER (si) CER (sa) CER (si) CER (sa)

MLP2 13.1 (+6%) 11.6 (+7%) 11.3 (+19%) 10.2 (+18%)

MLP1 12.4 (+11%) 11.4 (+9%) 11.5 (+18%) 10.1 (+19%)

GALE-dev08

Features MLP w/o PLP MLP with PLP

CER (si) CER (sa) CER (si) CER (sa)

MLP2 13.1 (+12%) 11.6 (+13%) 11.2 (+24%) 10.3 (+23%)

MLP1 12.4 (+16%) 11.4 (+14%) 11.3 (+24%) 10.4 (+22%)

GALE-eval07-seq

Features MLP w/o MFCC MLP with MFCC

CER (si) CER (sa) CER (si) CER (sa)

MLP2 13.9 (+13%) 13.2 (+9%) 12.8 (+19%) 11.8 (+19%)

MLP1 13.8 (+13%) 13.4 (+8%) 13.1 (+18%) 12.2 (+16%)

GALE-eval07-seq

Features PLP w/o MFCC PLP with MFCC

CER (si) CER (sa) CER (si) CER (sa)

MLP2 13.9 (+14%) 13.2 (+9%) 12.7 (+21%) 12.1 (+16%)

MLP1 13.8 (+14%) 13.4 (+7%) 12.9 (+20%) 12.2 (+16%)

TABLE XVI

SYSTEM COMBINATION OF MFCC AND PLPSUBSYSTEMS DESIGNATED WITH⊕. THE RELATIVE IMPROVEMENT WITH RESPECT TO THE

MFCC⊕ PLPBASELINE IS REPORTED IN PARENTHESES.

Features GALE-dev07 GALE-dev08 GALE-eval07-seq

MFCC ⊕ PLP 12.9 11.9 13.5

MLP1 ⊕ MLP2 11.1 (+14%) 10.7 (+10%) 12.3 (+9%)

MFCC+MLP2⊕ PLP+MLP1 9.9 (+23%) 9.4 (+21%) 11.0 (+19%)

amount of training data. In particular, the MLP1 and MLP2 front-ends outperform the spectral features and produce a relative

improvement in the range of15%-25% when used in concatenation with MFCC or PLP, reducing the CERto the range10.1%-

12.2% for the different datasets. The improvements are verified onall three test sets. The relative improvements after SAT

are generally reduced with respect to the speaker independent system. After SAT, the MLP2 features (based on a hierarchical

approach) yield the best performance in concatenation withboth MFCC and PLP.

The lattice combination results of MFCC and PLP sub-systemsare reported in Table XVI (first row). For investigation

purposes, corresponding sub-systems trained using MLP1 and MLP2 front-ends are combined in the same way and their

performance is reported in Table XVI (second row). Their performance is superior to the MFCC/PLP system by9 − 14%

relative, showing that the improvements hold after the lattice level combination.

In order to increase the complementarity of the sub-systems, features MLP1 and MLP2 were then concatenated with PLP and

MFCC, respectively. The performance of the lattice level combination of those two sub-systems is reported in Table XVI (third

February 25, 2011 DRAFT

Page 17: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 17

TABLE XVII

EFFECT OF DISCRIMINATIVETRAINING ON DIFFERENT SUBSYSTEMS AND THEIR COM BINATION(DESIGNATED WITH⊕).

Features GALE-dev07 GALE-dev08 GALE-eval07-seq

MFCC+MLP2 9.6 (+9%) 9.2 (+9%) 11.0 (+7%)

PLP+MLP1 9.9 (+13%) 9.3 (+10%) 11.0 (+9%)

MFCC+MLP2⊕ PLP+MLP1 8.8 (+11%) 8.5 (+10%) 10.4 (+6%)

row). The results show that using the two MLP front-ends in concatenation with MFCC/PLP features produces an additional

relative improvement, resulting in the range of18− 23% after system combination.

For the GALE 2008 evaluation, discriminative training was further applied to the two subsystems before the lattice level

combination. Discriminative training is based on a modifiedMinimum Phone Error (MPE) criterion described in [30]. Table XVII

reports CER obtained after discriminative training. Results are reported for the PLP+MLP1 system, the MFCC+MLP2 system,

and their lattice level combination.

In all the three cases, discriminative training reduced theCER in the range 6-13% relative, showing that it is also effective when

used together with different MLP front-ends. For computational reasons, fully contrastive results with and without discriminative

training are not available on the 1,600 hours system.

This system including the two most recent MLP-based front-ends showed to be very competitive to current Mandarin LVCSR

systems evaluated on the same test sets [31], [32].

VI. D ISCUSSION AND CONCLUSION

During GALE evaluation campaigns, several MLP based front-ends have been used in different LVCSR systems although

no exhaustive and systematic study of their performances has been reported in literature. Without such a comparison, itis not

possible to verify which of the modification to the original MLP features produced improvements in the final system.

This correspondence describes and compares in a systematicmanner all the MLP front-ends developed recently at multiple sites

and used during the GALE project for Mandarin transcription. The initial investigation is carried on a small scale experimental

setup (100 hours) and investigates the two directions alongwhich the MLP features have recently evolved: the use of different

inputs to the conventional three-layer MLP and the use of complex MLP architectures. The experimentation is done both using

MLP front-ends as stand-alone features and in concatenation with MFCC.

Three-layer MLPs are trained using conventional spectral features (9frames-PLP) and features extracted from long time spans

of the signal (MRASTA, DCT-TRAPS, wLP-TRAPS, and their augmented versions). Results reveal that as stand-alone features,

none of them outperforms the conventional MFCC features. The performances of the MLPs trained on long time spans of the

speech signal (MRASTA, DCT-TRAPS, wLP-TRAPS) are quite poor compared to those obtained from training on short-term

spectral features (9frames-PLP). The latter one is superior on most of the phonetic targets apart from a few of phonetic classes

like plosives and affricatives.

Features based on the three-layer MLP produce relative improvements in the range of10-14% when used in concatenation

with the MFCC. Even when their performances are poor as stand-alone front-ends, they always appear to provide complementary

information to the MFCC. After concatenation with MFCC, thevarious representations (MRASTA, DCT-TRAPS, wLP-TRAPS)

produce comparable performances.

February 25, 2011 DRAFT

Page 18: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 18

Over time several alternative architectures have been proposed to replace the three-layer MLP with different motivations.

This work experiments with Multi-stream, Hierarchical andBottleneck approaches. Results using those architecturesreveal the

following novel findings:

• The Multi-stream framework that combines MLPs trained on long and short time spans outperforms the MFCC by

approximatively10% relative as stand-alone feature. Furthermore, it reduces the CER by16% relative in concatenation

with MFCC.

• The hierarchical approach that sequentially increases thetime context through a hierarchy of MLPs outperforms the MFCC

by approximatively6% relative as stand-alone feature and reduces the CER by18% relative in concatenation with MFCC.

Results obtained using the bottleneck approach (five-layerMLP) show a similar trend.

• The MLP front-end that provides the lowest CER as stand-alone feature is different from the front-end that provides the

highest complementarity to spectral features. This effectis discussed in section IV-C.

• MLPs trained using long-time spans of the signal at the inputbecome effective only when coupled with architectures that

go beyond the three-layer structure, i.e., hierarchies or bottleneck.

In summary, the most recent improvements are obtained by theuse of architectures that go beyond the three-layer MLP rather

than the various input representations.

These results have been obtained by training the HMM/GMM andMLPs on 100 hours of speech data and tested in a simple

LVCSR system. Evaluation systems are typically trained on amuch larger amount of data, make uses of multipass training,and

are composed of a number of individual sub-systems that are combined together to provide the final recognition output. Inthis

work, MLP features are investigated with a large amount of training data as well as on a state-of-the-art multipass system. The

improvements from the small scale study hold for the large amount of training data on speaker independent, speaker adapted

systems, and after the lattice level combination. This is verified both in concatenation with MFCC and PLP features. When

MLP features are used together with spectral features, the gain after lattice combination is in the range of19-23% relative for

the 5 hours evaluation data sets. The comprehensive contrastive experiment on a multipass evaluation system shows thatthe

improvements obtained on a small setup scale with the amountof training data and the parametric complexity of the system.

To our best knowledge, this is the most extensive study on MLPfeatures for Mandarin LVCSR covering all the front-ends

including the most recent ones used in the 2008 GALE evaluation systems. The final evaluation system showed to be very

competitive to current Mandarin LVCSR systems evaluated onthe same test sets [31], [32].

VII. A CKNOWLEDGMENTS

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-06-C-0023. Any opinions,

findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarilyreflect the views of the Defense Advanced

Research Projects Agency (DARPA). The authors would like tothanks colleagues involved in the GALE project and Dr. Petr Fousek for their help.

REFERENCES

[1] Hermansky H. et al., “Connectionist feature extractionfor conventional hmm systems.,”Proceedings of ICASSP, 2000.

[2] Chen B. et al., “Learning discriminative temporal patterns in speech: Development of novel TRAPS-like classifiers,” in Proceedings of

Eurospeech, 2003.

[3] Morgan N. et al., “TRAPping conversational speech: Extending TRAP/Tandem approaches to conversational telephonespeech recognition,”

Proceedings of ICASSP 2004.

[4] Hermansky H. and Fousek P., “Multi-resolution rasta filtering for tandem-based asr.,” inProceedings of Interspeech 2005, 2005.

February 25, 2011 DRAFT

Page 19: manuscript

SUBMITTED TO IEEE TRANSACTION ON AUDIO SPEECH AND LANGUAGE PROCESSING 19

[5] Fousek Petr, Lori Lamel, and Jean-Luc Gauvain, “Transcribing Broadcast Data Using MLP Features.,”Procedings of Interspeech 2008.

[6] Ellis D. et al., “Tandem acoustic modeling in large-vocabulary recognition,”Proceedings of ICASSP, 2001.

[7] Morgan N. et al., “Pushing the envelope - aside,”IEEE Signal Processing Magazine, vol. 22, no. 5, 2005.

[8] Vergyri D. at al., “Development of the SRI/Nightingale Arabic ASR system,”Procedings of Interspeech 2008.

[9] Hwang M.-Y. et al., “Building a highly accurate mandarinspeech recognizer with language-independent technologies and language-

dependent modules,”IEEE Transaction on Audio, Speech and Language Processing, vol. 17, no. 7, 2009.

[10] Plahl C. et al., “Development of the GALE 2008 Mandarin LVCSR System,”Proceedings of Interspeech 2009.

[11] Park J. et al., “Efficient generation and use of mlp features for arabic speech recognition,” inInterspeech, Brighton, UK, September 2009,

p. 236 239.

[12] Schwarz P., Matejka P., and Cernocky J., “Extraction offeatures for automatic recognition of speech based on spectral dynamics,” in

Proceedings of TSD04, Brno, Czech Republic, September 2004, pp. 465 – 472.

[13] Fousek P., Extraction of Features for Automatic Recognition of SpeechBased on Spectral Dynamics, Ph.D. thesis, Czech Technical

University in Prague, Faculty of Electrical Engineering, 2007.

[14] Hermansky H. et al., “Towards ASR on partially corrupted speech,”Proc. ICSLP, 1996.

[15] Valente F. and Hermansky H., “Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications,”in Proceedings of

ICASSP, 2008.

[16] Grezl F. et al., “Probabilistic and bottle-neck features for lvcsr of meetings,” inProceedings of ICASSP07,Hononulu, 2007.

[17] Bourlard H. and Morgan N.,Connectionist Speech Recognition - A Hybrid Approach., Kluwer Academic Publishers, 1994.

[18] Lei X. et al., “Improved Tone Modeling for Mandarin Broadcast News Speech Recognition .,”Proceedings of Interspeech, 2006.

[19] Hermansky H. and Sharma S., “Temporal Patterns (TRAPS)in ASR of Noisy Speech,” inProceedings of ICASSP’99, Phoenix, Arizona,

USA, 1999.

[20] Dau T. et al., “Modeling auditory processing of amplitude modulation .I Detection and masking with narrow-band carriers.,” J. Acoustic

Society of America, , no. 102, pp. 2892–2905, 1997.

[21] Marios Athineos, Hynek Hermansky, and Daniel P. W. Ellis, “Lp-trap: Linear predictive temporal patterns,” inProc. ICSLP, 2004, pp.

1154–1157.

[22] Allen J.B., Articulation and Intelligibility, Morgan and Claypool, 2005.

[23] Misra H., Bourlard H., and Tyagi V., “Entropy-based multi-stream combination,” inProceedings of ICASSP, 2003.

[24] Valente F. and Hermansky H., “Combination of Acoustic Classifiers based on Dempster-Shafer Theory of Evidence,”Proc. ICASSP, 2007.

[25] Valente F. et al., “Hierarchical Modulation spectrum for the GALE project,” inProceedings of Interpseech, 2009.

[26] Grezl F. and Fousek P., “Optimizing bottleneck features for lvcsr,” in Proceedings of ICASSP08,Las Vegas, 2008.

[27] Zhu Q. et al., “On using MLP features in LVCSR,”Proceedings of ICSLP 2004.

[28] Suchato A.,Classification of stop place of articulation, Ph.D. thesis, Massachusetts Institute of Technology, 2004.

[29] Hoffmeister B. et al., “Frame based system combinationand a comparison with weighted ROVER and CNC,” inProceedings of Interspeech,

Pittsburgh, PA, USA, Sept. 2006, pp. 537–540.

[30] Heigold G. et al., “Margin-based discriminative training for string recognition.,”Journal of Selected Topics in Signal Processing - Statistical

Learning Methods for Speech and Language Processing, to appear December 2010.

[31] Chu S. M. et. al., “Recent advances in the GALE mandarin transcription system,” inICASSP, Las Vegas, NV, USA, Apr. 2008, pp.

4329–4333.

[32] Ng T. et. al., “Progress in the BBN mandarin speech to text system,” inICASSP, Las Vegas, NV, USA, Apr. 2008, pp. 1537–1540.

February 25, 2011 DRAFT