Speech segmentation and interpretation using a semantic syntax-directed translation

Pattern Recognition Letters I (1982) 121-124 December 1982 North-Holland Publishing Company

Speech segmentation and interpretation using a semantic syntax-directed translation

R. DE MORI Department o f Computer Science, Concordia University, Montreal H3G 1348, Quebec, Canada

Attilio G I O R D A N A Istituto di Scienze dell'Informazione, Universit?t di Torino, 10125 Torino, Italy

Pietro LAFACE CENS ~ Istituto di Elettrotecnica Generale, Politecnico di Torino, 10129 Torino, Italy

Received 19 July 1982

Abstract: A Semantic Syntax-Directed Translation is presented. Its rules are used to segment continuous speech and, at the same time, to produce phonetic interpretations.

Key words: Semantic syntax-directed translation, attributed grammar, speech recognition.

1. Introduction

Semantic Syntax-Directed Translation (SSDT) has been recently used for pictorial pattern recognition, as described by You and Fu (1979). This paper proposes another version of an SSDT suit- able to be used for segmenting continuous speech into Pseudo-Syllabic Segments (PSS) and for assigning phonetic interpretations to them.

The speech signal is transformed into the frequency domain by a Fast Fourier Transformation from which some acoustic cues are extracted. These cues are unambiguously described by a language L I. An SSDT transforms these descrip- tions into a lattice of phonetic interpretations and delimits the bounds of PSS, as described by De Mori (1982). Such a translation is controlled by rules inferred by the designers, based on a large number of experiments and their knowledge about acoustic phonetics.

2. The semantic syntax directed translation scheme

Definition 2.1. According to You and Fu (1979), and SSDT is a 5-tuple,

N , - r , A , S , P

where N is a set of nonterminal symbols; r is a set of input symbols; A is a set of output symbols; S ~ N is the start symbol; P is the set of rules.

The set of input symbols is given in Table 1. Every symbol is associated a vector of attributes. The meaning of the attributes is given in Table 2. The set of output symbols is given in Table 3.

The start symbol S is denoted PSS (Pseudo- Syllabic Segment). The rewriting rules in P have the following general form:

0167-8655/82/0000-0000/$02.75 © 1982 North-Holland 12~

Volume 1, Number 2 PATTERN RECOGNITION LETTERS December 1982

Table 1

Symbol Attributes Description

VOCPK tb, te, Vocalic peak of ml , m2, Total Energy rmin, rmax (TE)

SNPK Peak of TE for a sonorant consonant

BRSTPK Burst peak NSPK Frication peak UPK Uncertain peak

(LONG ~a rMIDDLE'~ a J D E E P ~ DIP tb, te, emin

(.SHORT) [ H I G H . )

BUZZDIP

NSDP

TRNS tb, te, rmin

SN tb, te

NS tb, te

VOC tb, te

Dips of TE

Dip with buzz- bar Dip with frication Prevocalic consonantal transition Sonorant tract in a peak Nonsonorant tract in a peak Vocalic tract in a peak

a LONG and MIDDLE are linguistic attributes concerning the duration, while MIDDLE, DEEP, and HIGH indicate the level of depth.

Table 2

Attribute Description

tb te ml m2 rmin

rmax

emin

Time of beginning Time of end Maximum signal energy in the peak Maximum energy in the 3-5 KHz band Minimum ratio between low (200-900 Hz) and high (5-10 KHz) frequency energies Maximum ratio between low and high frequency energies Minimum energy in a peak

Table 3

Symbol Primary phonetic feature

VOCALIC NI NA NC SON SNCL VC SINIL

Vowel Nonsonorant interrupted consonant Nonsonorant affricate consonant Nonsonorant continuant consonant Sonorant consonant Cluster of sonorant consonants T h e / v / c o n s o n a n t Single intervocalic nonsonorant interrupted Lax consonant

pk X--*YB; YG; Alg(k)

where X e N is a nonterminal symbol; Y e N* is a possibly empty string of nonterminal symbols; B e E * is a (possibly empty) string of input symbols; G~A* is a set of strings of output symbols.

The sequences YB, YG can appear in the reverse order, i.e., BY, GY. In any case, Y is in the same position in both expressions.

Alg(k) is an algorithm that may contain a condition made of a logical expression of predicates defined by semantic attachments; the rule can be applied only if the condition is verified.

Each symbol is associated a vector of attributes. The attributes associated with X belong to the vector A(X). In a similar way, the attributes associated with the symbols of B are grouped into A(B).

The algorithm Alg(k) may contain a semantic rule fk(A(Y),A(B)) which allows to compute the attributes of A(X) of X given the attributes of A(Y) and A(B). Another semantic rule f'k(A(B)) allows to compute the attributes of the output hypotheses G, given the attributes of the symbols in B which have been translated into G. The portion of the rule corresponding to the generation of hypotheses may not appear in some rules.

A parser analyzes a description of acoustic cues. Whenever a string of the input description appears to be generated by a rule pk, the second part of pk is used for generating phonetic hypotheses. Every time a complete derivation of the symbol PSS, of the type

PSS " , description of acoustic cues

is performed, a PSS is delimited. PSS's are described by the set of phonetic features produced by the translation. Each PSS, with its phonetic description, is then used for generating hypotheses about more detailed acoustic cues using context dependent rules. Context-dependent rules will not be described in this paper.

The segmentation grammar is a context-free grammar. The use of conditions considerably reduce the nondeterminism allowing a fast parsing. Parsing is seen as a problem solving activity and is not described here for the sake of brevity.

122

Volume 1, Number 2 P A T T E R N RECOGNITION LETTERS December 1982

3. The rules of the segmentation grammar

The rule of the segmentation grammar are introduced in this section with comments that will help to a better understanding of them.

pl P S S : = a b ; ab; Alg(l). p2 :=aV; aV; Alg(2).

Rules pl and p2 establish that a PSS generates ab if the predicate PI , contained in Alg(1), is true, otherwise it generates aV. As it will be seen later, b and V always generate one vocalic segment. The predicate P1 is true if between the vocalic segment in b or V, and the next vocalic segment there are no strings generated by the nonterminal symbol UN, that will be introduced later.

Rules pl and p2 do not have associated semantic nor translation part. a, b and V are rewritten as follows:

p3 a :=X1 ; p4 a : = X2 ; p5 a := X3 ; p6 a := UN ; p7 a := UN X4 ; p8 b :=V ; p9 b := V X4 ; pl0 V := VOCPK; p l l V :=VOC ; p12 V : = U P K ;

SON ; Alg(3). SINIL ; Alg(4). SNCL ; Alg(5). UN ; Alg(6). UN PREVS; Alg(7). V ; Alg(8). V POSTVS ; Alg(9). VOCALIC ; Alg(10). VOCALIC ; Alg(l l) . VOCALIC ; Alg(12).

The semantic rules for attributes composition in algorithms (Alg(3)-Alg(12)) are not explained for the sake of brevity. SINIL means Single Inter- vocalic Nonsonorant Interrupted Lax consonant while PREVS and POSTVS stay respectively for PREVocalic and POSTVocalic Sonorant consonant in a cluster with nonsonorant consonant.

Alg(4) contains a predicate P2 which is true if X2 is rewritten with a simple symbol of the alpha- bet; Alg(5) contains a predicate P3 which is true if the duration of the consonantal part is higher than a threshold.

Alg(12) contains a predicate P4 which is true if UPK does not follow a V. P5 in Alg(9) is true if X4 precedes a UN.

The remaining rules are given in Table 4. For the sake of brevity, rules for attribute composition have been omitted.

Table 4

p13 UN := X5; NC

p14 :=X5 YI; NC p15 :=X6; NI

p16 : = X 6 Y 1 ; NI

p17 := X6 X5; NA

p18 : = X 6 X 5 YI; NA p19 X1 : = ( H I G H - D I P + L O N G - M I D D L E - D I P + S N P K +

k > 0

+ SHORT-MIDDLE-DIP + TRNS)

p20 X2 : = { S H O R T - D E E P - D I P + S H O R T - M I D D L E - D I P +

o o

+ BUZZDIP)(BRSTPK + SNPK) (TRNS)

p21 X3 : = ( ( B U Z Z D I P + S H O R T - M I D D L E - D I P +

+ HIGH-DIP + LONG-MIDDLE-DIP)(SNPK +

• k > 0 o

+ BRSTPK)) (TRNS)

p22 X4 : = S N P K HIGH-DIP k > 0

p23 X5 : = ( N S + N S D P + N S P K )

p24 YI : = ( H I G H - D I P + S H O R T - M I D D L E - D I P + T R N S )

p25 X6 : = ( L O N G - D E E P - D I P + S H O R T - D E E P - D I P +

+ BUZZDIP + SHORT-MIDDLE-DIP)

' o ' means that the expression of which it is exponent, can be

present only once or absent.

A special parser has been designed for using the translation rules. The specific knowledge about the type of rules and the predicates has made the parser design particularly effective avoiding back- tracking. As the parser is an 'ad hoc' tool for a specific application, it will not be described.

4. Example

The following acoustic description was obtained for the italian word/prenotazione/(reservat ion) . For the sake of brevity parameters and parameters composition are omitted in this example.

LONG-DEEP-DIP(t l , t2) SNPK(t2, t3) HIGH-DIP(t3, t4) VOCPK(t4, t5) LONG-MIDDLE-DIP(t5, t6) UPK(t6, t7) SHORT-DIP(t7, t8) TRNS(tS, t9) VOC(t9, tl0) LONG-DEEP-DIP(tI0, tl 1)

123

Volume 1, Number 2 PATTERN RECOGNITION LETTERS December 1982

NSPK(tl 1, t12) VOCPK(tI2, t13) SHORT-MIDDLE-DIP(tl3, t14) UPK(t 14, tlS)

As there is no UN between t5 and t7 and UPK(t6, t7) does not follow a V, P4 is true on UPK(t6, t7) and PI is true on VOCPK(t4, t5) thus Alg(l) assignes t5 as ending time of the PSS and pl is applied.

As P1 is true, P5 is false and 09 cannot be applied; 08 and pl0 are applied and VOCPK(t4, t5) is translated into VOCALIC(t4, t5).

P2 is false, p3, p4, 05 cannot be applied because the first symbol that must be generated by a is LONG-DEEP-DIP. Based on the above considera- tions the following chain of rules can be applied for generating the acoustic description between tl and t5:

PSS pl ab p7 UN X4 b 015 X6 X4 b 025

LONG-DEEP-DIP X4 b 022

p8 LONG-DEEP-DIP SNPK HIGH-DIP b

LONG-DEEP-DIP SNPK HIGH-DIP V p.lO

LONG-DEEP-DIP SNPK HIGH-DIP VOCPK

attempts to delimit the second PSS and to generate phonetic features about it. The results of this operation are given in the following.

PSS2: SON(t5, t6), SNCL(t5, t6) VOCALIC(t6, t7)

PSS3: NI(t7, t9) VOCALIC(t9, t 10)

PSS4: NA(tl0, tl2) VOCALIC(tl2, t13)

PSS5: SON(tl3, tI4), SINIL(tl3, tl4) VOCALIC(tI4, t15)

5. Results

The Translation System has been extensively tested using unconstrained sentences of the Italian Language spoken by many male and female speakers. Limited experiments have also been performed with other languages spoken by native speakers.

Segmentation errors were less than 1°70 on the average, two phonetic hypotheses were generated on a set of 9 phonetic classes. The right hypothesis was not generated in less than I O7o of the cases.

The translation rules and the corresponding semantic rules associated with p7, p15, 022, and pl0 generate the following phonetic transcription:

NI(tl, t2) PREVS(t2, t4) VOCALIC(t4, t5)

The parser now starts from the description corresponding to the PSS ending symbol and

References

You, K.C. and K.S. Fu (1979). A syntactic approach to shape recognition using attributed grammars. IEEE Transactions on System, Man and Cybernetics SMC-9, 334-345.

R. De Mori (1982). Computer model of speech using fuzzy algorithms. Plenum Press, New York.

124

Documents

Speech segmentation and interpretation using a semantic syntax-directed translation