Click here to load reader

Grapheme-to-phoneme rules: A review - KTH

  • View
    0

  • Download
    0

Embed Size (px)

Text of Grapheme-to-phoneme rules: A review - KTH

Grapheme-to-phoneme rules: A reviewQuarterly Progress and Status Report
Grapheme-to-phoneme rules: A review Hunnicutt, S.
journal: STL-QPSR volume: 21 number: 2-3 year: 1980 pages: 038-060
S. Hunnicut'k*
Abstract
A set of mgl ish rules has been written for the speech synthesis sys- tem developed a t the Royal Ins t i tu te of Technology (KTH) i n Stockholm. The rules were written in the formalism provided by this system, an important feature of which is a higher-level progranming language which greatly f ac i l i t a t e s the extension of the system t o other lan- guages. A categorization of grapheme-to-phoneme rules i n terms of special contexts is presented here, and som3 aspects of the u t i l i t y of the formalism are discussed. The resul t s of a preliminary evalua- t ion of the grapheme-to-phoneme and lexical stress rules are also in- cluded.
Introduction
The value of a ccanprehensive set of grapheme-to-phoneme rules in
a text-to-speech system is well appreciated. Such rules allow the ,
processing of unrestricted text i n a ccanprehensive and ef f ic ient man-
ner. It is also recognized that sane form of "exceptions lexicon" is . ,
important since correctness of high-frequency m r d s in synthetic speech
is thus ensured. As text-to-speech m r k developed (see Cooper et al,
1 972) , it became clear that lexicons and grapheme-to-phoneme algorithms
muld be a natural, and eventually, necessary adjunct to phoneme-to-
speech processing. In order t o develop camplete systems, a n&r of
groups mrking w i t h synthetic speech inst i tuted research to construct
lexicons of various s izes and grapheme-to-phoneme and lexical stress
algorithms. Because grapheme-to-phoneme algorithms in English could ?
be quite ccanplex due to the lack of one-to-one grapheme-to-phone
correspondences, these researchers drew on studies previously done by
a number of l inguists such a s Chamsky and Halle (1968) and Venezb ,
(1 970) . Scarre early m r k on graphe-to-phoneme correspondences was done
a t Haskins Laboratories. An algorithm for the conversion of graphemes
t o phonemes i n English was written by Ainsmrth a t the University of
Keele in 1973. In 1974, algorithms were presented by both McIlroy of
Bell Laboratories and Hunnicutt of MIT. An Ehglish rule system was
* Employed a s a research assis tant a t the Dept. of Speech ccmmmica- t ion, KTH during the period February 15 - August 15, 1980.
STL-QPSR 2-3/1980
("Brown Corpus", Kucera & Francis, 1967) produces the correct pho-
nemic output for over f i f t y percent (53.6%) of the mrds i n running
text. The mst frequent 1,000 m r d s -- f ive times a s many -- cover
68.9% of the w r d s i n running text, an additional coverage of only
15.3% for 800 m r e mrds . This trend is mre and mre exaggerated
as w consider lower frequency mrds . Once again, increasing our
vocabulary by a factor of nine (considering the mst frequent
20,000 mrds) adds another 23.7% coverage. And the 40,000 lowest
frequency m r d s add a mre 7.4%. It is quite clear, then, that al-
though a small lexicon is remrkably productive, increasing its s ize
w i l l yield quickly dimishing returns. This realization has led t o
a much mre eff ic ient solution -- a modest set of grapheme-to-pho-
nem rules.
It should be noted, however, tha t t h i s solution precludes the
use of a lexicon for parts-of-speech and semantic information which
are necessary t o completely characterize an utterance. The exten-
sive mrph lexicon developed a t MIT (Allen e t a l , 1979) was construct-
ed with such considerations in mind.
Note: A comparison of the graphs for Swedish ( A l l & , 1970) and
English indicates that f e m r m r d s i n English cover a greater percent-
age of running text . It may also be observed tha t aver twice as many
different m r d s =re found i n the %dish texts a s in the English.
These phenomens indicate the greater incidence of inflectional
forms in Swedish and the greater freedom to compound; encouraging
new compound forms rather than s tr ings of descriptive adjectives
and nouns.
A diagram of the KTH text-to-speech system (Carlson & Granstrh,
1976) is shown i n Fig.111-A-2. The system accepts unrestricted input
text, and its f i r s t operation is t o convert t h i s text t o phonemes.
This conversion is accomplished e i ther by a small lexicon o r by tm
paral lel sets of rules: a set of grapheme-to-phoneme and lexical
stress rules, and a s e t of number-to-phoneme rules. The next set of
1-1 Lexicon 1 4 ,
numbers and yes, expressions F
P a
no
The KTH rule notation
In order t o separate the l inguist ic and phonetic knowledge base
of a text-ot-speech system from the programning necessary t o imple-
ment t h i s knowledge, a special higher-level programning language was
developed a t KTH (Carlson & GranstrGm, 1975a). This language was
written t o conform as closely as possible t o phonetic and linguistic
terminology, so tha t the knowledge represented by the rule sets in
the system wuld be clear t o those mrking i n language and speech re-
search.
The basic structure of the rules is:
where & marks the occurrence of the s tructural description X in the
context A B. The structural change is denoted by Y; other symbols
are delimiters. Either A or B may be empty; both A and B empty de-
fines a context-free rule. An insertion may be defined by X unspeci-
fied, and a deletion by unspecified Y. Recursive rule application is
made possible by a special rule label.
The symbols A, B, X , and Y are s t r ing elements which are defined
by the user. Such a definition is composed of up t o 32 dis t inct ive
features such as "voiced" or "vocalic," and variables such as formant
frequency and bandwidth. Basic features can be grouped into natural
classes, and both one- and tw-dimensional variables can be defined.
There is also a f ac i l i ty for specifying optional elements. Any of
the s tr ing elements i n a rule, then, might be a symbol o r a list of
specified features and values for variables (including arithmetic ex-
pressions) . Consider, for example, the grapheme-to-phoneme rule for conver-
sion of the past-tense/past part icipial mrd-final suff ix ED i n Eng-
l i s h a f t e r a voiced segment
The phoneme /D/ and the features "voiced," "segmental," "formative
boundary" and "mrd boundary" are defined by the user. Note the pos-
s i b l i t y of a ternary (+, - or unspecified) specification for each
feature.
I - Spel l ing-Out Rules (16) 1
I I n f l e c t i o n a l S u f f i x S t r i p p i n g ( 7 ) 1 wanted-want - e d
r a i n i n g - r a i n - i n g I
th--+/ 'a / b e f o r e a f i n a l " e " e . g . , b a t h e
4 S i n g l e Consonant Rules ( 3 4 ) - 1
p-6 b e f o r e a n o n - l i q u i d consonant i n m o r p h - i n i t i a l p o s i t i o n , e . g . , pneumatic
I
S u f f i x Rules (100) 2
v i b r a t o r y - v i b r - / a t 3 r i / - a t o r y e x c e p t e d from t h e stress c y c l e and marked a s " l a x o r " by two f e a t u r e s
P r e f i x Rules ( 2 4 ) - p e r m i t t i n g - - + / p a r + / m i t t i n g
v
Vowel Digraph Rules ( 3 4 )
I ea-+/n/ p r e c e d i n g "r - consonan tn I h
S i n g l e Vowel Rules ( 7 4 )
a + / € / p r e c e d i n g a s i n g l e con- s o n a n t and morph-f ina l " e " , e . g . , bake
I - S t r e s s Rules (19) I 2 1 d l s + a p p o l n t - ed
1 s e l e c t - i o n - s
Fig.111-A-3. Order of Grapheme-to-Phoneme and Lexical Stress Rules.
STL-QPSR 2-3/1980 47.
Swcia l contexts
Special contexts, in which less frequent grapheme-to-phoneme
correspondences occur, are seen t o be specified by a s few as ten
categories. Furthemre, these categories frequently define the
contexts for special pronunciations of both consonants and vowels.
These categories are shown i n Fig.111-A-4; the same, o r similar,
contexts for -1s and consonants are found opposite each other.
Examples of graphemes receiving correct pronunciations by rules in
these categories are also shown.
bbst special contexts can be defined in terms of mrpheme
boundaries. Categories (1 ) through (7) are of t h i s type. Cate-
gory (1) expresses the notion of mrph-ini t ial , providing palatal
versions of i n i t i a l /u/ and /v/, labelling i n i t i a l "y" as a con-
sonant and forming the context for special pronunciations i n i n i t i a l
consonant clusters. Category (2) for mrph-ini t ial syllables, Ca-
tegory (4 ) for morph- and mrd-final syllables, and a special case
of both, i.e. Category (3) for single syllables, define contexts i n
which long -1s are found and a few consonants can be singled out.
Long v w l s are also expected in the contexts defined by Categories
(5) and ( 6 ) , but short vowls are found in the context defined by
Category (7) . * Categories (5) and (6) also define contexts for syl-
labic "1" and " r , " and for a few other consonants and consonant clus-
ters.
The mst prol i f ic exception-generating contexts appear to be
those in which a liquid occurs, Category (8) . Twenty-six such rules
are currently included, although f ive are classif ied i n other cate- I
i
gories a s ~ 1 1 . The consonant examples l i s t ed in parenthesis re-
present cases where other rules do not apply due t o the presence - of a l iquid, i.e., "1" is not syllabified a f t e r a liquid and initial
"p" is not s i l en t before a liquid.
Category (9) is a vis ible and active category. This category
defines contexts i n both roots and suffixes (7 suffixes in the pre-
sent set of ru les) . It requires the lengthening of the f i r s t v m l ,
* Another special feature, N O R , is defined for suffixes which define the Category (7) context.
VOWELS
3) SINGLE-SmLE MORPH pie , - sour, - raced, striate, vocal, rn - - - -
4) WORDFINAL /MoRPHFINAL bonnie, - toe , - potato, bake, ma - - -
5) P-ING C - VOCALIC INFLECTIONAL SUFFIX baking, - miles, - loner, themes - -
6 ) PRECEDING C - Mom-FINAL "en bake, - mile, alone, theme - - -
7 ) PRECEDING C - SPECIAL "LAXING" SUFFIX r a t i f y , - u t i l i t y , - conic, edible - -
8) PRECEDING LIQUID / C - LIQUID a i r , earth, ward, shoulder, do l l , o r ... - - - - - -
9) I N ? H E m v - C - v h i h - v alienate, a m l i o r a t e , uzual , . . . at ion -- -- - - --
10) PRECEDING AN- VOWEL Ptolemaic, - meteor, - q o p i c , experience -
11) OTHERS ru l e , fue l , took, qual i ty , wrong, waste, hind, hi - - - - - - -
Fig. EI-A-4. SPECIAL CONTEXTS
Note : "C" indicates a s ingle consonant, and "V" a s ingle -1.
CONSONANTS
4) WORD-FINAL / MORPH-FINAL sing, t ic, inch, arguable, pariah_ - - - -
5) PRlTEDrn VOCF;LIC INFLECTIONAL SUFFIX wreathes, a n t i ~ e s , appLes, acKes -
6 ) PRMlEDING MOIiTH-FINAL "e" wreathe, anti%e, apple, acre, o r q e , cheese - - - -
8) PRECEDING / FOLLOWING A LIQUID (quadri l le , o r l e , place) - -
9) I N c m v - c - vhigb - v revision, dispersion, Russian, racial,,,ation - - - - -
10) PFUXEDING ANOTHER CONSONANT picture , clear, r a q , r a ~ b e r r y - -
11) OTHERS came, c i s t e rn , rancor, rancid, qwern, ayrosoope - - - -
rule context is compared. This method appears to be mre efficient,
and does not require the program code needed in the MIT method to
direct the various passes with the appropriate set of rules. In
fact, no new code was written for the English system at all: the
code existent for the Swedish system serves for the English rules
as well.
The major difference between the multi-pass method and this
one-pass procedure is in the manner of processing and ordering af-
fixes. Recognition and removal of all affixes as a first step in
the M T algorithm corresponds to seven rules in the KTH system which
recognize vocalic inflectional suffixes and insert a mrph boundary
marked with the feature "inflectional." Except for the loss of
"transparency" in the spelling of affixes, the effect df not recog-
nizing all affixes before consonant conversion appears to be rather
mall. Initial consonant clusters after unrecognized prefixes have , I
been observed to be mispronounced in a few cases in the KTH system. I
Hmver, the opposite effect may be obsewed in the MIT system:
strings incorrectly recognized as prefixes before application of
the consonant rules also lead to mistaken pronunciations.
There is a significant difference in the ordering of suffix I
i rules in the t m algorithms. Suffixes in the MIT algorithm are re-
cognized first and converted later (in any order). Recognition of
multiple suffixes, however, is contingent upon a verification of
the canpatability of their parts of speech. This verification re-
quires additional code and a table of parts of speech of each suf-
fix and its possible predecessor. In the one-pass system, suffixes
must be listed in the order of their probable occurrence frm the
right-hand side of the mrd so that their mrd-final or mrph-final
position is verified. A short study was undertaken for the purpose
of determining the best order for the suffix rules. Compound af-
fixes found in the mst frequent t~ thousand English mrds accord-
ing to the Brown Corpus (through a frequency of 48 per million) were
inspected. The resulting ordering of suffixes, with a few compound
suffixes added, correctly raves affixes frm this sample of mrds.
STL-QPSR 2-3/1980
Special stress effects due to suffixation, which are included
in the operation of the rule cycle in the MIT stress rules, are ac-
camplished in t~ ways in the KTH system. Stress-carrying suffixes
are pre-stressed in the suffix rules by noting primary or secondary
stress as a feature of the appropriate vowel. This stress m y be
adjusted later by the stress rules themselves. In the MIT rules,
special categories of stress-affecting affixes are stipulated as ex-
ceptions to the Main Stress Rule, the first rule in the cycle, and
may be adjusted by later rules, as in the KTH system. Suffixes
which have no effect on the stress cycle are recognized in both sys-
tems, by special categorization for both the Main Stress Iiule and
the Compound Stress Rule in the MIT rules, and by a feature "minus
stress cycle" in the KTH rules. This feature is also assigned to
mrd boundary symbols such as "space" and "period" in the KTH defi-
nitions of these symbols, and becanes part of the right context in
many stress rules.
Unlike the MIT system, the KTH formalism provides no device
with which to retain graphemes after their conversion to phonemes.
The retention of graphemes in the MIT formalism provides for the
specification of either letters of phonemes in both left and right
contexts. As a consequence, a substantial subset of rules differ
in specification of context. Although parallel tests have not been
made, it is believed that this difference gives neither set of
rules an advantage mrthy of note.
In addition, as previously discussed, the KTH programning lan-
guage allaws each phoneme and punctuation mark to be expressed in
terms of distinctive features. This type of specification makes the
rules more "transparent" than those in the MIT listing where variables
are used which, for the mst part, are lists of phonemes with a part-
icular feature. The facility of specifying optional elemnts in this
programning language has also allowed rules to be expressed mre
succinctly in several cases.
A ~reliminarv evaluation
Preliminary tests have been made to evaluate the usage of the
grapheme-to-phoneme and lexical stress rules, the correctness of
their phonemic output, and the speech output of the system with the
STL-QPSR 2-3/1980
complete se t of English rules. A l l t e s t s t o date, have employed
various samples from the Brown Corpus.
Tests for correctness of phonemic output and speech output
have been made on seven sets of t m hundred mrds each having the
following frequency range and correctness score:
TEST n&r Frequency range Score (Occurrence per million)
TEST 1 69,971 - 426 66.5% (100%)
TEST 2 108 - 91 73%
TEST 3 57 - 52 69%
TEST 4 32 - 30 76.5%
TEST 5 15 73%
TEST 6 7 67%
TEST 7 3 66%
Note that these scores are - not frequency weighted.
A l l mrds in TEST 1 are included in the lexicon. Wst of them
are function words, and are l is ted in the lexicon with a "+" follow-
ing their phonemic representation t o allow for their inclusion in
phrases. Their score, i f they had not been i n the lexicon (66.5%)
is rather low, reflecting their exceptional nature as high-frequen-
cy mrds. Words of frequency 7 and 3 also have rather low scores.
This seems t o be, for the most…