Bootstrapping pronunciation models: a South African case study

Bootstrapping pronunciation modelsM. Davel* and E. Barnard*

IntroductionHuman language technologies (HLT) hold much promise for

the developing world, especially for user communities that havea low literacy rate, speak a minority language, or reside in areaswhere access to conventional information infrastructure islimited. For example, information systems that provide speech-enabled services via a telephone can serve a user in his or herlanguage of choice in a remote location without requiringadditional expertise from the user or a sophisticated Internetinfrastructure. In South Africa, while Internet penetration is stilllow, more than 90% of the population has access to a telephoneand the potential of voice services for improving access to infor-mation is receiving increasing attention.1

The development of various forms of HLT — such asspeech-based information systems, speech recognition, speechsynthesis or multilingual information retrieval systems —requires the availability of extensive language resources. Thedevelopment of these resources involves significant effort andcan be a prohibitively expensive task when such technologiesare developed for a resource-scarce language. This presents asignificant obstacle to the development of HLT in the developingworld, where few electronic resources are available for locallanguages, skilled computational linguists are scarce and lin-guistic diversity is high.

One such language resource required by many speech process-ing tasks is a pronunciation model. A pronunciation model fora specific language describes the process of letter-to-soundconversion: given the orthography of a word, it provides aprediction of the phonemic realization of that word. Thiscomponent is required by many speech processing tasks —including general domain speech synthesis and large vocabu-lary speech recognition — and is often one of the first resources

required when developing speech technology in a resource-scarce language.

In order to accelerate the uptake of HLT in the developingworld, techniques are required that support the efficient devel-opment of language resources. In prior work,2–4 we have devel-oped bootstrapping techniques to accelerate the development ofa pronunciation model in a resource-scarce language. Duringthe procedure known as ‘bootstrapping’, a model is improvediteratively via a controlled series of increments, at each stageusing the previous model to generate the next. This self-improving circularity distinguishes bootstrapping from otherincremental learning processes and is the key to its efficiency, aswe discuss below.

In this work we define a generic framework for analysing abootstrapping process and apply this framework to the task ofcreating pronunciation models for resource-scarce languages.We use the framework to derive a mathematical model thatcan be used to predict the amount of time required for thedevelopment of a pronunciation lexicon of a given size, analysethe effectiveness of the approach when developing a medium-sized (5 000–10 000 word) pronunciation lexicon and demon-strate the various tools that can accelerate the bootstrappingprocess.

The paper is structured as follows: in the following sectionwe provide background with regard to the bootstrapping ofpronunciation models, and then we define a generic frameworkfor analysing a bootstrapping process. We next describe thespecific bootstrapping approach followed during the develop-ment of an Afrikaans pronunciation lexicon, analyse the resultsachieved and evaluate the efficiency of the process. Finally, wediscuss initial results obtained when bootstrapping pronuncia-tion models in two additional languages, Zulu and Sepedi, anddiscuss the implications of our results.

BackgroundIn prior work, we developed an audio-enabled approach to

the bootstrapping of pronunciation models, and demonstratedthe efficiency of this process for small lexicons.2,3 The core of thisbootstrapping process relies on an efficient grapheme-to-phonemepronunciation prediction algorithm. This algorithm is used togeneralize from the existing lexicon (dictionary) in order topredict additional entries, increasing the size of the lexicon in anincremental fashion. The bootstrapping system is initializedwith a large word list (containing no pronunciation informa-tion), or with a pre-existing pronunciation lexicon if such aresource is available. The system chooses the next ‘best’ word toconsider, predicts a pronunciation for this word and presentsa human dictionary developer with an audio versiona of thepredicted pronunciation. The human acts as a ‘verifier ’ andprovides a verdict with regard to the accuracy of the word-pronunciation pair: whether the pronunciation is correct aspredicted, or not. The verifier can also indicate that the worditself is invalid (e.g. from a different language or a misspelledword), ambiguous depending on context (e.g. the word ‘read’

322 South African Journal of Science 102, July/August 2006 The CSIR at 60

*HLT Research Group, CSIR Meraka Institute, P.O.Box 395, Pretoria 0001, South Africa.E-mail: [email protected]; [email protected]

Bootstrapping techniques can accelerate the development oflanguage technology for resource-scarce languages. We define aframework for the analysis of a general bootstrapping processwhereby a model is improved through a controlled series of incre-ments, at each stage using the previous model to generate the next.We apply this framework to the task of creating pronunciationmodels for resource-scarce languages, iteratively combiningmachine learning and human knowledge in a way that minimizesthe human intervention required during this process. We analysethe effectiveness of such an approach when developing a medium-sized (5000–10 000 word) pronunciation lexicon. We develop suchan electronic pronunciation lexicon in Afrikaans, one of SouthAfrica’s official languages, and provide initial results obtained forsimilar lexicons developed in Zulu and Sepedi, two other SouthAfrican languages. We derive a mathematical model that can beused to predict the amount of time required for the development of apronunciation lexicon of a given size, demonstrate the various toolsthat can accelerate the bootstrapping process, and evaluate theefficiency of these tools in practice.

aThe audio version is created from the predicted phoneme string by concatenating audiosamples of the specified phonemes, that is, the word is ‘sounded’ rather than synthesized.This implies that the only additional resource requirement is a set of sample phonemes,rather than a more complex speech synthesizer.

that has two pronunciations: /r iy d/ and /r eh d/), or that he or sheis uncertain about the status of the word-pronunciation pair. Ifthe word is valid but the pronunciation is wrong, the verifierspecifies the correct pronunciation by removing, adding orreplacing phonemes in the presented pronunciation. A newaudio version is generated, for which the verifier can specify anew verdict. At this stage, the learning algorithm updates theword-to-pronunciation model in order to account for thecorrected pronunciation. The process is repeated with updatedpredictions until a pronunciation lexicon of sufficient size isobtained. A related approach was developed by Maskey et al.,5

achieving comparable results.During prior experimentation, a number of tools were developed

in support of the bootstrapping process, including an efficientgrapheme-to-phoneme rule extraction algorithm (Default&Refine6), automatic error detection tools4 and a software systemthat supports the bootstrapping process. While rule extraction istypically computationally efficient for small lexicons, the timerequired to extract grapheme-to-phoneme rules from a sizeablelexicon slows the process significantly, especially if continuousrule updating is required. In our experience, the bootstrappingsystem became noticeably cumbersome at approximately 2000words. All the lexicons developed during prior experimentationwere therefore fairly small (approximately 1000 to 2000 words).To overcome this barrier, a new incremental version of thegrapheme rule extraction algorithm was defined that performslocal refinement of grapheme-to-phoneme rules, accepting asmall degradation in learning efficiency in exchange for fastupdate times.4 Incremental updating is performed continuously,after every word that is edited by the dictionary developer. Aftera pre-defined interval the system performs a full rule update (aglobal optimization operation we term ‘synchronization’),which is slower but results in more accurate models. The lengthof the update interval is defined by the dictionary developerand indicates the amount of ‘model drift’ that is allowed. (Thelength of a typical update interval is 50 to 100 corrected words;note that the number of actual words added to the lexiconduring this interval is larger.) In this paper, we evaluate these

tools during the development of a medium-sized pronunciationlexicon in a resource-scarce language, during a continuousprocess that starts well beyond the previous 2000-word barrier.

Bootstrapping frameworkDuring bootstrapping, a model is grown systematically,

becoming increasingly accurate from one increment to the next.When analysing the bootstrapping process, it soon becomesapparent that the process relies on an automated mechanism toconvert among various representations of the model considered.Each representation describes the same task in a format thatprovides a specific benefit, either because the representation isamenable to automated modelling and analysis or because itdescribes the current model in a way that is convenient for ahuman to verify and improve. For example, during the boot-strapping of pronunciation models, the correctness of an entryin the pronunciation lexicon (the first representation) is easilyverified by a human, whereas automated analysis of the extractedgrapheme-to-phoneme rules (the second representation) can beused to identify possible errors that require re-verification. Incontrast, during the bootstrapping of acoustic models for speechrecognition, both representations are amenable to automatedanalysis and updating: the acoustic models (the first representa-tion) are typically built from the phonemic segmentation of theaudio data (the second representation) through automatedtraining, clustering and re-training of acoustic models, while thephonemic segmentation of the audio data are created throughautomated Viterbi alignment of the phonemic transcriptions,utilizing the current acoustic models. By converting back andforth between these two representations in a bootstrappingfashion, both the phonemic segmentations and the acousticmodels become increasingly accurate.

Bootstrapping componentsThe general bootstrapping concept using two model represen-

tations is depicted in Fig. 1. The number of representations islimited to two for the sake of simplicity — three or more repre-sentations can also be included in the model.

The CSIR at 60 South African Journal of Science 102, July/August 2006 323

Fig. 1. General bootstrapping concept, using two model representations.


The following components play a role during bootstrapping:• Alternative representations: two or more representations of the

same model lie at the heart of the bootstrapping process. InFig. 1 these are indicated as A and B. During the bootstrappingof pronunciation models, one of these representations (say, ‘A’)can easily be verified by a human, while the second represen-tation (say, ‘B’) is more amenable to verification through auto-mated analysis. For other applications, either none or bothrepresentations may be amenable to either type of verification.

• Conversion mechanisms: each conversion mechanism (indicatedas A→B and B→A) provides an automated or semi-automatedmeans to convert data from one representation to another.

• Verification mechanisms: once converted to a specific representa-tion, the model can be improved via automated or human(manual) verification, indicated in the figure by the ‘Verify’components.

• Base data: this term is used to refer to the domain of the model.The current base indicates the domain that has been used intraining the current model, and consists of a subset of orthe full base data set. The current base data are implicitly orexplicitly included in each of the two representations.

• Increment mechanisms: the Add components are used toincrease the current base during bootstrapping. At the oneextreme, all model instances can be included in a single incre-ment; at the other, a single instance can be added per boot-strapping cycle. As a possible extension, the incrementmechanisms may use active learning techniques7,8 in order toselect an appropriate set of instances to add.

• External data: this term refers to additional data sources that areused during bootstrapping. Typically, external data are used toinitialize a bootstrapping system with models that were devel-oped on a related task.

Bootstrapping processPrior to bootstrapping, the various representations are initial-

ized in preparation for the first iteration. Typically only a singlerepresentation requires initialization (A in this instance). Exter-nal data may be included in this process, or the bootstrappingprocess starts without any initial knowledge of the task notincluded in the base data. The increment mechanism choosesthe first base set to use. Once initialized, the bootstrappingprocess consists of the following steps, many of which areoptional in the general case, as indicated:

1. The current base, as well as the current representation A, isused to generate the next representation B. During pronun-ciation modelling, A consists of word-pronunciation pairsand B of grapheme-to-phoneme rules.

2. B is optionally verified, either manually or automatically.During pronunciation modelling, this step is typicallyomitted, unless the system is requested to flag possibleerrors for re-verification.

3. Based on the current state of the bootstrapping system, theincrement mechanism increases the current base set. Duringpronunciation modelling, additional words are added tothe word list being considered.

4. The current base as well as the current representation B isused to generate representation A.

5. A is optionally verified, either manually or automatically.During pronunciation modelling, this step is not optionaland consists of human verification of the newly generatedword-pronunciation pairs.

6. Based on the current state of the bootstrapping system, theincrement mechanism increases the current base set. (In thegeneral case either step (3) or (6) is optional. During pronun-

ciation modelling, step (6) is omitted.)This cycle is repeated until a sufficiently accurate and/or

comprehensive model is obtained. Note that while step (5) isrequired during the bootstrapping of pronunciation models,both steps (2) and (5) may be optional for other applications(such as the bootstrapping of acoustic models described above)where model improvement occurs during the automated repre-sentation conversion process rather than through post-conversionverification.

Bootstrapping efficiencyThe main aim of a bootstrapping system is to obtain as accurate

a model as possible from available data. When human interventionis used to supplement or create the training data themselves, theaim shifts towards minimizing the amount of human effort requiredduring the process. This is the focus of our analysis, and wetherefore measure bootstrapping efficiency as a function ofmodel accuracy:

Efficiency at a

t abootstrap

manual( )

( )

( ),= (1)

where a is the accuracy of the current model as measured againstan independent test set and tbootstrap(a) and tmanual(a) specify the time(measured according to amount of human intervention)required to develop a model of accuracy a with and withoutbootstrapping, respectively.

Bootstrapping is analysed according to bootstrapping cycles.While bootstrapping, all base instances do not result in validdata that can be included in the model training process. Of theinstances that define a valid base data, some will be correctlyrepresented by the initial representation (B), and others willcontain errors. Recall from the Background section that theverifier can specify four verdicts per word: identifying theword-pronunciation pair as invalid, ambiguous, uncertain or correct,possibly after correcting the pronunciation manually. We definea number of variables to assist us in the analysis of these instances:At the start of cycle x of the bootstrapping process, we define n(x)as the number of instances included in the current base, ninvalid(x)as the number of instances that are invalid, ambiguous or uncer-tain; ncorrect(x) as the number of instances that are valid andcorrect, and nerror(x) as the number of instances that are valid andincorrect. For these variables, the following will always hold:

n x n x n xn x n x n

invalid valid

valid correct

( ) ( ) ( ),( ) ( )= +

= + error x( ).(2)

Related incremental variables are used to represent theincrease of word-pronunciation pairs of a specific type duringcycle x, namely inc_n(x), inc_ninvalid(x), inc_nvalid(x), inc_ncorrect(x) andinc_nerror(x). The same intervention mechanism may have differ-ent cost implications based on the status of the instance. In thesimplest case, the status of an instance may simply be correct,incorrect or invalid, but subtler differences are possible, e.g. thenumber of changes required to move from an incorrect to acorrect version. The expected status of a newly predictedinstance changes as the system becomes more accurate. Prior tohuman intervention at stage x of the bootstrapping process, thenumber of instances of each status within the current incrementis given by:

inc n x inc n s x

status invalid corres s tatus

_ ( ) _ ( , ),

{ ,

=

∈∈∑

ct error error error, , , ...}.1 2 3

(3)

where errorx indicates that the current pronunciation (prior toverification) contains x phoneme errors, that is, x substitutions,


deletions or insertions are required to correct the pronunciation.Note that the correct status is similar to the error0 status.

Combining machine learning and human intervention in away that minimizes the amount of human effort required duringthe process can be achieved in two ways: (a) by minimizing theeffort required by the human verifier to identify errors accurately,and (b) by optimizing the speed and accuracy with which thesystem learns from the human input. The remainder of thissection describes the various factors that influence the efficiencyof the bootstrapping process from both these perspectives. Byanalysing these factors individually, it becomes possible tounderstand better the effect on the overall process when opti-mizing any single factor, and it also becomes possible to predictbetter the effort involved when bootstrapping a specific resourceof a specific size.

Human factorsThe first human factor that impacts on the efficiency of the

bootstrapping process relates to required user expertise:whether the task requires expert skills, or whether a limitedamount of task-directed training is sufficient. If it is assumed thatthe user has the skills required, the following measurementsprovide an indication of the efficiency of the bootstrappingprocess for a specific user:• User learning curve: the time it takes for a specific user to

become fully proficient using the bootstrapping system.Measured as ttrain, initial training data are assumed to bediscarded.

• Cost of intervention: the average amount of user time requiredper intervention i when an instance has status s, for a fullytrained user using the bootstrapping system. Measured astverify(i, s), a different average cost may be associated with differ-ent types of interventions. If more than one intervention isused to generate a single instance during one cycle of boot-strapping, the combination of mechanisms is modelled as anadditional (single) mechanism. Depending on the bootstrap-ping process, it may be more realistic to measure this value fora set of instances. During pronunciation modelling, there areonly two types of interventions: verifying predictions and ver-ifying the list of errors.

• Task difficulty: the average number of errors generated by afully trained user using the bootstrapping system. Indicatedby error_ratebootstrap(i, s), this is measured as the percentage oferrors generated by a user using intervention mechanism i toverify an instance initially in state s. For example, when a singlepredicted pronunciation that contains 2 errors is verified, theerror_ratebootstrap(verifysingle, error2 ) is less than 0.5%.

• Quality and cost of user verification mechanisms: implicit in theabove two measurements are the cost and effect on error-rateof additional assistance provided during user intervention.Rather than modelling additional user assistance providedduring existing interventions separately, the combined inter-vention is again modelled as an additional type of interven-tion. In the same way, automated verification mechanisms aremodelled as additional interventions.

• Difficulty of manual task: The average number of errors for afully trained user developing instances manually. Indicated byerror_ratemanual, this is measured in percentage as the averagenumber of errors per 100 manual instances developed, whereeach manual instance can be associated with an individualbase data instance in the bootstrapped system.

• Manual development speed: The average amount of time perinstance development for a fully trained user performing thistask manually, measured as tdevelop.

• Initial set-up cost: The time it takes for a user to prepare theinitial system for manual development or bootstrapping,measured in time as tsetup_manual and tsetup_bootstrap, respectively.

Machine learning factorsThe faster a system learns between verification cycles, the

fewer corrections are required from a human verifier, and themore efficient the bootstrapping process becomes. From amachine learning perspective, learning speed and accuracy aredirectly influenced by the following factors:• Predictive accuracy of current base: modelled as the expected

number of instances of each status at a specific cycle of thebootstrapping process, and indicated by E(inc_n(s, x)). Implicitin this measurement are four factors:——Accuracy of representations: the ability of the chosen represen-

tations to model the specific task.—Set sampling ability: the ability to identify the next ‘best’

instance or instances to add to the knowledge base, possiblyusing active learning techniques.

—System continuity: the speed at which the system updates itsknowledge base. This has a significant effect on systemresponsiveness, especially during the initial stages of boot-strapping.

—Robustness to human error: the stability of the conversionmechanisms and chosen representations in the presence ofnoise introduced by human error.

• On-line conversion speed: any additional time costs introducedwhen computation is performed while a human verifier isrequired to be present (but idle while waiting for the computa-tion to complete). It is measured as the average time pernumber of valid instances developed and indicated by tidle(n).

• Quality and cost of verification mechanisms: the average amountof time required to utilize additional assistance mechanism j —from a computational perspective — when an instance is instatus s, measured as tauto(j, s).

• Validity of base data: Using invalid data slows the bootstrappingprocess, especially if human intervention is required to verifythe validity of base data. This is measured in percentage of basedata, and is indicated by valid_ratio.Two additional factors that are not included explicitly in the

general model, but can be included based on the requirements ofthe specific bootstrapping task, are:• Conversion accuracy: the ability of the conversion model to

convert between representations without loss of accuracy.• Effect of incorporating additional data sources: the ability of the

system to boost accuracy by incorporating external datasources at appropriate times.

System analysisThe combined effect of the machine learning factors and

human factors provides an indication of the expected cost ofusing the bootstrapping system. The time to develop a boot-strapping model via N cycles of bootstrapping, using a set ofinterventions I, is given by:

where titerate(N, I) combines the cost of the various iterations,excluding the cost associated with system setup and user train-

(4)

ing. The expected value of inc_n(s, x) depends on the specificconversion mechanism, and is influenced by valid_ratio anderror_ratebootstrap(i, s).

This cost of bootstrapping can be compared with the expectedcost of developing nmanual instances via a manual process:

t t t nmanual setup manual develop manual= + ∗_ . (5)

If nbootstrap and nmanual are chosen such that

E inc n correct n E inc n correct nbootstrap manu[ _ ( , )] [ _ ( ,= al )], (6)

where the number of valid instances generated during boot-strapping is given by:

n inc n valid xbootstrapx

N

==

−

∑ _ ( , ),1

1

(7)

the accuracy of each of the two systems is approximately equiva-lent, and the values of Equations (4) and (5) can be combinedaccording to Equation (1) in order to obtain a measure of theexpected efficiency of the bootstrapping process.

Bootstrapping a medium-sized lexicon

Experimental approachWe use the above framework to improve and analyse the

efficiency of our pronunciation model bootstrapping process.During prior experimentation, a number of bootstrappedlexicons were created, each less than 2000 words. In this workwe first cross-analyse these prior lexicons and manually verifydiscrepancies in order to obtain a verified lexicon of approxi-mately 5000 words. We then perform bootstrapping according tothe bootstrapping framework described above, and initializethe bootstrapping system using the verified lexicon. We use a lin-guistically sophisticated first language speaker (Developer ‘C’)who has previous experience in using the bootstrapping systemto develop the lexicon and measure the time taken for eachactionb. We obtain a word list from random text from theInternet, and order words randomly (in the list of new words tobe predicted). We use incremental Default&Refine for activelearning in between synchronization sessions and standardDefault&Refine during synchronization. We set the updateinterval (number of words modified between synchronizations)to 50.

At the end of the bootstrapping session we perform errordetection. (No additional error detection is performed duringbootstrapping.) We first extract the list of graphemic nulls, andidentify possible word errors from the graphemic null genera-tors.c We then extract Default&Refine rules from the full lexiconwith the purpose of using these rules to identify errors, similar tothe process described in our earlier work.4 Only words thatgenerate new grapheme-to-phoneme rules are considered formanual re-verification. We list as possible errors all words fromword sets that result in a new rule and contain fewer than fivewords, and verify these words manuallyd.

ResultsThe parameters of the bootstrapped lexicon are listed in

Table 1. We measure the time taken by the verifier to performeach verification action, and analyse the effectiveness of the

verification process from a human factors perspective. Figure 2illustrates the verification process as the lexicon grows from 5500to 7000 words. We plot the time taken to verify each valid word,indicating whether 0, 1, 2 or 3 corrections are required for eachword as it is added to the lexicon. (The number of training wordson the x-axis includes both valid and invalid words.)

We note the following:• User learning curve: the verifier was proficient in using the

system prior to the current bootstrapping session, and furthertraining was not required. The value of ttrain during earliersessions was <120 min.

• Cost of intervention: in this experiment we used two interven-tion mechanisms: verifying predictions and verifying the listof possible errors. Table 2 provides the average verificationtimes observed for Developer C where the interventionmechanism is a single verification of a prediction (tverify(single,s)) for words that are in different states s prior to verification.Verification of the list of possible errors took approximately27 minutes (for approximately 3000 words).

• Task difficulty: during the bootstrapping process, 3019 wordswere added to the lexicon, of which 181 were invalid orambiguous. During error detection, 9 errors were found in theremaining 2838 valid words. Given our previous analysis,4 weestimate that this represents at least 50% of the errors, andtherefore estimate the actual error rate to be 0.6%e. It is interest-ing to note that, while our error detection protocol resulted in are-verification of 3.3% of the full lexicon (1832 grapheme-specific patterns, or about 300 words), the average position ofeach error in the ordered error prediction list was at 0.67% ofthe full training lexicon, with the majority of errors found inthe first 0.1% of words, that is, the first or second pattern on theper-grapheme list of potential errors.


Table 1. Parameters of the Afrikaans bootstrapped lexicon.

Number of graphemes in orthography 40Number of phonemes in phoneme set 43Number of words in starting lexicon 4923Total number of words in bootstrapped lexicon 8053Number of valid words in bootstrapped lexicon 7782Number of derived rules (Default&Refine) 1471

Fig. 2. Time taken to verify words requiring zero, one, two or three corrections, as afunction of the number of words verified. For the first three measures, the averageswere computed for blocks of five words each.

bIn prior work we have shown that even linguistically unsophisticated user can be trained touse the bootstrapping system with high accuracy.6cGraphemic nulls are inserted during alignment where a single grapheme maps to morethan one phoneme. See ref. 16 for a detailed description of graphemic nulls and graphemicnull generators.dA word set associated with a rule tends to have either only one or two words associated withit, or a large set of words: within an acceptable range, the error detection process is notsensitive with regard to the exact cut-off point selected. e18 errors in 2838 valid words.


• Difficulty of manual task: error_ratemanual is assumed to be <0.5%,which is an optimistic estimate for the range of manual devel-opment speeds evaluated.

• Manual development speed: different values of tdevelop are used forcomparison, ranging from 19.2 s, again an optimistic estimate.

• Initial set-up cost: As this is an extension of an existing system,no further set-up cost was incurredf.From a machine learning perspective, the following is

observed:Predictive accuracy of current base: measured directly during

experimentation, the number of corrections required per wordadded to the lexicon (inc_n(s, n)) is depicted in Fig. 3. We plot therunning average (per blocks of 50 words) of the number ofcorrections as a function of the number of words verified.

On-line conversion speed: the average time taken for a synchro-nization event was 50.15 seconds ( = 7.72 s). This valueincreased gradually from 35 s during the initial cycle, to 56 s inthe final cycle.

Quality and cost of verification mechanisms: the computationaltimes required for both verification mechanisms are included inthe verification times. No additional processing is required.

Validity of base data: 94% of the base data was valid.Based on our observations during this experiment, we can

assign approximate values to the different costs and efficienciesinvolved during bootstrapping of an Afrikaans lexicon up to10 000 words. We list these values in Table 3.

Using the above observations, we obtain the followingexpected cost of N cycles of bootstrapping:

E t N E t E t E tbootstrap setup bootstrap train i[ ( )] [ ] [ ] [= + + terate N( )] , (8)

where:

E t N E t s E inc n s xiterate verify[ ( )] ( ( ( , )) ( _ ( ,= ∗single

( )

))

( _ ( , ))

s statusx

N

idlex

t inc n valid x

∈=

−

∑∑ ⎛⎝⎜ ⎞

⎠⎟

+ + +

1

1

1=

−

−

∑+

1

1

1

N

verify errt inc n valid x( )( _ ( , ))).or det

(9)

We assume an update event after every 100 errors (approxi-mately 400 words verified.) As tidle is dominated by tverify(error-det)during the initial 10 000 words, we keep this value constantas the number of words in the training lexicon increases,g andestimate it at:

t tverify error idle( det)( ) ( )− + =400 400 180 seconds. (10)

From Table 3 we estimate E(tverify(s, single)) as t0 + tex seconds,where x is an indication of the number of corrections required, t0

= 2 and te = 4.5.

In order to estimate E t s E inc n s xverifyx

N( ( , )) ( _ ( , ))

=

−∑ 1

1single for

different states (different numbers of corrections per word), wesmooth the number of errors across the training data — as if aword could have only one error — and fit an exponential curvethrough the accuracy measurements depicted in Fig. 3. That is,we assume the probability that the system will predict an errorwhen the training lexicon is of size d is given by pe(d), where:

p d P e

p d Pdk

e

e

dk( ) ,

( ) ,

=

= −

−0

0i.e. log log

(11)

and P0 and k are parameters to be estimated. The time requiredfor d corrections T(d)(excluding synchronization events) is thengiven by:

T d t t P

dt t P e

i

d

e

ei

d dk

( ) ( )= +

+

=

−

=

−

∑

∑ −

00

1

0

0 00

1

=

= dt t Pe

ee

dk

k0 0

1

11+

−

−

−

−.

(12)

For the specific data depicted in Fig. 3 we obtain the estimates:

logP

k

0

5

1 2741

3 49 10

= −

− = − ∗ −

.

. .(13)

We can combine Equations (9) and (12) in order to estimate thevalue of E[titerate(d/400)] for various values of total lexicon size d:

E t d dt t Pe

ed

t

iterate e

ver

dk

d

[ ( / )]

(

4001

1

400

0 0 1= +−

−

+ ∗

−

−

ify error idlet( )( ) ( )).− +det 400 400

(14)

In Fig. 4 we plot Equation (14) for different values of d, usingthe estimates from Equations (10) and (13). On the same graphwe plot the cost of manual lexicon development (again exclud-ing set-up cost) using estimates for tdevelop(d) of 19.2 and 30 s, bothoptimistic estimates. For these estimates we assume that thesame base data (or at least data with a similar validity ratio) are

Table 2. Statistics of the time taken to verify words requiring 0, 1, 2 or 3 errors, or toidentify a word as invalid or ambiguous (µ is the mean, and σ the standarddeviation.).

Verdict Time in seconds

µ σ

Correct 1.95 1.351 error 5.79 2.32 errors 10.74 3.193 errors 17.91 6.12Invalid 3.39 4.71Ambiguous 8.92 5.08

Fig. 3. The average number of corrections required per word as a function of thenumber of words verified. Averages were computed for blocks of 50 words each.

fIn the previous experiment tsetup_ bootstrap –tsetup_ manual <60 min.gThis value is influenced by the number of words corrected per cycle – a number thatremains constant per cycle.


used for both approaches. We also assume that the error rates forthe bootstrapping system with error detection and the manualprocess are approximately equal. As can be seen from the figure,the bootstrapping approach requires significantly less effortthan the manual approach. A manual developer would have tobe able to generate a correct instance every 3 seconds, in order todevelop a comparable lexicon within the same time frame. InFig. 5 we plot the efficiency estimates of the bootstrappingprocess as compared to a manual lexicon development processfor the same values as Fig. 4.

ConclusionA pronunciation model is one of the key components required

for the development of speech technology in a resource-scarcelanguage. In this paper we developed a framework for the analy-sis of a typical bootstrapping process, and demonstrated thepractical application of such a framework during the bootstrap-ping of a pronunciation model in a resource-scarce language. Wereported on results for a lexicon that is large enough to be ofpractical use, with final results summarized in Figs 4 and 5.

Upon completion of the Afrikaans pronunciation lexicon, twofurther bootstrapped pronunciation lexicons were developed inZulu and Sepedi, respectively. The Zulu lexicon has since beenintegrated in a general-purpose text-to-speech (TTS) systemdeveloped in the Festival9 framework. This was accomplished aspart of the Local Language Speech Technology Initiative(LLSTI),10 an internationally collaborative project that aims tosupport the development of speech technology systems in locallanguages. A small grapheme-to-phoneme rule set (84 rules

from 855 words) was generated using the bootstrapping systemand converted to the Festival letter-to-sound format. The TTSsystem used the Multisyn approach to synthesis and is describedin more detail elsewhere.11,12 The completed system was evalu-ated for intelligibility and naturalness by both technologicallysophisticated and technologically unsophisticated users.13

The Sepedi lexicon (Sepedi is a dialect of Northern Sotho) wasdeveloped and integrated in an automatic speech recognition(ASR) system in collaboration with partners from the Universityof Limpopo. Again a fairly small number of words were boot-strapped in order to develop a concise set of letter-to-soundrules (90 rules from 2827 words). These were then used todevelop14 a speech recognition system using the HTK15 frame-work.

The bootstrapping approach has been shown to be highlyefficient for the development of pronunciation lexicons. Weforesee that further application of the approach developed willensure the availability of pronunciation models suitable forspeech processing systems in all of South Africa’s officiallanguages.

1. Barnard E., Cloete J.P.L. and Patel H. (2003). Language and technology literacybarriers to accessing government services. Lecture Notes in Computer Science2739, 37–42.

2. Davel M. and Barnard E. (2003). Bootstrapping for language resourcegeneration. In Proc. Annual Symposium of the Pattern Recognition Association ofSouth Africa, pp. 97–100.

3. Davel M. and Barnard E. (2004). The efficient creation of pronunciationdictionaries: human factors in bootstrapping. In Proc. Interspeech, pp.2797–2800. Jeju, Korea.

4. Davel M. and Barnard E. (2005). Bootstrapping pronunciation dictionaries:

Fig. 4. Time estimates for creating different sized lexicons. Manual development isillustrated for values of tdevelop(1) of 19.2 and 30 seconds, respectively.

Fig. 5. Estimates of the efficiency of bootstrapping, as compared with manualdevelopment for values of tdevelop(1) of 19.2 and 30 seconds, respectively.

Table 3. Observed values for various bootstrapping parameters.

Bootstrapping parameter Estimated value

Training cost ttrain <120 minVerification cost for single words, with x corrections required for a word in state s: tverify(single, s) (2 + 4.5x) sVerification cost during error detection (per 1000 words): tverify(error-det) <10 minVerification cost during error detection (per 400 words): tverify(error-det) <3 minTask difficulty–bootstrapping, no error detection error_ratebootstrap 0–1%Task difficulty–bootstrapping, error detection error_ratebootstrap 0–0.5%Task difficulty–manual error_ratemanual 0-0.5%Manual development speed tdevelop 19.2–30 sInitial set-up cost tsetup_bootstrap –tsetup_manual <60 minUpdate interval 100 errors

(~400 words)

practical issues. In Proc. Interspeech, Lisbon, Portugal.5. Maskey S., Tomokiyo L. and Black A. (2004). Bootstrapping phonetic lexicons

for new languages. In Proc. Interspeech, pp. 69–72. Jeju, Korea.6. Davel M. and Barnard E. (2004). A default-and-refinement approach to

pronunciation prediction. In Proc. Annual Symposium of the Pattern RecognitionAssociation of South Africa, pp. 119–123.

7. Cohen D., Ghahramani Z. and Jordan M. (1990). Active learning with statisticalmodels. J. Artific. Intell. Res. 4, 129–145.

8. Seeger M. (2002). Learning with labeled and unlabeled data. Technical Report,Institute for Adaptive and Neural Computation, University of Edinburgh.

9. Black A., Taylor P. and Caley R. (1999). The Festival Speech Synthesis System.Online: http://festvox.org/festival/

10. Local Language Speech Technology Initiative (LLSTI) (2005). Online:

http://www.llsti.org11. Davel M. and Barnard E. (2004). LLSTI isiZulu TTS Project Report. CSIR,

Pretoria.12. Louw J.A., Davel M. and Barnard E. (in press). A general purpose isiZulu TTS

system. South African Journal of African Languages.13. Davel M. and Barnard E. (2004). LLSTI isiZulu TTS Evaluation Report. CSIR,

Pretoria.14. Modiba T.M. (2004). Aspects of automatic speech recognition with respect to Northern

Sotho. M.Eng. thesis, University of the North, South Africa.15. Young S., Kershaw D., Odell J., Ollason D., Valtchev V. and Woodland P. (2000).

The HTK Book. Revised HTK Version 3.0. Online: http://htk.eng.cam.ac.uk/16, Davel M. (2005). Pronunciation modelling and bootstrapping. Ph.D. thesis, Univer-

sity of Pretoria, South Africa.


Design and optimization of a pulsed CO2 laser forlaser ultrasonic applicationsA. Forbes

a,b*, L.R. Bothaa, N. du Preez

cand T.E. Drake

d

IntroductionPolymer–matrix composites are increasingly used in the aero-

space industry, particularly in the manufacture of modernfighter planes.1–3 The number and complexity of such compositesare also increasing steadily. The aerospace industry requiresnon-destructive testing (NDT) of all parts during the manufac-turing process. In the case of composite materials, testing isparticularly necessary for detecting the presence of delaminationsand inclusions. The conventional approach involves the use ofultrasonics, based on water jets and piezoelectric transducers. Arequirement for this process is that the transducers must be normalto the surface to ensure good signal-to-noise ratio. Since mostmodern aircraft parts have a complex geometry, this process isslow and therefore expensive.

An alternative ultrasonic inspection technique is laser ultrasonics(LU). In this technique two lasers are used to illuminate the part:a short pulse laser with a wavelength chosen so as to be absorbedby the material, and a second laser beam to detect the impact ofthe ultrasonic waves. The first laser pulse is rapidly absorbed bythe material, causing fast thermal expansion, which results inultrasonic waves. The wave amplitudes themselves are smallenough not to cause any damage. The waves so generated propa-

gate through the material and are reflected from interfaces ( suchas defects) inside the material. On returning to the surface, thereflected waves generate very small surface displacements of thematerial. A second, narrow bandwidth, frequency stable laser isused in conjunction with an interferometer to detect these me-chanical displacements on the surface of the material.1–3 The ad-vantage of LU is that these measurements can be done at largedistances from the sources and at large angles with respect to thesample surface. This gives LU a competitive advantage and en-sures that it is currently the NDT method of choice for the testingof composite materials in the aerospace industry.

For efficient and reliable operation, LU requires the efficientgeneration of ultrasound by the first laser pulse, which placesstringent requirements on the laser system. The propertiesrequired of the laser pulse include good beam quality, a pulse ofshort duration (<100 ns), as well as a minimum energy(>200 mJ) and pulse repetition rate (>200 Hz) for the process tobe suitable for industrial applications. Furthermore, since thesesystems will typically be used in a production environment, thefinal design must result in a laser system that is reliable (with lowdowntime) and economical. Numerical simulations and experi-mental investigations have indicated that generation efficiencycan be improved by using short pulses in the 3–4 µm and 10 µmspectral regions.1 Short-pulse 10-µm radiation can be producedby transversely excited, atmospheric CO2 (TEA CO2) lasers. Ow-ing to the technological maturity of these lasers, they areparticularly suited to industrial environments.

In this paper we discuss the design of a TEA CO2 laser for LUapplications. We cover the basic laser parameters to be optimized,and report experimental data for the optimization of the outputcoupler reflectivity and gas mix. The impact on laser chemistry isdiscussed in detail.

Laser parametersIt is useful to start with a summary of the important laser

parameters for LU, and their interrelationships. Since the parame-ters are not independent of one another, the critical aspect of theparameter testing is to explore regimes in terms of combinationsof parameters. Only in some cases can the parameters be chosenindependently. For LU a strong ultrasonic signal at as high arepetition rate as possible must be generated. The strong signalensures a good signal-to-noise ratio, while a higher repetitionrate allows more samples to be tested in a given time, therebyincreasing productivity. The LU signal is a product of the effi-ciency of the laser pulse in generating ultrasound, determined

Laser ultrasonics is currently the optimal method for non-destructivetesting of composite materials in the aerospace industry. Theprocess is based on a laser-generated, ultrasound wave whichpropagates inside the composite. The response at the materialsurface is detected and converted into a defect map across theaircraft. The design and optimization of a laser system for thisapplication, together with the basic science involved, is reviewed inthis paper. This includes the optimization of laser parameters, suchas output couplers and gas mixture, and the impact these choiceshave on the laser chemistry. We present a theory for the catalyticrecombination of the gas which shows excellent agreement withexperiment. Finally, an operating laser system for this application,yielding a sixfold improvement in performance over conventionallaser systems, is described.

aCSIR National Laser Centre, P.O. Box 395, Pretoria 0001, South Africa.bSchool of Physics, University of KwaZulu-Natal, Private Bag X54001, Durban 4000,South Africa.cScientific Development and Integration (Pty) Ltd, P.O. Box 1559, Pretoria 0001, SouthAfrica.dLockheed Martin Aeronautics Company, Lockheed Martin Blvd, Fort Worth, Texas76108, U.S.A.*Author for correspondence. E-mail: [email protected]

Documents

Bootstrapping pronunciation models: a South African case study