Parallel Networks that Learn to Pronounce English Text · Parallel Networks that Learn to Pronounce English Text Terrence J . Sejnowski Department of Biophysics, ... ticularly difficult

Complex Systems 1 (1987) 145-168

Parallel Networks that Learn to PronounceEnglish Text

Terrence J . SejnowskiDepartment of Biophysics, The Johns Hopkins University,

Baltimore, MD 21218 , USA

C h a rles R . R os enbergCogni tive Science Laboratory, Princeton University,

Princeton, NJ 08542, USA

Abstract. T his paper describes NETtalk , a class of massively-parallelnetwork systems that learn to convert English text to speech . Thememory represen tat ions for pronunciations are learned by practiceand are shared among many processin g uni ts . T he performan ce ofNET talk has some similar it ies with observed human performance. (i)T he learning follows a p ower law . (ii) T he more words the networ klearns , the be tter it is at generalizing and correctly pronouncing newwords, (iii) The performance of t he network degrades very slowly asconnections in the network are damaged: no single link or processingunit is essential. (iv) Relearning after damage is much faste r t hanlearning during the original training. (v) Distributed or spaced pr actice is more effecti ve for lon g-term reten tion t han massed pr actice.

Network models can be constructed that have th e same performance and learning characteristics on a particular task, bu t di ffercompletely at t he levels of synaptic strengths and single-unit responses.However . hierarchical clustering techniqu es applied to NETtalk reveal that these different networks have similar internal repr esentat ionsof let ter- to-sound correspondences within groups of processing units.This suggests t hat inva riant internal represent at ions may be found inassemblies of neurons inte rmed iate in size between high ly localizedand complete ly distributed re presentations.

1. Introd uction

Ex pert pe rformance is character ized by sp eed and effor t lessness, but thisfluency requ ires lon g ho urs of effort fu l practice. We are a ll exp erts a treading and communic ating wit h language. We forget how long it to ok toacqu ire these skills because we are now so goo d a t them and we continueto pract ice every day. As performance on a difficult task becomes more

(C) 1987 Com plex Systems Publications. Inc.

146 Terrence Sejnowski and Charles Rosenberg

automatic, it also becomes more inaccessible to conscious scrutiny. Theacquis ition of skilled performance by practice is more difficult to study andis not as well understood as memory for spec ific facts [4,55,78).

The prob lem of pronouncing written English text illust rates many ofthe features of skill acquisition and expert performance. In reading aloud,let ters and words are first recognized by the visual system from images onthe retina. Several words can be processed in one fixat ion suggest ing that asignificant amount of parallel processing is involved. At some point in thecentral nervous system the information encoded visually is transformed intoarticulatory informat ion about how to produce the correct speech sounds.Finally, intricate patterns of activity occur in t he motoneurons which innervate muscles in the larynx and mouth, and sounds are produced. The keystep that we are concerned wit h in this paper is the transformat ion fromthe highest sensory represent at ions of the letters to t he earliest articulatoryrepresent at ions of the phonemes.

English pronunciation has been extensively studied by linguists andmuch is known about the correspondences between letters and the elementary speech sounds of English, called phonemes 183). English is a particularly difficult language to master because of its irregular spelling. Forexample, the "a" in almost all words end ing in "ave", such as "brave" and"gave" , is a long vowel, but not in "have", and there are some word s suchas "read" that can vary in pronunciat ion with their grammatical role. Theprob lem of reconc iling ru les and exceptions in converting text to speechshares some characteristics with difficult problems in artificial intelligencethat have traditionally been approached with ru le-based knowledge representations, such as natural language translation [27J.

Another approach to knowledge representation wh ich has recently hecome popular uses patterns of activity in a large network of simple processing units 122,30,56,42,70,35,36,12,51,19,46,5,82,41,7,85,13,67,50]. This"connectionist" ap proach emphasizes the imp ortance of the connect ionsbetween t he processing un its in solving problems rather than the complexity of processing at the nodes.

The network level of analysis is intermediate between the cognitive andneural levels [11). Network models are constrained by the general styleof processing found in the nervous system [71]. The processing units ina network model share some of the prop erties of real neurons, but theyneed not be identified with process ing at the level of single neurons. Forexamp le, a processing unit might be ident ified with a group of neurons,such as a column of neurons [14,54,37). Also, those aspects of performancethat depend on the details of input and output data representations in thenervous system may not be captured with the present generation of networkmodels.

A connectionist network is "programmed" by spec ifying the architectural arrangement of connections between the processing units and thestrength of each connection. Recent advances in learning procedures forsuch networks have been applied to small abstract problems [73,66) and

Parallel Networks that Learn to Pronounce

TEACHER...

147

OUI PUI Unit s

/k/

=Hi dden Units =

// 1 \ \ <,Inp ut Un i Is coco a::xD o:.xD coco CI:XtJ c:o::xJ ccco

(_ a c a t _ )

Figure 1: Schematic drawing ofthe NETtalk network arc hitecture. Awindow of let ters in an English text is fed to an array of 203 inpu tuni ts . Information from these units is transformed by an intermedi atelayer of 80 "h idden" un its to pro duce pat te rns of activity in 26 outputun it s. The connections in the network are sp ecified by a total of 18629weight par ameters (includin g a var iable th reshold for each unit) .

more difficul t problems such as forming the past tense of Engli sh verbs[681·

In this paper we describ e a network that learns to pronounce Englishtext. The system, which we call NETtalk, demonstrates that even a smallnetwo rk can capture a significant fract ion of th e regular ities in Englishpronunciat ion as well as absorb many of th e irregularities . In commercialtext-to-speech systems, such as DECtalk [15], a look-up table (of abouta million bits) is used to store th e phonetic t ra nscr iption of common andirregular words, and ph onological rul es ar e applied to words that ar e notin this table 13,40] . The resul t is a st r ing of phonemes that can t hen beconver ted to sounds with digit al speech synt hesis . NETtalk is designed toperform t he task of converting str ings of letters to strings of phonemes.Earlie r wor k on NETtalk was described in 174] .

2. N etwork Archit ect ure

Figure 1 shows the schematic arrangement of th e NETtaik syst em. Threelayers of proc essing units are used. Text is fed to units in the input layer .Each of these input un it s has connect ions with vari ous strengths to uni ts inan int ermediate "hidden" layer . The uni ts in the hidden layer are in t urnconnected to un its in an output layer , whose valu es dete rm ine the outputphoneme.

The processing units in sucessive layers of the network are conne cted byweighte d arcs. The output of each pro cessing unit is a nonlinear function

148 Terrence Sejnowski and Ch arles Rosenberg

a

Input s

b

Pr oc ess ing Unit

Output

;']~J-r'~~;=:"'~~-r~~~-s - .4 -3 - 2 - 1 0 I 2 3 " 5

To t al Inp ut E

Figure 2: (a) Schematic form of a processing unit receiving inputsfrom other processing uni ts. (b) T he output peE) of a processinguni t as a function of the sum E of its inp uts .

of the sum of its inpu ts, as shown in Figure 2. The output functi on hasa sigmoid shape : it is zero if t he input is very negat ive, then increasesmonot onically, approaching the value one for large positive inputs. Thisform roughly approxim ates t he firing rate of a neuron as a function of itsintegrated input: if the input is below threshold there is no output; thefiring rate increas es with input, and saturates at a maximum firing rate.The behavior of the network does not depend cr it ically on the details ofthe sigmoid function, but the exp licit one used here is given by

1s; = P tE,) = 1 + e E ;

where s, is th e output of the ith unit . Ei is the total input

Ei = L wii s;i

(2.1)

(2.2)

where Wi; is the weight from the jth to th e ith unit . The weights can havepositive or negative real values, representing an excitatory or inhibitoryinfluence.

In addition to the weights connecting them, each un it also has a thres hold which can also vary. To make the notation uniform, the threshold wasimplemented as an ordinary weight from a special un it, called the true unit,that always had an output value of 1. This fixed bias acts like a th resho ldwhose value is th e negative of the weight .

Learn in g algorithm. Learning algorithms are automated proceduresthat allow networks to improve their performance through practice 163,87,2,75J.

ParalJeJ Networks that Learn to Pronounce 149

Supervised learning algorithms for networks with "hidden units" betweenthe input and output layers have been introduced for Boltzmann machines131,1,73,59,76), and for feed-forward networks [66,44,57). These algorithmsrequire a "local teacher" to provide feedback information about the performance of the network. For each input, the teacher must provide the networkwith the correct value of each unit on the output layer. Human learningis often imitative rather than instruct ive, so the teacher can be an internalmodel of the desired behavior rather than an external source of correct ion.Evidence has been found for error-correction in animal learning and humancategory learning [60,79,25,80,1). Changes in the strengths of synapses havebeen experimentally observed in the mamma lian nervous system that couldsupport error-correction learning [28,49,61,39j. The network model stud iedhere should be considered only a small part of a larger system that makesdecisions based on the output of the network and compares its performancewith a desired goal.

We have applied both tbe Boltzmann and the back-propagation learning algorithms to the problem of converting text to speech , but only resultsusing back-propagation will be presented here. The back-propagation learning algorithm [66] is an error-correct ing learning procedure that generalizesthe Widrow-Hoff algori thm [871to mul tilayered feedforward networks [231 .A superscript will be used to denote the layer for each unit, so that s~n} isthe ith unit on the nth layer . The final, output layer is designated the N thlayer.

The first step is to compute the output of the network for a giveninput. All the units on successive layers are updated. There may be directconnections between the input layer and the output layer as well as throughthe hidden units. The goal of the learning procedure is to minimize theaverage squared error between the values of the output units and the correctpattern, s;, provided by a teacher:

J

Error = 2:)s; - S~N))2i=1

(2.3)

where J is the number of units in the output layer. This is accomplishedby first computing the error gradient on the output layer:

(2.4)

and then propagating it backwards through the network, layer by layer:

6(n) =" 6(n+1)w(~) p '(E(n))I L..J J J' ,

;(2.5)

where P' (E;) is the first der ivative of th e function P tE; ) in F igure 2(b).These gradients are the directions that each weights should be altered

to reduce the error for a particular item. To reduce the average error for allthe input patterns, these gradients must be averaged over all the training


patterns before updat ing the weights. In practice, it is suffic ient to averageover several in put s before updating t he weights. Another metho d is tocompute a running average of the gradient with an exponentially decayingfilte r:

Il.wl;')(u + 1) = all.wljl(u) + (1 - a).inH).}n) (2.6)

where 0: is a smoothing parameter (typically 0.9) and u is the number ofinput patterns presented. The smoothed weight gradients .6.wJ;)(u) canthen be used to update the weights:

(2.7)

where the t is the number of weight updates and E is the learning rate (ty~

ically 1.0). The error signal was back-propagated only when the differencebetween the actual and desired values of the outputs were greater than amargin of 0.1. This ensured that the network did not overlearn on inputsthat it was already getting correct. This learning algorithm can be generalized to networks with feedback connections and multiplicative connection[66}, but these extension were were not used in this study.

The definitions of the learning parameters here are somewhat differentfrom those in [66]. In the or igina l algorithm e is used rather than (1 - a)in Eq uat ion 6. Our parameter a: is used to smooth the gradient in a waythat is independent of the lear ning rate, e, which only appears in the weightupdate Equation 7. Our averaging procedure also makes it un necessary toscale the learning rate by the number of presentat ions per weight update.

The back-p ropagat ion learning algor ithm has been applied to severa lproblems, including knowledge represent ation in semant ic networks [29,65],ban dwidt h compression by dimens ionality reduction [69,89], speech recognit ion [17,86], computing the shape of an object from it s shaded image [45]an d backgammon [81]. In the next section a detailed descr ipt ion will begiven of how back-propaga tion was applied to the problem of convertingEnglish text to speec h.

R epresent a t ions of le tters and p h on em es. The standard networkhad seven groups of un its in th e input layer, and one group of uni ts in eachof t he othe r two layers . Eac h input group enco ded one let ter of th e inputtext, so that strings of seven letters are presented to the input uni ts atanyone time. The desired output of the network is the correct pho neme,assoc iated with the center, or fourth, let ter of this seven letter "window".The other six let ters (three on either side of the center letter) provided apartial context for this decision. The text was stepped through t he windowletter-by- letter. At each step, t he network computed a phoneme, and aftereach word the weights were adjusted according to how closely the computedpronunciation matched the correct one.

We chose a window with seven letters for two reasons. First, [481 haveshown that a significant amount of the information needed to correc tlyprono unce a letter is contributed by the nearby letters (Figure 3) . Secondly,we were limited by our computational resources to exploring small networks

Parallel Networks th at Learn to Pronounce

Information Gain at Several Letter Positions

~c----------,

:

s,0s " -1

:1•.s

151

"o./..---

\ .'----.

<--,

~L---'---~-_--"-------.J• -a

F igure 3: Mut ua l inform ation prov ided by neighboring letters andthe correct pronunciation of the center let ter as a function of distancefrom the cente r let ter. (Data from (48)).

and it proved possib le to t rain a network with a seven letter win dow in afew days . The limi ted size of the wind ow also meant that some importantnonlocal information about pronunciation and st ress could not be prop er lytaken into account by our model [10]. T he main goal of our mod el was toexp lore t he basic pr inciples of distributed informat ion coding in a real-worlddo main rather than achieve pe rfect performance .

T he letters and phonemes were represented in different ways. The letters were represented locally within each group by 29 dedicat ed units, onefor each let ter of the alphabet, plus an ad dit ional 3 uni ts to encode punctuat ion an d word boundaries. On ly one unit in each inpu t group was activefor a given input. T he phonemes , in contrast , were represented in te rms of21 articulatory features, such as point of art iculation, voicing , vowel height ,and so on, as summarized in the Append ix. Five additional un its encodedst ress and syllable boundar ies, ma king a total of 26 output uni ts. Thiswas a distributed represent at ion since each output unit part icipat es in theenco d ing of several phonemes 1291.

The hidden units neither received direct input nor had d irect output,but were used by the network to form internal representat ions that were

152 Terrence Sejnowski and Charles Rosen berg

appropriate for solving the mapping problem of let t ers to phonemes. T hegoal of the learning algorithm was to search effectively the space of allpossible weights for a network that performed the mapping .

Learning. Two texts were used to train the network : phonetic transcriptions from informal, continuous speech of a child [91and Miriam Webster's Pocket Dictionary. The corresponding letters and phonemes werealigned and a special symbol for continuation, "-", was inserted whenevera letter was silent or part of a graphemic letter combination , as in the conversion from the string of let ters "phone" to the string of phonemes / f-on-/(see Appendix). Two procedures were used to move the text through thewindow of 7 input groups. For the corpus of informal, continuous speechthe text was moved through in order with word boundary symbols betweenthe words. Several words or word fragments could be within the window atthe same time. For the dictionary, the words were placed in random orderand were moved through the window individually.

The weights were incrementally adjusted during the training acco rdingto the discrepancy between the desired and actual values of the outputunits. For each phoneme, this error was "back-propagated" from the outputto the input layer using the lear ning algorithm int roduced by [661 an ddesc ribed above. Each weight in the network was adjusted after everyword to minimize its contribution to the total mean squared error betweenthe desired and actual outputs. T he weights in the network were alwaysinitialized to small random values uniformly dist ributed between -0.3 and0.3; t his was necessary to different ia te the hidden units.

A simulator was wr itten in the C programming language for configur inga network with arbitrary connect iv ity, t rai ning it on a corpus and collectingstat istics on its performance. A network of 10,000 weights had a throughput during learning of abou t 2 let ters/ sec on a VAX 11/780 FPA. Afterevery presentation of an inp ut , the inner product of t he output vector wascomputed with the codes for each of the phonemes. The phoneme thatmade the smallest ang le with the ou tput was chosen as the "best guess" .Slight ly "bet ter performance was achieved by choos ing the phoneme whoserepresentation had the smallest Euclidean distance from the output vector, but these results are not reported here. All performance figures inth is section refer to the percentage of correct phonemes chosen by the network. The performance was also assayed by "playing" the output st ring ofphonemes and stresses t hrough DECtalk, bypassing the part of the mac hinethat converts let ters to phonemes.

3. Performance

Continuous informal speech . 19] provide phonetic t ranscr ipt ions of children and adults that were tape recorded during informal sessions. This wasa particularly difficult training corpus because the same word was often pronounced several different ways; phonemes were commonly elided or modified at word boundaries, and adults were about as inconsistent as children.

Para llel Networks that Learn to Pronounce

,co

/' Stresses

so

Phon eme s -eo

zc

_..0•~ $0u

~ 4 00

i 3 0

,. -

I c

•• , a ,Numbe r of Wo r d s (1lI04)

Figure 4: Learning curves for phonemes and stresses during trainingon the 1024 word corpus of continuous informal speech. The percentage of correct phonemes and st resses are shown as funct ions of thenumber of trai ning words.

153

We used the first two pages of transcriptions, wh ich contained 1024 wordsfrom a ch ild in firstgr ade. The st resses were ass igne d t o the transcript ionsso t ha t the training text sounded natural when played through DEC talk.The learning curve for 1024 words from t he info rmal speech corpus is shownin F igure 4. The percentage of correct phonemes rose rapidly at first andcont inued to rise at slower rate throug hout the learn ing, reach ing 95% after50 passes t hrough the corpus . Primary and secondary stresses and syllable boundaries were learned very quickly for a ll words and achieved nearlyperfect perfo rmance by 5 passes (Figure 4) . When the learning curves wereplot ted as error rates on double logarit hmic scales t hey were approximatelyst raight lines, so t hat the learning follows a power law , which is characteristi c of human skill learning [641_

The distinction between vowels and consonants was made early; however , the network predicted the same vowel for a ll vowe ls and the sameconsonant for all consonants, which resulted in a babbling sound. A secondstage occurred when word boundar ies are recognized. and the output thenresembled pseudowords. Aft er just a few passes through th e network manyof the words were intelligible, and by 10 passes the text was understandable.

When the network made an error it often subst it uted phonemes tha tsounded sim ilar to eac h other . For example, a common confusion wasbetween the "th" sounds in "thesis" an d "these" which differ on ly in voicing.


.,~ ,..•• Uu ' Gue.. ....".. -... j;j........ ", so" " " ,., "

,"''''OU,,, 0 1 D~ ," , \tO

""'.0'"'''11 "".<D."'.~.-

, ,N. ", ,,,,, 01 wo,d . I. 'OJ I

Figu re 5: (a) Performance of a network as a funct ion of the amountof da mage to the weights. (b) Retraining of a damaged network compared with the original learnin g curve st ar t ing from the sa me level ofperforman ce. The network was damaged by ad ding a random component to all the weigh ts uniformly d ist ributed on th e inte rval [-1.2,1.21·

Few errors in a well-trained network were confusions between vowels andconsonants . Some errors were actually corrections to incon sistencies in theorigin al t raining corpus . Overa ll, t he intelligibility of the speech was quitegood.

Did the network memorize the training words or did it capt ure the regular features of pronunciation? As a test of generali zation, a network tra inedon the 1024 word corpus of inform al speech was tested without training ona 439 word continuation from the same speaker. The performance was 78%,which indicates that much of th e learning was transferred to novel wordseven after a small sample of Eng lish words .

Is the network resistant to damage? We examined performance of ahighly-trained network aft er making ra ndom changes of varying size to theweights . As shown in F igure 5(a) , random perturb at ions of the weightsuniformly dist ributed on the interval [-0.5, 0.51 had little effect on theperformance of the network, and degradation was 'gradual with increasing damage. T his damage caused the mag nitude of each weight to changeon average by 0.25; this is the roundoff error that can be tolerated beforethe performance of the network begins to deteriorate an d it can be usedto estimate t he accuracy with which each weight must be specified. Theweight s had an average mag nit ude of 0.8 an d almost all had a magnitude ofless than 2. With 4 binary bits it is possible to specify 16 possible va lues,or -2 to + 2 in st eps of 0.25. Hence, the minimum inform ation needed tosp ecify each weight in th e network is only about 4 bits .

If the damage is not too severe, relearning was much faste r than th e

Parallel Networks that Learn to Pronounce 155

"""so

eo

ao

'"'0 rs 10

Hum be r 01 WOll l . 10 10 3 ) "

",ioroc

eo r-j' I:_lIdao M

"ec

1:- 151

'"eo

~ '00o ..k10

ae .

rc

10 13 10 aNu mb ", "1 Wor d . 1. 101 ,

ec

Figure 6: (a) Learning curves for tr aining on a corpus of the 1000 mostcommon words in En glish using different numbers of hidden un its, asindicated beside each curve. (b) Performance during learning of tworepresentative phonological ru les, th e hard and soft pronunciation ofthe letter "c" .

original learning starting from the same level of performance, as shown inF igure 5(b) . Sim ilar fault tolerance and fast recovery from damage hasalso been observed in networks constructed using the Boltzmann learningalgorithm [321.

Dictionary. The Midam Webster's Pocket Dictionary that we usedhad 20,012 words. A subset of the 1000 most commonly occurring wordswas selected from this dict ionary based on frequency counts in the Browncorpus (43). The most common English words are also am ongst the mostirregular, so this was also a test of the cap acity of t he network to absorbexceptions. We were particularly int erest ed in exploring how the performance of the network and learning rate sca led with the number of hiddenunits. With no hidden units, only direct connect ions from the inp ut unitsto the out put units, the performance rose quickly and saturated at 82 % asshown in Figure 6(a) . This represents the part of the mapping that canbe accomplished by linearly separable partitioning of the input space [531 .Hidden uni ts allow more contextual influence by recogniz ing higher-orderfeatures amongst combinations of input units .

The rate of learn ing and asymptotic performance increased with thenumber of hid den un its, as shown in Figure 6 (a). The best performanceach ieved with 120 hidden units was 98% on the 1000 word corpus, sign ificantly better than the performance achieved with continuous informalspeech, which was more difficult because of the variability in real-worldsp eech . Different let t er-to-sound correspondences were learned at differentra tes and two exam ples are shown in F igure 6 (b): the soft "c" takes longerto lear n , but event ually achieves perfect accuracy. T he hard "e" occurs


about twice as often as the soft "c" in the t raining corpus. Children showna similar difficulty with learning to read words with the soft "c" 1841 .

The ab ility of a network to gener alize was tested on a large d ict ionary.Using weigh ts from a network with 120 hidden uni ts trained on t he 1000words, the aver age performance of the network on t he d ict ionary of 20,0 12words was 77%. With continued learning, the performance reached 85%at the end of the first pass through the dictionary, indicating a significantimprovement in gener aliz ation. Following five training passes throu gh th edictionary, the performan ce increased to 90%.

The number of input groups was var ied from three to eleven . Both thespeed of learning and the asymptotic level of performance improved withthe size of the window. The performan ce with 11 input groups and 80hidden units was about 7% higher than a networ k with 7 input groups and80 hidden units up to about 25 passes through the corp us, and reached97.5% afte r 55 passes compared with 95% for t he network with 7 inputgroups.

Adding an extra layer of hidden uni ts also improved the performancesomewhat. A network with 7 input groups and two layers of 80 hiddenunits each was trained first on the 1000 word dictionary. Its performanceaft er 55 passes was 97% and its generalization was 80% on the 20,012 worddictionary without add it ional t raining, and 87% afte r th e first pass t hroughthe dictionary with training. The asympto tic performance after 11 passesthrough the t he dictionary was 91%. Compare d to the network with 120hidden units , which had about the same number of weigh ts , the networkwith two layers of hidden units was better at generalizatio n but about thesame in absolute performance.

4. Analysis of the Hidden Units

There are not enough hidden units in even the largest network that westudied to memorize all of the words in the dictionary. The standard network with 80 hidden uni ts had a total of 18,629 weights, including variablethresholds. ITwe allow 4 bits of accuracy for each weight , as indicated bythe damage experiments, t he total storage needed to define the networkis about 80,000 bits. In comparison, the 20,012 word dictionary, including stress information, required nearly 2,000,000 bits of storage. This datacompression is possible bec ause of the redundancy in English pronunciation. By studying the patterns of act ivation amongst th e hidden units, wewere able to understand some of the coding methods that the network haddiscovered.

The standard network used for analysis had 7 input groups and 80hidden units and had been trained to 95% correct on the 1000 dictionarywords. The levels of activation of the hidden units were examined for eachletter of each word using the graphical representation shown in Figure 7.On average, about 20% of the hidden uni ts were highly activated for anygiven input, and most of th e remaining hidden units had little or no ac-

Parallel Networks that Learn to Pronounce

Figure 7: Levels of activation in t he layer of hidden un its for a varietyof words, all of which produce the same phoneme, l EI,on the output.The input string is shown at the left with the center letter emphasized.The level of activity of each hidden unit is shown to the right, in tworows of 40 un its each. The area of the white square is proportional tothe activity level.

157

t ivation. Thus , the coding scheme could be described neither as a localrepresentation, which would have act ivated only a few un its [6,20,8], or a"hologra phic" represent ation [88,?], in wh ich all of the hidden uni ts wouldhave participat ed to some exte nt . It was apparent, even without using stat istical techniques, that many hidden uni ts were highly act ivat ed only forcer tain letters, or sounds, or letter-to-soun d correspondences. A few of thehidden un its could be assigned unequivocal characterizations, such as oneunit that responded only to vowels, but most of the units participated inmore than one regularity.

To test t he hypothesis that letter-to-sound correspondences were theprimary organizing var iable, we computed the average activat ion level ofeach hidden unit for each let ter- to-soun d correspondence in the trainingcorpus. The resul t was 79 vect ors with 80 components each, one vector foreach letter-to-sound correspondence. A hierarchical clustering t echniquewas used to arrange the letter-to-sound vectors in groups based on a Euclidean metric in the SO-dimensional space of hidden units. The overallpat tern, as shown in Figure S, was strik ing: the most important distinctionwas the complete separation of consonants and vowels. However. withinthese two groups the clustering had a different pattern. For the vowels,the next most important variable was the letter. whereas consonants wereclustered according to a mixed st rategy that was based more on the similarity of their sounds. The same clus te ring procedure was repeated for threenetworks start ing from different random starting states. T he patterns of

158 Terrence Sejnowski an d Charles Rosenberg

Figure 8: Hierarchica l clustering of hid den uni ts for letter-to- soundcorresponden ces. The vectors of average hidden unit activity for eachcorrespondance, shown at t he bottom of t he binary tree (l-i.p for letter (I' and phon eme (PI)' were success ively grouped. accord ing to anagglomerative method using complete linkage (1181).

weights were completely different but the clustering analys is reveal ed t hesame hierarchies, with some differences in the details, for all th ree networks.

5. Conclusions

NETtalk is an illustration in miniature of many asp ect s of learning. F irs t ,the network starts out without considerable "innate" knowledge in the formof inp ut and output representations that were chosen by the expe rimenters ,but with no knowledge specific for English - the network could have beent ra ined on any language with the same set of letters and phonemes. Second ,th e network acquired its comp etence through pr actic e, went through severa ldist inct stages , and reached a significant level of performance . Finally,the information was dist r ibuted in the network such that no single unitor link was essenti al. As a conseq uence, t he network was fault tolerantand degraded gracefully wit h increasing damage. Moreover I the networkrecovered from damage much more quick ly than it took to learn initially.

Despite these similarities with human learning and memory, NETtalkis too simple to serve as a good mod el for t he acquisition of reading skillsin humans. The network attempts to accomplish in one st age what occursin two stages of human development . Childre n learn to talk first , and only

Para llel Networks that Learn to Pron oun ce 159

after representations for words and t heir meanings are well developed dothey learn to read. It is also very likely that we have access to articulatoryrep resentations for whole words , in addition to the our ability to use letterto-sound correspondences, but there are no word level representations in thenetwork. It is perhaps surprising that the network was capable of reaching asignificant level of performance using a window of only seven let ters . T hisapproach would have to he generalized to account for prosodic feat uresin cont inuous text and a human level of performance would requ ire th eintegration of information from several words at once.

NETtalk can be used as a research to ol to explore many aspec ts ofnetwork coding, scalin g, and training in a domain th at is far from trivial. Those asp ect of the network 's performance th at are simil ar to humanperformance are good candidates for general proper ties of network models;more progress may be made by studying t hese aspects in the small testlaborato ry that NETtalk affords . For example, we have shown elsewhere[62) that the optimal training schedule for teaching NETtaik new wordsis to alternate training of the new words with old words, a general phenomenon of human memory that was first demonstrated by Ebbinghaus[16] and has since been replicated with a wide range of stimulus materials and tasks 133,34,58,38,77,241. Our explanation of this spacing effect inNETtaik [621 may genera lize to more complex memory systems that usedistributed represent ations to store informat ion.

After training many networks, we conclude d that many different sets ofweights give about equally good performance. Alt hough it was possible tounders tand the function of some hidden un its, it was not possible to identifyunits in different networks th at had the same function . However, th e act ivity patterns in the hidden un its were interpret able in an interesting way.Patte rns of act ivity in groups of hidden uni ts could be ident ified in differentnetworks that served the same funct ion, such as dist inguishing vowels andconsonants. This suggests that the det ailed synapt ic connect ivity betweenneurons in cerebral cortex may not be helpful in revealing the functi onalpropert ies of a neural network. It is not at the level of the synapse or theneuron that one should expect to find invar iant properties of a network, bu tat the level of funct ional groupings of cells. We are conti nuing to analyzethe hidden units and have found statistical patterns that are even more detailed than those reported here. Techniques t hat are deve loped to uncoverthese groupings in model neural networks could be of value in uncoveringsimilar cell assemblies in real neural networks.

Acknowle dgment s

We thank Drs. Alfonso Cara masse, Francis Crick, Stephen Han son, JamesMcClelland , Geoffrey Hinton, T homas Landauer, George Miller , DavidRumelhart and Stephen Wolfram for helpful discussions about languageand learning. We are indebt ed to Dr. Stephen Hanson and Andrew Olsonwho made impor tant cont ribu tion s in th e stat ist ical an alysis of the hidden


units. Drs. Peter Brown, Edward Carterette, Howard Nusbaum and AlexWaibel assisted in the early stages of development. Bell CommunicationsResearch generously provided computational support.

T JS was supported by grants from the National Science Foundation,System Development Foundation, Sloan Foundation, General Electric Corporation, Allied Corporation Foundation, Richard Lounsbery Foundation,Seaver Institute, and the Air Force Office of Scientific Research. eRR wassupported in part by grants from the James S. McDonnell foundation, research grant 487906 from ffiM, by the Defense Advanced Research ProjectsAgency of the Department of Defense, the Office of Naval Research underContracts Nos. NOOO14-85-C-0456 and NOOO14-85-K-0465, and by the National Science Foundation under Cooperative Agreement No. DCR-8420948and grant number IST8503968.

Parallel Networ ks that Learn to Pronounce 161

Appendix A. Representat ion of Phonemes and Punctuations

Phoneme Sound Articulatory Featuresl si father Low, Tensed, Central2Ibl bet Voiced, Labial, Stoplei bo ugbt Medium, VelarIdl deb Voiced, Alveolar, Stopl ei bake Medium, Tensed, Front2It! fin Unvoiced, Labial, FricativeIgl gu ess Voiced, Velar, StopIhl head Unvoiced, Glottal, Glideiii P ete High, Tensed , Front!Ikl Ken Unvoiced, Velar, StopIII Jet Voiced, Dental, Liquidlsi met Voiced , Labial, NasalInl oet Voiced, Alveolar, Nasal101 boat Medium, Tensed, Back2Ipl pet Unvoiced, Labial, StopIrl red Voiced, Palatal , Liquidl si sit Unvoiced, Alveolar, FricativeIt I test Unvoiced, Alveolar, Stoplui lute High, Tensed , Back2Ivl vest Voiced, Labial, Fricative1.1 wet Voiced, Labial, GlideIxl about Medium, Central2Iyl yet Voiced, Palatal, GlideIzl zoo Voiced, Alveolar, FricativeIAI bite Medium, Tensed , Front2 + Centrallle i chin Unvoiced, Palatal, Affricative101 this Voiced, Dental , FricativelEI bet Medium, Front! + Front2IGI sing Voiced, Velar, NasalIII bit High, FrontlIJI gin Voiced, Velar, NasalIKI sexual Unvoiced, Palatal, Fricative + Velar, AffricativeILl bottle Voiced, Alveolar, LiquidIMI absym Voiced, Dental , NasalINI button Voiced, Palatal, Nasal101 boy Medium, Tensed, Centrall + Central2IQI quest Voiced , Labial + Velar, Affricative, StopIRI bird Voiced, Velar, Liquid

162 Terrence Sejnowski an d Charles Rosenberg

P hone me Sound Articulat ory Featuresl SI shin Unvoiced, Palat al, Fricati veITI thin Unvoiced, Dental. FricativelUI book High, Back!IWI bout High + Medium, Tensed , Central2 + BacklIXI excess Unvoiced , Affricative, Front2 + Cent ral !IYI cu te High , Tensed , Front ! + Front2 + Central !IZI leisu re Voiced, Palatal, Fricative101 bat Low, Front2I !I Nazi Unvoiced , Labial + Dent al , Affricative101 examine Voiced, Palatal + Velar , AffricativeI-I one Voiced, Glide, Front! + Low I CentrallIII logic High , Front! + Front2;-1 but Low, CentrallI-I Cont inuation Silent, ElideIj Word Boundary Pause, Elide1./ Period Pause, Full Stop< Syllable Boundary right> Syllable Boundary left1 P rimary Stress stro ng I weak2 Seco ndary St ress st rong0 Tertiary Stress weak

- Word Boundary right , left , boundary

Output repr esen tat ions for phonemes , punct uat ions, and st resses on t he 26outpu t units. Th e symbols for ph onemes in the first column are a superset ofARPAbet and are assoc iated with the sound of t he it al icized part of t he adjacentword . Compound phonemes were introduced when a sing le let ter was assoc iatedwit h more t han one primary phoneme . Two or more of the following 21 articulatoryfeature uni ts were used to represent each phoneme an d punc tuat ion: Positionin mouth: Labial = Front! , Dental = Front 2, Alveolar = Centrall, Palatal =Central2, Velar = Back l , Glottal =Beckz; Phoneme Type: Stop, Nasal, Fricative,Affricative, Glide , Liquid , Voiced , Tensed ; Vowel Height: High, Medium, Low ;P unctuation : Silen t, Elide, Pause, Full Stop. The continuation symbo l was usedwhen a let ter was silent. Stress and syllable boundaries were repr esented wit hcombinations of 5 additional un its, as shown at the end of the table. Stress wasassociated with vowels , and ar rows were assoc iated with the other letters. Thearrows point toward the stress and change direction at syllable boundaries. Thus,t he st ress ass ignments for' 'atmosphere " are' ' 1<>0» >2« ' ". T he phonemeand st ress ass ignments were chose n independe ntly

References

[,1] D. H. Ackley, G. E. Hinton and T . J. Sej nowski, uA learning algor it hm forBoltz mann machines" , Cognitive Science, 9 (1985) 147-169.

[2] M. A. Arbib, Brains, Machines k Matbematics , 2nd edition , (McGraw-HillPress, 1987).

Parallel Net works that Learn to Pronounce 163

131 J. Allen , From Tex t to Speech: The MITalk Sys tem, (Cambridge UniversityPress , 1985) .

[4j J. R. An derson , "Acquisition of cognitive skill", Psychological Re view, 89(1982) 369-406.

[51 D. H. Bal lard, G. E. Hinton, and T. J. Sejnowski, "Parallel visual computation" , Nature, 306 (1983) 21-26.

[6} H. B. Barlow, "Single uni ts an d sensation: A ne uron doctrine for pe rceptualpsychology?" , Perception, 1 (1972) 371-394 .

[7] A. G. Barto, "Learning by stat ist ical cooperation of self-inte rested neuronlike com puting elements" , Human Ne urobiology 4. (1985) 229-256.

[8j E. B. Beum, J . Moody, and F. Wilczek, "Internal representations for associative memory", reprint, Institute for Theoretical Physics, Un iversity ofCalifornia, Santa Barbara (1987).

[9] E. C. Carterette and M. G. Jones, Inform al Speec h. (Los An geles: Univers ityof California Press, 1974).

[10] K. Church, "Stress ass ignment in let te r to sound rules for speech syn th esis" ,in Proceedings of the 23rd Annual Meeting of the Association for Computational Linguistics (1985).

1111P. S. Churchland, Neurophilosophy, (MIT Press, 1986).

!12J M. A. Cohen and S. Grossberg, "Absolute stabilit y of global pattern format ion and parallel me mory storage by compe t it ive neural networks" , IEEETtensection on Systems, Man and Cyberne tics, 13 (1983) 815-825.

[13] L. N. Cooper, F. Liberman and E. Oja, "A th eory for t he acquisit ion andloss of neuron specificity in visual cortex", Biological Cybernetics 33, (1979)9-2g.

1141 F. H. C. Crick and C. Asanuma, "Certain aspects of the anatomy and physiology of the cerebral cortex", in Parallel Distrib uted Processing: Explorations in the Microstructure of Cognit ion . Vol. 2: Psychological and Biological Models, edited by J. L. McClelland &: D. E. Rumelhart, (MIT Press,1986).

{15] Digital Eq uipment Corporation, "DECtalk DTCOI Owner's Manual", (Digital Equipment Corporation, Maynard) Mesa. ; doc ument number EK DTCOI -OM-OO2) .

[16] H. Ebbinghaus, Memory: A contribution to Experimental Psyc hology,(Reprinted by Dover, New York, 1964; originally published 1885) .

[17] J. Ellman and D. Zipser, "Uni versity of Californ ia at San Diego, Institu tefor Cognitive Science Technical Report" (1985).

[181 B. Everitt, Cluste r Analysis, (Heinemann: London, 1974).

164 Terrence Sejnowski and Charles R osenberg

[19] S. E. Fahlman, G. E. Hinton and T. J . Sejnowski, "Massively-parallel architec tures for AI: NETL, THISTLE and Boltz mann Machines", Proceedingsof the Nat ional Con ference on Artificial Inte mgence, (Washington, D. C.,1983) 109-113.

[20] J . A. Feldman, "Dynamic conn ections in neural netwo rks", Biological Cy bern etics, 46 , (1982) 27-39.

[21] J. A. Feldm an , "Neural representation of concept ua l knowledge", Techni calReport TR~189, University of Rocheste r Department of Computer Science(1986).

122] J . A. Feldman and D. H. Ballard, "Con nectionist mod els and their properties", Cognitive Scien ce, 6 (1982) 205-254.

[231 A. L. Gamba, G. Gamberin i, G. Palmieri, and R. Sanna , "FUrth er experiments with PAPA", Nuovo Cimento Buppl., No. 2, 20 (1961) 221-231.

[24\ A. M. Glenb erg, "Monotonic and Nonmonotonic Lag Effects in Pa iredAssociate and Recogn it ion Memory Par adigms" • Journal of Verbal Learningand Verbal Behavior, 15 (1976) 1-16.

{25] M. A. Gluck and G. H. Bower, "From conditioning to category learning: Anadaptive network model" , in preparation.

[26] M. A. Gluck and R. F. Thompson, "Modeling th e neural sub strates of associative learning and memory : A computational approach", PsychologicalReview (1986) .

[27} W. Haas, Phonographic Translation, (Manchester : Man chester Unive rsityPress, 1970) .

[281 D. O. Hebb , Organ iza tion of Behavior, (John Wiley & Sons, 1949).

[29j G. E. Hinton, "Lea rning dis t ributed representations of concepts" , Proceedingsofthe Eighth Annual Conference of the Cognitive Science Society, (Hillsdale, New Jersey: Erlbaum, 1986) 1-12.

1301 G. E. Hinton and J. A. Ande rson, Parallel m odels of associative m emory ,(Hillsdale, N. J.: Erlbaum Associates, 1981).

[31] G. E. Hinton and T . J. Sejnowski , "Optimal perceptual inference" I Proceedings of the IEEE Computer Society Conference on Compute r Vision andPattern Recognition , (Washington, D. C., 1983) 448-453.

f32] G. E. Hinton and T. J . Sejnowski, "Learning and relear ning in Boltzmann machines", in Parallel Dist ributed Processing: Explorations in theMicrostructure of Cognit ion. Vol. 2: Psychological and Biological Mode ls,edited by J. L. McClelland & D. E. Rumelhar t, (MIT Pr ess, 1986).

[33] D. L. Hintzman , "Theoret ical imp lications of the spac ing effect" , in Th eoriesin Cognitive Psychology: The Loyola Symposium, edited by R.L . Solso,[Hillsdale, New Jersey: Lawrence Erlbaum Associates, 1974) .

Parallel Ne tworks that Learn to Pronounce 165

[34J D. L. Hintaman, "Repetit ion and memory" , in The Psychology of Learningand Mo tivation, edited by G . H. Bower , (Academic Press , 1976).

[35] J. J . Hopfield and D. Tank, "Computing with neural circu its: A model" ,Science, 233 (1986) 624-633.

[36] T. Hogg and B. A. Hube rman, "Understanding biological Computation",Proceedings of the National Academy of Sciences USA,81 (1986) 6871-6874.

[37] D. H. Hubel and T . N. Wiesel, "Receptive fields, binocular interactions , andfunctional architecture in the eat's visual cortex", Jou rnal of Physiology,160 (1962) 106-154.

[381 L. L. Jacoby, "On Interpreting th e Effects of Repetition: Solving a ProblemVersus Remembering a Solut ion" , Journal of Verbal Learnin g and VerbalBehavior, 11 (1918) 649-661.

[39] S. R. Kelso, A. H. Ganong , and T. H. Brown, "Hebbian synaps es in hippocampus", Proceedings of the National Academy of Sciences USA, 83(1986) 5326-5330 .

[40] D. Klatt , "Softwa re for a cascade/parallel formant synth esizer" , Journal ofthe Acoustical Society of America, 67 (1980) 971-995 .

[41] C . Koch , J. Ma rroquin , and A. Yuille, Proceedings of the National Academyof Sciences USA, 83 (1986) 4263-4261.

[42] T . Kohonen, Se1f~Organization and Associative Me m ory , (New York:Spr inger Verlag, 1984) .

[43] H. Kuchera, and W. N. Francis, Computational An alysis of Mo dern-DayAmerican English , (Providence, Rhode Island: Brown Universit y Press,1967) .

[44] Y. Le Cun, "A lea rning procedure for asymmetric network" , Proceedings ofCognitive (Paris), 85 (1985) 599-604.

[45] S. Lehky, and T . J. Sejnowski, "Computing Shape from Shading with aNeural Network Model" , in prepa ration .

[46] W. B. Levy, J. A. Anderson and W. Lehmkuhle, Synaptic Change in theNer vous System, (Hillsdale, New Jersey : Erlba um, 1984) .

[47] H. C . Longuet-Higgins, "Holographic model of temp oral recall", Nature, 217(1968) 104-107.

[48] J . M. Lucassen and R. L. Merce r, "An informatio n t heoreti c ap proach to theautomatic determination of phonemic beseforms" , Proceedings of the IEEEInternational Conference on Acoust ics, Speech and Signal Processing (1984)42.5.1-42.5.4.

[49] G. Lynch, Synapses, Circui ts, and the Beginnings of Memory, (MIT Press,1986).

166 Terrence Sejnowski and Charles Rosen berg

[50] J. L. McClelland and D. E. Rumelhart, Parallel Distributed Processing: Explorations in the Microstructure of Cognition . Vol. 2: Psychological andBiological Models, (MIT Press, 1986).

[51] D. Marc and T Peggie, "Cooperative computation of stereo disparity", Science, 194 (1976) 283-287 .

[52] W. S. McCulloch and W . H. Pitts , "A logica l calculus of ideas immanent innervous activity", Bull . Math. Biophysics,5 (1943) 115-133.

[53] M. Minsky and S. Papert, Perceptrons, (MIT Press, 1969).

[54] V. B. Mountcastle, "An organizing principle for cerebral fu nction: T he un itmodu le and the distributed sys tem", in The Mindful Brain, edited by G. M.Edelman & V. B. Mountcastle, (MIT Press, 1978).

[551 D. A. Norman, Learning and Memory, (San Francisco: W. H. Freeman,1982).

[56] G. Palm, "On represent ation and approximation of nonlinar systems, PartII: Discrete time" , Biological Cybernetics, 34 (1979) 49-52.

[57] D. B Parker, "A comparison of algorithms for neu ron-like cells", in NeuralNetworks for Computing, edited by J. S. Denker, (New York: Ame ricanInstitute of Physics, 1986).

[58] L. R. Peterson , R. Wamp ler, M. Kirkpatrick and D. Saltzman, "Effect ofspac ing presentations on retent ion of a paired-associate over sho rt int ervals" ,Journal of Experimental Psychology, 66 (1963) 206-209 .

[59] R. W. Prager, T. D Harrison, and F. Fallside, "Bolt zmann machines forspeech recognition", Cambridge University Engineering Departmen t Technical Report TR.260 (1986) .

[60] R. A. Rescorla and A. R. Wagner, "A theory of Pav lovian conditioning:Variations in the effectiveness of reinforcement and non- reinforcement" , inClassical Conditioning II: Current Research and Theory, edited by A. H.Black & W. F. Prokesy, (New York: Appleton-Crofts, 1972).

[61] E. T. Rolls, "Information representation, processing and storage in the brain:Analysis at the single neuron level" , in Neural and Molecular Mechanismsof Learning, (Berlin: Springer Verlag, 1986) .

[62] C . R. Rose nberg and T. J. Sejnowski, "The spacing effect on NETtalk, amassively-parallel network", Proceedings of the Eighth Annual Conferenceof the Cognit ive Science Society, (Hillsdale, New Jersey: Lawrence ErlbaumAssociates , 1986) 72-89.

[63] F. Rosenblatt, Principles of Neurodynamics, (New York: Spartan Books,1959).

Parallel Networks that Learn to Pronounce 167

[64] P. S. Rosenb loom and A. Newell, "The chunking of goal hierarchies: A generalized model of practice", in Machine Learning: An Artificial In teIligenceAproach, VoJ.l1, edited by R. S. Michals ki , J. G. Carbonell &; T . M. Mitchell,(Los Altos, California: Morgan Kau ffman, 1986).

[65] D. E . Rumelhart, "Presentation at the Symposium on Connectionism: Mult iple Agents, Parallelism and Learning" I Geneva, Switzerland (Septemb er9-12, 1986).

[66] D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learni ng int ernal representations by error propagation", in Parallel Distributed Processing: Explorations in the Microstructure of Cognit ion. Vol. 1: Fundations, edited byD. E. Rumelhart & J. L. McClellan d , (MIT P ress, 1986) .

[67] D. E. Rumelhart and J. L. McC lelland, Parallel Distrib uted P rocessing: Explorations in the Microstructure of Cogni tion . Vol. 1: Foundations, (MITP ress, 1986).

[68] D. E. Rumelhart and J. L. McC lelland, "On lear ning the past tenses ofEnglish verbs" I in Parallel Distrib uted Processing: Explorations in theMicrostructure of Cognition. Vol. 2: Psychological and Biological Models,edited by J. L. McClelland & D. E. Rumelhart, (MIT Press, 1986).

[69] E . Saund , "Abstraction and represent ation of continuous variables in con nec tionist networks" I Proceedings of the Fifth National ConFerence on A rt ificialInte lligence, (Los Altos, California: Mor gan Ka uffmann, 1986) 638-644.

[70] T. J . Sejnowski, "Skeleto n filte rs in the brain" , in Parallel m odels of associative memory , edited by G. E . Hinton &; J. A. Anderson , (Hillsdale, N. J.:Erlbaum Associates, 1981).

[71] T. J. Sejnowski, "Open questions about comp utation in cerebral cortex",in ParaIIel Distributed Processing: Explorations in the Microstr uctur e ofCognition. Vol. 2: Psychological and Biological Mode ls, edit ed by J . L. MeClelland & D. E. Rumelha rt, (MIT Press, 1986).

[72] T. J . Sejnowski and G. E. Hinton, "Separating figure from ground with aBoltzmann Machine", in Vision, Brain k Cooperative Com putation, editedby M. A. Arbib & A. R. Hanson (MIT Press, 1987).

[73J T . J . Sejnoweki, P. K. Kienker and G. E. Hinton, "Learning symmetry groupswith hidden units: Beyond the perceptron" , Physica, 22D (1986) .

174] T. J. Sejnowski and C. R. Rosenbe rg, "NETtalk: A parallel netwo rk th atlearns to read aloud", Johns Hopkins University Department of ElectricalEngi neering and Computer Science Technical Report 86/01 (1986).

[75] T . J. Sejnowski and C. R. Rosenberg, "Connect ionist Models of Learning", inPerspectives in Memory Research and Training, edited by M. S. Ge zaaniga,(MIT Press, 1986).


[76] P. Smolensky, "Informat ion processing in dynamical systems: Foundationsof harmony theory" I in Parallel Distributed Process ing : Explorations in theMicros tructure of Cognition. Vol. 2: Psychological and Biological Mo dels,edited by J . L. McClelland & D. E. Rumelhart, (MIT Press , 1986).

[77] R. D. Sperber, "Developmental changes in effects of spac ing of tr ials inretardate discrimina t ion learning and memory", Journal of ExperimentalPsychology, 103 (1974) 204-210 .

[78] L. R. Squire, "Mechanisms of memory", Science, 232 (1986) 1612-1619.

[79] R. S. Su t ton and A. G. Barto, "Toward a modern theory of adaptive networ ks : Expectation and prediction", Psychological Review, 88 (198 1) 135170.

[80] G. Tesauro, "Simple neural models of classic al conditioning", Bio logical Cybernetics, 55 (1986) 187-200.

[81] G. Tesauro and T . J . Sejncwski, "A parallel network that learns to playbackgam mon" , in preparation (1987).

[82] G. Toulouse, S. Dehaene, and J . P. Changeux, "Spin glass model oflearn ingby select ion", Proceedings of the National Academy of Sciences USA, 83(1986) 1695-1698 .

[83] R. L. Venezky, The Structure ofEnglish Orthography, (The Hag ue: Mouton,1970) .

[84] R. L. Venezky and D. Johnson, "Development of two let ter- sound patternsin grades on through th ree" , Journal of Educational Psychology, 64 (1973)109-115.

(85] C. von der Ma lsburg, and E . Bienenstock, "A neural netwo rk for th e retrievalof superimposed connection patterns" , in Disordered Systems and Biological Organization , edited by F . Fogelman , F . Weisbuch, &; E . Biene neto ck,(Springer- Verlag: Berlin, 1986).

[86] R. L. Watrous, L. Shastri, and A. H. Waibel, "Learned pho netic discrimination using connectionist netwo rks", University of Pennsylvania Departmentof Electrical Engineer ing and Computer Science Technical Report (1986) .

[87] G . Widrow and M. E. Hoff, "Adaptive switching circuits" , Ins ti tute ofRadioEngineers Western Electronic Show and Con vention, Convention Reco rd 4(1960) 96-194.

[88] D. Wilshaw , "Holography, associative memory, and ind uctive genera lizat ion" , in Parallel Models of Associative Memory, edited by G. E . Hinto n&; J. A. Anderson, (Hillsdale, New Jersey: Lawrence Erlbaum Associates,1981) .

[891 D. Zipser, "Programing networks to compute spatial functions" , ICS Technical Report, Institute for Cognitive Science, Univers ity of Cal ifornia at SanDiego (1986).

Documents

Parallel Networks that Learn to Pronounce English Text · Parallel Networks that Learn to Pronounce English Text Terrence J . Sejnowski Department of Biophysics, ... ticularly difficult