Neural architectures underlying language acquisition · then gradually abandoned this theory in favour of other frameworks of language, but remains supportive of the nativist view

Neural architectures underlyinglanguage acquisition

A simulation approach combining neural networks and genetic algorithms

A.J. van der Meij5996287

Bachelor thesisCredits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of AmsterdamFaculty of ScienceScience Park 904

1098 XH Amsterdam

Supervisordhr. prof. dr. J.M.J. Murre

Brain and CognitionDepartment of Psychology

Weesperplein 41018XA Amsterdam

July 24th, 2012

Copyright ©A.J van der Meij.This thesis and the work described herein was created exclusively using free open-source software.

Abstract

The design of neural networks can be challenging. Many of the initial pa-rameters of a network can severely affect its potential for learning. In thisthesis, a language acquisition task is considered. It has been shown that someareas of the brain operate using specialised modules. Some researchers haveargued that there also exists an innate neural faculty that allows us to learngrammar at a young age. Through genetic search, a variety of neural net-work architectures are trained to learn a set of grammatical rules. Both theirability to generalise on previously unseen items, and their potential to learnadditional languages are evaluated. The properties of the optimal performingarchitectures are analysed in order to find some common design propertiesthat provide advantages in language learning.

Keywords. neural networks, language acquisition, genetic algorithms, modu-larity, Elman networks.

Acknowledgements

First and foremost, I would like to thank my supervisor, Jaap Murre for his insightsand expertise. When taking the first steps in an unfamiliar forest, it is easy to getlost. The sheer number of trees can at times be overwhelming. I thank dhr. Murrefor pointing me in the right direction when necessary.

I would also like to thank Margriet Lok. Margriet is a trusted friend, talentedteacher and someone whose mastery of the English language I dare not challenge. Ithank her for proofreading this thesis.

I thank the SARA institute for allowing me access to their Lisa cluster. Some-day, we will laugh at the thought of systems with only a few thousand processorcores, but not today. The computing power of Lisa has allowed me to greatly widenthe breadth of my simulations.

I thank Anouk Lasairiona for confirming that the grammar production rules used togenerate Japanese relative clauses have some basis in reality.

Finally, I thank the dark poet, Bill Hicks, for his perspective and comic reliefduring this period. Don’t worry, don’t be afraid, ever, because, this is just a ride.

A.J. van der Meij, July 24th, 2012

Contents

1 Introduction 5

2 Related work 6

3 Background 73.1 Artificial Neural networks . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Artificial neurons . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Network topology . . . . . . . . . . . . . . . . . . . . . 83.1.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.4 Elman networks . . . . . . . . . . . . . . . . . . . . . . 93.1.5 Modular networks . . . . . . . . . . . . . . . . . . . . . 103.1.6 Design considerations . . . . . . . . . . . . . . . . . . . 10

3.2 Genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.1 Reproduction, Crossover & Mutation . . . . . . . . . . . 113.2.2 Advantages of genetic search . . . . . . . . . . . . . . . . 12

4 Methodology 134.1 Task and stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 Grammatical properties . . . . . . . . . . . . . . . . . . . 134.1.2 Representation . . . . . . . . . . . . . . . . . . . . . . . 134.1.3 Context-dependent likelihood vectors . . . . . . . . . . . 14

4.2 Encoding of the modular architecture . . . . . . . . . . . . . . . . 144.2.1 Architecture constraints . . . . . . . . . . . . . . . . . . 144.2.2 Chromosome encoding . . . . . . . . . . . . . . . . . . . 15

5 Results 175.1 Language generalisation . . . . . . . . . . . . . . . . . . . . . . 17

5.1.1 Basic sentence structures . . . . . . . . . . . . . . . . . . 175.1.2 Complex sentence structures . . . . . . . . . . . . . . . . 19

5.2 Knowledge transfer . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Conclusion 21

7 References 22

A Grammars 24A.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24A.2 Dutch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A.3 Japanese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1 Introduction

The acquisition of language is an area of research that has historically troubledmany interested parties. The many intricacies involved with natural language oftenhinders straight-forward (computational) modelling.

Despite the complex nature of language, a developing child is able to learnlanguage at an age when other higher cognitive abilities are not yet fully developed.This has led some researchers to suggest that there exists an innate neural componentspecifically tasked for language acquisition.

The existence of such a faculty has surfaced in academic discussions in manyforms over the years. A prominent example known as the language acquisitiondevice (LAD) was postulated by linguist Noam Chomsky. The LAD was an ele-ment of his Universal Grammar theory that proposed many aspects of languageare governed by an innate process. This theory was based on the premise that allnatural languages seem to share underlying principles. Chomsky argued that aninnate process must exist for languages to develop such similarities. He has sincethen gradually abandoned this theory in favour of other frameworks of language,but remains supportive of the nativist view of language acquisition.

The question as to what degree the mind is modular remains an open topic inthe debate on development theories of cognition. Neurological research seems toindicate that the human brain is modular to at least some extent. The primary visualcortex is a well studied area that seems to be organised into functional modules.Each module functions independently to process one of the sub-tasks involved invision. Other functionally specialised areas have also been identified.12

Happel & Murre explored the question on modularity in a 1994 paper9 in whichthey trained modular neural networks to learn a visual task. Neural networks arecomputational models for learning that are inspired by the workings of the brain.Their basic concepts are explained in Section 3. Remarking that a general theoryfor the design of task-specific network configurations does not exist, they opted touse genetic search methods instead. Happel & Murre found that the initial designproperties did not only affect the network’s potential for learning, but also affectedits ability to generalise. Furthermore, the best performing architectures seemed tohave incorporated some properties of the biological visual system.

This thesis will investigate whether general design properties exist for languageacquisition tasks. By using genetic search algorithms, a wide variety of networkarchitectures will be generated and evaluated on our task. Using these results, wehope to find networks with architectural properties that enhance their ability to learnthe language task. It is hoped that these architectures reveal some underlying designpattern that they all share.

5

2 Related work

Early work on grammatical inference by neural networks used a localist approach.In this approach, every lexical concept was represented by its own node in thenetwork. Also, concepts that were linked by some relationship, such as their spatialassignment13, 17 or concurrent activation16 were bound together using one or addi-tional concept nodes. Together, these concept nodes formed an input representationlayer, providing the network with knowledge on correct word structures.

A disadvantage of localist networks are their scalability. A network where eachconcept is represented using one or multiple nodes would see an exponential growthin nodes as the language task increases in complexity.

Another concern regarding scalability is that the input representation layer isformed by the creator of the network. An increase in task complexity also increasesthe time required to prepare this layer.

An alternative approach to learning in neural networks is to use distributed repre-sentations with a learning algorithm. Unlike the localist approach, this approachdoes not require representations that are defined a-priori and explicitly. Instead, therepresentations are formed as an emergent property of the internal network throughtraining.

Traditional network implementations however, struggle with the temporal natureof language. Some attempts were made by presenting several sequences at once tothe network by assigning each of them to part of the input nodes. This however,led to complications in processing sentences with input vectors of diverse length.The network’s input layer would always have to be able to account for the longestsentence.4

In the 1991 paper Distributed representations, simple recurrent networks and gram-matical structure,6 Jef Elman used a simple recurrent network architecture thatwould later become kwown as the Elman network. He used these networks to learngrammatical sentences of varied complexity. His networks included a mechanismthat allowed the network to use its earlier internal states at subsequent iterations.This allowed the network to have some idea of context when input was presented toit sequentially.

Elman evaluated his networks on sentences that included syntactic structure, re-quiring the network infer hierarchical and recursive relationships in order to besuccessful. By analysing the internal (hidden) representation of the trained network,he concluded that the network had learned to deduce the abstract grammatical rulesthat governed his input.

6

3 Background

3.1 Artificial Neural networks

Artificial neural networks are computational models used for adaptive learning.They are inspired by biological data, but do not presume to be precise models.Rather, they employ the general principles of neural transmission to learn complexfunctions.

3.1.1 Artificial neurons

A neural network is comprised of artificial neurons commonly referred to as units ornodes. Each unit j is linked with a fixed number n of input units i that are connectedto it through a weight w. The activation output of a neuron depends on the activationvalues of its predecessors and the strength of their weights to the neuron j.

Figure 3.1: Model of an artificial neuron, from18

The activation value of a unit is calculated using a two step process. First, the netinput value is determined by taking the weighed sum of the activation values of theinput units and the strength of their weight to unit j as per formulae 3.1.

in(j) =∑i

(ai × wi,j) (3.1)

The resulting net input value is then passed through a (non-linear) function knownas the activation function. This function serves as a squashing mechanism to restrictthe value of the net input within a certain numerical interval. Most networks applya sigmoid activation function to allow the internal units of the network to take onnon-linear values.

The suitable activation function depends on the type of the problem and thevalue range of its input patterns. Two common choices are the logistic function forthe 0 < x < 1 interval, and the hyperbolic tangent function for values x where−1 < x < 1.

7

3.1.2 Network topology

A neural network consists of a limited number of artificial neurons that are struc-turally connected. A wide variety of different topologies exist, but some generalclassifications can be made.

Figure 3.2: Feed-forward architecture Figure 3.3: Recurrent architecture

Feed-forward networks, shown in Figure 3.2, consist of layers that each containa limited number of units. A typical configuration uses an input i and an output olayer, with one or multiple hidden h layers in between. There is a uni-directionaldata flow from the input to the output layer. Networks of this type do not includeany feedback connections. A given unit is only receives weight connections fromunits in the preceding layer.

Recurrent networks, shown in Figure 3.3, are a variation on the feed-forwardparadigm that include feedback loops. They contain units which have weight con-nections to themselves or units earlier in the otherwise feed-forward cycle. Thisgives the network the ability to maintain states from earlier cycles, effectivelycreating a limited short-memory system.

3.1.3 Learning

A neural network is able to produce the appropriate response to an input patternwhen the internal weights are calibrated in a manner that they propagate the correctactivation values. In a supervised learning task, the accuracy of a response can bemeasured by comparing the desired output values d with the actual output values a.The error of a network is typically calculated using the Root Mean Square functionas shown in formulae 3.2.

E = 12

∑(du − au)2 (3.2)

8

A popular learning rule for training a neural network is the back-propagationalgorithm 1. Back-propagation employs a form of gradient descent optimization toreduce the total error of the network over a series of training iterations.

Training is initialised after a pattern has been presented and propagated through thenetwork layers. After the error of the network is determined, the back-propagationalgorithm executes a back-wards sweep through the layers of the network.

The network’s total error is used to determine the individual error each weightconnected to the output layer. These weights are adjusted by a margin proportionateto the learning rate η. The error for subsequent units in the other layers are deter-mined recursively using the error values of the units to which they connect and theweight of their connections.

When η is high, the initial error reduction can be quick, but it might cause thenetwork to overshoot on a state of minimal error. A low learning rate causes thenetwork to learn slowly, requiring more training cycles.

3.1.4 Elman networks

Simple recurrent networks combine elements of both the feed-forward and recur-rent topologies described in Section 3.1.2. Unlike other networks with recurrentconnections, this class of network types uses sequential cycles and can implementthe default back-propagation algorithm.

The Elman network,5 or simple recurrent network, was developed in 1990 to pursuegrammatical inference tasks. The defining feature of the Elman architecture is theaddition of a context layer to an otherwise typical feed-forward layout.

Figure 3.4: Layers of the Elman network.

The units in the context layer function as an extra input layer to the hiddenunits. Unlike the input units however, they are not accessible from outside the

1A more detailed description of the algorithm can be found in textbooks such as Connectionismand the Mind by Bechtel & Abrahamsen

9

network. Instead, they serve to store the activation values of the hidden units aftereach iteration. The Elman algorithm works as follows:

1. Context unit activations are set to 0

2. A pattern p is fed-forward through the network

3. Back-propagation learning rule is applied

4. Hidden units activations are copied to their equivalent context units

5. Repeat 2,5 for all patterns p

Elman networks have been shown to be able to learn simple language tasks.15

However, the feedback cycle that exists through the context layer only providesthe network with limited temporal capabilities. Patterns with a recursive characterabove a certain complexity level are generally not handled correctly.

3.1.5 Modular networks

The network architectures that have been discussed so far all consisted out of asingle network module. Networks with this property will from now on be referredto as monolithic.

Modular neural network (MNN) is somewhat of an ambiguous term. It denotesnot a defined network architecture, but rather a design principle that may be im-plemented in many forms. The core concept behind a MNN is the use of severalindependently functioning network modules that are connected together in a broaderarchitecture.

Using independent modules leads to two advantages. The modules are unableto interfere with each other, which reduces the negative effects that new data canhave on an already trained network. Furthermore, the function to be learned can bedivided into sub-tasks, emulating the functional specialization of some cognitivetasks.

3.1.6 Design considerations

Both the overall architecture and the internal parameters of a neural network canaffect its potential for learning. Worse even, an ill-chosen configuration can cause anetwork to learn too slowly or prevent it from converging to an error minimum atall.

Weight configuration The rate of convergence for neural networks are known tobe sensitive to the initial values of their weights. Networks are usually initialisedwith random weights, adding an element of randomness to the speed of convergence.Restricting the value range can help a network to converge faster. Another option ismodifying the η learn rate to increase the weight adjustments per iteration.

10

Number of hidden units Another challenge in neural network design is choosingthe right number of hidden units. When the data is too complex for the number ofhidden units, the network will return a high training and high generalisation error.Too many hidden units on the other hand risks over-fitting the training data, leadingto poor generalisation.

3.2 Genetic algorithms

Genetic algorithms are biologically-inspired methods of search based on the princi-ples of natural selection. They combine the genetic encoding of information with asurvival of the fittest strategy to converge to optimal search results.

A genetic algorithm approach requires potential solutions to the search problemat hand to be represented as a set of parameters. These parameters are encoded asgenes within a finite-length string known as a chromosome. Binary encodings areoften used as they allow easy manipulation of the individual bits. Also, their limitedrepresentational ability per bit leads to a large number of unique chromosomes.

Figure 3.5: Element of a genetic search approach, from18

An initial population (a) of chromosomes is generated randomly. Each chromosomerepresents an unique solution to the problem. Each chromosome is evaluated, andassigned a fitness score (b) based on their performance. The fitness score representsthe chromosome’s chance of being selected as a parent for the next generation.This is usually based on some measure of performance of the candidate solution inrelation to the problem.

3.2.1 Reproduction, Crossover & Mutation

New chromosomes in subsequent generations are formed through three main opera-tors.

Selection (c) is the reproduction mechanism that determines which chromosomesfrom the previous generation are allowed to pass their genes on to form new chro-mosomes. A common strategy used is roulette wheel selection in which a chancefor the chromosome to be chosen is directly proportional to its fitness score.

11

Crossover (d) forms new strings by allowing two chromosomes selected by re-production to mate. This operator selects a crossover point at a random position inthe original strings and swaps all bits after that point. The two resulting strings aftereach crossover are added to the new generation of chromosomes.

Mutation (e) alters one or more gene values in selected chromosomes. As aresult, new unique chromosomes are formed that might prove to have a beneficialmutation. Mutation should be configured as a low probability event to avoid toomany randomness in the our otherwise directed search.

3.2.2 Advantages of genetic search

The goal of search algorithm or any optimisation method is to find solutions thatsatisfy some level of performance. They often employ a process referred to ashillclimbing in which the direction of the search is moved to the local gradient.

A disadvantage of hill climbing is their tendency to get stuck in local maximums.Figure 3.6 illustrates how the initial search state can limit the success of a searchroutine.

Figure 3.6: Hill climbing through a complex search space, from18

The advantage of using a genetic search approach is that, by its inherent design, thesearch space is sought through from multiple initial states. Increasing the size ofthe initial population of chromosomes, also increases the chance of starting in astate near the global maximum. Furthermore, the flexible nature of the crossoverand mutation operators complement the hill-climbing approach with some degreeof random exploration.

12

4 Methodology

4.1 Task and stimuli

Three languages were considered for our simulations; English, Dutch and Japanese.A limited grammar was used for each language that can be found in Appendix A.These three languages were selected for their differences in syntax for the type ofsentences that were examined.

4.1.1 Grammatical properties

Both English and Dutch use the Subject Verb Object (SVO) constituent word orderin basic sentences with a single clause. Japanese however, uses the Subject ObjectVerb (SOV) order for similar sentences. An example for this type of sentence isshown in Table 4.1. We will refer to them as basic sentences.

Language Sentence* OrderEnglish boy reads book SVODutch boy reads book SVOJapanese boy book reads SOV

Table 4.1: Word order for basic sentences

Differences in word order become more apparent for sentences that include arelative clauses. The sentences used in our dataset include a relative clause thatprovides additional descriptive information on the subject of the sentence. This typeis referred to as an complex sentence in our dataset. Table 4.2 provides an exampleof this type.

Language Sentence* OrderEnglish boy who sees girl reads book SVODutch boy who girl sees reads book SVO/SOVJapanese book reads boy girl sees SOV

Table 4.2: Word order for complex sentences

Unlike its English counterpart, most Dutch sentences of this type use a SOVorder within the relative clause. A similar statement in Japanese diverges even more,as relative pronouns do not exist in that language at all. The complete productionrules that dictate correct syntax for each language can be seen in Appendix A.

4.1.2 Representation

The lexicon for each language that was considered consisted out of 24 items. Eachitem was represented using a 24-bit orthogonal vector in which one single bit wasset to 1.

13

Figure 4.1: Categorical overview of the lexicon

Figure 4.1 illustrates the word categories that the items in the lexicon belong to.The misc category represents a relative pronoun and the dot character that was usedas a stop symbol. It should be emphasized that none of the categorical or syntacticinformation is part of the representation.

4.1.3 Context-dependent likelihood vectors

The nature of our language task does not allow us to use the typical RMS error cal-culated described in Section 3.2. The network is trained to predict legal successorsin the context of a sentence. In most cases, several words can be legal successorsgiven an identical context. We instead test the network’s ability to approach theprobabilities of each word occurring. This implies that we can not use straight-forward single word target vectors as our measure for evaluation.

Algorithm 4.1 Generate context-dependent likelihood vectorscontext← {}n← max(length)i← 0for i← n do

total← count(sentences where length ≥ i)for word← sentence[i+ 1] do

likelihood← sum(word) in sentence[i+ 1] / totalcontext← context + (sentence(0→ i), likelihood)

i← i+ 1

Instead, a context-dependent likelihood (CDL) vector is created as a target for eachvalid sentence context. The positions of the vector correspond with those of theinput patterns. Each value in the 24-bit vector represents the probability of that wordoccurring, given its frequency in that particular context. The algorithm is shown aspseudo-code in Algorithm 4.1.

4.2 Encoding of the modular architecture

4.2.1 Architecture constraints

The architectures of the modular networks used in our experiments were constrainedin several ways to reduce complexity. All generated architectures consisted out oftwo levels comprised of simple recurrent networks. Modules on the first level acted

14

as input modules, whereas a single module was allowed on the second level actingas the decision output module.

Figure 4.2: Network architecture blueprint

Each input module m had a variable number of input, hidden and output units.The number of input units determined the size of the partial input pattern P that aparticular module would receive. The total sum of input units was enforced to matchthe dimensions of the input pattern. The number of input units for the decisionmodule was restricted to match the total number of input module output units.

n modules 2 3 4 5 6 >7% of strings 0.05 0.19 0.35 0.29 0.10 0.02

Table 4.3: Likelihood of number of modules by string percentage

The implementation used did not limit the number of input modules. However,the particular format used by the genetic encoding of this value did make somevalues more likely to occur as shown in Table 4.3.

4.2.2 Chromosome encoding

Each modular architecture was encoded using a binary string of 86 bits. These bitsserved to set four different parameter groups.

15

Segment 1encoding 1110 011110 010 1110

relative size [4, 6, 3, 4]actual size [6, 8, 4, 6]

Segment 1: 18 bits:The first segment determined the number of input modules and the size of their inputlayers. The algorithm to determine these values was inspired by the method usedby Happel & Murre.9 Each 18-bit string was divided into n substrings of varyingsize using the pattern ”10” (one followed by a zero) as a delimiter. The number ofmodules was set to n, with each having a relative size equal to the length of thesubstring that depicted them.The actual number of input units for each module was then calculated by scalingthe sum of their relative sizes to match the size of the input pattern vector.

Segment 2 & 3encoding 1101 11 0 11 0000 1001101100000111000operators [+,+,-,+]

diffvalues [2, 1, 2, 4]

Segment 2 & 3: 32 bits each: The next two sets of bits determined the numberof hidden units (segment 2) and the number of output units (segment 3). Theirinterpretation depended on the values found in segment 1. Based on the numberof modules to be created, the first 0,n bits of each string was reserved to functionas operators. The remainder of the string would act as data. This data string wasiterated bit by bit and divided each time the bit value changed. The length of theeach string would act as a diffvalue.

Using a base number b that was based on the size of the input layer, the num-ber of hidden and output units would be determined by taking b and adding it totheir operator and diffvalue. A constraint was in place to avoid negative values bymodifying each result to their absolute value. The remainder of the data string thatremained unused was disregarded.

Segment 4: 4-bits.The last 4 bits determined the number of hidden units for the decision module. Thisvalue was calculated by counting the number of ”1” values in the 4-bit string andadding them to a base value that was set to 2. Preliminary simulations seemed toshow that 2-6 hidden units for this module were sufficient.

16

5 Results

5.1 Language generalisation

A set of networks using either a monolithic, or a modular architecture were generatedand evaluated on their ability to generalize on new data. Sentences from the Englishdataset were used for this set of simulations.

Each network was trained for 100 iterations. This relatively low number ofcycles was used to reduce overall training time. A low initial weight value range of0 ≤ x ≤ 0.1 was used to minimize the effect of random weight configurations oninitial performance.

5.1.1 Basic sentence structures

Only the set of basic sentences were considered for this experiment. 400 mono-lithic networks were randomly created using a trial & error approach. Preliminarysimulations indicated between 4 and 20 hidden units as appropriate.

Nearly 3000 modular architectures were generated over three generations. Sub-sequent generations did not show significant improvement in average fitness scores.This suggests that our population size was either too limited or the combination ofoptimal network parameters too chaotic.

Each sentence contained either a Subject-Verb structure of length 2, or a Subject-Verb-Object structure of length 3. Every training iteration consisted of learning 142examples, after which the networks were evaluated on 95 previously unseen andunique test sentences.

(a) verb-prediction

(b) object-prediction

Figure 5.1: Prediction of valid word choices

As shown in Figure 5.1, the trained networks were able to correctly predict legal

17

successors in the context of a partial sentence. Figure (a) illustrates that the networkwas able to infer that a pronoun is followed by a verb. On Figure (b), the networkseems to demonstrates that it is able to not only infer the VP→V NP rule, but alsowhich nouns can function as objects.

GeneralisationThe 400 monolithic network configurations trained, using between 2 and 24 hiddenunits had an initial generalisation score of ∼0.72. The generalisation score is deter-mined by calculating the error between the actual output and the target as defined inSection 4.1.3. After training, the average score was 0.047 (SD=0.014) with the bestperforming architecture attaining a 0.031 score. The weighed average of hiddenunits was 13.08.

Figure 5.2: Average generalisation scores per n modules

The best performing modular architecture reached a generalisation score of 0.043with the average being 0.09 (SD=0.03), coming down from an initial average of0.22.

The average generalisation score per module count n is illustrated in Figure 5.2.There seemed to be a clear benefit for network architectures with a high mod-ule count. It must be noted that the majority of the networks contained between3 < n < 6 modules as the particular genetic encoding used favoured those values.

Training timeThe CPU cycles required for a training iteration is mostly determined by the numberof units and weights that the network contains. While the modular networks did notachieve the optimal generalisation score, they were quicker to train. The optimal 10monolithic architectures contained a sum total of 65.000 weights. Their 10 modularcounterparts consisted out of only 23.000 connections, giving them advantages intraining time per iteration.

18

5.1.2 Complex sentence structures

For this experiment, the basic sentence dataset described in section 5.1.1 wascomplemented with 200 complex items that included a relative clause. A new set ofboth monolithic (150) and modular (750) networks were generated.

The average number of hidden units was raised by 2 to account for the increasedcomplexity of the task. The method described by Elman6 to use multiple trainingphases in which the ratio of complex to simpler patterns increased per phase did notseem to provide significant performance improvements for this task.

Figure 5.3: Prediction at various stages of the sentence sequence

The trained networks seemed to be able to infer, to a degree, that the relative clauseused in the complex sentences broke the regular Subject-Verb-Object order of thebasic sentences, as shown in Figure 5.3.

GeneralisationThe 150 monolithic network configurations, using between 4 and 24 hidden unitshad an initial generalisation score of ∼1.49. Again, the generalisation score repre-sents the error between the actual output and the target. After training, the averagescore was 0.145 (SD=0.01) with the best performing architecture attaining a 0.131score. The weighed average of hidden units was 14.28.

The 750 modular architectures achieved an average of 0.159 (SD=0.015), withan initial generalisation score of ∼0.195. The best performing architecture attaineda 0.134 score, barely trailing the best performing monolithic configuration. Sim-ilar to the previous experiment, an inclination to a high number of modules wasobserved.

19

5.2 Knowledge transfer

A selection of networks trained for the complex sentence task in Section 5.1.2. werealso trained on an additional language. Fifty networks from both the monolithic andthe modular networks were randomly chosen from the total set.

Each network was tested on its ability to transfer knowledge from earlier train-ing sessions. The weight configurations of the select networks was restored, bothfrom before and training English. They were then trained to learn either Dutch orJapanese.

Remember that even though some of the grammatical rules of English aredifferent from their Dutch or Japanese counterparts, there are also similarities. Thephrase structure rules used for Dutch in particular only differs from English in oneaspect. Based on this property, we expect that prior training iterations in Englishwill influence the learning of a second language.

Figure 5.4: Sum of generalisation scores for monolithic (left) and modular (right)

For all networks, the generalization score after training a second language is taken,both when they have received prior English training and without. The sums over alltested networks is graphed in Figure 5.4.

No significant knowledge transfer effect is observed for monolithic networks. Priorexposure to English however, seems detrimental for learning Japanese. One pos-sible explanation could be the influence of the relative pronoun on the monolithicstructure. This word type plays a crucial role in the correct prediction of successorwords in English, whereas Japanese does not use this word type at all.

The modular architectures seem to be unaffected by this phenomenon. For ei-ther language, the ability for second language learning is not impacted by priorexposure to English. It appears that using independent modules negates the insta-bility that the monolithic structures in similar learning circumstances seem to havebeen affected by.

20

6 Conclusion

For both generalisation experiments, the monolithic network configurations seemedto out perform the modular architectures. The disparity in performance however,became smaller when the complexity of the task was increased. Modular neuralnetworks are especially suited for tasks using high-dimensional input. The languagetasks presented in this project might have been too simple for a modular approachto be optimally effective.

The modular neural networks did seem to attain better generalization score whena high number of modules was used. This might indicate that optimal performingarchitectures use the independent functioning of their modules to create categoricaldistinctions, similar to the localist approach of input representation. The data gath-ered in this project however is too inconclusive to prove this statement.

In learning a second language, a positive effect was observed by using modularneural networks. One of the advantages noted for MNN’s is their lack of interfer-ence between modules. This allows the network to stay more stable when new datapatterns are introduced.

Search through genetic algorithms is a proven method of optimization. In thedomain of neural networks however, the fitness of a solution is usually determinedby either the network’s error or its ability to generalize. Both values can not bedetermined until after training.

Training unfortunately can be a time-consuming process, and genetic search hasthe potential to create an immense search space. It is therefore of the utmost impor-tance to do extensive preparation to determine appropriate parameter boundaries.

As Happel & Murre remarked in their 1994 paper,9 there does not seem to bea general framework that provides clear design considerations for specific neuralfunctions. The absence of such a framework leaves those interested in designingneural networks only with some general guidelines, and search methods that can betime inefficient if not provided with effective constraints.

Rather than relying only on search methods, we suggest to further explore theproperties that affect efficient network design.

21

7 References

[1] G. Auda and M. Kamel. Modular neural networks: a survey. InternationalJournal of Neural Systems, 9:129–152, 1999.

[2] W. Bechtel and A. Abrahamsen. Connectionism and the mind: Parallelprocessing, dynamics and evolution in networks. Recherche, 67:02, 2001.

[3] D. Dasgupta and D.R. McGregor. Designing application-specific neural net-works using the structured genetic algorithm. In Combinations of GeneticAlgorithms and Neural Networks, 1992., COGANN-92. International Work-shop on, pages 87–96. IEEE, 1992.

[4] JL Ehnan and D. Zipser. Discovering the hidden structure of speech. J. Acousl.Soc. Am, 83:1615–1626, 1988.

[5] J.L. Elman. Finding structure in time. Cognitive science, 14(2):179–211,1990.

[6] J.L. Elman. Distributed representations, simple recurrent networks, and gram-matical structure. Machine Learning, 7(2):195–225, 1991.

[7] J.A. Fodor. The modularity of mind. MIT, 1981.

[8] D.E. Goldberg. Genetic algorithms in search, optimization, and machinelearning. Addison-wesley, 1989.

[9] B.L.M. Happel and J.M.J. Murre. The design and evolution of modular neuralnetwork architectures. Neural Networks, 7:985–1004, 1994.

[10] S.A. Harp, T. Samad, and A. Guha. Towards the genetic synthesis of neuralnetwork. In Proceedings of the third international conference on Geneticalgorithms, pages 360–369. Morgan Kaufmann Publishers Inc., 1989.

[11] M.D. Hauser, N. Chomsky, and W.T. Fitch. The faculty of language: What isit, who has it, and how did it evolve? science, 298(5598):1569–1579, 2002.

[12] N. Kanwisher, J. McDermott, and M.M. Chun. The fusiform face area: amodule in human extrastriate cortex specialized for face perception. TheJournal of Neuroscience, 17(11):4302–4311, 1997.

[13] A.H. Kawamoto. Distributed representations of ambiguous words and theirresolution in a connectionist network. 1988). Lexical Ambiguity Resolution:Perspectives from Psycholinguistics, Neuropsychology, and Artificial Intelli-gence, Morgan Kaufman, San Mateo, California, pages 195–228, 1988.

[14] B. Krose and P. van der Smagt. An introduction to neural networks. 1996.

22

[15] S. Lawrence, C.L. Giles, and S. Fong. Can recurrent neural networks learnnatural language grammars? In Neural Networks, 1996., IEEE InternationalConference on, volume 4, pages 1853–1858. IEEE, 1996.

[16] J.L. McClelland, M.S. John, and R. Taraban. Sentence comprehension: Aparallel distributed processing approach. Language and cognitive processes,4(3-4), 1989.

[17] R. Miikkulainen and M.G. Dyer. A modular neural network architecture forsequential paraphrasing of script-based stories. In Neural Networks, 1989.IJCNN., International Joint Conference on, pages 49–56. IEEE, 1989.

[18] S. Russell, P. Norvig, and A. Artificial Intelligence. A modern approach.Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, 1995.

[19] Sarle, W.S. et al. Neural network faq, periodic posting to the usenet newsgroupcomp.ai.neural-nets. ftp://ftp.sas.com/pub/neural/FAQ.html, 1997.

[20] N. Sharkey, A. Sharkey, and S. Jackson. Are srns sufficient for modellinglanguage acquisition. Models of language acquisition: Inductive and deductiveapproaches, pages 33–54, 2000.

[21] I. Tsoulos, D. Gavrilis, and E. Glavas. Neural network construction andtraining using grammatical evolution. Neurocomputing, 72(1):269–277, 2008.

23

A Grammars

Following are the formal grammars for the languages used. The abbreviationsused throughout this section follow the conventions as defined in transformationalgrammar theory.

A.1 English

Phrase structure rules for EnglishS→ NP VPNP→ PropN | N | N RCVP→ V (NP)RC→ who VP (NP)PropN→ Alexander | Anneloes | JaapN→ boy | girl | cat | dog | book | banana | phone | shoe | science | philosophy | goodV→ sees | hears | eats | reads | feeds | uses | enjoys | tastes

Example: SVO constituent order for single clause sentencesS

NP

PropN

Alexander

VP

V

eats

S

NP

N

boy

VP

V

reads

N

book

Example: SVO constituent order for non-defining relative clause sentencesS

NP

N

boy

RC

who VP

V

reads

NP

N

book

VP

V

hears

N

dog

24

A.2 DutchPhrase structure rules for DutchS→ NP VPNP→ PropN | N | N RCVP→ V (NP)RC→ who NP VPPropN→ Alexander | Anneloes | JaapN→ jongen | meisje | kat | hond | boek | banaan | telefoon | schoen | wetenschap | filosofie | goedV→ ziet | hoort | eet | leest | voed | gebruikt | geniet | proeft

Notes & Restrictions

• The Dutch language uses a SVO word order for main clauses. Unlike manyother SVO languages, such as English, a SOV word is used in most caseswithin relative clauses.

Example: SVO constituent order for single clause sentencesS

NP

PropN

Alexander

VP

V

eet(eats)

S

NP

N

jongen(boy)

VP

V

leest(reads)

N

boek(book)


NP

N

jongen(boy)

RC

die(who)

NP

N

boek(book)

VP

V

leest(reads)

VP

V

hoort(hears)

N

hond(dog)

25

A.3 Japanese

Phrase structure rules for JapaneseS→ NP VPNP→ PropN | N | RC NVP→ (NP) VRC→ (NP) VPropN→ Alexander | Anneloes | JaapN→ shonen | onnanoko | neko | inu | chosaku | banana | denwa | kutsu | rika | tetsugaku | yoiV→ shiso | kiku | taberu | yomu | shiryou | tentetsu | enjoi | shumi

Notes & Restrictions

• The Japanese language does not use relative pronouns to link relative clausesto their antecedent. Instead, the relative clause is placed before the nounphrase in the same structure.

Example: Constituent order for single clause sentencesS

NP

PropN

Alexander

VP

V

taberu(eats)

S

NP

N

shonen(boy)

VP

N

chosaku(book)

V

yomu(reads)


NP

RC

NP

N

chosaku(book)

V

yomu(reads)

N

shonen(boy)

VP

NP

N

inu(dog)

V

kiku(hears)

26

Documents

Neural architectures underlying language acquisition · then gradually abandoned this theory in favour of other frameworks of language, but remains supportive of the nativist view