Genetic Programming for Grammar Induction

Genetic Programming for Grammar Induction

M Å R T E N S V A N T E S S O N

Master of Science Thesis Stockholm, Sweden 2011


M Å R T E N S V A N T E S S O N

Master’s Thesis in Computer Science (30 ECTS credits) at the School of Engineering Physics Royal Institute of Technology year 2011 Supervisor at CSC was Viggo Kann Examiner was Stefan Arnborg TRITA-CSC-E 2011:077 ISRN-KTH/CSC/E--11/077--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc

Abstract


In this thesis the problem of Grammar Induction is approached with GeneticProgramming. We used the framework PerlGP to implement the evolution ofgrammars.

A simple grammar was used to generate an artificial text which in the next stepwas used for evaluating how grammars were evolved by the GP method. Weevaluated the influence by some variations: how many and by what frequencieswords were used in the artificial text, if the grammar in its entirety neededto match phrases or not. An implementation to allow reuse of parts of thegrammar as well as recursion was also evaluated. The results were encouragingthough the implementation scaled badly with the number of occurring wordsin the text. In our experiments with natural texts the texts where thus firsttagged with word classes to reduce the vocabulary used in the GP process.

Sammanfattning

Genetisk programmering för härledning av grammatik

I denna rapport angrips problemet att automatiskt härleda grammatiker medhjälp av genetisk programering (GP). Ramverket PerlGP användes för att attimplementera evolution av grammatiker.

En enkel grammatik användes för att generera articifiell text. Denna text använ-des för att evaluera hur grammatiker utvecklas med GP. Vi evaluerade inflytan-det av några variationer: hur många och med vilken frekvens ord användes i denarticifiella texten, omifall hela grammatiken behövde matcha fraser eller inte.En implementation som tillät återanvändning av delar av en grammatik samtrekursion evaluerades också. Resultaten var uppmuntrande men implementa-tionen skalade dåligt med antalet förekommande ord i texten. I våra försök mednaturliga texter taggade vi därför först orden med ordklasser för att reduceraden använda vokabulären vid den genetiska programmeringensprocessen.

Contents

Contents

1 Introduction 1

2 Theory 42.1 Formal grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Grammar classes . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Genetic Programing framework used 83.1 Basics of PerlGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Specifics on setup of PerlGP for grammar induction . . . . . . . . . . 10

3.2.1 The fitness function . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 The corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Performance measurement 144.1 Performance of grammars . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Performance of the evolution . . . . . . . . . . . . . . . . . . . . . . 14

5 The experiments 165.1 Simple artificial corpora . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.1.1 Quality of grammars . . . . . . . . . . . . . . . . . . . . . . . 175.1.2 Significance of vocabulary size . . . . . . . . . . . . . . . . . . 185.1.3 Significance of word frequency . . . . . . . . . . . . . . . . . . 185.1.4 Optional parts of grammar . . . . . . . . . . . . . . . . . . . 205.1.5 Recursive grammars . . . . . . . . . . . . . . . . . . . . . . . 21

5.2 Tagged medical texts . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Conclusions 276.1 Execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.2 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Bibliography 29

A Words used in 5.1 30

Chapter 1

Introduction

The problem examined in this work is Grammar Induction, also referred to as Gram-mar Inference. This is the problem of deducing a formal grammar of a languagegiven a corpus written in that language.

In the general case inference is done from both positive and negative examples.Positive examples are pieces of text known to be correct and negative examples arepieces of text known to be incorrect. Gold (1967) investigated the consequencesof either of the normal cases; using only positive examples or using both positiveand negative examples. Gold showed that a language with an infinite number ofsentences (like any human language) can not be fully learnt from positive examplesonly. Since observations show that children rarely are exposed to negative examples(Baker (1976), among others) this has been taken as an indication for innate humanlinguistic abilities.

But Gold treated only a very specific case of learning, which he calls identificationin the limit. That a language of a certain class1 and through a certain method isidentifiable in the limit means that it is possible within finite time to identify exactlywhich language of a set of possible languages is the unknown language. Here we bylanguage means formal language, a set of finite-length words composed from somefinite alphabet. For illustration let us define language English* to be English withoutthe sentence “I drink coffee”. Noticing that English and English* are two differentlanguages makes Golds results understandable. Also it indicates that for practicalpurposes identification in the limit might be an unnecessarily strict definition oflearning.

Gold himself also acknowledges that beside the question about the applicability ofidentification in the limit his assumptions on how learning is done might not beapplicable to real learning. He assumed that no inferences can be done from thestatistical composition of the text used for learning. Others have expanded on this

1See section 2.1.

1

and introduced the concept of implicit negative evidence. Implicit negative evidenceis negative evidence picked up by a learner without getting explicit rejections ofincorrect phrases. In the context of using the statistical composition of text thismeans information extracted from the absence of grammatical constructs in text(Angluin, 1988). This should be distinguished from signals a learner picks up withoutbeing overtly corrected (Bohannon and Stanowicz, 1988), which is also referred to asimplicit negative evidence and mentioned by Gold. Examples of the latter includeparents correctly rephrasing faulty utterances of a child.

The method examined for finding the formal grammar is Genetic Programming,with hypothesis grammars as individuals in the population.

In this work only positive examples are used as input for the system. The motivefor this is mainly that genuine positive examples are easier to come by. Affirmednegative examples have to be constructed more or less manually, which would beimpractical in many applications. Also the statistical composition of this input isused as the probability of a grammar to persist increases with the amount of phrasesit matches.

The algorithms that are grouped together as Genetic Programming are very diverse.So even though Grammatical Induction and Genetic Programming are large researchareas few articles approach the combination in a manner close to what we plannedfor this project. Because of the amount of parameters our results should not beregarded as definitive numbers on the performance of GP algorithms applied toGrammar Induction. Lack of standards for quantitative performance measurementalso obstructs comparison.

Smith and Witten (1995) describe small scale experiments with Genetic Program-ming. The setup is in many ways similar to the setup in our work. Similaritiesinclude letting a population of grammars evolve having a fitness function dependenton the matched amount of the corpus and processing the grammars as trees, usingstandard GP techniques for tree transformation during reproduction and mutation.

While we start with a corpus of some size Smith and Witten start with only onesentence and add a sentence a time when at least one grammar matches all sentences.It seems like the used sentences was constructed by hand. Also the fitness functionwere only influenced by whole matched sentences.

Comparing results are harder since the scale of their experiment only allow fordeducing qualitative results hard to compare with the results of our work. Theyindicate though that when the corpora get larger the number of generations neededto generate a grammar matching all sentences (including a newly added one) isincreasing rapidly, thus hampering the scalability of the setup.

Our setup is based on the PerlGP Genetic Programming framework (MacCallum,2003). As implied by the name Perl is the programming language. When workingwith textual data this is convenient, though presumably not optimal performance

2

wise.

We only consider grammars describing the language on the level of whole words.This is a significant constraint since suffix and prefix rules are very important partsof a full grammar. Also references and other constructs involving several sentencesare not handled.

Applications of grammar induction is a research field of its own. Two applicationswithin bioinformatics are improvements of search tools for literature and findingregularities in ‘languages’ like DNA (MacCallum, 2005).

A search tool having knowledge of the structure of texts within the domain of itsapplication could for example be used like this: A basic search for a word giving anoverwhelming number of results could be presented with the generalised contextsthe word is used in within the domain. The user could then choose the contextwanted and thus constricting the number of results.

The work was done at Stockholm Bioinformatics Center by me, Mårten Svantesson.The work was guided by and discussed with my supervisor Robert M. MacCallum.

3

Chapter 2

Theory

2.1 Formal grammars

A formal grammar consists of a set of rules describing how strings can be formed.Another term for formal grammar is generative grammar.

The rules of a formal grammar work on nonterminal and terminal symbols. Termi-nals are symbols that are part of the language while nonterminals are intermediariesthat eventually are to be replaced by terminals. Let us call the set of nonterminalsymbols N, the set of terminal symbols Σ and the set of rules P. A special symbolS ∈ N is the start symbol.

The rules look like X → Y, meaning X is transformed into Y. X and Y obviouslyconsists of combinations of terminals and nonterminals. The allowed combinationsare determined by the notation in use and to which class the grammar is supposedto belong.

2.1.1 Notations

Several different notations exist. With these you can express the same grammarswhile striking the balance between efficiency and simplicity differently. Secondlythere is the more important notion of grammar classes.

In the simplest notation the right and left hand sides of the rules are strings fromthe set (N ∪ Σ)�, the symbol � being the Kleene star. It denotes the set of strings ofany length possible to construct by concatenating symbols from the set it is appliedto. This includes the empty string. A convention is to write nonterminal symbolsin upper case and terminal symbols in lower case.

4

A simple example

S → aSa (2.1)S → b (2.2)

The initial string, represented by the start symbol S is empty. It can then betransformed to aSa or b. In the second case one does not continue, since the wholestring consists of terminal symbols. Continuing with the first gives aSa. Applying(2.2) gives aba, while applying (2.1) (giving aaSaa) forces you to continue &c.

So this grammar can form the strings b, aba, aabaa and so on. This is the language ofthe grammar, denoted L(G) where the grammar G = (N,Σ,P,S). However, you cannot practically form a formal grammar for a human language, since human languagesare not regular enough. This is of some dispute though and close approximationsare possible.

2.1.2 Grammar classes

The grammars one can form by applying different restrictions on the right and lefthand sides of the rules form different grammar classes. The most famous ones arethe four classes forming the Chomsky hierarchy. These are:

Type-0 grammars (unrestricted grammars) include all formal grammars. Unre-stricted does not mean that it is impossible to form a language that can notbe formed by such a grammar.

Type-1 grammars (context-sensitive grammars) have rules of the form αAβ → αγβwhere A is a nonterminal and Greek letters signify strings of terminals andnonterminals. These might be empty, except γ. A special exception is thatthe rule S → ε might exist if S does not exist in the right hand side of anyrule. The Gerek letter ε signify the empty string.

Type-2 grammars (context-free grammars) have rules of the form A → γ where γagain is a string of terminals and nonterminals.

Type-3 grammars (regular grammar) have a left hand side consisting of a singlenonterminal and the right hand side must consist of a single terminal optionallyfollowed by a single nonterminal.

2.2 Genetic Programming

The vocabulary in the area of Genetic Programming is not fully consistent, due to itsrecent inception. In the terminology of Banzhaf et al. (1997) Genetic Programming is

5

a catch-all name for computational problem solving methods inspired by biologicalevolution. Since the publication of this book the term Evolutionary Algorithmsseems to have been established for this. Genetic Programming is instead used for thespecific case where one uses an evolutionary algorithm to find computer programs.

The common trait of all Evolutionary Algorithms is that solutions are sought in apopulation of potential solutions. These solutions are represented by chromosomeswhich somehow encode the solutions. These chromosomes are subjected to processesinspired by natural evolution: mutation and reproduction (involving recombination).How the chromosomes represent the solutions varies a lot. The Genetic Algorithmpresented in Holland (1975) is considered the first Evolutionary Algorithm. Herepresent the solutions as bit strings.

To get evolution the population are subjected to a selection mimicking the survivalof the fittest in nature. Here the chromosomes are decoded to individuals on which afitness function is applied. The fitness function’s return value should be a measure forhow well the individual solves the problem or how close it is to solve the problem.The values of the fitness function decide which chromosomes will survive and/orreproduce.

The original population of chromosomes are normally generated randomly, thoughhopefully with the help of domain-specific knowledge.

* b

+

3a

Figure 2.1.Tree structure encod-ing a ∗ 3 + b.

In Genetic Programming the solutions are bits of program code.The most common way to encode this program code is by usingtree structures. A simple illustration of how the chromosomefor an arithmetic expression can be formed is seen in the figure2.1.

Subjecting tree structured chromosomes of mutation means thatyou randomly (under constraints of rules) manipulate nodes andsubtrees of the structure. These manipulations include (butare not limited to) changing values of data, changing places ofsubtrees, deleting and inserting nodes and changing operators.

In natural sexual reproduction the genetic material of a child isgenerated by combining parts of the parents genetic material.With tree based GP this is mimicked by building the new chro-mosome from subtrees of the parents. Technically you wouldnot need to restrict yourself to using subtrees from two parents,but this is normally the case.

To increase the possibility of getting working code when decoding the child’s chro-mosome it must be made likely that the different functional parts of the parents arepresent. A common way of doing this is to start of with the chromosome of oneparent and then randomly replace some of its subtrees with similar subtrees of theother parent’s chromosome. In finding a similar subtree factors can be the size, the

6

composition of different kinds of nodes and the location in the whole tree.

7

Chapter 3

Genetic Programing framework used

3.1 Basics of PerlGP

PerlGP is a framework for developing Genetic Programming systems. Though itis highly configurable we have used the basic setup of tournament based GeneticProgramming working with a population of tree-encoded solutions.

Naturally the solutions in PerlGP consist of Perl code, normally given as the functionevaluateOutput. A grammar specifies how this program code is allowed to beformed. The grammar is stored in 2 hash tables as strings. One hash table (%T)stores possible leaf nodes (terminals) and one hash table (%F) stores possible innernodes (functions). The original population of individuals is generated randomlyfrom this grammar. A simple grammar (coincidentally also for a human language)could for example look like code listing 3.1.

Here the resulting code will print simple Swedish sentences. The curly braces signifysymbols that are to be replaced by one of the elements of the corresponding entryin %T or %F. Any of the elements in the list of an entry is equally probable to bechosen. To make a certain element more likely to be chosen you simply have severalcopies of it in the list.

In the generation of an individual the tree will always start with the node ROOT,defined by the entry of the same name in %F. The definition is then scanned forsymbols. Here S is found. Since only one definition of S exists ({SUBJ} {PRED}) thiswill be the definition of the sole child (S) of ROOT. Several definitions of the childrenof S are possible though. As an example PRED could be defined {VP} {OBJ} or SUBJ{NP}. Figure 3.1 gives a complete example.

The example in code listing 3.1 is actually similar to the code used to generate theartificial corpora for the experiments described in section 5.1. The PerlGP grammarused to generate regular expressions which we hope will resemble natural language

8

# Rules$F{ROOT} = [ ’print "{S}\n";’ ];

$F{’S’} = [ ’{SUBJ} {PRED}’ ];$F{’PRED’} = [ ’{VP} {OBJ}’, ’{VP} {OBJ} {ADV}’ ];$F{’OBJ’} = [ ’{NP}’,’{ACKPRON}’ ];$F{’SUBJ’} = [ ’{NP}’, ’{PRON}’ ];

# Dictionary

$T{’PRON’} = [’han’, ’hon’, ’jag’, ’du’];$T{’NP’} = [’pojken’, ’flickan’, ’mannen’, ’barnet’, ’Kalle’,

’Lotta’];$T{’VP’} = [’träffade’, ’såg’, ’hämtade’, ’älskade’, ’hatade’,

’retade’, ’lurade’];$T{’ADV’} = [’igår’, ’isöndags’, ’nyss’, ’förut’];$T{’ACKPRON’} = [’mig’, ’honom’, ’henne’, ’dig’];

Code listing 3.1. Example grammar in PerlGP

daV

inci

V2.

1

PRON: du

SUBJ: _

VP: såg

NP: barnet

OBJ: _ ADV: igår

PRED: _ _ _

S: _ _

ROOT: print ’_\n’;

Figure 3.1. Visualisation of a random individual generated from example grammar. This indi-vidual encode the snippet of Perl code print "du såg barnet igår \n";.

9

grammars are more complex and are described in section 3.2.

As with other machine learning methods the normal procedure is to have a set ofinput data to learn on. In the case of PerlGP this data is typically processed by thefunction evaluateOutput and the return value is then evaluated by an appropriatefitness function (aptly called fitnessFunction()), often with the help of knowncorrect output.

As mentioned earlier PerlGP is tournament based. This means that repeatedly a ran-dom fixed size group from the population is selected for tournaments. The individ-uals in the tournament with the highest fitness (according to fitnessFunction())are allowed to reproduce. In the reproductive process the genomes of the two par-ents are mixed to produce the genome of the offspring. The offspring replace theindividuals in the tournament with the worst fitness.

Also any individual having taken part of more than a certain number of tournamentswill always be replaced, regardless of its fitness value. The purpose of this anti elitiststrategy is to ensure development in the population.

The mixing of the parents genomes are done by crossovers. This means that by acertain probability parents exchange similar subtrees. Similar means that the rootof the subtrees have the same type, none of the subtrees are already subtrees of eachother (from an earlier crossover) and they have a similar size and composition. Formore details, see MacCallum (2003).

Also there is a probability for mutations after the crossover. The kinds of mutationsthat exist are deletion and insertion of node, replacement of subtree (with eithera random tree or the subtree of another individual), swapping of subtrees of twoindividuals and alteration of value in node (applicable mainly to numeric values).

All crossovers and mutations are done under the constraint that the individuals stayvalid according to the grammar.

The fact that any code or data of the GP system itself can be chosen to be generatedby the evolutionary process enabled the parameters controlling the mutations to besubjected to evolution. Such parameters include the probability for a node to besubject to mutation and the probability for the subtree of a certain node to besubject to crossover. These parameters were given initial values that could then benumerically mutated, as in MacCallum (2003).

3.2 Specifics on setup of PerlGP for grammar induction

In the individuals of the population grammars are implemented by means of Perlregular expressions. These are evolved under restriction to parse as many and muchof the sentences in the training set as possible. The regular expressions match thesentences on the word level. This means that no morphological rules may be evolved.

10

In the vocabulary of formal grammars the words are the terminal symbols. This isa significant constraint since suffix and prefix rules are important parts of a fullgrammar. In our original system no recursion was possible, implicating that theclass of grammars that was recognised did not strictly correspond to any of theChomsky hierarchy grammars. Notably the generated language is finite withoutrecursion.

The rules of the grammars follow the templates

Si → SjSk (3.1)Si → Sj|Sk (3.2)Si → word (3.3)

where Si are nonempty nonterminals. That no recursion is possible can be formalisedas the restriction that the rules of a grammar should not allow Si to transform intoa string that includes itself. Rule (3.2) is equivalent to

Si → Sj (3.4)Si → Sk (3.5)

In the program the grammars are expressed in a different but equally expressiveway compared to the description above. Code listing 5.1 shows an extended Perlimplementation of this grammar template. In effect the grammars were Perl regularexpressions. A simple example:

(?:Flickan |(?:Lotta |Barnet ))(?:(?:lurade |hatade )|retade )(?:(?:Lotta |mannen )|Kalle )(?:nyss |förut )

This expresses the same as

S → ABCD

A → Flickan|Lotta|Barnet

B → lurade|hatade|retadeC → Lotta|mannen|Kalle

D → f orut|nyss

11

or (more strictly following the templates and using the simplest notation of formalgrammars)

S → XY A → Barnet C → mannen

X → AB B → lurade C → Kalle

Y → CD B → hatade D → nyss

A → Flickan B → retade D → f orut

A → Lotta C → Lotta

.

3.2.1 The fitness function

When creating the fitness function we make use of the Minimum Description Length(MDL) Principle as presented in Grünwald (1994). That is we reward grammarsnot only for the amount of text it matches, but also for how short the grammar is.But the exact composition of the fitness function can not be deduced theoreticallyand could be modified indefinitely.

From previous unpublished studies by Robert M. MacCallum we have found thefollowing fitness function to be reasonable in its trade off between rewarding thematching of many long phrases and conciseness:

f =

∑kn=1 (wn − 1)4

s+ 1(3.6)

The numerator is a function of the number of words wn in each unique phrase (oftotally k phrases) matched by the evolved regular expression against the corpus.The expression wn−1 is used to avoid counting the case of matching only one word.The denominator is a function of the size of the grammar expressed as the numberof nodes s in the tree structure representing the regular expression/grammar. Thisis equal to the number of rules used in the formal grammar in the form of thetemplates (3.1)–(3.3). The addition of one to s is to avoid the fitness functionbecoming undefined when the grammar has size zero.

3.2.2 The corpora

In each run two different corpora—called the training corpus and the testing corpus—are used when evaluating the fitness function. The fitness values emanating fromthe training corpus is used to drive the evolution, while occasional fitness evalua-tions using the testing corpus are used as a reference. If the fitness value from thetesting corpus differs a lot from the fitness value from the training corpus it is a

12

sign of over specialisation. For this to be true the two corpora must be of similarcomposition. But they should not be too similar because otherwise one will not seethe over specialisation.

13

Chapter 4

Performance measurement

There are two types of performances we are interested in: the performance of thefinal evolved grammars and the performance of the evolution itself.

4.1 Performance of grammars

A grammar for a language should match the phrases in the language but not thephrases that are not in the language. Moreover the grammar should be concise. InGP these goals should of course be encoded in the fitness function.

In these experiments the sizes of the grammars are given as the number of nodes inthe genome that generates the grammar.

When otherwise not mentioned the grammar searches for phrases within sentencesand several phrases might be found in one sentence.

It is hard to explicitly encode the more elusive demands for a structure that seemsintuitively correct. We will not try to extract any explicit figure for this qualita-tive property. Hopefully the combination of the allowed structure of the grammartogether with the demands for conciseness will be enough for our purpose. Thisapproach, but using very different techniques is described in Grünwald (1994).

The fitness values themselves do not give the information necessary to get an un-derstanding of how the grammars perform.

4.2 Performance of the evolution

Generally the performance of the evolution is a matter of what return in final per-formance of the grammars you gain by putting different resources into the evolution.

14

In this case factors in the performance of the evolution are its speed and its demandson the corpora.

Measuring the speed of improvement—using the fitness value—for a GP algorithmversus the number of tournaments might seem like the natural thing to do. But inreal time the execution time of one tournament can vary a lot. When space permitswe will present both real time and number of tournaments.

But what are the relevant figures to look at when assessing the performance in anexperiment? For the cases where it is clear at what time you have reached a stablemaximum you could simply look at the time for reaching this maximum (or a certainfraction thereof) and compare it to the performance. Here we have chosen 63% offinal fitness. When a stable maximum is not reached this is stated without anydetails.

The figure 63% bears no special significance except that at this point in time theseemingly random oscillations that dominate in the beginning have diminished in allruns but the fitness value were still rising steadily enough for the smaller oscillationsnot to disturb the time measurements much. (The number might be recognised fromelectronics.)

In benchmarks and competitions it seems like the current practise is to set a tightlimit on computing resources and then the performance of the grammars are mea-sured after a certain time. This stems from the problem with comparing resultsemanating from different conditions. Therefore all the experiment results we presentwere performed on identical hardware1.

12166 MHz Athlon XP, 8664 MFlop/s, 1024 MB of RAM

15

Chapter 5

The experiments

We prepared and executed a number of experiments. They differ in regard to thecomposition of the input data and how the GP system was set up.

5.1 Simple artificial corpora

In our first experiments we used corpora generated from the following very smallsubset of the Swedish grammar, taken from page 73 of Sigurd (1967).

S → SUBJ PRED (5.1)PRED → VP OBJ (5.2)PRED → VP OBJ ADV (5.3)OBJ → NP (5.4)OBJ → ACKPRON (5.5)

SUBJ → NP (5.6)SUBJ → PRON (5.7)

The symbols PRON, NP, VP, ADV and ACKPRON are all non-terminals producedwith rules consisting of individual words on the right hand side. The individualwords are listed in Appendix A.

The method for generating the corpora is mentioned in section 3.1. Both the corporaused for training and for testing are consistently consisting of 1,000 sentences. Butnot all of the training sentences are used in every tournament. Specifically thereare a 50% chance for each sentence in the training corpus to be used by the fitnessfunction. The reason for this is twofold: saving processing time and avoiding over

16

specialisation of the evolving grammars. The later is achieved since a changing setof sentences prevents the individuals from stagnating.

As indicated in appendix A we did several runs with sets of corpora (training andtesting) generated from grammars consisting of different numbers of words. For eachset of corpora 10 runs were made to be able to draw some conclusions despite thestochastic nature of GP.

The volume of data produced from the experiments were quite large. We have chosento be sparse with giving data in the text. It is included only in the cases when datais needed to improve understanding or we have managed to condense data in suchan extent that a reader can analyse it without it taking up too much space. The fulldatasets and scripts used to analyse them are available from the author on request.

5.1.1 Quality of grammars

The GP system is good at generating a grammar for the artificially generated cor-pora. When using the grammar templates (3.1)–(3.3) the generated grammars didnot match the whole corpus. The reason is that the optional ADV in PRED wouldmake it necessary to repeat the whole PRED part of the grammar. But this will notincrease the fitness value as much as it would decrease by the increased size of thegrammar. There were varieties in exactly how this turned out, but we will not givea detailed analysis for each of the used corpora.

For the 25 word corpora all evolved grammars turned out to consist of all the words.Combined with the last paragraph this means that sentences generated with rule 5.2will not match. The reason is quite clearly that this gives a higher fitness because thelength of the matches are so rewarded. Note that also matches to parts of sentencesare rewarded in the fitness function. The significance of this is shown in the caseswith larger vocabularies, where sometimes grammars matching shorter sentencessurvive.

The total number of sentences possible to generate with the grammar using 25 wordsis 3,500. The training and testing corpora of in total 2,000 sentences makes up alarge part of these possible sentences. This changes with bigger corpora making theevolution give specialisation on the sentences in the training set. This specialisationshows as the percentage of the vocabulary not appearing in the grammar is increasingwith size.

How words are selected for potentially being used in the grammar is adding to thiseffect: First the 32 most common words are used to generate the first individuals.Then after 1,000 tournaments more words are made available for insertion by mu-tations. These words are chosen due to being either adjacent to the phrases alreadybeing matched or part of previously unmatched phrases during search with a gram-mar where a word was replaced with a wild card. The goal is to make it possible

17

for the grammars to improve while minimising the risk of them being destroyed.

These things add to the difficulty of concisely describing the rest of the evolvedgrammars. But some general features can be observed: The grammars which doreach a plateau in development of the fitness function all match the sentences fromtheir beginnings. But the part of the grammar trying to match the PRED rules varyfrom only being able to match sentences generated from rule 5.3 to not having anyADV words at all. In between we have those encoding some of each.

A change in the bigger grammars have a smaller relative effect which make smallflaws in the grammars stay.

No fitness value were still on the rise when the run was terminated.

5.1.2 Significance of vocabulary size

The comparative runs with corpora generated from different numbers of words werealso used to see how the performance of the evolution was affected1. The hypothesisthat a bigger vocabulary would yield a slower evolution was affirmed. One have tokeep in mind though, that the algorithm is stochastic and we only did 10 runs ofeach kind. Also, of the 50 runs there were 3 we consider failures. That is the fitnessvalues never began to rise steadily but only oscillated. So the data do not allowfor any finer analysis though is seems like the time for the value to reach a 63% ofthe final fitness value were roughly proportional to the size of the vocabulary (figure5.1).

Most runs began with a period of oscillation before taking to a steady rise. But thecurves are not smooth on a smaller scale. So it is not a monotone rise. See figure5.2.

5.1.3 Significance of word frequency

In the experiments described above all words had the same probability of appearingin the corpus. Since this is not the case in natural languages we redid the experimentswith varying probabilities for the different words. We then compared the times spentto reach 63% of the final fitness values.

Looking at the best fitness value of tournaments as a function of time gives thatits shape is slightly different when the word frequency is no longer uniform. Butthere is no significant change in the times spent to reach 63% of final value. Thefinal fitness values did reach higher though. This is not surprising considering that asimilar grammar can match a larger part of the corpus if it consists of well selectedwords now that the corpus has a bias to contain some words compared to others.

1According to the definition in section 4.2.

18

00:00

02:00

04:00

06:00

08:00

10:00

12:00

14:00

16:00

18:00

20:00

0 20 40 60 80 100

Tim

e (h

h:m

m)

Number of words in corpora

Figure 5.1. Time to reach 63% of final value plotted for different sizes of vocabulary.

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0 20000 40000 60000 80000 100000 120000

Fitn

ess

valu

e

Number of tournaments

Figure 5.2. Example of run with artificial grammar. This is run number 5 with a vocabularysize of 100 words. The plot shows the fitness calculated using the testing corpus. The run took 24hours.

19

Figure 5.3 gives some qualitative understanding. Despite the observed differencesno figures can be given that are statistically significant.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1000 2000 3000 4000 5000 6000 7000 8000

Fit

ness

valu

e

Tournament

Best of tournament testing fitness

Varying probability Even probability

Figure 5.3. All graphs over the development of the best of tournament testing fitness for 25 wordcorpora. Graphs are marked as coming from runs with either an equal or differing probabilities fordifferent words to occur. The times to reach tournament 8,000 vary between 2:06 h and 2:45 h,except one run taking 5:44 h.

5.1.4 Optional parts of grammar

We tried adding an extra rule to the grammar template that allowed for optionalparts of the grammar2. This means that we allow for all left hand symbols in therules—following the templates (3.1)–(3.3)—to have the additional rule

Si → ε (5.8)

When using the 25 word corpora grammars matching the whole corpus were gener-ated. They were also as concise as possible. The results deteriorated with biggervocabulary however. With the 100 word corpora 5 of the ten runs were successfulin the sense that the original grammar was regenerated. But not all words wereincluded, presumably due to the effects described in section 5.1. The bigger corporaalso showed flaws in the implementation causing crashes. This seems to be the mainreason that 4 of the runs failed to produce any grammar.

2Simply by allowing for a question mark after each part of the regular expression implementingthe grammar.

20

The risk with allowing parts of a grammar being optional is that a grammar maymatch sentences without conveying the structure of them. For example a fragmentof a grammar could be

(the )?(a )?(boy )?(girl )?

instead of the more structured

(the |a )(boy |girl )

Examples of this can be seen below (section 5.2.1). In the light of this the reasonablygood results were a bit surprising. Maybe the precaution to give a certain penaltyto making part of a grammar optional helped. Each rule following template (5.8) iscounted as three in the fitness function.

5.1.5 Recursive grammars

We developed a possibility for the grammars to be ‘recursive’ in the sense describedin section 3.2. This was done in the hope that grammars would evolve that reuseparts of the grammars (like words of the same class) and encode repetitive structuresby means of recursion.

Details of the implementation

An array of grammars was allowed to evolve. In the initial individuals this array hadat least one entry. The probability for adding an extra entry was 50%. The righthand side symbols of rules 3.1 and 3.2 can then be transformed to the grammars inthis array. In other words a grammar can include a reference to any grammar in thearray (including itself). The Perl code for this is shown in code listing 5.1.

The words used for the symbol WORD are extracted from the text as explained insection 5.1 and copies is a function that copies its second argument the number oftimes given in the first argument. This is done to set the likelihood of the differentalternatives.

By incorporating the regular expression on line 15, the grammar is enabled to refer-ence itself. It makes use of the experimental regular expression pattern (??{ code })introduced in Perl 5.6. The Perl code in the pattern is evaluated and the returnedvalue is evaluated in place as a regular expression. By design the pattern can only

21

$F{ROOT} = [ <<’___’,sub getPhraseMatch {use re ’eval’;

5 return qr/(??{local $c=0; local @p=(qr’{PHRASE}’, {REGS});$p[0];})/;}___];

10 $F{REGS} = [’qr\’{PHRASE}\’, {REGS}’,’’];

$F{PHRASE} = [15 copies(4,’(??{local $c=$c+1; ($c < 5) ? $p[{INDEX}%@p] : "zzz"})’),

copies(6, ’{PHRASE}{PHRASE}’),copies(10, ’(?:{PHRASE}|{PHRASE})’),copies(28, ’{WORD} ’), #< note the trailing space

];20

$T{INDEX} = [ 0 .. 30 ];

Code listing 5.1. Implementation of recursive grammars

be used in the search of a particular phrase no more than 5 times. This is to avoidendless recursions leading to slow execution and even crashes. To make sure thatonly existing array elements are accessed the index is formed as an integer modulothe length of the array: {INDEX}%@p.

Results

Unfortunately with this implementation no grammars with sub grammars identi-fiable as phrase elements were ever evolved3. Thinking about it the cause of theresults were evident; the probability for a symbol to by chance point to a grammarthat works as a ‘sub grammar’ in this point is very small.

It was noticed that the evolutionary pressure reduced the average number of gram-mars in the grammar array. The reason was evidently that this was an effective wayof reducing the number of nodes.

But recursion can occur even with only a single regular expression. We did a runwith the 100 word corpora where any reference to a grammar was optional, in the

350 runs where made.

22

sense of section 5.1.4. An example of a result can be seen in figure 5.4. It is clearthat this grammar matches any sentence of arbitrary length containing any word inthe grammar.

daV

inci

V2.

1

|

|

|

han|

|

mig regelbundet

|

|

|

dottern kvinnan

|

|

| alltid

|

snabbt hatade

|

skallde_ut|

jag fadern

|

|

|

|

vem|

antastade du

elin

mig

|

torkade tolererade

?

Figure 5.4. Excerpt from recursive expression. The triangle indicates an optional (indicated bythe question mark) self reference to the whole expression. An unmarked node means that the rightbranch is concatenated with the left, while the vertical bar in a circle means alternative: either leftor right branch. The truncated part of the grammar (signified by the broken lines) contains onlyalternatives and leaf nodes.

A hypothetical grammar matching all phrases while having optional references couldlook like figure 5.5. In reality no grammar matching all phrases occurred, but gram-mars constructed in a similar way did occur.

5.2 Tagged medical texts

The result in section 5.1.2 that the time consumption of the evolution was roughlyproportional to the size of the vocabulary steered us away from trying grammar

23

daV

inci

V2.

1

Any word in text

|Any word in text

Figure 5.5. Hypothetical recursive grammar matching all phrases in a text without the selfreference (triangle) being optional.

inference on unrestricted real texts. Instead we chose to work on tagged texts, usingMedPost (Smith et al., 2004), a part-of-speech tagger for biomedical texts.

A corpus of abstracts was downloaded from Pubmed4 using the query string “tran-scription factors”. These were tagged and then reformatted slightly to facilitate theprocessing in the GP system. An example of a tagged sentence:

In_II addition_NN ,_, regions_NNS corresponding_VVG to_II the_DDinner_JJ and/or_CC outer_JJ segments_NNS of_II the_DD photore-ceptor_NN cells_NNS showed_VVD positive_JJ staining_VVGN for_IIcytokine_NN signaling_VVGJ components_NNS ._.

A difference from earlier experiments was that the words used to form the grammarwere a fixed set of tags instead of words extracted from the corpus.

The fitness function mainly used (3.6) has the drawback of being hard to motivateand analyse. This is particularly true for the exponent 4 in the numerator. Analternative fitness function was tried out:

f =

∑kn=1(wn − 1)

s+ c(5.9)

Comparing with the earlier fitness function you see the extra symbol c. This is aconstant which we set to different values to see the effect.

4The U.S. National Library of Medicine’s database of biomedical citations and abstracts.

24

5.2.1 Results

Earlier experiments using function 3.6 showed that phrases match up to a length ofup to 30 words. The result is due to the exponent in the numerator which gives agood return in fitness value. A suspected drawback of this is that this return woulddiminish the evolutionary pressure to minimise the grammar, since the denominatoris only a linear function of the size of the grammar. Real grammatical structuresdoes not consist of that many parts, instead phrases of such a length are bound toconsist of repeating structures. But since the evolved grammars can not use anyrecursion they can not reflect this and thus can not be a grammar we are lookingfor.

1e-06

1e-05

1e-04

0.001

0.01

0.1

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Nor

mal

ized

freq

uenc

yof

phra

sele

ngth

(log

scal

e)

Number of words in matched phrase

Fitness function (5.9), one match per sentence, bias 10Fitness function (5.9), one match per sentence, bias 50

Fitness function (5.9), one match per sentence, bias 100Fitness function (3.6)

Fitness function (3.6), optional parts of grammar

Figure 5.6. Histogram over the relative frequencies of the lengths of the matches of the bestgrammars of the runs with 10.000 sentences corpora. Logarithmic scale is used to be able todisplay the whole range of the values.

This together with the wish for a fitness function easier to understand led us to tryfunction (5.9), initially with c = 1. When using this fitness function the evolvedgrammars tended to be able to match only many short phrases and thus not whole

25

sentences. A contributing reason was that the fitness function was defined in sucha way that the grammar was allowed to match several phrases in each sentence andall matched words were counted. To encourage grammars to match longer stretchesof words the fitness function was redefined to allow only one match per sentence.This made matched phrases longer (data not showed), but only by a word or twoand thus was not very encouraging. Obviously the evolutionary pressure to matchlong sentences becomes too weak compared to the pressure to get a small grammar.

To counter this the bias factor c of the fitness function is increased to let the gram-mars reach a certain size before the evolutionary pressure for minimisation takeseffect.

Figure 5.6 shows histograms over the length of the matched phrases in a few cases.One should note that the fitness values of the grammars evolved with function (3.6)never levelled but were still rising when the run ended after 6 days.

An example of a full grammar can be seen in Figure 5.7.

DD

NN II

|

| JJ

|

NN NNS

| JJ

| DD

|

CC II

| JJ

| NN

|

NN DD

| JJ

|

II NNS

| NN

| NN NNS

|

|

Figure 5.7. The fitness function used to evolve this grammar only allowed for one match persentence, has a linear numerator and a bias of 50.Legend: NN, Noun; NNS, plural noun; DD, determiner; JJ, adjective; II, preposition; CC, coordi-nating conjunction.

26

Chapter 6

Conclusions

Because of the big number of parameters involved in setting up a Genetic Pro-gramming run the conclusions can not be said to apply to GP applied to grammarinduction in general.

Further methods than used in this work need to be used to have usable grammarsevolve using corpora of real human language.

6.1 Execution time

The running time for one tournament varies widely. Some factors are:

1. The size of the vocabulary.

2. The complexity of the fitness function.

3. The complexity of the regular expression making up the grammar.

4. The length of the phrases matched.

Both items 3 and 4 tend to increase during a run making the time per tournamentincrease.

6.2 Further work

Especially the grammars described in sections 5.1.5 and 5.2.1 would obviously alsomatch a lot of incorrect phrases. One would like to put an evolutionary pressureagainst this. The only way we can think of is using a corpus of incorrect phrases and

27

redefine the fitness function to reduce the fitness value for an individual accordingto the number of matches in this corpus. That is, one would need to add negativereinforcement. Corpora with incorrect phrases are harder to come by, so one wouldlike to automatically construct them. The simple method of generating ‘sentences’randomly from the words (tags in this case) of the original corpus, might workreasonably well. This stems from the facts that most of these random sentenceswould be incorrect and that GP tends to be resilient to imperfect input.

It might be rewarding to tailor the numerator of the fitness function to have a clearpeak for desired phrase lengths. That function might have a look similar to thenormal distribution. One approach could be letting the function change over time:First having a peak for sentences of a few words making the grammar cover basiccases. Then moving the peak upwards to longer phrases hoping that the populationinclude some specialised grammars that can combine using mutations and recursionmechanisms.

28

Bibliography

Angluin, D. 1988. Identifying Languages from Stochastic Examples, Technical ReportYALEU/DCS/RR-614, Yale University.

Baker, C.L. 1976. Syntactic theory and the projection problem, Linguistic Inquiry 10, 533–581.

Banzhaf, Wolfgang, Peter Nordin, Robert E. Keller, and Frank D. Francone. 1997. Genetic pro-gramming–an introduction, Morgan Kaufmann Publishers, Inc.

Bohannon, N. J. and L. Stanowicz. 1988. The issue of negative evidence: Adult responses to chil-dren’s language errors., Developmental Psychology 24, 684–689.

Gold, Mark E. 1967. Language identification in the limit, Information and Control 10, no. 5, 447–474.

Grünwald, Peter. 1994. A minimum description length approach to grammar inference, Connec-tionist, Statistical and Symbolic Approaches to Learning for Natural Language, pp. 203–216.

Holland, J. H. 1975. Adaptation in natural and artificial systems, University of Michigan Press.

MacCallum, Robert M. 2003. Introducing a Perl Genetic Programming System –and Can Meta-evolution Solve the Bloat Problem?, 6th European Conference on Genetic Programming, EuroGP(Essex, UK, 2003), Lecture Notes in Computer Science, vol. 2610, Springer, 2003, pp. 364–373.

. 2005. Personal communication.

Sigurd, Bengt. 1967. Språkstruktur: Den moderna språkforskningens metoder och problemställ-ningar, Wahlström & Widstrand.

Smith, L., T. Rindflesch, and W. J. Wilbur. 2004. MedPost: a part-of-speech tagger for bioMedicaltext, Bioinformatics 20, no. 14, 2320–2321.

Smith, Tony C. and Ian H. Witten. 1995. Learning language using genetic algorithms, Learningfor Natural Language Processing, pp. 132–145.

29

Appendix A

Words used in 5.1Words added to the grammar to reach the specified vocabulary size for each run

Word class 25 44 63 81 100

PRON hanhonjagdu

vide

ni

NP pojkenflickanmannenbarnetKalleLotta

RagnarBobpolisenmurarenlöparen

vemvilkasonendotternmodernfadernrökarenförsäljarenidiotenläkarenfysikernsångaren

OttoSvanteAndersLenaEllenElinDanvännen

chefendanskanalbanenkvinnanmördarenoffretkändisen

VP träffadesaghämtadeälskadehataderetadelurade

kittladerånadelämnadesköttorkadekonverserade

kyssteundervisadetolererade

frågadeletade_efterkonspirerade_medradadehotadeskällde_ut

utnyttjadeskämde_bortskämde_uthöggantastadesmektelugnadetrostadeberömde

ADV igåri_söndagsnyssförut

imorsealltidsnabbtlångsamt

brutalttidigtsentofta

tidigaresenare

i_fredagsklockan_nioregelbundet

ACKPRON mighonomhennedig

ossdem

er

30

TRITA-CSC-E 2011:077 ISRN-KTH/CSC/E--11/077-SE

ISSN-1653-5715

www.kth.se

Documents

Genetic Programming for Grammar Induction