ABSTRACT Genetic algorithms (GA) is an optimization ...dana/MLClass/GANotes.pdfABSTRACT Genetic algorithms (GA) is an optimization technique for searching very large spaces that models

ABSTRACT Genetic algorithms (GA) is an optimization technique forsearching very large spaces that models the role of the genetic material inliving organisms. A small population of individual exemplars can effectivelysearch a large space because they contain schemata, useful substructuresthat can be potentially combined to make fitter individuals. Formal studiesof competing schemata show that the best policy for replicating them is toincrease them exponentially according to their relative fitness. This turnsout to be the policy used by genetic algorithms. Fitness is determined byexamining a large number of individual fitness cases. This process can bevery efficient if the fitness cases also evolve by their own GAs.

1 Introduction

Network models, such as multilayered perceptrons, make local changes and sofind local minima. To find more global minima a different kind of algorithmis called for. Genetic algorithms fill the bill. These algorithms also allowlarge changes in the state to be made easily. They have a number of distinctproperties:

• They work with an encoding of the parameters.

• They search by means of a population of individuals.

• They use a fitness function that does not require the calculation of deriva-tives.

• They search probabilistically.

In broad outline the idea of a GA is to encode the problem in a string. Forexample, if the problem is to find the maximum of a scalar function f(x), thestring can be obtained by representing x in binary. A population represents adiverse set of strings that are different possible solutions to the problem. Thefitness function scores each of these as to its optimality. In the example, theassociated value of f for each binary string is its fitness. At each generation,

1

operators produce new individuals (strings) that are on average fitter. Aftera given set of generations, the fittest member in the population is chosen asthe answer.

The two key parameters are the number of generations NG and the pop-ulation size Np. If the population is too small, there will not be sufficientdiversity in the strings to find the optimal string by a series of operations. Ifthe number of generations is too small, there will not be enough chances tofind the optimum. These two parameters are not independent. The largerthe population, the smaller the number of generations needed to have a goodchance of finding the optimum. Note that the optimum is not guaranteed;there is just a good chance of finding it. Certain problems will be very hard(technically they are called deceptive if the global minimum is hard to find).Note that for the solution to be not found, the probability of generating itfrom the population as it traverses the generations has to be small.

The structure of a GA uses a reproductive plan that works as shown inAlgorithm II

The raw fitness score for the ith individual, h(i), is any way of assessinggood solutions that you have. Of course the performance of the algorithmwill be sensitive to the function that you pick. Rather than work with h, itis more useful to convert raw fitness to normalized fitness f , where

f(i) =h(i)∑Np

i=1 h(i)

Since now∑Np

i=1 f(i) = 1, this method allows the fitness values to be used asprobabilities.

Probabilistic operations enter the algorithm in three different phases. First,the initial population must be selected. This choice can be made randomly(or if you have some special knowledge of good starting points, these can bechosen). Next, members of the population have to be selected for reproduc-tion. One way is to select individuals based on fitness, produce offspring,score the offspring, and then delete individuals from the resultant population

2

Algorithm I .1in The Genetic Algorithm

Choose a population size.Choose the number of generations NG.Initialize the population.Repeat the following for NG generations:

1. Select a given number of pairs of individuals from the popu-lation probabilistically after assigning each structure a prob-ability proportional to observed performance.

2. Copy the selected individual(s), then apply operators to themto produce new individual(s).

3. Select other individuals at random and replace them with thenew individuals.

4. Observe and record the fitness of the new individuals.

Output the fittest individual as the answer.

based on 1− f . The third way probabilities enter into consideration is in theselection of the genetic operation to be used.

1.1 Genetic Operators

Now let’s define the genetic operators. Let the symbol a denote a gene, anddenote a chromosome by a sequence of genes a1a2a3 . . . an. The set of allelesci1, . . . , ciji represents the possible codes that can be used at location i. Thusthe gene will represent one of these

ai ∈ ci1, . . . , cijibut to take a specific example, for an alphabet of the first three letters, thealleles would be {a, b, c} at every position.

The first operator is crossover , wherein subsequences of two parent stringsare interchanged. The point of the operator is to combine two good subse-quences on the same string. To implement the operation, crossover points

3

Figure 1: Genetic operations on a three-letter alphabet of {a,b,c}. (top) Crossover swaps strings at acrossover point. (middle) Inversion reverses the order of letters in a substring. (bottom) Mutation changesa single element.

must be selected probabilistically. Then the substrings are swapped, as shownin Figure 1 (top).

Once good subsequences appear in an individual, it is advantageous topreserve them from being broken up by further crossover operations. Shortersubsequences have a better chance of surviving crossover. For that reasonthe inversion operation is useful for moving good sequences closer together.Inversion is defined in the middle section of Figure 1.

As the GA progresses, the diversity of the population may be removed asthe individuals all take on the characteristics of an exemplar from a localmimimum. In this case mutation is a way to introduce diversity into thepopulation and avoid local minima. Mutation works by choosing an elementof the string at random and replacing it with another symbol from the code,as shown in the bottom section of Figure 1.

1.2 An Example

Consider finding the maximum of the function f(x) = −x2 +12x+300 wherex takes on integer values 0, . . . , 31. This example is easily solved with directmethods, but it will be encoded as a GA search problem to demonstrate theoperations involved.

The first step in defining the GA is to code the search space. The encodingused for genetic operations is the binary code for the independent variablex. The value of f(x) is the fitness value. Normalized fitness over the wholepopulation determines the probability Pselect of being selected for reproduc-tion. Table 1 shows an initial condition for the algorithm starting with fiveindividuals.

Now select two individuals for reproduction. Selections are made by access-ing a random number generator, using the probabilities shown in the column

4

Table 1: Initial condition for the genetic algorithm example.

Individual Genetic Code x f(x) Pselect

1 10110 22 80 0.082 10001 17 215 0.223 11000 24 12 0.014 00010 2 320 0.335 00111 7 335 0.36Average 192

Table 2: Mating process.

Mating Pair Site New Individual f(x) Pselect

00010 2 10010 192 0.1410001 2 00001 311 0.23

Table 3: The genetic algorithm example after one step.

Individual Genetic Code x f(x) Pselect

1 10010 18 192 0.142 10001 17 215 0.163 00001 1 311 0.234 00010 2 320 0.235 00111 7 335 0.24Average 275

5

Pselect. Suppose that individuals 2 and 4 are picked.The operation we will use is crossover, which requires picking a locus on

the string for the crossover site. This is also done using a random numbergenerator. Suppose the result is two, counting the front as zero. The two newindividuals that result are shown in Table 2 along with their fitness values.Now these new individuals have to be added to the population, maintainingpopulation size. Thus it is necessary to select individuals for removal. Onceagain the random number generator is consulted. Suppose that individuals1 and 3 lose this contest. The result of one iteration of the GA is shown inTable 3.

Note that after this operation the average fitness of the population hasincreased. You might be wondering why the best individuals are not selectedat the expense of the worst. Why go to the trouble of using the fitness values?The reason can be appreciated by considering what would happen if it turnedout that all the best individuals had their last bit set to 1. There would beno way to fix this situation by crossover. The solution has a 0 in the last bitposition, and there would be no way to generate such an individual. In thiscase the small population would get stuck at a local minimum. Now you seewhy the low-fitness individuals are kept around: they are a source of diversitywith which to cover the solution space.

2 Schemata

In very simple form, the example exhibits an essential property of the geneticcoding: that individual loci in the code can confer fitness on the individual.Since the optimum is 6, all individuals with their leading bit set to 0 willbe fit, regardless of the rest of the bits. In general, the extent to whichloci can confer independent fitness will simplify the search. If the bits werecompletely independent, they could be tested individually and the problemwould be very simple, so the difficulty of the problem is related to the extentto which the bits interact in the fitness calculation.

6

A way of getting a handle on the impact of sets of loci is the concept ofschemata (singular: schema).1 A schema denotes a subset of strings thathave identical values at certain loci. The form of a schema is a template inwhich the common bits are indicated explicitly and a “don’t care” symbol (∗)is used to indicate the irrelevant part of the string (from the standpoint of theschema). For example, 1 ∗ 101 denotes the strings {10101, 11101}. Schematacontain information about the diversity of the population. For example, apopulation of n individuals using a binary genetic code of length l containssomewhere between 2l and n2l schemata.

Not all schemata are created equal, because of the genetic operations,which tend to break up some schemata more than others. For example,1 ∗ ∗ ∗ ∗1 is more vulnerable to crossover than ∗11 ∗ ∗∗. In general, shortschemata will be the most robust.

2.1 Schemata Theorem

To see the importance of schemata, let’s track the number of representativesof a given schema in a population. It turns out that the growth of a particularschema in a population is very easy to determine. Let t be a variable thatdenotes a particular generation, and let m(S, t) be the number of schemaexemplars in a population at generation t. To simplify matters, ignore theeffects of crossover in breaking up schema. Then the number of this schemain the new population is directly proportional to the chance of an individualbeing picked that has the schema. Considering the entire population, this is

m(S, t+ 1) = m(S, t)nf(S)∑i fi

because an individual is picked with probability f(S)∑i fi

and there are n picks.This equation can be written more succinctly as

m(S, t+ 1) = m(S, t)f(S)

fave

7

To see the effects of this equation more vividly, adopt the further simpli-fying assumption that f(S) remains above the average fitness by a constantamount. That is, for some c, write

f(S) = fave(1 + c)

Then it is easy to show that

m(S, t) = m(S, 0)(1 + c)t

In other words, for a fitness that is slightly above the mean, the number ofschema instances will grow exponentially, whereas if the fitness is slightlybelow the average (c negative), the schema will decrease exponentially. Thisequation is just an approximation because it ignores things like new schemathat can be created with the operators, but nonetheless it captures the maindynamics of schema growth.

2.2 The Bandit Problem

The upshot of the previous analysis has been to show that fit schemata prop-agate exponentially at a rate that is proportional to their relative fitness.Why is this a good thing to do? The answer can be developed in terms ofa related problem, the two-armed bandit problem (Las Vegas slot machinesare nicknamed “one-armed bandits”). The two-armed slot machine is con-structed so that the arms have different payoffs. The problem for the gambleris to pick the arm with the higher payoff. If the arms had the same payoff oneach pull the problem would be easy. Just pull each lever once and then pullthe winner after that. The problem is that the arms pay a random amount,say with means m1 and m2 and corresponding variances σ1 and σ2.

This problem can be analyzed by choosing a fixed strategy for N trials.One such strategy is to pull both levers n times (where 2n < N) and thenpull the best for the remaining number of trials. The expected loss for thisstrategy is

L(N, n) = |m1 −m2|{(N − n)p(n) + n[1− p(n)]} (1)

8

where p(n) is the probability that the arm that is actually worst looks thebest after n trials. We can approximate p(n) by

p(n) ≈ e−x2/2

√2πx

where

x =m1 −m2√σ2

1 + σ22

√n

With quite a bit of work this equation can be differentiated with respectto n to find the optimal experiment size n∗. The net result is that the totalnumber of trials grows at a rate that is greater than an exponential functionof n∗. More refined analyses can be done, but they only confirm the basicresult: Once you think you know the best lever, you should pull that leveran exponentially greater number of times. You only keep pulling the badlever on the remote chance that you are wrong. This result generalizes to thek-armed bandit problem. Resources should be allocated among the k armsso that the best arms receive an exponentially increasing number of trials, inproportion to their estimated advantage.

In the light of this result let us return to the analysis of schemata. Inparticular, consider schema that compete with each other. For example, thefollowing schemata all compete with each other:

∗ ∗ 0 ∗ 00∗∗ ∗ 0 ∗ 01∗∗ ∗ 0 ∗ 10∗∗ ∗ 0 ∗ 11∗∗ ∗ 1 ∗ 00∗∗ ∗ 1 ∗ 01∗∗ ∗ 1 ∗ 10∗∗ ∗ 1 ∗ 11∗

Do you see the relationship between the bandit problem and schema? If theseschemata act independently to confer fitness on the individuals that contain

9

them, then the number of each schema should be increased exponentiallyaccording to its relative fitness. But this is what the GA is doing!

Summary You should recognize that all of the foregoing discussion has notbeen a proof that GAs work, but merely an argument. The summary of theargument is as follows:

• GAs seem to work. In practice, they find solutions much faster (withhigher probability) than would be expected from random search.

• If all the bits in the GA encoding were independent, it would be a simplematter to optimize over each one of the bits independently. This is notthe case, but the belief is that for many problems, one can optimize oversubsets of bits. In GAs these are schemata.

• Short schemata have a high probability of surviving the genetic opera-tions.

• Focusing on short schemata that compete shows that, over the short run,the fittest are increasing at an exponential rate.

• This has been shown to be the right thing to do for the bandit problem,which optimizes reward for competing alternatives with probabilistic pay-offs that have stable statistics.

• Ergo, if all of the assumptions hold (we cannot tell whether they do, butwe suspect they do), GAs are optimal.

3 Determining Fitness

In the simple example used to illustrate the basic mechanisms of the GA,fitness could be calculated directly from the code. In general, though, it maydepend on a number of fitness cases taken from the environment. For exam-ple, suppose the problem is to find a parity function over binary inputs oflength four. All 24 inputs may have to be tested on each individual in the

10

population. In these kinds of problems, fitness is often usefully defined asthe fraction of successful answers to the fitness cases. When the number ofpossible cases is small, they can be exhaustively tested, but as this numberbecomes large, the expense of the overall algorithm increases proportionately.This is the impetus for some kind of approximation method that would re-duce the cost of the fitness calculation. The following sections describe twomethods for controlling this cost.

3.1 Racing for Fitness

One of the simplest ways of economizing on the computation of fitness isto estimate fitness from a limited sample. Suppose that the fitness score isthe mean of the fitnesses for each of the samples, or fitness cases. As theseare evaluated, the estimate of the mean becomes more and more accurate.The fitness estimate is going to be used for the selection of members of thepopulation for reproduction. It may well be that this fitness can be estimatedto a useful level of accuracy long before the entire retinue of fitness caseshas been evaluated.2 The estimate of fitness is simply the average of theindividual fitness cases tested so far. Assuming there are n of them, thesample mean is given by

f̄ =1

n

n∑i=1

fi

and the sample variance by

s2 =1

n− 1

n∑i=1

(fi − f̄)2

Given that many independent estimates are being summed, make the assump-tion that the distribution of estimates of fitness is normal, that is, N(µ, σ).Then the marginal posterior distribution of the mean is a Student distribu-tion with mean f̄ , variance s/n, and n − 1 degrees of freedom. A Student

distribution is the distribution of the random variable t = (f̄−µ)√n−1

s , which isa way of estimating the mean of a normal distribution that does not depend

11

on the variance.3 The cumulative density function can be used to design atest for acceptance of the estimate. Once the sample mean is sufficiently closeto the mean, testing fitness cases can be discontinued, and the estimate canbe used in the reproduction phase.

Example: Testing Two Fitness Cases Suppose that we have the samplemeans and variances of the fitness cases for two different individuals after tentests. These are f̄1 = 21, f̄2 = 17, and s1 = 3. Let us stop testing if the twomeans are significantly different at the 5% level. To do so, choose an interval(−a, a) such that

P (−a < t < a) = 0.95

Since n − 1 = 9, there are 9 degrees of freedom. A table for the Studentdistribution with nine degrees of freedom shows that

P (−2.26 < t < 2.26) = 0.95

Now for the test use the sample mean of the second distribution as the hy-pothetical mean of the first; that is,

t =(f̄1 − f̄2)

√n− 1

s1

With f̄1 = 21, f̄2 = 17, and s1 = 3,

t = 4

which is outside of the range of 95% probability. Thus we can conclude thatthe means are significantly different and stop the test.

3.2 Coevolution of Parasites

Up to this point the modeling effort has focused on a single species. Butwe know that “nature is red in tooth and claw”; that is, the competitionbetween different species can improve the fitness on each. To demonstrate

12

Figure 2: An eight-element sorting network. The eight lines denote the eight possible inputs. A vertical barbetween the lines means that the values on those lines are swapped at that point. The disposition of theparticular input sample is tracked through the network by showing the results of the swaps.

this principle we will study the example of efficient sorting networks. Thisexample was introduced by Hillis,4 who also introduced new features to thebasic genetic algorithm.

The idea of a sorting network is shown in Figure 2. Numbers appear at theinput lines on the left. At each crossbar joining two lines, the numbers arecompared and then swapped if the higher number is on the lower line (andotherwise left unchanged).

Recalling the basic result from complexity analysis that sorting takes atleast n log n operations, you might think that the best 16-input network wouldcontain at least 64 swaps.5 This result holds in the limit of large n. Forsmall 16-input networks, a solution has been found that does the job in 60swaps. Hillis shows that a GA encoding can find a 65-swap network, but thatintroducing parasites results in a 61-swap network.

The encoding for the sorting problem distinguishes a model genotype froma phenotype. In the genotype a chromosome is a sequence of pairs of numbers.Each pair specifies the lines that are to be tested for swapping at that pointin the network. A diploid encoding contains a second chromosome with asimilar encoding.

To create a phenotype, the chromosome pair is used as follows. At eachgene location if the alleles are the same—that is, have the same pairs ofnumbers—then only one pair is used. If they are different, then both areused (see Figure 3). The result is that heterozygous pairs result in largersorting networks, whereas homozygous pairs result in shorter networks. Theadvantage of this encoding strategy is that the size of the network does nothave to be explicitly included in the fitness function. If a sequence is useful,it is likely to appear in many genes and will thus automatically shorten thenetwork during the creation of the phenotype.

13

Figure 3: The translation of genotype to phenotype used by Hillis demonstrated on a four-input sortingnetwork. The genetic code contains a haploid representation with two possible swaps at each location.When the possible swaps are identical, only one is used in constructing the phenotype, resulting in a shorternetwork. In the example, the last pair result in two links, the second of which is redundant.

The fitness function is the percentage of the sequences that are sortedcorrectly. Experience shows that indexing the mating program with spatiallocality is helpful. Rather than randomly selecting mates solely on fitness,each individual is assigned a location on a two-dimensional grid. The choiceof mates is biased according to a Gaussian distribution, which is a functionof grid distance.

Another nice property of the sorting networks is that just testing them withsequences of 0s and 1s is sufficient; if the network works for all permutationsof 0s and 1s, it will work for any sequence of numbers.6 Even so, it is stillprohibitively expensive to use all 216 test cases of 0s and 1s to evaluate fitness.Instead, a subset of cases is used. A problem with this approach is that theresultant networks tend to cluster around solutions that get most of thecases right but cannot handle a few difficult examples. This limitation isthe motivation for parasites. Parasites represent sets of test cases. Theyevolve similarly, breeding by combining the test cases that they represent.They are rewarded for the reverse of the networks: the number of cases thatthe network gets wrong. The parasites keep the sorting-network populationfrom getting stuck in a local minimum. If that outcome occurs, then theparasites will evolve test cases targeted expressly at this population. A furtheradvantage of parasites is that testing is more efficient.

Notes

1. The exposition here follows David E. Goldberg’s text Genetic Algo-rithms in Search, Optimization, and Machine Learning (Reading, MA: Addison-Wesley, 1989), which has much additional detail.

2. This idea has been proposed by A. W. Moore and M. S. Lee in “Effi-

14

cient Algorithms for Minimizing Cross Validation Error,” Proceedings of the11th International Machine Learning Conference (San Mateo, CA: MorganKaufmann, 1994). In their proposal they advocate using gradient search in-stead of the full-blown mechanics of a genetic algorithm, and show that itworks well for simple problems. Of course, since gradient search is a localalgorithm, its overall effectiveness will be a function of the structure of thesearch space.

3. H. D. Brunk, An Introduction to Mathematical Statistics , 3rd ed. (Lex-ington, MA: Xerox College, 1975).

4. Daniel W. Hillis, “Co-evolving Parasites Improve Simulated Evolutionas an Optimization Procedure,” in Christopher Langton et al., eds., ArtificialLife II , SFI Studies in the Sciences of Complexity, vol. 10 (Reading, MA:Addison-Wesley, 1991).

5. Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, In-troduction to Algorithms (Cambridge, MA: MIT Press, 1990).

6. Ibid.

15

ABSTRACT Genetic programming (GP) applies the genetic algorithm di-rectly to programs. The generality of programs allows many different prob-lems to be tackled with the same methodology. Experimental results showthat even though GP is searching a vast space, it has a high probability ofgenerating successful results. Extensions to the basic GP algorithm add sub-routines as primitives. This has been shown to greatly increase the efficiencyof the search.

4 Introduction

In genetic algorithms the problem is encoded as a string. This means thatthere is always a level of indirection whereby the meaning of the string has tobe interpreted. A more direct way of solving a problem would be to have thestring encode the program for solving the problem. Then the search strategywould be to have the genetic operations act on the programs themselves.The advantage of this approach would be that programs could rate themselvesaccording to how well they are doing and rewrite themselves to do better. Thedifficulty is that, in general, making a change to a program will be disastrousto its function. To make it work, there must be some way of encoding changesthat makes the genetic modifications less destructive. This is the goal ofgenetic programming , which works directly with programs encoded as treesinstead of strings.

A beautiful idea is to use the LISP programming language.1 This is afunctional language based on lists. For example, the function call f(x, y) isencoded simply as

(f x y)

which is a function followed by a list of arguments to that function. Runningthe program, or evaluating the function, results in the value of the functionbeing computed. For example, evaluating the LISP list

(* 3 4)

16

results in a value of 12. Complex programs are expressed by using compositefunctions. For example,

(+ 2 (IF (> X 3) 4 7))

evaluates to 6 if X is greater than 3; otherwise it evaluates to 9. The liststructure consists of terminals and functions . For instance, in the presentexample,

{x,2,3,4,7}

are terminals, whereas

{IF, +, >}

are functions.2

Every LISP program has this simple list structure. This has several ad-vantages:

• The uniform list structure represents both program code and data. Thissimple syntax allows genetic operations to be simply defined.

• The format of the instructions is one of a function applied to arguments.This represents the control information in a common syntax, which isagain amenable to modification with genetic operators.

• The variable-length structure of lists escapes the fixed-length limitationof the strings used in genetic algorithms.

5 Genetic Operators for Programs

A LISP program can be written as a nested list and drawn as a tree. Theformer fact is helpful because it allows the definition of the effects of oper-ators on programs. The latter fact helps immensely because it allows thestraightforward visualization of these effects. For example,

(* x (+ x y))

17

Figure 4: Functions in LISP can be interpreted as trees. Two examples are (∗ x ( + x y)) and (+ (∗ z y)(/y x)).

Figure 5: The genetic programming crossover operator works by picking fracture points in each of twoprograms. Then these sublists are swapped to produce two new offspring.

and

(+ (* z y) (/ y x))

can be thought of as representing the programs that compute x(x + y) andzy + y/x. These programs in their list structure can be interpreted also astrees, as in Figure 4.

Operators have a very elegant structure in the list format. The first stepin applying an operator consists of identifying fracture points in the list formodification. These are analogous to crossover points in the string repre-sentation used by genetic algorithms. A fracture point may be either thebeginning of a sublist or a terminal. The following examples illustrate theuse of genetic operators.

To implement crossover, pick two individual programs. Next select anytwo sublists, one from each parent. Switch these sublists in the offspring.For example, pick y and (/ y x) from the two preceding examples. Then theoffspring will be (∗ x (+ x (/ y x))) and (+ (∗ z y) y), shown in tree form inFigure 5.

To implement inversion, pick one individual from the population. Nextselect two fracture points within the individual. The new individual is ob-tained by switching the delimited subtrees. For example, using (+ (∗ z y) (/y x)), let’s pick z and (/ y x). The result is (+ (∗ (/ y x) y) z), shown in treeform in Figure 6.

Figure 6: The genetic programming inversion operator works by picking two fracture points in a singleprogram. These sublists are swapped to produce the new offspring.

18

Figure 7: The genetic programming mutation operator works by picking a fracture point in a single program.Then the delimited terminal or nonterminal is changed to produce the new offspring.

To implement mutation, select one parent. Then replace randomly anyfunction symbol with another function symbol or any terminal symbol withanother terminal symbol. That is, (+ (∗ z y)(/ y x)) could become (+ (+ zy)(/ y x)), shown in tree form in Figure 7. Mutation may also be implementedby replacing a subtree with a new, randomly generated subtree.

What you have seen is that the genetic operators all have a very simpleimplemetation in the LISP language owing to its list structure. But thereis one caveat. These examples are all arithmetic functions that return inte-gers, so there is no problem with swapping their arguments. However, theremay be a problem if the results returned by the swapped functions are notcommensurate. For example, if one function returns a list and it is swappedwith another that returns an integer, the result will be a syntactically in-valid structure. To avoid this error (at some cost), either all functions shouldreturn the same type of result, or some type checking has to be implemented.

6 Genetic Programming

The structure of the genetic programming algorithm, Algorithm II is iden-tical to that of the basic genetic algorithm (Algorithm 12.1). The principaldifference is the interpretation of the string as a program. This differenceshows up in the evaluation of fitness, wherein the program is tested on a setof inputs by evaluating it.

As examples of genetic programming we consider two problems. One isthe parity problem: Given a set of binary inputs of fixed size n the programmust correctly determine their parity. Although it is easy to build a parityfunction without using GP, this problem is valuable as a test case as it is adifficult problem for learning algorithms. Knowing the parity of subsets ofthe input is insufficient to predict the answer.

19

Algorithm II Genetic Programming

Choose a population size Np.Choose the number of generations NG.Initialize the population.Repeat the following for NG generations:


2. Copy the selected structure(s), then apply operators to themto produce new structure(s).

3. Select other elements at random and replace them with thenew structure(s).

4. Observe and record the fitness of the new structure.


The second example is that of the Pac-Man video game. In this case theprogram must learn to control a robotic agent in a simulated environmentthat has rewards and punishments.

Example 1: Parity Consider the problem of discovering the even-n parityfunction. The even-3 parity function is shown in Table 4. For this problemthe terminals are given by three symbols that represent the three possibleinputs to be tested for parity:

T = {D0, D1, D2}

The function set is given by the set of primitive Boolean functions of twoarguments

F = {AND,OR,NAND,NOR}Using a population size of 4,000, a solution was discovered at generation

5. The solution contains 45 elements (the number of terminals plus the num-ber of nonterminals). The individual comprising the solution is shown in the

20

Table 4: The even-3 parity function.

Input Output000 1001 0010 0011 1100 0101 1110 1111 0

following program. As you can see, this solution is not the minimal functionthat could be written by hand but contains many uses of the primitive func-tion set, mostly because there is no pressure on the selection mechanism tochoose minimal functions.

(AND (OR (OR D0 (NOR D1 D2)) D2)

(AND (NAND (NOR (NOR D0 D2)

(AND (AND D1 D1) D1))

(NAND (OR (AND D0 D1) D2) D0))

(OR (NAND (AND (D0 D2)

(OR (NOR D0 (OR D2 D0) D1))

(NAND (NAND D1 (NAND D0 D1)) D2)))

Example 2: Pac-Man As another illustration of the use of GP, consider learning to play the game ofPac-Man.3 The game is played on a 31 × 28 grid, as shown in Figure 8. At every time step, the Pac-Mancan remain stationary or move one step in any possible direction in the maze. The goal of the game is tomaximize points. Points are given for eating the food pellets arrayed along the maze corridors. Energizersare special objects and are worth 50 points each. At fixed times t = 25 and t = 125 a very valuable piece offruit worth 2, 000 points appears, moves on the screen for 75 time steps, and then disappears.

In the central den are four monsters. After the game begins they emerge from the den and chase thePac-Man. If they catch him the game ends. Thus the game ends when the Pac-Man is eaten or when thereis nothing left for him to eat. In this eat-or-be-eaten scenario the energizers play a valuable role. After thePac-Man has eaten an energizer, all the monsters are vulnerable for 40 time steps and can be eaten. (Inthe video game the monsters turn blue for this period.) Thus a strategy for a human player is to lure themonsters close to an energizer, eat the energizer, and then eat the monster. Eating more than one monsteris especially valuable. The first is worth 200 points, the next 400, the next 800, and the last 1600.

21

Figure 8: The board for the Pac-Man game, together with an example trace of a program found by geneticprogramming. The path of the Pac-Man is shown with vertical dots, and the paths of the monsters areshown with horizontal dots. In this example the Pac-Man starts in the middle of the second row from thebottom. The four monsters start in the central den. The Pac-Man first moves toward the upper right, whereit captures a fruit and a pill, and subsequently attracts the monsters in the lower left corner. After eating apill there it eats three monsters and almost catches the fourth.

Table 5: The GP Pac-Man nonterminals (control functions). (After Koza, 1992.)

Control Function Name PurposeIf-Blue (IFB) A two-argument branching operator that exe-

cutes the first branch if the monsters are blue;otherwise the second.

If-Less-Than-or-Equal (IFLTE) A four-argument branching operator thatcompares its second argument to its first. If itis less, the third argument is executed; other-wise the fourth is executed.

The monsters are as fast as the Pac-Man but are not quite as dogged. Of every 25 time steps, 20 arespent in pursuit of the Pac-Man and 5 in moving at random. But since there are more of them, they cansucceed by cornering the Pac-Man.

The parity problem was selected for its special appeal: its difficulty makes it a good benchmark. Inthe same way the Pac-Man problem is especially interesting as it captures the problem of survival in theabstract. The Pac-Man has to feed itself and survive in the face of predators. It has primitive sensors thatcan identify food and foes as well as motor capability. As such it is an excellent test bed to study whetherGP can successfully find a good control strategy. Furthermore, the GP operations are so simple you canimagine how they might be done neurally in terms of changing wiring patterns introducing a changing controlstructure. Thus the evolutionary roots of the algorithm may be translatable into neural growth.

Tables 5 and 6 show the GP encoding of the problem. The two nonterminals appear as interior nodes ofthe tree and allow the Pac-Man to change behavior based on current conditions. The terminals either movethe Pac-Man in the maze or take a measurement of the current environment, but they cannot call otherfunctions.

Figure 9 shows the solution obtained in a run of GP after 35 generations.4 Note the form of the solution.The program solves the problem almost by rote, since the environment is similar each time. The monstersappear at the same time and move in the same way (more or less) each time. The regularities in theenvironment are incorporated directly into the program code. Thus while the solution is better than whata human programmer might come up with, it is very brittle and would not work well if the environmentchanged significantly.

The dynamics of GP can also be illustrated with this example. Figure 10 shows how the populationevolves over time. Plotted are the number of individuals with a given fitness value as a function of bothfitness value and generations. An obvious feature of the graph is that the maximum fitness value increases asa function of generations. However, an interesting subsidiary feature of the graph is the dynamics of geneticprogramming. Just as in the analysis of GAs, fit schema will tend to increase in the population exponentially,

22

0 (IFB

1 (IFB

2 (IFLTE (AFRUIT)(AFRUIT)

3 (IFB

4 (IFB

5 (IFLTE

6 (IFLTE

(AGA)

(DISA)

7,8 (IFB (IFLTE

(DISF)

(AGA)

(DPILL)

9 (IFLTE (DISU)(AGA)

(AGA)

10 (IFLTE (AFRUIT)(DISU)

(AFRUIT)

10,9,8 (DISA))))

8 (IFLTE (AFRUIT)(RGA)

(IFB (DISA) 0)

8,7 (DISA)))

6 (DPILL))

(IFB

7 (IFB (AGA)

8 (IFLTE

9 (IFLTE

(IFLTE (AFRUIT)(AFOOD)(DISA)(DISA))

(AFRUIT)

O

9 (IFB (AGA) 0))

(DPILL)

(IFLTE (AFRUIT)(DPILL)(RGA)(DISF))

8,7 (AFRUIT)))

0)

(AGA)

5 (RGA))

4 (AFRUIT))

3 (IFLTE

4 (IFLTE (RGA) (AFRUIT)(AFOOD)(AFOOD))

(IFB(DPILL)(IFLTE (RGA)(APILL)(AFOOD)(DISU)))

5 (IFLTE

(IFLTE (RGA)(AFRUIT)(AFOOD)(RPILL))

(IFB (AGA) (DISB))

(IFB (AFOOD) 2)

5 (IFB (DISB) (AFOOD)))

4,3 (IFB (DPILL) (AFOOD))))

2 (RPIL))

3 (IFB (DISB)

4 (IFLTE

(DISU)

0

(AFOOD)

4,3,2 (AGA))))

2 (IFB (DISU)

3 (IFLTE

(DISU)

(DISU)

4 (IFLTE

5 (IFLTE (AFRUIT)

(AFOOD)

(DPILL)

5 (DISA))

(AFRUIT)

0

4 (IFB (AGA) 0))

3,2,1 (RGB))))

Figure 9: Program code that plays Pac-Man shows the ad hoc nature of the solution generated by GP. Thenumbers at the line beginnings indicate the level of indentation in the LISP code. (After Koza, 1992.)

23

Table 6: The GP Pac-Man terminals. (After Koza, 1992.)

Terminal Fuction Name PurposeAdvance-to-Pill (APILL) Move toward nearest uneaten energizerRetreat-fom-Pill (RPILL) Move away from nearest uneaten energizerDistance-to-Pill (DPILL) Distance to nearest uneaten energizerAdvance-to-Monster-A (AGA) Move toward monster ARetreat-from-Monster-A (RGA) Move away from monster ADistance-to-Monster-A (DISA) Distance to monster AAdvance-to-Monster-B (AGB) Move toward monster BRetreat-from-Monster-B (RGB) Move away from monster BDistance-to-Monster-B (DISB) Distance to monster BAdvance-to-Food (AFOOD) Move toward nearest uneaten foodDistance-to-Food (DISD) Distance to nearest foodAdvance-to-Fruit (AFRUIT) Move toward nearest fruitDistance-to-Fruit (DISF) Distance to nearest fruit

Figure 10: The dynamics of GP is illustrated by tracking the fitness of individuals as a function of generations.The vertical axis shows the number of individuals with a given fitness value. The horizontal axes show thefitness values and the generations. The peaks in the figure show clearly the effects of good discoveries. Thefittest of the individuals rapidly saturate almost the whole population.

and that fact is illustrated here. When a fit individual is discovered, the number of individuals with thatfitness level rapidly increases until saturation. At the same time the sexual reproduction operators areworking to make the population more diverse; that is, they are working to break up this schema and increasethe diversity of the population. Once the number of this schema has saturated, then the diversifying effectsof the sexual reproduction can catch up to the slowed growth rate of individuals with the new schema.

7 Analysis

Since the performance of GP is too difficult to approach theoretically, we can attempt to characterize itexperimentally.5 The experimental approach is to observe the behavior of the algorithm on a variety ofdifferent runs of GP while varying NG and Np, and then attempt to draw some general conclusions. Themost elementary parameter to measure is the probability of generating a solution at a particular generationk using a population size Np. Call this Pk(Np), and let the cumulative probability of finding a solution upto and including generation i be given by

P (NG, i) =i∑

k=1

Pk(Np)

One might think that Pk would increase monotonically with k so that using more generations would inevitablylead to a solution, but the experimental evidence shows otherwise. The fitness of individuals tends to

24

plateau, meaning that the search for a solution has gotten stuck in a local minimum. Thus for most casesPk eventually decreases with k. The recourse is to assume that successive experiments are independent andfind the optimum with repeated runs that use different initial conditions.

A simple formula dictates the number of runs needed based on empirical success measures. This processleads to a trade-off between population size and multiple runs.

Assuming the runs are independent, you can estimate the number needed as long as P (Np, i) is known.This process works as follows. Let the probability of getting the answer after G generations be PA. ThenPA is related to the number of experiments r by

PA = 1− [1− P (Np, G)]r

Since the variable of interest is r, take logarithms of both sides,

ln(1− PA) = r ln[1− P (Np, G)]

orr =

ln εln[1− P (Np, G)]

where ε is the error, (1− PA).These formulas allow you to get rough estimates on the GP algorithm for a particular problem, but a more

basic question that you might have is about the efficacy of GP versus ordinary random search. Could one doas well by just taking the search effort and expending it by testing randomly generated examples? It turnsout that the answer to this question is problem dependent, but for most hard problems the experimentalevidence is that the performance of GP is much better than random search.6

8 Modules

A striking feature of the Pac-Man solution shown in Figure 9 is its nonhierarchical structure. Each piece ofthe program is generated without taking advantage of similarities with other parts of the code. In the sameway the parity problem would be a lot easier to solve if functions that solved subproblems could be used.Ideally one would like to define a hierarchy of functions, but it is not at all apparent how to do so. Theobvious way would be to denote a subtree as a function and then add it to the library of functions that canbe used in the GP algorithm. In LISP, the function call (FUNC X Y) would simply be replaced by a definedfunction name, say FN001, whose definition would be (FUNC X Y), and the rest of the computations wouldproceed as in the standard algorithm. The problem with this idea is the potential explosion in the number offunctions. You cannot allow functions of all sizes in the GP algorithm owing to the expense of testing them.Each program is a tree containing subprograms as subtrees. Naively testing all subtrees in a populationincreases the workload by a factor of N2. However, it is possible to test small trees that have terminals attheir leaves efficiently by using a data structure that keeps track of them.7 This way the price of testing thetrees is reduced to a manageable size. Thus to make modules, all trees of size less than M are tested forsome M that is chosen to be practical computationally.

The modular version of GP for the parity problem tests each of these small subtrees to see if they solvethe problem for their subset of inputs. If they do, then each of those subtrees is made a function with thenumber of arguments equal to the variables used in the subtree. This function is then added to the pool ofnonterminals and can be used anywhere in the population. This process is recursive. Once a new function isadded to the nonterminal pool, it can be used as part of yet newer function compositions. If later generations

25

Algorithm III Modular Genetic Programming

Choose a population size.Choose the number of generations NG.Initialize the population.Repeat the following for NG generations:


2. Copy the selected structure(s), then apply operators to themto produce new structure(s).

3. Select other elements at random and replace them with thenew structure(s).

(a) Test all the subtrees of size less than M . Ifthere are any subtrees that solve a subprob-lem, add them to the nonterminal set.

(b) Update the subtree data set.

4. Observe and record the fitness of the new structure.


solve a larger parity problem using the new function as a component, they can be treated in the same way:another new function is added to the nonterminal set. These constraints are formalized in Algorithm III

The modular approach works spectacularly well for parity, as shown in Table 7. Consider the entriesfor the 5 parity problem. Genetic programming without modules requires 50 generations, whereas the sameproblem can be solved in only 5 generations using modular GP. The reason, of course, is that functions thatsolve the parity of subsets of the data are enormously helpful as components of the larger solution.

Figure 11 shows the generation and use of new functions in the course of solving the 8 parity problem.In the course of the solution, functions that solved the 2, 3, 4, and 5 parity problem were generated atgenerations 0, 1, 3, and 7, respectively. The specific code for each of these functions is shown in Table 8.These functions may look cumbersome, but they are readily interpretable. Function F681 is the negationof a function of a single argument. Note that this was not in the initial function library. Function F682computes the parity of its two arguments. Not surprisingly this function is extremely useful in larger parityproblems, and you can see that it is used extensively in the final solution shown in the table.

26

Table 7: Comparison of GP and modular GP for parity problems of different sizes. Table entry is thegeneration at which the solution was found. Standard GP fails to find a solution to the even-8 parityproblem after 50 generations, but modular GP solves the same problem at generation 10.

Method Even-3 Even-4 Even-5 Even-8GP 5 23 50Modular GP 2 3 5 10

Table 8: Important steps in the evolutionary trace for a run of even-8 parity. “LAMBDA” is a genericconstruct used to define the body of new functions. (From Rosca and Ballard, 1994.)

Generation 0. New functions[F681]: (LAMBDA (D3) (NOR D3 (AND D3 D3)));[F682]: (LAMBDA (D4 D3) (NAND (OR D3 D4) (NAND D3D4)))

Generation 1. New function[F683]: (LAMBDA (D4 D5 D7) (F682 (F681 D4) (F682 D7D5)))

Generation 3. New functions[F684]: (LAMBDA (D4 D5 D0 D1 D6) (F683 (F683 D0 D6D1) (F681 D4) (OR D5 D5)));[F685]: (LAMBDA (D1 D7 D6 D5) (F683 (F681 D1) (ANDD7 D7) (F682 D5 D6)))

Generation 7. The solution found is: (OR (F682 (F682 (F683D4 D2 D6) (NAND (NAND (AND D6 D1) (F681 D5)) D1))(F682 (F683 D5 D0 D3) (NOR D7 D2))) D5)

Figure 11: Call graph for the extended function set in the even-8 parity example showing the generationwhen each function was discovered. For example, even-5 parity was discovered at generation three and useseven-3 parity discovered at generation one and even-2 parity and NOT, each appearing at generation zero.(From Rosca and Ballard, 1994.)

27

Algorithm IV Selecting a Subroutine

For individuals that have small height:

1. Select a set of promising individuals that have positive dif-ferential fitness.

2. Compute the number of activations of each individual, andremove individuals with inactive nodes.

3. For each individual with terminals Ts and block b do thefollowing:

(a) Create a new subroutine having a random subset of theterminals Ts with body (b, Ts).

(b) Choose a program that uses the block b, and replace itwith the new subroutine.

8.1 Testing for a Module Function

The previous strategy for selecting modules used all the trees less than some height bound. This strategycan be improved by keeping track of differential fitness. The idea is that promising modules will confer afitness on their host that is greater than the average, that is,

∆fs = fhost − fparents

Thus a more useful measure of subroutine fitness is to select candidates that have a positive differentialfitness. With this modification, you can use Algorithm IV to select subroutines.

8.2 When to Diversify

The distribution of the fitness among individuals in a population is captured by the fitness histogram, whichplots, for a set of discrete fitness ranges, the number of individuals within each range. Insight into thedynamics of propagation of individuals using subroutines can be seen by graphing the fitness histogram as afunction of generation, as is done in Figure 12. This figure shows that once a good individual appears it growsexponentially to saturate the population. These individuals are subsequently replaced by an exponentialgrowth of a fitter individual. The stunning improvement is seen for the example of Pac-Man. Comparingthis result with that of Figure 10, the plot of the fitness histogram shows that not only is the popluationmore diversified, but also that the fitness increases faster and ultimately settles at a larger point.

Further insight into the efficacy of the modular algorithm can be obtained by studying the entropy ofthe population. Let us define an entropy measure by equating all individuals with the same fitness measure.

28

Figure 12: The dynamics of GP with subroutines is illustrated by tracking the fitness of individuals as afunction of generations. The vertical axis shows the number of individuals with a given fitness value. Thehorizontal axes show the fitness values and the generations. The beneficial effects of subroutines can be seenby comparing this figure with Figure 13.7.

Figure 13: Entropy used as a measure to evaluate the GP search process. If the solution is found or if thepopulation is stuck on a local fitness maximum, the entropy of the population tends to slowly decrease (top).In searching for a solution the diversity of a population typically increases, as reflected in the fluctuationsin entropy in the modular case (bottom).

If fitness is normalized, then it can be used like a probability. Then entropy is just

E =∑

fi log fi

When the solution is not found, as in nonmodular GP, the search process decreases the diversity ofthe population, as shown by the top graph in Figure 13. This result is symptomatic of getting stuck in alocal fitness maximum. But the nonlocal search of modular GP continually introduces diversity into thepopulation. This process in turn keeps the entropy measure high, as shown in the lower graph in Figure 13.This behavior can be used to detect when a solution is stuck. If the entropy is decreasing but the fitness isconstant, then it is extremely likely that the population is being dominated by a few types of individualsof maximum fitness. In this case the diversity of the population can be increased by increasing the rate ofadding new modules.

Finally, you can see the effects of the subroutine strategy from Table 9, which compares runs from GPwith and without subroutines.

9 Summary

The features of GP search are qualitatively similar to that of GAs. Once an individual of slightly increaseddifferential fitness is produced, its features tend to spread exponentially through the population. During this

Table 9: Comparison between solutions obtained with standard GP, subroutines, and a carefully hand-designed program. The GP parameters have the following values: population size = 500; crossover rate= 90%; reproduction rate = 10% (no mutation). During evolution, fitness is determined by simulation in1 or 3 different environments. Each table entry shows the total number of points of an evolved Pac-Mancontroller for the corresponding GP implementation and, in parentheses, the generation when that solutionwas discovered. For all entries, the best results over 100 simulations were taken.

GP Subroutines Hand1 3 1 3

6380 4293 9840 5793 5630(67) (17) (20) (36)

29

period the mechanism of sexual reproduction helps this process along. However, once the features saturatethe population, sexual reproduction continues by breaking up program segments in the course of searchingfor new above-average features. Since most changes tend to be harmful, the course of sexual reproductionand schemata growth can be seen as being in dynamic balance.

In the world, nature provides the fitness function. This function is extremely complicated, reflectingthe complexity of the organism in its milieu. Inside the GP alogorithm, however, fitness must be estimated,using both a priori knowledge and knowledge from the search process itself. The power of subroutines in thiscontext is that they provide a way of pooling the results of the organism’s experiments across the population.Thus they provide a way of estimating the effects of a schema en route to a solution.

Notes1. John R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection

(Cambridge, MA: MIT Press, Bradford, 1992).2. Actually, one does not put all the integers into the terminal set. Instead a function random-integer is

used; whenever it is called, a random integer is generated and inserted into the program.3. The Pac-Man example was introduced by Koza in Genetic Programming and refined by Justinian P.

Rosca and Dana H. Ballard in “Discovery of Subroutines in Genetic Programming,” Chapter 9 in P. Angelineand K. E. Kinnear, Jr., eds., Advances in Genetic Programming 2 (Cambridge, MA: MIT Press, in press).

4. Koza, Genetic Programming .5. This analysis was first done by Koza in Genetic Programming .6. See, for example, Chapter 9 of Koza, Genetic Programming .7. Justinian P. Rosca and Dana H. Ballard, “Genetic Programming with Adaptive Representations,” TR

489, Computer Science Department, University of Rochester, February 1994.

30

Documents

ABSTRACT Genetic algorithms (GA) is an optimization ...dana/MLClass/GANotes.pdfABSTRACT Genetic algorithms (GA) is an optimization technique for searching very large spaces that models