Evolutionary Computation and Fractal Visualization of ...eldar.mathstat.uoguelph.ca/Dashlock/Eprints/Biochapter.pdfFractals are useful for conveying mul-tiple types of information

Evolutionary Computation

and

Fractal Visualization of Sequence Data

Dan AshlockMathematics Department and

Bioinformatics and Computational BiologyIowa State University

Ames, IA [email protected]

Jim Golden, Ph.D.Director of Bioinformatics

CuraGen Corporation555 Long Wharf Drive, 9th Floor

New Haven, CT 06511http://www.curagen.com

Abstract

Fractals are useful for conveying mul-tiple types of information with shapeand color in a single picture. Wereport on a standard technique forvisualizing DNA or other sequencedata with a fractal algorithm andthen twice generalize the techniqueto obtain two types of evolvable frac-tals. Both are forms of iterated func-tion systems(IFS), collections of ran-domly driven or data driven contrac-tion maps. The first, an indexed IFS,uses incoming data to choose whichcontraction map will be applied next.The second, called a chaos automata,drives a finite state machine with in-coming sequence data and associates acontraction map with each state. Bothevolvable fractals are tested on theirability to separate DNA from distinctmicrobial genomes. Design of fitnessfunction is a critical issue as we aretrying to create fractals that look goodand convey information about the se-quences driving them. It is demon-strated that chaos automata are su-perior to the indexed IFS on the testproblem. Potential improvements inthe fractal chromosomes, fitness func-tions, and issues to be resolved to ob-tain useful applications are discussed.

1 Introduction

There are several senses in which DNA can be saidto have a fractal character. The action of transpos-able elements upon the genome is akin, for example,to algorithms used generate fractal objects. In thischapter we will not be documenting or exploitingthe fractal character of DNA but rather using ran-domized fractal algorithms to visualize DNA. Thebasic idea is simple: replace random numbers withDNA bases in an algorithm that generates a fractaland the resulting modified fractal is a visualizationof the DNA. The fractal character of DNA, withits substantial component of non-randomness, willforce the DNA driven fractals to look distinct fromfractals produced from streams of random numbers.Our goal is to make those differences informative.

Figure 1: The Serpinski Triangle

This goal of locating fractals that display infor-mative differences when driven by DNA instead ofrandom numbers, or when driven by DNA from dis-tinct sources, requires we be able to search the spaceof fractal generation algorithms for those algorithms

1

that perform as we require. That search is the pointat which evolutionary computation enters the mix.We will present three technologies for producingDNA driven fractals. The first is a modification ofthe well known chaos game and does not require evo-lutionary computation. It will serve as a point of de-parture for evolvable fractal generating algorithms.The second technology is an evolvable form of it-erated function system(IFS) [Barnsley, 1993]. Thethird technology is also evolvable and fuses IFS tech-nology with a finite state automata to create a struc-ture that can respond to long range effects in DNA.In addition to being evolvable the third techniquewe present is novel as a fractal generating algorithm,having the potential to permit long range effects inthe DNA to influence the character of the fractal.

2 The chaos game

The simplest form of the chaos game starts with thechoice of three fixed points in the plane, not on aline. A moving point is initialized at one of thesethree points and then the following random processis repeated indefinitely. One of the three fixed pointsis selected at random and the moving point is thenmoved half way from its current position to the se-lected point. The set of points that can be visitedby the moving point are called the Serpinski triangle,shown in Figure 1.

One of the most common questions asked aftersomeone is introduced to the chaos game is “whathappens if four fixed points are used instead ofthree?”. Imagine that these points are the cornersof a square with side length 1. Then along each di-mension of the square all points with coordinates ofthe form

∞∑k=1

xk2k, xk ∈ {0, 1} (1)

can be reached by the moving point. The numberswith this form are the dyadic rationals within theinterval. The dyadic rationals form a dense sub-set of the interval, and so along each side of thesquare. In plain language, the square fills in becauseevery point has, at any positive distance from it,a point that the moving point can reach. For thesquare to fill in, however, the choice of fixed pointto move toward must be made close to uniformlyat random. If the choices of fixed point are not atrandom then the square need not fill in. The holesleft by the non-randomness form a visualization ofthe deviation from uniform randomness of the infor-mation driving the chaos game. Examples of chaosgames with four fixed points are shown in Figure 2.

The fractal generated by the HIV genome displaysthe lack of methylation sites in the HIV genome asthe large blank space in the upper right quadrant to-gether with its shadows in the other quadrants. Thisblank space results from avoiding the three DNA let-ters in sequence that correspond to a methylationsite.

The four cornered chaos game permits us to seemissing sequences of particular sorts in DNA. Typi-cally these are compact patterns, with no more thanlog2(L) DNA bases in them where L is the sizelength of the square in pixels. It is also difficultto see patterns whose length in DNA bases is closeto log2(L), as they involve only a few pixels. Thekey to understanding these assertions is the updat-ing rule “move halfway to the selected point”. Thisupdating of the moving point has the effect of di-viding the size of features currently encoded by themoving point in half and then adding in the newfeature. The averaging is an unweighted average ofthe moving and selected fixed point’s positions. Ifwe want to think of it this way we could also deemit a weighted average in which the weight of eachpoint happens to be one-half. This one-half weightis, itself, a parameter that we can play with. In Fig-ure 3 the boundedly random fractal is shown in itsoriginal form and with the weighting shifted plus-or-minus one tenth toward the fixed or moving points.

In general, the sparser in information a givenstretch of DNA is the sparser the standard four-cornered chaos game fractal derived from it is. Ifmost six letter words over the DNA alphabet do notappear then most locations in a 26×26 four-corneredchaos game will be blank. Playing with the averag-ing weight permits us to choose the degree of sparsityof the resulting picture, but at the cost of the uniquecorrespondence of pixels within the chaos game topatterns in the DNA.

Another source of sparseness in a chaos game isshort data. In the HIV derived chaos game shown inFigure 2 we simply cycled through the HIV genometwice to make sure that all the data that appearwere represented as plotted black dots. It would benatural, however, to run the chaos game for a longtime and save the number of times a pixel was hit asa measure of how common a pattern is. This addedinformation could then be displayed in gray scale.If we are to do this, however, we must have enoughdata to provide an honest sampling of events. In thissituation a Markov model of the DNA data may beused to smooth the data and extend it. In essencethis is a resampling technique in which the depen-dent (Markov) distribution of short words is com-puted from the data and then the Markov process

C G

AT

CC CG

CT

A

G

T

CAG

Figure 2: Driving the four-cornered chaos game with boundedly random data and with an example of theHIV-1 genome. The diagram at the right shows how DNA prefixes are associated with portions of the square.The moving point is plotted in black. The boundedly random data permits the moving point to move towardthe current corner or the next two corners clockwise with equal probability, excluding the remaining corner.The HIV-1 genome used has GenBank Accession NC 001802.

Figure 3: Driving the four-cornered chaos game with boundedly random data for three settings of theaveraging weight. The boundedly random data permits the moving point to move toward the current corneror the next two corners clockwise with equal probability, excluding the remaining corner. From left to rightthe averaging parameter give 60% weight to the fixed point, weighs the fixed and moving point equally, andgives a 60% weight to the moving point.

is used to generate synthetic data with the correctdependent distribution of short words. As the onlything displayed by the four cornered chaos game isthe presence and absence of these short words, us-ing Markov models of the DNA should only resultin cleaner pictures.

To construct a Markov model of the data, we fixa window size, n. For each contiguous substring ofthe data of length n, other than the last, we tabulatehow often a particular window is followed by each ofthe four bases. This tabulation is then normalizedto yield an empirical probability distribution on thefollow base for each n-character pattern. When gen-erating synthetic data, a current word or length n issaved. The next base after this current word is gen-erated with reference to the Markov model, and thenthe current word has its oldest (first) character re-moved and the newly generated character appended

to it. The stream of characters generated in thisfashion is the synthetic data. To initialize the cur-rent word we start with a standard word, e.g., all A,and then generate a few thousand bases of syntheticdata. This has the effect of placing the Markov pro-cess into the set of words that appear in the datafrom which the Markov model was computed. Thisprocess is called burning in the Markov model. Ourfractal algorithms in subsequent sections will also re-quire a similar type of burn in. To enable burn inwe set the distribution of next bases for n-characterpatterns that do not appear in the data to the uni-form distribution.

In Figure 4 we illustrate the effect of windowsize on the Markov process for making syntheticdata. Starting with regular, periodic data we gener-ate Markov processes with window size 2 and 3. Thesize three windows reproduce a phase-shift version of

Input : CCCGGGAAATTTCCCGGGAAATTTCCCGGGAAATTTMarkov-2 : GGAATTCCGGAATTTTTTCCCCGGAATTTTTTTTTCMarkov-3 : AATTTCCCGGGAAATTTCCCGGGAAATTTCCCGGGA

Figure 4: A sequence and synthetic data similar to that sequence generated by Markov chains with windowsizes 2 and 3.

the input data. The size two windows produce syn-thetic data that produces the right general sequenceof bases but which fail to reproduce the run lengthsin the original data. When using Markov processesto synthesize additional data we will use a windowsize of 6 to achieve relatively high data fidelity. Fig-ure 5 gives some notion of how the Markov chainlooks from the inside. Notice that for the patternsCC and TT the probabilities are slightly differentfrom those for the patterns GG and AA. This is anedge effect, resulting from the original data startingCCC and ending TTT.

Conditional probabilitiesExists Word P(C) P(G) P(A) P(T)

1 CC 0.4 0.6 0 01 GC 0 1 0 00 AC 0.25 0.25 0.25 0.250 TC 0.25 0.25 0.25 0.250 CG 0.25 0.25 0.25 0.251 GG 0 0.5 0.5 01 AG 0 0 1 00 TG 0.25 0.25 0.25 0.250 CA 0.25 0.25 0.25 0.250 GA 0.25 0.25 0.25 0.251 AA 0 0 0.5 0.51 TA 0 0 0 11 CT 1 0 0 00 GT 0.25 0.25 0.25 0.250 AT 0.25 0.25 0.25 0.251 TT 0.4 0 0 0.6

Figure 5: A tabulation of the window size twoMarkov chain from Figure 4. Notice that patternsthat do not exist in the input data have uniformdistributions of next letters.

With the Markov chain technology in place wecan now produce four-cornered chaos games withrelatively high fidelity from even relatively sparsesequence data. In Figure 6 we show gray scale four-colored chaos game for three types of DNA. One isthe HIV sequence from Figure 2 for comparison.

The gray scale chaos games clearly distinguish thethree types of sequence used to generate them and,

as we can see by comparing with the black-and-whiteexample for HIV, contain more information aboutthe distribution of small DNA subwords. Many ofthe disadvantages of the black-and-white technologyremain in the gray scale pictures. Features dependon words of at most a fixed length, with feature sizein the picture depending inverse-exponentially uponword length. With the Markov chain synthesis ofdata with the right word distribution we could zoomto some degree and make bigger pictures, but thisin turn would require Markov chains with windowsizes sufficient to still be producing the pictures atthe scale of the doom. This notion rapidly undergoesdeath by exponential growth.

While we can tell that the pictures shown in Fig-ure 6 are different from one another, a good deal oftraining is needed to tell what the differences mean.In addition, these differences are essentially visualsummaries of the small-word statistics of the DNA.Such a summary could be used more efficiently formost tasks by simply feeding it into a statistics pack-age. To address these limitations we will introduceevolvable fractals that can be tailored to emphasizeparticular sequence features of interest rather thannecessarily displaying small word statistics.

Our generalizations should not be taken as astatement that the chaos game visualizations areuseless. With training the four cornered chaos gamedoes convey information about the sequence and isis useful for quick visual homology checks. A longrun of repetitive sequence with a small anomalousinsert will display that insert very clearly in a chaosgame plot. It is also true that we do not exam-ine techniques of implicit annotation of the chaosgames with colors or schemes for using more thanfour points in more than two dimensions, so as tobuild chaos games around richer feature sets.

3 Iterated Function Systems

Chaos games are a particular type of iterated func-tion system [Barnsley, 1993]. In an iterated functionsystem a number of maps from a metric space to it-self are chosen. In this case the metric space is thereal plane, R2, with the standard Euclidean metric.

HIV Methanococcus Helicobacter

Figure 6: Four cornered chaos games for HIV-1 (NC 001802), Methanococcus jannaschii small extra-chromosomal element (C 001733.1), and Helicobacter pylori, strain J99 complete genome (AE001439). Foreach organism the DNA was used to build a Markov model with window size 6. Synthetic data was thenused to produce hit counts in each pixel with the relative number of hit counts displayed by 256-shade grayscale, white=no hits, black=maximum hits.

In exact analogy with the fixed points of the chaosgame, these maps are then called in a random orderaccording to some distribution, to move a point. Theorbit of this point in the metric space is called theattractor of the iterated function system. The orbitis the fractal, save for other details like color that wemay choose to add. In [Barnsley, 1993] a number oftheorems about iterated function systems are estab-lished, in particular if we wish to obtain a boundedorbit (finite fractal) we cannot use just any functionsfrom the plane to itself. A function from a metricspace to itself is called a contraction map if, for anypair of points, mutual distance decreases with theapplication of the map. Formally, f : R2 → R2 is acontraction map if

d(p, q) ≥ d(f(p), f(q))∀p, q ∈ R2. (2)

One of the theorems proved is that an iteratedfunction system made entirely of contraction mapshas a bounded orbit for its moving point and hencea finite fractal attractor.

A rich class of maps that are guaranteed to becontraction maps are similitudes. A similitude isa map that performs a rigid rotation of the plane,displaces the plane by a fixed amount, and then con-tracts the plane toward the origin by a fixed scalingfactor. The derivation of a new point (xnew, ynew)from old point (x, y) with a similitude that uses ro-tation t, displacement (∆x,∆y) and scaling factor0 < s < 1 is given by:

xnew = s · (x · Cos(t)− y · Sin(t) + ∆x) (3)ynew = s · (x · Sin(t) + y · Cos(t) + ∆y) (4)

To see that a similitude must always reduce thedistance between two points, note that rotation and

displacement are isometries, they do not change dis-tances between points. This means any change isdue to the scaling factor which necessarily causes areduction in the distance between pairs of points.

At this point we know that a collection of simili-tudes, applied to a moving point in the same fashionthat we averaged towards fixed points in the chaosgame, will select a bounded fractal subset of theplane. Examples of such similitude-based iteratedfunction systems are shown in Figure 7. If we wereto drive the selection of a fixed set of similitudes withdata then we would have a visualization of that data.The process of selecting how many similitudes to use,which similitudes to use, and how to connect dataitems with similitudes gives us an exceedingly richspace of data driven fractal generation algorithms.We now require the tools for picking out good algo-rithms and visualizations.

3.1 Evolvable Fractals I

Our goal is to use a data driven fractal, generalizingthe four cornered chaos game, to provide a visualrepresentation of sequence data. It would be niceif this fractal representation could work smoothlywith DNA, protein, and codon data. These se-quences, while derived from one another, have vary-ing amounts of information and are appear at dif-ferent levels of the biological process that run a cell.The raw DNA data contains the most informationand the least degree of interpretation. The segrega-tion of the DNA data into codon triples has moreinterpretation (and requires us to work on coding,as opposed to intronic, untranslated, or junk DNA).The choice of DNA triplet used to code for a givenamino acid can be exploited, for example, to vary

Figure 7: Examples of fractals obtained by using a set of eight similitudes selected uniformly at random.These are samples of output of fractal visulaizers in the space of fractal algorithms we will be searching withan evolutionary algorithm.

the thermal stability of the DNA (more G and Cbases yield a higher melting temperature) and sothe codon data contains information that disappearswhen the codons are translated into amino acids.The amino acid sequence contains information fo-cused on the mission, within the cell, of the protein.This sequence specifies the protein’s fold and func-tion without the codon usage information muddyingthe waters. Given all of this it is natural to tie se-quence data to choice of next contraction map toapply by segregating the data into the 64 possibleDNA triples.

The data structure or chromosome we use to holdthe evolvable fractal has two parts; a list of simili-tudes and a index of DNA triples into that list ofsimilitudes. This permits smooth use of the frac-tal generating algorithm on DNA, DNA triplets, oramino acids. Amino acids require modifying theway the data are interpreted by the indexing func-tion, but simply applying the many-one map of DNAtriples to amino acids and stop codons would allowus to regroup the index function. A diagram of thedata structure is given in Figure 8. An evolved in-stance of this structure is shown in Figure 9. Eachsimilitude is defined by four real parameters in themanner described in Equation 3. The index list issimply a sequence of 64 integers that specify, foreach of the 64 possible DNA codon triplets, whichsimilitude to apply when that triplet is encountered.

Interpretation Contains

First similitude t1 (∆x1,∆y1) s1

Second similitude t2 (∆x2,∆y2) s2

· · ·Last similitude tn (∆xn,∆yn) sn

Triplet Index List NCCC , NCCG . . . , NTTT

Figure 8: The data structure that serves asthe chromosome for an evolvable DNA drivenfractal. In this work we use n = 8 similitudes,and so 0 ≤ Nxyz ≤ 7, where xyz is a DNAtriplet.

In order to derive a fractal from DNA in our firstset of experiments the DNA is first Markov modeledwith window size six. The Markov model is thenused to create DNA, in triples. These triplets arethen used, via the index portion of the fractal chro-mosome, to choose a similitude to apply to the mov-ing point. This permits evolution to both choose theshape of the maximal fractal (the one we would seeif we drove the process with data chosen uniformly

at random) and also to choose which DNA codontriplets are associated with the use of each simili-tude. It is a theorem that any contraction map hasa unique fixed point. The fixed points of the eightsimilitudes in each fractal chromosome play the samerole that the four corners of the square did in thechaos game.

Similitudes:Rotate: 1.15931 radians.Displace: (0.89346,0.802276).Shrink: 0.166377.Rotate: 5.3227 radians.Displace: (0.491951,0.453776).Shrink: 0.596091.Rotate: 5.73348 radians.Displace: (0.985895,0.123306).Shrink: 0.33284.Rotate: 1.2409 radians.Displace: (0.961873,0.63981).Shrink: 0.375387.Rotate: 5.91113 radians.Displace: (0.41471,0.15982).Shrink: 0.571791.Rotate: 0.538828 radians.Displace: (0.660229,-0.137627).Shrink: 0.452674.Rotate: 5.87837 radians.Displace: (0.993039,0.345593).Shrink: 0.330442.Rotate: 5.09863 radians.Displace: (0.987267,0.338542).Shrink: 0.408452.Index:2754772412552151226422122167225477554711166152612751742621555666

Figure 9: An evolved DNA driven fractal, achievingbest fitness in run 1 of the first evolutionary algo-rithm.

If we are to apply an evolutionary algorithm tothe chromosome given in Figure 8 then we mustspecify variation (crossover and mutation) opera-tors. We employ a single two-chromosome varia-tion operator (crossover operator) that performs onepoint crossover on the list of eight similitudes andtwo point crossover on the list of 64 indecies, treatingthe lists independently. We have two single chromo-some variation operators (mutations) available. Thefirst, termed a similitude mutation, modifies a simil-itude selected uniformly at random. It does this bypicking one of the four parameters that define thesimilitude, uniformly at random, and adding a num-

ber selected uniformly in the interval [-0.1,0.1] tothat parameter. The scaling parameter is kept inthe range [0,1] by reflecting the value at the bound-aries so that numbers s in excess of 1 are replacedby 2 − s and values s below zero are replaced by−s. Our second mutation operator, called an indexmutation, acts on the index list by picking the in-dex associate with a DNA triple chosen uniformlyat random and replacing it with an index selecteduniformly at random.

3.2 Designing the evolutionary algo-rithm

The most difficult task, after selecting a data struc-ture for the fractal chromosome, was selecting a fit-ness function. The moving point is plotted a largenumber of times and we want the fractal in questionto look “different” for different types of data. Inthe experiments reported here we seek to separatetwo sorts of data, Markov models of an exampleof the HIV-1 genome and the Methanococcus jan-naschii small extra-chromosomal element describedin Figure 2. For this relatively simple task we choseto track the average position of the moving pointwhen it was being driven by the two types of dataand make the fitness function the distance betweenthe two average positions. In a mild abuse of nota-tion, we will call the mean position of the movingpoint when the fractal is being driven by data of aparticular type (µtypex , µtypey ). Using this notation,the fitness function for a given evolvable fractals is:

√(µHIV 1x − µMJ

x )2 + (µHIV 1y − µMJ

y )2, (5)

where the fractal is alternately presented with bothtypes of data in runs of uniformly distributed length200-500.

We report here on one sets of simulations. Therewas an initial “burn in” period of 1000 iterationsin which the moving point was subjected to fullyrandom data. During this period the mean posi-tion of the moving point and maximum distance itachieved from that mean were estimated to permitnormalization of the fractal image. By “fully ran-dom data” we mean that each of the eight simili-tudes was selected uniformly at random during theburn in period. Subsequent to burn in, the movingpoint was acted upon by the similitudes driven byMarkov-modeled DNA data as described previouslyfor about 200,000 iterations, with each type of databeing used in runs of 200-500 bases. The length ofthese runs was selected uniformly at random. The

fitness evaluation ceased when the point had justfinished a run of one type of data and was also over200,000 iterations.

Individual fractals were initialized with simili-tude parameters chosen as follows; 0 ≤ t ≤ 2π,−1 ≤ ∆x,∆y ≤ 1, and s is the average of twouniformly distributed random variables on the in-terval [0,1]. This latter averaging prevented ab-surd initial values from the shrink parameter frombeing too common. The evolutionary algorithmused is steady state [Syswerda, 1991]. In each mat-ing event, four individuals are chosen uniformlyat random, without replacement, from the popula-tion. The two most fit are then crossed over andthe resulting new fractals replace the two least fit.A number of preliminary simulations, reported in[Ashlock and Golden, 2000], were performed to findworkable settings for population size, number ofmating events, and frequency of mutation. A rel-atively low mutation rate seemed to generate steadyincrease in fitness in preliminary simulations and soone of the new fractals is subject to a similitude mu-tation, the other is subjected to an index mutation.Each simulation was run for 10,000 mating events,well past the mean point in which the preliminarystudies plateaued their fitness. We performed 30 in-dependent simulations.

3.3 Results for iterated function sys-tems

The fractals located during our simulation has littletrouble separating the two types of data. For eachsimulation run we created two images from the mostfit member of each population. The first showedthe fractal with green plotted for data from the firstsource (HIV-1) and red plotted for data from thesecond source (Methanococcus jannaschii). The sec-ond was generated by assigning each of the eightbasic RGB colors, black, red, green, blue, cyan, ma-genta, yellow, and white to one of the eight simili-tudes used. When a similitude was called the currentcolor used for the moving point was shifted, destroy-ing the least significant of eight bits, and the colorassociated with the similitude was shifted into thehigh bits for the three color channels. Examples ofthese two sorts of plots for four simulations runs areshown in Figure 10. Looking at the images thattrack similitude using, one can deduce that simula-tions 0 and 24 produced fractals that use most oftheir similitudes in separating the data while simu-lations 20 and 27 located fractals that rely on threesimilitudes: those associated with white, blue, andblack for simulation 20 and red, magenta, and white

Simulation 0

Simulation 20

Simulation 24

Simulation 27

Figure 10: The attractor for the fractal receiving best fitness in simulations 0,20,24, and 27comparing HIV-1 and Methanococcus jannaschii derived data. The left images displays thegreen and red for points plotted for HIV-1 and Methanococcus jannaschii data, respectively.The right images shows similitude use by associating the eight fundamental RGB colors withthe similitudes, averaging the correct color into the current color as similitudes are called.

for simulation 27. This indicates the presence ofmultiple optima within the fitness landscape.

The fitness function, intuitively, should select forfractals where the mean of the red and green dotsis well separated. The typical result, as one cansee in the examples provided, was to have redderand greener regions with substantial interdigitation.The average red and green positions are several or-ders of magnitude farther apart in the best-of-runfractals than they were in the initial random popu-lations. Nevertheless, we did not achieve clean sepa-ration into distinct regions for biological data. Thisis partly because there is substantial similarity insome parts of the Markov model of the two typesof DNA. In addition, the self-similar nature of theiterated function system fractals makes parts of thefractals copies of the whole. This does not preventdifferent parts of the fractal being associated withdifferent similitudes and hence different data types,but it does mean the almost all the fractal algo-rithms mix the red and green points. Part of outmission in the next section will be to experimentwith different fitness functions. In addition, we willmodify the fractal chromosome to have a form ofmemory. This will both reduce the interaction ofself-similarity with reaction to distinct data typesand also create the potential for detection of longrange correlations in the visualization of DNA data.

4 Chaos automata, addingmemory

This section introduces a new way of coding iter-ated function system fractals called chaos automata.Chaos automata differ from standard iterated func-tion systems in that they retain internal state infor-mation and so gain the ability to visually associateevents that are not proximate in their generationhistories. This internal memory also grants frac-tals generated by chaos automata a partial exemp-tion from self similarity in the fractals they spec-ify. When driven by multiple types of input datathe device can “remember” what type of data it isprocessing and so plot distinct types of shapes fordistinct data. Two more-or-less similar sequencesseparated by a unique marker could, for example,produce very different chaos-automata based frac-tals by having the finite state transitions recognizethe marker and then use different contraction mapson the remaining data. Comparison with standarditeration function system fractals already presentedmotivates the need for this innovation in the repre-sentation of data driven fractals. The problem we

seek to address by incorporating state informationinto our evolvable fractals is that data items are“forgotten” as their influence vanished into the con-tractions of space associated with each function. Anexample of a chaos automata, evolved to be drivenwith DNA data, is shown in Figure 11.

Starting State: 6

Transitions Similitudes

State C G A T Rotate Displace Shrink

-----------------

0) 2 5 6 0 : 1.518 ( 0.768, 0.937) 0.523

1) 5 7 1 2 : 6.018 (-0.822, 1.459) 0.873

2) 2 1 2 2 : 0.004 ( 0.759, 0.880) 0.989

3) 4 4 4 7 : 4.149 (-0.693,-0.903) 0.880

4) 3 3 3 3 : 1.399 ( 0.693,-0.724) 0.758

5) 2 5 7 1 : 6.104 (-0.951,-0.077) 0.852

6) 1 0 2 1 : 2.195 ( 0.703, 0.864) 0.572

7) 1 4 5 2 : 1.278 ( 0.249, 1.447) 0.715

Figure 11: A chaos automata evolved to visuallyseparate two classes of DNA. The automata startsin state 6 and makes state transitions depending oninputs from the alphabet C,G,A,T. As the automataenters a given state it applies the similitude definedby a rotation (R), displacement (D), and shrinkage(S).

We can now give a formal definition of chaos au-tomata. A chaos automata is a 5-tuple (S, C, A, t, i)where S is a set of states, C is a set of contrac-tion maps from R2 to itself, A is an input alphabet,t : S×A→ A×C is a transition function that mapsa state and input to a next state and contractionmap, and i ∈ S is a distinguished initial state. Togenerate a fractal from a stream of data with a chaosautomata we use the algorithm given in Figure 12.Readers familiar with finite state machines will notewe have made the, somewhat arbitrary, choice ofassociating our contraction maps with states ratherthan transitions. We thus are using “Moore” chaosautomata rather than “Mealy” chaos automata.

4.1 The Data Structure and It’s Vari-ation Operators

In order to use an evolutionary algorithm to evolvechaos automata that visually separate classes of datathey must be implemented as a data structure andthen variation operators must be designed for thatdata structure. The data structure used containsthe following. An integer that designates the initialstate and a vector of nodes. These nodes each holdthe information defining one state of the chaos au-tomata. This defining information consists of four

Set state to initial state i.Set (x,y) to (0,0).Repeat

Apply the current similitude to (x,y).Use(x,y).Update the state with the transition rule.

Until (out of input).

Figure 12: Basic algorithm for generating fractalswith a chaos automata. The action Use(x,y) canconsist of ignoring the values (during burn in), us-ing them to estimate the center and radius of thefractal being generated, or plotting a point in thenormalized fractal.

integers, giving the next state to use for inputs of C,G, A, and T, respectively, and a similitude to applywhen the automata makes a transition to the state.The similitude is defined exactly as in our previousfractal chromosome.

We now give the variation operators. A two pointcrossover operator is used. This crossover operatortreats the vector of nodes as a linear chromosome.Each node is an indivisible object to the crossoveroperator and so the crossover is essentially the clas-sical two-point crossover, relative to these indivis-ible objects. The integer that identifies the initialstate is attached to the first node in the chromo-some and moves with it during crossover. There arethree kinds of things that could be changed witha mutation operator. Primitive mutation operatorsare defined for each of these things and then usedin turn to define a master mutation operator thatcalls the primitive mutations with a fixed probabil-ity schedule. The first primitive mutation acts onthe initial state, picking a new initial state uniformlyat random. The second primitive mutation acts ontransition to a next state and selects a new nextstate uniformly at random. The third primitive mu-tation is similitude mutation. Similitude mutationis, again, exactly the one used in mutating our lastfractal chromosome. The master mutation mutatesthe initial state 10% of the time, a transition 50% ofthe time, and mutates a similitude 40% of the time.These percentages were chosen in an ad hoc manner.

4.2 Evolution and Fitness

As we did with our first fractal algorithm chromo-some, we use an evolutionary algorithm to locatechaos automata that have the property of visuallyseparating data in the fractal generated by the chaosautomata. In an initial set of experiments with chaos

automata we experimented with several fitness func-tion. If the only goal were to computationally sep-arate two or more classes of data then a simple as-sessment of the fraction of times the evolving struc-ture separated the data classes of interest would beenough (and far simpler techniques would solve theproblem). To visually separate data, a more sophis-ticated fitness function is required. When workingwith the first fractal chromosome, chaos automatawere evolved to separate two types of data by maxi-mizing the normalized separation between the meanpositions of the moving point on each sort of data.The evolutionary algorithm for both the first chro-mosome and for chaos automata rapidly achievedrelatively high fitness values that climbed steadilywith evolutionary time. Examples of the results forthe first chromosome appear in Figure 10. When thisfitness function was used for chaos automata, the re-sulting fractals were examined they often consistedof two dots, one for each type of data. This solu-tion separates the classes of data almost perfectly,but provides only the most minimal of visual cues.This experience not only motivated the examinationof several new fitness functions, but suggested thatthe imposition of a lower bound on the contractionfactor of the similitudes would help. To jump themoving point into a small neighborhood, the chaosautomata exploits similitudes with contraction fac-tors near zero. The good news is that the finite statememory of the chaos automata is in fact doing itsjob. A state transition to a distinct part of the au-tomaton serves to enable the point contractions thatthe first fractal chromosome was unable to achieveusing only an index table.

To describe the fitness functions we extend theearlier abuse of notation. The moving point ofour chaos games, used to generate fractals fromchaos automata driven by data, are referred toas if its coordinates were a pair of random vari-ables. Thus (X, Y ) are an ordered pair of ran-dom variables that give the position of the mov-ing point of the chaos game. When workingto separate several types of data {d1, d2, . . . , dn}the points described by (X, Y ) are partitionedinto {(Xd1 , Yd1), (Xd2 , Yd2), . . . , (Xdn , Ydn)}, whichare the positions of the moving point when achaos automata was being driven by data of typesd1, d2, . . . , dn respectively. For any random variableR we use µ(R) and σ2(R) for the sample mean andvariance of R.

We burn in chaos automata in much the samefashion as our earlier fractals. The moving point isinitialized to lie at the origin of R2 before burn in.In the first part of burn in the chaos automata being

evaluated is driven with data selected uniformly atrandom from {C,G,A, T} for 1000 iterations. Thisshould place the moving point in or near the at-tractor of the chaos automaton for uniform, randomdata. In the second portions of burn in, the chaosautomata is driven with uniformly random data for1000 additional steps, generating 1000 sample valuesof (X, Y ). The values of µ(X) and µ(Y ) during thisphase of burn-in are saved for later use in normal-izing both fitness and any images generated. In thethird part of burn-in the chaos automata is driven foran additional 1000 steps. Over those 1000 samplesof (X, Y ) the maximum Euclidean distance Rmax ofany instance of the moving point from (µ(X), µ(Y ))is computed. We term Rmax the radius of the frac-tal associated with the chaos automata. A numberof fitness functions were evaluated and the best oneturned out to be:

F2 = Tan−1(σ(Xd1 )σ(Yd1)σ(Xd2)σ(Yd2))F1, (6)

Where F1 =√

(µX1 − µX2)2 + (µY1 − µY2)2 isour original fitness function, the same as Equation5 save for naming of variables. This function simplycomputes the Euclidean separation of mean pointposition. This new function numerically requiresthe moving point separate the two classes of data,and then gives a bounded reward for scattering thepoints plotted for each data type. This bounded re-ward fluffs out the fractal - when unbounded suchreward led to fractals with enormous radius and lit-tle in the way of features clearly tied to the distinctdata types.

4.3 The Evolutionary Algorithm

The evolutionary algorithm used to evolve chaos au-tomaton is a minimal modification of the one usedto evolve the first fractal chromosome. Burn in andfitness evaluation happen as before, and the samesteady-state model of evolution is used. We ex-tended the algorithm to use three distinct types ofdata instead of two. The algorithm cycled throughthe three data types during fitness evaluation, andgeneralized the fitness function, Equation 6 in thefollowing fashion. The variances included all sixrather than all three multiplied as the argument ofthe arctangent function, and the sum all three Eu-clidean separations between pairs of mean positionsfor the three data types replace the pairwise separa-tion.

4.4 Experimental Design

A series of simulations reported in[Ashlock and Golden, 2001] were performed tocharacterize the behavior of several fitness functionsand their interaction with the chaos automata datastructure on data derived from random sourceswith distinct distributions. This led to the selectionof Equation 6 as our fitness function. To see thatthe chaos automata do yield a lesser degree of selfsimilarity, examine Figure 13.

In all of the simulations a population of 120 8-state chaos automata were evolved for 20,000 mat-ing events, saving the most fit chaos automata fromthe final generation in each simulation. The pop-ulations were initialized by filling in the chaos au-tomata with appropriate uniformly distributed ran-dom values. Each type of simulation, defined bya data type and fitness function choice, was run50 times with distinct random populations. As re-ported in [Ashlock and Golden, 2001] Initial sets ofsimulations exposed two potential flaws. First, asthe position of moving points is normalized by theradius, Rmax, of the fractal it is possible to ob-tain incremental fitness by reducing the radius ofthe fractal. Because of this some populations ex-hibited an average radius of about 10−15, yieldinginferior results. This flaw was removed by fiat: anyfractal with a radius of 0.1 or less was awarded afitness of zero. The number 0.1 was chosen becauseit was smaller than the radius of most of a sampleof 1000 randomly generates chaos automata drivenwith uniform, random data for 5000 bases. The sec-ond flaw, mentioned previously, was related to thedesire to have visually appealing fractals. If the con-traction factor of the similitudes in a chaos automataare close to zero then the automata displays only asparse dust of points. Such a sparse dust can stillobtain a high fitness as it does spatially separatethe points plotted for each data type. Multi-pointdusts can even achieve large, if uniformative, scat-ter. To amend this flaw we placed a lower boundon the contraction factor s with the reflection usedto correct mutations that drifted below zero beingreplaced with reflection at the lower bound. Thislower bound was set to 0.5 for the simulations re-ported here.

4.5 Results

Chaos automata were evolved to separate Markovmodeled HIV-1 and Methanococcus jannaschii se-quence, and then again to separate these two datatypes together with Markov modeled Helicobacterpylori sequence. All Markov models used window

Figure 13: The attractor for chaos automata evolved to separate two and three classes ofsynthetic data. The two class data was uniform random stings of CGA and GAT, the threeclass data was CGA, CGT, and CAT. Separating such random data is itself a trivial task, butthese images show that the resulting fractals are not self-similar between the distinct types ofdata.

Simulation 5

Simulation 12

Simulation 20

Simulation 45

Figure 14: The attractors for the fractal receiving best fitness in simulations 5, 12, 30, and 45 comparingHIV-1 and Methanococcus jannaschii derived data with chaos automata. The left images displays the greenand red for points plotted for HIV-1 and Methanococcus jannaschii data, respectively. The right imagesshows state use associating states with colors as similitudes were in Figure 10.

size six, yielding a relatively high probability of sav-ing unique local features for use by the finite stateportion of the chaos automata. For both the twosource and three source data fifty simulations wereperformed. Figure 14 shows a sample of the fractalsresulting when the evolutionary algorithm was at-tempting to separate the two source data. Good vi-sual separation was achieved for the two source datarelative to the experiments performed with the firstfractal chromosome, compare with Figure 10. In ad-dition to the red-green plots tracking what data wasbeing plotted, we made eight-shade plots to trackstate usage, in the same fashion that similitude us-age was tracked when evolving the previous chro-mosome. From the use of color in these plots itappears that the chaos automata were using theirstate transition diagrams to get by with using fewersimilitudes.

Separating three types of data was more diffi-cult than separating two. A sample of the frac-tals evolved to separate three biological data typesis shown in Figure 15. The third data type, plot-ted in blue and drawn from the Helicobacter pyloricomplete genome, was derived from a much longerDNA sequence and so has more non-empty windowsin its Markov model. This makes the third data classharder to pull away from the others. Simulations 10and 35 both found way of performing the separation.In simulation 10 a radical rotation appears to be thekey, as only two states of the automata are appar-ently used, as can be seen in the right-shade picturetracking state usage. The “fractal” from simulation10 does not have a pronounced fractal character, andwas one of a group of thirteen more-or-less spiralshaped attractors that appeared as best-of-run re-sults among the fifty simulations performed. Theseessentially non-fractal visualizations fail to exploitthe shape related portion of the visual bandwidthand are a good target for removal by additional re-finement of the fitness function.

Simulation 35 is an example of an attractor thatmanages to put the difficult, blue data in its ownpart of the fractal, yielding about the sort of resultwe desired. Red, green, and blue have their own rel-atively exclusive portions of the fractal. Simulations2 and 23 give the most typical sort of result in whichthe difficult data is displayed as a majority elementof a mixed central region while the two easier to de-tect data types are placed in their own regions in atail or along an edge.

The diversity of apparent solutions encounteredin the space of fractals trained to separate typesof data indicates a fairly rough fitness landscape.Between-run best fitness figures varied by two or-

ders of magnitude between simulations. To cursoryinspection there was no real correlation between highfitness and visually effective fractals - the disk likesolutions typified by the attractor displayed for sim-ulation 10 in Figure 15 received the highest fitness.This indicates that the fitness functions used, de-signed as they were with a combination of intuitionand ad-hoc testing, can be substantially improvedupon. Chaos automata are a substantial departurefrom the original chaos games for visualizing DNAand we invite others to help us learn to use themeffectively.

5 Conclusions

Chaos games, like the four cornered chaos games de-scribed here, have the potential to provide an easilyexplained visualization of data that portrays somepart of the local deviation from uniform randomness.When used to visualize DNA, they permit a traineduser, or an untrained user with a point-and-querytool, to rapidly figure out what short DNA frag-ments are present or absent. The two fractal chro-mosomes presented here, evolvable iterated functionsystems and chaos automata, are superior to chaosgames in tapping a far larger space of fractal visu-alization algorithms and permitting the training ofthe fractal algorithms to pick out particular classesof training data. They are clearly inferior in beingmore difficult to interpret. While the meaning ofa pixel in a four-cornered chaos game is tied to aparticular small DNA sequence the meaning of pix-els within the more complex visualizations requirestracing of the data structures or user habituation onknown data. In Section 6 we will discuss the poten-tial for achieving such habituation.

Comparing the results from the evolved iteratedfunction system and the chaos automata, it is clearthat the chaos automata are able to deal more ef-fectively with points plotted for different data typesthan the indexed iterated function systems. The sep-aration of points plotted for each data type is largerfor the chaos automata and the chaos automata alsomanage to associate shapes with data. This lat-ter quality of shape differentiation is the result ofthe finite state structure permitting the detection ofdata features in a manner distinct from the simili-tude manipulations of the moving point. The statetransitions implicitly assign different similitudes todifferent types of data.

It is important to keep in mind, though, that werewe merely wanting to separate three types of datathe fractal would be unnecessary. Our goal is to pro-

Simulation 2

Simulation 10

Simulation 23

Simulation 35

Figure 15: The attractors for the fractal receiving best fitness in simulations 2, 10, 23, and 35 comparingHIV-1, Methanococcus jannaschii, and Helicobacter pylori derived data with chaos automata. The leftimages displays the green, red, and blue for points plotted for HIV-1, Methanococcus, and Helicobacterdata, respectively. The right images shows state use associating states with colors as similitudes were inFigure 10.

vide a visualization of the data, not to simply ana-lyze it. Analysis along could be accomplished moreefficiently by a number of different techniques, neu-ral nets, finite state machines with tagged states,or routine statistical study of the distribution ofsubwords. The value of the fractal visualizationsrecorded here requires some visualizations be fixedand that a human user habituate themselves to thosevisualizations.

6 Roads not yet taken

There are two general areas that the research de-scribed here can be extended. One is attemptingto transfer the work to data analysis software as acomplementary visualization tool. The other is toextend the technology used in the evolvable frac-tal algorithms. This consists of both continuing toplay with the fitness function and extending the datastructure to incorporate new features. We will be-gin with applications, discuss briefly possible fitnessfunctions, and the talk about some potential exten-sions to the data structure.

6.1 Applications

One clear advantage the four cornered chaos gameexhibits over both the index iterated function systemand the chaos automata is the fixed and easily estab-lished meaning, relative to the sequence data used, ofthe pixels and areas of the fractal generated. Whilethe evolvable fractals produce more complex shapes,those shapes are not tied to particular sequence fea-tures. The same curly twist could designate DNA inan open reading frame in one instance and the pres-ence of the long terminal repeat of a transposon inthe another, depending on the training data used toevolve the fractal. In order to be useful some set ofclass variables must be chosen and fractals evolvedthat separate them effectively. These fixed fractalscould then be incorporated into sequence viewing oranalysis software as a form of graphical annotation.As sequences are viewed by the user the fractal data-flakes build up. With the particular fractal visual-izations fixed, these data flakes not only will tell thehabituated user the values of the class variables usedto evolve the fractals but also let them see partial orintermediate class memberships in some cases. Thevisual bandwidth substantially exceeds that requiredto display the class variables and so additional infor-mation drawn from the sequence data is displayed.It may be worth noting that both evolvable fractalchromosomes described here can be automatically

compiled into C++ or other languages and, as algo-rithms, run rapidly.

The fractal visualization presented in this paperare made with knowledge of which class of data aredriving the algorithm. The red-green and red-green-blue plots thus can thus use color to display spatialseparation of the data. If we embed the visualiza-tion algorithms into a software tool it will typicallynot know the class of the data being used to drivethe automaton. Spatial separation itself will visuallyseparate classes of data and we can simply assigndifferent colors to different data sources. It may bepossible, as well, to get an estimate of the data classfrom either the location of the dots plotted (this as-sumes no data from outside the training classes ispresented) or from the states in a chaos automata.In the latter case we could train on an additionalclass of data, uniform random data, and write thefitness function to reward separating data classesby internal state. This would include the random“none-of-the-above” classification and so permit es-timation of the class (and hence plotting color) asdata flow through the visualizer.

While we chose to drive the visualization de-scribed here with DNA bases and triples, they can bedriven with almost any sort of discrete (or discree-tized continuous) data. Amino acid residues wouldbe a natural candidate. For visualization of sec-ondary RNA structure we might use curvature asan input variable. With protein crystal structure,the direction of the backbone in 3-space could sup-ply an interesting driving variable. The number ofdata sources that could be used to drive fractals isenormous - the difficult task will be making the frac-tals valuable to the human users.

6.2 Fitness

In this work we have reported on two fitness func-tions. The first, Equation 5, simply computes theEuclidean separation of the mean positions of themoving point for two data classes. The second,Equation 6, multiplies the first fitness function bythe arctangent of the product of the variances ofthe coordinates of the positions of the moving pointsfor each class of data. The variance rewards “fluff-ing out” the fractals and the arctangent boundsthat reward so they do not fluff out too much.In [Ashlock and Golden, 2001] the fitness functionwithout the arctangent was tried and it was clearlyestablished that almost all the fractals are the disklike smears typified by simulation 10 in Figure 15.Another fitness strategy was also tested, creatingtwo hills and making the fitness the sum, over all

points plotted, of the height of one of the hills wherethe point was plotted, letting data class designatewhich hill was currently in use. This fitness strategyproduced “two-point” fractals with the points on thehilltops.

Logical things to try would include modifying thehill-base fitness with a bounded scatter reward orvarying the slope and position of the hills. The mo-tivation for trying such a hill-based function was thatmention in Section 6.1: regaining some control overthe appearance of the fractal. If fitness is rewardedfor plotting data in distinct, pre-specified areas foreach data class then some part of the appearance ofthe fractal returns from the capricious hands of evo-lution to the researcher. The feathery, leafy shapeswould remain in many cases but some shred of fixedmeaning would have returned to the spatial arrange-ment of the fractal.

6.3 Other fractal chromosomes

Both the index table base on DNA triples and thefinite state transition diagram in chaos automataare widgets for connecting data to similitudes. Anytransducer from finite sets to finite sets could betried as such a widget. The key is to find transduc-ers that produce good fractals. Genetic program-ming [Koza, 1992, Koza, 1994], could be used assuch a connector. A group of parse trees, each withan associated similitude, would be presented inputsfrom a moving window of sequence data. The parsetree that produced the largest output would have itssimilitude used to drive the moving point. This no-tion is inspired by Peter Angeline’s MIPS nets, and,were we to fully implement Angeline’s idea, the parsetrees would also have access to one another output.This innovation would permit the system to storestate information.

GP-Automata [Ashlock, 1997, Ashlock, 2000],which modify finite state machines by adding a parsetree to each state to interpret (and compress) inputcould be used as a generalization of chaos automata.Here, the parse trees would permit the finite statetransition diagram to be driven by information ex-tracted from several bases, perhaps in a moving win-dow or in an evolvable sample pattern. Such sam-ple patterns could be part of the GP-Automatonchromosome or could be preselected to match thedata. This would involve finding patterns of baseswithin a window that exhibit relatively low random-ness across the entire data set. Techniques for per-forming this type of computation are described in[Ashlock and Davidson, 1999].

6.4 More general contraction maps

Similitudes are a simple and extensive family of con-traction maps. We might obtain better results byusing more general and even evolvable families ofcontraction maps. Recent results have shown thatwe need not have strictly contracting maps to ob-tain a bounded fractal. In fact the expected log ofthe contraction factor need only be negative. Withthis in mind we could either design a class of mapsthat contract on average, or design an operation andterminal set for genetic programming that would fa-vor contractive maps. In the latter case we wouldneed to use the fitness function to filter out any un-bounded fractals, but this should be easy. Eitherdeath by excessive radius or a multiplicative penaltyfor excessive radius should answer.

References

[Ashlock, 1997] Ashlock, D. (1997). Gp-automatafor dividing the dollar. In Genetic Programming1997, Proceedings of the Second Annual Confer-ence on Genetic Programming, pages 18–26.

[Ashlock, 2000] Ashlock, D. (2000). Data crawlersfor optical character recognition. In Proceedings ofthe 2000 Congress on Evolutionary Computation,pages 706–713.

[Ashlock and Davidson, 1999] Ashlock, D. andDavidson, J. (1999). Texture synthesis withtandem genetic algorithms using nonparametricpartially ordered markov models. In Proceed-ings of the 1999 Congress on EvolutionaryComputation, pages 1157–1163.

[Ashlock and Golden, 2000] Ashlock, D. andGolden, J. B. (2000). Iterated function systemfractals for the detection and display of dna read-ing frame. In Proceedings of the 2000 Congresson Evolutionary Computation., pages 1160–1167.

[Ashlock and Golden, 2001] Ashlock, D. andGolden, J. B. (2001). Chaos automanta: Iter-nated function systems with memory. submittedto Physica D.

[Barnsley, 1993] Barnsley, M. F. (1993). FractalsEverywhere. Academic Press, Cambridge, MA.

[Koza, 1992] Koza, J. R. (1992). Genetic Program-ming. The MIT Press, Cambridge, MA.

[Koza, 1994] Koza, J. R. (1994). Genetic Program-ming II. The MIT Press, Cambridge, MA.

[Syswerda, 1991] Syswerda, G. (1991). A study ofreproduction in generational and steady state ge-netic algorithms. In Foundations of Genetic Al-gorithms, pages 94–101. Morgan Kaufmann.

Documents

Evolutionary Computation and Fractal Visualization of ...eldar.mathstat.uoguelph.ca/Dashlock/Eprints/Biochapter.pdfFractals are useful for conveying mul-tiple types of information