Parallel genetic programming and its application to trading model induction

ELSEVIER Parallel Computing 23 (1997) 1183-I 198

PARALLEL COMPUTING

Practical aspects and experiences

Parallel genetic programming and its application to trading model induction

Mouloud Oussaidkne a*1, Bastien Chopard a**, Olivier V. Pictet b*2, Marco Tomassini c,3

a GUI, University of Geneva, CH-121 I Geneva 4, Switzerlund b Olsen & Associutes, Seefeldstr 233, 8008 Zurich, Switzerland

’ DI-LSL EPFL, 1015 Lausunne, Switzerland

Received 10 July 1996; revised 23 February 1997

Abstract

This paper presents a scalable parallel implementation of genetic programming on distributed memory machines. The system runs multiple master-slave instances each mapped on all the allocated nodes and multithreading is used to overlap message latencies with useful computation. Load balancing is achieved using a dynamic scheduling aigorithm and comparison with a static algorithm is reported. To alleviate premature convergence, asynchronous migration of individuals is performed among processes. We show that nearly linear speedups can be obtained for problems of large enough size. The system has been applied to infer robust trading strategies which is a compute-intensive financial application.

Keywords: Parallel genetic programming; Performance analysis; Financial trading models

1. Introduction

Evolutionary processes are a generalization of classical genetic algorithms (GAS) originally conceived by Holland [ll. These algorithms are based on the principle of evolution (survival of the fittest): fitter solutions in the evolved population are selected to undergo genetic transformations and thus giving rise to a new, more adapted

* Corresponding author. E-mail: [email protected]. ’ E-mail: [email protected].

* E-mail: [email protected].

3 E-mail: [email protected].

0167-8 191/97/$17.00 Copyright 0 1997 Elsevier Science B.V. All rights reserved.

PII SO167-8191(97>00045-8

1184 M. Oussuidhe et d./Purullel Computing 23 (1997) 1183-1198

population of individuals. Koza [2] extended this genetic model of learning into the space of programs and thus introduced the concept of genetic programming. Each solution to a given problem is represented by a genetic program (GP), traditionally using the Lisp syntax. Genetic programming is now widely recognized as an effective search paradigm in artificial intelligence, databases, classification, robotics and many other areas.

However, the inherent convergence characterizing traditional GAS makes it difficult to maintain different high fitness individuals in a single population: once a suboptimal individual dominates the population, selection is likely to keep it and prevent further adaptation. Evolving multiple, independent subpopulations with occasional interchange (migration) of solutions between these subpopulations is an alternative approach to deal with this premature convergence problem. This allows not only a better exploration of the global search space (each subpopulation can explore different parts of the search space), it also delays premature convergence by introducing, via migration, diversity from other subpopulations.

We present here a parallel genetic programming system (PGPS) which maintains multiple independent subpopulations interacting asynchronously using a ring topology. For many problems of large enough size, the fitness evaluation phase takes the most time to run. Particularly, when evolving trading strategies in a financial application, each GP is evaluated over a large price time series and the time spent in the selection and reproduction phases is practically negligible.

Parallelizing the different evaluation phases related to each evolutionary process in the subpopulations is not a trivial task: since the run time GP complexity inside each subpopulation may widely differ in size, work load is unbalanced and some processing nodes become idle while others are active. This problem is equivalent to finding an optimal schedule of p independent tasks on m <p machines and is known to be NP-hard.

In our parallel implementation, the evaluation phase, in each evolutionary process, is separated from the rest of the population management calculations. We compare a static scheduling algorithm to distribute the genetic programs upon the processing nodes at the evaluation phase with a dynamic load balancing algorithm based on the run time GP complexity.

Interleaving the different evaluation phases allows to hide message latencies by switching between multiple threads. Our implementation shows that the parallelization of genetic programming on distributed memory machines is linearly scalable with respect to the number of processors, provided that the problem is large enough. The real-life benchmark application addressed in this paper shows that promising trading strategies can be inferred in reasonable time using parallel genetic programming.

The code was implemented on the IBM SP-2 machine and written in C + +, using PVM3 message passing library [31; it can be easily ported to other parallel machines like the Cray T3D and on workstation clusters. The sequential version of the code is based on [4].

The organization of this paper is as follows. The evolution of genetic programs is outlined in Section 2. Section 3 discusses the PGPS parallel scheme. Section 4 highlights the load balancing problem. Section 5 presents some benchmarks and the time

M. Oussuidhe er al./Purallel Computing 23 (1997) 1183-1198 1185

complexity and computational model are analyzed in Section 6. Finally, in Section 7, the system is applied to a difficult problem of trading model search. Concluding remarks are made in Section 8. Note that a preliminary study of this problem is given in Ref. [5].

2. Evolution of genetic programs

A GP can be regarded as a Lisp function or S-expression. It is usually represented by a parse tree. The number of nodes (including the terminal nodes) in the parse tree gives a measure of the space complexity of that GP. Thus, the GP ( * ( +xyx - tb)) has space complexity 7. In the creation phase, each parse tree is built in a recursive way, starting from the root node. Terminal and non-terminal nodes are randomly taken from some well defined TerminalSet and FunctionSet, respectively. These sets are usually small and their choice is normally driven by previous knowledge of the problem. To ensure the syntax validity and control the complexity expansion of GPs during this creation process, some rules must be observed. If the current node in the parse tree is a function taking r arguments, then r nodes are chosen from the Terminal/ FunctionSet, to be child nodes. In the other case, the current node is a terminal and the expansion of the branch stops. A maximum depth d is fixed before the execution so that when a branch in the parse tree reaches this level then a terminal must be chosen. The depth level corresponding to the root node is zero.

Let r be the maximum function arity in the FunctionSet. At run time, the space complexity c of a GP is therefore bounded by c 5 (@ ’ - l)/( r - 11, for r # 1.

Two genktic operators are used in the reproduction phase: crossover and mutation. The crossover operator selects two GPs from the population and chooses one node (crosspoint) on each. Each node is, by definition, the root of some complete subtree. The two subtrees are extracted and swapped with each other. Note that the resulting GPs are syntactically valid and only the maximum depth constraint is checked for non-violation.

For illustration, if we choose the nodes 2 and 1 (numbering from the root), respectivelyontheGPs (+a(*bc)) and (-(/y(+5z))x) thenthenewGPswil1 be (+a(/y(+sz))) and (-(*bc)x).ThisexampleisshowninFig. 1.

Contrary to genetic algorithms, the mutation operation is seldom used in genetic programming. The goal of the mutation operator is that of reintroducing some diversity in an otherwise stagnant population. While this is needed in GAS, in GP the crossover

operator, being less restricted, makes mutation largely unnecessary [2]. The evaluation

Fig. 1. Example of crossover of two genetic programs.

1186 M. Oussuidhe et al./ Parallel Compwing 23 (1997) 1183-l 198

phase consists of assigning a fitness value to each GP in the population. This fitness calculation requires evaluating each GP as many times as there are fitness cases related to the problem to solve. In image compression, the fitness cases are 2D array pixels. In a classification problem, the number of fitness cases corresponds to the amount of data to classify. In designing logical circuits taking k input bits, there are 2k possible cases. In evolving trading strategies, the benchmark application used in this study, each GP program is evaluated on a price time series and each series element represents a fitness case.

In this analysis, n indicates the number of fitness cases (which is problem dependent), p the population size, g the number of generations and d the maximum depth of the GPs.

3. PGPS parallel scheme

PGPS consists of multiple master-slave instances each mapped on all the allocated processing nodes (see Fig. 2). Each master process implements the genetic programming management system. A slave process corresponds to the user defined problem. The number of master process may vary from 1 to m, where m is the total number of processors available.

The conventional master-slave paradigm is used as a model template where each instance evolves its own population. The subpopulation, maintained by each evolutionary process, is evaluated using all the allocated nodes. The genetic programming management system creates the initial population, applies the genetic operators (crossover and mutation) and performs the selection of genetic programs which will be the candidates to the reproduction phase. At the evaluation phase, each master process distributes the work load among all the processing nodes in the virtual machine, including the processor on which the current master process runs. This computational load distribution can be accomplished by different strategies. In Section 4 we compare a static scheduling algorithm with a dynamic one.

The interprocessor communications are achieved using communication interface routines. The genetic programming management system packs each parse tree from its

Fig. 2. Example of the PGPS parallel architecture with 4 subpopulations on 8 processors. There is a master

(population management) process Mi for each subpopulation and each physical processor runs a slave process

Sk which evaluates the individuals from any subpopulation. The solid lines show the communication pattern

between master and slave processes. The dotted line shows me ring topology that allows individual migration

across masters.

M. Oussaidhe et al./Parallel Computing 23 (1997) 1183-1198 1187

memory representation into a buffer and sends it, as a string of characters to the appropriate slave process using PVM routines. Each slave process then performs the

unpack operation to build the equivalent parse tree in memory. After the fitness calculation of a genetic program, each slave process sends back the obtained fitness value to the source master process. Note that the GPs can be evaluated independently and therefore there is no communication between the slaves.

After work distribution, each master process first evaluates the individuals belonging to its own subpopulation (local tasks), next switches to the GPs coming from other subpopulations (external tasks). The local tasks require no communication. The load distribution cost and the communication overhead are negligible compared with the computational cost associated with fitness calculation.

Each new GP is sent individually as soon as created. The individual belonging to the same destination are not clustered in a single message. This technique avoids delaying the fitness computations and allows the slave processes to be kept busy. In this way fitness computation and work distribution overlap.

The global evaluation ends when all subpopulations are evaluated (each processing node has performed its assigned tasks). Once evaluation completes, the subpopulations interact, to delay the convergence, using a ring topology, for instance. Each master node selects a small fraction of good genetic programs, sends it to its next neighbor and receives asynchronously, using a nonblocking primitive, an individual from its previous neighbor. When the individual is available, it is inserted in the subpopulation by replacing the less fit one; otherwise the node continues its evolutionary process. Note that similar loosely coupled subpopulation models have been described before [6]. The termination signal of parallel execution is given when a pre-assigned maximum number of generations g has been attained. The main steps of the parallel algorithm are (* indicates that the task is performed only by the first loaded master process):

Master process: 1. Load the other master processes * . 2. Load the slave processes * . 3. Create the initial population. 4. Distribute work to the processing nodes. 5. Execute the local evaluation tasks. 6. Execute the external tasks (requested by other master processes) and send back the

results. 7. Receive the fitness values (sent either by a slave or a master process). 8. Select and send an individual to the next neighbor. 9. If an individual arrived from the previous neighbor then insert it in the subpopula-

tion. 10. Perform the selection phase. 11. Perform the reproduction phase. 12. Repeat steps 3-8 up to the maximum number of generations. 13. Terminate the slave processes * .

Slave process: 1. Receive a genetic program. 2. Calculate the fitness of the received program.

1188 M. Ou.waid&e et al./ Purullel Compuring 23 (1997) 1183-I 198

3. Send the fitness value to the source master process. 4. Repeat steps O-2 until reception of termination signal.

The evaluation of all subpopulations (global evaluation) is performed by interleaving local and external tasks. Multiple threads are maintained on each master node and switching among them overlaps message latencies by performing useful computation while other threads wait for synchronization signals.

4. Load balancing

This section discusses the way to distribute equally the computational load among the processing nodes. A good load balancing is crucial for a parallel system to deliver a significant speedup. In the terminology used here a task is the fitness computation of a GP and the task size refers to the time complexity of the GP to be evaluated. Time complexity refers to the number of arithmetic operations required for a single evaluation of the GP, times some weight factor taking into account the individual cost of each operation (a division is more costly than an addition and it is possible that the FunctionSet contains macro-operations involving several basic operations). In what follows, we will denote by Cj the time complexity of individual i.

A task is the smallest unit that can be scheduled on a processing node. Furthermore, once initiated, each task runs uninterrupted until termination. After task termination, each processor activates a task taken from its local wait queue according to FIFO policy. Load balancing can be either static or dynamic.

In a static load balancing scheme, work load is distributed in an ordered way defined at compilation time. The Round-Robin policy is used to assign tasks to the processing nodes. The criterion of load balancing upon the processors is the number of tasks assigned to each processor. Task i is assigned to the processing node [i/ml.

Even though this algorithm regulates the number of tasks assigned to each processor it does not reflect dynamic factors which may change during the system evolution. Due to the way programs are created by the genetic process, irregularity in computational load may occur because they are not all of the same size and complexity.

In a dynamic load balancing scheme, the decision of which processor should run a specified task is made at run time. The processor allocation is a function of population complexity at each generation of the evolutionary process. Given m processing nodes Mi, (i= 1, . . . . m) and p tasks q, ( j = 1, . . . , p), of size Cj, one wishes to find a optimal scheduling such that the total completion time is minimum. This problem is NP-hard and, here, we propose a ‘greedy’ heuristic which, on average, gives a good solution. Let us call li the work load of processor i. The dynamic load balancing algorithm is the following 1. Sort Cj in downward order 2. Initialize Ii to zero for all i = 1, . . . , m 3. For j := 1 to p do

find Mi. , the least loaded processor such that Ii. < I,, for all k, (1 < k 5 m) 4. Assign ‘I;. to Mi.

li. := Ii. + cj

M, Oussuidbze et al./Purullel Computing 23 (1997) 1183-1198 1189

The tasks are sorted by size in downward order. The task distribution starts from the largest task. At each iteration, the algorithm assigns the current task to the least loaded processing node. When a processor is selected to perform a task, then its work load is increased by the size of that task.

5. Parallel programming performance

This section discusses the parallel programming performances. The benchmark is given for a test case application: the logistic map interpolation where a GP is looked for to predict a chaotic time series. Stating from a learning sequence x,, x2, . . . , A-,,, where

xi+ 1 =F(xi) with F(x) =4x(1 -x)

our system evolves a program able to compute xi+, knowing xi. The fitness function is C~~,‘[ Xi+ I - GP(Xi>I*, th a is the sum of the square of the difference between the t predicted value and the expected one. After a few tens of generations, the genetic programming system finds the correct individual, namely GP(x) = F(x).

The performance measurements were obtained with a number of fitness cases of n = 1000 and are consistent with the measurements obtained in the more demanding trading model application.

Our parallel genetic programming system was implemented on the distributed memory machine IBM SP-2 giving a peak performance of 125 MF’LOPS per node, 40 Mbyte/s peak channel bandwidth and 500 ns hardware latency 171. We used the IBM PVMe message passing library [8].

The effective performance of a point-to-point communication operation using dedicated nodes connected with the high performance switch was measured using the wall clock gettimeofday. Each time reported is half the round-trip (ping-pong) communication time for a message of variable size. In order to minimize the effect of other user’s traffic in the network, each measurement is repeated 100 times and then the minimum

message size, bytes IO5 i;;::;::‘::: .:‘::‘::f:iIiliiiiijiii::iii.f:i,i.:

Fig. 3. Measured performance of point-to-point communication operation on SP-2.

1190 M. Oussaidhne et al./ Parallel Computing 23 (1997) 1183-l 198

Table 1 A summary of the parameters for the two problem sizes

Measurement

(A)

(B)

n d

Iwo 6 100 12

P g

100 100 100 10

value is retained. The curve in Fig. 3 shows a discontinuity at 100 bytes. For long messages (larger than 500 bytes), the communication time becomes nearly linear as message size increases, leading to an asymptotic bandwidth of 28.5 Mbytes/s.

All the speedup measurements reported here are calculated in the traditional form. We have not only fixed the problem size but also the random number generator seed so that the output of a parallel execution is exactly the same as the corresponding sequential execution. Other authors suggest to fix only the problem size and to average over many runs [6].

Here, the speedup was measured on two different problem instances, (A) and (B), and for each of them we report both static and dynamic load balancing performances. Table 1 summarizes the parameters used for each problem size. The columns n, d, p and g indicate the problem size, maximum depth, population size and number of generations, respectively. Fig. 4 shows the speedup curves obtained by measuring the elapsed time in the dedicated mode. For n = 1000 and d = 6, the sequential execution takes 235 s on a single node of SP-2. Using 10 SP-2 nodes, the static load balancing scheme completes in 34 s whereas the dynamic load balancing algorithm takes about 30 s. When reducing the problem size to n = 100 and increasing the maximum depth up to d = 12, the sequential execution time takes 744 s on a single node. On 10 nodes, the static scheduling algorithm completes in 140 s while the dynamic one takes 106 s.

In Section 6 we propose a performance model in order to interpret the above data.

0- :

,.._.; . . . . . . . . . . .._. )’

fj j... j ..i. ,ct’:*. . ;/xc

2 4 6 8 10 # processors

Fig. 4. Speedup for n = 1000 (left) and n = 100 (tight).

hf. Oussaidhte et al./Parallel Computing 23 (1997) 1183-1198 1191

6. Time complexity and scalability analysis

In this analysis, we model the distributed memory machine as a set of high-performance sequential machines interacting through a low-latency high-bandwidth network. We assume that the time to send a message of size s from one processor to another can be modeled as rs + ST-~ where T, is the data transmission rate and rs the startup time which includes software and communication protocol overheads associated with each communication operation.

On a sequential machine, the evaluation phase of the genetic process, for one generation, can be performed in

units of time, where p and n have been defined in Section 5 and C* is the average population complexity

C’ =f.-jCj I I

The constant (Y is the average time required to perform each arithmetic operation in the tree evaluation.

The run time of the parallel implementation can be estimated as follows. On m processing nodes, using the dynamic scheduling algorithm, the work load 1,, assigned to the most loaded processing node cannot exceed lmin, the work load of the least loaded processor, by an amount larger than C,,,,, , where C,,,,, = max(Cj, 1 5 j < p) is the maximum complexity of the population.

If this were not the case, the excess of work f,,, - lmin would be composed of at least two tasks. Due to our load balancing strategy, one of these tasks could have been assigned to the least loaded processor. Thus 1,, - lmin I; C,,,,,. Since lmin is certainly smaller than the average load per processor (PC * )/m, we have the inequality

PC*

When the population size p is large enough, we may then estimate that

1 PC*

z- max m

Since each GP is evaluated for n fitness cases, the evaluation phase will take

T npC*

camp =(y- m

Communication should also be taken into account. A genetic program of complexity C is packed as a message of length smaller than 6C + 4 bytes, due to parenthesis and extra characters in the S-expression representation. The total communication time required to move the whole population of GP to the evaluation processors is the time to send p

1192 M. Ouusaidbze et al./Parallel Computing 23 (1997) 1183-1198

0 203 400 600 600 to00 n/m

Fig. 5. Validation of the performance model on the SP2.

messages of variable length. Assuming that the GP are all coded with O(C) bytes, the communication time can be written

T corn =p,rs + q@( PC”)

When C’ is large, one can drop out the term pan and the total execution time T,,, of the evaluation phase can be modeled as

qar = Lmp + Tcom npC *

= cl- + ppc* m

The quantity (Y and p can be determined from actual runs on a parallel machine. Fig. 5 shows T,,,/( pC * > as a function of (n/m) for various executions on the SP2. The plot should scale as a straight line if our performance model is a good approximation. We have considered runs with m E (2, 3,4} and n varying between 2 and 2000. The values derived for the SP2 are

ff = 1.45 x 1o-6 /3=1.67x 1O-5

These quantities are given in seconds and are inversely proportional to an effective Mflops rate and bandwidth, respectively.

From Eq. (l), the speedup on m processors can be approximated as anm

s= crn+pm

For a fixed problem size II, the speedup saturates at a value n(o/P) when increasing to infinity the number m of processors. Hence, the speedup is bounded by 87 for the problem instance (A) of Section 5 and by 8.7 for the problem instance (B). However, for a given number of processors m, nearly linear speedups can be obtained simply by taking a big enough instance of the problem.

The scalability of an application is characterized by the ease a given efficiency E

EL, an m an+pm (2)

1’4. Oussuidhte et ul./ Parallel Computing 23 (1997) 1183-1198 1193

is achieved. The &efficiency function n = F(m) indicates how the problem size it must grow as the number of processor m increases in order to obtain a given efficiency E. From Eq. 2, we obtain

n=F(-)=(-&)(%)m which is linear with respect to m, making our implementation linearly scalable when the population size p and the average complexity C’ are large enough.

With the values we found for (Y and p, an efficiency of 90% on 10 processors requires a problem size n = lo3 and the same efficiency can be maintained on 100 processors by increasing 12 to 104.

The difference of performance between the static and dynamic load balancing scheme can be enlightened by the following discussion. Increasing the program depth d augments the task average size C* which scales the whole processing time. Further- more, increasing d is expected to expand exponentially Cmax, the size of the largest task. Using the static load balancing scheme, each node will receive at most 1 p/ml + 1 tasks to be evaluated. In the worst case, the longest tasks are assigned to the same node, thereby the work load can be bounded by (1 p/m] f l)nC,,,. Following the same reasoning as above, we derive, for the static load balancing algorithm, the limit (a//3XC*/C,,,,,)n of the speedup when m goes to infinity.

When C * /Cm,, < 1, the dynamic algorithm gives better results than the static one and this explains, in problem (B), the speedups of 5.3 and 7 obtained, respectively, with the static and dynamic load balancing schemes on 10 processors.

When C */Cm,, = 1, both static and dynamic load balancing schemes will lead to similar speedup performances. This happens either when d is small, as in problem (A),

making all the tasks equal sized, or when there is a large enough task such that the rest of tasks in the pool can be neglected.

7. Evolving trading models

Genetic programming provides a natural way to represent and evolve decision trees [9]. In this section, we present an application of genetic programming to learn technical trading models for foreign exchange (FX) market. The recommendations given by a trading model are purely based on past prices (price time series) of the exchange rate being analyzed. The price history is summarized in the form of variables called indicators.

A trading model is a system of rules catching the movement of the market and providing explicit trading recommendations for financial assets. A simple form of a trading rule could be

IFIII>K THEN G:=SIGN(I) ELSE G:=O (3)

Where I is an indicator whose sign and value model the current trend and K is a

threshold constant called break-level. The gearing, G, is the recommended position of the model. The value G = +1 corresponds to a ‘buy signal,’ G = - 1 corresponds to ‘sell

1194 M. Oussaidhe et al./Parallel Computing 23 (1997) 1183-l 198

signal’ and G = 0 corresponds to the neutral position. A trading rule with more complex strategy may use more than one indicator: typically, an indicator giving the current trend can be used in conjunction with an indicator reflecting the volatility of prices.

Since no single indicator can ever be expected to signal all the trends, it is essential to combine a set of indicators so that an overall picture of the market can be built up. All the indicators used here are functions of time and price history. The construction of these indicators is based on the concept of momentum. A momentum of the price x (logarithmic middle price) is computed according to

I,,,( r> = x(r) - EMA,( At,, t) (4 The first term x(t) is the price and the second term is a exponential moving averages (EMA) of the price computed on the range (depth in the past) At,. The exponential moving average is computed with the formula

As the trading rules contain constant break-levels which are independent of the FX rate to analyze, the indicators ZX,r(t> are normalized by a scaling factor given by the square root of long range momentum of the squared values of the indicator. These normalized indicators are used, according to Eq. (3), to obtain the signal indicators

G(x, r. K) =G(&,,(r), K)

The different signal indicators are combined by logical functions to form a decision tree or S-expression corresponding to the specific trading model. These logical S-expressions are evolved using genetic programming where each genetic program represents a trading model. Because of the presence of three possible values ( - 1, 0, + 1) for the gearing signal, we have modified the basic logical functions. The OR operator returns the sign of the sum of its arguments, the NOT function returns the opposite decision of the argument, the AND function returns the neutral signal when one of its arguments is zero otherwise returns the OR value. The IF function takes three arguments. lt returns the second argument if the first one is true, otherwise returns its third argument.

The major problem in optimizing trading models is to avoid oue$rting caused by the presence of noise. Overfitting means building trading models that fit a price history very well but generalize badly. The idea here, to avoid this phenomena, is to build genetic programs based on pre-optimized building blocks corresponding to robust indicators. An indicator is said to be robust if it performs well in both learning period (in-sample) and test period (out-of-sample). These robust indicators are optimized using a niching genetic algorithm based on a fitness sharing scheme [IO].

The fitness measure of a trading model quantifies not only the return but also the risk involved by the model [ 111. The return of a deal is given by the ratio

PI, - PIi_ ,)Pt,_, where P,; is the current transaction price and P, _ is the transaction price of the previous deal. The total return is then obtained by continuously cumulating the deal

M. Oussuidkne et al./Parallel Computing 23 (1997) 1183-l 198 119.5

return (which is null for deals starting from the neutral position). The fitness function, termed effective return (Xen>, is defined as:

where is E is the annualized average total return, C is a risk aversion constant and cr2 is the variance of the total return over time. Since the variance is a measure of the stability of the return, then high effective return means also high stable return. The notion of robustness is directly related to the ability of generalizing the results beyond the training sample. For this purpose, each trading model is tested on more than one exchange rate time series. The fitness measure is then extended as follows:

where the first term is the average fitness value obtained for the different exchange rates and the second term is a penalty proportional to the standard deviation of these values.

In this application, the evaluation of the fitness value corresponding to each particular GP (trading model decision tree) imply a very heavy computation. For each price of a given time series, all the indicators are updated and the GP is evaluated to get the new trading recommendation. A new transaction is executed each time the GP provide a different gearing signal. The total return curve generated in the full optimization period is used to evaluate the fitness of the GP on these specific time series. The fitness measure returned to the genetic program is the corrected average fitness value obtained for the different exchange rates.

For this problem, the number of evaluation of each GP is given by the size of the time series to be learned and is typically of the order of 2.5 X 105. To give an order of magnitude of the time spent in the evaluation phase, evolving over 100 generations a population of 100 GPs, setting the maximum depth to d = 6, takes more than 30 h on four SP-2 nodes.

7.1. Trading model performance

The optimization of the trading models is performed on seven exchange rates (GBP/USD, USD/DEM, USD/ITL, USD/JPY, USD/CHF, USD/FRF, USD/NLG) where each time series contains hourly data and is divided into alternated periods (in-sample/out-of sample) of one and half year. The optimization period starts (January 1, 1987) and ends (December 3 1, 1994).

In a first phase, the FunctionSet is restricted to the logical functions (AND, OR, NOT, IF} and the TerminalSet contain a few pre-optimized indicators and two constant values {G(x,16,0.32), ~(x,10,0.28), G(x,8,0.34), +l, -1). The evaluation is done with 20 independent runs and each run evolves over 100 generations a population of 100 GPs with a maximum tree depth of 6. Among the trading models inferred according to this approach, we obtain logical S-expressions such as

Model,:(IF(OR lG(x, 10, 0.32))

(AND Gfx, 16, 0.32)1)G(x, 16, 0.32))

1196 M. Oussaidhe et al./Purallel Computing 23 (1997) 1183-I 198

0.9

0.55 0

time (days) 350

Fig. 6. Behaviour of trading model Model, monitoring usd-dem FX rate over one year analysis period.

The behavior of this model is illustrated in Fig. 6. The comparison of the average performance of the best solutions found in these runs with the performance of the pre-optimized indicators, shows a slightly better average performance in-sample. Be- cause the function may be too restrictive, we also consider another model with an extended function set. In this second phase, we enlarge the search space by including in the function set some basic mathematical functions and in the terminal set a volatility indicator (range 16 days) plus a random number in the range [ - 2.0, 2.01. We also use a less restrictive rule for generating indicator signal:

IF III>K THEN M:=I ELSE M:=O

and

The FunctionSet used in this application is now composed of (* , /, +, - ,

<, >, MIN, MAX, ABS, NOT, OR, AND, IF) and the TerminalSet of {~(~,16,0.32), M(x,10,0.28), M(x,8,0.34), ~16, rnd[ - 2 . 0 ,2 . 0 ] 1, Eq. (3) is used to transform the signal of the root node in the recommended position of the model and the associated break-level K is set to 0.3.

The full optimization is performed in 20 independent runs. Each run is executed with 4 master nodes evolving over 100 generations a population of 100 GPs (maximum tree depth of 6). A migration of 5% GPs is done asynchronously each time a master has completed 10 generations. The imported GPs are selected randomly from the best 5% GPs of the other master nodes. All the master nodes are using the same function and terminal sets.

In Table 2 we present average quality of the results of the different runs compared to the average quality of the pre-optimized indicators. The results of the runs are grouped in 4 classes of decreasing out-of-sample performance.

In the first group of results (runs l-51, it can be seen that the average yearly return seems to be stable in both learning and test periods. However, the return time series itself presents some more fluctuations (reflected in the fitness value) during the out-of sample period. In the other groups the increase of the in-sample quality is paid by a clear

M. Oussaidhe et al./Parallel Computing 23 (1997) 1183-1198 1197

Table 2 Average complexity, average yearly return and fitness value (in percent) corresponding to the in-sample and out-of-sample periods. The results are given for pre-optimized indicators and for each group of results

Trading models Average complexity In sample (%I Out of sample (%)

E X eff E X eff

Indicators 1 5.85 1.72 4.65 - 1.83

Run l-5 33 5.97 1.98 5.61 - 2.85 Run 6-10 41 6.32 2.14 4.15 - 5.34 Run 1 l-15 46 6.12 2.54 3.85 - 6.34 Run 16-20 55 8.23 2.86 2.61 - 10.34

decrease of the out-of-sample performance, which show a higher degree of overfitting. This is also true of the GP model using the restricted function set described before.

The best selected model in the first group has a complexity of 36 and is given by the S-expression:

(IF (MIN(ABS(MINM( x, 16,0.32)M( x, 10,0.28)))( > -0.74~16))

( -M( x, 16,0.32)(MIN ~16 M( x, 16,0.32)))

(- (ABS( - (ABs(MAx0.74~ 16))

( > ( > -0.74 ~16)(M1N ~16 M( x, 16,0.32)))))

( > (IF 0.3 M( x, 16,0.32) ~16)(M1N ~16 M( x, 16,0.32)))))

The in-sample average return is 8.98% and the out-of-sample one is 5.12%. Such decision tree is not easy to interpret and some of the branches are duplicated or do not bring any information. The solutions provided by the other runs are in general more complex and less robust.

This study shows that genetic programming is an interesting tool in trading model optimization, but more work needs to be done for reducing the average complexity and increase the robustness of the solutions. A modification of the fitness function to penalize too complex decision trees and the introduction of building block functions would probably allow the genetic program to find a larger number of robust solutions.

8. Concluding remarks

We have presented a linearly scalable parallel implementation of genetic programming paradigm on distributed memory machines. The parallel scheme consists of multiple master-slave instances sharing the processing nodes. Each instance is mapped on all the allocated nodes and each master node runs an evolutionary process evolving its own subpopulation. To relieve premature convergence, the different subpopulations exchange individuals asynchronously through the ring topology. As the evaluation time is scaled by the number of fitness cases, large problem sizes make the evaluation phase

1198 M. Oussaidhe et al./Parallel Computing 23 (1997) 1183-1198

compute intensive. The dynamic load balancing algorithm enhances processor utilization by considering the task grain variance at run time, as opposite to the static algorithm which assumes equal sized tasks. At evaluation, multiple threads are maintained on each node to overlap communication with useful computation. Given a problem with n fitness cases, the speedup presents a saturation at about 8.7 X 10m2n on the IBM SP-2 machine. For n = lo2 and n = 103, we obtained, respectively, the speedups of 7 and 8 on 10 processing nodes. Typical values of n being larger, we can expect interesting parallel performances.

Genetic programming has been used to infer robust trading strategies. Real-world price data set, covering seven exchange rates, is used to optimize the trading strategies. The average returns provided by the inferred models exceed 5%. These models can be expressed as logical combinations of robust indicators. Separating the data (in- sample/out-of sample) inside each time series and diversifying the exchange rates contribute in reducing overfitting caused by the presence of noisy data. However, the fitness measure, as it drives the evolutionary process, may direct the searching process to unstable regions. Our results show that when addressing such highly complex problems it becomes necessary to decompose the problem representation so that an overall solution can be viewed as combination of small modules. Furthermore, we do not expect to have adopted an optimal fitness function. Genetic programming has exhibited reasonable promise for trading model optimization which is not a well understood domain.

Acknowledgements

The authors would like to acknowledge the Swiss National Science Foundation for its financial support. We would like to thank Olsen & Associates research institute for providing the price data and for their collaboration in the trading model application.

References

[I] J. Holland. Adaptation in Natural and Artificial Systems, University of Michigan Press, Arm Arbor, 1975.

[2] J. Koza, Genetic Programming, MIT Press, 1992.

[3] Oak Ridge National Laboratory, PVM3 User’s Guide and Reference Manual, 1994. 14) A.P. Fraser, Genetic programming in C + f , 1994, Public domain genetic programming system.

[5] M. Oussaidsne, 33. Chopard, 0. Pictet, M. Tomassini, Parallel genetic programming: An application to

trading models evolution, in: J.R. Koza, D.E. Goldberg, D.B. Fogel, R.L. Riolo (Eds.), Genetic

Programming, Proceedings of the First Annual Conference, July 28-31, Stanford University, The MIT

Press, Cambridge, MA, 1996, pp. 357-362.

[6] J. Koza, D. Andre, Parallel genetic programming on a network of transputers, Technical Report

CS-TR-95-1542, Computer Science Department, Stanford University, 1995.

[7] IBM Syst. J. 34 (2) (1995).

[8] IBM, IBM AIX PVMe User’s Guide and Subroutine Reference, 1994.

[9] F. Allen, R. Karjalainen, Using genetic algorithms to find technical trading rules, Technical report,

Wharton school, University of Pennsylvania, 1993.

[lo] O.V. Pictet, M.M. Dacorogna, B. Chopard, M. Oussaidene, R. Schirru, M. Tomassini, Using genetic

algorithms for robust optimization in financial applications, Neural Network World 4 (1995) 573-587.

[I l] O.V. Pictet, M.M Dacorogna, U.A. Mueller, R.B. Olsen, J.R. Ward, Real-time trading models for foreign

exchange rate, Neural Network World 6 (1992) 7 13-744.

Documents

Parallel genetic programming and its application to trading model induction