53
1 Testing neural networks using the complete round robin method Boris Kovalerchuk, Clayton Todd, Dan Henderson Dept. of Computer Science, Central Washington University, Ellensburg, WA, 98926-7520 [email protected] [email protected] [email protected]

1 Testing neural networks using the complete round robin method Boris Kovalerchuk, Clayton Todd, Dan Henderson Dept. of Computer Science, Central Washington

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

1

Testing neural networks using the complete round

robin method

Boris Kovalerchuk, Clayton Todd, Dan Henderson

Dept. of Computer Science, Central Washington University, Ellensburg, WA, 98926-7520

[email protected] [email protected] [email protected]

2

Problem

Common practice Pattern discovered by learning methods such

as NN are tested in the data before they are used for their intended purpose.

Reliability of testing The conclusion about reliability of discovered

patterns depends on testing methods used.

Problem A testing method may not test critical

situations.

3

Result New testing method

Tests much wider set of situations than the traditional round robin method

Speeds up required computations

4

Novelty and Implementation

Novelty The mathematical mechanism based on the

theory of monotone Boolean functions and Multithreaded parallel processing.

Implementation The method has been implemented for

backpropagation neural networks and successfully tested for 1024 neural networks by using SP500 data.

5

1. Approach and Method

Common approach 1. Select subsets in the data set D

subset Tr for training and

subset Tv for validating the discovered patterns.

2. repeat step 1 several times for different subsets

3. compare if results are similar to each other than a discovered

regularity can be called reliable for data D.

Tr

Tv

6

Known Methods[Dietterich, 1997]

Random selection of subsets bootstrap aggregation (bagging)

Selection of disjoint subsets cross validated committees

Selection of subsets according the probability distribution Boosting

7

Problems of sub-sampling

Different regularities for different sub-samples of Tr

Rejecting and accepting regularities heavily depends on a specific splitting

of Tr Example:

non-stationary financial time series bear and bull market trends for different

time intervals.

8

Splitting sensitive regularities for non-stationary data

A A’ Cregularity 1(bear market)

B’ B C regularity 2 (bull market)

Independentdata C

Training sample Tr

Reg2 does not work on B’=A

Reg1 does not work on A’=B

9

Round robin method

Eliminates arbitrary splitting by examining several groups of subsets of Tr.

The complete round robin method examines all groups of subsets of Tr.

Drawback 2n possible subsets, where n is the number of

groups of objects in the data set. Learning 2n neural networks is a

computational challenge.

10

Implementation

Complete round robin method applicable to both attribute-based and relational data mining methods.

The method is illustrated for neural networks.

11

Data

Let M be a learning method and D be a data set of N objects,

represented by m attributes. D={di},i=1,…,N, di=(di1, di2,…, dim). Method M is applied to data D for

knowledge discovery.

12

Grouping data

Example -- a stock time series. the first 250 data objects (days) belong to

1980, the next 250 objects (days) belong to 1981,

and so on.

Similarly, half years, quarters and other time intervals can be used.

Any of these subsets can be used as training data.

13

Testing data

Example 1 (general data) Training -- the odd groups #1, #3, #5, #7 and

#9 Testing -- #2, #4, #6, #8 and #10

Example 2 (time series) Training -- #1, #2, #3, #4 and #5 Testing -- #6, #7, #8, #9 and #10

Used third path completely independent later data C for

testing.

14

Hypothesis of monotonicity

D1 and D2 are training data sets and Perform1 and Perform2

the performance indicators of learned models using D1 and D2

Binary Perform index, 1 stands for appropriate performance and 0 stands for inappropriate performance.

The hypothesis of monotonicity (HM) If D1 D2 then Perform1 Perform2.

(1)

15

Assumption

If data set D1 covers data set D2, then performance of method M on D1 should be better or equal to performance of M on D2.

Extra data bring more useful information than noise for knowledge discovery.

16

Experimental testing of the hypothesis

Hypothesis P(M,Di) P(M,Dj) for performance is not always true.

1024 subsets of ten years and generated NN

Found 683 subsets such that Di Dj. Surprisingly, in this experiment

monotonicity was observed for all of 683 combinations of years.

17

Discussion

Experiment strong evidence for use of monotonicity

along with the complete round robin method.

Incomplete round robin method assumes some kind of independence of training subsets used.

Discovered monotonicity shows that independence at least should be tested.

18

Formal notation

The error Er for data set D={di}, i=1,…,N, is the normalized error of all its components di:

T(di) is the actual target value for di d J(di) is the target value forecast delivered

by discovered model J, i.e., the trained neural network in our case.

N

i i

i

N

i i

T

JTEr

1

2

1

)(

))()((

d

dd

19

Performance

Performance is measured by the error tolerance (threshold) Q0 of error Er:

0

0

QErif0,

QErif1, Perform

20

Binary hypothesis of monotonicity

Combinations of years are coded as binary vectors vi=(vi1,vi2…vi10) like

0000011111 with 10 components from (0000000000) to (1111111111)

total 2n=1024 data subsets. Hypothesis of monotonicity

If vi vj then Performi Performj (2) Here

vi vj vik vjk for all k=1,...,10.

21

Monotone Boolean Functions

Not every vi and vj are comparable with each other by the ”” relation.

Perform as a quality indicator Q: Q(M, D, Q0) =1 Perform=1, (3) where Q0 is some performance limit.

Rewritten monotonicity (2) If vi vj then Q(M, Di, Q0) Q(M, Dj, Q0) (4)

Q(M, D, Q0) is a monotone Boolean function of D. [Hansel, 1966; Kovalerchuk et al, 1996]

22

Use of Monotonicity

A method M, Data sets D1 and D2, with D2 D1. Monotonicity

if the method M does not perform well on the data D1, then it will not perform well on the data D2 either.

under this assumption, we do not need to test method M on D2.

23

Experiment

SP500 running method M 250 times instead of

the complete 1024 times. Tables

24

Experiments with SP500 and Neural Networks

The backpropagation neural network for predicting SP500 [Rao, Rao, 1993]

0.6% prediction error on the SP500 test data (50 weeks) with 200 weeks (about four years) of training data.

Is this result reliable? Will it be sustained for wider training and testing data?

25

Data for testing reliability

The same data:

Training data -- all trading weeks from 1980 to 1989 and

Independent testing data -- all trading weeks from 1990-1992.

26

Complete round robin

Generated all 1024 subsets of the ten training years (1980-

1989) Computed

1024 the corresponding backpropagation neural networks and

their Perform values. 3% error tolerance threshold (Q0=0.03) (higher than 0.6% used for the smaller data set in [Rao, Rao, 1993]).

27

Performance

Table 1. Performance of 1024 neural networksPerformance

Training TestingNumber ofneuralnetworks

% of neuralnetworks

0 0 289 28.250 1 24 2.351 0 24 2.351 1 686 67.05

A satisfactory performance is coded as 1 and non-satisfactory performance is coded as 0.

28

Analysis of performance

Consistent performance (training and testing) 67.05% + 28.25%= 95.3% with 3% error tolerance sound Performance on training data =>

sound Performance on testing data (67.05%) unsound Performance on training data =>

unsound Performance on testing data (28.25%)

A random choice of data for training from ten-year SP500 data will not produce regularity in 32.95% of cases, although regularities useful for forecasting do exist.

29

Specific Analysis

Table 2 9 nested data subsets of the possible

1024 subsets Begin the nested sequence with a

single year (1986). This single year’s data is too small to

train a neural network to a 3% error tolerance.

30

Specific analysis

Table 2. Backpropagation neural networks performance for different data subsets.Training yearsPerformance Binary code

for set ofyears

80818283848586878889TrainingTesting90-92

0000001000x00

0000001001xx00

0000001011xxx00

0000011011xxxx11

0000111011xxxxx11

0001111011xxxxxx11

0011111011xxxxxxx11

0111111011xxxxxxxx11

1111111011xxxxxxxxx11

9 nested data subsets of the possible 1024 subsets

31

Nested sets

A single year (1986) is too small to train a neural network to a 3%

error tolerance. 1986, 1988 and 1989,

produce the same negative result. 1986, 1988, 1989 and 1985,

the error moved below the 3% threshold. Five or more years of data

also satisfy the error criteria. The monotonicity hypothesis is confirmed.

32

Analysis

All other 1023 combinations of years only a few combinations of four years

satisfy the 3% error tolerance, practically all five-year combinations

satisfy the 3% threshold. all combinations of over five years

satisfy this threshold.

33

Analysis

Four-year training data sets produce marginally reliable forecasts. Example.

1980, 1981, 1986, and 1987, corresponding to the binary vector (1100001100) do not satisfy the 3% error tolerance

34

Further analyses

Three levels of error tolerance, Q0: 3.5%, 3.0% and 2.0%.

The number of neural networks with sound performance goes down from

81.82% to 11.24% by moving error tolerance from 3.5% to 2%.

NN with 3.5% error tolerance are much more reliable than networks with 2.0

% error tolerance.

35

Three levels of error tolerance

Table 3. Overall performance of 1024 neural networks with different error tolerancePerformanceError toleranceTraining Testing

Number of neural networks

% of neural networks

3.5% 0 0 167 16.320 1 3 0.2931 0 6 0.5861 1 837 81.81

3.0% 0 0 289 28.250 1 24 2.351 0 24 2.351 1 686 67.05

2.0% 0 0 845 82.600 1 22 2.1501 0 31 3.0301 1 115 11.24

36

Performance

0102030405060708090

<0,0> <1,1> <0,1> <1,0>

Performance <training,testing>

% o

f n

eu

ral n

etw

ork

s

Er. tolerance=3.5% Er.tolerance=3.0% Er. tolerance=2.0%

37

Standard random choice method

A random choice of training data for 2% error tolerance will more often reject

the training data as insufficient.

This standard approach does not even allow us to know how unreliable the result is without running the complete 1024 subsets in complete round robin method.

38

Reliability of 0.6 error tolerance

Rao and Rao [1993] the 0.6% error for 200 weeks (about

four years) from the 10-year training data.

unreliable (table 3)

39

Underlying mechanism

The number of computations depends on a sequence of testing data subsets Di.

To optimize the sequence of testing, Hansel’s lemma [Hansel, 1966, Kovalerchuk et al, 1996] from the theory of monotone Boolean functions is applied to so called Hansel’s chains of binary vectors.

40

Computational challenge of Complete round robin method

Monotonicity and multithreading significantly speed up computing Use of monotonicity with 1023

threads decreased average runtime about 3.5 times from 15-20 minutes to 4-6 minutes in order to train 1023 Neural Networks in the case of mixed 1s and 0s for output.

41

Error tolerance and computing time

Different error tolerance values can change output and runtime. extreme cases with all 1’s or all 0’s as

outputs.. File preparation to train 1023 Neural

Networks. The largest share of time for file

preparation (41.5%) is taken by files using five years in the data subset (table 5).

42

Runtime

Table 1. Runtime for different error tolerance settingsMethod Average time for 1023

Neural Networks1 Processor, no threads

Average time for 1023Neural Networks1 Processor, 1023 threads

Round-Robin with monotonicity formixed 1’s and 0’s as output

15-20 min. 6-4 min.

Round-Robin with monotonicity forall 1’s as output

10 min. 3.5 min.

43

Time distribution

Table 5. Time for backpropagation and file preparation Set of years % of time in File preparation % of time in Backpropagation0000011111 41.5% 58.5%0000000001 36.4% 63.6%1111111111 17.8% 82.2%

44

General logic of software

The exhaustive option generate 1024 subsets using a file

preparation program and compute backpropagation for all subsets

Optimized option generate Hansel chains, store them compute backpropagation only for

specific subsets dictated by the chains.

45

Implemented optimized option

1. For a given binary vector the corresponding training data are produced.

2. Backpropagation is computed generating the Perform value.

3. The Perform value is used along with a stored Hansel chains to decide which binary vector (i.e., subset of data) will be used next for learning neural networks.

46

System Implementation

The exhaustive set of Hansel chains is generated first and stored for a desired case.

A 1024 subsets of data are created using a file preparation program.

Based on the sequence dictated by the Hansel Chains, we produce the corresponding training data and compute backpropagation to generate the Perform value. The next vector to be used is based on the stored Hansel Chains and the Perform value.

47

User Interface and Implementation

Built to take advantage of Windows NT threading.

Built with Borland C++ Builder 4.0. Parameters for the Neural Networks

and sub-systems are setup through the interface.

Various options, such as monotinicity, can be turned on or off through the interface.

48

Multithreaded Implementation

Independent Sub-Processes Learning of an individual neural network

or a group of the neural networks. Each sub-process is implemented as an

individual thread. Run in parallel on several processors to

further speed up computations.

49

Threads

Two different types of threads. Worker Thread and Communication Threads

Communication Threads send and receive the data.

Work threads are started by the Servers to perform the calculations on the data.

50

Client/Server Implementation

Decreases the workload per-computer Client does the data formatting, Server

performs all the data manipulation. A client connects to multiply servers. Each server performs work on the data

and returns the results to the client. The client formats the returned results

and gives a visualization of them.

51

Client

Loads vectors into linked list. Attaches threads to each node. Connects thread to active server. Threads send server activation packet Waits for server to return results. Formats results for outputting. Send more work to server.

52

Server

Obtains a message from the client that this client is connected.

Starts work on data given it from client.

Returns finished packet to client. Loop.

53

System Diagram

Interface

Thread Monitor

Server 1 Server 2 Server 250

Network Communication Streams

Threads….