Neural Computing Group Department of Computing University of Surrey Matthew Casey 21 st April 2004

Neural Computing Group

Department of Computing

University of Surrey

Matthew Casey21st April 2004

http://www.computing.surrey.ac.uk/personal/st/M.Casey/






2


• Novel neural network architectures– Multi-net systems: ensemble and modular

approaches– Combining GA with NN: representation

• Theoretical underpinnings– Multi-net systems: extending traditional

techniques to define architecture, algorithm and explore properties

3


• Cognitive modelling– Extrapolation: generalising to patterns not

found in the training data– Numeric and language abilities: simulating

child abilities and exploring biological aspects of neural networks

• Applications– Classification; clustering; prediction;

information retrieval; bioinformatics

4

Multi-net Systems

• Learning and collaboration in multi-net systems– Single-net versus multi-net systems– In-situ learning

• Experimental results– Parallel combination of networks: ensemble– Sequential combination of networks: modular

• Formalising multi-net systems

5

Multi-net Systems

• Biological motivation for neural networks:– Hebb’s neurophysiological postulate1

– Learning across cell assemblies: neural integration

– Functional specialism: analogy to multi-net systems

• Theoretical motivation– Generalisation improvements with multi-net systems

– Ensemble and modular

• Learning in collaboration with modularisation

6

Single-net Systems

• Systems of one or more artificial neurons combined together in a single network– Parallel distributed processing2 systems:

(multi-layer) perceptron systems– Unsupervised learning: Kohonen’s SOM3

7

Single-nets as Multi-nets?

x1 x2 y

-1 -1 -1

-1 1 1

1 -1 1

1 1 -1

Combination of Linear Decision

Boundaries

1

1

-1

-1

1

1

-1

-1

1

1

-1

-11

1

-1

-1

True (1)

False (-1)

XOR

8

From Single-nets to Multi-nets

• Multi-net systems appear to be a development of the parallel processing paradigm

• Can multi-net systems improve generalisation?– Modularisation with simpler networks?– Limited theoretical and empirical evidence

• Generalisation:– Balance prior knowledge and training– VC Dimension4,5,6

– Bias/variance dilemma7

9

Multi-net Systems:Ensemble or Modular?

• Ensemble systems:– Parallel combination

– Each network performs the same task

– Simple ensemble

– AdaBoost8

• Modular systems:– Each network performs a different (sub-)task

– Mixture-of-experts9 (top-down parallel competitive)

– Min-max10 (bottom-up static parallel/sequential)

10

Categorising Multi-net Systems

• Sharkey’s11,12 combination strategies:– Parallel: co-operative or competitive

top-down or bottom-upstatic or dynamic

– Sequential– Supervisory

• Component networks may be13:– Pre-trained (independent)– Incrementally trained (iterative)– In-situ trained (simultaneous)

11

Multi-net Systems

• Categorisation schemes appear not to support the generalisation of multi-net system properties beyond specific examples– Ensemble: bias, variance and diversity14

– Modular: bias and variance

– What about measures such as the VC Dimension?

• Some use of in-situ learning– ME and HME15

– Negative correlation learning16

12

So?

• Multi-net systems appear to offer a way in which generalisation performance and learning speed can be improved:– Yet limited theoretical and empirical evidence– Focus on parallel systems

• Limited use of in-situ learning despite motivation– Existing results show improvement– Can the approach be generalised?

• No general framework for multi-net systems– Difficult to generalise properties from categorisation

13

Ongoing Research

• Explore (potential) benefit of multi-net systems– Can we combine ‘simple’ networks to solve ‘complex’

problems: ‘superordinate’ systems with faster learning?

• Can learning improve generalisation?– Parallel: in-situ learning in the simple ensemble– Sequential: combining networks with in-situ learning– Does in-situ learning provide improved generalisation?

• Can we formally define multi-net systems?– A method to describe the architecture and learning

algorithm for a general multi-net system

14

In-situ Learning

• Simultaneous training of neural networks within a combined system– Existing techniques focus more on pre-training

• Systems being explored:– Simple learning ensemble (SLE)– Sequential learning modules (SLM)

• Classification– MONK’s problems (MONK 1, 2, 3)17

– Wisconsin Breast Cancer Database (WBCD)18

15

Simple Learning Ensemble

• (Pre-trained) ensembles:– Train each network individually– Parallel combination of network outputs: mean output– Pre-training: how can we understand or control the

combined performance of the ensemble?– Incremental: AdaBoost8

– In-situ: negative correlation16

• In-situ learning:– Train in-situ and assess combined performance during

training using early stopping– Generalisation loss early stopping criterion19

16

Sequential Learning Modules

• Sequential systems:– Can a combination of (simpler) networks give good

generalisation and learning speed?– Typically pre-trained and for specific processing

• Sequential in-situ learning:– How can in-situ learning be achieved with sequential

networks: target error/output?– Use unsupervised networks– Last network has target output and hence can be

supervised

17

Systems

• 100 trials of each system, each with different random initial weights

• Generalisation assessed using test data to give mean response

MLP (ES) Early Stopping 1MLP Fixed 1SE Early Stopping 2 to 20

SLE Early Stopping 2 to 206-5x5: 25-1 6-5x5: 25-1 6-5x5: 25-1 9-5x5: 25-2

6-10x10: 100-1 6-10x10: 100-1 6-10x10: 100-1 9-10x10: 100-26-20x20: 400-1 6-20x20: 400-1 6-20x20: 400-1

System TrainingNumber of

ComponentsComponent Topology

MONK 1 MONK 2 MONK 3 WBCD6-3-1 6-2-1 6-4-1 9-5-2

" " " "

" " "" " " "

SLM Fixed 2

"

18

Experimental Results

Networks Epochs Response Networks Epochs ResponseMLP (ES) 1 23 57.13% 1 314 65.21%

MLP 1 1000 84.44% 1 1000 66.29%SE 3 55.75% 20 66.25%

SLE 20 354 90.21% 20 1000 69.49%SLM 10x10 1000 75.63% 20x20 1000 75.09%

SystemMONK 1 MONK 2

Networks Epochs Response Networks Epochs ResponseMLP (ES) 1 5 63.10% 1 3 76.01%

MLP 1 1000 83.39% 1 1000 92.68%SE 18 66.03% 20 87.23%

SLE 19 47 78.57% 16 4 88.47%SLM 10x10 1000 84.10% 10x10 1000 97.29%

WBCDSystem

MONK 3

19

Comparison

• MONK’s problems– Comparison of maximum responses17

– MONK 1: 100%; SLE 98.4%; SLM 84.7%

– MONK 2: 100%; SLE 74.5%; SLM 81.0%

– MONK 3: 97.2%; SLE 83.1%; SLM 87.5%

– Small spread of values, especially SLE

• WBCD– Comparison of mean responses20

– AdaBoost 97.6%; SLE 88.47%; SLM 97.29%

20

SLE Validation Error: MONK 1

0.00

20.00

40.00

60.00

80.00

100.00

MLP(ES)

2 4 6 8 16 20 MLP

Number of Components

Val

idat

ion

Err

or

21

SLE Epochs: MONK 1

0.00

100.00

200.00

300.00

400.00

500.00

MLP(ES)

2 4 6 8 16 20

Number of Components

Ep

och

s

22

In-situ Learning

• In-situ learning in multi-net systems:– Encouraging results from non-optimal systems– Comparison with existing single-net and multi-net

systems (with and without early stopping)– Computational effort?

• Ensemble systems:– Effect of training times on bias, variance and diversity?

• Sequential systems:– Encouraging empirical results: theoretical results?– Automatic classification of unsupervised clusters

23

The Next Step?

• Empirical results are encouraging– But, what is the theoretical basis of multi-net

systems?– Are multi-net systems any better than

monolithic solutions, and if so, which configurations are better?

• Early work: a formal framework for multi-net systems

24

Multi-net System Framework

• Previous work:– Framework for the co-operation of learning

algorithms21

– Stochastic model22

– Importance Sampled Learning Ensembles (ISLE)23

– Focus on supervised learning and specific architectures

• Jordan and Xu’s (1995) definition of HME24:– Generalisation of HME– Abstraction of architecture from algorithm– Theoretical results for convergence

25


• Propose a modification to Jordan and Xu’s definition of HME to provide a generalised multi-net system framework– HME combines the output of the expert networks

through a weighting generated by a gating network

– Replace the weighting by the (optional) operation of a network

– Can be used for parallel, sequential and supervisory systems

26

Example: HME

v112 v111

v11 v12

v1

x x x x

x

g12

g11

1y

g112

g111

11y 12y

111y 112y

Depth

r=0

r=1

r=2

27

Example: Framework

v13

x x x

1y

11y 12y

111y 112y

Depth

r=0

r=1

r=2 v113 v112 v111

x

v12

v1

v11

x

13y

113y

28

Definition

• A multi-net system consists of the ordered tree of depth r defined by the nodes , with the root of the tree associated with the output , such that:

v 1v

mRy 1

0 if,,...,

0 if,

1

Kyyf

Kxfy

K

29


• Learning algorithm operates by modifying the state of the system as defined by associated with each node

• Includes:– Pre-training– In-situ training– Incremental training (through pre- or in-situ

training)

v

30

Future Work

• In-situ learning:– Look further at in-situ learning – is it a valid approach

– SLE: does learning promote diversity?

– SLM: expand and explore limitations

• Framework:– Explore properties such as bias, variance, diversity and

VC Dimension

– Relate framework to existing systems and explore their properties

31

Questions?

32

References

1. Hebb, D.O. (1949). The Organization of Behavior: A Neuropsychological Theory. New York: John Wiley & Sons.

2. McClelland, J.L. & Rumelhart, D.E. (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 2: Psychological and Biological Models. Cambridge, MA.: A Bradford Book, MIT Press.

3. Kohonen, T. (1982). Self-Organized Formation of Topologically Correct Feature Maps. Biological Cybernetics, vol. 43, pp. 59-69.

4. Vapnik, V.N. & Chervonenkis, A.Ya. (1971). On the Uniform Convergence of Relative Frequencies of Events to their Probabilities. Theory of Probability and Its Applications, vol. XVI(2), pp. 264-280.

5. Baum, E.B. & Haussler, D. (1989). What Size Net Gives Valid Generalisation? Neural Computation, vol. 1(1), pp. 151-160.

33

References

6. Koiran, P. & Sontag, E.D. (1997). Neural Networks With Quadratic VC Dimension. Journal of Computer and System Sciences, vol. 54(1), pp. 190-198.

7. Geman, S., Bienenstock, E. & Doursat, R. (1992). Neural Networks and the Bias/Variance Dilemma. Neural Computation, vol. 4(1), pp. 1-58.

8. Freund, Y. & Schapire, R.E. (1996). Experiments with a New Boosting Algorithm. Machine Learning: Proceedings of the 13th International Conference, pp. 148-156. Morgan Kaufmann.

9. Jacobs, R.A., Jordan, M.I. & Barto, A.G. (1991). Task Decomposition through Competition in a Modular Connectionist Architecture: The What and Where Vision Tasks. Cognitive Science, vol. 15, pp. 219-250.

10. Lu, B. & Ito, M. (1999). Task Decomposition and Module Combination Based on Class Relations: A Modular Neural Network for Pattern Classification. IEEE Transactions on Neural Networks, vol. 10(5), pp. 1244-1256.

11. Sharkey, A.J.C. (1999). Multi-Net Systems. In Sharkey, A. J. C. (Ed), Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, pp. 1-30. London: Springer-Verlag.

34

References

12. Sharkey, A.J.C. (2002). Types of Multinet System. In Roli, F. & Kittler, J. (Ed), Proceedings of the Third International Workshop on Multiple Classifier Systems (MCS 2002), pp. 108-117. Berlin, Heidelberg, New York: Springer-Verlag.

13. Liu, Y., Yao, X., Zhao, Q. & Higuchi, T. (2002). An Experimental Comparison of Neural Network Ensemble Learning Methods on Decision Boundaries. Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN'02), vol. 1, pp. 221-226. Los Alamitos, CA.: IEEE Computer Society Press.

14. Kuncheva, L.I. (2002). Switching Between Selection and Fusion in Combining Classifiers: An Experiment. IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 32(2), pp. 146-156.

15. Jordan, M.I. & Jacobs, R.A. (1994). Hierarchical Mixtures of Experts and the EM Algorithm. Neural Computation, vol. 6(2), pp. 181-214.

16. Liu, Y. & Yao, X. (1999a). Ensemble Learning via Negative Correlation. Neural Networks, vol. 12(10), pp. 1399-1404.

35

References

17. Thrun, S.B., Bala, J., Bloedorn, E., Bratko, I., Cestnik, B., Cheng, J., De Jong, K., Dzeroski, S., Fahlman, S.E., Fisher, D., Hamann, R., Kaufman, K., Keller, S., Kononenko, I., Kreuziger, J., Michalski, R.S., Mitchell, T., Pachowicz, P., Reich, Y., Vafaie, H., van de Welde, W., Wenzel, W., Wnek, J. & Zhang, J. (1991). The MONK's Problems: A Performance Comparison of Different Learning Algorithms. Technical Report CMU-CS-91-197. Pittsburgh, PA.: Carnegie-Mellon University, Computer Science Department.

18. Wolberg, W.H. & Mangasarian, O.L. (1990). Multisurface Method of Pattern Separation for Medical Diagnosis Applied to Breast Cytology. Proceedings of the National Academy of Sciences, USA, vol. 87(23), pp. 9193-9196.

19. Prechelt, L. (1996). Early Stopping - But When? In Orr, G. B. & Müller, K-R. (Ed), Neural Networks: Tricks of the Trade, 1524, pp. 55-69. Berlin, Heidelberg, New York: Springer-Verlag.

20. Drucker, H. Boosting Using Neural Networks. In Sharkey, A. J. C. (Ed), Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, pp. 51-78. London: Springer-Verlag, 1999.

36

References

21. Bottou, L. & Gallinari, P. (1991). A Framework for the Cooperation of Learning Algorithms. In Lippmann, R.P., Moody, J.E. & Touretzky, D.S. (Ed), Advances in Neural Information Processing Systems, vol. 3, pp. 781-788.

22. Amari, S.-I. (1995). Information Geometry of the EM and em Algorithms for Neural Networks. Neural Networks, vol. 8(9), pp. 1379-1408.

23. Friedman,J.H. & Popescu,B. (2003). Importance Sampling: An Alternative View of Ensemble Learning. Presented at the 4th International Workshop on Multiple Classifier Systems (MCS 2003). Guildford, UK.

24. Jordan, M.I. & Xu, L. (1995). Convergence Results for the EM Approach to Mixtures of Experts Architectures. Neural Networks, vol. 8, pp. 1409-1431.

Documents

Neural Computing Group Department of Computing University of Surrey Matthew Casey 21 st April 2004