310
Aboul-Ella Hassanien, Ajith Abraham, and Francisco Herrera (Eds.) Foundations of Computational Intelligence Volume 2

Foundations of Computational Intelligence

  • Upload
    dinhdat

  • View
    231

  • Download
    9

Embed Size (px)

Citation preview

Page 1: Foundations of Computational Intelligence

Aboul-Ella Hassanien,Ajith Abraham, and Francisco Herrera (Eds.)

Foundations of Computational Intelligence Volume 2

Page 2: Foundations of Computational Intelligence

Studies in Computational Intelligence,Volume 202

Editor-in-ChiefProf. Janusz KacprzykSystems Research InstitutePolish Academy of Sciencesul. Newelska 601-447 WarsawPolandE-mail: [email protected]

Further volumes of this series can be found on ourhomepage: springer.com

Vol. 181. Georgios Miaoulis and Dimitri Plemenos (Eds.)Intelligent Scene Modelling Information Systems, 2009ISBN 978-3-540-92901-7

Vol. 182.Andrzej Bargiela and Witold Pedrycz (Eds.)Human-Centric Information Processing Through GranularModelling,2009ISBN 978-3-540-92915-4

Vol. 183. Marco A.C. Pacheco and Marley M.B.R.Vellasco(Eds.)Intelligent Systems in Oil Field Development underUncertainty, 2009ISBN 978-3-540-92999-4

Vol. 184. Ljupco Kocarev, Zbigniew Galias and Shiguo Lian(Eds.)Intelligent Computing Based on Chaos, 2009ISBN 978-3-540-95971-7

Vol. 185.Anthony Brabazon and Michael O’Neill (Eds.)Natural Computing in Computational Finance, 2009ISBN 978-3-540-95973-1

Vol. 186. Chi-Keong Goh and Kay Chen TanEvolutionary Multi-objective Optimization in UncertainEnvironments, 2009ISBN 978-3-540-95975-5

Vol. 187. Mitsuo Gen, David Green, Osamu Katai, Bob McKay,Akira Namatame, Ruhul A. Sarker and Byoung-Tak Zhang(Eds.)Intelligent and Evolutionary Systems, 2009ISBN 978-3-540-95977-9

Vol. 188.Agustín Gutierrez and Santiago Marco (Eds.)Biologically Inspired Signal Processing for Chemical Sensing,2009ISBN 978-3-642-00175-8

Vol. 189. Sally McClean, Peter Millard, Elia El-Darzi andChris Nugent (Eds.)Intelligent Patient Management, 2009ISBN 978-3-642-00178-9

Vol. 190. K.R.Venugopal, K.G. Srinivasa and L.M. PatnaikSoft Computing for Data Mining Applications, 2009ISBN 978-3-642-00192-5

Vol. 191. Zong Woo Geem (Ed.)Music-Inspired Harmony Search Algorithm, 2009ISBN 978-3-642-00184-0

Vol. 192.Agus Budiyono, Bambang Riyanto and EndraJoelianto (Eds.)Intelligent Unmanned Systems: Theory and Applications, 2009ISBN 978-3-642-00263-2

Vol. 193. Raymond Chiong (Ed.)Nature-Inspired Algorithms for Optimisation,2009ISBN 978-3-642-00266-3

Vol. 194. Ian Dempsey, Michael O’Neill and AnthonyBrabazon (Eds.)Foundations in Grammatical Evolution for DynamicEnvironments, 2009ISBN 978-3-642-00313-4

Vol. 195.Vivek Bannore and Leszek SwierkowskiIterative-Interpolation Super-Resolution ImageReconstruction:A Computationally Efficient Technique,2009ISBN 978-3-642-00384-4

Vol. 196.Valentina Emilia Balas, Janos Fodor andAnnamária R.Varkonyi-Koczy (Eds.)Soft Computing Based Modelingin Intelligent Systems, 2009ISBN 978-3-642-00447-6

Vol. 197. Mauro BirattariTuning Metaheuristics, 2009ISBN 978-3-642-00482-7

Vol. 198. Efren Mezura-Montes (Ed.)Constraint-Handling in Evolutionary Optimization, 2009ISBN 978-3-642-00618-0

Vol. 199. Kazumi Nakamatsu, Gloria Phillips-Wren,Lakhmi C. Jain, and Robert J. Howlett (Eds.)New Advances in Intelligent Decision Technologies, 2009ISBN 978-3-642-00908-2

Vol. 200. Dimitri Plemenos and Georgios Miaoulis VisualComplexity and Intelligent Computer Graphics TechniquesEnhancements, 2009ISBN 978-3-642-01258-7

Vol. 201.Aboul-Ella Hassanien,Ajith Abraham,Athanasios V.Vasilakos, and Witold Pedrycz (Eds.)Foundations of Computational Intelligence Volume 1, 2009ISBN 978-3-642-01081-1

Vol. 202.Aboul-Ella Hassanien,Ajith Abraham,and Francisco Herrera (Eds.)Foundations of Computational Intelligence Volume 2, 2009ISBN 978-3-642-01532-8

Page 3: Foundations of Computational Intelligence

Aboul-Ella Hassanien,Ajith Abraham,and Francisco Herrera (Eds.)

Foundations of ComputationalIntelligence Volume 2

Approximate Reasoning

123

Page 4: Foundations of Computational Intelligence

Prof.Aboul-Ella HassanienCairo UniversityFaculty of Computers and InformationInformation Technology Department5 Ahmed Zewal St.Orman, GizaE-mail: [email protected]://www.fci.cu.edu.eg/abo/

Prof.Ajith AbrahamMachine Intelligence Research Labs(MIR Labs)Scientific Network for Innovation andResearch ExcellenceP.O. Box 2259Auburn,Washington 98071-2259USAE-mail: [email protected]

Prof. Francisco HerreraSoft Computing and Intelligent InformationSystemsDepartment of Computer Science andArtificial IntelligenceETS de Ingenierias Informática y deTelecomunicaciónUniversity of GranadaE-18071 GranadaSpain

E-mail: [email protected]

ISBN 978-3-642-01532-8 e-ISBN 978-3-642-01533-5

DOI 10.1007/978-3-642-01533-5

Studies in Computational Intelligence ISSN 1860949X

Library of Congress Control Number: Applied for

c© 2009 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuseof illustrations, recitation, broadcasting, reproduction on microfilm or in any otherway, and storage in data banks. Duplication of this publication or parts thereof ispermitted only under the provisions of the German Copyright Law of September 9,1965, in its current version, and permission for use must always be obtained fromSpringer. Violations are liable to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in thispublication does not imply, even in the absence of a specific statement, that suchnames are exempt from the relevant protective laws and regulations and thereforefree for general use.

Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.

Printed in acid-free paper

9 8 7 6 5 4 3 2 1

springer.com

Page 5: Foundations of Computational Intelligence

Preface

Foundations of Computational Intelligence Volume 2: Approximation Reasoning: Theoretical Foundations and Applications Human reasoning usually is very approximate and involves various types of un-certainties. Approximate reasoning is the computational modelling of any part of the process used by humans to reason about natural phenomena or to solve real world problems. The scope of this book includes fuzzy sets, Dempster-Shafer theory, multi-valued logic, probability, random sets, and rough set, near set and hybrid intelligent systems. Besides research articles and expository papers on the-ory and algorithms of approximation reasoning, papers on numerical experiments and real world applications were also encouraged. This Volume comprises of 12 chapters including an overview chapter providing an up-to-date and state-of-the research on the applications of Computational Intelligence techniques for ap-proximation reasoning. The Volume is divided into 2 parts:

Part-I: Approximate Reasoning – Theoretical Foundations Part-II: Approximate Reasoning – Success Stories and Real World Applications Part I on Approximate Reasoning – Theoretical Foundations contains four chap-ters that describe several approaches of fuzzy and Para consistent annotated logic approximation reasoning.

In Chapter 1, “Fuzzy Sets, Near Sets, and Rough Sets for Your Computational Intelligence Toolbox” by Peters considers how a user might utilize fuzzy sets, near sets, and rough sets, taken separately or taken together in hybridizations as part of a computational intelligence toolbox.

In multi-criteria decision making, it is necessary to aggregate (combine) utility values corresponding to several criteria (parameters). The simplest way to com-bine these values is to use linear aggregation. In many practical situations, how-ever, linear aggregation does not fully adequately describe the actual decision making process, so non-linear aggregation is needed. From the purely mathemati-cal viewpoint, the next natural step after linear functions is the use of quadratic

Page 6: Foundations of Computational Intelligence

VI Preface

functions. However, in decision making, a different type of non-linearities is usually more adequate than quadratic ones: fuzzy-type non-linearities like OWA or Choquet integral that use min and max in addition to linear combinations. In Chapter 2, “Fuzzy Without Fuzzy: Why Fuzzy-Related Aggregation Techniques Are Often Better Even in Situations Without True Fuzziness” by Nguyen et al. gives a mathematical explanation for this empirical phenomenon. Specifically, the authors show that approximation by using fuzzy methodology is indeed the best (in some reasonable sense).

In Chapter 3, “Intermediate Degrees are needed for the World to be Cogniza-ble: Towards a New Justification for Fuzzy Logic Ideas” Nguyen et al. prove that intermediate degrees are needed to describe real-world processes and it pro-vides an additional explanation for the success of fuzzy techniques (and other techniques which use intermediate degrees) – which often goes beyond situations in which the intermediate degrees are needed to describe the experts’ uncertainty.

Chapter 4, “Paraconsisitent annotated logic program Before After EVALSPN and its applications” by Nakamatsu, proposes a paraconsistent annotated logic program called EVALPSN. In EVALPSN, an annotation called an extended vec-tor annotation is attached to each literal. In addition, the author introduces the bf-EVALPSN and its application to real-time process order control and its safety verification with simple examples.

Part II on Approximate Reasoning – Success Stories and Real World Applications contains eight chapters that describe several success stories and real world appli-cations on approximation reasoning.

In Chapter 5, “A Fuzzy Set Approach to Software Reliability Modeling” Zeephongsekul provides a discussion of a fuzzy set approach, which is used to extend the notion of software debugging from a 0-1 (perfect/imperfect) crisp ap-proach to one which incorporates some fuzzy sets ideas.

Chapter 6, “Computational Methods for Investment Portfolio: the Use of Fuzzy Measures and Constraint Programming for Risk Management” by Majoc et al. present a state of the art on computational techniques for portfolio management, that is, how to optimize a portfolio selection process and propose a novel approach involving utility-based multi-criteria decision making setting and fuzzy integration over intervals.

In Chapter 7, “A Bayesian Solution to the Modifiable Areal Unit Problem” Hui explores how the Modifiable Areal Unit Problem (MAUP) can be described and potentially solved by the Bayesian estimation (BYE). Specifically, the scale and the aggregation problems are analyzed using simulated data from an individual-based model.

In chapter 8, “Fuzzy Logic Control in Communication Networks” by Chry-sostomou and Pitsillides discuss the difficulty of the congestion control problem and review the control approaches currently in use. The authors motivate the util-ity of Computational Intelligence based control and then through a number of ex-amples, illustrate congestion control methods based on fuzzy logic control.

Page 7: Foundations of Computational Intelligence

Preface VII

In Chapter 9, “Adaptation in Classification Systems” Bouchachia investigates adaptation issues in learning classification systems from different perspectives. Special attention is given to adaptive neural networks and the most visible incre-mental learning mechanisms. Adaptation is also incorporated in the combination of incremental classifiers in different ways so that adaptive ensemble learners are obtained. These issues are illustrated by means of a numerical simulation.

In Chapter 10, “Music Instrument Estimation in Polyphonic Sound Based on Short-Term Spectrum Match” Jiang et al. provide a new solution to an important problem of instrument identification in polyphonic music: There is loss of infor-mation on non-dominant instruments during the sound separation process due to the overlapping of sound features. Experiments show that the sub-patterns detected from the power spectrum slices contain sufficient information for the multiple-timbre estimation tasks and improve the robustness of instrument identi-fication.

In Chapter 11, “Ultrasound Biomicroscopy Glaucoma Images Analysis Based on Rough Set and Pulse Coupled Neural Network” El-Dahshan et al. present rough sets and pulse coupled neural network scheme for Ultrasound Biomicro-scopy (UBM) glaucoma images analysis. The Pulse Coupled Neural Network (PCNN) with a median filter was used to adjust the intensity of the UBM images. This is followed by applying the PCNN-based segmentation algorithm to detect the boundary of the interior chamber of the eye image. Then, glaucoma clinical parameters are calculated and normalized, followed by application of a rough set analysis to discover the dependency between the parameters and to generate set of reducts that contains minimal number of attributes.

In Chapter 12, “An overview of fuzzy c-means based image clustering algo-rithm” Huiyu Zhou and Gerald Schaefer provide an overview of several fuzzy c-means based image clustering concepts and their applications. In particular, we summarize the conventional fuzzy c-means (FCM) approaches as well as a num-ber of its derivatives that aim at either speeding up the clustering process or at providing improved or more robust clustering performance.

We are very much grateful to the authors of this volume and to the reviewers for their great efforts by reviewing and providing interesting feedback to authors of the chapter. The editors would like to thank Dr. Thomas Ditzinger (Springer Engineering Inhouse Editor, Studies in Computational Intelligence Series), Professor Janusz Kacprzyk (Editor-in-Chief, Springer Studies in Computational Intelligence Series) and Ms. Heather King (Editorial Assistant, Springer Verlag, Heidelberg) for the editorial assistance and excellent cooperative collaboration to produce this important scientific work. We hope that the reader will share our joy and will find it useful!

December 2008 Aboul Ella Hassanien, Cairo, Egypt Ajith Abraham, Trondheim, Norway

Francisco Herrera, Granada, Spain

Page 8: Foundations of Computational Intelligence

Contents

Part I: Approximate Reasoning - Theoretical Foundations andApplications

Approximate Reasoning - Theoretical Foundations

Fuzzy Sets, Near Sets, and Rough Sets for YourComputational Intelligence Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . 3James F. Peters

Fuzzy without Fuzzy: Why Fuzzy-Related AggregationTechniques Are Often Better Even in Situations withoutTrue Fuzziness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Hung T. Nguyen, Vladik Kreinovich, Francois Modave,Martine Ceberio

Intermediate Degrees Are Needed for the World to BeCognizable: Towards a New Justification for Fuzzy LogicIdeas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Hung T. Nguyen, Vladik Kreinovich, J. Esteban Gamez,Francois Modave, Olga Kosheleva

Paraconsistent Annotated Logic Program Before-afterEVALPSN and Its Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Kazumi Nakamatsu

Part II: Approximate Reasoning - Success Stories and Real WorldApplications

A Fuzzy Set Approach to Software Reliability Modeling . . . . . 111P. Zeephongsekul

Page 9: Foundations of Computational Intelligence

X Contents

Computational Methods for Investment Portfolio: The Useof Fuzzy Measures and Constraint Programming for RiskManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Tanja Magoc, Francois Modave, Martine Ceberio, Vladik Kreinovich

A Bayesian Solution to the Modifiable Areal UnitProblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175C. Hui

Fuzzy Logic Control in Communication Networks . . . . . . . . . . . . 197Chrysostomos Chrysostomou, Andreas Pitsillides

Adaptation in Classification Systems . . . . . . . . . . . . . . . . . . . . . . . . . 237Abdelhamid Bouchachia

Music Instrument Estimation in Polyphonic Sound Basedon Short-Term Spectrum Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259Wenxin Jiang, Alicja Wieczorkowska, Zbigniew W. Ras

Ultrasound Biomicroscopy Glaucoma Images AnalysisBased on Rough Set and Pulse Coupled Neural Network . . . . 275El-Sayed A. El-Dahshan, Aboul Ella Hassanien, Amr Radi,Soumya Banerjee

An Overview of Fuzzy C-Means Based Image ClusteringAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295Huiyu Zhou, Gerald Schaefer

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

Page 10: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets for YourComputational Intelligence Toolbox

James F. Peters�

Computational Intelligence Laboratory, Department of Electrical & Computer Engineering,University of Manitoba, E1-390 Engineering Building, 75A Chancellor’s Circle,Winnipeg, Manitoba R3T 5V6 [email protected]

Summary. This chapter considers how one might utilize fuzzy sets, near sets, andrough sets, taken separately or taken together in hybridizations as part of a com-putational intelligence toolbox. These technologies offer set theoretic approachesto solving many types of problems where the discovery of similar perceptual gran-ules and clusters of perceptual objects is important. Perceptual information systems(or, more concisely, perceptual systems) provide stepping stones leading to nearnessrelations and properties of near sets. This work has been motivated by an interestin finding a solution to the problem of discovering perceptual granules that are, insome sense, near each other. Fuzzy sets result from the introduction of a member-ship function that generalizes the traditional characteristic function. Near set theoryprovides a formal basis for observation, comparison and classification of percep-tual granules. Near sets result from the introduction of a description-based approachto perceptual objects and a generalization of the traditional rough set approach togranulation that is independent of the notion of the boundary of a set approxima-tion. Near set theory has strength by virtue of the strength it gains from rough settheory, starting with extensions of the traditional indiscernibility relation. This chap-ter has been written to establish a context for three forms of sets that are now partof the computational intelligence umbrella. By way of introduction to near sets, thischapter considers various nearness relations that define partitions of sets of percep-tual objects that are near each other. Every perceptual granule is represented by a setof perceptual objects that have their origin in the physical world. Objects that havethe same appearance are considered perceptually near each other, i.e., objects withmatching descriptions. Pixels, pixel windows, and segmentations of digital imagesare given by way of illustration of sample near sets. This chapter also briefly con-siders fuzzy near sets and near fuzzy sets as well as rough sets that are near sets.

� This author gratefully acknowledges the insights and suggestions by Christopher Henry,Piotr Wasilewski and Andrzej Skowron concerning topics in this paper. This researchhas been supported by the Natural Sciences & Engineering Research Council of Canada(NSERC) grant 185986.

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 3–25.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 11: Foundations of Computational Intelligence

4 J.F. Peters

The main contribution of this chapter is the introduction of a formal foundation fornear sets considered in the context of fuzzy sets and rough sets.

Keywords: Description, fuzzy sets, near sets, perceptual granule, perceptual sys-tem, rough sets.

Near ToHow near to the bark of a tree are drifting snowflakes,swirling gently round, down from winter skies?How near to the ground are icicles,slowly forming on window ledges?

–Fragment of a Philosophical Poem [27].–Z. Pawlak & J.F. Peters, 2002.

1 Introduction

This chapter considers how one might utilize fuzzy sets, near sets, and rough sets,considered separately and taken together as part of a computational intelligencetoolbox. Near set theory provides a formal basis for observation, comparison andclassification of perceptual granules. Near sets and the perception of nearness of ob-jects were inspired by images in a philosophical poem written in 2002 [14]. Sincethat time, a considerable number of papers have been written about near set the-ory [21, 20, 29] and its applications [2, 4, 5, 3, 26, 25]. Near sets result from theintroduction of a description-based approach [23] to the identification and analysisof perceptual objects and a generalization of the traditional rough set approach togranulation that is independent of the notion of the boundary of a set approximation.

Perceptual information systems (or, more concisely, perceptual systems) providestepping stones leading to nearness relations and properties of near sets. This workhas been motivated by an interest in finding a solution to the problem of discoveringperceptual granules that are, in some sense, near each other. Near set theory providesa formal basis for observation, comparison and classification of perceptual granules.A perceptual granule is defined by a collection of objects that are graspable by thesenses or by the mind. This is made clear in this article by considering variousnearness relations that define partitions of sets of perceptual objects that are neareach other. Every perceptual granule is represented by a set of perceptual objectsthat have their origin in the physical world. Objects that have the same appearanceare considered perceptually near each other, i.e., objects with matching descriptions.Pixels, pixel windows, and segmentations of digital images are given by way ofillustration of sample near sets. This chapter also briefly presents near sets arisingfrom fuzzy sets and rough sets. The main contribution of this chapter is an overviewof the basics of near sets considered separately and in the context of fuzzy sets andrough sets.

This chapter has the following organization. The basic notion of a perceptualsystem is presented in Sect. 2. Definitions and illustration of indiscernibility, weak

Page 12: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 5

indiscernibility and weak tolerance relations are given in Sect. 3. Building on theserelations, nearness relation are defined and illustrated in Sect. 4. Nearness relationslead to the introduction of perceptual near sets in Sect. 5. A short introduction tofuzzy near sets and near fuzzy sets is given in Sect. 6. The strong ties between nearsets and rough sets are briefly discussed in Sect. 7.

2 Perceptual Systems

The word perception indicates a directionrather than a primitive function. It is known thatthe uniformity of apparent size of objects at differentdistances, or of their colour in different lights,is more perfect in children than in adults.

–Phenomenology of Perception.–Maurice Merleau-Ponty, 1945.

Thus the near thing, as ‘the same’, appears from nowfrom this ‘side’, now from that; and the ‘visual perspectives’change–also, however, the other manners of appearance(tactile, acoustic, and so forth), as we can observeby turning our attention in the right direction.

–Cartesian Meditations.–Edmund Husserl, 1929.

This section briefly presents the basis for perceptual systems that hearkens back tothe original notion of a deterministic information system introduced by ZdzisławPawlak [13] and elaborated in [11, 10].

2.1 Perceptual Object Descriptions

Perceptual objects are known by their descriptions. An object description is definedby means of a tuple of function values φ(x) associated with an object x ∈ X (seeTable 1). The important thing to notice is the choice of functions φi ∈ B used todescribe an object of interest. Assume that B ⊆ F (see Table 1) is a given set offunctions representing features of sample objects X ⊆ O and F is finite. Let φi ∈ B,where φi : O −→ R. In combination, the functions representing object features pro-vide a basis for an object description φ : O −→ R

l , a vector containing measure-ments (returned values) associated with each functional value φi (x) for x ∈X , where|φ | = l, i.e. the description length is l.

Object Description: φ(x) = (φ1(x),φ2(x), . . . ,φi(x), . . . ,φl(x)).

The intuition underlying a description φ(x) is a recording of measurements fromsensors, where each sensor is modelled by a function φi.

Page 13: Foundations of Computational Intelligence

6 J.F. Peters

Table 1. Description Symbols

Symbol Interpretation

R Set of real numbers,O Set of perceptual objects,X X ⊆ O, set of sample objects,x x ∈ O, sample object,F A set of functions representing object features,B B ⊆ F,φ φ : O → R

l , object description,l l is a description length,i i ≤ l,

φi φi ∈ B, where φi : O −→ R, probe function,φ(x) φ(x) = (φ1(x), . . . ,φi(x), . . . ,φL(x)), description,〈X ,F〉 φ(x1), . . . ,φ(x|X |), i.e., perceptual information system.

Let X ,Y ⊆ O denote sets of perceptual objects. Sets X ,Y ⊆ O are considerednear each other if the sets contain perceptual objects with at least partial match-ing descriptions. A perceptual object x ∈ O is something presented to the sensesor knowable by the mind [9]. In keeping with the approach to pattern recognitionsuggested by Pavel [12], the features of an object such as contour, colour, shape, tex-ture, bilateral symmetry are represented by probe functions. A probe function canbe thought of as a model for a sensor. A probe makes it possible to determine if twoobjects are associated with the same pattern without necessarily specifying whichpattern (classification). A detailed explanation about probe functions vs. attributesin the classification of objects is given in [19].

2.2 Perceptual Systems: Specialized Deterministic Systems

For representing results of a perception, the notion of a perceptual system isbriefly introduced in this section. In general, an information system is a tripleS = 〈Ob,At,{Val f} f∈At〉 where Ob is a set of objects, At is a set of functions rep-resenting either object features or object attributes, and each Val f is a value domainof a function f ∈ At, where f : Ob −→ P(Val f ) (P(Val f ) is a power set of Val f ).If f (x) �= /0 for all x ∈ Ob and f ∈ At, then S is total. If card( f (x)) = 1 for everyx ∈ Ob and f ∈ At, then S is deterministic. Otherwise S is non-deterministic. In thecase, when f (x) = {v}, {v} is identified with v. An information system S is realvalued iff Val f = R for every f ∈ At. Very often a more concise notation is used:〈Ob,At〉,especially when value domains are understood, as in the case of real valuedinformation systems. Since we discuss results of perception, as objects we considerperceptual objects while f ∈ At are interpreted as probe functions. Two examples ofperceptual systems are given in Table 2.

Definition 1. Perceptual SystemA perceptual system 〈O,F〉 is a real valued total deterministic information systemwhere O is a non-empty set of perceptual objects, while F a countable set of probefunctions.

Page 14: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 7

Table 2. Sample perceptual information systems

Sys. 1 Sys. 2X φ1 φ2 φ3 φ4 d Y φ1 φ2 φ3 φ4 dx1 0 1 0.1 0.75 1 y1 0 2 0.2 0.01 0x2 0 1 0.1 0.75 0 y2 1 1 0.25 0.01 0x3 1 2 0.05 0.1 0 y3 1 1 0.25 0.01 0x4 1 3 0.054 0.1 1 y4 1 3 0.5 0.55 0x5 0 1 0.03 0.75 1 y5 1 4 0.6 0.75 1x6 0 2 0.02 0.75 0 y6 1 4 0.6 0.75 1x7 1 2 0.01 0.9 1 y7 0 2 0.4 0.2 0x8 1 3 0.01 0.1 0 y8 0 3 0.5 0.6 1x9 0 1 0.5 0.1 1 y9 0 3 0.5 0.6 1x10 1 1 0.5 0.25 1 y10 1 2 0.7 0.4 0

y11 1 4 0.6 0.8 1y12 1 4 0.7 0.9 0y13 1 1 0.25 0.01 0y14 1 4 0.6 0.75 1

1.1: SCC leaf 1.2: Shubert stomata 10x 1.3: Pin leaf 1.4: Pin stomata 100x

Fig. 1. Sample Percepts

The notion of a perceptual system admits a wide variety of different interpretationsthat result from the selection of sample perceptual objects contained in a particularsample space O. Perceptual objects are known by their descriptions.

2.3 Sample Perceptual System

By way of an illustration, let 〈P,φ〉 denote a perceptual system where P is a setof microscope images and φ is a probe function representing luminance contrast1,respectively. A sample Shubert choke cherry leaf and Native Pin choke cherry leafare shown in Figures 1.1 and 1.3. Leaf stomata (minute pores in the epidermis ofa leaf) are shown in the microscope images magnified by 10x in Fig. 1.2 for the

1 In digital images, luminance contrast can be controlled by converting irradiance (amountof light per unit area) into a grey value g using a function g(E) = Eγ , where E denotesirradiance level and luminance varies non-linearly with γ (gamma) typically having a valueof 0.4 [8].

Page 15: Foundations of Computational Intelligence

8 J.F. Peters

sample Shubert CC leaf and by 100x in Fig. 1.4 for the sample Native Pin CC leaf.Intuitively, if we compare image colours, luminance contrast or sub image shapes,the microscope leaf images are similar. By considering nearness relations in thecontext of a perceptual system, it is possible to classify sets of perceptual objects.A formal basis for the discovery of different forms of near sets is the focus of theremaining sections of this chapter.

3 Relations and Classes

The basic idea in the near set approach to object recognition is to compare objectdescriptions. Sample perceptual objects x,y ∈ O,x �= y are near each other if, andonly if x and y have similar descriptions. Similarly, sets X ,Y are perceptually neareach other in the case where there is at least one pair of objects x ∈ X ,y ∈ Y thathave similar descriptions. In this section, two kinds of indiscernibility relations anda tolerance relation are briefly introduced. These relations make it possible to definevarious nearness relations and make it possible to provide a formal foundation fornear sets. Because of the importance of image analysis as an application area fornear sets, this section illustrates the relations both with images and with perceptualinformation tables. This practise is continued in the sequel to this section.

3.1 Indiscernibility and Tolerance Relations

Recall that each φ defines the description of an object (see Table 1). To establisha nearness relation, we first consider the traditional indiscernibility relation. LetB ⊆ F denote a set of functions representing perceptual object features. The indis-cernibility relation ∼B introduced by Zdzisław Pawlak [13] is distinguished fromweak indiscernibility �� introduced introduced by Ewa Orłowska [10]. In keepingwith the original indiscernibility relation symbol ∼F [13], the symbol �� is used todenote weak indiscernibility instead of the notation wind [10]. The pioneering workby Andrzej Skowron and Janislaw Stepaniuk on tolerance relations [32] plays animportant role in the providing a foundation for a near set approach to tolerancespaces.

Definition 2. Indiscernibility RelationLet 〈O,F〉 be a perceptual system. For every B ⊆ F the indiscernibility relation ∼B

is defined as follows:

∼B= {(x,y) ∈ O×O | ∀ φ ∈ B,‖ φ(x)−φ(y) ‖= 0},where ‖ · ‖ represents the l2 norm. If B = {φ} for some φ ∈F, for simplicity, insteadof ∼{φ}, we write ∼φ .

Example 1. Clustering Matching Leaf PixelsFor the two microscopic views of leaf stomata shown in Fig. 1.2 (Shubert chokecherry leaf stomata) and Fig. 1.4 (Native Pin cherry leaf stomata), consider the greyscale view of fragments of these leaves shown in Fig. 2. There areas of both leaf

Page 16: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 9

Fig. 2. Near Leaf Fragments

fragments that contain pixels with matching grey levels. This view of the two leaffragments is expressed formally in the context of the indiscernibility relation. Letg denote a probe function that measures the grey level of a pixel and let x,y de-note pixels in leaf fragments shown in Fig 2. Then the relation ∼g is defined in thefollowing way:

x ∼g y ⇐⇒ g(x) = g(y).

Classes of pixels with the same grey level are denoted by x/∼g (Pin Cherry) andy/∼g (Shubert CC) in Fig. 2. In the case where two classes of pixels are in relation∼g to each other, we obtain a cluster of pixels. For example, x ∼g y suggests theformation of a cluster of pixels denoted by

x/∼g ∪ y/∼g .

A sample cluster of pixels with matching grey levels is depicted in Fig 2. The in-dividual classes containing pixels with matching descriptions and the cluster of thetwo classes in Fig. 2 are examples of perceptual granules.

Let 〈ImPin,F〉 denote perceptual system Sys. L1 with

ImPin = {ImPin/∼g| ImPin/∼g

= class of pixels in Pin cherry leaf},FImPin = {g}.

Similarly, let 〈ImShubert ,F〉 denote perceptual system Sys. L2 with

ImShubert = {ImShubert/∼g| ImShubert/∼g

= class of pixels in Shubert CC leaf},FImShubert = {g}.

An obvious extension (not shown here) is the partition of any leaf fragment intonon-overlapping sets of pixels having matching descriptions, i.e., sets of pixels with

Page 17: Foundations of Computational Intelligence

10 J.F. Peters

matching grey levels. It is important to notice that each particular set of match-ing pixels may be in regions of the image that are not contiguous, i.e., pixels withmatching grey levels can be located anywhere in an image.

Example 2. Let 〈O1,F1〉 denote perceptual system Sys. 1 with O1 = {x1, ... ,x9},F1 = {φ1,φ2,φ3,φ4}, where the values of probe functions from F1 are given in thelefthand side of table 2. Similarly, let 〈O2,F2〉 denote perceptual system Sys. 2 withO2 = {y1, ... ,x14}, F2 = {φ1,φ2,φ3,φ4}, where the values of the probe functionsfrom F1 are given in the righthand side of table 2. The perceptual systems 〈O1,F1〉,〈O2,F2〉 have partitions (1) and (2) of the space of percepts defined by relations ∼F1and ∼F2 .

O1/∼F1= {{x1,x2},{x3},{x4},{x5},{x6},{x7},{a8},{x9},{x10}}, (1)

O2/∼F2= {{y1},{y2,y3,y13},{y4},{y5,y6},{y7},{y8,y9},{y10},{y11},{y12},{y14}}.

(2)

If we consider only probe function φ3 relative to O1, then we obtain, e.g., severalequivalence classes such as (3), each containing a pair of objects.

x1/∼φ3= {x1,x2}, (3)

x7/∼φ3= {x7,x8}, (4)

x9/∼φ3= {x9,x10}. (5)

Again, for example, if we probe O2 with φ3, we obtain, e.g., a number of multi-object classes such as the one in (6).

y2/∼φ3= {y2,y3,y13}, (6)

y4/∼φ3= {y4,y8,y9}, (7)

y5/∼φ3= {y5,y6,y11,y14}, (8)

y10/∼φ3= {y10,y12}. (9)

Definition 3. Weak Indiscernibility RelationLet 〈O,F〉 be a perceptual system. For every B ⊆ F the weak indiscernibility rela-tion �B is defined as follows:

�B= {(x,y) ∈ O×O | ∃ φ ∈ B,‖ φ(x)−φ(y) ‖= 0}.

If B = {φ} for some φ ∈ F, instead of �{φ} we write �φ .

Example 3. Let 〈O1,F1〉 denote perceptual system Sys. 1 with O1 = {x1, ... ,x9},F1 = {φ1,φ2,φ3,φ4}, where the values of probe functions from F1 are given in thelefthand side of table 2. Similarly, let 〈O2,F〉 denote perceptual system Sys. 2 withO2 = {y1, ... ,y14}, F = {φ1,φ2,φ3,φ4}, where the values of the probe functionsfrom F are given in the righthand side of table 2. Let X ⊂ O1,X = {x1,x9,x10} and

Page 18: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 11

Y ⊂ O2,Y = {y1,y8,y10,y11,y12}. Consider partitions X/�φ3and Y /�φ3

given in (10)and (11), respectively.

X/�φ3= {{x1} ,{x9,x10}} , (10)

Y /�φ3= {{y1} ,{y8} ,{y10} ,{y11} ,{y12}} , (11)

Remark 1. Notice that the class {x1} ∈ X/�φ3contains only a single object, since

there is no other object in x ∈ X such that φ3(x1) = φ3(x). Similarly, each of theclasses in Y /�φ3

contains only a single object.

Definition 4. Weak Tolerance RelationLet 〈O,F〉 be a perceptual system and let ε ∈ ℜ (reals). For every B ⊆ F the weaktolerance relation ∼=B,ε is defined as follows:

∼=B,ε= {(x,y) ∈ O×O | ∃ φ ∈ B,‖ φ(x)−φ(y) ‖≤ ε}.

That is, in general, the relation ∼B,ε is reflexive and symmetric but not transitive.

This relation is very important in discovering near sets, since it defines toleranceclasses relative to a threshold ε , rather than require strict equality of probe functionvalues in the case of the indiscernibility relations ∼B and �B (see, e.g., [24]).

Remark 2. Notice that Def. 4 represents a special case. That is, in general, the setsX and Y represent sample sets of observations from distinct perceptual systems. Ineffect, it is possible to state a Proposition to this effect.

Definition 5. Let P1 = 〈O1,F〉 denote perceptual system P1. Similarly, let P2 =〈O2,F〉 denote a second, distinct perceptual system. Also, let ε ∈ ℜ. P1 has a weaktolerance relation to P2 if, and only if O1∼

F,ε O2.

Proposition 1. Let Sys1 = 〈O1,F〉 denote perceptual system Sys1. Similarly, letSys2 = 〈O2,F〉 denote a second, distinct perceptual system with the same set offeatures F. Let B ⊆ F and choose ε . Then

Sys1∼B,ε Sys1 ⇐⇒ O1∼B,ε O2.

Example 4. Clusters of Similar Leaf PixelsFor the two microscopic view of leaf stomata shown in Fig. 1.2 (Shubert chokecherry leaf stomata) and Fig. 1.4 (Native Pin cherry leaf stomata), consider thegreyscale view of fragments of these leaves shown in Fig. 2. Let g denote the greylevel of a pixel and let x,y denote pixels in leaf fragments shown in Fig 3. Then therelation ∼={g},ε is defined in the following way:

x ∼={g},ε y ⇐⇒ g(x) = g(y).

Classes of pixels with the same grey level are denoted by x/∼={g},0.2(Native Pin

cherry) and y/∼={g},0.2(Shubert choke cherry) in Fig. 3 for ε = 0.2. In the case where

Page 19: Foundations of Computational Intelligence

12 J.F. Peters

Fig. 3. Near Leaf Pixels

two classes of pixels are in relation ∼={g},ε to each other, we obtain a cluster of sim-ilar pixels. For example, x ∼={g},0.2 y suggests the formation of a cluster of pixelsdenoted by

x/∼={g},0.2∪ y/∼={g},0.2

.

A sample cluster of similar pixels with the norm of grey level differences withinε = 0.2 of each other is depicted in Fig 3.

Example 5. Let 〈O1,F〉 denote perceptual system Sys. 1 with O1 = {x1, ... ,x9},F = {φ1,φ2,φ3,φ4}, where the values of probe functions from F are given in thelefthand side of table 2. Similarly, let 〈O2,F〉 denote perceptual system Sys. 2 withO2 = {y1, ... ,y14}, F = {φ1,φ2,φ3,φ4}, where the values of the probe functionsfrom F are given in the righthand side of table 2. Let ε = 0.1 for both perceptualsystems. For example, let φ3 ∈ F1. The perceptual system 〈O1,{φ3}〉 has toleranceclasses (12), (13), (14) defined by relation �φ3,0.1

.

x1/�φ3,0.1= {x1,x2,x5,x6,x7,x8}, (12)

x3/�φ3,0.1= {x3,x4}, (13)

x9/�φ3,0.1= {x9,x10}. (14)

For example, in x3/�φ3,0.1, we have

|φ3(x3)−φ3(x4)| = |0.05−0.054| ≤ 0.1

Similarly, the perceptual system 〈O2,{φ3}〉 has tolerance classes (15), (16), (17),(18) defined by relation �φ3,0.1

.

Page 20: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 13

y1/�φ3,0.1= {y1,y2,y3,y13}, (15)

y4/�φ3,0.1= {y4,y5,y6,y8,y9,y11,y14}, (16)

y7/�φ3,0.1= {y7,y4,y8,y9}, (17)

y10/�φ3,0.1= {y5,y6,y10,y11,y12,y14}, (18)

For example, in y7/�φ3,0.1, we have

|φ3(y7)−φ3(y4)| = |0.4−0.5| ≤ 0.1,

|φ3(y7)−φ3(y8)| = |0.4−0.5| ≤ 0.1,

|φ3(y7)−φ3(y9)| = |0.4−0.5| ≤ 0.1,

|φ3(y8)−φ3(y9)| = |0.5−0.5| ≤ 0.1

4 Nearness Relations

Nearness. The state, fact or quality of being near.1. Close kinship or relationship.

–Oxford English Dictionary, 1933.

Three basic nearness relations are briefly presented and illustrated in terms ofvisual as well as numerical perceptual systems in this section. In keeping withthe intuitive notion of nearness enunciated in the Oxford English Dictionary, oneestablishes the nearness of perceptual granules (e.g., perceptual objects such as im-age pixels, classes of pixels with matching descriptions, and clusters of classes ofpixels with similar descriptions) by discovering relations that underlying percep-tions of ‘closeness’. Nearness relations themselves are defined relative to the funda-mental relations coming from rough set theory [13] and near set theory [20, 30, 24].

Table 3. Relation and Partition Symbols

Symbol Interpretation

B see Table 1,φ probe function in B,ε ε ∈ [0,1],

∼B {(x,y) | f (x) = f (y) ∀ f ∈ B}, indiscernibility relation [13],�B weak indiscernibility relation [10],∼

B,ε weak tolerance relation,x/∼B

x/∼B= {y ∈ X | y ∼B x}, elementary set (class),

O/∼BO/∼B

= {x/∼B| x ∈ O}, quotient set,

‖ φ(x)−φ(y) ‖ ‖ · ‖ = l2 norm,�� nearness relation symbol,�� weak nearness relation symbol,�� weak tolerance nearness relation symbol.

Page 21: Foundations of Computational Intelligence

14 J.F. Peters

Definition 6. Nearness Relation [30]Let 〈O,F〉 be a perceptual system and let X ,Y ⊆ O. The set X is perceptually nearto the set Y (X ��F Y ), if and only if there are x ∈ X and y ∈ Y such that x ∼F y (seeTable 3).

Example 6. Perceptually Near Leaf FragmentsConsider the perceptual systems 〈ImPin,F〉, 〈ImShubert ,F〉 from

ImPin = {ImPin/∼g| ImPin/∼g

= class of pixels in Pin cherry leaf},FImPin = {g},

ImShubert = {ImShubert/∼g| ImShubert/∼g

= class of pixels in Shubert CC leaf},FImShubert = {g},

ImPin ��g ImShubert .

Example 7. Consider the perceptual systems 〈O1,F〉, 〈O2,F〉 given in Table 2.From Example 3, we obtain

B = {φ3}, where φ3 ∈ F,

Xnew = x9/∼φ3, from Example 3,

= {x9,x10},Ynew = y8/∼φ3

= {y4,y8,y9},Xnew ��φ3 Ynew, since

φ3(x9) = φ3(y8) = 0.5

Definition 7. Weak Nearness Relation [30]Let 〈O,F〉 be a perceptual system and let X ,Y ⊆ O. The set X is weakly near to theset Y within the perceptual system 〈O,F〉 (X ��F Y ) iff there are x ∈ X and y ∈Y andthere is B ⊆ F such that x �B y. If a perceptual system is understood, then we sayshortly that a set X is weakly near to set Y (see Table 3).

Example 8. Weakly Near Leaf FragmentsLet r,g,b,gr denote red, green, blue, grey in the RGB colour model2. Considerthe perceptual systems 〈ImPin,FImPin 〉, 〈ImShubert ,FImShubert 〉 in Example 1, whereFImPin ,FImShubert contain more than one probe function.

ImPin = {ImPin/∼gr| ImPin/∼g

= class of pixels in Pin cherry leaf},FImPin = {r,g,b,gr},

2 e.g., r = RR+G+B ,g = G

R+G+B ,b = BR+G+B ,gr = R+G+B

3 , where R,G,B represent theamounts of red, green, blue used to form a particular colour, also known as at the tris-timulus values [8].

Page 22: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 15

ImShubert = {ImShubert/∼g| ImShubert/∼g

= class of pixels in Shubert CC leaf},FImShubert = {r,g,b,gr},

ImPin ��gr ImShubert .

Example 9. Consider the perceptual systems 〈O1,F〉, 〈O2,F〉 given in Table 2.

B = {φ3}, where φ3 ∈ F,

X = {x1,x2,x7,x8,x9,x10},Y = {y4,y5,y6,y8,y9,y11},X ��φ3

Y, since we can find x ∈ X ,y ∈Y where x �φ3 y, e.g.,

φ3(x9) = φ3(y8) = 0.5.

Definition 8. Weak Tolerance Nearness Relation [24]Let 〈O,F〉 be a perceptual system and let X ,Y ⊆ O,ε ∈ [0,1]. The set X is percep-tually near to the set Y within the perceptual system 〈O,F〉 (X ��

FY ) iff there exists

x ∈ X , y ∈Y and there is a φ ∈ F,ε ℜ such that x�B,ε y (see Table 3). If a perceptualsystem is understood, then we say shortly that a set X is perceptually near to a setY in a weak tolerance sense of nearness.

Example 10. Weak Tolerance Near Leaf FragmentsFrom Example 8, consider

ε = 0.2,

ImPin = {ImPin/∼gr| ImPin/∼g

= class of pixels in Pin cherry leaf},FImPin = {r,g,b,gr},

ImShubert = {ImShubert/∼g| ImShubert/∼g

= class of pixels in Shubert CC leaf},FImShubert = {r,g,b,gr},

ImPin ��gr, 0.2 ImShubert .

An example of a cluster of weak tolerance near sets is shown in Fig. 3.

Example 11. Let 〈O1,F〉 denote perceptual system Sys. 1 with O1 = {x1, ... ,x9},F = {φ1,φ2,φ3,φ4}, where the values of probe functions from F are given in thelefthand side of table 2. Similarly, let 〈O2,F〉 denote perceptual system Sys. 2 withO2 = {y1, ... ,y14}, F = {φ1,φ2,φ3,φ4}, where the values of the probe functionsfrom F are given in the righthand side of table 2. Now choose ε and arbitrary sam-ples X1 and Y1 so that they are also weak tolerance near sets.

ε = 0.1,

B = {φ3}, where φ3 ∈ F,

X1 ∈ O1,Y1 ∈ O2,

Page 23: Foundations of Computational Intelligence

16 J.F. Peters

X1 = {x1,x2,x7,x8,x9,x10},Y1 = {y4,y5,y6,y8,y9,y11},X1 ��φ3

Y1, since we can find x ∈ X ,y ∈ Y where x�φ3,ε y, e.g.,

|φ3(x9)−φ3(y8)| = |0.5−0.5|= 0 ≤ 0.1; again, e.g.,

|φ3(x10)−φ3(y11)| = |0.1−0.2|= 0.1

Remark 3. In Example 11, we know that X ��F

Y, since there exists an x ∈ X ,y ∈Y(namely, x9,y8) such that

|φ3(x)−φ3(y)| ≤ ε

We can generalize the result from Example 11 in Prop 2 by extending the idea inProp. 1.

Proposition 2. Let Sys1 = 〈O1,F〉 denote perceptual system Sys1. Similarly, letSys2 = 〈O2,F〉 denote a second, distinct perceptual system. Then

Sys1 ��F

Sys1 ⇐⇒ O1 ��F

O2.

5 Near Sets

Object recognition problems, especially in images [4], and the problem of thenearness of objects have motivated the introduction of near sets (see, e.g., [21]).Since we are mainly interested in comparing perceptual granules for selected real-valued probe functions, only weakly near sets and weakly tolerant near sets arebriefly considered in this section based on the weak near relation [30, 24] ��F inDef. 7 and weak tolerance nearness relation [24] ��

Fin Def. 8. These two forms

of near sets are especially useful in discovering near fuzzy and fuzzy near sets aswell as rough near and near rough sets. Other forms of near sets are introducedin [21, 20, 30].

5.1 Weakly Near Sets

Definition 9. Weakly Near SetsLet 〈O,F〉 be a perceptual system and let X ,Y ⊆O,X �=Y. Let F denote a non-emptyset of probe functions representing features of objects in O. A set X is weakly nearY iff X ��F Y, i.e., there exists φ ∈ F,x ∈ X ,y ∈ Y such that x ∼=φ y.

Example 12. Weakly Near Sub-Images Let 〈Im,F〉 be a perceptual informationsystem for a sample space such that Im is an image, i.e., perceptual objects heremean pixels in Im. Let X ,Y ⊆ Im,X �= /0,Y �= /0, i.e., X and Y are subimages of imageIm. Let F denote a set of probe functions representing image features. Let gr : Im →R denote a function that maps each pixel to the grey level of a pixel. Assume that

Page 24: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 17

Fig. 4. Motic Stereo Zoom SMZ140/143 Microscope

there are x ∈ X ,y ∈Y such that x ∼gs y, i.e., gr(x)−gr(y) = 0. Therefore, X ��F Y ,i.e., image X is weakly near to Y . An illustration of weakly near images is given inExample 8.

5.2 Tolerance Near Sets

Definition 10. Tolerance Near SetsLet 〈O,F〉 be a perceptual system and let X ⊆ O. A set X is a tolerance near set iffthere is Y ⊆ O such that X ��

FY . The family of near sets of a perceptual system

〈O,F〉 is denoted by NearF(O).

In effect, tolerance perceptual near sets are those sets that are defined by the nearnessrelation ��

F.

Example 13. Tolerance Near Images 1 For an illustration of tolerance near images,see Example 10.

Example 14. Tolerance Near Images 2This example illustrates the use of a Motic stereo zoom SMZ140/143 microscopein comparing images showing fossils contained in a sample piece of Dominicanamber. This particular microscope has a working distance of 80 mm and a magnifi-cation range between 2.5x and 120x. The setup shown in Fig. 4 was used to captureamber images shown in Fig. 5. Amber is an ancient tree resin. At the time whenthe resin was sticky, insects and other small organisms became trapped on the resinsurface and gradually engulfed by the flowing resin [1]. It has also been pointedout that amber preserves organisms in finer fidelity than perhaps any other kind offossil [1]. The particular amber examined in this example comes from the Domini-can Republic. Dominican amber belongs to the mid-Miocene period, approximately17-20 million years ago [7]. Microscopic images of insects fossilized in MioceneDominican amber are shown on a 100 µm (micrometer) scale in Fig 5.

Page 25: Foundations of Computational Intelligence

18 J.F. Peters

5.1: Miocene moth 5.2: Miocene beetle

Fig. 5. Sample Miocene Dominican amber

6.1: Moth sets 6.2: Beetle sets

Fig. 6. Amber segmentations, 5x5, ε = 0.1

The fragment of Dominican amber in Fig. 5.1 shows a fossil, probably a Acrolo-phus moth (see, e.g., a similar fossil also preserved in Dominican amber [1]). Inanother part of the same piece of amber shown in Fig. 5.2, there is another tinyfossil, probably a predatory larva of the cucujoid family Discolomidae (for a simi-lar fossil, see, e.g., [1]). The scale for the two fossils shown in [1] is millimeters,not micrometers. These amber images are compared by segmenting them into 5x5subimages (see Fig 6) and 3x3 subimages (see Fig 7). Subimages containing pixelswith the same average grey level are masked (each with a colour representing a par-ticular grey level). Let gr denote a function to compute the average grey level of asubimage. And let x,y denote n×n subimages. Then tolerance classes are identifiedusing ∼=gr, ε , where

x ∼=gr, ε y ⇐⇒ ‖ gr(x)− gr(y) ‖ ≤ ε.

This leads to a collection of tolerance classes for each image. For example, if weconsider 5×5 subimages and let ε = 0.1, we obtain the tolerance classes shown inFig. 6. In Fig. 6.1, the moth shows up as a collection tolerance classes representedby a single colour, except for the boundary of the moth that is surrounded by othertolerance classes representing varying average grey levels in the subimages alongthe border of the fossil. A similar result is obtained in Fig. 6.2 for the fossilized

Page 26: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 19

7.1: Moth sets 7.2: Beetle sets

Fig. 7. Amber segmentations, 3x3, ε = 0.01

beetle larva. We get a quite different result, if we consider 3× 3 subimages andlet ε = 0.1. The smaller subimages leads to a more accurate representation of theperceptual granules in the original microscopic images.

The goal in this work is to measure the degree of nearness of pairs of images as ameans of discovering similar images. A tolerance relation-based nearness measure(19) has recently been introduced [6]. Let F denote a set of probe functions in aperceptual system 〈O1,F〉 and assume B ⊆ F. Then

NM∼=B= ∑

x/∼=B∈X/∼=B

∑y/∼=B

∈Y/∼=B

ξ (x/∼=B,y/∼=B

)max(|x/∼=B

|, |y/∼B|) , (19)

where

ξ (x/∼=B,y/∼=B

) = {min(|x/∼=B

|, |y/∼=B|) , if ‖ φ(x)−φ(y) ‖≤ ε,

0 , otherwise.

For simplicity, we write NM∼=Binstead of NM∼=B , ε . For this example, assume O

consists of n× n subimages. Then NM∼=B= 0.0817 for 5× 5 subimages with ε =

0.01. For the same ε , NM∼=B= 0.460517 for 3× 3 subimages shown in Fig. 7.

This matches our intuition, since the fossils are similar in size but not in shape. Inaddition, notice that the penumbra region surrounding the border of the two fossils ismore pronounced (evident) in the finer-grained tolerance classes in Fig. 7. NM∼=B

in(19) is an example of a characteristic function that defines a fuzzy set (we considerthis in the sequel to this section).

Example 15. Let 〈O1,F〉 denote perceptual system Sys. 1 with O1 = {x1, ... ,x9},F = {φ1,φ2,φ3,φ4}, where the values of probe functions from F are given in thelefthand side of table 2. Similarly, let 〈O2,F〉 denote perceptual system Sys. 2 withO2 = {y1, ... ,y14}, F = {φ1,φ2,φ3,φ4}, where the values of the probe functionsfrom F are given in the righthand side of table 2. Now choose samples X and Y

Page 27: Foundations of Computational Intelligence

20 J.F. Peters

that are also weak tolerance near sets. Sets X ,Y in Example 11 are near sets, sinceX ��φ3

Y . Again, for example, consider the following near sets extracted fromTable 2.

ε = 0.3,

B = {φ3},X1 ∈ O1,Y1 ∈ O2,

X1 = {x1,x2,x5,x6,x7,x8,x9,x10},Y1 = {y4,y5,y6,y8,y9,y10,y11,y12},X1 ��φ3

Y1, since we can find x ∈ X1,y ∈Y1, where

x�φ3,0.3y, e.g.,x9�φ3,0.3

y10, since |φ3(x9)−φ3(y10)| = |0.5−0.7|= 0.2 ≤ 0.3

The basic idea here is to look for sets of objects containing at least one pair ofobjects that satisfy the weak tolerance relation. Consider, for example, sets X2 ∈O2,Y1 ∈ O2 extracted from Table 2 in (23) and (24).

ε = 0.3 (20)

B = {φ4}, (21)

X2 ∈ O2,Y1 ∈ O2, (22)

X2 = {x1,x2,x5,x6,x7,x8,x9}, (23)

Y2 = {y5,y6,y8,y9,y10,y11,y12,y14}, (24)

X2 ��φ3Y2, since we can find x ∈ X2,y ∈Y2, where (25)

x�φ4,0.3y, e.g.,

x1�φ4,0.3y8, since |φ4(x1)−φ4(y8)| = |0.75−0.6|= 0.15 ≤ 0.3; again, e.g.,

x7�φ4,0.3y11, since |φ4(x7)−φ4(y11)| = |0.9−0.8|= 0.1 ≤ 0.3

6 Fuzzy Sets

Fuzzy sets were introduced by Lotfi A. Zadeh in 1965 [33] viewed as a generaliza-tion of traditional set.

Definition 11. Fuzzy SetLet X denote a set of objects and let A ⊆ X . A fuzzy set is a pair (A,μA) such thatμ : A → [0,1].

The membership function μ generalizes the usual characteristic function ε : X →{0,1}.

Page 28: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 21

Fig. 8. Sample Fuzzy Near Sets

6.1 Near Fuzzy Sets

Fuzzy sets A1 and A2 shown in Fig. 8 are also near sets inasmuch as each fuzzyset has a non-empty core. Let X be a problem domain for a fuzzy set A. By defini-tion [18], the core of A is

core(A) = {x ∈ X | A(x) = 1} .

The core of A is an example of a probe function that defines the class

x�core(A) = {y ∈ X | core(A)(x) = core(A)(y) = 1} .

It can also be argued that 〈X ,core(A)〉 is a perceptual system. In the case where apair of fuzzy sets has non-empty cores, then the fuzzy sets satisfy the condition forthe weak nearness relation, i.e., we can find x ∈ X ,y ∈ Y for (X ,A1) ,(Y,A2) wherex �core y. In effect,

A1 ��core A2.

Proposition 3. Fuzzy sets with non-empty cores are near sets.

6.2 Fuzzy Near Sets

Fuzzy sets that are near each other can be partitioned into classes containing ob-jects that have matching descriptions. For example, in Fig. 9, two fuzzy sets areshown:

(X , f 1), where X = [0,1.5] ,f 1(x) = |sin(x)sin(10x)| ,

(Y, f 2)where Y = [0.5,1.5] ,f 2(y) = |sin(y)sin(10y)| .

Page 29: Foundations of Computational Intelligence

22 J.F. Peters

Fig. 9. Sample Fuzzy Sets

Fig. 10. Sample Fuzzy Near Sets

Let x,x′ ∈ X and y,y′ ∈ Y . Then define tolerance classes for each fuzzy set.

x/∼= f 1,ε=

{x′ | ‖ f 1(x)− f 1(x′) ‖≤ ε

},

y/∼= f 2,ε=

{y′ | ‖ f 2(y)− f 2(y′) ‖≤ ε

}.

Example 16. Fuzzy Weakly Near SetsLet 〈X ,{ f 1}〉,〈Y,{ f 2}〉 denote perceptual systems. Let ε = 0.1. Then ∼= f 1,0.1 de-fines a partition of the fuzzy set (X , f 1) and ∼= f 2,0.1 defines a partition of the fuzzyset (Y, f 1). For example, the partition of X is defined by

x/∼= f 1,ε= {x′ | ‖ f 1(x)− f 1(x′) ‖≤ 0.1},x,x′ ∈ A in Fig. 10.

Page 30: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 23

In addition, the two fuzzy sets are weakly near each other, since the objects inx/∼= f 1,0.1

and y/∼= f 2,0.1have matching descriptions for each subinterval of the domains

of the two functions, starting with A,A′ shown in Fig. 10.

In effect,X ��

B,ε Y.

Proposition 4. Fuzzy sets (X ,φ) ,(Y,φ) are weakly near sets if, and only if thereexists at least one tolerance class x/∼=φ ,ε

in (X ,φ) and y/∼=φ ,εin (Y,φ) such that

x/∼=φ ,ε��φ ,ε y/∼=φ ,ε

.

7 Rough Near Sets

The germ of the idea for near sets first appeared within a poem by Zdisław Pawlakand this author in a poem entitled Near To written in 2002 and later published inEnglish and Polish [27, 14]. In later years, the foundations for near sets grew outof a rough set approach to classifying images [28, 21, 4, 22]. It is fairly easy toshow that every rough set is also a near set. This section briefly presents some fun-damental notions in rough set theory resulting from the seminal work by ZdisławPawlak during the early 1980s [13] and elaborated in [17, 16, 15]. An overview ofthe mathematical foundations of rough sets is given by Lech Polkowski in [31].

Let 〈O,F〉 denote a perceptual system containing a set of perceptual objects O anda set of functions F representing features of the objects in O. Further, let O∼B

denotethe set of all classes in the partition of O defined by ∼B for B ⊆ F. Recall that x/∼B

denotes an equivalence class relative x ∈ O. For X ⊆ O,B ⊆ F, a sample perceptualgranule X can be approximated with a B-lower B∗X and B-upper approximationB∗X defined by

B∗X =⋃

x:[x]B⊆X

[x]B,

B∗X =⋃

x:[x]B∩X �= /0

[x]B.

Let BNDB(X) = B∗X −B∗X denote the approximation boundary. A set X is arough set in the case where the boundary BNDB(X) is not empty, i.e., B∗X−B∗X �=/0. That is, whenever B∗X is a proper subset of B∗X , i.e., the sample X has beenclassified imperfectly and X is considered a rough set. Notice, from Def. 6,

B∗X ��B X , and

B∗X ��B X ,

since the classes in an approximation of X contain objects with descriptions thatmatch the description of at least one object in X . Hence, the pairs B∗X ,X andB∗X ,X are examples of near sets. In general,

Proposition 5. (Peters [20]) The pairs B∗X ,X and B∗X ,X are near sets.

Proposition 6. (Peters [20]) Any equivalence class x/∼B,∣∣x/∼B

∣∣ > 2 is a near set.

Page 31: Foundations of Computational Intelligence

24 J.F. Peters

Conclusion

This chapter includes a tutorial on near sets as well as a consideration of hybridfound by considering combination of near and fuzzy sets as well as near and roughsets. If nothing else, near sets provide a unifying influence in the study of pairs ofsets that are either fuzzy or rough. Near sets have their origins in image analysis,especially if there is interest in comparing images. Fuzzy sets and rough sets areperhaps the most commonly used mathematical tools in a typical computationalintelligence framework for applications such as control and classification. For thisreason, there is considerable interest in finding connections between fuzzy sets andnear sets as well as between rough sets and near sets. It has been shown that nearsets are a generalization of rough sets. In fact, the formalization of near sets beganin 2006 with a consideration of approximation spaces and the importance of theboundary of an approximation of a set in distinguishing between near sets and roughsets. It is a straightforward task to show that every rough set is a near set but notevery near set is rough set. This is an important observation, since the population ofnon-rough sets appears to be considerably larger than the population of rough sets,if one considers the fact that every class with 2 or more objects is a near set. Nearsets have proven to be useful machine learning (especially, biologically-inspiredadaptive learning) and in image analysis.

References

1. Grimaldi, D., Engel, M.: Evolution of the Insects. Cambridge University Press, Cam-bridge (2005)

2. Gupta, S., Patnaik, K.: Enhancing performance of face recognition sys-tem by using near set approach for selecting facial features. Journal ofTheoretical and Applied Information Technology 4(5), 433–441 (2008),http://www.jatit.org/volumes/fourth_volume_5_2008.php

3. Hassanien, A., Abraham, A., Peters, J., Schaefer, G.: Rough sets and near sets in medicalimaging: A review. IEEE Trans. on Information Technology in Biomedicine (submitted)(2008)

4. Henry, C., Peters, J.: Image pattern recognition using approximation spaces and nearsets. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds.)RSFDGrC 2007. LNCS (LNAI), vol. 4482, pp. 475–482. Springer, Heidelberg (2007)

5. Henry, C., Peters, J.: Near set image segmentation quality index. In: GEO-BIA 2008 Pixels, Objects, Intelligence. GEOgraphic Object Based Image Anal-ysis for the 21st Century, pp. 284–289. University of Calgary, Alberta (2008),http://www.ucalgary.ca/geobia/Publishing

6. Henry, C., Peters, J.: Perception based image classification. IEEE Transactions on Sys-tems, Man, and Cybernetics–Part C: Applications and Reviews (submitted) (2008)

7. Iturralde-Vinent, M., MacPhee, R.: Age and paleogeographical origin of dominican am-ber. Science 273, 1850–1852 (1996)

8. Jahne, B.: Digital Image Processing, 6th edn. Springer, Heidelberg (2005)9. Murray, J., Bradley, H., Craigie, W., Onions, C.: The Oxford English Dictionary. Oxford

University Press, Oxford (1933)10. Orłowska, E. (ed.): Incomplete Information: Rough Set Analysis. Studies in Fuzziness

and Soft Computing, vol. 13. Physica-Verlag, Heidelberg (1998)

Page 32: Foundations of Computational Intelligence

Fuzzy Sets, Near Sets, and Rough Sets 25

11. Orłowska, E., Pawlak, Z.: Representation of nondeterministic information. TheoreticalComputer Science 29, 27–39 (1984)

12. Pavel, M.: Fundamentals of Pattern Recognition, 2nd edn. Marcel Dekker, Inc., N.Y.(1993)

13. Pawlak, Z.: Classification of objects by means of attributes. Polish Academy of Sci-ences 429 (1981)

14. Pawlak, Z., Peters, J.: Jak blisko. Systemy Wspomagania Decyzji I, 57 (2007)15. Pawlak, Z., Skowron, A.: Rough sets and boolean reasoning. Information Sciences 177,

41–73 (2007)16. Pawlak, Z., Skowron, A.: Rough sets: Some extensions. Information Sciences 177, 28–

40 (2007)17. Pawlak, Z., Skowron, A.: Rudiments of rough sets. Information Sciences 177, 3–27

(2007)18. Pedrycz, W., Gomide, F.: An Introduction to Fuzzy Sets. Analysis and Design. MIT

Press, Cambridge (1998)19. Peters, J.: Classification of objects by means of features. In: Proc. IEEE Symposium

Series on Foundations of Computational Intelligence (IEEE SCCI 2007), Honolulu,Hawaii, pp. 1–8 (2007)

20. Peters, J.: Near sets. General theory about nearness of objects. Applied MathematicalSciences 1(53), 2029–2609 (2007)

21. Peters, J.: Near sets, special theory about nearness of objects. Fundamenta Informati-cae 75(1-4), 407–433 (2007)

22. Peters, J.: Near sets. toward approximation space-based object recognition. In: Yao, J.,Lingras, P., Wu, W.-Z., Szczuka, M.S., Cercone, N.J., Slezak, D. (eds.) RSKT 2007.LNCS (LNAI), vol. 4481, pp. 22–33. Springer, Heidelberg (2007)

23. Peters, J.: Classification of perceptual objects by means of features. Int. J. of Info. Tech-nology & Intelligent Computing 3(2), 1–35 (2008)

24. Peters, J.: Discovery of perceputally near information granules. In: Yao, J. (ed.) NovelDevelopments in Granular Computing: Applications of Advanced Human Reasoningand Soft Computation. Information Science Reference, Hersey, N.Y., U.S.A. (to appear)(2008)

25. Peters, J., Ramanna, S.S.: Feature selection: A near set approach. In: ECML & PKDDWorkshop on Mining Complex Data, Warsaw (2007)

26. Peters, J., Shahfar, S., Ramanna, S., Szturm, T.: Biologically-inspired adaptive learning:A near set approach. In: Frontiers in the Convergence of Bioscience and InformationTechnologies, Korea (2007)

27. Peters, J., Skowron, A.: Zdzisław pawlak: Life and work. Transactions on Rough Sets V,1–24 (2006)

28. Peters, J., Skowron, A., Stepaniuk, J.: Nearness in approximation spaces. In: Proc. Con-currency, Specification and Programming (CS&P 2006), Humboldt Universitat, pp. 435–445 (2006)

29. Peters, J., Skowron, A., Stepaniuk, J.: Nearness of objects: Extension of approximationspace model. Fundamenta Informaticae 79(3-4), 497–512 (2007)

30. Peters, J., Wasilewski, P.: Foundations of near sets. Information Sciences (submitted)(2008)

31. Polkowski, L.: Rough Sets. Mathematical Foundations. Springer, Heidelberg (2002)32. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informati-

cae 27, 245–253 (1996)33. Zadeh, L.: Fuzzy sets. Information and Control 8, 338–353 (1965)

Page 33: Foundations of Computational Intelligence

Fuzzy without Fuzzy: Why Fuzzy-Related

Aggregation Techniques Are Often BetterEven in Situations without True Fuzziness

Hung T. Nguyen1, Vladik Kreinovich2, and Francois Modave2,and Martine Ceberio2

1 Department of Mathematical Sciences, New Mexico State University,Las Cruces, NM 88003, [email protected]

2 Department of Computer Science, University of Texas at El Paso,El Paso, TX 79968, [email protected], [email protected], [email protected]

Summary. Fuzzy techniques have been originally invented as a methodol-ogy that transforms the knowledge of experts formulated in terms of naturallanguage into a precise computer-implementable form. There are many suc-cessful applications of this methodology to situations in which expert knowl-edge exist, the most well known is an application to fuzzy control.

In some cases, fuzzy methodology is applied even when no expert knowl-edge exists: instead of trying to approximate the unknown control functionby splines, polynomials, or by any other traditional approximation technique,researchers try to approximate it by guessing and tuning the expert rules.Surprisingly, this approximation often works fine, especially in such applica-tion areas as control and multi-criteria decision making.

In this chapter, we give a mathematical explanation for this phenomenon.

1 Introduction

Fuzzy techniques: a brief reminder. Fuzzy techniques have been originallyinvented as a methodology that transforms the knowledge of experts formu-lated in terms of natural language into a precise computer-implementableform. There are many successful applications of this methodology to situa-tions in which expert knowledge exist, the most well known is an applicationto fuzzy control; see, e.g., [6, 8, 18].

Universal approximation results. A guarantee of success comes from the factthat fuzzy systems are universal approximators in the sense that for everycontinuous function f(x1, . . . , xn) and for every ε > 0, there exists a set ofrules for which the corresponding input-output function is ε-close to f ; see,e.g., [1, 8, 9, 11, 12, 16, 18, 19, 22, 23, 25] and references therein.

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 27–51.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 34: Foundations of Computational Intelligence

28 H.T. Nguyen et al.

Fuzzy methodology is sometimes successful without any fuzzy expert knowl-edge. In some cases, fuzzy methodology is applied even when no expertknowledge exists: instead of trying to approximate the unknown controlfunction by splines, polynomials, or by any other traditional approximationtechnique, researchers try to approximate it by guessing and tuning the ex-pert rules. Surprisingly, this approximation often works fine.

Similarly, fuzzy-type aggregation functions like OWA or Choquet integralsoften work better than quadratic functions in multi-criteria decision making.

What we plan to do. In this chapter, we give a mathematical explanation forthese phenomena, and we show that approximation by using fuzzy method-ology is indeed (in some reasonable sense) the best.

Comment. In this chapter, we build upon our preliminary results publishedin [13, 15, 17].

2 Use of Fuzzy Techniques in Non-fuzzy Control: AJustification

In many practical applications, data processing speed is important. We havementioned that one of the main applications of fuzzy methodology is to in-telligent control.

In applications to automatic control, the computer must constantly com-pute the current values of control. The value of the control depends on thestate of the controlled object (called plant in control theory). So, to get ahigh quality control, we must measure as many characteristics x1, . . . , xn ofthe current state as we can. The more characteristics we measure, the morenumbers we have to process, so, the more computation steps we must per-form. The results of these computations must be ready in no time, beforewe start the next round of measurements. So, automatic control, especiallyhigh-quality automatic control, is a real-time computation problem with aserious time pressure.

Parallel computing is an answer. A natural way to increase the speed of thecomputations is to perform computations in parallel on several processors.To make the computations really fast, we must divide the algorithm intoparallelizable steps, each of which requires a small amount of time.

What are these steps?

The fewer variables, the faster. As we have already mentioned, the main rea-son why control algorithms are computationally complicated is that we mustprocess many inputs. For example, controlling a car is easier than controllinga plane, because the plane (as a 3-D object) has more characteristics to takecare of, more characteristics to measure and hence, more characteristics toprocess. Controlling a space shuttle, especially during the lift-off and landing,is even a more complicated task, usually performed by several groups of peo-ple who control the trajectory, temperature, rotation, etc. In short, the more

Page 35: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 29

numbers we need to process, the more complicated the algorithm. Therefore,if we want to decompose our algorithm into fastest possible modules, we mustmake each module to process as few numbers as possible.

Functions of one variable are not sufficient. Ideally, we should only use themodules that compute functions of one variable. However, if we only havefunctions of one variables (i.e., procedures with one input and one output),then, no matter how we combine them, we will always end up with functionsof one variable. Since our ultimate goal is to compute the control functionu = f(x1, . . . , xn) that depends on many variables x1, . . . , xn, we must there-fore enable our processors to compute at least one function of two or morevariables.

What functions of two variables should we choose?

Choosing functions of two or more variables. Inside the computer, each func-tion is represented as a sequence of hardware implemented operations. Thefastest functions are those that are computed by a single hardware operation.The basic hardware supported operations are: arithmetic operations a + b,a − b, a · b, a/b, and min(a, b) and max(a, b). The time required for each op-eration, crudely speaking, corresponds to the number of bits operations thathave to be performed:

• Division is done by successive multiplication, comparison and subtraction(basically, in the same way as we do it manually), so, it is a much sloweroperation than −.

• Multiplication is implemented as a sequence of additions (again, basicallyin the same manner as we do it manually), so it is much slower than +.

• − and + are usually implemented in the same way. To add two n-bitbinary numbers, we need n bit additions, and also potentially, n bit ad-ditions for carries. Totally, we need about 2n bit operations.

• min of two n-bit binary numbers can be done in n binary operations: wecompare the bits from the highest to the lowest, and as soon as they differ,the number that has 0 as opposed to 1 is the desired minimum: e.g., theminimum of 0.10101 and 0.10011 is 0.10011, because in the third bit, thisnumber has 0 as opposed to 1.

• Similarly, max is an n-bit operation.

So, the fastest possible functions of two variables are min and max. Similarlyfast is computing the minimum and maximum of several (more than two) realnumbers. Therefore, we will choose these functions for our control-orientedcomputer.

Summarizing the above-given analysis, we can conclude that our computerwill contain modules of two type:

• modules that compute functions of one variable;• modules that compute min and max of two or several numbers.

How to combine these modules? We want to combine these modules in such away that the resulting computations are as fast as possible. The time that is

Page 36: Foundations of Computational Intelligence

30 H.T. Nguyen et al.

required for an algorithm is crudely proportional to the number of sequentialsteps that it takes. We can describe this number of steps in clear geometricterms:

• at the beginning, the input numbers are processed by some processors;these processors form the first layer of computations;

• the results of this processing may then go into different processors, thatform the second layer;

• the results of the second layer of processing go into the third layer,• etc.

In these terms, the fewer layers the computer has, the faster it is.So, we would like to combine the processors into the smallest possible

number of layers.Now, we are ready for the formal definitions.

Definition and the main result. Let us first give an inductive definition ofwhat it means for a function to be computable by a k-layer computer.

Definition 1

• We say that a function f(x1, . . . , xn) is computable by a 1-layer computerif either n = 1, or the function f coincides with min or with max.

• Let k ≥ 1 be an integer. We say that a function f(x1, . . . , xn) is com-putable by a (k+1)-layer computer if one of the following three statementsis true:• f = g(h(x1, . . . , xn)), where g is a function of one variable, and

h(x1, . . . , xn) is computable by a k-layer computer;• f = min(g1(x1, . . . , xn), . . . , gm(x1, . . . , xn)), where all functions gi

are computed by a k-layer computer;• f = max(g1(x1, . . . , xn), . . . , gm(x1, . . . , xn)), where all functions gi

are computed by a k-layer computer.

Comment. A computer is a finite-precision machine, so, the results of thecomputations are never absolutely precise. Also, a computer is limited inthe size of its numbers. So, we can only compute a function approximately,and only on a limited range. Therefore, when we say that we can computean arbitrary function, we simply mean that for an arbitrary range T , foran arbitrary continuous function f : [−T, T ]n → R, and for an arbitraryaccuracy ε > 0, we can compute a function that is ε-close to f on the givenrange. In this sense, we will show that not every function can be computedon a 2-layer computer, but that 3 layers are already sufficient.

Proposition 1. There exist real numbers T and ε > 0, and a continuousfunction f : [−T, T ]n → R such that no function ε-close to f on [−T, T ]n

can be computed on a 2-layer computer.

Page 37: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 31

Comment. To make the text more readable, we present both proofs in thelast section. However, we will make one comment here. The function that willbe proved to be not computable on a 2-layer computer is not exotic at all:it is f(x1, x2) = x1 + x2 on the domain [−1, 1]2, and Proposition 1 holds forε = 0.4.

Theorem 1. For every real numbers T and ε > 0, and for every continuousfunction f : [−T, T ]n → R, there exists a function f that is ε-close to f on[−T, T ]n and that is computable on a 3-layer computer.

Comment. In other words, functions computed by a 3-layer computer areuniversal approximators.

Relation to fuzzy control. As we will see from the proof, the approximatingfunction f is of the type max(A1, . . . , Am), where

Aj = min(fj1(x1), . . . , fjn(xn).

These functions correspond the so-called fuzzy control [6, 8, 18]: Indeed, letus define

U = maxi,j,xi∈[−T,T ]

|fji(xi)|,

andμji(xi) =

fji(xi) − (−U)U − (−U)

.

Let us now assume that the rules base that describes the expert recommen-dations for control consists of exactly two rules:

• “if one of the conditions Cj is true, then u = U”;• “else, u = −U”,

where each condition Cj means that the following n conditions are satisfied:

• x1 satisfies the property Cj1 (described by a membership function μj1(x1));• x2 satisfies the property Cj2 (described by a membership function μj2(x2));• . . .• xn satisfies the propertyCjn (described by a membership functionμjn(xn)).

In logical terms, the condition C for u = U has the form

(C11& . . . &C1n) ∨ . . . ∨ (Ck1& . . .&Ckn).

If we use min for &, and max for ∨ (these are the simplest choices in fuzzycontrol methodology), then the degree μC with which we believe in a condi-tion C = C1 ∨ . . . ∨ Ck can be expressed as:

μC = max[min(μ11(x1), . . . , μ1n), . . . , min(μk1, . . . , μkn)].

Correspondingly, the degree of belief in a condition for u = −U is 1 − μC .According to fuzzy control methodology, we must use a defuzzification todetermine the actual control, which in this case leads to the choice of

Page 38: Foundations of Computational Intelligence

32 H.T. Nguyen et al.

u =U · μC + (−U) · (1 − μC)

μC + (1 − μC).

Because of our choice of μji, one can easily see that this expression coincidesexactly with the function max(A1, . . . , Am), where

Aj = min(fj1(x1), . . . , fjn(xn)).

So, we get exactly the expressions that stem from the fuzzy controlmethodology.

Conclusion. Since our 3-layer expression describes the fastest possible com-putation tool, we can conclude that for control problems, the fastest possibleuniversal computation scheme corresponds to using fuzzy methodology.

This result explains why fuzzy methodology is sometimes used (and usedsuccessfully) without any expert knowledge being present, as an extrapolationtool for the (unknown) function.

Comment. We have considered digital parallel computers. If we use analogprocessors instead, then min and max stop being the simplest functions.Instead, the sum is the simplest: if we just join the two wires together, thenthe resulting current is equal to the sum of the two input currents. In this case,if we use a sum (and more general, linear combination) instead of min andmax, 3-layer computers are also universal approximators; the correspondingcomputers correspond to neural networks [10].

Proof of Proposition 1

0◦. Let us proof (by reduction to a contradiction) that if a function f(x1, x2)is 0.4−close to f(x1, x2) = x1 + x2 on [−1, 1]2, then f cannot be computedon a 2-layer computer. Indeed, suppose that it is. Then, according to theDefinition, the function f(x1, x2) is of one of the following three forms:

• g(h(x1, x2)), where h is computable on a 1-layer computer;• min(g1(x1, x2), . . . , gm(x1, x2)), where all the functions gi are computable

on a 1-layer computer;• max(g1(x1, x2), . . . , gm(x1, x2)), where all the functions gi are computable

on a 1-layer computer.

Let us show case-by-case that all these three cases are impossible.

1◦. In the first case, f(x1, x2) = g(h(x1, x2)), where h is computable on a1-layer computer. Be definition, this means that h is either a function of onevariable, or min, or max. Let us consider all these three sub-cases.

1.1◦. If f(x1, x2) = g(h(x1)), then the function f depends only on x1. Inparticular,

f(0,−1) = f(0, 1). (1)

Page 39: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 33

But since f is ε-close to f(x1 + x2) = x1 + x2, we get

f(0,−1) ≤ f(0,−1) + ε = −1 + 0.4 = −0.6,

andf(0, 1) ≥ f(0, 1)− ε = 1 − 0.4 > 0.6 > −0.6.

So, f(0,−1) ≤ −0.6 < f(0, 1), hence, f(0,−1) �= f(0, 1), which contradictsto (1). So, this sub-case is impossible. Similarly, it is impossible to have hdepending only on x2.

1.2◦. Let us consider the sub-case when f(x1, x2) = g(min(x1, x2)). In thissub-case,

f(−1,−1) = g(min(−1,−1)) = g(−1) = g(min(−1, 1)) = f(−1, 1),

andf(−1,−1) = f(−1, 1). (2)

Butf(−1,−1) ≤ f(−1,−1) + ε = −2 + 0.4 = −1.6,

andf(−1, 1) ≥ f(−1, 1)− ε = 0 − 0.4 = −0.4 > −1.6,

so, the equality (2) is also impossible.

1.3◦. Let us now consider the sub-case f(x1, x2) = g(max(x1, x2)). In thissub-case,

f(−1, 1) = g(max(−1, 1)) = g(1) = g(max(1, 1)) = f(1, 1),

andf(−1, 1) = f(1, 1). (3)

Butf(−1, 1) ≤ f(−1, 1) + ε = 0 + 0.4 = 0.4,

andf(1, 1) ≥ f(1, 1)− ε = 2 − 0.4 = 1.6 > 0.4,

so, the equality (3) is also impossible.

2◦. In the second case, f(x1, x2) = min(g1(x1, x2), . . . , gm(x1, x2)), where allthe functions gi are computable on a 1-layer computer. For this case, theimpossibility follows from the following sequence of steps:

2.1◦. If one of the functions gi is of the type min(x1, x2), then we can rewrite

min(g1, . . . , gi−1, min(x1, x2), gi+1, . . . , gm)

Page 40: Foundations of Computational Intelligence

34 H.T. Nguyen et al.

asmin(g1, . . . , gi−1, g

(1)i , g

(2)i , gi+1, . . . , gm),

where g(i)(x1, x2) = xi is a function that is clearly computable on a 1-layercomputer. After we make such transformations, we get an expression for fthat only contains max and functions of one variable.

2.2◦. Let us show that this expression cannot contain max. Indeed, if it does,then

f(x1, x2) = min(. . . , max(x1, x2)) ≤ max(x1, x2).

In particular, f(1, 1) ≤ max(1, 1) = 1. But we must have

f(1, 1) ≥ f(1, 1) − ε = 2 − 0.4 = 1.6 > 1.

The contradiction shows that max cannot be one of the functions gi.

2.3◦. So, each function gi depends only on one variable. If all of them dependon one and the same variable, say, x1, then the entire function f depends onlyon one variable, and we have already proved (in the proof of the first case) thatit is impossible. So, some functions gi depend on x1, and some of the functionsgi depend on x2. Let us denote by h1(x1) the minimum of all functions gi

that depend on x1, and by h2(x2), the minimum of all the functions gi thatdepend on x2. Then, we can represent f as f(x1, x2) = min(h1(x1), h2(x2)).

2.4◦. To get a contradiction, let us first take x1 = 1 and x2 = 1. Then,

f(1, 1) = min(h1(1), h2(1)) ≥ f(1, 1)− ε = 2 − 0.4 = 1.6.

Since the minimum of the two numbers is ≥ 1.6, we can conclude that eachof them is ≥ 1.6, i.e., that h1(1) ≥ 1.6 and h2(1) ≥ 1.6. For x1 = 1 andx2 = −1, we have

f(1,−1) = min(h1(1), h2(−1)) ≤ f(1,−1) + ε = 0.4.

Since h1(1) ≥ 1.6, we conclude that f(1,−1) = h2(−1). From

f(1,−1) ≥ f(1,−1)− ε = −0.4, (4)

we can now conclude that h2(−1) ≥ −0.4. Similarly, one can prove thath1(−1) ≥ −0.4. Hence,

f(−1,−1) = min(h1(−1), h2(−1)) ≥ −0.4.

Butf(−1,−1) ≤ f(−1,−1) + ε = −2 + 0.4 = −1.6 < −0.4 :

a contradiction with (4).The contradiction shows that the second case is also impossible.

Page 41: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 35

3◦. In the third case, f(x1, x2) = max(g1(x1, x2), . . . , gm(x1, x2)), where allthe functions gi are computable on a 1-layer computer. For this case, the im-possibility (similarly to the second case) follows from the following sequenceof steps:

3.1◦. If one of the functions gi is of the type max(x1, x2), then we can rewrite

max(g1, . . . , gi−1, max(x1, x2), gi+1, . . . , gm)

asmax(g1, . . . , gi−1, g

(1)i , g

(2)i , gi+1, . . . , gm),

where g(i)(x1, x2) = xi is a function that is clearly computable on a 1-layercomputer. After we make such transformations, we get an expression for fthat only contains min and functions of one variable.

3.2◦. Let us show that this expression cannot contain min. Indeed, if it does,then

f(x1, x2) = max(. . . , min(x1, x2)) ≥ min(x1, x2).

In particular,f(−1,−1) ≥ min(−1,−1) = −1.

But we must have

f(−1,−1) ≤ f(−1,−1) + ε = −2 + 0.4 = −1.6 < −1.

The contradiction shows that min cannot be one of the functions gi.

3.3◦. So, each function gi depends only on one variable. If all of them dependon one and the same variable, say, x1, then the entire function f dependsonly on one variable, and we have already proved (in the proof of the firstcase) that it is impossible. So, some functions gi depend on x1, and someof the functions gi depend on x2. Let us denote by h1(x1) the maximum ofall functions gi that depend on x1, and by h2(x2), the maximum of all thefunctions gi that depend on x2. Then, we can represent f as

f(x1, x2) = max(h1(x1), h2(x2)).

3.4◦. To get a contradiction, let us first take x1 = −1 and x2 = −1. Then,

f(−1,−1) = max(h1(−1), h2(−1)) ≤ f(−1,−1) + ε = −2 + 0.4 = −1.6.

Since the maximum of the two numbers is ≤ −1.6, we can conclude that eachof them is ≤ −1.6, i.e., that h1(−1) ≤ −1.6 and h2(−1) ≤ −1.6. For x1 = 1and x2 = −1, we have

f(1,−1) = max(h1(1), h2(−1)) ≥ f(1,−1)− ε = −0.4.

Page 42: Foundations of Computational Intelligence

36 H.T. Nguyen et al.

Since h2(−1) ≤ −1.6, we conclude that f(1,−1) = h1(1). From

f(1,−1) ≤ f(1,−1) + ε = 0.4,

we can now conclude that h1(1) ≤ 0.4. Similarly, one can prove that h2(1) ≤0.4. Hence,

f(1, 1) = max(h1(1), h2(1)) ≥ 0.4. (5)

Butf(1, 1) ≥ f(1, 1)− ε = 2 − 0.4 = 1.6 > 0.4,

which contradicts to (5).The contradiction shows that the third case is also impossible.

4◦. In all there cases, we have shown that the assumption that f can becomputed on a 2-layer computer leads to a contradiction. So, f cannot bethus computed. Q.E.D.

Proof of Theorem 1. Since the function f is continuous, there exists a δ > 0such that if |xi − yi| ≤ δ, then

|f(x1, . . . , xn) − f(y1, . . . , yn)| ≤ ε.

Let us mark the grid points on the grid of size δ, i.e., all the points for whicheach coordinate x1, . . . , xn has the form qi · δ for integer qi (i.e., we mark thepoints with coordinates 0, ±δ, ±2δ, . . . , ±T ).

On each coordinate, we thus mark ≈ 2T/δ points. So, totally, we mark≈ (2T/δ)n grid points. Let us denote the total number of grid points by k,and the points themselves by Pj = (xj1, . . . , xjn), 1 ≤ j ≤ k.

By mf , let us denote the minimum of f :

mf = minx1∈[−T,T ],...,xn∈[−T,T ]

f(x1, . . . , xn).

For each grid point Pj , we will form piece-wise linear functions fji(xi) asfollows:

• if |xi − xji| ≤ 0.6 · δ, then fji(xi) = f(Pj)(≥ mf );• if |xi − xji| ≥ 0.7 · δ, then fji(xi) = mf ;• if 0.6 · δ ≤ |xi − xji| ≤ 0.7 · δ, then

fji(xi) = mf + (f(Pj) − mf ) · 0.7 · δ − |xi − xji|0.7 · δ − 0.6 · δ .

Let us show that for these functions fji, the function

f(x1, . . . , xn) = max(A1, . . . , Am),

whereAj = min(fj1(x1), . . . , fjn(xn)),

is ε-close to f .

Page 43: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 37

To prove that, we will prove the following two inequalities:

• For all x1, . . . , xn, we have f(x1, . . . , xn) ≥ f(x1, . . . , xn) − ε.

• For all x1, . . . , xn, we have f(x1, . . . , xn) ≤ f(x1, . . . , xn) + ε.

Let us first prove the first inequality. Assume that we have a point (x1, . . . , xn).For every i = 1, . . . , n, by qi, we will denote the integer that is the closest toxi/δ. Then,

|xi − qi · δ| ≤ 0.5 · δ.These values qi determine a grid point Pj = (xj1, . . . , xjn) with coordinatesxji = qi · δ. For this j, and for every i,

|xi − xji| ≤ 0.5 · δ < 0.6 · δ,

therefore, by definition of fji, we have fji(xi) = f(Pj). Hence,

Aj = min(fj1(x1), . . . , fjn(xn)) = min(f(Pj), . . . , f(Pj)) = f(Pj).

Therefore,

f(x1, .., xn) = max(A1, . . . , Am) ≥ Aj = f(Pj).

But since |xji − xi| ≤ 0.5 · δ < δ, by the choice of δ, we have

|f(x1, . . . , xn) − f(Pj)| ≤ ε.

Therefore, f(Pj) ≥ f(x1, . . . , xn) − ε, and hence,

f(x1, . . . , xn) ≥ f(Pj) ≥ f(x1, . . . , xn) − ε.

Let us now prove the second inequality. According to our definition of fji,the value of fji(xi) is always in between mf and Pj , and this value is differentfrom mf only for the grid points Pj for which |xji − xi| ≤ 0.7 · δ. The value

Aj = min(fj1(x1), . . . , fjn(xn))

is thus different from m only if all the values fji(xi) are different from m, i.e.,when |xji − xi| ≤ 0.7 · δ for all i. For this grid point, |xji − xi| ≤ 0.7 · δ < δ;therefore,

|f(Pj) − f(x1, . . . , xn)| ≤ ε

and hence, f(Pj) ≤ f(x1, . . . , xn) + ε. By definition of fji, we have fji(xi) ≤f(Pj). Since this is true for all i, we have

Aj = min(fj1(x1), . . . , fjn(xn)) ≤ f(Pj) ≤ f(x1, . . . , xn) + ε.

For all other grid points Pj , we have

Aj(x1, . . . , xn) = mf

Page 44: Foundations of Computational Intelligence

38 H.T. Nguyen et al.

for a given (x1, . . . , xn). Since mf has been defined as the minimum of f , wehave

Aj = mf ≤ f(x1, . . . , xn) < f(x1, . . . , xn) + ε.

So, for all grid points, we have

Aj ≤ f(x1, . . . , xn) + ε,

and therefore,

f(x1, . . . , xn) = max(A1, . . . , Am) ≤ f(x1, . . . , xn) + ε.

The second inequality is also proven.So, both inequalities are true, and hence, f is ε-close to f . The theorem is

proven.

3 Fuzzy-Type Aggregation in Multi-criteria DecisionMaking: A Problem

A similar situation occurs in multi-criterion decision making. To describe theproblem, let us briefly explain what multi-criteria decision making is about.

One of the main purposes of Artificial Intelligence in general is to in-corporate a large part of human intelligent reasoning and decision-makinginto a computer-based systems, so that the resulting intelligent computer-based systems help users in making rational decisions. In particular, to helpa user make a decision among a large number of alternatives, an intelligentdecision-making systems should select a small number of these alternatives –alternatives which are of the most potential interest to the user.

For example, with so many possible houses on the market, it is not realis-tically possible to have a potential buyer inspect all the house sold in a givencity. Instead, a good realtor tries to find out the buyer’s preferences and onlyshow him or her houses that more or less fit these preferences. It would begreat to have an automated system for making similar pre-selections.

To be able to make this selection, we must elicit the information about theuser preferences.

In principle, we can get a full picture of the user preferences by askingthe user to compare and/or rank all possible alternatives. Such a completedescription of user preferences may be sometimes useful, but in decision mak-ing applications, such an extensive question-asking defeats the whole purposeof intelligent decision-making systems – to avoid requiring that the the usermake a large number of comparisons.

The existing approach to this problem is called multi-criteria decision mak-ing (MCDM). The main idea behind this approach is that each alternative ischaracterized by the values of different parameters. For example, the buyer’sselection of a house depends on the house’s size, on its age, on its geographicallocation, on the number of bedrooms and bathrooms, etc. The idea is to elicit

Page 45: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 39

preferences corresponding to each of these parameters, and then to combinethese single-parameter preferences into a reasonable model for describing theuser’s choice.

In the standard decision making theory, preferences are characterized byassigning, to each alternative, a numerical value called its utility. In theseterms, the multi-criteria decision making approach means that we try tocombine single-variable utility values u1(x1), . . . , un(xn) characterizing theuser’s preferences over individual parameters x1, . . . , xn into a utility valueu(x1, . . . , xn) that characterizes the utility of an alternative described by thevalues (x1, . . . , xn).

In the first approximation, it makes sense simply to add the individualutility values with appropriate weights, i.e., to consider linear aggregation

u(x1, . . . , xn) = w1 · u1(x1) + . . . + wn · un(xn).

In many practical situations, linear aggregation works well, but in some cases,it leads to counterintuitive conclusions. For example, when selecting a house,a user can assign certain weights to all the parameters characterizing differenthouses, but the user may also has absolute limitations: e.g., a user withkids may want a house with at least two bedrooms, and no advantages inlocation and price would entice her to buy a one-bedroom house. To describesuch reasonable preferences, we must therefore go beyond linear aggregationfunctions.

From the purely mathematical viewpoint, the inadequacy of a linear modelis a particular example of a very typical situation. Often, when we describe theactual dependence between the quantities in physics, chemistry, engineering,etc., a linear expressions y = c0 + c1 · x1 + . . . + cn · xn is a very good firstapproximation (at least locally), but to get a more accurate approximations,we must take non-linearity into account. In mathematical applications tophysics, engineering, etc., there is a standard way to take non-linearity intoaccount: if a linear approximation is not accurate enough, a natural idea is

to use a quadratic approximation y ≈ a0 +n∑

i=1

ci · xi +n∑

i=1

n∑j=1

cij · xi · xj ; if

the quadratic approximation is not sufficient accurate, we can use a cubicapproximation, etc.; see, e.g., [4].

At first glance, it seems reasonable to apply a similar idea to multi-criteriadecision making and consider quadratic aggregation functions

udef= u(x1, . . . , xn) = u0 +

n∑i=1

wi · ui(xi) +n∑

i=1

n∑j=1

wij · ui(xi) · uj(xj).

Surprisingly, in contrast to physics and engineering applications, quadraticapproximation do not work as well as approximations based on the use ofpiece-wise linear functions, such as the OWA operation u = w1 ·u(1)+. . .+wn ·u(n), where u(1) = max(u1(x1), . . . , un(xn)) is the largest of n utility valuesui(xi), u(2) is the second largest, . . . , and u(n) = min(u1(x1), . . . , un(xn)) isthe smallest of n utility values; see, e.g., [24].

Page 46: Foundations of Computational Intelligence

40 H.T. Nguyen et al.

In our own research, we have applied OWA and we have also applied similarpiece-wise linear operations (based on the so-called Choquet integral [5]), andwe also got good results – better than quadratic approximations; see, e.g.,[2] and references therein. Similar results have been obtained by others. Forquite some time, why piece-wise approximations are better than quadraticones remains a mystery to us – and to many other researchers whom weasked this question. Now, we finally have an answer to this question – andthis answer is presented in the current chapter.

Thus, the chapter provides a new justification of the use of piece-wiseaggregation operations in multi-criteria decision making – a justification thatexplains why these aggregation operations are better than the (seeminglymore natural) quadratic ones.

4 Standard Decision Making Theory: A BriefReminder

To explain our answer to the long-standing puzzle, we need to recall theproperties of the utility functions. The needed properties of utility functionsare described in this section. Readers who are already well familiar with thestandard decision making theory (and with the corresponding properties ofutility functions) can skip this section and proceed directly to the next one.

To be able to describe decisions, we must have a numerical scale for describ-ing preferences. The traditional decision making theory (see, e.g., [7, 14, 20])starts with an observation that such a scale can be naturally obtained by us-ing probabilities. Specifically, to design this scale, we select two alternatives:

• a very negative alternative A0; e.g., an alternative in which the decisionmaker loses all his money (and/or loses his health as well), and

• a very positive alternative A1; e.g., an alternative in which the decisionmaker wins several million dollars.

Based on these two alternatives, we can, for every value p ∈ [0, 1], consider arandomized alternative L(p) in which we get A1 with probability p and A0

with probability 1 − p.(It should be mentioned that in the standard decision making theory,

randomized alternatives like L(p) are also (somewhat misleadingly) calledlotteries. This name comes from the fact that a lottery is one of the fewreal-life examples of randomized outcomes with known probabilities.)

In the two extreme cases p = 0 and p = 1, the randomized alternative L(p)turns into one of the original alternatives: when p = 1, we get the favorablealternative A1 (with probability 1), and when p = 0, we get the unfavor-able alternative A0. In general, the larger the probability p of the favorablealternative A1, the more preferable is the corresponding randomized alter-native L(p). Thus, the corresponding randomized alternatives (“lotteries”)L(p) form a continuous 1-D scale ranging from the very negative alternativeA0 to the very positive alternative A1.

Page 47: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 41

So, it is reasonable to gauge the preference of an arbitrary alternative Aby comparing it to different alternatives L(p) from this scale until we findA’s place on this scale, i.e., the value p ∈ [0, 1] for which, to this decisionmaker, the alternative A is equivalent to L(p): L(p) ∼ A. This value is calledthe utility u(A) of the alternative A in the standard decision making theory.

In our definition, the numerical value of the utility depends on the selectionof the alternatives A0 and A1: e.g., A0 is the alternative whose utility is 0and A1 is the alternative whose utility is 1. What if we use a different set ofalternatives, e.g., A′

0 < A0 and A′1 > A1?

Let A be an arbitrary alternative between A0 and A1, and let u(A) beits utility with respect to A0 and A1. In other words, we assume that A isequivalent to the randomized alternative in which:

• we have A1 with probability u(A), and• we have A0 with probability 1 − p.

In the scale defined by the new alternatives A′0 and A′

1, let u′(A0), u′(A1),and u′(A) denote the utilities of A0, A1, and A. This means, in particular:

• that A0 is equivalent to the randomized alternative in which we get A′1

with probability u′(A0) and A′0 with probability 1 − u′(A0); and

• that A1 is equivalent to the randomized alternative in which we get A′1

with probability u′(A1) and A′0 with probability 1 − u′(A1).

Thus, the alternative A is equivalent to the compound randomized alterna-tive, in which

• first, we select A1 or A0 with probabilities u(A) and 1 − u(A), and then• depending on the first selection, we select A′

1 with probability u′(A1) oru′(A0) – and A′

0 with the remaining probability.

As the result of this two-stage process, we get either A′0 or A′

1. The probabilityp of getting A′

1 in this two-stage process can be computed by using the formulaof full probability

p = u(A) · u′(A1) + (1 − u(A)) · u′(A0) = u(A) · (u′(A1) − u′(A0)) + u′(A0).

So, the alternative A is equivalent to a randomized alternative in which weget A′

1 with probability p and A′0 with the remaining probability 1 − p. By

definition of utility, this means that the utility u′(A) of the alternative A inthe scale defined by A′

0 and A′1 is equal to this value p:

u′(A) = u(A) · (u′(A1) − u′(A0)) + u′(A0).

So, changing the scale means a linear re-scaling of the utility values:

u(A) → u′(A) = λ · u(A) + b

for λ = u′(A1) − u′(A0) > 0 and b = u′(A0).

Page 48: Foundations of Computational Intelligence

42 H.T. Nguyen et al.

Vice versa, for every λ > 0 and b, one can find appropriate events A′0 and

A′1 for which the re-scaling has exactly these values λ and b. In other words,

utility is defined modulo an arbitrary (increasing) linear transformation.The last important aspect of the standard decision making theory is its

description of the results of different actions. Suppose that an action leadsto alternatives a1, . . . , am with probabilities p1, . . . , pm. We can assume thatwe have already determined the utility ui = u(ai) of each of the alternativesa1, . . . , am. By definition of the utility, this means that for each i, the al-ternative ai is equivalent to the randomized alternative L(ui) in which weget A1 with probability ui and A0 with probability 1 − ui. Thus, the re-sults of the action are equivalent to the two-stage process in which, with theprobability pi, we select a randomized alternative L(ui). In this two-stageprocess, the results are either A1 or A0. The probability p of getting A1 inthis two-stage process can be computed by using the formula for full proba-bility: p = p1 ·u1+. . .+pm ·um. Thus, the action is equivalent to a randomizedalternative in which we get A1 with probability p and A0 with the remainingprobability 1− p. By definition of utility, this means that the utility u of theaction in question is equal to

u = p1 · u1 + . . . + pm · um.

In statistics, the right-hand of this formula is known as the expected value.Thus, we can conclude that the utility of each action with different possiblealternatives is equal to the expected value of the utility.

5 Why Quadratic Aggregation Operations Are LessAdequate Than OWA and Choquet Operations: AnExplanation

To adequately describe the decision maker’s preferences, we must be able,given an alternative characterized by n parameters x1, . . . , xn, to describethe utility u(x1, . . . , xn) of this alternative. To get a perfect description of theuser’s preference, we must elicit such a utility value for all possible combina-tions of parameters. As we have mentioned in the Introduction, for practicalvalues n, it is not realistic to elicit that many utility values from a user. So,instead, we elicit the user’s preference over each of the parameters xi, andthen aggregate the resulting utility values ui(xi) into an approximation foru(x1, . . . , xn): u(x1, . . . , xn) ≈ f(u1, . . . , un), where ui

def= ui(xi).We have also mentioned that in the first approximation, linear aggrega-

tion operations f(u1, . . . , un) = a0 +n∑

i=1

wi · ui work well, but to get a more

adequate representation of the user’s preferences, we must go beyond lin-ear functions. From the purely mathematical viewpoint, it may seem thatquadratic functions f(u1, . . . , un) should provide a reasonable next approxi-mation, but in practice, piece-wise linear aggregation operations such as OWA

Page 49: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 43

(or Choquet integral) provide a much more adequate description of expertpreferences.

For example, for two parameters, the general OWA combination of twoutility values has the form

f(u1, u2) = w1 · min(u1, u2) + w2 · max(u1, u2).

Similarly, the general OWA combination of three utility values has the form

f(u1, u2, u3) = w1 · min(u1, u2, u3)+

w2 · max(min(u1, u2), min(u1, u3), min(u2, u3)) + w3 · max(u1, u2, u3).

Let us show that this seemingly mysterious advantage of non-quadraticaggregation operations can be explained based on the main properties of theutility functions.

Indeed, as we have mentioned in Section 2, the utility is defined modulo twotypes of transformations: changing a starting point u → u + b and changinga scale u → λ · u for some λ > 0. It is therefore reasonable to require thatthe aggregation operation should not depend on which “unit” (i.e., whichextreme event A1) we use to describe utility. Let us describe this requirementin precise terms.

In the original scale,

• we start with utility values u1, . . . , un;• to these values, we apply the aggregation operation f(u1, . . . , un) and get

the resulting overall utility u = f(u1, . . . , un).

On the other hand,

• we can express the same utility values in a new scale, as u′1 = λ · u1, . . . ,

u′n = λ · un;

• then, we use the same aggregation function to combine the new utilityvalues; as a result, we get the resulting overall utility u′ = f(u′

1, . . . , u′n).

Substituting the expressions u′i = λ · ui into this formula, we conclude that

u′ = f(λ · u1, . . . , λ · un). We require that the utility

u′ = f(u′1, . . . , u

′n) = f(λ · u1, . . . , λ · un)

reflect the same degree of preference as the utility u = f(u1, . . . , un) but ina different scale: u′ = λ · u, i.e.,

f(λ · u1, . . . , λ · un) = λ · f(u1, . . . , un).

It is worth mentioning that in mathematics, such functions are called ho-mogeneous (of first degree). So, we arrive at the conclusion that an adequateaggregation operation should be homogeneous.

This conclusion about the above mysterious fact. On the other hand, onecan show that linear aggregation operations and piece-wise linear aggregationoperations like OWA are scale-invariant.

Page 50: Foundations of Computational Intelligence

44 H.T. Nguyen et al.

Let us start with a linear aggregation operation

f(u1, . . . , un) = w1 · u1 + . . . + wn · un.

For this operation, we get

f(λ · u1, . . . , λ · un) = w1 · (λ · u1) + . . . + wn · (λ · un) =

λ · (w1 · u1 + . . . + wn · un) = λ · f(u1, . . . , un).

Let us now consider the OWA aggregation operation

f(u1, . . . , un) = w1 · u(1) + . . . + wn · u(n),

where u(1) is the largest of n values u1, . . . , un, u(2) is the second largest,etc. If we multiply all the utility values ui by the same constant λ > 0, theirorder does not change. In particular, this means that the same value u(1)

which was the largest in the original scale is the largest in the new scaleas well. Thus, its numerical value u′

(1) can be obtained by re-scaling u(1):u′

(1) = λ · u(1). Similarly, the same value u(2) which was the second largestin the original scale is the second largest in the new scale as well. Thus, itsnumerical value u′

(2) can be obtained by re-scaling u(2): u′(2) = λ · u(2), etc.

So, we have u′(i) = λ ·u(i) for all i. Thus, for the OWA aggregation operation,

we have

f(λ·u1, . . . , λ·un) = w1 ·u′(1)+. . .+wn ·u′

(n) = w1 ·(λ·u(1))+. . .+wn ·(λ·u(n)) =

λ · (w1 · u(1) + . . . + wn · u(n)) = λ · f(u1, . . . , un).

On the other hand, a generic quadratic operation is not homogeneous.Indeed, a general quadratic operation has the form

f(u1, . . . , un) =n∑

i=1

wi · ui +n∑

i=1

n∑j=1

wij · ui · uj .

Here,

f(λu1, . . . , λun) =n∑

i=1

wi · (λ · ui) +n∑

i=1

n∑j=1

wij · (λ · ui) · (λ · uj) =

λ ·n∑

i=1

wi · ui + λ2 ·n∑

i=1

n∑j=1

wij · ui · uj .

On the other hand,

λ · f(u1, . . . , un) = λ ·n∑

i=1

wi · ui + λ ·n∑

i=1

n∑j=1

wij · ui · uj .

Page 51: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 45

The linear terms in the expressions f(λu1, . . . , λun) and λ · f(u1, . . . , un) co-incide, but the quadratic terms differ: the quadratic term in f(λu1, . . . , λun)differs from the quadratic term in λ · f(u1, . . . , un) by a factor of λ. Thus,the only possibility to satisfy the scale-invariance (homogeneity) requirementfor all λ is to have these differing quadratic terms equal to 0, i.e., to havewij = 0 – but in this case the aggregation operation is linear. So, quadraticoperations are indeed not homogeneous – which explains whey they are lessadequate in describing user’s preferences than homogeneous operations likeOWA or Choquet integral.

6 OWA and Choquet Operations Are, in SomeReasonable Sense, the Most General Ones: A NewResult

In the previous section, we explained the empirical fact that in multi-criteriadecision making, OWA and Choquet operations lead to more adequate resultsthan seemingly natural quadratic aggregation operations. The explanation isthat, due to the known properties of the utility, it is reasonable to require thataggregation operation be scale-invariant (homogeneous); OWA and Choquetoperations are scale-invariant but quadratic operations are not.

However, in principle, OWA and Choquet operations are just a few ex-amples of scale-invariant operations, so by itself, the above result does notexplain why OWA and Choquet operations are so successful and not anyother scale-invariant operation. In this section, we give such an explanation.

This explanation is based on the fact that OWA and Choquet operationsare compositions of linear functions, min, and max. In this section, we provethat, crudely speaking, every scale-invariant operation can be composed oflinear functions and min and max operations.

Definition 2. A function f(x1, . . . , xn) is called homogeneous if for everyx1, . . . , xn and for every λ > 0, we have f(λ·x1, . . . , λ·xn) = λ·f(x1, . . . , xn).

Definition 3. By a basic function, we mean one of the following functions:

• a linear function f(x1, . . . , xn) = w1 · x1 + . . . + wn · xn;• a minimum function f(x1, . . . , xn) = min(xi1 , . . . , xim); and• a maximum function f(x1, . . . , xn) = max(xi1 , . . . , xim).

We also say that basic functions are 1-level compositions of basic functions.We say that a function f(x1, . . . , xn) is a k-level composition of basic func-tions if f(x1, . . . , xn) = g(h1(x1, . . . , xn), . . . , hm(x1, . . . , xn)), where g isa basic function, and the functions h1(x1, . . . , xn), . . . , hm(x1, . . . , xn) are(k − 1)-level compositions of basic functions.

By induction over k, one can easily prove that all compositions of basicfunctions are homogeneous. For example:

Page 52: Foundations of Computational Intelligence

46 H.T. Nguyen et al.

• a linear combination is a basic function;• an OWA combination of two values is a 2-level composition of basic func-

tions;• a general OWA operation is a 3-level composition of basic functions.

It turns out that an arbitrary homogeneous function can be approximatedby appropriate 3-level compositions.

Definition 4. Let k > 0 be a positive integer. We say that k-level composi-tions have a universal approximation property for homogeneous functions iffor every continuous homogeneous function f(x1, . . . , xn), and for every twonumbers ε > 0 and Δ > 0, there exists a function f(x1, . . . , xn) which is ak-level composition of basic functions and for which

|f(x1, . . . , xn) − f(x1, . . . , xn)| ≤ ε

for all x1, . . . , xn for which |xi| ≤ Δ for all i.

Theorem 2. 3-level compositions have a universal approximation propertyfor homogeneous functions.

A natural question is: do we need that many levels of composition? Whatis we only use 1- or 2-level compositions? It turns out that in this case, wewill not get the universal approximation property – and thus, the 3 levels ofOWA operations is the smallest possible number.

Theorem 3

• 1-layer computations do not have a universal approximation property forhomogeneous functions;

• 2-layer computations do not have a universal approximation property forhomogeneous functions.

Comment. A natural question is: why should we select linear functions, min,and max as basic functions? One possible answer is that these operationsare the fastest to compute, i.e., they require the smallest possible number ofcomputational steps.

Indeed, the fastest computer operations are the ones which are hardwaresupported, i.e., the ones for which the hardware has been optimized. In moderncomputers, the hardware supported operations with numbers include elemen-tary arithmetic operations (+, −, ·, /, etc.), and operations min and max.

In the standard (digital) computer (see, e.g., [3]):

• addition of two n-bit numbers requires, in the worst case, 2n bit opera-tions: n to add corresponding digits, and n to add carries;

• multiplication, in the worst case, means n additions – by each bit of thesecond factor; so, we need O(n2) bit operations;

• division is usually performed by trying several multiplications, so it takeseven longer than multiplication;

• finally, min and max can be performed bit-wise and thus, require only nbit operations.

Page 53: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 47

Thus, the fastest elementary operations are indeed addition (or, more gener-ally, linear combination), min, and max.

Proof of Theorems 2 and 3

1◦. Before we start proving, let us notice that the values of the functionsmin(xi1 , . . . , xim) and max(xi1 , . . . , xim) depend on the order between thevalues x1, . . . , xn. There are n! possible orders, so we can divide the wholen-dimensional space of all possible tuples (x1, . . . , xn) into n! zones corre-sponding to these different orders.

2◦. In each zone, a basic function is linear:

• a linear function is, of course, linear;• a minimizing function min(xi1 , . . . , xim) is simply equal to the variable

xikwhich is the smallest in this zone and is, thus, linear;

• a maximizing function max(xi1 , . . . , xim) is simply equal to the variablexik

which is the largest in this zone and is, thus, also linear.

3◦. If a function f(x1, . . . , xn) can be approximated, with arbitrary accuracy,by functions from a certain class, this means that f(x1, . . . , xn) is a limit offunctions from this class.

4◦. Basic functions are linear in each zone; thus, their limits are also linearin each zone. Since some homogeneous functions are non-linear, we can thusconclude that basic functions do not have a universal approximation propertyfor homogeneous functions.

5◦. Let us now consider 2-level compositions of basic functions, i.e., functionsof the type f(x1, . . . , xn) = g(h1(x1, . . . , xn), . . . , hm(x1, . . . , xn)), where gand hi are basic functions.

Since there are three types of basic functions, we have three options:

• it is possible that g(x1, . . . , xm) is a linear function;• it is possible that g(x1, . . . , xm) is a minimizing function; and• it is possible that g(x1, . . . , xm) is a maximizing function.

Let us consider these three options one by one.

5.1◦. Let us start with the first option, when g(x1, . . . , xm) is a linear function.Since on each zone, each basic function hi is also linear, the compositionf(x1, . . . , xn) is linear on each zone.

5.2◦. If g(x1, . . . , xm) is a minimizing function, then on each zone, each hi

is linear and thus, the composition f(x1, . . . , xn) is a minimum of linearfunctions. It is known that minima of linear functions are concave; see, e.g.,[21]. So, within this option, the function f(x1, . . . , xn) is concave.

5.3◦. If g(x1, . . . , xm) is a maximizing function, then on each zone, each hi

is linear and thus, the composition f(x1, . . . , xn) is a maximum of linearfunctions. It is known that maxima of linear functions are convex; see, e.g.,[21]. So, within this option, the function f(x1, . . . , xn) is convex.

Page 54: Foundations of Computational Intelligence

48 H.T. Nguyen et al.

6◦. In each zone, 2-level compositions of basic functions are linear, concave,or convex. The class of all functions approximable by such 2-level composi-tions is the class of limits (closure) of the union of the corresponding threeclasses: of linear, concave, and convex sets. It is known that the closure ofthe finite union is the union of the corresponding closures. A limit of linearfunctions is always linear, a limit of concave functions is concave, and a limitof convex functions is convex. Thus, by using 2-level compositions, we canonly approximate linear, concave, or convex functions. Since there exist ho-mogeneous functions which are neither linear nor concave or convex, we canthus conclude that 2-level compositions are not universal approximators forhomogeneous functions.

7◦. To complete the proof, we must show that 3-level compositions are uni-versal approximators for homogeneous functions. There are two ways toprove it.

7.1◦. First, we can use the known facts about concave and convex func-tions [21]:

• that every continuous function on a bounded area can be represented asas a difference between two convex functions, and

• that every convex function can be represented as a maximum of linearfunctions – namely, all the linear functions which are smaller than thisfunction.

These facts are true for general (not necessarily homogeneous) functions.For homogeneous functions f(x1, . . . , xn), one can easily modify the existingproofs and show:

• that every homogeneous continuous function on a bounded area can berepresented as as a difference between two convex homogeneous functions,and

• that every homogeneous convex function can be represented as a max-imum of homogeneous linear functions – namely, all the homogeneouslinear functions which are smaller than this function.

Thus, we can represent the desired function f(x1, . . . , xn) as the differencebetween two convex homogeneous functions f(x1, . . . , xn) = f1(x1, . . . , xn)−f2(x1, . . . , xn). Each of these convex functions can be approximated by max-ima of linear functions and thus, by 2-level compositions. Substraction f1−f2

adds the third level, so f(x1, . . . , xn) can indeed be approximated by 3-levelcompositions.

To prove that a function f(x1, . . . , xn) can be represented as a differentbetween two convex functions, we can, e.g., first approximate it by a homo-geneous function which is smooth on a unit sphere

{(x1, . . . , xn) : x21 + . . . + x2

n = 1},

Page 55: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 49

and then take f1(x1, . . . , xn) = k · √x21 + . . . + x2

n for a large k. For smoothfunctions, convexity means that the Hessian matrix – consisting of its second

derivatives∂2f

∂xi∂xj– is positive definite.

For sufficiently large k, the difference

f2(x1, . . . , xn) = f1(x1, . . . , xn) − f(x1, . . . , xn)

is also convex – since its second derivatives matrix is dominated by positivedefinite terms coming from f1. Thus, the difference f1 − f2 = f is indeed thedesired difference.

7.2◦. Another, more constructive proof, is, for some δ′ > 0, to select a finite δ′-dense set of points e = (e1, . . . , en) on a unit square. For each such point, webuild a 2-level composition which coincides with f on the corresponding ray{λ · (e1, . . . , en) : λ > 0}. This function can be obtained, e.g., as a minimumof several linear functions which have the right value on this ray but changedrastically immediately outside this ray.

For example, let f0(x) be an arbitrary homogeneous linear function whichcoincides with f(x) at the point e – and thus, on the whole ray. To con-struct the corresponding linear functions, we can expand the vector e to anorthonormal basis e, e′, e′′, etc., and take linear functions f0(x)+k ·(e′ ·x) andf0(x) − k · (e′ · x) for all such e′ (and for a large k > 0). Then, the minimumof all these functions is very small outside the ray.

We then take the maximum of all these minima – a 3-level composition.The function f(x1, . . . , xn) is continuous on a unit sphere and thus, uni-

formly continuous on it, i.e., for every ε > 0, there is a δ such that δ-closevalue on the unit sphere lead to ε-close values of f . By selecting appropriateδ′ and k (depending on δ), we can show that the resulting maximum is indeedε-close to f .

The theorem is proven.

7 Conclusions

Fuzzy techniques have been originally invented as a methodology that trans-forms the knowledge of experts (formulated in terms of natural language)into a precise computer-implementable form. There are many successful ap-plications of this methodology to situations in which expert knowledge exist;the most well known (and most successful) are applications to fuzzy control.

In some cases, fuzzy methodology is applied even when no expert knowl-edge exists. In such cases, instead of trying to approximate the unknowncontrol function by splines, polynomials, or by any other traditional approx-imation technique, researchers try to approximate it by guessing and tuningthe expert rules. Surprisingly, this approximation often works fine.

Similarly, in multi-criteria decision making, it is necessary to aggregate(combine) utility values corresponding to several criteria (parameters). The

Page 56: Foundations of Computational Intelligence

50 H.T. Nguyen et al.

simplest way to combine these values is to use linear aggregation. In manypractical situations, however, linear aggregation does not fully adequately de-scribe the actual decision making process, so non-linear aggregation is needed.From the purely mathematical viewpoint, the next natural step after linearfunctions is the use of quadratic functions. However, in decision making, adifferent type of non-linearities are usually more adequate than quadraticones: fuzzy-type non-linearities like OWA or Choquet integral that use minand max in addition to linear combinations.

In this chapter, we give a mathematical explanation for this empirical phe-nomenon. Specifically, we show that approximation by using fuzzy method-ology is indeed the best (in some reasonable sense).

Acknowledgments

This work was supported in part by NSF grants HRD-0734825, EAR-0225670,and EIA-0080940, by Texas Department of Transportation grant No. 0-5453,by the Japan Advanced Institute of Science and Technology (JAIST) Inter-national Joint Research Grant 2006-08, and by the Max Planck Institut furMathematik.

References

1. Buckley, J.J.: Sugeno type controllers are universal controllers. Fuzzy Sets andSystems 53, 299–303 (1993)

2. Ceberio, M., Modave, F.: An interval-valued, 2-additive Choquet integral formulti-cruteria decision making. In: Proceedings of the 10th Conf. on Informa-tion Processing and Management of Uncertainty in Knowledge-based SystemsIPMU 2004, Perugia, Italy (July 2004)

3. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algo-rithms. MIT Press, Cambridge (2001)

4. Feynman, R., Leighton, R., Sands, M.: Feynman Lectures on Physics. AddisonWesley, Reading (2005)

5. Grabisch, M., Murofushi, T., Sugeno, M. (eds.): Fuzzy Measures and Integrals.Physica-Verlag, Heidelberg (2000)

6. Kandel, A., Langholtz, G. (eds.): Fuzzy Control Systems. CRC Press, BocaRaton (1994)

7. Keeney, R.L., Raiffa, H.: Decisions with Multiple Objectives. John Wiley andSons, New York (1976)

8. Klir, G., Yuan, B.: Fuzzy sets and fuzzy logic: theory and applications. PrenticeHall, Upper Saddle River (1995)

9. Kosko, B.: Fuzzy systems as universal approximators. In: Proceedings of the 1stIEEE International Conference on Fuzzy Systems, San Diego, CA, pp. 1153–1162 (1992)

10. Kreinovich, V., Bernat, A.: Parallel algorithms for interval computations: anintroduction. Interval Computations 3, 6–62 (1994)

Page 57: Foundations of Computational Intelligence

Fuzzy without Fuzzy: An Explanation 51

11. Kreinovich, V., Mouzouris, G.C., Nguyen, H.T.: Fuzzy rule based modelingas a universal approximation tool. In: Nguyen, H.T., Sugeno, M. (eds.) FuzzySystems: Modeling and Control, pp. 135–195. Kluwer, Boston (1998)

12. Kreinovich, V., Nguyen, H.T., Yam, Y.: Fuzzy systems are universal approxi-mators for a smooth function and its derivatives. International Journal of In-telligent Systems 15(6), 565–574 (2000)

13. Lea, R.N., Kreinovich, V.: Intelligent control makes sense even without expertknowledge: an explanation. In: Reliable Computing. Supplement, Extended Ab-stracts of APIC 1995: International Workshop on Applications of Interval Com-putations, El Paso, TX, February 23–25, pp. 140–145 (1995)

14. Luce, R.D., Raiffa, H.: Games and Decisions: Introduction and Critical Survey.Dover, New York (1989)

15. Modave, F., Ceberio, M., Kreinovich, V.: Choquet integrals and OWA criteriaas a natural (and optimal) next step after linear aggregation: a new generaljustification. In: Proceedings of the 7th Mexican International Conference onArtificial Intelligence MICAI 2008, Mexico City, Mexico, October 27–31 (toappear) (2008)

16. Nguyen, H.T., Kreinovich, V.: On approximation of controls by fuzzy systems.In: Proceedings of the Fifth International Fuzzy Systems Association WorldCongress, Seoul, Korea, pp. 1414–1417 (July 1993)

17. Nguyen, H.T., Kreinovich, V.: Fuzzy aggregation techniques in situations with-out experts: towards a new justification. In: Proceedings of the IEEE Confer-ence on Foundations of Computational Intelligence FOCI 2007, Hawaii, April1–5, pp. 440–446 (2007)

18. Nguyen, H.T., Walker, E.A.: A first course in fuzzy logic. CRC Press, BocaRaton (2005)

19. Perfilieva, I., Kreinovich, V.: A new universal approximation result for fuzzysystems, which reflects CNF-DNF duality. International Journal of IntelligentSystems 17(12), 1121–1130 (2002)

20. Raiffa, H.: Decision Analysis. Addison-Wesley, Reading (1970)21. Rockafeller, R.T.: Convex Analysis. Princeton University Press, Princeton

(1970)22. Wang, L.-X.: Fuzzy systems are universal approximators. In: Proceedings of the

IEEE International Conference on Fuzzy Systems, San Diego, CA, pp. 1163–1169 (1992)

23. Wang, L.-X., Mendel, J.: Generating Fuzzy Rules from Numerical Data, withApplications, University of Southern California, Signal and Image ProcessingInstitute, Technical Report USC-SIPI # 169 (1991)

24. Yager, R.R., Kacprzyk, J. (eds.): The Ordered Weighted Averaging Operators:Theory and Applications. Kluwer, Norwell (1997)

25. Yager, R.R., Kreinovich, V.: Universal approximation theorem for uninorm-based fuzzy systems modeling. Fuzzy Sets and Systems 140(2), 331–339 (2003)

Page 58: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the

World to Be Cognizable: Towards a NewJustification for Fuzzy Logic Ideas

Hung T. Nguyen1, Vladik Kreinovich2, J. Esteban Gamez2,Francois Modave2, and Olga Kosheleva3

1 Department of Mathematical Sciences, New Mexico State University,Las Cruces, NM 88003, [email protected]

2 Department of Computer Science, University of Texas at El Paso,El Paso, TX 79968, [email protected], [email protected], [email protected]

3 Department of Teacher Education, University of Texas at El Paso,El Paso, TX 79968, [email protected]

Summary. Most traditional examples of fuzziness come from the analysisof commonsense reasoning. When we reason, we use words from natural lan-guage like “young”, “well”. In many practical situations, these words do nothave a precise true-or-false meaning, they are fuzzy. One may therefore be leftwith an impression that fuzziness is a subjective characteristic, it is causedby the specific way our brains work. However, the fact that that we are theresult of billions of years of successful adjusting-to-the-environment evolu-tion makes us conclude that everything about us humans is not accidental.In particular, the way we reason is not accidental, this way must reflect somereal-life phenomena – otherwise, this feature of our reasoning would havebeen useless and would not have been abandoned long ago. In other words,the fuzziness in our reasoning must have an objective explanation – in fuzzi-ness of the real world. In this chapter, we first give examples of objectivereal-world fuzziness. After these example, we provide an explanation of thisfuzziness – in terms of cognizability of the world.

1 Introduction

One of the main ideas behind Zadeh’s fuzzy logic and its applications is thateverything is a matter of degree. We are often accustomed to think that everystatement about a physical world is true or false:

• that an object is either a particle or a wave,• that a person is either young or not,• that a person is either well or ill,

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 53–74.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 59: Foundations of Computational Intelligence

54 H.T. Nguyen et al.

but in reality, we sometimes encounter intermediate situations. That every-thing is a matter of degree is a convincing empirical fact, but a natural questionis: why? How can we explain this fact?

This is what we will try to do in this chapter: come up with a theoreti-cal explanation of this empirical fact. Most traditional examples of fuzzinesscome from the analysis of commonsense reasoning. When we reason, we usewords from natural language like “young”, “well”. In many practical situa-tions, these words do not have a precise true-or-false meaning, they are fuzzy.One may therefore be left with an impression that fuzziness is a subjectivecharacteristic, it is caused by the specific way our brains work. However, thefact that that we are the result of billions of years of successful adjusting-to-the-environment evolution makes us conclude that everything about ushumans is not accidental. In particular, the way we reason is not accidental,this way must reflect some real-life phenomena – otherwise, this feature of ourreasoning would have been useless and would not have been abandoned longago. In other words, the fuzziness in our reasoning must have an objectiveexplanation – in fuzziness of the real world.

In this chapter, we first give examples of objective real-world fuzziness.After these example, we provide an explanation of this fuzziness – in terms ofcognizability of the world. Some of our results first appeared in the conferencepapers [4, 16].

2 Examples of Objective “Fuzziness”

Fractals

The notion of dimension has existed for centuries. Already the ancient re-searchers made a clear distinction between 0-dimensional objects (points),1-dimensional objects (lines), 2-dimensional objects (surfaces), 3-dimensionalobjects (bodies), etc. In all these examples, dimension is a natural number:0, 1, 2, 3, . . .

Since the 19th century, mathematicians have provided a mathematical ex-tension of the notion of dimension that allowed them to classify some weirdmathematical sets as being of fractional (non-integer) dimension, but for along time, these weird sets remained anomalies. In the 1970s, B. Mandlebrotnoticed that actually, many real-life objects have fractional dimension, rang-ing from the shoreline of England to the shape of the clouds and mountains tonoises in electric circuits (to social phenomena such as stock prices). He calledsuch sets of fractional (non-integer) dimension fractals; see, e.g., [11, 12, 13].

It is now clear that fractals play an important role in nature. So, what weoriginally thought of as an integer-valued variable turned out to be real-valued.

Quantum physics

Until the 19th century, physical phenomena were described by classical physics.In classical physics, some variables are continuous, some are discrete. For

Page 60: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the World to Be Cognizable 55

example, the coordinates and velocities of particles usually take continuousvalues. However, if we are interested in stable states or periodic trajectories,we often end up with a discrete set of stable states. This discreteness under-lies most engineering implementations of computers: to represent 0 or 1, weselect an object with 2 possible states, and use one of these states to represent0 and another to represent 1.

In the 20th century, however, it turned out that a more adequate descrip-tion of the physical world comes from quantum physics. One of the peculiarfeatures of quantum physics is the so-called superposition principle (see, e.g.,[2]) according to which with every two states 〈0| and 〈1|, it is also possible tohave “intermediate” states (superpositions) c0 · 〈0| + c1 · 〈1| for all complexvalues c0 and c1 for which |c0|2 + |c1|2 = 1.

So, what we originally thought of as an integer-valued variable turned outto be real-valued. It is worth mentioning that these quantum combinationsof 0 and 1 states are not only happening in real life, but, as it was discoveredin the 1990s, their use can drastically speed up computations. For example:

• we can search in an unsorted list of n elements in time√

n – which ismuch faster than the time n which is needed on non-quantum computers[6, 7, 18];

• we can factor a large integer in time which does not exceed a polyno-mial of the length of this integer – and thus, we can break most existingcryptographic codes like widely used RSA codes which are based on thedifficulty of such a factorization on non-quantum computers [18, 21, 22].

These techniques form the basis of quantum computing; see, e.g., [18].

Fractional charges of quarks

In the late 19th century and early 20th century, it was experimentally con-firmed that seemingly continuous matter is actually discrete: it consists ofmolecules, molecules consist of atoms, and atoms consist of elementary par-ticles. A part of this confirmation came from an experimental discovery thatall electric charges are proportional to a single charge – which was later re-vealed to be equal to the charge of an electron. Based on this proportionality,physicists concluded that many observed elementary particles ranging from(relatively) stables particles such as protons and neutrons to numerous unsta-ble ones – like many mesons and baryons discovered in super-collides and incosmic rays – cannot be further decomposed into “more elementary” objects.

In the 1960s, M. Gell-Mann [2, 5, 20] discovered that if we allow particleswith fractional electronic charge, then we can describe protons, neutrons,mesons, and baryons as composed of 3 (now more) even more elementary par-ticles called quarks. At first, quarks were often viewed as purely mathematicalconstructions, but experiments with particle-particle collisions revealed that,within a proton, there are three areas (partons) where the reflection seems tobe the largest – in perfect accordance with the fact that in the quark model,a proton consists of exactly three quarks.

Page 61: Foundations of Computational Intelligence

56 H.T. Nguyen et al.

So, what we originally thought of as an integer-valued variable turned outto be real-valued.

There exist other examples of objective “fuzziness”

In physics, there are many other examples when what we originally thoughtof as an integer-valued variable turned out to be real-valued. In this chapter,we just described the most well known ones.

3 Our Explanation of Why Physical QuantitiesOriginally Thought to Be Integer-Valued Turned outto Be Real-Valued: Main Idea

In philosophical terms, what we are doing is “cognizing” the world, i.e., under-standing how it works and trying to predict consequences of different actions– so that we will be able to select an action which is the most beneficial for us.

Of course, our knowledge is far from complete, there are many real-worldphenomena which we have not cognized yet – and many philosophers believethat some of these phenomena are not cognizable at all.

If a phenomenon is not cognizable, there is nothing we can do about it.What we are interested in is phenomena which are cognizable. This is whatwe will base our explanation on – that in such phenomena, it is reasonable toexpect continuous-valued variables, i.e., to expect that properties originallythought to be discrete are actually matters of degree.

4 First Explanation: Godel’s Theorem vs. Tarski’sAlgorithm

Godel’s theorem: a brief reminder. Our first explanation of “objective fuzzi-ness” is based on the historically first result in which something was actuallyproven to be not cognizable – the well-known 1931 Godel’s theorem; see,e.g., [3].

This theorem can be formulated in terms of arithmetic. Specifically, wehave variables which run over natural numbers 0, 1, 2, . . . A term is anythingthat can be obtained from these variables and natural-valued constants byusing addition and multiplication, e.g., 2 · x · y + 3 · z (subtraction is alsoallowed).

Elementary formulas are defined as expressions of the type t = t′, t < t′,t > t′, t ≤ t′, t ≥ t′, and t �= t′ for some terms t and t′. Examples are2 · x · y + 3 · z = 0 or x < y + z.

Finally, a formula is anything which is obtained from elementary formulasby using logical connectives “and” (&), “or” (∨), “implies” (→), “not” (¬),and quantifiers “for all x” (∀x) and “there exists x” (∃x). Example:

Page 62: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the World to Be Cognizable 57

∀x∀y(x < y → ∃z(y = x + y)).

Many statements about the physical world can be formulated in terms ofsuch formulas. Our objective is therefore to find out whether a given formulais true or false.

Godel’s theorem states that no algorithm is possible that would, given aformula, check whether this formula is true or false. In other words, if we allowvariables with discrete values, then it is not possible to have an algorithmwhich would solve all the problems.

Tarksi’s result

In the 1940s, another well-known logician, Alfred Tarski, raised an interestingquestion: what if we only allow continuous variables? In other words, whatif we consider the same formulas as Godel considered, but we change theirinterpretation: now every variable can take arbitrary real values. It turns outthat in this case, it is possible to have an algorithm that, given a formula,checks whether this formula is true or false. [23].

Thus, in a cognizable situations, we cannot have variables which only takediscrete values – these variables must be able to take arbitrary real values[4, 16]. we have to note that it is worth mentioning that the original Tarski’salgorithm required an unrealistically large amount of computation time; how-ever, later, faster, practically useful algorithms have been invented; see, e.g.,[1, 14].

5 Second Explanation: Efficient Algorithms vs.NP-Hardness

Not all algorithms are practical

Our first explanation of continuity (and “fuzziness”) was that with the dis-crete variables, we cannot have a deciding algorithm, but with continuousvariables, we can.

The existence of an algorithm is necessary for cognition, but not sufficient.It is well known that some theoretical algorithms are not practical at all. Forexample, if an algorithm requires an exponential number of computationalsteps 2n on an input of size n, this means that for inputs of a reasonablesize n ≈ 300−400, the required computation time exceeds the lifetime of theUniverse.

Feasible vs. non-feasible algorithms

There is still no perfect formalization of this difference between “practical”(feasible) and impractical (non-feasible) algorithms. Usually:

Page 63: Foundations of Computational Intelligence

58 H.T. Nguyen et al.

• algorithms for which the computation time tA(x) is bounded by somepolynomial P (n) of the length n = len(x) of the input (e.g., linear-time,quadratic-time, etc.) are practically useful, while

• for practically useless algorithms, the computation time grows with thesize of the input much faster than a polynomial.

In view of this empirical fact, in theoretical computer science, algorithmsare usually considered feasible if their running time is bounded by a polyno-mial of n. The class of problems which can be solved in polynomial time isusually denoted by P; see, e.g., [19].

Notion of NP-hardness

Not all practically useful problems can be solved in polynomial time. To de-scribe such problems, researchers have defined several more general classes ofproblems. One of the most well known classes is the class NP. By definition,this class consists of all the problems which can be solved in non-deterministicpolynomial time – meaning that if we have a guess, we can check, in polyno-mial time, whether this guess is a solution to our problem.

Most computer scientists believe that NP �=P, i.e., that some problems fromthe class NP cannot be solved in polynomial time. However, this inequalityhas not been proven, it is still an open problem. What is known is that someproblems are NP-hard, i.e., any problem from the class NP can be reduced toeach of these problems in polynomial time. One of such NP-hard problems isthe problem SAT of propositional satisfiability: given a propositional formulaF , i.e., a formula obtained from Boolean (yes-no) variables x1, . . . , xn byusing &, ∨, and ¬, check whether there exist values x1, . . . , xn which makethis formula true.

NP-hardness of SAT means that if NP �=P (i.e., if at least one problem fromthe class NP cannot be solved in polynomial time), then SAT also cannot besolved in polynomial time. In other words, SAT is the hardest of the problemsfrom this class. It is known that all the problems from the class NP can besolved in exponential time. Indeed, for a problem of size n, there are ≤ an

possible guesses, where a is the size of the corresponding alphabet, so we cansimply try all these guesses one by one.

Example: systems of linear equations

One of the simplest-to-solve numerical problems is the solution to a systemof linear equations

a11 · x1 + . . . + a1n · xn = b1;

. . .

am1 · x1 + . . . + amn · xn = bm.

In the situation when all the unknowns xi can take arbitrary real values,there exist efficient algorithms for solving such systems of equations – even the

Page 64: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the World to Be Cognizable 59

well-known Gauss elimination method, while not the fastest, it still feasible.However, as soon as we restrict ourselves to discrete (e.g., integer-valued)variables xi, the solution of such a system becomes an NP-hard problem [19].

So, we end up with the same conclusion: that in a cognizable situations,we cannot have variables which only take discrete values – these variablesmust be able to take arbitrary real values.

6 Case Study: Selecting the Most RepresentativeSample

Introduction to the problem. In many practical situations, it is desirable tofind the statistical analysis of a certain population, but this population is solarge that it is not practically possible to analyze every individual elementfrom this population. In this case, we select a sample (subset) of the popu-lation, perform a statistical analysis on this sample, and use these results asan approximation to the desired statistical characteristics of the populationas a whole.

For example, this is how polls work: instead of asking the opinion of allthe people, pollsters ask a representative sample, and use the opinion of thissample as an approximation to the opinion of the whole population.

The more “representative” the sample, the larger our confidence that thestatistical results obtained by using this sample are indeed a good approx-imation to the desired population statistics. Typically, we gauge the repre-sentativeness of a sample by how well its statistical characteristics reflect thestatistical characteristics of the entire population. For example, in the sampleof human voters, it is reasonable to require that in the selected sample, theaverage age, the average income, and other characteristics are the same asin the population in a whole. Of course, the representativeness of averagesis not enough: e.g., the voting patterns of people whose salary is exactly thenational average are not necessarily a good representation of how people willwork on average. For that, we need the sample to include both poorer andreacher people – i.e., in general, to be representative not only in terms ofaverages but also in terms of, e.g., standard deviations (i.e., equivalently, interms of variances).

In practice, many techniques are used to design a representative sample;see, e.g., [10]. In this section, we show that the corresponding exact optimiza-tion problem is computationally difficult (NP-hard).

How is this result related to fuzzy techniques? The main idea behind fuzzytechniques is that they formalize expert knowledge expressed by words fromnatural language. In this section, we show that if we do not use this knowl-edge, i.e., if we only use the data, then selecting the most representativesample becomes a computationally difficult (NP-hard) problem. Thus, theneed to find such samples in reasonable time justifies the use of fuzzytechniques.

Page 65: Foundations of Computational Intelligence

60 H.T. Nguyen et al.

We have to note that similar results are known: for example, it is knownthat a similar problem of maximizing diversity is NP-hard; see, e.g., [9].

Towards formulation of the problem in exact terms. Let us assume that wehave a population consisting of N objects. For each of N objects, we know thevalues of k characteristics x1, x2, . . . , xk. The value of the first characteristicx1 for i-th object will be denoted by x1,i, the value of the second characteristicx2 for the i-th object will be denoted by x2,i, . . . , and finally, the value ofthe characteristic xk for the i-th object will be denoted by xk,i. As a result,we arrive at the following formal definition:

Definition 1. By a population, we mean a tuple

pdef= 〈N, k, {xj,i}〉,

where:

• N is an integer; this integer will be called the population size;• k is an integer; this integer is called the number of characteristics;• xj,i (1 ≤ j ≤ k, 1 ≤ i ≤ N) are real numbers; the real number xj,i will be

called the value of the j-th characteristic for the i-th object.

Based on these known values, we can compute the population means

E1 =1N

·N∑

i=1

x1,i, E2 =1N

·N∑

i=1

x2,i, . . . ,

and the population variances

V1 =1N

·N∑

i=1

(x1,i − E1)2, V2 =1N

·N∑

i=1

(x2,i − E2)2, . . .

We can also compute higher order central moments.

Definition 2. Let p = 〈N, k, {xj,i}〉 be a population, and let j be an integerfrom 1 to k.

• By the population mean Ej of the j-th characteristic, we mean the value

Ej =1N

·N∑

i=1

xj,i.

• By the population variance Vj of the j-th characteristic, we mean thevalue

Vj =1N

·N∑

i=1

(xj,i − Ej)2.

Page 66: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the World to Be Cognizable 61

• For every integer d ≥ 1, by the even order population central momentM

(2d)j of order 2d of the j-th characteristic, we mean the value

M(2d)j =

1N

·N∑

i=1

(xj,i − Ej)2d.

Comment. In particular, the population central moment M(2)j of order 2

(corresponding to d = 1) is simply the population variance.In addition to the values x1,i, x2,i, . . . , we are given a size n < N of the

desirable sample. For each sample I = {i1, . . . , in} ⊆ {1, 2, . . . , N} of size n,we can compute the sample means

E1(I) =1n

∑i∈I

x1,i, E2(I) =1n

∑i∈I

x2,i, . . .

and the sample variances

V1(I) =1n

∑i∈I

(x1,i − E1(I))2, V2(I) =1n

∑i∈I

(x2,i − E2(I))2, . . .

Definition 3. Let N be a population size.

• By a sample, we mean a non-empty subset

I ⊆ {1, 2, . . . , N}.• For every sample I, by its size, we mean the number of elements in I.

Definition 4. Let p = 〈N, k, {xj,i}〉 be a population, let I be a sample of sizen, and let j be an integer from 1 to k.

• By the sample mean Ej(I) of the j-th characteristic, we mean the value

Ej(I) =1n·∑i∈I

xj,i.

• By the sample variance Vj(I) of the j-th characteristic, we mean the value

Vj(I) =1n·∑i∈I

(xj,i − Ej(I))2.

• For every d ≥ 1, by the sample central moment M(2d)j (I) of order 2d of

the j-th characteristic, we mean the value

M(2d)j (I) =

1n·∑i∈I

(xj,i − Ej(I))2d.

Page 67: Foundations of Computational Intelligence

62 H.T. Nguyen et al.

Comment. Similarly to the population case, the sample central moment M(2)j

of order 2 (corresponding to d = 1) is simply the sample variance.We want to select the most representative sample, i.e., the sample I for

which the sample statistics E1(I), E2(I), . . . , V1(I), V2(I), . . . are the closestto the population statistics E1, E2, . . . , V1, V2, . . .

Definition 5. Let p = 〈N, k, {xj,i}〉 be a population.

• By an E-statistics tuple corresponding to p, we mean a tuple

t(1)def= (E1, . . . , Ek).

• By an (E, V )-statistics tuple corresponding to p, we mean a tuple

t(2)def= (E1, . . . , Ek, V1, . . . , Vk).

• For every integer d ≥ 1, by a statistics tuple of order 2d corresponding top, we mean a tuple

t(2d) def= (E1, . . . , Ek, M(2)1 , . . . , M

(2)k , M

(4)1 , . . . , M

(4)k , . . . , M

(2d)1 , . . . , M

(2d)k ).

Comment. In particular, the statistics tuple of order 2 is simply the (E, V )-statistics tuple.

Definition 6. Let p = 〈N, k, {xj,i}〉 be a population, and let I be a sample.

• By an E-statistics tuple corresponding to I, we mean a tuple

t(1)(I) def= (E1(I), . . . , Ek(I)).

• By an (E, V )-statistics tuple corresponding to I, we mean a tuple

t(2)(I) def= (E1(I), . . . , Ek(I), V1(I), . . . , Vk(I)).

• For every integer d ≥ 2, by a statistics tuple of order 2d corresponding toI, we mean a tuple

t(2d)(I) def= (E1(I), . . . , Ek(I), M (2)1 (I), . . . , M (2)

k (I),

M(4)1 (I), . . . , M (4)

k (I) . . . , M(2d)1 (I), . . . , M (2d)

k (I)).

Comment. In particular, the statistics tuple of order 2 corresponding to asample I is simply the (E, V )-statistics tuple corresponding to this sametuple.

We will show that no matter how we define closeness, this problem isNP-hard (computationally difficult).

Page 68: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the World to Be Cognizable 63

Let us describe the problem in precise terms. To describe which tuple

t(I) def= (E1(I), E2(I), . . . , V1(I), V2(I), . . .)

is the closest to the original statistics tuple

tdef= (E1, E2, . . . , V1, V2, . . .),

we need to fix a distance function ρ(t(I), t) describing how distant are thetwo given tuples. Similarly to the usual distance, we would like this distancefunction to be equal to 0 when the tuples coincide and to be positive if whenthe tuples are different. So, we arrive at the following definitions.

Definition 7. By a distance function, we mean a mapping ρ that maps everytwo real-valued tuples t and t′ of the same size into a real value ρ(t, t′) in sucha way that ρ(t, t) = 0 for all tuples t and ρ(t, t′) > 0 for all t �= t′.

As an example, we can take Euclidean metric between the tuples t =(t1, t2, . . .) and t′ = (t′1, t′2, . . .):

ρ(t, t′) =√∑

j

(tj − t′j)2.

Now, we are ready to formulate the problem.

Definition 8. Let ρ be a distance function. By a E-sample selection problemcorresponding to ρ, we mean the following problem. We are given:

• a population p = 〈N, k, {xj,i}〉, and• an integer n < N .

Among all samples I ⊆ {1, . . . , N} of size n, we must find the sample Ifor which the distance ρ(t(1)(I), t(1)) between the corresponding E-statisticaltuples is the smallest possible.

Definition 9. Let ρ be a distance function. By a (E, V )-sample selectionproblem corresponding to ρ, we mean the following problem. We are given:

• a population p = 〈N, k, {xj,i}〉, and• an integer n < N .

Among all samples I ⊆ {1, . . . , N} of size n, we must find the sample I forwhich the distance ρ(t(2)(I), t(2)) between the corresponding (E, V )-statisticaltuples is the smallest possible.

Definition 10. Let ρ be a distance function, and let d ≥ 1 be an integer.By a 2d-th order sample selection problem corresponding to ρ, we mean thefollowing problem. We are given:

Page 69: Foundations of Computational Intelligence

64 H.T. Nguyen et al.

• a population p = 〈N, k, {xj,i}〉, and• an integer n < N .

Among all samples I ⊆ {1, . . . , N} of size n, we must find the sample I forwhich the distance ρ(t(2d)(I), t(2d)) between the corresponding (2d)-th orderstatistical tuples is the smallest possible.

Proposition 1. For every distance function ρ, the corresponding E-sampleselection problem is NP-hard.

Proposition 2. For every distance function ρ, the corresponding (E, V )-sample selection problem is NP-hard.

Proposition 3. For every distance function ρ and for every integer d ≥ 1,the corresponding (2d)-th order sample selection problem is NP-hard.

What is NP-hardness: a brief informal reminder. In order to prove theseresults, let us recall what NP-hardness means. Informally, a problem P0 iscalled NP-hard if it is at least as hard as all other problems from the classNP (a natural class of problems).

To be more precise, a problem P0 is NP-hard if every problem P from theclass NP can be reduced to this problem P0. A reduction means that to everyinstance p of the problem P , we must be able to assign (in a feasible, i.e.,polynomial-time way) an instance p0 of our problem P0 in such a way thatthe solution to the new instance p0 will lead to the solution of the originalinstance p. For precise definitions, see, e.g., [19].

How NP-hardness is usually proved. The original proof of NP-hardness ofcertain problems P0 is rather complex, because it is based on explicitly prov-ing that every problem from the class NP can be reduced to the problem P0.However, once we have proven NP-hardness of a problem P0, the proof ofNP-hardness of other problems P1 is much easier.

Indeed, from the above description of a reduction, one can easily see thatreduction is a transitive relation: if a problem P can be reduced to a problemP0, and the problem P0 can be reduced to a problem P1, then, by combiningthese two reductions, we can prove that P can be reduced to P1.

Thus, to prove that a new problem P1 is NP-hard, it is sufficient to provethat one of the known NP-hard problems P0 can be reduced to this problemP1. Indeed, since P0 is NP-hard, every other problem P from the class NPcan be reduced to this problem P0. Since P0 can be reduced to P1, we cannow conclude, by transitivity, that every problem P from the class NP canbe reduced to this problem P1 – i.e., that the problem P1 is indeed NP-hard.

Comment. As a consequence of the definition of NP-hardness, we can con-clude that if a problem P0 is NP-hard, then every more general problem P1

is also NP-hard.Indeed, the fact that P0 is NP-hard means that every instance p of every

problem P can be reduced to some instance p0 of the problem P0. Since the

Page 70: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the World to Be Cognizable 65

problem P1 is more general than the problem P0, every instance p0 of theproblem P0 is also an instance of the more general problem P1.

Thus, every instance p of every problem P can be reduced to some instancep0 of the problem P1 – i.e., that the more general problem P1 is indeed NP-hard.

Main idea of the proof: reduction to subset sum, a known NP-hard problem.We prove NP-hardness of our problem by reducing a known NP-hard problemto it: namely, a subset sum problem, in which we are given m positive integerss1, . . . , sm, and we must find the signs εi ∈ {−1, 1} for which

m∑i=1

εi · si = 0;

see, e.g., [19].A reduction means that to every instance s1, . . . , sm of the subset sum

problem, we must assign (in a feasible, i.e., polynomial-time way) an instanceof our problem in such a way that the solution to the new instance will leadto the solution of the original instance.

Reduction: explicit description. Let us describe this reduction: we take N =2n, k = 2, n = m, and we select the values xj,i as follows:

• x1,i = si and x1,m+i = −si for all i = 1, . . . , m;• x2,i = x2,m+i = 2i for all i = 1, . . . , m.

We will show that for this new problem, the most representative sample I hasρ(t(I), t) = 0 if and only if the original instance of the subset sum problemhas a solution.

General analysis. Indeed, by definition of a distance function, the equalityρ(t(I), t) = 0 is equivalent to t(I) = t, i.e., to the requirement that for thesample I, means (and variances) within the sample are exactly the same asfor the entire population.

Consequences for the second component. Let us start by analyzing the con-sequences of this requirement for the mean of the second component. Forthe entire population of size N = 2m, for each i from 1 to m, we have twoelements, i-th and (m + i)-th, with the value x2,i = x2,m+i = 2i. Thus, forthe population as a whole, this mean is equal to

E2 =2 + 22 + . . . + 2m

m.

For the selected subset I of size m, this mean should be exactly the same:E2(I) = E2. Thus, we must have

E2(I) =2 + 22 + . . . + 2m

m.

Page 71: Foundations of Computational Intelligence

66 H.T. Nguyen et al.

By definition,

E2(I) =1m

·∑i∈I

x2,i.

Thus, we conclude that

S2(I) def=∑i∈I

x2,i = 2 + 22 + . . . + 2m.

What can we now conclude about the set I?First of all, we can notice that in the sum 2+22+. . .+2m, all the terms are

divisible by 4 except for the first term 2. Thus, the sum itself is not divisibleby 4.

In our population, we have exactly two elements, element 1 and elementm + 1, for which x2,1 = x2,m+1 = 2. For every other element, we havex2,i = x2,m+i = 2i for i ≥ 2 and therefore, the corresponding value is divisibleby 4.

In regards to a selection I, there are exactly three possibilities:

• the set I contains none of the two elements 1 and m + 1;• the set I contains both elements 1 and m + 1; and• the set I contains exactly one of the two elements 1 and m + 1.

In the first two cases, the contribution of these two elements to the sum S2(I)is divisible by 4 (it is 0 or 4). Since all other elements in the sum S2(I) aredivisible by 4, we would thus conclude that the sum itself is divisible by 4 –which contradicts to our conclusion that this sum is equal to 2+22+ . . .+2m

and is, therefore, not divisible by 4.This contradiction shows that the set I must contain exactly one of the

two elements 1 and m+1. Let us denote this element by k1. For this element,x2,k1 = 2. Subtracting x2,k1 and 2 from the two sides of the equality

S2(I) =∑i∈I

x2,i = 2 + 22 + . . . + 2m,

we conclude that

S2(I − {k1}) =∑

i∈I−{k1}x2,i = 22 + 23 + . . . + 2m.

In the new sum 22 + 23 + . . . + 2m, all the terms are divisible by 23 = 8except for the first term 22. Thus, the sum itself is not divisible by 8.

In our remaining population {2, . . . , m, m + 2, . . . , 2m}, we have exactlytwo elements, element 2 and element m + 2, for which x2,2 = x2,m+2 = 22.For every other element, we have x2,i = x2,m+i = 2i for i ≥ 3 and therefore,the corresponding value is divisible by 3.

Page 72: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the World to Be Cognizable 67

In regards to a selection I, there are exactly three possibilities:

• the set I contains none of the two elements 2 and m + 2;• the set I contains both elements 2 and m + 2; and• the set I contains exactly one of the two elements 2 and m + 2.

In the first two cases, the contribution of these two elements to the sumS2(I − {k1}) is divisible by 8 (it is 0 or 8). Since all other elements in thesum S2(I − {k1}) are divisible by 8, we would thus conclude that the sumitself is divisible by 8 – which contradicts to our conclusion that this sum isequal to 22 + 23 + . . . + 2m and is, therefore, not divisible by 8.

This contradiction shows that the set I must contain exactly one of thetwo elements 2 and m+2. Let us denote this element by k2. For this element,x2,k2 = 22. Subtracting x2,k2 and 22 from the two sides of the equality

S2(I − {k1}) =∑

i∈I−{k1}x2,i = 22 + 23 + . . . + 2m,

we conclude that

S2(I − {k1, k2}) =∑

i∈I−{k1,k2}x2,i = 23 + 24 + . . . + 2m.

Now, we can similarly conclude that the set I contains exactly one elementfrom the pair {3, m + 3}, and in general, for every i from 1 to m, we canconclude that the selection set I contains exactly one element ki from thepair {i, m + i}.Consequences for the first component. Let us now analyze the consequencesof this requirement for the mean of the first component. For the entire pop-ulation of size N = 2m, for each i from 1 to m, we have two elements, i-thand (m + i)-th, with the opposite values x1,i = si and x2,m+i = −si. Thus,for the population as a whole, this mean is equal to E1 = 0.

For each i from 1 to m, the selection set contains exactly one elementof these two: ki = i and ki = m + i. Thus, E1(I) = 0 means that the

corresponding sum is equal to 0:m∑

i=1

x1,ki = 0. Here, x1,ki = εi · si, where:

• εi = 1 if ki = i, and• εi = −1 if ki = m + i.

Thus, we conclude thatm∑

i=1

εi · si = 0 for some εi ∈ {−1, 1}, i.e., that the

original instance of the subset problem has a solution.

Equivalence. Vice versa, if the original instance of the subset problem has

a solution, i.e., ifm∑

i=1

εi · si = 0 for some εi ∈ {−1, 1}, then we can select

I = {k1 . . . , km}, where:

Page 73: Foundations of Computational Intelligence

68 H.T. Nguyen et al.

• ki = i when εi = 1, and• ki = m + i when εi = −1.

One can easily check that in this case, we have E1(I) = E1, E2(I) = E2,V1(I) = V1, V2(I) = V2, and, in general, M

(2d)1 (I) = M

(2d)1 and M

(2d)2 (I) =

M(2d)2 .

Conclusion. The reduction is proven, so the problem of finding the mostrepresentative sample is indeed NP-hard.

Discussion. In the definitions of sample selection problem P1 ( Definitions8–10), the objective is to find the sample I of given size n (which is smallerthan N , the size of the population) such that the distance ρ(t(I), t) is thesmallest possible.

In the above text, we have proved, in effect, that the selection of a sampleI of a given size n (< N), such that the distance ρ(t(I), t) = 0, is NP-hard.

The distance is always non-negative. Thus, when the smallest possibledistance is 0, finding the sample I for which the distance ρ(t(I), t) is thesmallest possible is equivalent to finding the sample for which this distance iszero. In general, the smallest possible distance does not necessarily equal to0. Thus, the sample selection problem P1 is more general that the auxiliary“zero-distance” problem P0 for which we have proven NP-hardness.

We have already mentioned earlier that if a problem P0 is NP-hard, thena more general problem P1 is NP-hard as well. Thus, we have indeed provedthat the (more general) sample selection problem is NP-hard.

Towards auxiliary results. In our proofs, we considered the case when thedesired sample contains half of the original population. In practice, however,samples form a much smaller portion of the population. A natural questionis: what if we fix a large even number 2P � 2, and look for samples whichconstitute the (2P )-th part of the original population? It turns out that theresulting problem of selecting the most representative sample is still NP-hard.

Definition 11. Let ρ be a distance function, and let 2P be a positive even

integer. By a problem of selecting an E-sample of relative size1

2P, we mean

the following problem:

• We are given a population p = 〈N, k, {xj,i}〉.• Among all samples I ⊆ {1, . . . , N} of size n =

n

2P, we must find the

sample I for which the distance ρ(t(1)(I), t(1)) between the correspondingE-statistical tuples is the smallest possible.

Definition 12. Let ρ be a distance function, and let 2P be a positive even

integer. By a problem of selecting an (E, V )-sample of relative size1

2P, we

mean the following problem:

Page 74: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the World to Be Cognizable 69

• We are given a population p = 〈N, k, {xj,i}〉.• Among all samples I ⊆ {1, . . . , N} of size n =

n

2P, we must find the

sample I for which the distance ρ(t(2)(I), t(2)) between the corresponding(E, V )-statistical tuples is the smallest possible.

Definition 13. Let ρ be a distance function, let d ≥ 1 be an integer, andlet 2P be a positive even integer. By a problem of selecting an (2d)-th order

sample of relative size1

2P, we mean the following problem:

• We are given a population p = 〈N, k, {xj,i}〉.• Among all samples I ⊆ {1, . . . , N} of size n =

n

2P, we must find the

sample I for which the distance ρ(t(2d)(I), t(2d)) between the correspondingstatistical tuples of order 2d is the smallest possible.

Proposition 4. For every distance function ρ and for every even integer

2P , the corresponding problem of selecting an E-sample of relative size1

2Pis NP-hard.

Proposition 5. For every distance function ρ and for every even integer2P , the corresponding problem of selecting an (E, V )-sample of relative size1

2Pis NP-hard.

Proposition 6. For every distance function ρ, for every integer d ≥ 1, andfor every even integer 2P , the corresponding problem of selecting a (2d)-th

order sample of relative size1

2Pis NP-hard.

Proof of Propositions 4–6. The proof is similar to the proofs of Propositions1–3.

The main difference is that for each i from 1 to m, we now have not twobut 2P different objects

i, m + i, 2m + i, . . . , k · m + i, . . . , (2P − 1) · m + i

with the same value

x2,i = x2,m+i = . . . = x2,k·m+i = . . . = x2,(2P−1)·m+i = (2P )i.

(And this common value is also different.)Among these 2P objects with the same value of the second characteristic

x2,., for the first half, we have x1,. = si and for the second half, we havex1,. = −si, i.e.:

x1,i = x1,m+i = . . . = m1,(P−1)·m+i = si;

x1,P ·m+i = x1,(P+1)·m+i = . . . = m1,(2P−1)·m+i = −si.

Page 75: Foundations of Computational Intelligence

70 H.T. Nguyen et al.

By using divisibility by (2P )2 (instead of divisibility by 22), we conclude thatthe best fitting sample is the one which has exactly one element of each group.Thus, from E1(I) = E1, we similarly conclude that the original instance ofthe subset problem has a solution – and hence that the new problems areindeed NP-hard.

7 Symmetry: Another Fundamental Reason forContinuity (“Fuzziness”)

Case study: benzene. To explain why symmetry leads to continuity, let usstart with a chemical example. In the traditional chemistry, a molecule iscomposed from atoms that exchange electrons with each other. If an atomborrows one electron from another atom, we say that they have a connectionof valence 1, if two electrons, there is a connection of valence 2, etc.

From the analysis of benzene, it has been clear that it consists of 6 carbonand six hydrogen atoms, i.e., that its chemical formula is C6H6. However, for along time, it was not clear how exactly they are connected to each other. Thesolution came in the 19th century to a chemist August Kekule in a dream.He dreamed of six monkeys that form a circle in which each monkey holds tothe previous monkey’s tail. According to this solution, the six C atoms forma circle. To each of these atoms, a H atom is attached. Each C atom has a 1valence connection to H, 1 valence connection to one of its neighbors, and 2to another neighbor.

The resulting chemical structure is still routinely described in chemicaltextbooks – because a benzene loop is a basis of organic chemistry and life.However, now we understand that this formula is not fully adequate. Indeed,according to this formula, the connections between C atoms are of two dif-ferent types: of valence 1 and of valence 2. In reality, the benzene molecule iscompletely symmetric, there is no difference between the strengths of differentconnections.

It is not possible to have a symmetric configuration is we require thatvalencies are integers. To equally split the remaining valence of 3 (1 is takenfor H) between the two neighbors, we need a valence of 3/2. This is notpossible in classical chemistry – but this is possible, in some sense, in quantumchemistry where, as we have mentioned, we have a continuum of intermediatestates; see, e.g., [2].

Fuzzy logic itself is such an example. Fuzzy logic itself can be viewed as anexample where symmetries leads to values intermediate between the originaldiscrete values.

Indeed, in traditional logic, we have two possible truth values: 1 (“true”)and 0 (“false”). How can we use this logic to describe the absence of knowl-edge? If we do not know whether a given statement A is true or not, thismeans that we have the exact same degree of belief in the statement A aswe have in its negation ¬A. In the traditional logic, none of the two truth

Page 76: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the World to Be Cognizable 71

values are symmetric (invariant) under such transformation A → ¬A. Thus,to adequately describe this situation, we need to also consider additional(intermediate) truth values.

And indeed, in fuzzy logic with the set of truth values [0, 1] and the nega-tion operation f¬(x) = 1 − x, there is a value which is invariant under theoperation A → ¬A: the value 0.5.

8 Case Study: Territory Division

Formulation of the problem. In many conflict situations, several participantswant to divide a territory between themselves. It may be farmer’s childrendividing his farm, it may be countries dividing a disputed territory.

Traditional (non-fuzzy) formalization of the problem. Let us follow [15] anddescribe a traditional (non-fuzzy) formalization of this problem. Let us de-note the disputed territory (i.e., to be more precise, the set of all the pointsin this territory) by T . Our objective is to divide this territory between nparticipants, i.e., to select a division of the set T into the sets T1, T2, . . . , Tn

for which Ti ∩ Tj = ∅ for i �= j and

T1 ∪ T2 ∪ . . . ∪ Tn = T.

It is reasonable to assume that the utility ui of the i-th participant in acquir-ing the territory Ti is linear in Ti, i.e., has the form

ui(Ti) =∫

Ti

Ui(x) dx

for some appropriate function Ui(x). As we mentioned in [15], it is reasonableto use Nash’s criterion to select the optimal division, i.e., to select the divisionfor which the product

udef= u1(T1) · u2(T2) · . . . · un(Tn)

attains the largest possible value. According to [15], in the optimal solution,for every participants i, there is a weight ci such that each point x is assignedto the participant with the largest weighted utility ci · Ui(x).

In particular, for two participants, there is a threshold c such that all thepoints x for which U1(x)/U2(x) > c go to the first participant, and all thepoints x for which U1(x)/U2(x) < c go to the second participant.

Possibility of a “fuzzy” solution. From the commonsense viewpoint, why dowe have to necessarily divide all the disputed territory? Why cannot wecontrol some parts of it together? In other words, instead of dividing the setT into subsets Ti, why cannot we assign, to every point x ∈ T and to every i,the degree di(x) to which the i-th participant will control the neighborhoodof this point – in such a way that for every point x,

Page 77: Foundations of Computational Intelligence

72 H.T. Nguyen et al.

d1(x) + . . . + dn(x) = 1.

In other words, instead of a crisp partition we have a fuzzy partition.In this setting, the utility ui of the i-th participant has the form

ui(di) =∫

Ui(x) · di(x) dx,

and our objective is to find a fuzzy partition for which the product

udef= u1(d1) · u2(d2) · . . . · un(dn)

attains the largest possible value.

Observation: the above “fuzzy” problem always has a crisp optimal solution.The derivation from [15] was based on the idea that if we attain a maximum,then a small change of assignment in the vicinity of each point will onlydecrease (or not change) the desired product. For the fuzzy problem, a similarargument shows that there are weights ci such that in the optimal solution,every point x for which the weighted utility each point x is assigned to theparticipant with the largest weighted utility ci · Ui(x) of the i-th participantis larger than the weighted utility of all other participants is assigned to thisi-th participant.

The only points about which we cannot make a definite assignment arethe ones in which two or more participants have exactly the same weightedutility. How we divide these points between these participants does not matter– as long as the overall degree of all the points assigned to each of theseparticipants remains the same. In particular, this means that it is alwayspossible to have a crisp division with the optimal value of the desired product.

So, we arrive at a somewhat paradoxical situation: even when we allow“fuzzy” divisions, the corresponding optimization problem always have a crispsolution. So, at first glance, it may seem that fuzzy solutions are not neededat all.

As we will see, the situation changes if we consider symmetry.

Symmetry leads to fuzziness. For the territory division problem, a symmetrymeans a transformation f : T → T that preserves the area of each (crisp)subset and that preserves the utility of each subarea to each participant.Preserving area means that f has to be a measure-preserving transformation.Preserving utility means that we must have Ui(x) = Ui(f(x)) for all x.

It is reasonable to require that if the original situation allows a symmetry,then the desired division should be invariant with respect to this symmetry.Let us show that this requirement leads to a fuzzy solution.

Indeed, let us consider the simplest situation in which we have only twoparticipants, and both assign equal value to all the points U1(x) = U2(x) = 1.In this case, the utility of each set Ti is simply equal to its area Ai, so theoptimization problem takes the form

Page 78: Foundations of Computational Intelligence

Intermediate Degrees Are Needed for the World to Be Cognizable 73

A1 · A2 → max .

Since the sum A1 + A2 is equal to the area A of the original territory T , thisproblem takes the form

A1 · (A − A1) → max .

One can easily check that the optimal crisp solution means that A1 = A/2,i.e., that we divide the area T into two equal halves.

This solution is optimal but it is not symmetric. Indeed, in this case, sym-metries are simply area-preserving transformations. Symmetry of the divisionmeans that f(T1) = T1 for all such transformations f . However, for every twopoints x, y ∈ T , we can have an area-preserving transformation f that maps xinto y: f(x) = y. In particular, we can have sauch a transformation for x ∈ T1

and y ∈ T2, in which case f(T1) �= T1. Thus, a crisp symmetric solution isimpossible.

In contrast, a fuzzy symmetric solution is quite possible – and uniquely de-termined: we simply assign to each point x equal degrees d1(x) = d2(x) = 1/2.Then, f(d1) = d1 and f(d2) = d2 for all area-preserving transformations f .

In general, we always have an optimal symmetric solution: in this solution,equally desired points – for which ci · Ui(x) = cj · Uj(x) – are all assigned ajoint control with the same degree of ownership depending only on i and j.

9 Conclusion

In this chapter, we have proven that from the natural assumption that theworld is cognizable, we can conclude that intermediate degrees are needed todescribe real-world processes. This conclusion provides an additional expla-nation for the success of fuzzy techniques (and other techniques which useintermediate degrees) – success which often goes beyond situations in whichthe intermediate degrees are needed to describe the experts’ uncertainty.

Acknowledgments

This work was supported in part by NSF grants HRD-0734825, EAR-0225670,and EIA-0080940, by Texas Department of Transportation grant No. 0-5453,by the Japan Advanced Institute of Science and Technology (JAIST) Inter-national Joint Research Grant 2006-08, and by the Max Planck Institut furMathematik.

References

1. Basu, S., Pollack, R., Roy, M.-F.: Algorithms in Real Algebraic Geometry.Springer, Berlin (2006)

2. Feynman, R.P., Leighton, R., Sands, M.: The Feynman Lectures on Physics.Addison Wesley, Reading (2005)

Page 79: Foundations of Computational Intelligence

74 H.T. Nguyen et al.

3. Franzen, T.: Godel’s Theorem: An Incomplete Guide to its Use and Abuse.A. K. Peters (2005)

4. Gamez, J.E., Modave, F., Kosheleva, O.: Selecting the Most RepresentativeSample is NP-Hard: Need for Expert (Fuzzy) Knowledge. In: Proceedings ofthe IEEE World Congress on Computational Intelligence WCCI 2008, HongKong, China, June 1–6, pp. 1069–1074 (2008)

5. Gell-Mann, M.: The Quark and the Jaguar. Owl Books (1995)6. Grover, L.K.: A fast quantum mechanical algorithm for database search. In:

Proceedings of the 28th Annual ACM Symposium on the Theory of Computing,p. 212 (May 1996)

7. Grover, L.K.: From Schrodinger’s equation to quantum search algorithm. Amer-ican Journal of Physics 69(7), 769–777 (2001)

8. Klir, G., Yuan, B.: Fuzzy sets and fuzzy logic: theory and applications. PrenticeHall, Upper Saddle River (1995)

9. Kuo, C.C., Glover, F., Dhir, K.S.: Analyzing and modeling the maximum di-versity problem by zero-one programming. Decision Sciences 24(6), 1171–1185(1993)

10. Lohr, H.: Sampling: Design and Analysis. Duxbury Press (1999)11. Mandelbrot, B.: Fractals: Form, Chance and Dimension. W. H. Freeman and

Co., New York (1977)12. Mandelbrot, B.: The Fractal Geometry of Nature. W. H. Freeman & Co., New

York (1982)13. Mandelbrot, B., Hudson, R.L.: The (Mis)Behavior of Markets: A Fractal View

of Risk, Ruin, and Reward. Basic Books (2004)14. Mishra, B.: Computational real algebraic geometry. In: Handbook on Discreet

and Computational Geometry. CRC Press, Boca Raton (1997)15. Nguyen, H.T., Kreinovich, V.: How to Divide a Territory? A New Simple Dif-

ferential Formalism for Optimization of Set Functions. International Journal ofIntelligent Systems 14(3), 223–251 (1999)

16. Nguyen, H.T., Kreinovich, V.: Everything Is a Matter of Degree: A New The-oretical Justification of Zadeh’s Principle. In: Proceedings of the 27th Interna-tional Conference of the North American Fuzzy Information Processing SocietyNAFIPS 2008, New York, May 19–22 (2008)

17. Nguyen, H.T., Walker, E.A.: A first course in fuzzy logic. CRC Press, BocaRaton (2005)

18. Nielsen, M., Chuang, I.: Quantum Computation and Quantum Information.Cambridge University Press, Cambridge (2000)

19. Papadimitriou, C.H.: Computational Complexity. Addison Wesley, San Diego(1994)

20. Povh, B.: Particles and Nuclei: An Introduction to the Physical Concepts.Springer, Heidelberg (1995)

21. Shor, P.: Polynomial-Time Algorithms for Prime Factorization and DiscreteLogarithms on a Quantum Computer. In: Proceedings of the 35th Annual Sym-posium on Foundations of Computer Science, Santa Fe, NM, November 20–22(1994)

22. Shor, P.: Polynomial-Time Algorithms for Prime Factorization and DiscreteLogarithms on a Quantum Computer. SIAM J. Sci. Statist. Comput. 26, 1484(1997)

23. Tarski, A.: A Decision Method for Elementary Algebra and Geometry, 2ndedn., 63 p. Berkeley and Los Angeles (1951)

Page 80: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program

Before-after EVALPSN and Its Application

Kazumi Nakamatsu

School of Human Science and EnvironmentUniversity of Hyogo,1-1-12 Shinzaike, HIMEJI [email protected]

Summary. We have already proposed a paraconsistent annotated logic pro-gram called EVALPSN. In EVALPSN, an annotation called an extended vectorannotation is attached to each literal. In order to deal with before-after rela-tion between two time intervals, we introduce a new interpretation for extendedvector annotations in EVALPSN, which is named Before-after(bf) EVALPSN.

In this chapter, we introduce the bf-EVALPSN and its application toreal-time process order control and its safety verification with simple ex-amples. First, the background and overview of EVALPSN are introduced,and paraconsistent annotated logic as the formal background of EVALPSNand EVALPSN itself are recapitulated with simple examples. Then, after bf-EVALPSN is formally defined, how to implement and apply bf-EVALPSNto real-time intelligent process order control and its safety verification withsimple practical examples. Last, unique and useful features of bf-EVALPSNare introduced, and conclusions and remarks are provided.

1 Introduction and Background

One of the main purposes of paraconsistent logic is to deal with inconsistencyin a framework of consistent logical systems. It has been almost six decadessince the first paraconsistent logical system was proposed by S.Jaskowski[12].It was four decades later that a family of paraconsistent logic called “anno-tated logic” was proposed by da Costa et al.[8, 49], which can deal withinconsistency by introducing many truth values called “annotations” intotheir syntax as attached information to formulas.

The paraconsistent annotated logic by da Costa et al.was developed fromthe viewpoint of logic programming by Subrahmanian et al.[7, 13, 48]. Fur-thermore, in order to deal with inconsistency and non-monotonic reasoningin a framework of annotated logic programming, ALPSN (Annotated LogicProgram with Strong Negation) and its stable model semantics was devel-oped by Nakamatsu and Suzuki [17]. It has been shown that ALPSN can dealwith some non-monotonic reasonings such as default logic [46], autoepistemic

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 75–108.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 81: Foundations of Computational Intelligence

76 K. Nakamatsu

logic [15] and a non-monotonic Assumption Based Truth Maintenance Sys-tem(ATMS) [9] in a framework of annotated logic programming [18, 36, 37].Even though ALPSN can deal with non-monotonic reasoning such as de-fault reasoning and conflicts can be represented as paraconsistent knowledgein it, it is difficult and complicated to deal with reasoning to resolve con-flicts in ALPSN. On the other hands, it is known that defeasible logic candeal with conflict resolving in a logical way [5, 41, 42], although defeasiblelogic cannot deal with inconsistency in its syntax and its inference rules aretoo complicated to implement them easily. In order to deal with conflict re-solving and inconsistency in a framework of annotated logic programming,a new version of ALPSN, VALPSN (Vector Annotated Logic Program withStrong Negation) that can deal with defeasible reasoning and inconsistencywas also developed by Nakamatsu et al.[22]. Moreover, it has been shown thatVALPSN can be applied to conflict resolving in various systems [19, 20, 21].It also has been shown that VALPSN provides a computational model ofdefeasible logic[5, 6]. Later, VALPSN was extended to EVALPSN (ExtendedVALPSN) by Nakamatsu et al. [23, 24] to deal with deontic notions (obliga-tion, permission, forbiddance, etc.) and defeasible deontic reasoning [43, 44].Recently, EVALPSN has been applied to various kinds of safety verificationand intelligent control, for example, railway interlocking safety verification[27], robot action control [25, 28, 29, 39], safety verification for air trafficcontrol [26], traffic signal control [30], discrete event control [31, 32, 33] andpipeline valve control [34, 35].

Considering the safety verification for process control, there is an occa-sion in which the safety verification for process order control is significant.For example, suppose a pipeline network in which two kinds of liquids, nitricacid and caustic soda are used for cleaning the pipelines. If those liquids areprocessed continuously and mixed in the same pipeline by accident, explo-sion by neutralization would be caused. In order to avoid such a dangerousaccident, the safety for process order control should be strictly verified ina formal way such as EVALPSN. However, it seems to be a little difficultto utilize EVALPSN for verifying process order control as well as the safetyverification for each process in process control. We have already proposed anew EVALPSN, bf(before-after)-EVALPSN that can deal with before-afterrelations between two time intervals [40].

This chapter mainly focuses on introducing bf-EVALPSN and its applica-tion to real-time process order control and its safety verification with simpleprocess order control examples. As far as we know there seems to be no otherefficient computational tool that can deal with the real-time safety verifica-tion for process order control than bf-EVALPSN.

This chapter is organized as follows : firstly, in Section 1, the backgroundand overview of the paraconsistent annotated logic program EVALPSN areintroduced ; in Section 2, paraconsistent annotated logic as the backgroundknowledge of EVALPSN and EVALPSN itself are formally recapitulated withsimple examples ; in Section 3, after bf-EVALPSN is formally defined, how to

Page 82: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 77

implement and apply bf-EVALPSN to real-time safety verification for processorder control is described with simple practical examples ; in Section 4, uniqueand useful features of bf-EVALPSN are introduced with examples ; lastly,conclusions and remarks are provided.

2 Paraconsistent Annotated Logic Program

This section is devoted to clarify the formal background of the paraconsis-tent annotated logic program EVALPSN. The more details of EVALPSN hasbeen introduced in [40]. We assume that the reader is familiar with the basicknowledge of classical logic and logic programming [14]. In order to under-stand EVLPSN and its reasoning we introduce Paraconsistent AnnotatedLogics PT [8] in the following subsection.

2.1 Paraconsistent Annotated Logic PT

Here we briefly recapitulate the syntax and semantics for propositional para-consistent annotated logics proposed by da Costa et al. [8].

Generally, a truth value called an annotation is attached to each atomicformula explicitly in paraconsistent annotated logic, and the set of annota-tions constitutes a complete lattice. We introduce a paraconsistent annotatedlogic PT with the four valued complete lattice T .

Definition 2.1The primitive symbols of PT are :

1. propositional symbols p, q, · · · , pi, qi, · · · ;2. each member of T is an annotation constant (we may call it simply an

annotation) ;3. the connectives and parentheses ∧, ∨, →, ¬, (, ) .

Formulas are defined recursively as follows :

1. if p is a propositional symbol and μ ∈ T is an annotation constant, thenp :μ is an annotated atomic formula (atom) ;

2. if F, F1, F2 are formulas, then ¬F, F1 ∧F2, F1 ∨F2, F1 → F2 are formulas.

We suppose that the four-valued lattice in Fig.1 is the complete lattice T ,where annotation t may be intuitively regarded as the truth value true andannotation f as the truth value false. It may be comprehensible that an-notations ⊥, t, f and � correspond to the truth values ∗, T, F and TF inVisser[50] and None, T, F, and Both in Belnap[4], respectively. Moreover,the complete lattice T can be viewed as a bi-lattice in which the verticaldirection −→⊥� indicates knowledge amount ordering and the horizontal direc-tion −→ft does truth ordering [10]. We use the symbol ≤ to denote the orderingin terms of knowledge amount (the vertical direction −→⊥�) over the completelattice T , and the symbols ⊥ and � are used to denote the bottom and top

Page 83: Foundations of Computational Intelligence

78 K. Nakamatsu

��

��

��

��

�(inconsistent)

f (false) t (true)

⊥(unknown)

Fig. 1. The 4-valued Complete Lattice T

elements, respectively. In the paraconsistent annotated logic PT , each anno-tated atomic formula can be interpreted epistemically, for example, p :t maybe interpreted epistemically as “the proposition p is known to be true”.

There are two kinds of negation in the paraconsistent annotated logic PT ,one of them, represented by the symbol ¬ in Definition 2.1, is called epis-temic negation, and the epistemic negation in PT followed by an annotatedatomic formula is defined as a mapping between elements of the completelattice T as follows :

¬(⊥) = ⊥, ¬(t) = f, ¬(f) = t, ¬(�) = �.

This definition shows that the epistemic negation maps annotations tothemselves without changing the knowledge amounts of the annotations, andthe epistemic negation followed by an annotated atomic formula can be elim-inated by syntactical operation. For example, the knowledge amount of anno-tation t is the same as that of annotation f as shown in the complete latticeT , and we have the epistemic negation ¬(p : t) = p : ¬(t) = p : f 1. whichshows that the knowledge amount in terms of the proposition p cannot bechanged by the epistemic negation. There is another negation called ontolog-ical(strong) negation that is defined by the epistemic negation.

Definition 2.2 (Strong Negation)Let F be any formula,

∼ F =def F → ((F → F ) ∧ ¬(F → F )).

The epistemic negation in the above definition is not interpreted as a mappingbetween annotations since it is not followed by an annotated atomic formula.Therefore, the strongly negated formula ∼ F is intuitively interpreted so thatif the formula F exists, the contradiction ((F → F )∧ ¬(F → F )) is implied.Usually, strong negation is used for denying the existence of the formulafollowing it.

The semantics for the paraconsistent annotated logics PT is defined asfollows.

1 An expression ¬p : µ is conveniently used for expressing a negative annotatedliteral instead of ¬(p :µ) or p :¬(µ).

Page 84: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 79

Definition 2.3Let ν be the set of all propositional symbols and F be the set of all formulas.An interpretation I is a function,

I : ν −→ T .

To each interpretation I, we can associate the valuation function such that

vI : F −→ {0, 1},

which is defined as :

1. let p be a propositional symbol and μ an annotation,

vI(p :μ) = 1 iff μ ≤ I(p),vI(p :μ) = 0 iff μ �≤I(p) ;

2. let A and B be any formulas, and A not an annotated atom,

vI(¬A) = 1 iff vI(A) = 0,

vI(∼ B) = 1 iff vI(B) = 0 ;

other formulas A → B, A ∧ B, A ∨ B are valuated as usual.

We provide an intuitive interpretation for strongly negated annotated atomwith the complete lattice T . For example, the strongly negated literal ∼ (p :t) implies the knowledge “p is false(f) or unknown(⊥)” since it denies theexistence of the knowledge that “p is true(t)”. This intuitive interpretationis proofed by Definition 2.3 as follows : if vI(∼ (p :t)) = 1, we have vI(p :t) = 0 and for any annotation μ ∈ {⊥, f, t,�} ≤ t, we have vI(p : μ) = 1,therefore, we obtain that μ = f or μ = ⊥.

2.2 EVALPSN (Extended Vector Annotated Logic Programwith Strong Negation)

Generally, an annotation is explicitly attached to each literal in paraconsis-tent annotated logic programs as well as the paraconsistent annotated logicP |calT . For example, let p be a literal, μ an annotation, then p : μ is calledan annotated literal. The set of annotations constitutes a complete lattice.

An annotation in EVALPSN has a form of [(i, j), μ] called an extendedvector annotation. The first component (i, j) is called a vector annotationand the set of vector annotations constitutes a complete lattice,

Tv(n) = { (x, y)|0 ≤ x ≤ n, 0 ≤ y ≤ n, x, y and n are integers }

shown by the Hasse’s diagram as n = 2 in Fig.2. The ordering(�v) of thecomplete lattice Tv(n) is defined as follows : let (x1, y1), (x2, y2) ∈ Tv(n),

Page 85: Foundations of Computational Intelligence

80 K. Nakamatsu

��

��

��

��

��

��

��

��

��

��

��

��

� �

(1, 0)

(2, 1)

(0, 1)

(1, 2)

(0, 0)

(0, 2) (2, 0)

(2, 2)

(1, 1)

�����

��������

���

��������

α

βγ

∗1

∗2

∗3

Fig. 2. Lattice Tv(2) and Lattice Td

(x1, y1) �v (x2, y2) iff x1 ≤ x2 and y1 ≤ y2.

For each extended vector annotated literal p : [(i, j), μ], the integer i denotesthe amount of positive information to support the literal p and the integerj denotes that of negative one. The second component μ is an index of factand deontic notions such as obligation, and the set of the second componentsconstitutes the following complete lattice,

Td = {⊥, α, β, γ, ∗1, ∗2, ∗3,�}.

The ordering(�d) of the complete lattice Td is described by the Hasse’sdiagram in Fig.2. The intuitive meanings of all members in Td are

⊥(unknown),α(fact), β(obligation), γ(non-obligation),∗1(fact and obligation),∗2(obligation and non-obligation),∗3(fact and non-obligation), and�(inconsistency).

The complete lattice Td is a quatro-lattice in which the direction −→⊥� mea-sures knowledge amount, the direction −→

γβ deontic truth, the direction −−→⊥∗2

deontic knowledge amount and the direction −→⊥α factuality. For example, theannotation β(obligation) can be intuitively interpreted to be more obligatorythan the annotation γ(non-obligation), and the annotations ⊥(no knowledge)and ∗2(obligation and non-obligation) are deontically neutral, that is to say,it cannot be said whether they represent obligation or non-obligation.

A complete lattice Te(n) of extended vector annotations is defined as theproduct,

Tv(n) × Td.

The ordering(�e) of the complete lattice Te(n) is also defined as follows :let [(i1, j1), μ1], [(i2, j2), μ2] ∈ Te,

[(i1, j1), μ1] �e [(i2, j2), μ2] iff (i1, j1) �v (i2, j2) and μ1 �d μ2.

Page 86: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 81

There are two kinds of epistemic negation (¬1 and ¬2) in EVALPSN, whichare defined as mappings over the complete lattices Tv(n) and Td, respectively.

Definition 2.4(epistemic negations ¬1 and ¬2 in EVALPSN)

¬1([(i, j), μ]) = [(j, i), μ], ∀μ ∈ Td

¬2([(i, j),⊥]) = [(i, j),⊥], ¬2([(i, j), α]) = [(i, j), α],¬2([(i, j), β]) = [(i, j), γ], ¬2([(i, j), γ]) = [(i, j), β],¬2([(i, j), ∗1]) = [(i, j), ∗3], ¬2([(i, j), ∗2]) = [(i, j), ∗2],¬2([(i, j), ∗3]) = [(i, j), ∗1], ¬2([(i, j),�]) = [(i, j),�].

If we regard the epistemic negations in Definition 2.4 as syntactical op-erations, the epistemic negations followed by literals can be eliminated bythe syntactical operations. For example, ¬1p : [(2, 0), α] = p : [(0, 2), α] and¬2q : [(1, 0), β] = p : [(1, 0), γ]. The strong negation (∼) in EVALPSN is definedas well as the paraconsistent annotated logic PT .

Definition 2.5 (well extended vector annotated literal)Let p be a literal. p : [(i, 0), μ] and p : [(0, j), μ] are called weva(well extendedvector annotated)-literals, where i, j ∈ {1, 2, · · · , n}, and μ ∈ { α, β, γ }.Defintion 2.6 (EVALPSN)If L0, · · · , Ln are weva-literals,

L0 ← L1 ∧ · · · ∧ Li∧ ∼ Li+1 ∧ · · · ∧ ∼ Ln

is called an EVALPSN clause. An EVALPSN is a finite set of EVALPSNclauses.

Fact and deontic notions, “obligation”, “forbiddance” and “permission”are represented by extended vector annotations,

[(m, 0), α], [(m, 0), β], [(0, m), β], and [(0, m), γ],

respectively, where m is a positive integer. For example,

p : [(2, 0), α] is intuitively interpreted as “it is known to be true of strength2 that p is a fact”;

p : [(1, 0), β] is as “it is known to be true of strength 1 that p is obligatory”;p : [(0, 2), β] is as “it is known to be false of strength 2 that p is obligatory”,

that is to say, “it is known to be true of strength 2 that p is forbidden”;p : [(0, 1), γ] is as “it is known to be false of strength 1 that p is not oblig-

atory”, that is to say, “it is known to be true of strength 1 that p ispermitted”.

Generally, if an EVALPSN contains the strong negation∼, it has stable modelsemantics [40] as well as other ordinary logic programs with strong negation.However, the stable model semantics may have a problem that some programsmay have more than two stable models and others have no stable model.

Page 87: Foundations of Computational Intelligence

82 K. Nakamatsu

Moreover, computing stable models takes a long time compared to usuallogic programming such as PROLOG programming. Therefore, it does notseem to be so appropriate for practical application such as real time process-ing in general. However, we fortunately have cases to implement EVALPSNpractically, if an EVALPSN is a stratified program, it has a tractable modelcalled a perfect model [45] and the strong negation in the EVALPSN can betreated as the Negation as Failure in logic programming with no strong nega-tion. The details of stratified program and some tractable models for normallogic programs can be found in [3, 11, 45, 47], furthermore the details of thestratified EVALPSN are described in [40]. Therefore, inefficient EVALPSNstable model computation does not have to be taken into account in practicesince all EVALPSNs that will appear in the subsequent sections are stratified.

3 Before-after EVALPSN

In this section, we define bf(before-after)-EVALPSN formally and introducehow to implement it aiming at its application to the real-time safety verifi-cation for process order control.

3.1 Before-after Relation in EVALPSN

First of all, we introduce a special literal R(pi, pj, t) whose vector annotationrepresents the before-after relation between processes Pri(pi) and Prj(pj) attime t, which may be regarded as time intervals in general, and the literalR(pi, pj, t) is called a bf-literal 2 .

Definition 3.1(bf-EVALPSN)An extended vector annotated literal R(pi, pj , t) : [μ1, μ2] is called a bf-EVALPliteral, where μ1 is a vector annotation and μ2 ∈ {α, β, γ}. If an EVALPSNclause contains bf-EVALP literals, it is called a bf-EVALPSN clause or just abf-EVALP clause if it contains no strong negation. A bf-EVALPSN is a finiteset of bf-EVALPSN clauses.

We provide some paraconsistent interpretations of vector annotations forrepresenting bf-relations, which are called bf-annotations. Strictly speaking,bf-relations between time intervals are classified into 15 kinds according tobf-relations between start/finish times of two time intervals. We define the15 kinds of bf-relations in bf-EVALPSN with regarding processes as timeintervals.

Suppose that there are two processes, Pri with its start/finish times xs

and xf , and Prj with its start/finish times ys and yf .

2 Hereafter, the expression “before-after” is abbreviated as just “bf” in thischapter.

Page 88: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 83

xs Pri

ys Prj

Fig. 3. Bf-relations, Before/After

�xs xfPri �

ys yfPrj

Fig. 4. Bf-relations, Disjoint Before/After

Before (be)/After (af)Firstly, we define the most basic bf-relations before/after according to thebf-relation between each start time of two processes, which are representedby bf-annotations be/af, respectively. If one process has started before/afteranother one started, then the bf-relations between those processes are definedas “before(be)/after(af)”, respectively. The bf-relations also are described inFig.3 with the condition that process Pri has started before process Prj

starts. The bf-relation between their start/finish times is denoted by theinequality {xs < ys} 3. For example, a fact at time t “process Pri has startedbefore process Prj started” can be represented by a bf-EVALP clause

R(pi, pj, t) : [be, α].

The bf-relations before/after do not care when the two processes finish.

Disjoint Before (db)/After (da)Bf-relations disjoint before/after between processes Pri and Prj are repre-sented by bf-annotations db/da, respectively. The expressions “disjoint be-fore/after” imply that there is a timelag between the earlier process finishand the later one start. They also are described in Fig.4 with the conditionthat process Pri has finished before process Prj starts. The bf-relation be-tween their start/finish times is denoted by the inequality {xf < ys}. Forexample, an obligation at time t “process Pri must start after process Prj

finishes” can be represented by a bf-EVALP clause,

R(pi, pj, t) : [da, β].

Immediate Before (mb)/After (ma)Bf-relations immediate before/after between the processes Pri and Prj arerepresented by bf-annotations mb/ma, respectively. The expressions “immedi-ate before/after” imply that there is no timelag between the earlier process

3 If time t1 is earlier than time t2, we conveniently denote the relation by theinequality t1 < t2.

Page 89: Foundations of Computational Intelligence

84 K. Nakamatsu

�xs xf

Pri �ys yfPrj

Fig. 5. Bf-relations, Immediate Before/After

�xs xfPri

�ys yfPrj

Fig. 6. Bf-relations, Joint Before/After

finish time and the later one start time. The bf-relations also are describedin Fig.5 with the condition that process Pri has finished immediately be-fore process Prj starts. The bf-relation between their start/finish times isdenoted by the equality {xf = ys}. For example, a fact at time t “processPri has finished immediately before process Prj starts” can be representedby a bf-EVALP clause,

R(pi, pj, t) : [mb, α].

Joint Before (jb)/After (ja)Bf-relations joint before/after between processes Pri and Prj are representedby bf-annotations jb/ja, respectively. The expressions “joint before/after”imply that the two processes overlap and the earlier process has finishedbefore the later one finishes. The bf-relations also are described in Fig.6with the condition that process Pri has started before process Prj starts andprocess Pri has finished before process Prj finishes. The bf-relation betweentheir start/finish times is denoted by the inequalities {xs < ys < xf < yf}.

For example, a fact at time t “process Pri has started before processPrj starts and finished before process Prj finishes” can be represented by abf-EVALP clause,

R(pi, pj, t) : [jb, α].

S-included Before (sb), S-included After (sa)Bf-relations s-included before/after between processes Pri and Prj are rep-resented by bf-annotations sb/sa, respectively. The expressions “s-includedbefore/after” imply that one process has started before another one startsand they have finished at the same time. The bf-relations also are describedin Fig.7 with the condition that process Pri has started before processPrj starts and they have finished at the same time. The bf-relation be-tween their start/finish times is denoted by the equality and inequalities{xs < ys < xf = yf}.

For example, a fact at time t “process Pri has started before process Prj

starts and they have finished at the same time” can be represented by abf-EVALP clause,

R(pi, pj, t) : [sb, α].

Page 90: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 85

�xs xfPri

�ys yfPrj

Fig. 7. Bf-relations, S-included Before/After

�xs xfPri

�ys yfPrj

Fig. 8. Bf-relations, Included Before/After

Included Before (ib)/After (ia)Bf-relations included before/after between processes Pri and Prj are rep-resented by bf-annotations ib/ia, respectively. The expressions “includedbefore/after” imply that one process has started/finished before/after an-other one starts/finished, respectively. The bf-relations also are described inFig.8 with the condition that process Pri has started before process Prj

starts and finished after process Prj finished. The bf-relation between theirstart/finish times is denoted by the inequailies {xs < ys, yf < xf}. Forexample, an obligation at the time t “process Pri must start before processPrj starts and finish after process Prj finishes” can be represented by abf-EVALP clause,

R(pi, pj, t) : [ib, β].

F-included Before (fb)/After (fa)The bf-relations f-include before/after between processes Pri and Prj are rep-resented by bf-annotations fb/fa, respectively. The expressions “f-includedbefore/after” imply that the two processes have started at the same timeand one process has finished before another one finishes. The bf-relationsalso are described in Fig.9 with the condition that processes Pri and Prj

have started at the same time and process Pri has finished after process Prj

finished. The bf-relation between their start/finish times is denoted by theequality and inequality {xs = ys, yf < xf}. For example, a fact at time t“processes Pri and Prj have started at the same time and process Pri hasfinished after process Prj finished” can be represented by a bf-EVALP clause,

R(pi, pj, t) : [fa, α].

�xs xfPri

�ys yfPrj

Fig. 9. Bf-relations, F-included Before/After

Page 91: Foundations of Computational Intelligence

86 K. Nakamatsu

�xs xfPri

�ys yfPrj

Fig. 10. Bf-relation, Paraconsistent Before-after

Paraconsistent Before-after (pba)Bf-relation paraconsistent before-after between processes Pri and Prj is rep-resented by bf-annotation pba. The expression “paraconsistent before-after”implies that the two processes have started at the same time and also fin-ished at the same time. The bf-relation is also described in Fig.10 with thecondition that processes Pri and Prj have not only started but also finishedat the same time. The bf-relation between their start/finish times is denotedby the equalities {xs = ys, yf = xf}. For example, an obligation at time t“processes Pri and Prj must not only start but also finish at the same time”can be represented by a bf-EVALP clause,

R(pi, pj, t) : [pba, β].

Here we define the epistemic negation ¬1 that maps bf-annotations to them-selves in bf-EVALPSN.

Definition 3.2 (Epistemic Negation ¬1 for Bf-annotations)The epistemic negation ¬1 over the set of bf-annotations,

{be, af, da, db, ma, mb, ja,jb, sa, sb, ia, ib, fa, fb, pba}

is obviously defined as the following mapping :

¬1(af) = be, ¬1(be) = af,

¬1(da) = db, ¬1(db) = da,

¬1(ma) = mb, ¬1(mb) = ma,

¬1(ja) = jb, ¬1(jb) = ja,

¬1(sa) = sb, ¬1(sb) = sa,

¬1(ia) = ib, ¬1(ib) = ia,

¬1(fa) = fb, ¬1(fb) = fa,

¬1(pba) = pba.

If we consider the before-after measure over the 15 bf-annotations, obviouslythere exists a partial order(<h) based on the before-after measure, where μ1 <h

μ2 is intuitively interpreted that the bf-annotation μ1 denotes a more “before”degree than the bf-annotation μ2, and μ1, μ2 ∈ {be, af, db, da, mb, ma, jb, ja,ib, ia, sb, sa, fb, fa, pba}. If μ1 <h μ2 and μ2 <h μ1, we denote it μ1 ≡h μ2.Then we have the following ordering :

Page 92: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 87

db <h mb <h jb <h sb <h ib <h fb <h pba <h ia <h ja <h ma <h da

andsb ≡h be <h af ≡h sa.

On the other hand, if we take the before-after knowledge (information)amount of each bf-relation into account as another measure, obviously therealso exists another partial order(<v) in terms of the knowledge amount, whereμ1 <v μ2 is intuitively interpreted that the bf-annotation μ1 has less knowl-edge amount in terms of bf-relation than the bf-annotation μ2. If μ1 <v μ2

and μ2 <v μ1, we denote it μ1 ≡v μ2. Then we have the following ordering :

be <v μ1, μ1 ∈ { db, mb, jb, sb, ib },af <v μ2, μ1 ∈ { da, ma, ja, sa, ia },

db ≡v mb ≡v jb ≡v sb ≡v ib ≡v fb ≡v pba ≡v

fa ≡v ia ≡v sa ≡v ja ≡v ma ≡v da

andbe ≡v af.

If we take the before-after measure as the horizontal one and the before-after knowledge amount as the vertical one, we obtain the complete bi-latticeTv(12)bf of vector annotations including the 15 bf-annotations.

Tv(12)bf = { ⊥12(0, 0), · · · , be(0, 8), · · · , db(0, 12), · · · , mb(1, 11), · · · ,jb(2, 10), · · · , sb(3, 9), · · · , ib(4, 8), · · · , fb(5, 7), · · · ,pba(6, 6), · · · , fa(7, 5), · · · , af(8, 0), · · · , ia(8, 4), · · · ,sa(9, 3), · · · , ja(10, 2), · · · , ma(11, 1), · · · , da(12, 0), · · · ,�12(12, 12)},

which is described as the Hasse’s diagram in Fig.11. We note that a bf-EVALP literal

R(pi, pj, t) : [μ1(m, n), μ2],where μ2 ∈ {α, β, γ} andμ1 ∈ {be, db, mb, jb, sb, ib, fb, pba, fa, ia, sa, jb, ma, da, af},

is not well annotated if m �= 0 and n �= 0, however, since the bf-EVALP literalis equivalent to the following two well annotated bf-EVALP literals :

R(pi, pj) : [(m, 0), μ] and R(pi, pj) : [(0, n), μ].

Therefore, such non-well annotated bf-EVALP literals can be regarded as theconjunction of two well annotated EVALP literals. For example, suppose thatthere is a non-well annotated bf-EVALP clause,

Page 93: Foundations of Computational Intelligence

88 K. Nakamatsu

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

••

••

�� afterbefore

knowledge

⊥12

�12

be af

db damb majb jasb saib iafb fapba

Fig. 11. The Complete Bi-lattice Tv(12)bf of Bf-annotations

R(pi, pj, t1) : [(k, l), μ1] ← R(pi, pj, t2) : [(m, n), μ2],

where k �= 0, l �=0, m �= 0 and n �= 0. It can be equivalently transformedinto the following two well annotated bf-EVALP clauses,

R(pi, pj, t1) : [(k, 0), μ1] ← R(pi, pj, t2) : [(m, 0), μ2] ∧ R(pi, pj, t2) : [(0, n), μ2],R(pi, pj, t1) : [(0, l), μ1] ← R(pi, pj, t2) : [(m, 0), μ2] ∧ R(pi, pj, t2) : [(0, n), μ2].

3.2 Implementation of Bf-EVALPSN

We now introduce how to implement bf-EVALPSN process order safety ver-ification systems with a simple example. For simplicity, we do not considercases in which one process starts/finishes with another one starts/finishesat the same time, however, the process order control system can deal withimmediately before/after relations, which means that we consider a case in

Page 94: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 89

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

••

••

�� afterbefore

knowledge

⊥7

�7

be af

db damb majb jaib ia

Fig. 12. The Complete Bi-lattice Tv(7)bf of Bf-annotations

which two processes are processed in sequence. Then, we do not have to takebf-annotations, sb/sa, fb/fa and pba into account, and we take the followingten bf-annotations with new vector annotations into account :

before(be)/after(af), (0, 4)/(4, 0),discrete before(db)/after(da), (0, 7)/(7, 0),immediate before(mb)/after(ma), (1, 6)/(6, 1),joint before(jb)/after(ja), (2, 5)/(5, 2),included before(ib)/after(ia). (3, 4)/(4, 3).

The complete bi-lattice Tv(7)bf including the ten bf-annotations is de-scribed as the Hasse’s diagram in Fig.12.

Now we show an example of implementing a real-time process order safetyverification system in bf-EVALPSN.

Example 1Suppose three processes Pr0(id p0), Pr1(id p1) and Pr2(id p2) appearing,and the next process Pr3(id p3) not appearing in Fig.13. Those processesare supposed to be processed according to the processing schedule in Fig.13,Then, we consider three bf-relations represented by the following bf-EVALPclauses, (1), (2) and (3):

Page 95: Foundations of Computational Intelligence

90 K. Nakamatsu

��

� time

Proc.

P r2

Pr1

Pr0

t0 t1 t2 t3 t4 t5 t6 t7

Fig. 13. Process Timing Chart

R(p0, p1, ti) : [(i1, j1), α], (1)R(p1, p2, ti) : [(i2, j2), α], (2)R(p2, p3, ti) : [(i3, j3), α], (3)

which will be infered based on each process start/finish information at timeti (i = 0, 1, 2, . . . , 7).

At time t0 no process has started yet. Thus, we have no knowledge interms of any bf-relation. Therefore, we have the bf-EVALP clauses,

R(p0, p1, t0) : [(0, 0), α],R(p1, p2, t0) : [(0, 0), α],R(p2, p3, t0) : [(0, 0), α].

At time t1 only process Pr0 has started before process Pr1 starts, Then,bf-annotations, db (0, 7), mb (1, 6), jb (2, 5) or ib (3, 4) could be the finalbf-annotation to represent the bf-relation between processes Pr0 and Pr1,thus, the greatest lower bound be(0, 4) of the set of vector annotations{(0, 7), (1, 6), (2, 5), (3, 4)} becomes the vector annotation of bf-literalR(p0, p1, t1). Other bf-literals have the bottom vector annotation (0, 0).Therefore, we have the bf-EVALP clauses,

R(p0, p1, t1) : [(0, 4), α],R(p1, p2, t1) : [(0, 0), α],R(p2, p3, t1) : [(0, 0), α].

At time t2 the second process Pr1 also has started before process Pr0

finish. Then, two bf-annotations, jb (2, 5) or ib (3, 4) could be the finalbf-relation to represent the bf-relation between processes Pr0 and Pr1.Thus, the greatest lower bound (2, 4) of the set of vector annotations{(2, 5), (3, 4)} has to be the vector annotation of bf-literal R(p0, p1, t2).In addition, bf-literal R(p1, p2, t2) has bf-annotation be(0, 4) as well asbf-literal R(p0, p1, t1) since process Pr1 has also started before processPr2 starts. On the other hand, bf-literal R(p2, p3, t2) has the bottomvector annotation (0, 0) since process Pr3 has not started yet. Therefore,we have the bf-EVALP clauses,

Page 96: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 91

R(p0, p1, t2) : [(2, 4), α],R(p1, p2, t2) : [(0, 4), α],R(p2, p3, t2) : [(0, 0), α].

At time t3 process Pr2 has started before both processes Pr0 and Pr1

finish. Then, both bf-literals R(p0, p1, t3) and R(p1, p2, t3) have thesame vector annotation (2, 4) as well as bf-literal R(p0, p1, t2). More-over, bf-literal R(p2, p3, t3) has bf-annotation be(0, 4) as well as bf-literalR(p0, p1, t1). Therefore, we have the bf-EVALP clauses,

R(p0, p1, t3) : [(2, 4), α],R(p1, p2, t3) : [(2, 4), α],R(p2, p3, t3) : [(0, 4), α].

At time t4 process Pr2 has finished before both processes Pr0 and Pr1

finish. Then, bf-literal R(p0, p1, t4) still has the same vector annotation(2, 4) as well as the previous time t3. In addition, bf-literal R(p1, p2, t4)has its final bf-annotation ib(3, 4). For the final bf-relation between pro-cesses Pr2 and Pr3 there are still two alternatives : (1) if process Pr3

will start immediately after process Pr2 finishes, bf-literal R(p2, p3, t4)has its final bf-annotation mb(1, 6) ; (2) if process Pr3 will not start im-mediately after process Pr2 finishes, bf-literal R(p2, p3, t4) has its finalbf-annotation db(0, 7). Either way, at least we have the knowledge thatprocess Pr2 has just finished at time t4, which can be represented bythe vector annotation (0, 6) that is the greatest lower bound of the setof vector annotations {(1, 6), (0, 7)}. Therefore, we have the bf-EVALPclauses,

R(p1, p2, t4) : [(2, 4), α],R(p2, p3, t4) : [(3, 4), α],R(p3, p4, t4) : [(0, 6), α].

At time t5 process Pr0 has finished before processes Pr1 finishes. Then,bf-literal R(p0, p1, t5) has its final bf-annotation jb(2, 5), and bf-literalR(p2, p3, t5) also has its final bf-annotation jb(0, 7) because process Pr3

has not started yet. Therefore, we have the bf-EVALP clauses,

R(p1, p2, t5) : [jb(2, 5), α],R(p2, p3, t5) : [ib(3, 4), α],R(p3, p4, t5) : [db(0, 7), α],

and all the bf-relations have been determined at time t5 before processPr1 finishes and process Pr3 starts.

Page 97: Foundations of Computational Intelligence

92 K. Nakamatsu

�Pr0 �Pr0

�Pr1 �Pr1

Fig. 14. Bf-EVALPSN Safety Verification Example

In Example 1, we have shown how the vector annotations of bf-literals areupdated according to start/finish information of processes in real-time. Wewill introduce real-time safety verification for process order control based onbf-EVALPSN with examples in the subsequent section.

3.3 Safety Verification in Bf-EVALPSN

First of all we introduce the basic idea of bf-EVALPSN safety verification forprocess order with a simple example.

Suppose that two processes Pr0 and Pr1 are processed repeatedly, andprocess Pr1 must be processed immediately before process Pr0 starts asshown in Fig.14. In bf-EVALPSN process order safety verification systems,the safety for process order is verified based on safety properties to be assuredin general. In order to verify the safety of the process order, we assume twosafety properties SP-0 and SP-1 for processes Pr0 and Pr1 as follows :

SP-0 process Pr0 must start immediately after process Pr1 finishes,SP-1 process Pr1 must start in a while after (disjoint after) process Pr0

finishes.

Then, safety properties SP-0 and SP-1 should be verified immediately beforeprocesses Pr0 and Pr1 start, respectively.

In order to verify the bf-relation “immediate after” with safety propertySP-0, it shoud be verified whether process Pr1 has finished immediatelybefore process Pr0 starts or not, and the safety verification should be carriedout immediately after process Pr1 finishes. Then bf-literal R(p0, p1, t) musthave vector annotation (6, 0), which means that process Pr1 has finished butprocess Pr0 has not started yet. Therefore, safety property SP-0 is translatedto the bf-EVALPSN-clauses,

SP-0Start(p0, t) : [(0, 1), γ] ← R(p0, p1, t) : [(6, 0), α] ∧

∼ R(p0, p1, t) : [(7, 0), α], (4)Start(p0, t) : [(0, 1), β] ← ∼ Start(p0, t) : [(0, 1), γ], (5)

where literal Start(pi, t) represents “process Pri starts at time t”and the setof its vector annotations constitutes the complete lattice Tv(1) = {⊥(0, 0),(0, 1), (1, 0), �(1, 1)}.

Page 98: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 93

On the other hand, in order to verify bf-relation “disjoint after” with safetyproperty SP-1, it should be verified whether there is a timelag between pro-cess Pr0 finish time and process Pr1 start time or not. Then, bf-literalR(p1, p0, t) must have bf-annotation da(7, 0). Therefore, safety propertySP-1 is translated to the bf-EVALPSN clauses,

SP-1Start(p1, t) : [(0, 1), γ] ← R(p1, p0, t) : [(7, 0), α], (6)Start(p1, t) : [(0, 1), β] ← ∼ Start(p1, t) : [(0, 1), γ]. (7)

Now, we will describe how to verify the process order safety by safetyproperties SP-0 and SP-1 in bf-EVALPSN. In order to verify the processorder safety, the following safety verification cycle is applied repeatedly.

Safety Verification Cycle

1st Step (safety verification for starting process Pr1)Suppose that process Pr1 has not started yet at time t1. If process Pr0

has already finished at time t1, we have the bf-EVALP clause,

R(p1, p0, t1) : [(7, 0), α]. (8)

On the other hand, if process Pr0 has just finished at time t1, we havethe bf-EVALP clause,

R(p1, p0, t1) : [(6, 0), α]. (9)

If the bf-EVALP clause (8) is input to safety property SP-1 {(6), (7)},we obtain the EVALP clause,

Start(p1, t1) : [(0, 1), γ]

and the safety for starting process Pr1 is assured. On the other hand, ifthe bf-EVALP clause (9) is input to the same safety property SP-1, weobtain the EVALP clause

Start(p1, t1) : [(0, 1), β],

then the safety for starting process Pr1 is not assured.

2nd Step (safety verification for starting process Pr0)Suppose that process Pr0 has not started yet at time t2. If process Pr1

has just finished at time t2, we have the bf-EVALP clause,

R(p0, p1, t2) : [(6, 0), α]. (10)

On the other hand, if process Pr1 has not finished yet at time t2, we havethe bf-EVALP clause,

Page 99: Foundations of Computational Intelligence

94 K. Nakamatsu

R(p0, p1, t2) : [(4, 0), α]. (11)

If the bf-EVALP clause (10) is input to safety property SP-0 {(4), (5)},we obtain the EVALP clause,

Start(p0, t2) : [(0, 1), γ],

and the safety for starting process Pr0 is assured. On the other hand, ifthe bf-EVALP clause (11) is input to the same safety property SP-0, weobtain the EVALP clause,

Start(p1, t) : [(0, 1), β],

then the safety for starting process Pr0 is not assured.

Example 2In this example we provide a more practical bf-EVALPSN safety verificationfor process order control with a simple brewery pipeline process order control.The brewery pipeline network consists of four tanks {T0, T1, T2, T3}, five pipelines {Pi0, P i1, P i2, P i3, P i4}, and two valves {V0, V1} as shown in Fig.15.We assume that four pipeline processes:

process Pr0, a brewery process using

line-1, tank T0 −→ valve V0 −→ tank T1 ;

process Pr1, a cleaning process by nitric acid using

line-2, tank T3 −→ valve V1 −→ Valve V0 −→ tank T2 ;

process Pr2, a cleaning process by water in line-1 ;process Pr3, a brewery process using both line-1 and line-2 with mixing

at valve V0 ;

are processed according to the processing schedule in Fig.15. We also assumefour safety properties :

safety property SP-2,process Pr0 must start before any other processes start ;

safety property SP-3,process Pr1 must start immediately after process Pr0 starts ;

safety property SP-4process Pr2 must start immediately after process Pr1 finishes ;

safety property SP-5process Pr3 must start immediately after both processes Pr0 and Pr2

finish.

Safety property SP-2 is translated to the bf-EVALPSN clauses,

Page 100: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 95

�� ��

T2

Pi2

�� ��

T0 � Pi1 �

�� ��

T1

�� ��

T3�

Pi0 �

V0

Pi4

Pi3

V1

BREWERY

CLEANING

�Pr0 �Pr3

�Pr1 �Pr2

Fig. 15. Brewery Pipeline and Processing Schedule

SP-2

Start(p0, t) : [(0, 1), γ] ← ∼ R(p0, p1, t) : [(4, 0), α] ∧∼ R(p0, p2, t) : [(4, 0), α] ∧∼ R(p0, p3, t) : [(4, 0), α],

Start(p0, t) : [(0, 1), β] ← ∼ Start(p0, t) : [(0, 1), γ]. (12)

As well as safety property SP-2, other safety properties SP-3, SP-4 andSP-5 are also translated to the bf-EVALPSN clauses,

SP-3

Start(p1, t) : [(0, 1), γ] ← R(p1, p0, t) : [(4, 0), α],

Start(p1, t) : [(0, 1), β] ← ∼ Start(p1, t) : [(0, 1), γ], (13)

SP-4

Start(p2, t) : [(0, 1), γ] ← R(p2, p1, t) : [(6, 0), α] ∧∼ R(p2, p1, t) : [(7, 0), α],

Start(p2, t) : [(0, 1), β] ← ∼ Start(p2, t) : [(0, 1), γ], (14)

Page 101: Foundations of Computational Intelligence

96 K. Nakamatsu

SP-5Start(p3, t) : [(0, 1), γ] ← R(p3, p0, t) : [(6, 0), α] ∧

R(p3, p2, t) : [(6, 0), α] ∧∼ R(p3, p2, t) : [(7, 0), α],

Start(p3, t) : [(0, 1), γ] ← R(p3, p0, t) : [(6, 0), α] ∧R(p3, p2, t) : [(6, 0), α] ∧∼ R(p3, p0, t) : [(7, 0), α],

Start(p3, t) : [(0, 1), β] ← ∼ Start(p3, t) : [(0, 1), γ]. (15)

Now, we will describe the safety verification process for the process order inFig.15.

Initial Stage (t0). No process has started at time t0, we have noinformation in terms of all bf-relations between all processes Pr0,Pr1,Pr2

and Pr3, thus, we have the bf-EVALP clauses,

R(p0, p1, t0) : [(0, 0), α], (16)R(p0, p2, t0) : [(0, 0), α], (17)R(p0, p3, t0) : [(0, 0), α]. (18)

In order to verify the safety for starting the first process Pr0, the bf-EVALP clauses (16), (17) and (18) are input to safety property SP-2(12). Then, we obtain the EVALP clause,

Start(p0, t0) : [(0, 1), γ],

which expresses permission for starting process Pr0, and its safety isassured at time t0. Otherwise, it is not assured.

2nd Stage (t1). Suppose that only process Pr0 has already started attime t1. Then, we have the bf-EVALP clauses,

R(p1, p0, t1) : [(4, 0), α]. (19)

In order to verify the safety for starting the second process Pr1, thebf-EVALP clause (19) is input to safety property SP-3 (13). Then, weobtain the EVALP clause,

Start(p1, t1) : [(0, 1), γ],

and the safety for starting process Pr1 is assured at time t1. Otherwise,it is not assured.

Page 102: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 97

3rd Stage (t2). Suppose that processes Pr0 and Pr1 have alreadystarted, and neither of them has finished yet at time t2. Then, we havethe bf-EVALP clauses,

R(p2, p0, t2) : [(4, 0), α], (20)R(p2, p1, t2) : [(4, 0), α]. (21)

In order to verify the safety for starting the third process Pr2, if theEVALP clause (21) is input to safety property SP-4 (14), then, we obtainthe EVALP clause,

Start(p2, t2) : [(0, 1), β],

and the safety for starting process Pr2 is not assured at time t2. On theother hand, if process Pr1 has just finished at time t2, then, we have thebf-EVALP clause,

R(p2, p1, t2) : [(6, 0), α]. (22)

If the bf-EVALP clause (22) is input to safety property SP-4 (14), then,we obtain the EVALP clause,

Start(p2, t2) : [(0, 1), γ],

and the safety for starting process Pr2 is assured.

4th Stage (t3). Suppose that processes Pr0, Pr1 and Pr2 have alreadystarted, processes Pr0 and Pr1 have already finished, and only processPr3 has not started yet at time t3. Then, we have the bf-EVALP clauses,

R(p3, p0, t3) : [(7, 0), α], (23)R(p3, p1, t3) : [(7, 0), α], (24)R(p3, p2, t3) : [(4, 0), α]. (25)

In order to verify the safety for starting the last process Pr3, if the bf-EVALP clauses (23) and (25) are input to safety property SP-5 (15),then, we obtain the EVALP clause,

Start(p3, t3) : [(0, 1), β],

and the safety for starting process Pr3 is not assured at time t3. On theother hand, if process Pr2 has just finished at time t3, then we have thebf-EVALP clause,

R(p3, p2, t3) : [(6, 0), α]. (26)

If bf-EVALP clause (26) is input to safety property SP-5 (15), then weobtain the EVALP clause,

Start(p3, t3) : [(0, 1), γ],

and the safety for starting process Pr3 is assured.

Page 103: Foundations of Computational Intelligence

98 K. Nakamatsu

Table 1. Table of Vector Annotations

bf-relations t0 t1 t2 t3 t4 t5 t6 t7

R(p0(a1), p1(a1), t) (0, 0) (0, 8) (2, 8) (2, 8) (2, 8) (2, 10) (2, 10) (2, 10)R(p1(a1), p2(a1), t) (0, 0) (0, 0) (0, 8) (0, 8) (0, 8) (0, 8) (0, 12) (0, 12)

R(p0(a2), p1(a2), t) (0, 0) (0, 8) (0, 8) (2, 8) (4, 8) (4, 8) (4, 8) (4, 8)R(p1(a2), p2(a2), t) (0, 0) (0, 0) (0, 0) (0, 8) (0, 12) (0, 12) (0, 12) (0, 12)

R(p0(a3), p1(a3), t) (0, 0) (0, 0) (0, 8) (0, 8) (0, 8) (0, 8) (0, 12) (0, 12)

R(p0(as), p1(as), t) (0, 0) (5, 5) (5, 5) (5, 5) (5, 5) (6, 6) (6, 6) (6, 6)R(p1(as), p2(as), t) (0, 0) (0, 8) (2, 8) (2, 8) (2, 8) (2, 10) (2, 10) (2, 10)R(p2(as), p3(as), t) (0, 0) (0, 0) (5, 5) (5, 5) (5, 5) (5, 5) (6, 6) (6, 6)R(p3(as), p4(as), t) (0, 0) (0, 0) (0, 8) (2, 8) (4, 8) (4, 8) (4, 8) (4, 8)R(p4(as), p5(as), t) (0, 0) (0, 0) (0, 0) (0, 8) (0, 12) (0, 12) (0, 12) (0, 12)

4 Reasoning in Bf-EVALPSN

In this section, we introduce two useful reasonings based on unique featuresof vector annotations in bf-EVALPSN with complete lattice Tv(12).

4.1 Reasoning under Incomplete Information in Bf-EVALPSN

We introduce a reasoning system for bf-relations under contradictory and in-complete information in taking a multi-agent system example, which implieshow to utilize paraconsistent knowledge in vector annotations.

Example 3Suppose a multi-agent system consisting of three autonomous agents 1,2 and3 (ids : a1,a2 and a3, resp) who can detect start/finish times of three processesPr0, Pr1 and Pr2 in Fig.13, and their supervisor (id : as) who can infer thecorrect bf-relations under incomplete or contradictory bf-relation informationdetected by each agent at each time ti (i = 0, 1, 2, . . . , 7). We assume that :

agent 1 (a1) fails to detect process Pr2 start/finish times,agent 2 (a2) fails to detect process Pr1 start/finish times, andagent 3 (a3) can detect only process Pr1 start/finish times.

Let Pri(aj) be the i-th process identified by agent aj and pi(aj) its processid, where i ∈ {0, 1, 2, . . .} and j ∈ {1, 2, 3, s(supervisor)}. Then, we have thevariation of vector annotations describing the bf-relations inferred by eachagent and the supervisor in Table 1, which also shows that the supervi-sor identifies five processes according to the detected bf-relations by eachagent and eventually infers that there exist just three different processes. Wedescribe the reasoning process for obtaining the correct bf-relations by thesupervisor under incomplete or contradictory information from each agent.

Page 104: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 99

At time t1, agents 1 and 2 have detected that their first processes, Pr0(a1)and Pr0(a2), have started, respectively, however, agent 3 could not detectits first process, therefore, they have the bf-EVALP clauses,

R(p0(a1), p1(a1), t1) : [(0, 8), α],R(p0(a2), p1(a2), t1) : [(0, 8), α],R(p0(a3), p1(a3), t1) : [(0, 0), α];

the supervisor obtains the above bf-EVALP clauses from each agent, andrecognizes that two processes, Pr0(a1) and Pr0(a2), have started at thesame time t1 as different processes Pr0(as) and Pr1(as) for the supervi-sor, respectively ; therefore, it has the bf-EVALP clauses,

R(p0(as), p1(as), t1) : [(5, 5), α],R(p0(as), p2(as), t1) : [(0, 8), α],R(p1(as), p3(as), t1) : [(0, 8), α];

where processes

Pr2(as) = Pr1(a1) and Pr3(as) = Pr1(a2),

although they have not started yet.

At time t2, agents 1 and 3 have simultaneously detected their second andfirst processes Pr1(a1) and Pr0(a3), respectively, on the other hands,agent 2 could not detect anything, that is to say, the knowledge of agent2 is the same as at time t1, therefore, agent 1 and 3 have the bf-EVALPclauses,

R(p0(a1), p1(a1), t2) : [(2, 8), α], R(p1(a1), p2(a1), t2) : [(0, 8), α],R(p0(a3), p1(a3), t2) : [(0, 8), α];

the supervisor obtains the above bf-EVALP clauses from each agent, andrecognizes that two processes Pr1(a1) and Pr0(a3) have started at thesame time t2 as different processes Pr2(as) and Pr4(as) for the supervi-sor, respectively ; thus, it has the bf-EVALP clauses,

R(p0(as), p2(as), t2) : [(2, 8), α],R(p2(as), p5(as), t2) : [(0, 8), α],R(p4(as), p6(as), t2) : [(0, 8), α],R(p2(as), p4(as), t2) : [(5, 5), α],

where processes

Pr5(as) = Pr2(a1) and Pr6(as) = Pr1(a3),

although they have not started yet.

Page 105: Foundations of Computational Intelligence

100 K. Nakamatsu

At time t3, only agent 2 has detected that its second process Pr1(a2) hasstarted, however, agents 1 and 3 could not detect anything, therefore,agent 2 has the bf-EVALP clauses,

R(p0(a2), p1(a2), t3) : [(2, 8), α], R(p1(a2), p2(a2), t3) : [(0, 8), α];

the supervisor obtains the above bf-EVALP clauses from agent 2, andrecognizes that process Pr1(a2) has started at time t3 as the processPr3(as) for the supervisor ; thus, it has the bf-EVALP clause,

R(p1(as), p3(as), t3) : [(2, 8), α], R(p3(as), p7(as), t3) : [(0, 8), α],

where processes Pr7(as) = Pr2(a2), although they have not started yet.

At time t4, only agent 2 has detected that its second process Pr1(a2) hasfinished, however, agents 1 and 3 could not detect anything, therefore,agent 2 has the bf-EVALP clauses,

R(p0(a2), p1(a2), t4) : [(4, 8), α], R(p1(a2), p2(a2), t4) : [(0, 12), α];

the supervisor obtains the above bf-EVALP clauses from agent 2 andrecognizes that process Pr1(a2) has finished at time t4 as process Pr3(as); therefore, it has the bf-EVALP clauses,

R(p1(as), p3(as), t4) : [ib(4, 8), α], R(p3(as), p7(as), t4) : [db(0, 12), α].

At time t5, agents 1 and 2 have detected that both processes Pr0(a1)and Pr0(a2) have finished, however, agent 3 could not detect anything,therefore, they have the bf-EVALP clauses,

R(p0(a1), p1(a1), t5) : [(2, 10), α], R(p0(a2), p1(a2), t5) : [(4, 8), α];

the supervisor obtains the above bf-relations from agents 1 and 2, andrecognizes that both processes, Pr0(a1) and Pr0(a2) have finished atthe same time t5 as different processes, Pr0(as) and Pr1(as) for thesupervisor, respectively ; therefore, it has the bf-EVALP clauses,

R(p0(as), p1(as), t5) : [pba(6, 6), α], R(p1(as), p2(as), t5) : [jb(2, 10), α].

At time t6, agents 1 and 3 have detected that their second and first pro-cesses Pr1(a1) and Pr0(a3) have finished, respectively, however, agent 2could not detect anything, therefore, they have the bf-EVALP clauses,

R(p1(a1), p2(a1), t6) : [(0, 12), α], R(p0(a3), p1(a3), t6) : [(0, 12), α];

the supervisor obtains the above bf-relations from agents 1 and 3, andrecognizes that both processes, Pr1(a1) and Pr0(a3), have finished at thesame time t6 as different processes Pr2(as) and Pr4(as) for the supervi-sor, respectively ; thus, it has the bf-EVALP clauses,

R(p2(as), p4(as), t6) : [pba(6, 6), α].

Page 106: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 101

Since the supervisor has the bf-EVALP clauses,

R(p0(as), p1(as), t6) : [pba(6, 6), α], andR(p2(as), p4(as), t6) : [pba(6, 6), α].

Eventually, the supervisor has the following correspondence between the pro-cesses:

Pr0(a1) = Pr0(a2) = Pr0(as) = Pr1(as),P r1(a1) = Pr0(a3) = Pr2(as) = Pr4(as),P r1(a2) = Pr3(as),P r2(a1) = Pr5(as),P r1(a3) = Pr6(as),P r2(a2) = Pr7(as),

where the underlined processes have actually appeared. Thus, the supervisorobtains the bf-EVALP clauses,

R(p0(as), p2(as), t6) : [jb(2, 10), α],R(p2(as), p3(as), t6) : [ib(4, 8), α],R(p3(as), p7(as), t6) : [db(0, 12), α],

as the correct bf-relations, which say that three processes have been detectedand finished, and the bf-relation between the first and second processes Pr0

and Pr1 is ‘joint before/after’, the bf-relation between the second and thirdones Pr1 and Pr2 is ‘include before/after’, and the bf-relation between thethird and fourth ones Pr2 and Pr3 is ‘disjoint before/after’, since the forthprocess Pr3(Pr7(as)) has not started yet.

4.2 Transitive Reasoning for Bf-Relations

Suppose that a bf-EVALPSN process order control system has to deal withten processes. Then, if it deals with all the bf-relations between ten processes,forty five bf-relations have to be considered. It may take much computationcost. In order to reduce such computation cost, we consider inference rules toderive bf-relation between processes Pri and Prk from bf-relations betweenprocesses Pri and Prj and between processes Prj and Prk in bf-EVALPSNin real-time, which are called bf-relation inference rules. Hereafter we call bf-relation inference rules as bf-inf rules for short. We introduce how to derivesome of bf-inf rules and how to apply them to real-time process order control.

Suppose that three processes Pr0, Pr1 and Pr2 are processed accordingto the process schedule (Fig.16) in which only the start time of processPr2 varies between time t3 and time t5 and no bf-relation among the pro-cesses varies. The vector annotations of bf-literals R(p0, p1, t), R(p1, p2, t)and R(p0, p2, t) at each time ti(i = 1, . . . , 7) are shown by the three charts

Page 107: Foundations of Computational Intelligence

102 K. Nakamatsu

Pr0

t0 t1 t4

Pr1

t2 t7

Pr2

t3 t5 t6

Pr0

t0 t1 t4

Pr1

t2 t7

Pr2

t3 t5 t6

Pr0

t0 t1 t4

Pr1

t2 t7

Pr2

t3 t5 t6

Fig. 16. Process Timing Chart 1(top left), 2(top right), 3(bottom left)

Table 2. Vector Annotations of Process Time Chart 1,2,3

process time chart 1 t0 t1 t2 t3 t4 t5 t6 t7

R(p0, p1, t) (0, 0) (0, 8) (2, 8) (2, 8) (2, 10) (2, 10) (2, 10) (2, 10)R(p1, p2, t) (0, 0) (0, 0) (0, 8) (2, 8) (2, 8) (2, 8) (4, 8) (4, 8)

R(p0, p2, t) (0, 0) (0, 8) (0, 8) (2, 8) (2, 10) (2, 10) (2, 10) (2, 10)

process time chart 2 t0 t1 t2 t3 t4 t5 t6 t7

R(p0, p1, t) (0, 0) (0, 8) (2, 8) (2, 8) (2, 10) (2, 10) (2, 10) (2, 10)R(p1, p2, t) (0, 0) (0, 0) (0, 8) (0, 8) (2, 8) (2, 8) (4, 8) (4, 8)

R(p0, p2, t) (0, 0) (0, 8) (0, 8) (0, 8) (1, 11) (1, 11) (1, 11) (1, 11)

process time chart 3 t0 t1 t2 t3 t4 t5 t6 t7

R(p0, p1, t) (0, 0) (0, 8) (2, 8) (2, 8) (2, 10) (2, 10) (2, 10) (2, 10)R(p1, p2, t) (0, 0) (0, 0) (0, 8) (0, 8) (0, 8) (2, 10) (4, 8) (4, 8)

R(p0, p2, t) (0, 0) (0, 8) (0, 8) (0, 8) (0, 12) (0, 12) (0, 12) (0, 12)

in Table 1. For each table, if we focus on the vector annotations at time t1and time t2, the following bf-inf rule can be derived:

rule-1R(p0, p2, t) : [(0, 8), α] ← R(p0, p1, t) : [(0, 8), α] ∧ R(p1, p2, t) : [(0, 0), α],

which is reduced to

R(p0, p2, t) : [(0, 8), α] ← R(p0, p1, t) : [(0, 8), α]. (27)

Furthermore, if we also focus on the vector annotations at time t3 andtime t4 in Table 1, the following two bf-inf rules also can be derived:

rule-2R(p0, p2, t) : [(2, 8), α] ← R(p0, p1, t) : [(2, 8), α] ∧ R(p1, p2, t) : [(2, 8), α], (28)

Page 108: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 103

rule-3R(p0, p2, t) : [(2, 10), α] ← R(p0, p1, t) : [(2, 10), α] ∧ R(p1, p2, t) : [(2, 8), α].(29)

As well as bf-inf rules rule-2 and rule-3, the following two bf-inf rulesalso can be derived with focusing on the variations of the vector annotationsat time t4.

rule-4

R(p0, p2, t) : [(1, 11), α] ← R(p0, p1, t) : [(2, 10), α] ∧ R(p1, p2, t) : [(2, 8), α],(30)(31)

rule-5

R(p0, p2, t) : [(0, 12), α] ← R(p0, p1, t) : [(2, 10), α] ∧ R(p1, p2, t) : [(0, 8), α].(32)

Among all the bf-inf rules only bf-relation inference rules rule-3 and rule4 have the same precedent (body)

R(p0, p1, t) : [(2, 10), α] ∧ R(p1, p2, t) : [(2, 8), α],

and different consequents (heads)

R(p0, p2, t) : [(2, 10), α] and R(p0, p2, t) : [(1, 11), α].

Having the same precedent may cause duplicate application of the bf-infrules. If we take bf-inf rules rule-3 and rule-4 into account, obviously theycannot be uniquely applied. In order to avoid duplicate application of bf-inf rules rule-3 and rule-4, we consider all correct considerable applicableorders order-1 (33), order-2 (34) and order-3 (35) for all the bf-inf rules,rule-1, · · ·, rule-5.

order-1 rule-1 −→ rule-2 −→ rule-3 (33)order-2 rule-1 −→ rule-4 (34)order-3 rule-1 −→ rule-5 (35)

As indicated in the above orders, bf-inf rule rule 3 should be appliedimmediately after bf-inf rule rule 2, on the other hand, bf-inf rule rule 4should be done immediately after bf-inf rule rule 1. Thus, if we take theapplicable orders (33), (34) and (35) into account, such confusion may beavoidable. Actually, bf-inf rules are not complete, that is to say there existsome cases in which bf-relations cannot be uniquely determined by only bf-infrules, although we will not address the topics in details in this chapter.

We show a real-time application of the bf-inf rules by taking process timingchart 3 in Fig.16 as an example.

At time t1, bf-inf rule rule-1 is applied and we have the bf-EVALPSN clause,

R(p0, p2, t1) : [(0, 8), α].

Page 109: Foundations of Computational Intelligence

104 K. Nakamatsu

���

���

���

Fig. 17. Anticipation of bf-relation

At time t2 and time t3, no rule can be applied and we still have the samevector annotation (0, 8) of bf-literal R(p0, p2, t3).

At time t4, only bf-inf rule rule-5 can be applied and we obtain the bf-EVALP clause,

R(p0, p2, t4) : [(0, 12), α]

and the bf-relation between processes Pr0 and Pr2 has been infered ac-cording to process order order-3 (35).

We could not introduce all bf-inf rules though, it is sure that we have manycases that can reduce bf-relation computing cost in bf-EVALPSN processorder control by using bf-inf rules. In real-time process control systems, suchreduction of computing cost is required and significant in practice.

As another topic, we briefly introduce anticipation of bf-relations in bf-EVALPSN. For example, suppose that three processes Pr0, Pr1 and Pr2

have started in this turn, and only process Pr1 has finished at time t asshown in Fig.17. Then, two bf-relations between processes Pr0, Pr1 andprocesses Pr1, Pr2 have already determined, and we have two bf-EVALPclauses with complete bf-annotations,

R(p0, p1, t) : [ib(4, 8), α] and R(p1, p2, t) : [mb(1, 11), α]. (36)

On the other hand, the bf-relation between processes Pr0 and Pr2 cannot bedetermined yet. However, if we use bf-inf rule,

rule-6R(p0, p2, t) : [(2, 8), α]←R(p0, p1, t) : [(4, 8), α]∧R(p1, p2, t) : [(2, 10), α],(37)

we obtain vector annotation (2, 8) as the incomplete bf-annotation of bf-literal R(p0, p2, t). Moreover, it is logically anticipated that the bf-relationbetween processes Pr0 and Pr2 will finally be represented by one of three bf-annotations (vector annotations), jb(2, 10), sb(3, 9) and ib(4, 8), since thevector annotation (2, 8) is the greatest lower bound of the set of vectorannotations, {(2, 10), (3, 9), (4, 8)}. As mentioned above, we can systemati-cally anticipate complete bf-annotations from incomplete bf-annotations inbf-EVALPSN. This remarkable anticipatory feature of bf-EVALPSN couldbe applied to some kinds of safety verification and control that require suchlogical anticipation.

Page 110: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 105

5 Conclusions and Remarks

In this chapter, we have introduced bf-EVALPSN and its application to real-time process order control and its safety verification. The bf-EVALPSN safetyverification based process order control method can be applied to variousprocess order control systems requiring real-time processing.

An interval temporal logic has been proposed by Allen et al. for knowledgerepresentation of properties, actions and events [1, 2]. In the interval tempo-ral logic, predicates such as Meets(m,n) are used for representing primitivebefore-after relations between time intervals m and n, and other before-afterrelations are represented by six predicates such as Before, Overlaps, etc. It iswell known that the interval temporal logic is a logically sophisticated tool todevelop practical planning or natural language understanding systems [1, 2].However, it does not seem to be so suitable for practical real-time processingbecause before-after relations between two processes cannot be determineduntil both of them finish. On the other hand, in bf-EVALPSN bf-relationsare represented more minutely in paraconsistent vector annotations and canbe determined according to start/finish information of two processes in realtime. Moreover, EVALPSN can be implemented on microchips as electroniccircuits, although it has not introduced in this chapter. We have alreadyshown that some EVALPSN based control systems can be implemented ona microchips in [29, 39]. Therefore, bf-EVALPSN is a more practical tool fordealing with real-time process order control and its safety verification.

In addition to the suitable characteristics for real-time processing, bf-EVALPSN can deal with incomplete and paracomplete knowledge in terms ofbefore-after relation in vector annotations, although the treatment of para-complete knowledge has not been discussed in this chapter. Furthermore, bf-EVALPSN has inference rules for transitive reasoning of before-after relationsas shortly described. Therefore, if we apply EVALPSN and bf-EVALPSNappropriately, various systems should intellectualize more.

References

1. Allen, J.F.: Towards a General Theory of Action and Time. Artificial Intelli-gence 23, 123–154 (1984)

2. Allen, J.F., Ferguson, G.: Actions and Events in Interval Temporal Logic. J.Logic and Computation 4, 531–579 (1994)

3. Apt, K.R., Blair, H.A., Walker, A.: Towards a theory of declarative knowledge.In: Minker, J. (ed.) Foundation of Deductive Database and Logic Programs,pp. 89–148. Morgan Kaufmann, San Francisco (1989)

4. Belnap, N.D.: A useful four valued logic. In: Dunn, M., Epstein, G. (eds.) Mod-ern Uses of Multiple-Valued Logic, pp. 8–37. D.Reidel Publishing, Netherlands(1977)

5. Billington, D.: Defeasible logic is stable. J. Logic and Computation 3, 379–400(1993)

Page 111: Foundations of Computational Intelligence

106 K. Nakamatsu

6. Billington, D.: Conflicting literals and defeasible logic. In: Nayak, A., Pagnucco,M. (eds.) Proc. 2nd Australian Workshop Commonsense Reasoning, December1, pp. 1–15. Australian Artificial Intelligence Institute, Australia (1997)

7. Blair, H.A., Subrahmanian, V.S.: Paraconsistent logic programming. Theoret-ical Computer Science 68, 135–154 (1989)

8. da Costa, N.C.A., Subrahmanian, V.S., Vago, C.: The paraconsistent logicsPT . Zeitschrift fur Mathematische Logic und Grundlangen der Mathematik 37,139–148 (1989)

9. Dressler, O.: An extended basic ATMS. In: Reinfrank, M., Ginsberg, M.L., deKleer, J., Sandewall, E., et al. (eds.) Non-Monotonic Reasoning 1988. LNCS,vol. 346, pp. 143–163. Springer, Heidelberg (1988)

10. Fitting, M.: Bilattice and the semantics of logic programming. J. Logic Pro-gramming 11, 91–116 (1991)

11. Gelder, A.V., Ross, K.A., Schlipf, J.S.: The well-founded semantics for generallogic programs. J. ACM 38, 620–650 (1991)

12. Jaskowski, S.: Propositional calculus for contradictory deductive system (En-glish translation of the original Polish paper). Studia Logica 24, 143–157 (1948)

13. Kifer, M., Subrahmanian, V.S.: Theory of generalized annotated logic program-ming and its applications. J. Logic Programming 12, 335–368 (1992)

14. Lloyd, J.W.: Foundations of Logic Programming, 2nd edn. Springer, Berlin(1987)

15. Moore, R.: Semantical considerations on non-monotonic logic. Artificial Intel-ligence 25, 75–94 (1985)

16. Morley, J. M.: Safety Assurance in Interlocking Design. Ph.D Thesis, School ofInformatics, University of Edinburgh (1996)

17. Nakamatsu, K., Suzuki, A.: Annotated semantics for default reasoning. In: Dai,R. (ed.) Proc. 3rd Pacific Rim Intl. Conf. Artificial Intelligence(PRICAI 1994),Beijin, China, August 15–18, pp. 180–186. International Academic Publishers,China (1994)

18. Nakamatsu, K., Suzuki, A.: A nonmonotonic ATMS based on annotated logicprograms. In: Wobcke, W., Pagnucco, M., Zhang, C., et al. (eds.) Agentsand Multi-Agent Systems Formalisms, Methodologies, and Applications. LNCS(LNAI), vol. 1441, pp. 79–93. Springer, Heidelberg (1998)

19. Nakamatsu, K., Abe, J.M.: Reasonings based on vector annotated logic pro-grams. In: Mohammadian, M. (ed.) Computational Intelligence for Modelling,Control & Automation (CIMCA 1999). Concurrent Systems Engineering Series,vol. 55, pp. 396–403. IOS Press, Netherlands (1999)

20. Nakamatsu, K., Abe, J.M., Suzuki, A.: Defeasible reasoning between conflictingagents based on VALPSN. In: Tessier, C., Chaudron, L. (eds.) Proc. AAAIWorkshop Agents’ Conflicts, Orland, FL, July 18, pp. 20–27. AAAI Press,Menlo Park (1999)

21. Nakamatsu, K., Abe, J.M., Suzuki, A.: Defeasible reasoning based on VALPSNand its application. In: Nayak, A., Pagnucco, M. (eds.) Proc. The Third Aus-tralian Commonsense Reasoning Workshop, Sydney, Australia, December 7,pp. 114–130. University of Newcastle, Sydney (1999)

22. Nakamatsu, K.: On the relation between vector annotated logic programs anddefeasible theories. Logic and Logical Philosophy 8, 181–205 (2000)

Page 112: Foundations of Computational Intelligence

Paraconsistent Annotated Logic Program Before-after EVALPSN 107

23. Nakamatsu, K., Abe, J.M., Suzuki, A.: A defeasible deontic reasoning systembased on annotated logic programming. In: Dubois, D.M. (ed.) Proc. 4th Intl.Conf. Computing Anticipatory Systems(CASYS 2000), Liege, Belgium, August7–12. AIP Conference Proceedings, vol. 573, pp. 609–620. American Instituteof Physics, New York (2000)

24. Nakamatsu, K., Abe, J.M., Suzuki, A.: Annotated semantics for defeasible de-ontic reasoning. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI),vol. 2005, pp. 432–440. Springer, Heidelberg (2001)

25. Nakamatsu, K., Abe, J.M., Suzuki, A.: Extended vector annotated logic pro-gram and its application to robot action control and safety verification. In:Abraham, A., et al. (eds.) Hybrid Information Systems. Advances in Soft Com-puting Series, pp. 665–680. Physica-Verlag, Heidelberg (2002)

26. Nakamatsu, K., Suito, H., Abe, J.M., Suzuki, A.: Paraconsistent logic programbased safety verification for air traffic control. In: El Kamel, A., et al. (eds.)Proc. IEEE Intl. Conf. System, Man and Cybernetics 2002 (SMC 2002), Ham-mamet, Tunisia, October 6–9, IEEE SMC, Los Alamitos (2002)

27. Nakamatsu, K., Abe, J.M., Suzuki, A.: A railway interlocking safety verificationsystem based on abductive paraconsistent logic programming. In: Abraham, A.,et al. (eds.) Soft Computing Systems(HIS 2002). Frontiers in Artificial Intelli-gence and Applications, vol. 87, pp. 775–784. IOS Press, Netherlands (2002)

28. Nakamatsu, K., Abe, J.M., Suzuki, A.: Defeasible deontic robot control basedon extended vector annotated logic programming. In: Dubois, D.M. (ed.) Proc.5th Intl. Conf. Computing Anticipatory Systems(CASYS 2001), Liege, Bel-gium, August 13–18. AIP Conference Proceedings, vol. 627, pp. 490–500. Amer-ican Institute of Physics, New York (2002)

29. Nakamatsu, K., Mita, Y., Shibata, T.: Defeasible deontic action control basedon paraconsistent logic program and its hardware application. In: Moham-madian, M. (ed.) Proc. Intl. Conf. Computational Intelligence for ModellingControl and Automation 2003 (CIMCA 2003), February 12–14, IOS Press,Netherlands (2003)

30. Nakamatsu, K., Seno, T., Abe, J.M., Suzuki, A.: Intelligent real-time trafficsignal control based on a paraconsistent logic program EVALPSN. In: Wang,G., Liu, Q., Yao, Y., Skowron, A., et al. (eds.) RSFDGrC 2003. LNCS (LNAI),vol. 2639, pp. 719–723. Springer, Heidelberg (2003)

31. Nakamatsu, K., Komaba, H., Suzuki, A.: Defeasible deontic control for discreteevents based on EVALPSN. In: Tsumoto, S., S�lowinski, R., Komorowski, J.,Grzyma�la-Busse, J.W., et al. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp.310–315. Springer, Heidelberg (2004)

32. Nakamatsu, K., Ishikawa, R., Suzuki, A.: A paraconsistent based control for adiscrete event cat and mouse. In: Negoita, M.G., Howlett, R.J., Jain, L.C., et al.(eds.) KES 2004. LNCS (LNAI), vol. 3214, pp. 954–960. Springer, Heidelberg(2004)

33. Nakamatsu, K., Chung, S.-L., Komaba, H., Suzuki, A.: A discrete event controlbased on EVALPSN stable model. In: Slezak, D., Wang, G., Szczuka, M.S.,Duntsch, I., Yao, Y., et al. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3641,pp. 671–681. Springer, Heidelberg (2005)

34. Nakamatsu, K., Abe, J.M., Akama, S.: An intelligent safety verification basedon a paraconsistent logic program. In: Khosla, R., Howlett, R.J., Jain, L.C., etal. (eds.) KES 2005. LNCS (LNAI), vol. 3682, pp. 708–715. Springer, Heidelberg(2005)

Page 113: Foundations of Computational Intelligence

108 K. Nakamatsu

35. Nakamatsu, K., Kawasumi, K., Suzuki, A.: Intelligent verification for pipelinebased on EVALPSN. In: Nakamatsu, K., Abe, J.M. (eds.) Advances in LogicBased Intelligent Systems. Frontiers in Artificial Intelligence and Applications,vol. 132, pp. 63–70. IOS Press, Netherlands (2005)

36. Nakamatsu, K., Suzuki, A.: Autoepistemic theory and paraconsistent logic pro-gram. In: Nakamatsu, K., Abe, J.M. (eds.) Advances in Logic Based IntelligentSystems. Frontiers in Artificial Intelligence and Applications, vol. 132, pp. 177–184. IOS Press, Netherlands (2005)

37. Nakamatsu, K., Suzuki, A.: Annotated semantics for non-monotonic reasoningsin artificial intelligence – I, II, III, IV. In: Nakamatsu, K., Abe, J.M. (eds.)Advances in Logic Based Intelligent Systems. Frontiers in Artificial Intelligenceand Applications, vol. 132, pp. 185–215. IOS Press, Netherlands (2005)

38. Nakamatsu, K.: Pipeline valve control based on EVALPSN safety verification.J. Advanced Computational Intelligence and Intelligent Informatics 10, 647–656(2006)

39. Nakamatsu, K., Mita, Y., Shibata, T.: An intelligent action control systembased on extended vector annotated logic program and its hardware imple-mentation. J. Intelligent Automation and Soft Computing 13, 289–304 (2007)

40. Nakamatsu, K.: Paraconsistent Annotated Logic Program EVALPSN and itsApplication. In: Fulcher, J., Jain, C.L. (eds.) Computational Intelligence: ACompendium. Studies in Computational Intelligence, vol. 115, pp. 233–306.Springer, Germany (2008)

41. Nute, D.: Defeasible reasoning. In: Stohr, E.A., et al. (eds.) Proc. 20th HawaiiIntl. Conf. System Science (HICSS 1987), Kailua-Kona, Hawaii, January 6–9,pp. 470–477. University of Hawaii, Hawaii (1987)

42. Nute, D.: Basic defeasible logics. In: del Cerro, L.F., Penttonen, M. (eds.) Inten-sional Logics for Programming, pp. 125–154. Oxford University Press, Oxford(1992)

43. Nute, D.: Defeasible logic. In: Gabbay, D.M., et al. (eds.) Handbook of Logicin Artificial Intelligence and Logic Programming, vol. 3, pp. 353–396. OxfordUniversity Press, Oxford (1994)

44. Nute, D.: Apparent obligatory. In: Nute, D. (ed.) Defeasible Deontic Logic. Syn-these Library, vol. 263, pp. 287–316. Kluwer Academic Publisher, Netherlands(1997)

45. Przymusinski, T.C.: On the declarative semantics of deductive databases andlogic programs. In: Minker, J. (ed.) Foundation of Deductive Database andLogic Programs, pp. 193–216. Morgan Kaufmann, New York (1988)

46. Reiter, R.: A logic for default reasoning. Artificial Intelligence 13, 81–123 (1980)47. Shepherdson, J.C.: Negation as failure, completion and stratification. In: Gab-

bay, D.M., et al. (eds.) Handbook of Logic in Artificial Intelligence and LogicProgramming, vol. 5, pp. 356–419. Oxford University Press, Oxford (1998)

48. Subrahmanian, V.S.: Amalgamating knowledge bases. ACM Trans. DatabaseSystems 19, 291–331 (1994)

49. Subrahmanian, V.S.: On the semantics of qualitative logic programs. In: Proc.the 1987 Symp. Logic Programming (SLP 1987), August 31–September 4, pp.173–182. IEEE Computer Society Press, San Francisco (1987)

50. Visser, A.: Four valued semantics and the liar. J. Philosophical Logic 13, 99–112(1987)

Page 114: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability

Modeling

P. Zeephongsekul

RMIT University, Melbourne, Victoria 3000, [email protected]

Summary. This chapter provides a discussion of a fuzzy set approachwhich is used to extend the notion of software debugging from a 0-1 (per-fect/imperfect) crisp approach to one which incorporates some fuzzy setsideas. The main objective of this extension is to make current software relia-bility models more realistic. The theory underlying this approach, and henceits key modeling tool, is the theory of random point processes with fuzzymarks. The relevance of this theory to software debugging arises from thefact that it incorporates the randomness due to the locations of the soft-ware faults and the fuzziness bestowed by the imprecision of the debuggingeffort. Through several examples, we also demonstrates that this theory pro-vides the natural vehicle for an investigation into the properties and efficacyof fuzzy debugging of software programs and is therefore a contribution tocomputational intelligence.

1 Introduction

Almost all complex systems on which our modern scientific and technologicalworld depend are run by computers. Personnel operating these computerswould expect that the software systems running them are reliable in thatthey will fulfil their assigned tasks for a set of input cases with few, if any,errors. A software failure occurs if there is a departure of the program fromits intended performance. This could be minor or very costly, dependingon the environment in which the software is used and the severity of theconsequences in the event of a failure. For example, if failures were to occurin software running a nuclear plant or medical facility, the results could bedisastrous. Some recent examples of catastrophic disasters caused by softwareerrors are catalogued and described in the book by R.L. Glass [9].

Software reliability, defined as the probability of a failure-free operationof a program for a specified time in a specified environment [16], differs fromhardware reliability in that the shortfall of the former is mainly due to designfaults attributable to human error, and not due to mechanical or electricalwear and tear. Also, once all the errors resident in the software are removed,it will continue to be reliable, unlike hardware, whose reliability tends to

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 111–131.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 115: Foundations of Computational Intelligence

112 P. Zeephongsekul

decline with age, even with constant repair. Thus, achievement of softwarereliability would be expected to follow a different path to that of hardwarereliability.

The chapter is organized in the following manner. We provide a brief lit-erature review and discussed related work in Section 2. In Section 3, somepreliminaries are given on the concepts of fuzzy random variables, their meanand variance, and define point processes with random fuzzy marks. We thenintroduce a reliability measure called gauge measure in Section 4. The meanand variance of a gauge measure are derived and their properties discussed.Finally, we apply the results to several numerical examples which involvefuzzy debugging of software programs.

2 Literature Review and Related Work

Currently, three methods are often employed to improve software reliability.The first method consists of designing software programs using structuredprogramming, formal specification language and related methods so that theprobability of introducing an error into the program is reduced [21]. Thesecond method, favored in hardware reliability, utilizes the provision of re-dundancy in the program or by adding some error-recovery procedures so thatsoftware is able to perform satisfactorily even in the presence of errors. Thethird method, and this is perhaps the most commonly used method, consistsof putting the software through different development life cycles including anintensive regime of testing, debugging and evaluation The objective is to re-move as many errors as possible before the software is released to users. In theprocess, an error detected in the program is either removed or not removedwith certainty, i.e. software debugging is regarded as a 0-1, perfect-imperfectremoval process.

A simple method of assessing reliability of the software is to plot the cu-mulative number of failures detected (and removed)versus the elapsed timet (execution or calendar time) or number of test cases conducted, duringthe testing phase. These plots are called Software Reliability Growth Curves.Since the 1970s, many stochastic models, collectively called Software Reli-ability Growth Models (SRGMs) have been proposed to fit many of thesereliability growth curves using empirical data (see for example, [17], [22],[27], [28], [29] and [30]). SRGMs are used extensively to predict reliabilityfeatures such as rate of occurrence of failure, mean or median time to failureand the residual number of errors in the software at the conclusion of testing.It is also used to determine the optimal time to release software [20]. Figure1 is an example of a reliability growth curve using data taken from [25] whichcontain cumulative number of failures collected each day over 111 days. Theprogram that was tested consists of 200 modules and about 1000 lines ofhigh-level language codes per module.

The application of fuzzy set ideas to reliability engineering is fairly recent,and can be found in the monograph [2]. In [3] and [4], the authors also argued

Page 116: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability Modeling 113

Fig. 1. An example of a software reliability growth curve

that software reliability behavior, due to the many subjective factors whichimpinge on it, should be regarded as fuzzy rather than stochastic. There-fore, an alternative approach, which recognizes the inexactitude of eventsinvolved in the process of software testing, is deemed necessary. It also ap-pears that the increase in complexity of some proposed stochastic SRGMshave not been universally successful in predicting software reliability nor arethey very tractable for users. Whilst it is not the objective of this chapterto enter into an argument over the superiority of a fuzzy approach versus astochastic approach, we will demonstrate in this chapter that the fuzzy de-bugging approach that was adopted in [32] and [33] provides an alternative tothe standard 0-1 perfect-imperfect debugging approach used in conventionalSRGM modeling.

Fuzzy debugging can be modeled by marked point processes; that is,stochastic processes consisting of an underlying point process and an aux-iliary variable called mark associated with each point. These types of pro-cesses have applications in many scientific and engineering disciplines (referto [23] for many of these applications). In this chapter we will consider pointprocesses with random fuzzy marks. These marks are fuzzy random variablesas defined in Puri and Ralescu [18] and also by others, e.g. [13, 12, 24, 14].For a recent review of the concept of fuzzy random variables , the reader isreferred to Gil et al. [8]. Fuzzy random variables incorporate both random-ness due to sampling variability and imprecision due to the fuzzy perceptionof their outcomes. Their expected values and variances will be used to assessthe mean and consistency of fuzzy debugging.

3 Preliminaries

In this and for the rest of the chapter, the symbol IA(x) denotes the indi-cator function of the set A, R is the set of real numbers, R+ = [0,∞) thenon-negative half line, Z+ the subset of non-negative integers and Rn the

Page 117: Foundations of Computational Intelligence

114 P. Zeephongsekul

n-dimensional Euclidean space. The bounded Borel subsets of R+ will be de-noted by B. Finally, (Ω,A, P ) denotes a probability space where Ω is theunderlying sample space of events, A a σ− algebra of subsets of Ω and P aprobability measure on A.

3.1 Fuzzy Sets

A fuzzy set on Rn is synonymous with its membership function μ : Rn �−→I = [0, 1] which is assumed to satisfy the following properties:

(i) μ(x) = 1 for some x ∈ Rn;(ii) μ is upper semicontinuous;(iii) for each 0 < α ≤ 1, the α - level set

μα = {x ∈ Rn : μ(x) ≥ α} (1)

is convex ;(iv) the support of μ, i.e.

supp μ =⋃α∈I

μα (2)

is bounded. The symbol A refers to the closure of the set A.

The space of fuzzy sets which satisfy the above properties will be denoted byF(Rn). By properties (ii), (iii) and (iv), μα is a compact and convex subsetof Rn for all α ∈ (0, 1].

Two operations, the Minkowski sum ⊕ and Minkowski product , can bedefined between any two nonempty subsets A and B of Rn:

A ⊕ B = {a + b : a ∈ A, b ∈ B}c A = {ca : a ∈ A}

where c is any scalar quantity. It will be convenient to use the notation⊕i≥1 ci Ai ≡ c1 A1 ⊕ c2 A2 ⊕ . . . ⊕ ck Ak ⊕ . . . for any sequence of

nonempty sets Ai and scalars ci.The sum of two fuzzy sets μ+ν and scalar multiplication cμ can be defined

using the identities

(μ + ν)α = μα ⊕ να and (c μ)α = c μα (3)

(cf. [7]) and the observation that a fuzzy set is related to its α- level sets bythe following relationship:

μ(x) = sup{α ∈ (0, 1]|x ∈ μα}. (4)

By associativity, (3) can be extended to finite sum of fuzzy sets through theidentity

Page 118: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability Modeling 115

(c Σni=1 μi)α =

n⊕i=1

(c μα i). (5)

A powerful concept in convex analysis is that of the support function of acompact and convex subset K in Rn. This is defined by

s(K, u) = sup{〈u, x〉|x ∈ K}, u ∈ Sn−1 (6)

where 〈., .〉 is the inner product and Sn−1 is the (n-1)- dimensional unitsphere of Rn. The utility of this concept is that K can be identified withits support function and therefore any operation involving closed convex setscan be replaced by the more accessible operation between support functions.

We complete the discussion on the space of fuzzy sets by defining thefollowing L2 norm ‖.‖2 on F(Rn):

‖μ‖2 = (n∫ 1

0

∫Sn−1

|s(μα, u)|2 dλ(u) dα)1/2 < ∞ (7)

where λ is the Lebesgue measure on the unit sphere. Based on this norm,the following L2 metric d2(μ, ν) between any two fuzzy sets μ and ν can beconstructed:

d2(μ, ν) = ‖μ − ν‖2

= (n∫ 1

0

∫Sn−1

|s(μα, u) − s(να, u)|2 dλ(u) dα)1/2. (8)

The space (F(Rn), d2) is a complete separable metric space (refer to [6]))and we let E denote the family of Borel subsets of F(Rn).

3.2 Fuzzy Random Variables

A fuzzy random variable (frv) X is a measurable mapping from a probabilityspace (Ω,A, P ) into (F(Rn), d2) such that, for each 0 ≤ α ≤ 1, the α - levelmapping Xα from Ω into the space of convex and compact sets defined by

Xα(ω) = {x ∈ Rn : X(ω)(x) ≥ α} (9)

is a random set 1. This is analogous to the well known definition of an ordinaryrandom variables, which is usually defined as a measurable mapping from aprobability space to Rn.

Since the concept of a frv depends intrinsically on that of a random setthrough (9), we next give a formal definition of random sets. Let Q(Rn)denotes the family of compact and convex subsets of Rn, with a linear struc-ture induced by ⊕ and . We define the Hausdorff metric for elements inQ(Rn) by1 This definition is due to Puri and Ralescu [18].

Page 119: Foundations of Computational Intelligence

116 P. Zeephongsekul

dH(A, B) = max{ρ(A, B), ρ(B, A)} (10)

where ρ(A, B) = supaεA infbεB ‖a − b‖ and ‖.‖ denotes the Euclidean normin Rn. The space (Q(Rn), dH) is a complete and separable metric space.

A measurable mapping 2 Y from a probability space (Ω,A, P ) into(Q(Rn), dH) is a random set. Let L1(P ) denotes the space of P - integrablefunction f : Ω → Rn and S(Y ) the set of all L1(P ) selections of Y, that is,S(Y ) = {f ∈ L1(P ) : f(ω) ∈ Y (ω) a.s.}. The Aumann integral integral ofY with respect to the measure P 3 is defined by

(A)∫

Y dP = {∫

Ω

f dP : fεS(Y )}. (11)

By a result due to Richter [19], the Aumann integralis a convex subset of Rn.In addition, if Y (ω) is closed for every ω ε Ω and Y is integrably bounded, i.e.,there exists a function g : Ω → Rn, g ∈ L1(P ) such that ‖x‖ ≤ g(ω) for allx and ω with xεY (ω), then the Aumann integral is compact in Rn.

The Aumann integral defined by (11) tends to camouflage the fact that itis simply a formal definition of the expected value of a random set. This canbe seen clearly once we note that in case of a random set Y taking valuesKi ∈ Q(Rn) with corresponding probabilities pi, i = 1, 2, . . . , n, the Aumannintegral of Y is equal to

(A)∫

Y dP = (p1 K1) ⊕ (p2 K2) ⊕ . . . ⊕ (pn Kn) (12)

which is reminiscent of the definition of an ordinary expected value.

3.3 Expected Value of a Fuzzy Random Variable

The expected value E{X} of an integrably bounded frv X , i.e. Xα is inte-grably bounded for all α ∈ (0, 1], is defined as the unique fuzzy set whose α- levels E{X}α are Aumann integrals of the random sets (9), i.e.

E{X}α = (A)∫

Xα dP

= {∫

fdP : f ∈ S(Xα)} (13)

([18] and [32]). Another characterization of E{X} can be expressed throughthe support function of Xα (cf. [26]):

s(E{X}α, u) = E{s(Xα, u)}, u ∈ Sn−1. (14)

2 By measurability, it is meant that sets of the form {(ω, x) ε Ω × Rn) :x ε Y (ω)} ε A× B where B is the family of Borel subsets of Rn.

3 This definition is due to Aumann [1].

Page 120: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability Modeling 117

The computation of E(X) is relatively straightforward if the frv is finiteand separably-valued, i.e. its range consists of a finite number of fuzzy sets.This is due to the fact that (12) can then be used to calculate its α− levelset and (4) to establish the fuzzy set. Formally, we let Ω1, . . . , Ωm be a finitemeasurable partition of Ω, A is the σ-algebra generated by this partition,P (Ωj) = pj and the fuzzy random variable X maps each member of thepartition Ωj into the fuzzy set μj , 1 ≤ j ≤ m, then, according to (12), foreach 0 ≤ α ≤ 1

E{X}α = ⊕mj=1 pj μjα (15)

where μjα is the α− level of μj . Also, in this case it follows directly from (14)that

s(E{X}α, u) =m∑

j=1

pj s(μjα, u) u ∈ Sn−1. (16)

The next example illustrates the calculation of E(X) for another large classof fuzzy random variables, the class of LR fuzzy random variable XLR in R1

([7, 11]).

Example 1. An LR fuzzy set has the form

μLR(x) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0 if x < m − s − l,L(m−s−x

l ) if m − s − l ≤ x < m − s,1 if m − s ≤ x < m + s,R(x−s−m

r ) if m + s ≤ x < m + s + r,0 if x ≥ m + s + r,

where L, R : [0, 1] → [0, 1] are fixed continuous and decreasing functions suchthat L(0) = R(0) = 1 and L(1) = R(1) = 0. Here, m ∈ R1 and s, l, r arethree positive real numbers. XLR is a LR fuzzy random variable when m, s, land r are independent squared-integrable real random variables.

It is easy to see that μLRα = [m− s− l L−1(α), m + s + rR−1(α)] hence it

follows that

E{XLR}α = [E(m) − E(s) − E(l)L−1(α), E(m) + E(s) + E(r)R−1(α)].

We can now use E{XLR}α to build up E{XLR} leading to

E{XLR}(x) =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

0 if x < E(m) − E(s) − E(l),L(E(m)−E(s)−x

E(l) ) if E(m) − E(s) − E(l) ≤ x < E(m) − E(s),1 if E(m) − E(s) ≤ x < E(m) + E(s),R(x−E(s)−E(m)

E(r) ) if E(m) + E(s) ≤ x < E(m) + E(s) + E(r),0 if x ≥ E(m) + E(s) + E(r).

When L(x) = R(x) = 1−x, m is uniformly distributed over (−1, 1) and s, l, rare uniformly distributed over (0, 1), then

Page 121: Foundations of Computational Intelligence

118 P. Zeephongsekul

E{XLR}(x) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0 if x < −1,2 + 2x if −1 ≤ x < − 1

2 ,1 if − 1

2 ≤ x < 12 ,

2 − 2x if 12 ≤ x < 1,

0 if x ≥ 1.

The graph of E{XLR} is displayed in Fig. 3.

-1 -1/2 0 1/2 1

1

x

Fig. 2. Fuzzy set of E{XLR}

3.4 Variance of a Fuzzy Random Variable

The following definition of the variance of a frv , V ar{X}, is given by Korner[11]:

V ar{X} = E{d2(X, E{X})}

= E{n∫ 1

0

∫Sn−1

|s(Xα, u) − s(E{X}α, u)|2dλ(u)dα}. (17)

This definition generalizes the concept of the variance of convex random setsintroduced in [15] to frvs. Unlike E{X}, which is a fuzzy set, the variance ofa frv is a non-negative real number and satisfies the natural condition:

V ar{X} = supμ∈F(Rn)

E{d2(X, μ)}. (18)

Note that if the interchange of expected value operator and integrals in (17)is permissable, then using (14), it follows that

V ar{X} = n

∫ 1

0

∫Sn−1

V ar{s(Xα, u)}dλ(u)dα. (19)

Similarly, the covariance between two fuzzy random variables X and Y isdefined as

Page 122: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability Modeling 119

cov{X, Y } = E{n∫ 1

0

∫Sn−1

[s(Xα, u) − s(E{X}α, u)][s(Yα, u)

− s(E{Y }α, u)]dλ(u)dα}. (20)

It follows from (14) that cov{X,Y } = 0 if X and Y are independent. Thefollowing properties of Var{X}, listed in [11], can easily be obtained usingdefinitions (17) and (20):

Proposition 1. Let X and Y be two fuzzy random variables, x a nonnegativesquare integrable random variable, μ ∈ F(Rn) and c a nonnegative scalarnumber, then the following identities hold:

(i) V ar{X} = E‖X‖22 − ‖E{X}‖2

2

(ii) V ar{cX} = c2 V ar{X}(iii) V ar{xμ} = ‖μ‖2

2 V ar{x}(iv) V ar{μ + X} = V ar{X}(v) V ar{xX} = E2{x}V ar{X} + E‖X‖2

2 V ar{x} if X and x are indepen-dent.

(vi) Var{X + Y } = Var{X } + Var{Y } + 2 cov{X,Y }.Notice how similar the above properties are to some of the properties of thevariance of an ordinary random variable X . For computational purpose lateron, we also note that (7) gives

E‖X‖22 = E(n

∫ 1

0

∫Sn−1

|s(Xα, u)|2 dλ(u) dα) (21)

and ‖E{X}‖22 = n

∫ 1

0

∫Sn−1

|s(E{X}α, u)|2 dλ(u) dα. (22)

Example 2. In this example, we will derive the variance of an LR frv consid-ered in Example 1. This result was first obtained by Korner [11]. Note thatwhen n = 1, S0 = {+1,−1} and λ{1} = λ{−1} = 1

2 . Since

s(Xα, 1) = m + s + r R−1(α)and s(Xα,−1) = −m + s + l L−1(α)

we obtain by independence of the random variables m, s, l and r that

V ar{s(Xα, 1)} = V ar(m) + V ar(s) + (R−1(α))2 V ar(r)and V ar{s(Xα,−1)} = V ar(m) + V ar(s) + (L−1(α))2 V ar(l).

Applying (19) therefore gives

V ar{XLR} = V ar(m) + V ar(s) +V ar(r)

2

∫ 1

0(R−1(α))2 dα +

V ar(l)

2

∫ 1

0(L−1(α))2 dα.

For the specific case discussed in Example 1, V ar(m) = 13 , V ar(s) =

V ar(r) = V ar(l) = 112 and R−1(α) = L−1(α) = 1−α, therefore V ar{XLR} =

0.444 using the previous equation.

Page 123: Foundations of Computational Intelligence

120 P. Zeephongsekul

4 Point Processes with Random Fuzzy Marks

In this section, we briefly discuss the concept of point processes with randomfuzzy marks. These processes were first introduced in [34] and noted for theirbroad utility in several reliability problems . A point process in Rn is astochastic process which models the distribution of points in space or time. Amark is an auxiliary variable, which could be either deterministic or random,associated with each point. For example, consider a queue with batch arrivals,the point process represents the entry times into the service facility and thecorresponding mark is the size of each batch. Marked point processes areapplied extensively in practice and good references, such as [23, 10], existwhere the concept and its many applications are methodically discussed. Inthis chapter, we will restrict our point process to the non-negative half lineR+ = [0,∞) and assume that the mark associated with each point is a fuzzyrandom variable. The significance of this novel approach will be apparentwhen we apply it to fuzzy debugging in Section 5.

Let (Ω,A, P ) denote a probability space and Ns the space consisting ofcounting measures ωs =

∑n≥1 δtn where δx is the Dirac delta function con-

centrated at the point x and 0 = t0 < t1 < t2 < . . . < tn < . . . are the timesof occurrences of an event (these times are also known as the atoms of thecounting measure). The function δx attaches mass 1 to the point x and 0elsewhere. We will assume ωs(B) < ∞ for all ωs ∈ Ns and B ∈ B. On Ns, letNs be the σ- algebra generated by sets {ωs : ωs(B) = j} where B ∈ B andj ∈ Z+, the set of non-negative integers.

A simple point process (spp) is a measurable mapping ξs from (Ω,A, P )into (Ns,Ns). A distribution Ps on Ns can be defined through ξs by Ps(A) =Pξ−1

s (A), A ∈ Ns. Since (Ns,Ns) is fixed, a spp ξs can be identified by itsinduced measure Ps. Any member ωs ∈ Ns is an outcome of Ps. Alternatively,an outcome of a spp can be represented by a sequence of events

0 ≤ T1 < T2 < . . . < Tk . . .

with associated counting process

Nt = N([0, t]) =∞∑

j=1

I[0,t] (Tj) (23)

where{Tk ≤ t} = {Nt ≥ k}.

The mark space is the space of fuzzy sets F(Rn) which we discussed inSection 2.1. To extend the concept of a spp to marked point process (mpp)with random fuzzy marks, we let Nm be the space consisting of countingmeasure ωm =

∑n≥1 δ(tn,zn) where each zn assumes values in F(Rn). Thus,

for each fixed atom t of a s.p.p. ωs, the random mark z : R+ → F(Rn) is ameasurable mapping from t to a member of the fuzzy sets. We will assume

Page 124: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability Modeling 121

ωm(B×F(Rn)) < ∞ for all ωm ∈ Nm and B ∈ B but ωm(R+×F(Rn)) = ∞almost surely. On Nm, let Nm be the σ- algebra generated by sets {ωm :ωm(B × E) = j} where B ∈ B, E ∈ E and j ∈ Z+. (Recall that E refersto the Borel subsets of F(Rn) induced by the metric d2(., .) defined in (8).)A random fuzzy marked point process (rfmpp) is a measurable mapping ξm

from (Ω,A, P ) into (Nm,Nm) and as in the case of a spp, we will identify arfmpp through the measure Pm = Pξ−1

m induced by ξm.The distribution Pm can be constructed based on the sample path evolu-

tion of a rfmpp (cf. [10]). The evolution can be described by the followingsteps:

1. Initialization: T1 = t1 and given this information, determine the distri-bution of Z1 = z1.

2. For k ≥ 1 : given T1 = t1, Z1 = z1, T2 = t2, Z2 = z2, . . . , Tk = tk, Zk = zk,determine the distribution of Tk+1 = tk+1;

3. given T1 = t1, Z1 = z1, T2 = t2, Z2 = z2, . . . , Tk = tk, Zk = zk, Tk+1 =tk+1, determine the distribution of Zk+1 = zk+1.

Initially, the time of the first jump of the underlying spp T1 = t1 is recorded,then successively, for each k ≥ 1, knowledge of T1 = t1, Z1 = z1, T2 = t2, Z2 =z2, . . . , Tk = tk, Zk = zk determines Tk+1 = tk+1 and, incorporating this intothe known sample then provide information concerning Zk+1 = zk+1 andso on.

Let P (0) on R+ denote the distribution of T1 and π(0)(., t1) be the con-ditional distribution of z1 given t1. Furthermore, let P (k) and π(k) be theconditional distributions outlined above in Step 2 and Step 3 respectively.That is,

1. for each yk = (t1, z1, t2, z2, . . . , tk, zk), P (k)(.,yk) is a distribution on R+;2. for each t ∈ R+, P (k)(t, .) is a measurable function on (R+ × F(Rn))k,

the k-fold product space of R+ ×F(Rn).

Similarly,

1. for each (yk, tk+1), π(k)(., (yk, tk+1)) is a distribution on F(Rn);2. for each z ∈ F(Rn), π(k)(z, .) is a measurable function on (R+ ×

F(Rn))k × R+.

The distribution Pm is constructed using the above distributions; for exam-ple, the probability increment dPm(ωm) where ωm =

∑nj=1 δ(tj ,zj) can be

expressed somewhat loosely as

dPm(ωm) = P (0)(dt1)π(0)(dz1, t1)P (1)(dt2,y1)π(1)(dz2, (y1, t2)) . . .

. . . P (n−1)(dtn,yn−1)π(n−1)(dzn, (yn−1, tn)). (24)

In general, if Q(ωs, .) is the distribution of marks corresponding to theatoms of ωs =

∑n≥1 δtn ∈ Ns, then we may represent Pm in terms of Q(.)

and Ps as follows:

Page 125: Foundations of Computational Intelligence

122 P. Zeephongsekul

Pm(A) =∫

Ns

Q(ωs, Aωs) dPs(ωs), A ∈ Nm (25)

where Aωs = {zi ∈ F(Rn) : ωm =∑

n≥1 δ(tn,zn) ∈ A} is the ωs- sectionof the set A. Note that Q(.) is related to the distributions π(k)(.) describedabove.

5 A Measure of Reliability Based on Point Processeswith Random Fuzzy Marks

We define a fuzzy gauge measure as the measurable function G : Nm →F(Rn) where

G(ωm)(B) =∑n≥1

znIB(tn) (26)

B ∈ B and ωm =∑

n≥1 δ(tn,zn). For typographic convenience, we shall sup-press the dependence of G(ωm)(B) on the outcome ωm in the sequel.

Note that G(.) is not a measure in the strict sense of the word, since (26)is a sum of fuzzy random variables and is therefore a fuzzy random variable;however, it satisfies the axiom of countable additivity, i.e. G(A∪B) = G(A)+G(B) for disjoint sets A and B. If ωs =

∑n≥1 δtn is the spp corresponding

to ωm, then defining

z(t, ωs) ={

zn if t = tn0 otherwise

G(B) can also be expressed in integral form as

G(B) =∫

B

z(t) dωm(t) (27)

as suggested by (26).Since gauge measures are fuzzy random variables, we can define their

means and variances using concepts that were discussed in Section 2. Thenext set of propositions were first enunciated in [34]. For completeness, wewill present the proofs here as well.

Let the expected value of a fuzzy gauge with respect to a m.p.p. Pm bedenoted by ηPm , i.e.

ηPm(B) =∫

G(B)dPm(ωm), B ∈ B. (28)

Proposition 2

s((ηPm(B))α, u) = EPm{∑n≥1

s(znα, u) I(tn ∈ B)} (29)

Page 126: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability Modeling 123

where B ∈ B, u ∈ Sn−1 and znα refers to the α- level set of the fuzzy variablezn, n ≥ 1.

Proof. Using (3), (G(B))α =⊕

n≥1 znα IB(tn) and therefore (13) implies

(ηPm(B))α = {∫

fdPm : f ∈ S(⊕n≥1

znα I(tn ∈ B))}. (30)

Next, applying (14) and using the fact that support function preserves setaddition,

s((ηPm(B))α, u) = EPm{s(G(B))α, u)}= EPm{s(

⊕n≥1

znα I(tn ∈ B), u)}

= EPm{∑n≥1

s(znα, u) I(tn ∈ B)}

and (29) follows.If the fuzzy marks are independent and identically distributed (iid) with

zn ≡ z, then the sequence of random variables s(znαu) are also iid withs(znαu) ≡ s(zαu). If, in addition, the sequence (zn) is independent of theunderlying spp (tn), then applying Wald Theorem, (29) simplifies to

s((ηP (B))α, u) = EPs{N(B)}Eπ{s(zα, u)} (31)

where π(.) is the common distribution of the fuzzy marks and N(B) =∑n≥1 I(tn ∈ B).Guided by (17), the following is a very reasonable definition for the variance

of G(B):

Definition 1. The variance of the fuzzy gauge measure G(B) is defined as

V ar{G(B)} = EPm{n∫ 1

0

∫Sn−1

|s((G(B))α, u) − s((ηP (B))α, u)|2 dλ dα}.(32)

From (26), we note that G(B) is equal to the sum of N(B) fuzzy randomvariables

SN(B) = z1 + z2 + . . . + zN(B).

Define

EPs{V ar{SN(B)}} = EPs EQ{n

∫ 1

0

∫Sn−1

|s((G(B))α, u) − s((EQ{SN(B)})α, u)|2dλ(u)dα

where the iterated expectation is first taken with respect to Q and then Ps.Applying Proposition 1 (vi), it follows that

Page 127: Foundations of Computational Intelligence

124 P. Zeephongsekul

EPs{V ar{SN(B)}}=EPs{N(B)∑

i=1

V ar{zi}+2∑

1≤i<j≤N(B)

cov{zi, zj}}. (33)

Also, let

V ar{E{SN(B)}}= EPs{n∫ 1

0

∫Sn−1

|s((EQ{SN(B)})α, u) − s((ηP (B))α, u)|2 dλ dα}(34)

then the following identity holds:

Theorem 1

V ar{G(B)} = EPs{V ar{SN(B)}} + V ar{E{SN(B)}}. (35)

Proof. We first expand (32) in the following manner:

V ar{G(B)} = EPsEQ{n∫ 1

0

∫Sn−1

|s((G(B))α, u) − s((EQ{SN(B)})α, u)

+ s((EQ{SN(B)})α, u) − s((ηP (B))α, u)|2dλ(u)dα

= EPs{V ar{SN(B)}} + V ar{E{SN(B)}}

+ 2EPsEQ{n∫ 1

0

∫Sn−1

[s((G(B))α, u) − s((EQ{SN(B)})α, u)]

× [s((EQ{SN(B)})α, u) − s((ηP (B))α, u)]dλ(u)dα}. (36)

For fixed N(B),

EQ{s((G(B))α, u)} = EQ{s(⊕n≥1

znα I(tn ∈ B), u)}

= EQ{∑n≥1

s(znα, u)I(tn ∈ B)}

=∑n≥1

s((EQ{zn})α, u)I(tn ∈ B)

= s(⊕n≥1

(EQ{zn)})αI(tn ∈ B), u)

= s((∑n≥1

EQ{zn}I(tn ∈ B))α, u)

= s((EQ{∑n≥1

znI(tn ∈ B)})α, u)

= s((EQ{SN(B)})α, u),

therefore the cross product term in (36) equals zero and the proof is complete.

Page 128: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability Modeling 125

Corollary 1. If the fuzzy marks are i.i.d. with zn ≡ z, then

V ar{G(B)} = E{N(B)}V ar{z} + V ar{N(B)}||E{z}||22. (37)

Proof. Using the i.i.d. assumption and referring to (33),

EPs{V ar{SN(B)}} = E{N(B)}V ar{z}by an application of Wald Theorem. Also, (34) and applying (14) give

V ar{E{SN(B)}} = EPs{n

∫ 1

0

∫Sn−1

|N(B)s((E{z})α, u) − E{N(B)}E{s(zα, u)}|2dλ(u)dα}

= EPs{n

∫ 1

0

∫Sn−1

|N(B) − E{N(B)}|2|s((E{z})α, u)|2dλ(u)dα}

= V ar{N(B)}||E(z)||22

and the proof is complete on referring to (35).Note that a similar result to (37) holds for the variance of ordinary random

variables.

6 Application to Software Reliability

As discussed in the Introduction, software testing plays an integral part inthe improvement of software reliability. During the testing process, detectedfaults, revealed by failures in the operation of the program, are removed fromthe program. The failures that occurred during this period of inspection canbe modeled by a stochastic point process. When a failure occurs, a debug-ging effort takes place immediately to fix the faults which caused the failure.Instead of regarding the fault removal process as crisp, i.e. successful with acomplete removal (binary digit 1) or unsuccessful with the faults remainingin the program (binary digit 0), it is often more realistic to regard the resultof each debugging effort as a fuzzy random variable z, i.e. as a graded effortwith value in the closed interval [0, 1]. This is due to the undisputed fact thatoften, the success or failure of debugging an error is a subjective assessmentand should not therefore be regarded as a crisp phenomenon.

If zi is the result of applying fuzzy debugging to the ith failure, the cu-mulative effect of the debugging effort at time t is equal to the sum ofN(t) ≡ N [0, t) fuzzy random variables SN(t) = z1 + z2 + . . . + zN(t). Thus,an assessment of the reliability of the debugging process can be measuredby the fuzzy gauge measure G(t) ≡ G([0, t)). In this sense, G(t) is the fuzzyanalogue of a software reliability growth curve which we discussed in the In-troduction with an example depicted in Figure 1. However, the nomenclature”growth curve” is not appropriate here as G(t) is a fuzzy random variable,hence we have opted to adopt the name gauge measure. Note that the meanof a gauge measure is the fuzzy analogue of the mean value function which isused to model most so called Fault-Count SRGMs (see e.g. [17] and [27]). Itsvariance (32) can therefore be used to measure the consistency of the fuzzy

Page 129: Foundations of Computational Intelligence

126 P. Zeephongsekul

debugging process. This approach differs from conventional SRGM modelingwhere consistency is evaluated by computing the mean squared error betweenthe SRGM growth curve against the empirical data.

We illustrate the preceding theory with a numerical example using LRfuzzy sets (refer to Example 1). Assume that errors in the program occur asa random point process over time (calendar or execution). The result of thedebugging effort at each point produces the following fuzzy random variablewhich takes on the following fuzzy sets with prior probability q = (q1, q2, q3):

• total failure (TF )

μ1(x) ={

R1( x0.2 ) if 0 ≤ x ≤ 0.2

0 otherwise.

• less than perfect (LP )

μ2(x) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0 if x < 0.2L2(0.3−x

0.1 ) if 0.2 ≤ x < 0.31 if 0.3 ≤ x < 0.7R2(x−0.7

0.1 ) if 0.7 ≤ x < 0.80 if x ≥ 0.8.

• perfect (P )

μ3(x) ={

L3(1−x0.2 ) if 0.8 ≤ x < 1

0 otherwise.

Assume that the fuzzy marks are independent and identically distributed ateach point and also independent of the underlying point process. We also letR1(x) = L2(x) = R2(x) = L3(x) = 1 − xβ for some β > 0. Figure 3 depictsthese fuzzy sets in the case of β = 1. Note that β �= 1 produce non-linearcurves on the sides of each of the fuzzy sets. These are above and below thestraight lines when β > 1 and β < 1 respectively.

Define

bt =V ar{G(t)}

E(Nt)(38)

and dt =V ar{Nt}E(Nt)

, (39)

and note that bt and dt measure the dispersion of G(t) and Nt with respectto the mean value of Nt respectively. From equation (37), it follows that

bt = V ar{z} + dt||E{z}||22. (40)

The consistency of the fuzzy debugging effort is measured by bt and from (40),it is clear that bt depends on q, β and dt implicitly through the underlying sppand fuzzy random variable z. How these parameters affect bt will be explorednext. But first, we give a numerical example of the calculation of bt.

Page 130: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability Modeling 127

LP

PTF

0.5

1.0

0.0

0 0.5 1.0

Fig. 3. The three levels of debugging efforts

Example 3. Let the fuzzy debugging at each point takes on the above LRfuzzy sets with q = (0.05, 0.10, 0.85) and β = dt = 2, It is easy to show thatthe α− level sets of the above fuzzy sets are the closed intervals

μ1α = [0,

√1 − α

5]

μ2α = [3 −√

1 − α

10,7 +

√1 − α

10]

and μ3α = [7 +

√1 − α

10, 1]

respectively. Since we are dealing in real time our fuzzy random variables willtake values in the real line R+. We note that in R1, S0 = {+1,−1} and themeasure λ(.) allocates equal weight of 1

2 to each member of S0. Using thedefinition of support function given in (6), we obtain

s(zα, 1) =

⎧⎪⎨⎪⎩

√1−α5 with probability 0.05

7+√

1−α10 with probability 0.101 with probability 0.85

and

s(zα,−1) =

⎧⎪⎨⎪⎩

0 with probability 0.05−3+

√1−α

10 with probability 0.10−7−√

1−α10 with probability 0.85,

hence applying (21) we obtain

E‖z‖22 = 0.05 ×

∫ 1

0

α2

50dα + 0.10 ×

∫ 1

0

(7 +√

α)2 + (−3 +√

α)2

200dα

+ 0.85 ×∫ 1

0

100 + (7 +√

α)2

200dα

= 0.678.

Page 131: Foundations of Computational Intelligence

128 P. Zeephongsekul

Next, using (15),

E{X}α = [0.625 + 0.075√

1 − α, 0.92 + 0.02√

1 − α]

therefore s(E{X}α, 1) = 0.92 + 0.02√

1 − α and s(E{X}α,−1) = −0.625 −0.075

√1 − α and applying (22) we obtain

‖E{z}‖22 =

∫ 1

0

∫S0

|s(E{X}α, u)|2dλ(u)dα

=12

∫ 1

0

(0.92 + 0.02√

α)2dα +12

∫ 1

0

(0.625 + 0.075√

α)2dα

= 0.664

which gives V ar{z} = 0.014 by Proposition 1 (i). Finally, using (40), weobtain bt = 1.342.

Values of bt are obtained for various q, β and dt values. In Figure 4 we displaythe plots of bt against various q1, q2 and q3 values. These plots show that bt

decline with increasing probabilities of TF and LP level fuzzy debugging butincreases with increasing probability of P level fuzzy debugging. This can beexplained by the fact that with successful debugging, the amount of softwaredefects are uncovered at an increasing rate, leading to higher variability ofthe gauge measure. The same phenomenon occurs in probabilistic SRGMswhere the number of software defects uncovered increases rapidly and thenflatten out after a period of testing.

In Figure 5, we display the boxplots of bt for various dispersions of theunderlying spp as quantified by dt. A boxplot is a well known graphical displaywhich gives a good description of the spread of the data (cf. [5]). The largerthe width of these boxes, the higher is the variability. From these boxplots, itis apparent that the variability of G(t) increases when the variability of the

Fig. 4. Plots of bt against q1, q2 and q3

Page 132: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability Modeling 129

Fig. 5. Boxplots of bt for dt = 0.5, 1, 2, 3, 4

Fig. 6. Boxplots of bt for β = 0.1, 0.33, 0.5, 1, 2

underlying spp increases, which is quite reasonable since one would expecthigher dispersion in N(t) to lead to higher variability in G(t).

Finally, comparing the boxplots of bt for various values of β indicates veryhigh variability for low β values compare to higher β values. As mentioned,deviation from β = 1 (as in Figure 3) produce non-linear curves on the sidesof the fuzzy sets which are above and below the straight lines when β > 1and β < 1 respectively. Thus, low β values indicate less accurate debuggingprocess thereby tending to increase its variability.

7 Conclusion

In this chapter, we present a generic model on fuzzy debugging which is basedon point processes with random fuzzy marks. This provides an alternative tothe standard 0-1 perfect-imperfect debugging approach used in conventional

Page 133: Foundations of Computational Intelligence

130 P. Zeephongsekul

SRGM modeling and is suitable in cases where a considerable amount ofsubjectivity enters into the debugging process.

A large part of this chapter is devoted to introducing fairly abstract con-cepts such as fuzzy random variables, Aumann integrals, point process the-ory and other related topics since it is our belief that a rigorous studies ofthese topics are warranted given their wide range of applications, not leastin the subject matter of this chapter. Nevertheless, numerical examples arepresented showing how these concepts can be used in a concrete manner todemonstrate the way in which fuzzy debugging can be assessed.

Finally, there is much scope in extending the present work. For example, wehave devoted our coverage to strictly independent fuzzy random variables, itwill be interesting to compare results when there are dependencies betweenthe variables. Also, statistical theory such as estimation and inference canbe put to bear on the distribution of gauge measures. This will certainlyextend the range of application of this measure into areas other than softwarereliability.

References

1. Aumann, R.: J. Math. Anal. Appl. 12, 1–12 (1965)2. Cai, K.Y.: Introduction to Fuzzy Reliability. Kluwer Academic Publishers,

Boston (1996)3. Cai, K.Y., Wen, C.Y., Zhang, M.I.: Reliab. Engrg. Syst. Safety 32, 357–371

(1991)4. Cai, K.Y., Wen, C.Y., Zhang, M.I.: Microelectron Reliab. 33, 2265–2267 (1993)5. Devore, J.L.: Probability and Statistics for Engineering and the Sciences, 6th

edn. Brooks/Cole, Pacific Grove (2004)6. Diamond, P., Kloeden, P.: Metric Spaces of Fuzzy Sets. World Scientific, Sin-

gapore (1994)7. Dubois, D., Prade, H.: Int. J. Systems Sci. 9, 613–626 (1978)8. Gil, M.A., Lopez–Dıaz, M., Ralescu, D.A.: Fuzzy Sets and Systems 157, 2546–

2557 (2006)9. Glass, R.L.: Software Runaways: Monumental Software Disasters. Prentice Hall

Computer Science Series, New Jersey (1998)10. Jacobsen, M.: Point Process Theory and Application. Birkhauser, Basel (2006)11. Korner, R.: Fuzzy Sets and Systems 92, 83–93 (1997)12. Kruse, R., Meyer, K.D.: Statistics with Vague Data. Reidel, Dordrecht (1987)13. Kwakernaak, K.D.: Inform. Sci. 15, 1–29 (1978)14. Lushu, L., Zhaohan, S.: Fuzzy Sets and Systems 97, 203–209 (1998)15. Lyashenko, N.: J. Soviet Math. 20, 2187–2196 (1982)16. Musa, J.D., Iannino, A., Okumoto, K.: Software Reliability: Measurement, Pre-

diction, Application. McGraw-Hill, New York (1990)17. Pham, H.: Software Reliability. Springer, Berlin (2000)18. Puri, M.L., Ralescu, D.A.: J. Math. Anal. Appl. 64, 409–422 (1978)19. Richter, H.: Math. Ann. 150, 85–90 (1963)20. Ross, S.M.: IEEE Trans. Software Engng. 11, 1472–1476 (1985)21. Shooman, M.: Software Engineering: Design, Reliability and Management.

McGraw-Hill, New York (1972)

Page 134: Foundations of Computational Intelligence

A Fuzzy Set Approach to Software Reliability Modeling 131

22. Singpurwalla, N.D., Wilson, S.: International Statistical Review 62, 289–317(1994)

23. Snyder, D.L., Miller, M.I.: Random Point Processes in Time and Space, 2ndedn. Springer, New York (1991)

24. Stein, W., Talati, K.: Fuzzy Sets and Systems 6, 271–287 (1981)25. Tohma, Y., Yamano, H., Ohba, M., Jacoby, R.: IEEE Trans. Software En-

gng. 17, 483–489 (1991)26. Vitale, R.: J. Microscopy 151, 197–204 (1988)27. Xie, M.: Software Reliability Modelling. World Scientific, Singapore (1991)28. Zeephongsekul, P., Xia, G., Kumar, S.: IEEE Transactions on Reliability 43,

408–413 (1994)29. Zeephongsekul, P., Xia, G., Kumar, S.: International Journal of Systems Sci. 25,

737–751 (1994)30. Zeephongsekul, P., Bodhisuwan, W.: International Journal of Reliability. Qual-

ity and Safety Engineering 6, 19–30 (1999)31. Zeephongsekul, P., Bodhisuwan, W.: International Journal of Reliability. Qual-

ity and Safety Engineering 6, 19–30 (1999)32. Zeephongsekul, P., Xia, G.: Fuzzy Sets and Systems 83, 239–247 (1996)33. Zeephongsekul, P.: Fuzzy Sets and Systems 123, 29–38 (2001)34. Zeephongsekul, P.: International Journal of Reliability. Quality and Safety En-

gineering 13, 237–255 (2006)

Page 135: Foundations of Computational Intelligence

Computational Methods for Investment

Portfolio: The Use of Fuzzy Measures andConstraint Programming for Risk

Management

Tanja Magoc, Francois Modave, Martine Ceberio,and Vladik Kreinovich

University of Texas at El PasoComputer Science Department500 West University AvenueEl Paso, Texas [email protected], [email protected], [email protected],[email protected]

Summary. Computational intelligence techniques are very useful tools forsolving problems that involve understanding, modeling, and analysis of largedata sets. One of the numerous fields where computational intelligence hasfound an extremely important role is finance. More precisely, optimization is-sues of one’s financial investments, to guarantee a given return, at a minimalrisk, have been solved using intelligent techniques such as genetic algorithm,rule-based expert system, neural network, and support-vector machine. Eventhough these methods provide good and usually fast approximation of thebest investment strategy, they suffer some common drawbacks including theneglect of the dependence among among criteria characterizing investmentassets (i.e. return, risk, etc.), and the assumption that all available data areprecise and certain. To face these weaknesses, we propose a novel approachinvolving utility-based multi-criteria decision making setting and fuzzy inte-gration over intervals.

1 Introduction

Given the pervasive nature of computer science, virtually all areas have hadto deal with enormous amounts of data. These data alone do not providemuch information if they cannot be analyzed, understood, and used to ex-tend knowledge. The strength of computational intelligence is to give a widevariety of techniques that can be used to process, model and understandthese datasets. One of the fields where computational intelligence has beenextremely useful is finance. To simplify, we can consider two types of prob-lems of interest for the computational intelligence community: pricing andportfolio management. The former deals with how to assign a price to aderivative instrument in such a way that arbitrage is not necessary, whereas

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 133–173.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 136: Foundations of Computational Intelligence

134 T. Magoc et al.

the latter deals with the optimization issues of one’s financial investments, toguarantee a given return, at a minimal risk. Pricing theory is tackled mostlyfrom a stochastic perspective, using models such as Black-Scholes [16]. Onthe other hand, portfolio management is a natural area of application forcomputational intelligence. The problem we are interested in is the selectionof optimal portfolio–a distribution of wealth over several investment assets inorder to diversify risk and obtain a maximal return for the given acceptablelevel of risk. Typically, the higher the value of the expected return, the higherthe value of risk associated with the asset. Besides the return and the risk,other factors, such as time to maturity and transaction cost, influence thedecisions of how much money to invest in each asset under the consideration.

Depending on the need of the investor, different goals could be sought. Twomost commonly considered problems in portfolio selection are maximizationof wealth and minimization of risk to acquire a required level of wealth.

Various investment strategies have been examined to identify an optimalportfolio in different settings including simple return-based strategies that donot consider other characteristics of assets, the strategies that use stochas-tic processes to model the behavior of assets and portfolios, and strategiesthat involve intelligent systems. The aim of this chapter is two-fold. First, wepresent an extensive selection of computational intelligence techniques suchas genetic algorithms, neural networks, supports vector machines, and expert-based systems to select optimal portfolio management strategies. Then, wepropose a novel approach based on utility-based multi-criteria decision mak-ing setting and fuzzy integration over intervals, as a natural framework forportfolio management.

2 Mathematical Background

As we have mentioned in the Introduction, the two most commonly consid-ered problems in portfolio selection are maximization of wealth and mini-mization of risk to acquire a required level of wealth.

The simplest way of representing these problems is as an optimizationproblem subject to constraints, which maximizes or minimizes a simple ob-jective function

maximize (or minimize)m∑

i=1

wixi, (1)

subject to constraints such as

wi ≥ 0 ∀i ∈ {1, 2, . . . , m} (2)

m∑i=1

wiri ≤ risk (3)

Page 137: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 135

m∑i=1

wiRi ≥ return (4)

m∑i=1

wi = 1 (5)

where xi is either return (in maximization problems) or risk (in minimiza-tion problems), m is the number of investment assets, wi is the proportion ofwealth invested in the asset i, Ri is the return rate of the asset i, ri is the riskof the asset i, return is required level of return, and risk is the level of riskacceptable by the investor. As presented above, selecting the optimal portfo-lio seems to be a straightforward linear programming problem. However, thisrepresentation is just the simplest problem that we can face when lookingfor the optimal portfolio. In general, many more constraints are imposed onthe solution. Moreover, the objective function and the constraints are usu-ally much more complex if more information, such as transaction cost, timeperiod, preferable portfolio structure, relationships between characteristics ofassets, etc., are taken into consideration. A real-life problem of portfolio se-lection is most commonly a non-linear optimization problem with constraintsthat is usually not (easily) solvable using general constraint solving techniquesdue to their deficiencies to find a global optimum.

Regardless of the setting and its complexity, in the portfolio optimizationproblem, we aim at finding the vector of weights (i.e. amount of wealth allo-cated into each asset), w = (w1, w2, . . . , wm), given all the other parameters.However, the risk and the return of each asset are predicted rather than cer-tain values, so the uncertainty of the values complicate even more the processof selecting the optimal portfolio.

3 Genetic Algorithms

In this section, we first give a general description of genetic algorithms, andthen explain how these algorithms work in a portfolio management frame-work. We also compare how genetic algorithm approaches perform versusother approaches, e.g., greedy algorithms.

3.1 Theoretical Background

Agenetic algorithm(GA) is an optimization method (see e.g., [11],[14],[24],[25])that imitates biological process of natural survival of the fittest individuals ina population. Each individual is characterized by a sequence of genes, whichconstitute a chromosome. The fittest individuals are selected for mating.Through exchange of chromosomalmaterial between selected pairs and throughmutations, the new generation is produced. Thus, generating a new populationfollows a three-step process: selection, crossover–exchange of genetic material

Page 138: Foundations of Computational Intelligence

136 T. Magoc et al.

between two individuals to produce one or more offsprings, and mutation ingenes. Genetic algorithm simulates all three steps of the natural evolutionprocess.

A genetic algorithm starts by defining its optimization variables and thefitness function. Each variable represents a gene, and all genes of an individualrepresented a chromosome:

Definition 2.1. (Chromosome) A chromosome representing an individuali is the vector of all genes of this individual:

chromosomei = [p1, . . . , pN ], (6)

where N is the number of genes (variables), and pj ∀j = 1, . . . , N are genes(i.e. values of the variables) of the individual i.

Since the variables could include qualitative as well as quantitative valuesof different ranges, each of them needs to be encoded into a finite set ofdistinct values, usually represented using binary digits.

The next step is to define the fitness function used in the algorithm.

Definition 2.2. (Fitness function) The fitness function

f(chromosome) = f(p1, . . . , pN ) (7)

represents an optimization criterion that defines the fitness of each individual.Usually, the fitness is to be maximized, so that the fittest individuals are se-

lected for the next step. However, the fitness function could be defined as a costfunction in which case the fittest individuals are the ones with the lowest cost.

After defining the variables and the fitness function, the initial populationis generated either by a random number generator or by encoding values ofvariables for specific individuals. Initializing the population ends the prepara-tion part of the genetic algorithm and denotes the beginning of the iterativesteps. The first of three iterative steps is selection. A proportion of the pop-ulation is selected to proceed to the next step and the remainder of thepopulation is discarded. Most commonly, the generational gap, that is thepercentage of the population selected to continue process, is 50%, but anyother percentage could be used. The selection process is based on the fitnesslevel of individuals and could be performed mainly in two different ways. Thefirst method ranks all individuals based on their fitness level and selects topranked individuals. The second method of selection relies on random selectionin which a higher probability of selection is given to fitter individuals.

The selected individuals proceed to the crossover step that chooses twoindividuals for mating in order to produce one or two offsprings. The mostcommonly used method for mating is the one-point crossover technique thatpicks a random point r between the first and the last position in a chro-mosome, a point called crossover point, and produces two offsprings in thefollowing way. The first offspring copies the genes 1 to r from the first parentand the genes r+1 to N from the second parent, while the second offspring is

Page 139: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 137

produced by changing the order of the parents. The crossover using differentparents continues until the number of individuals is increased to the originalsize of the population. Note however, that there are some variations in howto perform the crossover step.

The crossover step is followed by mutation. A proportion of genes is chosenfor mutation. The mutating genes are selected randomly. The selected genestake random values from the domain of the variable. The mutation process isvery important since it slows down the quick convergence of the populationin a small search area. It also allows the current best solution to jump awayfrom a local optimum that is not a global optimum. However, it is desired thatthe current best individual is not mutated in order to not lose the currentsolution, so many GAs apply the elitist strategy to protect the individualwith the highest fitness from being mutated.

Finally, the fitness of each individual in the population is calculated againand the convergence criterion is checked. Ideally, at the end, all individualsin the population have the same genes, representing the optimal solution.However, a genetic algorithm is usually stopped after a predefined number ofiterations, which results in a set of optimal values rather than just a singlesolution, a characteristic that suits portfolio selection problems very well.

3.2 Applications to Portfolio Management

As a computational intelligence technique, GAs have found different appli-cations in portfolio management. In [18], the authors developed a two-stagealgorithm to allocate wealth among numerous investment assets to reach aninvestor’s goal. The first step, first described in [32], uses a GA to select thehighest performing assets among thousands of available assets, while the sec-ond step utilizes another GA to find an optimal wealth distribution amongchosen assets.

The choice of the assets to proceed to the second stage of the algorithm isbased purely on the return of assets. Each asset is represented as a chromo-some containing four genes. Each of the four genes is a representation of onefinancial indicator used as an input variable. The four variables are:

• Return on capital employed: ROCE = profitshareholder’s equity · 100%.

• Price/Earning Ratio: P/E = profitearnings per share · 100%.

• Earning per share: EPC = net incomethe number of ordinary shares.

• Liquidity ratio: current assetscurrent liabilities · 100%.

Each financial indicator is rated and takes one of eight values (0-7), where0 stands for a poor performance of the asset and 7 represents a very goodperformance. These values are encoded as binary numbers so that each geneis a three-digit binary number.

Next, “fitness” of each asset is determined. To find the fitness of an asset,all assets are ranked based on the annual price return (APR):

Page 140: Foundations of Computational Intelligence

138 T. Magoc et al.

APRn =ASPn − ASPn−1

ASPn−1, (8)

where APRn is the annual price return for the year n and ASPn is the annualstock price for the year N . The assets with high APR are considered goodassets. Thus, all the assets are ranked from 1 to N where the asset withthe highest APR is ranked 1, and the asset with the lowest APR is rankedN . The asset’s ranking, r, is then mapped into the range 0 − 7 using thelinear mapping Ractual = 7 · N−r

N−1 , where N is the number of assets. Finally,a fitness function, which determines the optimization criterion, is designed.The most commonly used fitness function is the mean square error betweenthe estimated ranking and next year’s actual ranking:

RMSE =

√√√√ 1m

m∑i=1

(Rderived − Ractual)2. (9)

The goal is to minimize the value of RMSE.After defining the variables and the fitness function, the selection step of

GA is performed. Chromosomes are selected randomly for crossover with ahigher probability for selection being given to chromosomes with a higherfitness. The one-point crossover technique, which is used to combine twoparents to produce two offsprings, picks a position in a chromosome andinterchanges the values of two parents at this position. Finally, a randommutation in each gene changes 0 to 1 or vice versa with a probability equalto 0.005.

The generation produced by this method is either accepted as a final popu-lation or another iteration of selection, crossover, and mutation is performed.The process stops when one of the following three conditions is satisfied:

• A predefined number of iterations is reached.• A defined fitness is reached.• A convergence criterion of the population is reached. In an ideal case, all

the chromosomes of the final generation have the same genes, representingthe optimal solution.

At the end of the first step of two-stage portfolio optimization algorithm,the assets are ranked based on their return and the best m assets are consid-ered for investment. The second stage of the algorithm determines the wealthdistribution among these m assets. It takes into consideration the risk as wellas the return with the goal to minimize the risk for the expected level of thereturn.

This step of the algorithm is based on yet another genetic algorithm. Beforeapplying the second GA, the expected return of each asset and the covariancebetween each pair of assets are calculated. The expected return of the asseti after n time intervals is calculated as

E(Ri) =n∑

t=1

Rit

n, (10)

Page 141: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 139

where

Rit =SCPit − SCPi(t−1)

SCPi(t−1)(11)

is the return of the asset i for time t and SCPit is the closing price for theasset i at time t. The covariance between assets i and j is given by

σij =1n

n∑t=1

(Rit − E(Ri)) · (Rjt − E(Rj)). (12)

The algorithm designs chromosomes using the binary representation ofasset’s weight, wi, which is the amount of wealth allocated to the asset i.The weight of the asset i is normalized by

xi =wi

m∑j=1

wj

(13)

to fit into 8-bits allocated for representation of each chromosome. Theweights are adjusted through the GA algorithm until the optimal weightsare achieved.

Next, a fitness function, defined by

Fitness =m∑

i=1

m∑j=1

σijxixj +

(m∑

i=1

E(Ri)xi − R∗p

)2

(14)

is designed to take into consideration the tradeoff between the risk and the re-turn. The optimal solution is obtained by minimizing this function. The firstterm of the fitness function minimizes the risk, which is defined as volatil-ity of assets included in the portfolio, while the second term minimizes thedifference between the portfolio’s overall return and the pre-defined requiredreturn, R∗

p. The fitness function for each chromosome determines the assetschosen for the selection, crossover, and mutation, which are performed simi-larly to the processes in the first GA. The results of these processes determinethe generation for the next iteration. The final generation determines the dis-tribution of wealth among the chosen assets.

The algorithm was tested on data obtained from Shanghai Stock Exchangeduring a period ranging from January 2001 to December 2004. The test usedmonthly and yearly available information. After the first stage of the algo-rithm, top 10, 20, and 30 stocks were selected for three different experiments.The results showed that the greater the number of stocks selected to be in-cluded in the portfolio, the lower the return of the portfolio. The portfoliowith 10 stocks produced by the genetic algorithm was also tested against theequally weighted portfolio, which allocates equal amount of money in eachof 10 stocks. The investment portfolio that resulted from the GA constantlyoutperformed the equally weighted portfolio.

Page 142: Foundations of Computational Intelligence

140 T. Magoc et al.

A similar two-stage genetic algorithm was build in [19]. The only differ-ences are the details of asset representation and selection, crossover, andmutation processes. An asset is again represented as a chromosome, whichis an n−dimensional vector consisting of n parameters called genes. If theinitial population contains m assets, the selection process picks exactly [m

2 ]assets with the highest fitness and discards all other assets. For the crossoverstage, a random positive integer, r ≤ n, is selected and two offsprings areproduced by the following procedure. The first offspring copies the first rgenes from the first parent and the last n − r genes from the second parent,while the second offspring is created by copying the first r genes from thesecond parent and the last n − r genes from the first parent. Formally, twoparents P1 and P2 yield two offsprings O1 and O2 by following rules:

O1 = {gi|gi ∈ P1 if i ≤ r else gi ∈ P2} (15)

andO2 = {gi|gi ∈ P2 if i ≤ r else gi ∈ P1} (16)

where gi represents the ith gene.Finally, the mutation is performed by randomly selecting another positive

integer r, r ≤ n. All the genes except the rth gene are copied, and gene rtakes a random value that represents a possible mutation.

The algorithm was tested on data obtained from the Australian StockExchange. The results were compared against a Greedy algorithm and thecomparison showed that the genetic algorithm performed only slightly weakerthan Greedy algorithm but ran much faster.

4 Rule-Based Expert Systems

Even though genetic algorithms showed good results when applied to portfoliomanagement, other intelligent systems have been used as well to optimize thedistribution of wealth among assets. Rule-based expert system is one of thesetechniques, so we review the basic of expert systems and then describe theirapplication to portfolio selection.

4.1 Theoretical Background

Rule-based expert system simulates the decision making ability of a humanexpert in a field of interest (see e.g., [23]). The system is designed to al-low “communication” between a user and itself in order to obtain some in-formation that is necessary for solving a problem. This is done through auser-interface, which consists of a pseudo-natural language processing com-ponent that allows interaction between the user and the system using a lim-ited form of natural language. Another role of the user interface is to displaythe solution of the problem being considered to the user along with possibleexplanation for the decision actually made.

Page 143: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 141

The “brain” of the expert based system consists of two parts–the knowl-edge base and the inference engine. The knowledge base contains the factsand the rules of the subject at hand. The rules are usually the rules of predi-cate calculus. The inference engine consists of processes that manipulate theknowledge base to make inferences.

The rules are usually directly entered in the system’s knowledge base.However, sometimes the rules are inferred through training samples. Theprocess of building an expert system this way usually iterates through manycycles until human experts are satisfied by its performance. The test casesare run on the system to ensure that the system provides the same results aswould a human expert in the field.

There are several methods to make inferences from the given rules, but for-ward chaining and backward chaining are the most commonly used ones [10].Forward chaining, as the name suggests, starts with the facts and deduces theconclusion by applying rules to the facts. On the other hand, backward chain-ing involves reasoning in the opposite direction. It starts with the hypothesisand tries to induce the facts to support the hypothesis.

An expert-based system is the simplest example of rule-based systemsthat has been applied to the selection of optimal portfolio [6]. However, port-folio management involves numerous tasks that, in real life, would not beperformed by a single expert. To better simulate the behavior of human ex-perts, a single expert systems have been used as a base for development ofmulti-agent systems. Multi-agent systems simulate tasks of several expertsand combine their expertise to make a final decision [29]. This kind of systemallows communication among temporally and spatially separated experts,which is why they have found application in lots of different areas.

4.2 Applications to Portfolio Management

The first attempt to design an expert system to assist portfolio managers isdescribed in [6]. The basic idea of this system, called Folio, is to interview auser and use an expert’s knowledge to first determine the user’s investmentgoals and then build the portfolio that best suits the situation. The algorithmconsists of three steps: the interview of the investor, the inference of the goalsof the investor, and the optimization of distribution of wealth to reach goalsof the investor.

The interview contains a set of questions that help the expert to derivethe correct goal of the investor. The simple questions determine the user’sacceptance of the level of risk, the desired return, and the user’s tax bracketamong others. Based on the obtained answers, the algorithm infers the rulesof a user’s goal, and the rules are used to determine the goals of the investor.

Folio recognizes 14 different goals for investment including acquiring a re-quired level of return, reducing risk by investing into different assets, andminimizing risk while attaining the desired return. Each goal is characterizedby five parameters: a target value, a penalty for exceeding the target value,

Page 144: Foundations of Computational Intelligence

142 T. Magoc et al.

a penalty for falling short of the target value, a lower bound under whichthe penalty becomes infinite, and an upper bound above which the penaltybecomes infinite. These parameters are established from the inferred rules.About 50 rules (derived from interview) are used to infer one or more param-eters of the goals. Sometimes, a parameter has more than one possible value,in which case a heuristics is used to determine the most certain value.

When the goal and its parameters are specified, Folio uses a goal program-ming algorithm to determine the distribution of wealth among assets thatbest fits the goal parameters. The goal programming algorithm used by Foliois a linear programming solver whose objective function is set to calculatethe differences between the user’s target values and the obtained values foreach of the parameters. The algorithm minimizes the sum of the deviations ofobtained values from desired values. The optimal wealth distribution amongclasses of assets is suggested. The algorithm considers nine classes of assetswhich include different levels of low-risk to high-risk assets. However, thedistribution of wealth among each class is not given by this algorithm forseveral reasons. First, this method does not require Folio to consider thou-sands of investment assets that exist in the market and therefore, reducesthe complexity of the algorithm. Second, Folio requires only the aggregateknowledge about the properties of each asset class and not the knowledge ofindividual assets. Moreover, the aggregate values change less slowly than theindividual asset’s characteristics, so Folio stays current for longer time pe-riod. Finally, picking the exact assets for the investment is the responsibilityof an investment advisor and not Folio.

Even though performance of Folio has not been tested on real data, thisalgorithm is the foundation for the further development of expert based sys-tems which evolved into multi-agent systems for portfolio management. Theadvantages of multi-agent systems (MAS) over the single-agent systems in-clude [29]:

• The ability to avoid performance bottlenecks due to one stage in themulti-stage process.

• Possibility for interconnection and interoperation of multiple systems.• Natural distribution of tasks among different agents.• Possibility to connect spatially and temporally distributed experts.• Enhancement of overall system performance including reliability, compu-

tational efficiency, maintainability, flexibility, and reuse among others.

A multi-agent system for portfolio monitoring, called Warren, was developedin [30] and further improved and implemented in [29]. Warren was designed tomonitor portfolio rather than manage it. Monitoring portfolio is a continuouspicture of an existing portfolio, which helps to determine if reallocation ofassets is necessary, but does not suggest how to redistribute the wealth. Thegoal of Warren is to provide an overall picture of the existing portfolio basedon the numerous available data from different sources.

Page 145: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 143

Warren is composed of several types of agents: interface, task, middle,and information agents. The interface agent, Warren Interface, communi-cates with investor. This type of agent interviews the user and collects allnecessary data that determine the goals of the investor. It also displays acomprehensive summary of the user’s current portfolio and allows the userto buy and sell assets. The middle agent, MatchMaker, helps match agentsthat request services with agents that provide those services.

The task agents, RiskCritical agent and Comptroller, perform tasks. Thetasks are performed by formulating problem-solving plans and carrying themout in collaboration with other agents. RiskCritical agent calculates the risk ofthe portfolio, while Comptroller agent is in charge of buying and selling assets.

The information agents monitor and collect financial information aboutcompanies of interest when requested by a middle agent. Warren contains fourinformation agents: FdsHistory agent, iYahooStocks, iEdgar, and TextMiner.FdsHistory agent provides a historical view of financial data summary, iYa-hooStocks provides stock prices, iEdgar provides financial data summariesfrom SEC’s Edgar web site, and TextMiner provides financial news analysis.FdsHistory, iYahooStocks, and iEdgar provide quantitative data about com-panies of interest, while TextMiner provides qualitative data available fromnumerous news agencies.

TextMiner is designed as a text classification agent to sort data availablefrom a high volume of news reports about financial assets since only the use-ful details should be considered when monitoring portfolio. TextMiner sortsthe news from Reuters, CNN, Business Wire, etc. by first separating financialfrom non-financial news in articles. The financial news cover the reports oncompany’s earnings, movements of stock price, revenues, and similar informa-tion, while the news about corporate control and legal and regulatory issuesare considered non-financial. To separate financial from non-financial data,TextMiner was trained on a set of 1,239 news articles, which were labeledmanually. The selection process is based on the weighting of commonly usedterms (words or phrases) in the following way. First, each news article is rep-resented in a high-dimensional space, where each dimension corresponds to aterm. Then, the set of news articles is represented by the term-by-documentmatrix M = T × N , where T is the number of terms and N is the numberof articles. The set of terms T = {t1, . . . , tt, . . . , tT } is constructed by elim-inating the words whose frequency is higher than frequent threshold (wordsthat are considered to be just features) and the words whose frequency islower than infrequent threshold. Each term has its weight wt, which indicateshow important the term is for the given document. All the weights are scaledfrom 0 to 1 with the higher weight being given to terms that appear often inone article but less frequently in other documents. Precisely, the weight of aword is determined by

wt =(1 + log(fit)) · logN

dt√∑s�=t(log(fis) + 1)2

(17)

Page 146: Foundations of Computational Intelligence

144 T. Magoc et al.

where fit is the number of times the term t occurs in the document i, anddt is the number of documents in which the word t occurs. The weightis normalized by the document’s length. After the weights for each termare determined, the article d is compared to one of the classes, C ={financial, non-financial}. A class is determined by the mean vector of alldocuments in the class,

c =1|c|∑d∈c

d, (18)

and the calculation of similarity is the measure of the cosine of the anglebetween the class vector and the document vector

s(di, cj) = arg maxcj∈C

di · cj

||di|| · ||cj || . (19)

When financial news are separated from non-financial news, they are clas-sified into one of five groups: good, good–uncertain, neutral, bad–uncertain,bad. Here the ‘good’ news are the ones clearly showing a company’s goodfinancial standing whereas ‘bad’ news are the ones clearly showing the badfinancial standing of a given company. ‘Neutral’ news mention financial factsbut do not give enough information to determine whether the facts indi-cate the good or the bad financial state of a given company. Two ‘uncertain’categories refer to the prediction of future behavior of the company. The clas-sification into one of five classes is performed by co-locating phrases, that islooking for words in the article that are usually in the same order in a sen-tence but not necessarily next to each other, such as ‘earning’ and ‘increased’.The selection of useful co-located phrases is based on the training set of data.

Finally, a step-by-step description of the performance of Warren follows.First, the MatchMaker initializes the virtual work-space for agent-namingand resources for Warren, and all the other agents register their serviceswith MatchMaker. The Warren Interface displays the current portfolio of theinvestor, and allows the user to buy/sell assets. If the investor requests thefinancial information about a particular company, the interface agent sendsthe request to the middle agent, and the middle agent invokes informationagents to provide requested information. The information agents look for theinformation on the requested company and provide it to the interface agent.Warren Interface displays the gathered information and the RiskCritical agentcalculates the risk of new portfolio. Finally, the Comptroller agent updatesthe investor’s portfolio if he/she decides to buy/sell an asset.

Even though the entire model has not been tested on real-life data,TextMiner showed great results when compared to traditional Bayesian ap-proaches to classify articles. With this in mind, Warren gives a promisingtool for portfolio monitoring.

Page 147: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 145

5 Neural Networks

Neural networks provide yet another approach to identify an optimal port-folio. Thus, we touch the basics of neural networks that are necessary forthe understanding of several examples that use this type of computationalintelligence techniques in a portfolio selection process.

5.1 Theoretical Background

An artificial neural network or just neural network (NN) is designed to imitatethe actions of human neural system, which consists of neurons and axons (thelinks between neurons) (see e.g.,[31]). Similarly, a neural network consists ofnodes and directed links between nodes. NN is based on ability to learn fromtraining data sets in order to perform accurately on real data.

Several types of neural networks exist, the simplest one being theperceptron.

Definition 4.1 (Perceptron): The perceptron is a neural network thatconsists of two layers of nodes: input and output layers. The input nodes,x = (x1, . . . , xn), represent the input values, and the output nodes, y =(y1, . . . , ym), carry out mathematical calculations and output the results.

The function used to calculate the outputs is called the activation func-tion. Most commonly used activation function in a perceptron is the signfunction:

y = sign

(n∑

i=1

wixi

)= sign(w · x), (20)

where w = (w1, . . . , wn) is the vector of weights assigned to the links frominput to output nodes. The weights represent the strength of the connectionbetween the nodes and are determined by a learning process using the trainingdata set for which the expected outcome is known. The weights are updatedafter each training example by

w(k+1)j = w

(k)j + λ

(yi − y

(k)i

)xij , (21)

where w(k) is the weight of the ith input link after the kth iteration, xij is thevalue of the jth attribute of the training example xi, and λ is the learningrate that is determined by user. The value of λ belongs to interval [0, 1] andis designed to control the amount of adjustment after each training sample.The learning rate is either a constant that stays small throughout the entiretraining process to avoid overfitting to a specific training data element or theλ is adaptable in which case it starts with a large value but the size getssmaller during the training process.

Page 148: Foundations of Computational Intelligence

146 T. Magoc et al.

The perceptron model is the simplest kind of neural network and is usedonly for classification purposes. However, more complex multilayer networksare much more powerful and applicable to other types of problems.

Definition 4.2 (Multilayer network): A multilayer network is a neuralnetwork that contains one or more hidden layers of nodes that perform cal-culations and influence more accurate weight adjustments.

In a multilayer network, the links between nodes can go either only from alower layer to a higher layer (input being the lowest layer and output thehighest layer), which is the case in feed-forward networks, or the links canconnect nodes in the same layer or even be directed towards the previous lay-ers, which is the case in recurrent networks. The multilayer networks can usedifferent activation functions, such as linear, sigmoid, and hyperbolic tangentfunction among others. These functions allow more complex situations to bemodeled by multilayer networks.

A neural network learning algorithm works by minimizing the sum ofsquared errors:

Err(w) =12

N∑i=1

(yi − yi)2, (22)

where y depends on w. If y is replaced by w · x, then the error functionbecomes quadratic in its parameters, and a global minimum can be easilyfound. However, if a non-linear function is used as an activation function,hidden and output layers produce non-linear outputs, so finding the solu-tion for w is harder. Usually this problem is solved using a gradient descentmethod, which basically increases the weights in a direction that reduces theoverall error function:

wk+1j = wk

j − λ∂Err(w)

∂wj, (23)

where λ is the learning rate. This method can be successfully used to learnweights for the output layer. However, it might not be as easy to performthe computation for hidden layer since it is not possible to know their errorterm, ∂Err

∂wj, without knowing what their output values should be. To solve

this problem, backpropagation algorithm is used, which forces two phases ineach iteration of the algorithm: the forward and the backward phases. In theforward phase, the weights computed in the previous iteration are used tocompute the outputs of each node in the network and the computations fol-low in forward direction. In the backward phase, the weights are updated inthe reverse order–the weight update formula is applied to the last layer first,and then for each previous layer one-by-one going towards the first layer,which allows the use of output at the next level to compute the error at theprevious layer.

Page 149: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 147

5.2 Applications to Portfolio Management

Neural networks have found several applications in portfolio management([2],[4],[20],[33],[34]) ranging from forecasting the behavior of investment as-sets to optimizing the distribution of wealth among assets.

Lowe [20] used analog NN to find the optimal distribution of wealth amonginvestment assets. The optimal portfolio is constructed by minimizing the riskfunction defined by

risk =

√√√√ 1T

T∑t=1

[y(t) −

m∑i=1

wixi(t)

]2

, (24)

where m is the number of assets, T is the number of iterations, y(t) is themarket portfolio return, xi(t) is the return of the asset i at time t, and wi isthe proportion of wealth invested into the asset i. The risk function is subjectto a non-negativity constraint of the weights wi ≥ 0 for every asset i and

the normalization constraintm∑

i=1

wi = 1. This linear constraint optimization

problem could be transformed into a nonlinear unconstrained optimizationproblem that minimizes the cost function

E =1T

T∑t=1

[y(t) −

m∑i=1

wixi(t)

]2

+ λ

[m∑

i=1

wi − 1

]2

+ μm∑

i=1

11 + eβwi

. (25)

The first term of this equation corresponds to minimizing the risk; the secondterm replaces the normalization constraint; and the third term transformsnon-negativity constraint into a barrier potential term, which has the formof a logistic used in multilayer perceptron studies. The parameters λ and μ arepenalties for breaking constraints, and could be adjusted based on investorspreferences.

The minimization of the cost function could be performed by using anystandard gradient based method, one of them being Runge-Kutta integrationalgorithm with a possibility to adapt step size based on the results formprevious iteration. The performance of analog neural network in portfoliomanagement was tested on seven stocks in the electricity sector of the UKmarket for 160 consecutive days starting at 26th of September 1989.

Another application of neural networks in portfolio management was de-scribed by Casas [2] to predict which of three considered classes of assetswill outperform the other two. The three classes in consideration are: stocks,bonds, and money markets. The idea is to invest all wealth into one classof assets for a given time interval, and then re-evaluate the performance ofthe asset classes and make a new decision for the next time interval. Thisapproach does not diversify the portfolio to reduce risk, and is based purelyon the return of three classes of assets rather than performance of individualassets.

Page 150: Foundations of Computational Intelligence

148 T. Magoc et al.

A neural network, that uses fundamental factors such as earnings, price perearning ratios, interest rates, and inflation, as input values, is trained withbackpropagation algorithm to predict behavior of three classes of assets. Therelative performance of classes of assets is measured by the risk premium.The risk premium between two asset classes A and B is calculated as

ΓAB = E(A) − E(B), (26)

where E(x) is the expected return of the class x. Assuming that risk premiumfollows normal distribution, the probability that class A outperforms the classB is given by

P (A > B) = CND(ΓAB , μAB, σ2AB), (27)

where CND is cumulative normal distribution function, μAB is mean riskpremium, and σ2

AB is standard deviation of risk premium. The algorithmcalculates the probabilities that stocks will outperform bonds, bonds willoutperform money markets, and stocks will outperform money markets.

The performance of this algorithm was tested against a buy-and-hold strat-egy that buys and holds S&P500 Index for the entire time period under con-sideration, which was 12 months in this case study for the year 1999. The NNapproach outperformed the buy-and-hold strategy at the end of 12 months.Moreover, it predicted correctly 92% of the time which asset class wouldoutperform the other two classes.

Another example of forecasting ability of NN was tested in [34]. In thispaper, the authors presented a portfolio management algorithm that consistsof three parts. The first part uses error correction neural network (ECNN) toforecast the behavior of assets. The second step uses a higher-level feedfor-ward network to compute the excess return of one asset over another asset.Finally, the third part determines the optimal wealth distribution based onthe excess returns.

Forecasting the behavior of each asset in the future is based on the expectedreturn of the asset, which depends on the previous state of the asset, st,external influences, ut, and the difference between the predicted output, yt,and the observed output, yd

t , at the previous iteration. Thus,

st+1 = f(st, ut, yt − ydt ), (28)

where yt = g(st) is determined based on the current state. In the suggestedmodel, the expected return is predicted based on error correction neural net-work, which uses weight matrices of appropriate dimensions, A, B, C, and D,to transform the problem into the following set of equations:

st+1 = tanh(Ast + But + Dtanh(Cst − ydt )) (29)

yt = Cst. (30)

Page 151: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 149

The optimization of parameters is obtained by finite unfolding in timeusing shared weights, which solves

minA,B,C,D

1T

T∑t=1

(yt − ydt )2 (31)

After the parameters are established by an ECNN, the expected return iscalculated for each asset, fi. Next, the difference between expected returnsof two assets is calculated for each pair of assets, eij = fi − fj . Finally, thecumulative excess return of each asset is calculated as weighted sum of excessreturns,

ei =k∑

j=1

wijeij , (32)

where wij ≥ 0 is the assigned weight to the pair (i, j) of assets. Based oncumulative excess returns, the proportion of wealth that should be investedinto the asset i is calculated by

ai =eei

k∑j=1

eej

= ai(w, f1, . . . , fk). (33)

This form guarantees that exactly all wealth is distributed (∑

ai = 1) andthe proportions of investment are non-negative (ai ≥ 0).

However, there are sometimes market share constraints given by the assetmanager, and they are usually given in the form of an interval with a lowerbound and an upper bound. If the mean of the available allocation for asset i isdenoted by mi, and the admissible spread is given by Δi, then the proportionai should fall into the interval [mi − Δi, mi + Δi]. The vector of means,m = (m1, . . . , mk), is used as a benchmark distribution. To comply with therequirements of the manager, the excess return is adjusted by a bias vectorv so that

ei = vi +k∑

j=1

wij(fi − fj), (34)

where the vector v = (v1, . . . , vk) could be determined in advance by settingthe excess returns to zero and solving the system of nonlinear equations

m1 = a1(v1, . . . , vk)

...

mk = ak(v1, . . . , vk). (35)

Page 152: Foundations of Computational Intelligence

150 T. Magoc et al.

The non-unique solution of the form

vi = ln(mi) + c (36)

could be simplified by setting c = 0.Since the interval [mi−Δi, mi +Δi] represents a constraint for parameters

wi1, ..., wik, the optimal portfolio selection defined as the return maximizationproblem can be solved by solving a penalized maximization problem

maxw

1T

T∑t=1

k∑i=1

[ritai(fit, . . . , fkt,w) − λ||ai − mi||Δi ] , (37)

where rit is the actual return of the asset i at time t and

||x||Δ =

{0 if |x| ≤ Δ

|x| − Δ otherwise(38)

The proposed model was tested on the basis of monthly data of G7 coun-tries markets. The data from September 1979 to June 1993 was used to trainnetwork, and based on produced coefficients, the model was tested in theperiod July 1993 to May 1995. The results showed that the neural networkbased model outperformed the benchmark model by almost 10%.

The modification of asset allocation step of this algorithm is presented in[33] and shows how to incorporate the risk of investing into selected assetsrather than determining the optimal portfolio only based on the return. Theauthors use an ECNN (developed in [34]) to forecast the return of assets,ri, which is used to calculate the risk-adjusted expected excess return ratherthan the expected return that does not consider risk related to the assets.The risk-adjusted excess return is defined by

ρi =∑

t

rit − rf

|rit − rdit|

, (39)

where rf is the risk-free asset return and rdit is the actual return at time t.

Based on the risk-adjusted excess returns, the assets are ranked from thehighest to the lowest, and all assets whose risk-adjusted excess return ishigher than a pre-defined threshold value ρ∗ are selected to be included intothe portfolio.

If we denote the set of assets included in portfolio by A, the proportion ofwealth invested in each of the selected assets is determined by

wi =ρi∑

A

ρj. (40)

This model was tested on the German stock market by using weekly datafrom 68 stocks on a period ranging from November 1994 to June 1999, inorder to train the neural network considered. The algorithm’s performance

Page 153: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 151

was tested on data from July 1999 to June 2000. Four portfolios were builtwith different number of stocks included: 5, 10, 15, and 20 stocks. The re-sults of the algorithm were compared to the performance of the benchmark,which included all 68 stocks whose weights were chosen based on the sharesof the stocks in the market. All portfolios produced by the NN algorithmoutperformed the benchmark portfolio. Among the four derived portfolios,the portfolio with the smallest number of assets performed better than allthe other portfolios.

Finally, [4] shows another application of NN as a forecast model as wellas a decision model for wealth allocation. The multilayer perceptron (MLP)with one hidden tanh layer (with H hidden units) and a linear output layeris considered. The function represented by MLP is given by

f(x; θ) = A2tanh(A1x + b1) + b2, (41)

where x is the current distribution of wealth among assets, A1 is an H × Mmatrix (with M being the dimension of the input vector x), A2 is an N ×Hmatrix (with N being the dimension of the output vector), b1 is an H-elementvector, b2 is an N -element vector, and θ = (A1, A2, b1, b2) is the vector of pa-rameters. The parameters represented by the vector θ are found by trainingthe network to minimize a cost function; the cost function differs for two typesof the model–forecast and decision model. The optimization is performed byusing a conjugate gradient descent algorithm. The gradient of the param-eters with respect to cost function is computed using the backpropagationalgorithm for MLP.

In the forecast model, a neural network is used to predict the returns ofassets in the next time period, μt+1|t, given explanatory variables ut, whichbelong to the set of the available information, It. The network is trained tominimize the prediction error of returns of assets in the next time period byusing a quadratic loss function

CF (θ) =1T

T∑t=1

||f(ut; θ) − rt+1||2 + CWD(θ) + CID(θ), (42)

where || · || is the Euclidian distance, f(·; θ) is the function computed byMLP, and CWD(θ) and CID(θ) are terms used for regularization purposes.The regularization is needed to prevent overfitting by specifying a prioripreferences on weights in the neural network. CWD(θ) is the weight decay. Ittries to reduce magnitude of weights in the network by setting a penalty onthe squared norm of all network weights. On the other hand, CID(θ) is theinput decay. It tries to utilize useful inputs to train the network by penalizingthe inputs that turn out to be unimportant.

The neural network decision model uses the NN to directly determine thedistribution yt of wealth among assets based on the explanatory variables ut.NN is trained to minimize the negative of the financial performance evalua-tion criterion

Page 154: Foundations of Computational Intelligence

152 T. Magoc et al.

CD(θ) = − 1T

T∑t=1

Wt + CWD(θ) + CID(θ) + Cnorm, (43)

where Cnorm is a preferred norm of the neural network. The preferred normis important since two vector solutions that differ only by a constant multiplewould be considered as different solutions without the use of preferred norm.The result would be that, for each vector θ, there would be a direction with(almost) zero gradient, so there would be no local minimum. The preferrednorm variable, which is given by user, re-scales the parameters so that thenorm constraint is achieved.

Training MLP for the decision problem is more complex than for theforecast model. It includes a feedback loop, which induces a recurrence byinputting the distribution yt−1 to determine the output yt. Also the back-propagation through time algorithm is used to compute the gradient by goingback in time, starting from the last time step until the first one.

Sometimes, the user has an idea of the optimal portfolio or has a prioripreferences of the portfolio structure (i.e. the proportion of wealth investedinto stocks versus the proportion invested into bonds). In this case, insteadof the preferred norm, the preferred portfolio is considered. Deviation fromthe preferred portfolio is penalized by

Cref.port. =1T

T∑t=1

penaltyref.port.(yt), (44)

where penalty is calculated as the squared distance between the networkoutput and the reference portfolio.

For testing purposes, Toronto Stock Exchange market data from January1971 to July 1996 were used, and the results proved to outperform the bench-mark algorithms. It was also shown that the decision model is preferred tothe forecast model as it relies on fewer assumptions.

6 Support Vector Machines

Finally, we give a general description of support vector machines and theirapplication to portfolio selection.

6.1 Theoretical Background

Support vector machine (SVM) is one of the most commonly used classi-fication techniques (see e.g.,[31]). It classifies data into one of two groupsby constructing a hyperplane that separates these two groups. The simplestsituation is when the data are linearly separable. In this case, usually morethan one hyperplane could be constructed to represent the boundary betweentwo classes that will result in a zero error. However, instead of minimizing

Page 155: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 153

the empirical error (or error produced by training data), the best hyperplaneshould minimize the generalization error, that is the error that could resultfrom classifying real data based on the model developed from the trainingset. To explain how to minimize the generalization error, we first define themargin hyperplanes. We consider a hyperplane b and create two other hyper-planes, b1 and b2, such that they are parallel to b and as far as possible from b(going into opposite direction from b) so that they do not touch any trainingdata element. The distance between the hyperplanes b1 and b2 is called themargin of hyperplane. Since several non-parallel hyperplanes usually existin the linearly separable case, we select the pair of parallel hyperplanes thatyield the highest margin of hyperplane. The decision boundary is representedby the hyperplane going straight through the middle between two selectedmargin hyperplanes.

To formally define the best hyperplane, we consider a set of N trainingexamples, each of them denoted by (xi, yi), where xi = (xi1, . . . , xid) cor-responds to the attribute set for the ith training example and yi ∈ {−1, 1}is the class label. Given this notation, the decision boundary is given byw · x + b = 0, where w and b are the parameters of the model which are de-termined through training. Based on the calculated parameters, the decisionfor a new data sample z, which is not in the training set, is determined by

y =

{1 if w · z + b > 0

−1 if w · z + b < 0(45)

Since the margin hyperplanes are defined as

w · x + b = ±1, (46)

each training data sample satisfies the conditions

w · xi + b ≥ 1 if yi = 1 (47)

andw · xi + b ≤ −1 if yi = −1. (48)

These two conditions could be simplified to

yi(w · xi + b) ≥ 1. (49)

Furthermore, we denote the margin hyperplanes by w · x + b = +1 andw · x + b = −1, which implies that the margin, d, of the decision hyperplaneis d = 2

||w|| . To simplify calculations necessary to find the best hyperplane,||w|| is usually replaced by ||w||2. Thus, maximizing the margin is equivalentto minimizing

f(w) =||w||2

2. (50)

Page 156: Foundations of Computational Intelligence

154 T. Magoc et al.

We can formally define the objective of the learning process in SVM trainingphase as follows:

Definition 5.1. (Linear SVM: separable case): The learning task inSVM can be formalized as the following constraint optimization problem:

minw

||w||22

(51)

subject to yi(w · xi + b) ≥ 1 ∀i = 1, 2, . . . , N. (52)

This problem of solving for w and b is a convex optimization problem(since the objective function is quadratic and the constraints are linear) thatcould be solved by using the standard Lagrange multiplier method, whichrewrites the objective function in terms of Lagrangian

LP =12||w||2 −

N∑i=1

λi(yi(w · xi + b) − 1), (53)

where the parameters λi are called the Lagrange multipliers. The first termtries to minimize the objective function, while the second term replaces theconstraint and must be minimized in order to reduce the penalty of notsatisfying the constraint. When solving for the Lagrange multipliers, manyof them are equal to zero. However, a few Lagrange multipliers that are non-zero correspond to the training examples that lie exactly on one of the marginhyperplanes and thus represent support vectors, which are used to find thevalues of w.

The Lagrangian problem could be transformed into a dual problem thatinvolves finding only Lagrange multipliers. The problem maximizes the dualLagrangian

LD =N∑

i=1

λi − 12

∑i,j

λiλjyiyjxi · xj , (54)

where the Lagrangian multipliers must be non-negative.The solution to this problem can be found using numerical techniques such

as quadratic programming. The solution for w is calculated by

w =N∑

i=1

λiyixi (55)

and b is obtained by solving

λi[yi(w · xi + b)] = 0. (56)

Page 157: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 155

The decision boundary can be expressed as(N∑

i=1

λiyixi · x)

+ b = 0. (57)

The previous description to find the optimal decision boundary works wellif the training data is linearly separable. However, it is not always the case.Very often, any decision boundary would misclassify some training examples.The problem could be approached by introducing positive slack variables ξi

that represent the error of the decision boundary for the training sample i[7]. Thus, the new objective function

f(w) =||w||2

2+ C

(N∑

i=1

ξi

)(58)

tends to minimize the error besides minimizing the original objective function.Here C represents the penalty for misclassification, and is determined by user.The new objective function and the inequality constraints

w · xi + b ≥ 1 − ξi if yi = 1,

w · xi + b ≤ 1 + ξi if yi = −1, (59)

could be easily transformed into the Lagrangian where each Lagrangian valueis bounded above by the value of the parameter C:

0 ≤ λi ≤ C. (60)

This problem could be approached by using quadratic programming.In some instances, however, a better solution exists than reducing the mis-

classification. A non-linear decision bound might exist to correctly classifytraining data that are not separable by linear method. The idea is to transformthe original coordinates of the training sample x into a new space Φ(x) so thata linear decision bound can be used to correctly separate data in the new space.The problem with this approach is to determine the mapping that will lead todesired results. Now, the problem of learning from training data becomes:

Definition 5.2. (Nonlinear SVM): The learning task in a non-linear SVMcan be formalized as the following constraint optimization problem:

minw

||w||22

(61)

subject to yi(w · Φ(xi) + b) ≥ 1 ∀i = 1, 2, . . . , N. (62)

The attempt to solve this problem by transforming it into Lagrangian isusually not easy due to need for calculation of the dot product between the

Page 158: Foundations of Computational Intelligence

156 T. Magoc et al.

new spaces Φ(xi) and Φ(xj), which might be very complicated. However,since the dot product is a measure of similarity between two instances xi andxj , we can solve this problem by applying the kernel trick, which computesthe similarity between two instances in the transformed space by using theoriginal attribute set [1]. The kernel function

K(xi,xj) = Φ(xi) · Φ(xj) (63)

is the function that calculates the similarity of instances xi and xj by usingthe attributes in the original space, which simplifies the computation of thedot product. The use of kernel trick also does not require the knowledge ofthe exact transformation Φ because the kernel function used in non-linearSVM must satisfy Mercer’s theorem:

Theorem 5.3. (Mercer’s theorem): A kernel function K can be expressedas K(u,v) = Φ(u) · Φ(v) if and only if, for any function g(x) such that∫

g(x)2dx is finite, then∫

K(x,y)g(x)g(y)dxdy ≥ 0.

Support vector machine represents the most commonly used classifier thathas found implementation in different fields, one of them being the portfoliomanagement.

6.2 Applications to Portfolio Management

Support Vector Machine (SVM) has found implementation in classificationof stocks into one of two classes–the stocks with exceptional high returns(Class+1) and the stocks with unexceptional returns (Class-1) [9]. SVM triesto minimize a bound on the generalization error rather than the empiricalerror as many other approaches do. It uses several financial indicators to de-termine the performance of each asset. The n financial indicators of the asseti are represented as a vector xi = (x1, ..., xn). The expected future return ofthe stock is a binary dependent variable yi = ±1, where +1 represents theClass+1 asset and −1 represents the Class-1 asset. Thus, the training set ofm companies consists of pairs {(x1, y1), (x2, y2), . . . , (xm, ym)} ⊂ RN × ±1.The classifier (SVM) tries to learn from the training set, and it behaves asa function that maps the input variables x into an output value y. The mis-classification is reduced by adjusting parameters.

SVM is a classifier that tries to construct an optimal separating hyperplanebetween two classes by minimizing the bound on the misclassification risk.To solve linearly separable patterns, traditional approach using quadraticprogramming is utilized to maximize the dual Lagrangian.

In the case of non-separable patterns, different kernel functions could beused. In the test case, the Radial Basis Kernel, K(x,y) = exp(−||x − y||2),was used.

The method was tested with Australian Stock Exchange data between1992 and 2000. The data from three years were used for training and valida-tion of estimated parameters that were then used to predict the performance

Page 159: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 157

of stocks in the next year. Eight groups of financial indicators were used tocalculate the performance of stocks: Return on Capital, Profitability, Lever-age, Investment, Growth, Short term Liquidity, Return on Investment, andRisk. The data for each stock were converted into an eight-element inputvector. For the training samples, the output of each stock was determined byits annual actual performance with the top 25% of stocks being selected intoClass+1, and the remaining stocks being assigned Class-1. For testing pur-poses, the stocks selected into Class+1 group were given equal weights in theportfolio. The created portfolio’s return outperformed the equally weightedportfolio consisting of all available stocks, which was used as the benchmarkportfolio.

Summary of approaches. We have presented several attempts to use in-telligence systems in portfolio management. Genetic algorithms, rule-basedexpert systems, neural networks, and support vector machines have all con-tributed towards finding the optimal distribution of wealth among availableassets. With the exception of genetic algorithm, all other methods are basedon the ability to learn from examples and approximation of algorithm’s pa-rameters due to training samples. This could lead to overfitting of the pa-rameters to specific type of data or a specific sample, which might not beapplicable in other situations.

Moreover, all of the approaches do not consider the relationship betweenthe characteristics of an asset. For example, return and risk are known tousually move in the same direction, that is higher the return of the asset,higher the risk of that asset. However, the presented approaches do not takeinto consideration this and many other existing relationships.

Furthermore, the return, the risk, and other characteristics of an asset areassumed to be precisely known for each asset in consideration. In reality, thisis not always the case as the best we can do is to predict the future returnand risk. Sometimes, these predictions are not correct, but all the presentedtechniques rely on precise knowledge of these values.

To face the drawbacks of the presented approaches, we propose a newapproach to portfolio optimization. The novel approach is based on multi-criteria decision making and fuzzy integration over intervals.

7 Fuzzy Integration and Decision Making

Before going into details of portfolio optimization problem, we review basicsof fuzzy measures, fuzzy integration, and multi-criteria decision making.

7.1 Theoretical Background

A multi-criteria decision making (MCDM) problem seeks the optimal choiceamong a (finite) set of alternatives. It can formally be defined as a triple(X, I, (�i)i∈I) where

Page 160: Foundations of Computational Intelligence

158 T. Magoc et al.

• X ⊂ X1×· · ·×Xn is the set of alternatives with each set Xi representinga set of values of the attribute i.

• I is the (finite) set of criteria (or attributes).• ∀i ∈ I, �i is a preference relation (a weak order) over Xi.

The next task is to “combine” the preference relations �i of an alternativeinto a global value for the alternative such that the final order of the alter-natives is in agreement with the decision maker’s partial preferences. Thenatural way to construct a global preference is by using utility function foreach attribute to reflect partial preferences of a decision-maker, and thencombine these monodimensional utilities into a global utility function usingan aggregation operator. The utility functions ui : Xi → R such that forall xi, yi ∈ Xi, ui(xi) ≥ ui(yi) if and only if xi �i yi, scale the values ofall attributes onto a common scale. The existence of monodimensional util-ity functions is guaranteed under relatively loose hypotheses by the workpresented in [17].

Numerous aggregation operators could be used to combine monodimen-sional utilities into a single number that represents the value of an alternative.Two simple approaches that correspond to optimistic and pessimistic behav-ior of the decision maker are maximax and maximin strategies, respectively,assuming that the goal of a decision-maker is to maximize the utility. Themaximax method compares the utilities of all attributes of an alternative andchooses the highes utility value, max

iui(xi), to represent the global utility of

the alternative x = (x1, ..., xn). This approach reflects the optimistic behav-ior of the decision-maker since he/she is concerned only with the attributethat has the highest utility for the given alternative. The maximax methodtries to maximize the best criterion:

maxx∈X

( maxi∈{1,...,n}

ui(xi)). (64)

On the contrary, the maximin method reflects the pessimistic behavior of thedecision-maker as the decision-maker is concerned only with the attributethat could result in the worst value. This method compares the utilities of allattributes of an alternative and chooses the lowest utility value, min

iui(xi),

to represent the global utility of the alternative x. The decision-maker triesto maximize the value of the worst case scenario:

maxx∈X

( mini∈{1,...,n}

ui(xi)). (65)

To allow for a position between these extremes when making a decision,a simple combination of maximax and maximin approaches is achieved by aweighted aggregation operator

maxallalternatives

(α maxi

ui(xi) + β mini

ui(xi)), (66)

Page 161: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 159

with α + β = 1 where α ≥ 0 and β ≥ 0 are weights given by the decisionmaker. This approach simplifies to the optimistic case if α = 1 and to thepessimistic case if β = 1.

These simple approaches are very tempting to use for quick decisions.However, they focus only on a few criteria and ignore the impact of othercharacteristics of alternatives, which often does not suit the situation. Thus,we usually need to consider more complex aggregation operators that takeinto consideration all attributes. The simplest and most natural of them isa weighted sum approach, in which the decision-maker is asked to provideweights, wi, that reflect the importance of each criterion. Thus, the globalutility of alternative x = (x1, ..., xn) ∈ X is given by

u(x) =n∑

i=1

wiui(xi). (67)

The best alternative is the one that maximizes this value. Even though thisapproach is attractive due to its low complexity, it can be shown that us-ing an additive aggregation operator, such as weighted sum, is equivalent toassuming that all the attributes are independent [22]. In practice, this is usu-ally not realistic and therefore, we need to turn to non-additive approaches,that is to aggregation operators that are not linear combinations of partialpreferences.

Before approaching non-additive methods, we give the definition of a non-additive measure, a tool for building non-additive aggregation operators.

Definition 6.1. (Non-additive measure): Let I be the set of attributesand P(I) the power set of I. A set function μ : P(I) → [0, 1] is called anon-additive measure (or a fuzzy measure) if it satisfies the following threeaxioms:

(1) μ(∅) = 0 : the empty set has no importance.(2) μ(I) = 1 : the maximal set has maximal importance.(3) μ(B) ≤ μ(C) if B, C ⊂ P(I) and B ⊂ C: a new criterion added cannot

make the importance of a coalition (a set of criteria) diminish.

Of course, any probability measure is also a non-additive measure. Thereforenon-additive measure theory is an extension of traditional measure theory.Moreover, a notion of integral can also be defined over such measures.

A non-additive integral, such as the Choquet integral [5], is a type of ageneral averaging operator that can model the behavior of a decision maker.The decision-maker provides a set of values of importance, this set beingthe values of the non-additive measure on which the non-additive integral iscomputed from. Formally, The Choquet integral is defined as follows:

Definition 6.2. (Choquet integral): Let μ be a non-additive measure on(I,P(I)) and an application f : I → R

+. The Choquet integral of f w.r.t. μis defined by:

Page 162: Foundations of Computational Intelligence

160 T. Magoc et al.

(C)∫

I

fdμ =n∑

i=1

(f(σ(i)) − f(σ(i − 1)))μ(A(i)),

where σ is a permutation of the indices in order to have f(σ(1)) ≤ · · · ≤f(σ(n)), A(i) = {σ(i), . . . , σ(n)}, and f(σ(0)) = 0, by convention.

It can be shown that many aggregation operators can be represented by Cho-quet integrals with respect to some fuzzy measure. However, note that thereare other non-additive approaches to decision making besides the Choquetintegral, one of them being the Sugeno integral [27]:

Definition 6.3. (Sugeno integral): Let μ be a fuzzy measure on (I,P(I))and an application f : I → [0, +∞]. The Sugeno integral of f w.r.t. μ isdefined by:

(S)∫

f ◦ μ =n∨

i=1

(f(x(i)) ∧ μ(A(i))).

where ∨ is the supremum and ∧ is the infimum.

Even though the Choquet and the Sugeno integrals are structurally similar,their applications are very different. The Choquet integral is generally usedin quantitative measurements, while Sugeno integral has found more applica-tions in qualitative approaches. However, we restrict ourselves to quantitativeapproaches.

Although the Choquet integral is well suited for quantitative measure-ments, it has a major drawback. The decision maker needs to input a valueof importance of each subset of attributes, that is total of 2n values. Moreprecisely, since the value of the empty set and the entire set are known by thedefinition of a fuzzy measure, the exact number of values required from thedecision-maker is 2n − 2. This still leads to an exponential complexity andis therefore intractable. However, we can overcome intractability by using2-additive measure to limit the complexity to a O(n2) (as shown in [3]) andstill get accurate results.

Before giving the definition of a 2-additive measure, we need to definenotion of interaction indices of orders 1 and 2. The importance of an attribute(or the interaction index of degree 1) is best described as the value thisattribute brings to each coalition it does not belong to. It is given by theShapley value [26]:

Definition 6.4. (Shapley value): Let μ be a non-additive measure over I.The Shapley value of index i is defined by:

v(i) =∑

B⊂I\{i}γI(B)[μ(B ∪ {i}) − μ(B)] (68)

Page 163: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 161

with

γI(B) =(|I| − |B| − 1)! · |B|!

|I|! (69)

and |B| denoting the cardinal of B.

While the Shapley value gives the importance of a single attribute to theentire set of attributes, the interaction index of degree 2 represents the inter-action between two attributes, and is defined by ([8],[13]):

Definition 6.5. (Interaction index of degree 2): Let μ be a non-additivemeasure over I. The interaction index between i and j is defined by:

I(i, j) =∑

B⊂I\{i,j} ( ξI(B) · (μ(B ∪ {i, j}))

−μ(B ∪ {i})− μ(B ∪ {j}) + μ(B))

with ξI(B) = (|I|−|B|−2)!·|B|!(|I|−1)! .

The interaction indices belong to the interval [−1, +1] and

• I(i, j) > 0 if the attributes i and j are complementary;• I(i, j) < 0 if the attributes i and j are redundant;• I(i, j) = 0 if the attributes i and j are independent.

Even though we can define interaction indices of any order, defining theimportance of attributes and the interaction indices between two attributesis generally enough in MCDM problems. Thus, 2-additive measures constitutea feasible and accurate tool in this setting. The formal definition of 2-additivemeasure follows [8]:

Definition 6.6. (2-additive measure): A non-additive measure μ is called2-additive if all its interaction indices of order equal to or larger than 3 arenull and at least one interaction index of degree two is not null.

We can also show [12] that the Shapley values and the interaction indices oforder two offer us an elegant way to represent a Choquet integral. Therefore,in a decision-making problem, we can ask the decision maker to give theShapley values, Ii, and the interaction indices, Iij , and then use the Choquetintegral w.r.t. to a 2-additive measure, μ, to obtain the aggregation operator:

(C)∫

I

fdμ =∑

Iij>0

(f(i) ∧ f(j))Iij

+∑

Iij<0

(f(i) ∨ f(j))|Iij |

+n∑

i=1

f(i)(Ii − 12

∑j �=i

|Iij |).

Page 164: Foundations of Computational Intelligence

162 T. Magoc et al.

This form of the Choquet integral is accurate and practical approach tomany situation, one of them being portfolio management.

7.2 Application to Portfolio Management

We propose two different algorithms that make use of multi-criteria decisionmaking approach to find the optimal portfolio allocation. A two-stage algo-rithm uses a multi-criteria decision making setting to rank all asset. Basedon the rank, good assets are selected among thousands of assets that existin market and wealth is invested in these selected assets only. The secondstep of the algorithm utilizes another MCDM setting to determine the exactwealth allocation among the assets to best suit the goals of the investor.

The second algorithm utilizes similar multi-criteria decision making set-tings by starts by clustering all assets into three groups based on their risk.Based on the investor’s acceptable level of risk, distribution of wealth amongthree groups of assets is determined and MCDM setting is created to deter-mine the exact allocation of wealth within each cluster.

We first define a multi-criteria decision making problem by considering theset of all asset as the set of alternatives. We determine a finite set of criteriathat characterize investment assets–return (R), risk (r), time to maturity (t),transaction cost (c), etc., and define the utility functions for each of them. Thesimplest method to choose rational utility functions is to provide mappingsfrom the values of an alternative onto the interval [0, 1], f : Xi → [0, 1].For the return of an asset, this could mean that the highest realistic return ismapped into 1, the lowest return to 0, and the other returns are proportionallymapped into values between 0 and 1. The utility of risk could be defined ina similar fashion, but taking the reciprocal of the risk since a high value ofrisk is less desired than a low value of risk. Similar arguments hold for timeto maturity and transaction cost. Once the utility function for each criterionis defined, we proceed to calculation of the value of each asset.

If the decision maker (the investor) is concerned only with the return oronly with the level of risk, then the maximax strategy could be used to rankall the assets with a high importance given to return in the first case andto the risk in the second case, and low importance given to all the otherattributes. However, usually an investor wants to maximize the return for agiven level of risk or minimize the risk while attaining the required returnlevel in a certain time period. Thus, all the criteria have some influence onthe decision. The decision-maker is asked to input the Shapley value of eachcriterion, that is the importance of each criterion relative to other criteria.Since the attributes are mutually dependent (e.g., higher return usually im-plies higher risk, longer time to maturity usually means higher return, etc.),the weighted sum approach does not promise to give accurate results. How-ever, we can approximate the interaction indices for each pair of attributesby estimating the correlation between their values, and use the Choquet in-tegral with respect to a 2-additive measure, defined by Shapley values and

Page 165: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 163

interaction indices of order 2, to calculate the global value of an asset. TheChoquet integral values are used to order the assets giving higher rank to theassets with the higher value of the Choquet integral.

Top n assets are chosen to proceed to the second stage of the algorithm.The number n is either pre-defined by the investor, or all the assets with theChoquet value above a threshold specified by the investor are selected. Wedenote the set of all assets that are used to create portfolio by A. The secondstage of the algorithm tends to find the optimal distribution of wealth amongn selected assets, w = (w1, . . . , wn), by considering another multi-criteriadecision making setting. The set of alternatives is defined as the set of allpossible portfolios using only the assets selected based on their rank. The setof criteria is unchanged from the first stage of the algorithm. However, thevalues of the criteria are defined in terms of their values for each asset in theportfolio as follows:

• The return of the portfolio is

R(w) =n∑

i=1

Riwi. (70)

• The risk of the portfolio is

r(w) =n∑

i=1

riwi. (71)

• Time to maturity of the portfolio, however, is not the weighted sum of theindividual assets’ maturity times. It is the maximum time to maturity ofall assets included in the portfolio:

t(w) = maxj

tj , where j is such that xj ∈ A. (72)

Note that if all assets are included in every portfolio, then the time tomaturity will be same for all portfolios.

• The transaction cost of the portfolio is

c(w) =n∑

i=1

civi, (73)

where vi = wi if the transaction cost of the asset i is a proportion ofwealth invested into the asset, and vi = constant s if the transaction costof the asset j is equal to s for any amount invested.

• Similarly, the values of other attributes characterizing a portfolio could bedefined in terms of values of individual assets included into the portfolio.

Keeping the same Shapley values for all attributes and interaction indices ofdegree 2 for each pair of attributes as given in the first step of the algorithm,we maximize the Choquet integral of the alternatives. Thus, this stage of the

Page 166: Foundations of Computational Intelligence

164 T. Magoc et al.

algorithm reduces to an optimization constraint programming problem thatfinds the vector w = (w1, . . . , wn) that maximizes the objective function

maxw

∑Iij>0

Iij [ui(xi)∧uj(xj)]+∑

Iij<0

|Iij |·[ui(xi)∨uj(xj)]+

n∑i=1

⎛⎝ui(xi) − 1

2

n∑i�=j

Iij

⎞⎠ .

(74)

Here, xi and xj represent criteria of the portfolio (e.g., risk, return, time tomaturity, etc.), which are defined in terms of wi, and one of the characteristicsof portfolio (ri, Ri, ti, ci, or others).The maximization problem is subject to the following constraints:

n∑i=1

wiRi ≥ Rorn∑

i=1

wiri≤ r (portfolio satisfies the main goal of the investor);

(75)

n∑i=1

wi = 1 (exactly all wealth is invested); (76)

wi ≥ 0 ∀i=1, . . . , n (money can not be borrowed to be invested in an asset).(77)

This problem involving constraints could be solved using standard op-timization techniques. Since all constraints are linear, the choice of theoptimization technique depends on the form of the objective function. Usingthe simple utility functions described in this section, the objective functionis linear as well, which allows us to use of the simplex method to determinethe optimal solution. However, if some complex utility functions are used toexecute the multi-criteria decision process, the objective function might notbe linear and other methods, such as Karush-Kuhn-Tucker and Fritz-John,must be used to find the solution. Since both methods guarantee to find alocal optimum but not the global one, we can iterate the algorithm severaltimes with different starting points to find a better solution.

To reduce the complexity of the algorithm, we developed another algo-rithm that utilizes MCDM setting in portfolio selection. It starts by orderingall assets based only on their risk, and using this ranking it clusters all theassets into three groups: high, middle, and low risk assets. The clusteringis performed such that the third of assets with the highest risk constitutesgroup 1, group 2 contains the middle risk assets, and group 3 is the thirdof assets with the lowest risk. Next, we calculate the Choquet integral ofeach asset following the same MCDM setting as in the first algorithm. Weselect top n1 > 0, n2 > 0, and n3 > 0 assets respectively from high, middle,and low risk clusters to be included into the portfolio. The values of n1, n2,

Page 167: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 165

and n3 are either all equal and predetermined, or they are such that the valuesof assets selected from each cluster are higher than a predefined thresholdvalue.

Based on the investor’s level of risk aversion, the proportion of wealthinvested in each cluster is determined and denoted by p1, p2, and p3 respec-tively for groups 1, 2, and 3. If the decision-maker is highly risk-averse, p1

will be much smaller than p2 and p3, while for a risk-prone individual, p3 willbe smaller than p1 and p2. However, none of the numbers will be equal tozero in order to diversify portfolio, which is necessary to reduce unsystemicrisk, the risk that depends on the company.

Finally, the wealth allocated to each cluster is distributed among the assetsthat belong to the group, so that the optimal portfolio is selected. Each clusteris considered separately from the other two and the best distribution of wealthis determined by maximizing the Choquet integral of the portfolios built byselected assets in each group

maxw

∑Iij>0

Iij [ui(xi)∧uj(xj)]+∑

Iij<0

|Iij |[ui(xi)∨uj(xj)]+n∑

i=1

(ui(xi)−12

n∑i�=j

Iij),

(78)subject to

n∑i=1

wi = 1 (exactly all wealth is invested); (79)

wi ≥ 0 ∀i=1, . . . , n (money can not be borrowed to be invested in an asset).(80)

Note that this algorithm does not explicitly require satisfaction of the maingoal of the investor (e.g., required return level, maximum risk rate, etc.),but this requirement is implicitly accounted for in the distribution of wealthamong three clusters. We can again apply one of the standard optimizationtechniques to solve this problem.

Even though the utility based multi-criteria decision making setting andits solution by use of Choquet integral with respect to 2-additive measureis a feasible and accurate solution for values given by the decision maker,this approach faces another problem. We cannot expect a decision maker togive precise values for the importance and interaction indices. In order toovercome this hurdle, it was shown [3] that the use of intervals provides anice solution in MCDM problem.

7.3 Intervals

Interval Arithmetic (IA) is an arithmetic over sets of real numbers called in-tervals. It had started the development in fifties in order to model uncertainty,and to tackle rounding errors of numerical computations. For a complete pre-sentation of interval arithmetic, we refer the reader to [15].

Page 168: Foundations of Computational Intelligence

166 T. Magoc et al.

Definition 6.7. (Interval): A closed real interval is a closed and connectedset of real numbers. The set of closed real intervals is denoted by IR. Everyx ∈ IR is denoted by

[x, x], (81)

where its bounds are defined by x = inf x and x = sup x.For every a ∈ R, the interval point [a, a] is also denoted by a.

In the following, elements of IR are simply called real intervals or intervals.The width of a real interval x is the real number w(x) = x− x. Given two

real intervals x and y, x is said to be tighter than y if w(x) ≤ w(y).Interval Arithmetic operations are set theoretic extensions of the corre-

sponding real operations. Due to properties of monotonicity, these operationscan be implemented by real computations over the bounds of intervals. Giventwo intervals x = [a, b] and y = [c, d], we have for instance:

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

x + y =[a + c, b + d]

x − y =[a − d, b − c]

x × y = [min{ac, ad, bc, bd}, max{ac, ad, bc, bd}]

xn =

⎧⎨⎩

[an, bn] if n is an odd natural number[0, max{|a|, |b|}n] if n is even and 0 ∈ [a, b][min{|a|, |b|}n, max{|a|, |b|}n] if n is even and 0 �∈ [a, b]

The associative law and the commutative law are preserved over IR. How-ever, the distributive law does not hold. In general, only a weaker law isverified, called semi-distributivity. For all x, y, z ∈ IR, we have:

associativity: (x + y) + z = x + (y + z)(xy)z = x(yz)

commutativity: (x + y) = (y + x)xy = yx

sub-distributivity: x × (y + z) ⊆ x × y + x × z

Intervals of preferences

As mentioned earlier, to define preferences over multi-dimensional alterna-tives, the user is required to provide importance and interaction indices, butis more likely to provide intervals of values Ii and Iij , ∀i, j ∈ {1, . . . , n},which leads to evaluation of a Choquet integral over intervals using IA:

Page 169: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 167

(CI)∫

I

fdμ =∑

Iij>0

((f(i) ∧ f(j)) − 1

2(f(i) + f(j))

)Iij

+∑

Iij<0

((f(i) ∨ f(j)) − 1

2(f(i) + f(j))

)|Iij |

+n∑

i=1

f(i)Ii

where the annotation (CI) means that the interpretation of this formula isperformed using IA. As a consequence, the value of the integral is an interval.

Back to the Portfolio Optimization Problem

Intervals allow the problem of portfolio management to be presented morerealistically as the investor is asked to provide the ranges of values of theimportance and interaction indices of order 2 instead of the exact values. Itis reasonable to believe that an investor can determine whether, for exam-ple, minimization of risk is more important than the return from an asset,or whether the time period in which an amount can be obtained is moreimportant than risk. However, it is more realistic that the investor can givethe interval of how much one criterion is more important than the other cri-terion rather than giving the exact values of the relative importance amongcriteria. Thus, the intervals provide a rational way to solve the portfolio opti-mization problem by following the same procedures as the non-interval basedmethods and evaluating the Choquet integral over intervals and extendingthe optimization techniques to intervals as well. However, a new issue ariseswhen using intervals to evaluate alternatives: the result of the Choquet inte-gral evaluated over intervals is an interval, and intervals are not as easy tocompare as real numbers.

Strategies of preference

When comparing intervals, the ideal case is when the intervals of preferencesdo not intersect. In this case, if alternative I is evaluated with values thatare all better than those of alternative J , the preference is clearly given tothe alternative I.

However, the above case is very specific and unfortunately does not happenoften. More common is that two intervals intersect and we need to choose abetter of two overlapping intervals. The strategies to make decisions in suchcases are described below.

A simple naive strategy offers a straightforward solution that comparesonly the upper bounds and gives the preference to the interval with the high-est upper bound (which corresponds to an optimistic behavior of a decision-maker as he/she is only interested in the highest potential values rather than

Page 170: Foundations of Computational Intelligence

168 T. Magoc et al.

all the values that could be reached), or compares the lower bounds and givespreference to the highest lower bound (which corresponds to a pessimisticbehavior).

However, many alternatives between the very optimistic case and the verypessimistic case are possible. They require us to look simultaneously at theupper and lower bounds as well as the width of the intervals, which high-lights the degree of uncertainty of the alternative’s value. To combine thesevariables, a degree of preference was introduced [3]. A degree of preference,d(I, J), intended to express the extent to which a better value of the Choquetintegral is likely to lie in interval I, rather than in interval J .

It is defined as a function d : I2 → [0, 1], where:

d(I, J) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

I−J|(I−J)+(J−I)| if I > J and J ≥ I

1 if I = J and I > J

or: if I > J and I ≥ Jor: if I = J and I = J

1 − d(J, I) otherwise

(82)

The higher the value of the degree of preference, the greater the chancethat the optimal interval is the interval I, while lower value of the degree ofpreference implies that the interval J would more likely contain higher valueof Choquet integral.

The degree of preference, as described above, assumes that a decision-maker is risk-neutral, that is the person is not willing to undergo an extremerisk nor he/she believes that there is a reason to be too careful. However,sometimes, a person exhibits a risk-prone attitude and leans towards opti-mistic behavior, or on the other hand, the decision-maker could be morerisk-averse especially if there is a reason to expect pessimistic results. If adecision-maker could provide the level of risk that he/she is willing to takein order to maximize the utility, then we can modify the degree of preferenceto include this fact.

Let us assume that the level of risk a person wants to take is expressed bya real value in the interval [0, 1], where naturally, values close to 0 representpessimistic situations, and values closer to 1 mean more optimistic expecta-tions. Now, we can tighten the considered interval to better suit this level ofrisk. The shrinking of the interval [X, X ] based on the risk level r ∈ [0, 1] isdone in the following way [21]:

• First, calculate the proportion of the interval that is considered importantby the decision-maker:

p = 2 · min{r − 0, 1 − r}. (83)

Page 171: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 169

• Next, calculate the size of the interval that corresponds to the given pro-portion:

size = p · (X − X). (84)

• Finally, calculate the interval of importance, [N, N ]:

[N, N ] =

{[X, X + size] if r ≤ 0.5

[X − size, X] otherwise(85)

This approach clearly returns a single point instead of an interval in caseswhen the level of risk is at extreme points, i.e., the interval of importanceis the upper bound of the original interval when the risk level is 1, and thelower bound when the risk level is 0. In both cases, the problem is reducedto comparison of single (extreme) points rather than intervals, the situationthat corresponds to the naive strategy.

Once we have tightened the intervals to reflect the level of risk the decision-maker is willing to take, we apply equation (82) to new intervals of importanceto calculate the degree of preference, which determines the better of twointervals.

The presented approach to determine the better of two intervals given thelevel of risk works well if the decision-maker can provide the exact degree ofrisk he/she is willing to take. However, in reality it is hard to describe thelevel of risk by a single number [21]. More probable is that a person coulddefine the level of risk by an interval r = [r, r]. In this case, the calculationof the interval of importance that encounters for the optimism/pessimism ofa person is a bit more complicated. Instead of a precise interval, the result isan interval whose bounds are themselves intervals (2nd order interval), andtherefore, the degree of preference would result in an interval, d(I, J) = [d, d]rather than a single number. Three situations could occur:

• d < 0.5 in which case the preferable choice is interval J .• d > 0.5 in which case the preferable choice is interval I.• 0.5 ∈ [d, d] in which case the preferable choice is

(1) interval I if (d − 0.5) ≥ (0.5 − d)

(2) interval J otherwise.

All of the above rankings of intervals suppose uniform probability distribu-tion, which is a reasonable assumption if no additional information is avail-able. However, sometimes more information is accessible and more accurateprobability distribution over an interval could be considered. Typically, if thewidth of interval is not limited, it is common that a decision-maker would givean interval bigger than what he/she really believes the interval should be tocover any possible extreme value even though the extreme values would veryrarely happen. Thus, it is not uncommon that the values within an intervalwould not follow uniform distribution, but rather a form of Gaussian distri-bution (possibly screwed). In this situation, it is reasonable to assume that

Page 172: Foundations of Computational Intelligence

170 T. Magoc et al.

the interval Choquet integral would also not follow uniform distribution butwould rather have higher probability of values in the interior of the intervalthan those close to bounds.

In the case when more information is available about probability distribu-tion over an interval, we can slightly modify the approach of calculating thedegree of preference [21]. As before, we start by tightening the given intervalbased on the level of risk, r, that a person is willing to take. Thus, we needto determine the value, s, within the given interval [X, X] such that

s =

{P (x ≤ 2r) if r < 0.5

P (x ≥ (2r − 1)) if r > 0.5(86)

So the interval of importance is:

[N, N ] =

{[X, X + s] if r ≤ 0.5

[X − s, X] otherwise(87)

Note that the above formula when applied to the uniform distributionleads exactly to the equation (85), with s replacing the variable size.

The next step is to calculate the degree of preference between two intervalsgiven their new bounds. Encountering the probability distribution, the degreeof preference is given by:

d(I, J) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

P (J≤x≤I)

P (J≤x≤I)+P (I≤x≤J)if I > J and J ≥ I

1 if I = J and I > J

or: if I > J and I ≥ Jor: if I = J and I = J

1 − d(J, I) otherwise

(88)

When applied to uniform distribution, this equation simplifies to equation(82).

Back to the Portfolio Optimization Problem

Even though the presented utility-based approach to decision making has notyet been tested on real data, the theoretical framework is sound and generalenough to be applied successfully to real finance data sets. Its performancecould be tested against the performance of benchmark portfolios, and thereis no doubt that the approach would perform superiorly compared to thesetechniques as fuzzy integration over intervals adjusts the drawbacks of othercommonly used intelligent techniques, which outperform benchmark portfo-lios themselves. Thus, this method offers a natural and logical framework tooptimize portfolio selection.

Page 173: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 171

8 Conclusion

Computational intelligence techniques are very useful for solving problemsinvolving the understanding, modeling and analysis of large data sets. Veryoften, trying to take all variables into considerations, along with variable de-pendencies is not practical, as this approach rapidly becomes untractable,even if using distributed computing techniques. On the other hand, humansare very efficient at identifying what matters, and discarding what does notmatter in a given situation. Computational intelligence techniques are pre-cisely replicating this process of eliminating the ’noise’ and focusing on thedata that matter.

We have seen that finance is an area that is well-suited for computationalintelligent approaches. We have presented a state of the art on computationaltechniques for portfolio management, that is, how to optimize a portfolioselection process. More specifically, we have shown that genetic algorithms,rule-based systems, neural networks, and support vector machines offer someadvantages to benchmark portfolio management schemes, be it in complexityor in accuracy. We then proceeded to show that a utility-based approach todecision making offers a natural and logical framework to optimize portfolioselection, and have shown how the Choquet integral (which generalizes a largeclass of aggregation operators in multi-criteria decision making), constraintprogramming, and interval computation can be used to solve such problems,and allow us to deal with both uncertain and imprecise data. This approachhas not been tested yet on real data, however, the theoretical framework issound and general enough to be applied successfully to real finance data sets.Moreover, it addresses some of the issues pertaining to other computationaltechnique approaches, such as overfitting of parameters, description of thedependencies between characteristics of an asset, imprecise data, etc.

Last, although we have focused on portfolio management problems, it isquite possible to use similar computational technique approaches to pricingproblems, as an alternative to the more traditional stochastic differentialequations and stochastic integration approaches.

References

1. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimalclassifiers. In: Haussler, D. (ed.) 5th Annual ACM Workshop on COLT, pp.144–152. ACM Press, Pittsburgh (1992)

2. Casas, C.A.: Tactical asset allocation: an artificial neural network basedmodel. In: Proceedings of International Joint Conference on Neural Networks,vol. 3, pp. 1811–1816 (2001)

3. Ceberio, M., Modave, F.: Interval-based multicriteria decision making. In:Bouchon-Meunier, B., Coletti, G., Yager, R.R. (eds.) Modern InformationProcessing: from Theory to Applications. Elsevier, Amsterdam (2006)

4. Chapados, N., Bengio, Y.: Cost functions and model combination for VaR-bsed asset allocation using neural networks. IEEE Transactions on NeuralNetworks 12, 890–906 (2001)

Page 174: Foundations of Computational Intelligence

172 T. Magoc et al.

5. Choquet, G.: Theory of capacities. Annales de l’Institut Fourier 5 (1953)6. Cohen, P.R., Lieberman, M.D.: A report on FOLIO: an expert assistant for

portfolio managers. Investment Management: Decision Support and ExpertSystems, 135–139 (1990)

7. Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20 (1995)8. Denneberg, D., Grabisch, M.: Shapley value and interaction index. Mathe-

matics of interaction index (1996)9. Fan, A., Palaniswami, M.: Stock selection using support vector machines. In:

Proceedings of International Joint Conference on Neural Networks, vol. 3, pp.1793–1798 (2001)

10. Giarratano, J., Riley, G.: Expert systems: principles and programming. PWSPublishing Company, Boston (1994)

11. Goldberg, D.E.: Genetic algorithms in search, optimization and machine learn-ing. Addison-Wesley, MA (1989)

12. Grabisch, M.: The interaction and Mobius representation of fuzzy measureson finite spaces. In: Grabisch, M., Murofushi, T., Sugeno, M. (eds.) k-additivemeasures: a survey. Fuzzy measures and integrals: Theory and applications.Physica Verlag (2000)

13. Grabisch, M., Roubens, M.: Application of the Choquet integral in multicrite-ria decision making. In: Garbisch, M., Murofushi, T., Sugeno, M. (eds.) FuzzyMeasures and Integrals: Theory and Applications. Physica Verlag (2000)

14. Haupt, R.L., Haupt, S.E.: Practical Genetic Algorithms, 2nd edn. Wiley, NewYork (2004)

15. Jaulin, L., Kieffer, M., Didrit, O., Walter, E.: Applied Interval Analysis,with Examples in Parameter and State Estimation. In: Robust Control andRobotics. Springer, London (2001)

16. Joshi, M.: The concepts and practice of mathematical finance, Cambridge(2003)

17. Krantz, D., Luce, R., Suppes, P., Tverski, A.: Foundations of measurement.Academic Press, London (1971)

18. Lai, K.K., Yu, L., Wang, S., Zhou, C.: A double-stage genetic optimizationalgorithm for portfolio selection. In: King, I., Wang, J., Chan, L.-W., Wang,D. (eds.) ICONIP 2006, Part III. LNCS, vol. 4234, pp. 928–937. Springer,Heidelberg (2006)

19. Lin, L., Cao, L., Wang, J., Zhang, C.: The applications of genetic algorithmsin stock market data mining optimization. In: Zanasi, A., Ebecken, N.F.F.(eds.) Data Mining V. WIT Press (2004)

20. Lowe, D.: Novel exploitation of neural network methods in financial markets.In: IEEE International Conference on Neural Networks, vol. 6, pp. 3623–3628(1994)

21. Magoc, T., Ceberio, M., Modave, F.: Interval-based multi-criteria decisionmaking: strategies to order intervals. In: Proceedings of North American FuzzyInformation Processing Spciety (2008)

22. Modave, F., Grabisch, M.: Preferential independence and the Choquet inte-gral. In: 8th International Conference on the Foundations and Applications ofDecision Under Risk and Uncertainty, Mons, Belgium (1997)

23. Nilsson, N.J.: Artificial intelligence: a new synthesis. Morgan Kaufmann, SanFrancisco (1998)

24. Prugel-Bennett, A., Shapiro, J.L.: An analysis of genetic algorithms usingstatistical mechanics. Physycal Review Letters 72(9), 1305–1309 (1994)

Page 175: Foundations of Computational Intelligence

Computational Methods for Investment Portfolio 173

25. Reeves, C.R., Rowe, J.E.: Genetic algorithms-Principles and Perspective: AGuide to GA Theory. Kluwer Academic Publisher, London (2003)

26. Shapley, L.S.: A value for n-person games. In: Kuhn, H.W., Tucker, A.W.(eds.) Contributions to the Theory of Games, vol. 2, pp. 307–317. PrincetonUniversity Press, Princeton (1953)

27. Sugeno, M.: Theory of fuzzy integrals and its applications. PhD Thesis, TokyoInstitute of Technology (1974)

28. Schlkopf, B., Burges, C., Vapnik, V.: Extracting support data for a given task.In: Advances in Neural Information Processing Systems (1995)

29. Seo, Y., Giampapa, J., Sycara, K.: Financial news analysis for intelligentportfolio management. Tech. report CMU-RI-TR-04-03, Robotics Institute,Carnegie Mellon University (2004)

30. Sycara, K., Decker, K., Zeng, D.: Intelligent agents in portfolio management.In: Jennings, N., Wooldridge, M. (eds.) Agent Technology: Foundations, Ap-plications, and Markets. Springer, Heidelberg (1998)

31. Tan, P., Steinbach, M., Kumar, V.: Introduction to data mining. Addison-Wesley, Reading (2006)

32. Zhou, C., Yu, L., Huang, T., Wang, S., Lai, K.K.: Selecting valuable stocks us-ing genetic algorithm. In: Wang, T.-D., Li, X.-D., Chen, S.-H., Wang, X., Ab-bass, H.A., Iba, H., Chen, G.-L., Yao, X. (eds.) SEAL 2006. LNCS, vol. 4247,pp. 688–694. Springer, Heidelberg (2006)

33. Zimmermann, H.G., Grthmann, R.: Optimal asset allocation for a large num-ber of investment opportunities. Intelligent systems in Accounting, Finance,and Management 13, 33–40 (2005)

34. Zimmermann, H.G., Neuneier, R., Grthmann, R.: Active portfolio manage-ment based on error correction neural networks. In: Proceedings of NeuralNetwork Information Processing systemns, Vancouver, Canada (2002)

Page 176: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal

Unit Problem

C. Hui

Centre of Excellence for Invasion Biology, Department of Botany and Zoology,University of Stellenbosch, Private Bag X1, Matieland 7602, South [email protected]

Summary. The Modifiable Areal Unit Problem (MAUP) prevails in theanalysis of spatially aggregated data and influences pattern recognition. Itdescribes the sensitivity of the measurement of spatial phenomena to the size(the scale problem) and the shape (the aggregation problem) of the mappingunit. Much attention has been recieved from fields as diverse as statisticalphysics, image processing, human geography, landscape ecology, and biodi-versity conservation. Recently, in the field of spatial ecology, a Bayesian es-timation was proposed to grasp how our description of species distribution(described by range size and spatial autocorrelation) changes with the sizeand the shape of grain. This Bayesian estimation (BYE), called the scalingpattern of occupancy, is derived from the comparison of pair approximation(in the spatial analysis of cellular automata) and join-count statistics (in thespatial autocorrelation analysis) and has been tested using various sources ofdata. This chapter explores how the MAUP can be described and potentiallysolved by the BYE. Specifically, the scale and the aggregation problems areanalyzed using simulated data from an individual-based model. The BYEwill thus help to finalize a comprehensive solution to the MAUP.

1 Introduction

Spatial patterns in the natural world are hardly random, from the cloud ofatoms to the distribution of species. Relating such patterns across scaleshas been argued as the central problem in all of science [1]. In physics,Bose-Einstein condensation trys to describe how numerous atoms are dis-tributed in lattices with changing sizes [2]. In ecology, alpha and beta diversityshows the scale sensitivity of species richness [3]. The species-area relationship[4, 5] and the scaling pattern of species distribution (measured by occupancyor occurrence) [6] attract much attention from ecologists. In the analysisof those patterns across scales (called scaling patterns), a severe statistical

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 175–196.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 177: Foundations of Computational Intelligence

176 C. Hui

problem emerges, i.e. how the measurements of those spatial patterns de-pend on the scales in question, named the Modifiable Areal Unit Problem(hereafter MAUP) [7].

MAUP is a potential source of error that can affect spatial studies whichutilize aggregate data sources [8]. To describe a spatial phenomenon, wenormally transform it into a raster format, like a digital picture, with eachminimal unit called grain [9]. As the the grain changes, so does the measure-ment of the spatial pattern (e.g. spatial variance [10]). MAUP consists of twoparts: the scale problem and the aggregation problem (also called the zoningproblem), which are caused by the changing of the size and the shape of grain,respectively [7]. Although there is hitherto no complete solution to MAUP[11], studies on the scale problem prevail in the spatial analysis of ecology. Thenon-random distribution of species in its spatial habitat has long been stud-ied (e.g. [12]). This non-randomness arises from the spatial heterogeneity ofthe habitat, the nonlinear essence of species life history (e.g. dispersal strat-egy and density dependence), as well as interspecific interactions. Becausemost species’ distributions are not random but aggregated (or over-dispersed,clustered, clumpy, autocorrelated, contagious, patchy, etc.), finding a consis-tent description of the spatial structure of species distributions becomes apriority in ecological research. On the other hand, the aggregation (zoning)problem has recieved little attention in ecology, with only a little from humangeography [8].

Since MAUP only affects non-random spatial data [7], different methodsand aggregation indices have been invented to grasp the characteristics ofthis non-randomness in nature. According to Li and Reynolds’[13] introduc-tion, Wiens [12] categorised the spatial heterogeneity into four forms: spatialvariance, patterned variance, compositional variance and locational variance.Based on the standard geographic system [14], Perry et al. [15] divided thetypes of spatial data into point- and area-referenced data. Different spatialstatistics focus on different types of spatial data [16]. Clarification of the con-cept of aggregation as well as the scaling pattern of the aggregation indicesplays an important role in the unification of the theories in macroecology andspatial ecology.

In the development of spatially semi-explicit indices, belonging to theLocal Indicators of Spatial Autocorrelation (LISA) statistics [17], progresswas made through linking pair approximation and join-count statistics [6, 18].Pair approximation is a moment approximation method for describing spatialpatterns, introduced from statistical physics [19], and has been widely em-ployed in the analysis of spatial patterns from cellular automata (e.g. [20]).Join-count statistics is the first step in describing real spatial patterns bygiving the status of a focal sample’s neighbors (adjacent patches) [21, 22].It is possible that the combination of these two methods can address a sim-plified MAUP (i.e. in presence-absence format), and eventually provide acomprehensive framework for solving the MAUP in the future.

Page 178: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal Unit Problem 177

2 Models

2.1 Modifiable Areal Unit Problem

Literature Review

MAUP first originated from the analysis of scale effects in census data [23],and was formally demonstrated in 1980’s [7]. Jelinski and Wu [24] indicatedthat the MAUP arises from the fact that areal units are usually arbitrarilydetermined and “modifiable”, in the sense that they can be aggregated toform units of different sizes or spatial arrangements. Openshaw [7] dividedthe MAUP into two sub-problems (scale and aggregation). The scale problemis “the variation in results that may be obtained when the same areal dataare combined into sets of increasingly larger areal units of analysis”; theaggregation (zoning, zonation) problem is “any variations in results due toalternative units of analysis where the number of units, is constant”.

Research on the MAUP has tended to focus on the meaurement of spa-tial structure using different indices and statistics (e.g. [25]). Dorling [26]highlighted difficulites such as the MAUP in preventing the development ofvisual palliatives and pointed out the necessity of using cartograms in placeof conventional choropleth maps for aggregated distributions. Amrhein [27]concluded that the effects of MAUP vary with the statistic calculated. Eventhough means and variances are resistant to aggregation effects, regressioncoefficients and correlation statistics exhibit dramatic effects. Amrhein [27]reminded us that the world of spatial analysis is not entirely well-behaveddue to the MAUP. With the increased awareness of MAUP in GeographicInformation Systems (GIS) based spatial analysis in physical geography [28],landscape ecology [24] and social, urban management [29, 30], case studies onscaling patterns flourished. To list a few, Lery [31] performed a risk analysisfor foster care entry at three spatial scales. Sexton et al. [32] examined spatialeffects on analytical results related to causal inference and disease cluster-ing. These studies recall Levin’s [1] emphasis in science. In the meantime,traditional indices and statistics, as well as a wide range of new methodshave been tested for their robustness and senstivity towards the MAUP (e.g.[33, 34])). As the MAUP was only formally termed after Openshaw in the80’s, the studies on MAUP, as reviewed here, were largely descriptive. A fullunderstanding even of a simplified MAUP will help to put the efforts into aright research direction, which is the focus of this work.

Presence-Absence MAUP

In spatial analysis, one often encounters raster data, e.g. atlas and digitalpictures, in which the presence-absence (binary) format is prevalent. For ex-ample, the Atlas of Southern African Birds [35] presents, for each species,a presence-absence map with a square 15’× 15’ grain, showing the distribu-tion of a particular species during the sampling period. The prevalence of

Page 179: Foundations of Computational Intelligence

178 C. Hui

0 1 0 0

1 1 0 0

0 0 1 0

0 0 0 1

1

0 1

0

1 1 1 1

1

1

1

0

5/16 8/16

16/16 12/16

The Scale Problem

The Aggregation Problem

Fig. 1. A simplified illustration of the scale and aggregation problems in the mod-ifiable areal unit problem (MAUP)

presence-absence data is due to the fact that it is the easiest and quickestrepresentation, which can be provided for a large-scale region at relativelylow cost [36]. Spatial niche modelling (such as the climatic envelope; [37])can also provide a quick, large-scale presence-absence map for a species dis-tribution in landscape ecology. Lattice models also provide a huge amount ofbinary data from theoretical and experimental studies (e.g. [38, 39]). How-ever presence-absence data is susceptible to the MAUP, and a framework forovercoming this downfall is proposed in the pages to follow.

To clearly understand the MAUP, an illustrative example is given in figure 1using such binary data. In figure 1, the original grain is 1/16 and the occupancy(i.e. the proportion of occupied cells) is 5/16. If we keep the shape of the grainunchanged but combine the four nearest neighbors into a new grain, we have anew grain size of 4/16. Now the MAUP occurs. We notice that the occupancyis not 5/16 anymore, but 8/16. This deviation between 5/16 and 8/16 is causedby the MAUP and is termed the scale problem. If we let the grain size remain4/16, but change the shape of the grain (as shown in figure 1) from square torectangular (with the width to length ratio being 1:4) and L-shaped, we find

Page 180: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal Unit Problem 179

that the occupancy will also change (16/16 for rectangle grain and 12/16 forL-shaped grain). Such a difference is caused by the shape of the grain, calledthe aggregation problem (also called the zoning problem). A solution to thissimplified MAUP (i.e. in a presence-absence version) should indicate how theoccupancy will change with grain size as well as shape.

2.2 Individual Based Model

A standard plant population presence-absence data from a spatially explicitindividual-based model (IBM) was generated. IBM is a typical model inecological research, which has been widely used in the simulation of mixedecosystems, fish, mammals, birds, insects, marine invertebrates, arachnids,bacteria, and other non-species-specific models (e.g. ATLSS by Donald DeAn-gelis et al. http://atlss.org/).

An individual in this simulation can be seen as a hermaphrodite perennialbush. A number of n seeds (n = 10) will be dispersed around the parent, whichobeys an exponential distribution p(x) = λe−λx (x is the distance between theseed and the parent; 1/λ gives the mean distance of seeds to parent, 0.01 ≤ λ ≤5.5; here we chose λ = 1) with a randomly chosen direction θ (0o ≤ θ ≤ 360o).The individuals that can produce seeds are randomly chosen with a probabilityof c (c = 0.25) within the mature adults (those whose age is older than oneyear). The seed can only have a chance to grow to a seedling if there are no otherplants within a certain distance d (d = 0.2)(due to the overcrowding effect orother density-dependent mechanisms). During each time step, individuals alsosuffer a probability of death (mortality) e (e = 0.1).

The simulation was done in a 50×50 extent of a two-dimensional homoge-neous space (figure 2A) with periodic boundary (to exclude the edge effect),which means the maximal number of individuals is 19,894. This model alsogenerated similar results as other individual-based models, such as the forestgrowth simulators JABOWA, FORET, and SORTIE [40], but it is distin-guished from other lattice simulations (cellular automata; e.g. [41, 42]) bythe fact that the individuals were not constrained to grids (or discrete inspace), rather each one had its own coordinates. The two-dimensional spacewas then divided into a patch network (figure 2B). A patch with at least oneindividual was marked as presence (black in figure 2); while an empty onewas marked as absence (white). The occupancies of figure 2B, C and D are0.2794, 0.6325 and 0.65, respectively. Specifically, we want to have a formulato predict the occupancies in figure 2C and D using the occupancy of figure2B. This formula should not only grasp the difference in occupancies acrossscales (the scale problem) but also the difference for the same grain size (theaggregation problem).

2.3 Local Indicators of Spatial Autocorrelation

A specific field exists in ecology, namely spatial analysis, that focuses onthe non-random, aggregated (contagious) spatial distributions of species [43],

Page 181: Foundations of Computational Intelligence

180 C. Hui

A B

C D

Fig. 2. The spatial pattern generated from the individual-based model. A, A totalof 541 individuals were generated after 20 time steps. B , The presence-absencemaps for the resolution of 40×40 grains. C , The map for 20×20 grains. D , Themape for 40×10 grains. See text.

i.e. the spatial heterogeneity of species distributions. Two factors can causesuch phenomena of non-randomness in nature: spatial heterogeneity in habi-tat [44] and the nonlinearity of biological processes, such as density dependentgrowth and dispersal rates [45]. The spatial heterogeneity caused by the IBMabove belongs to the latter scenario. In the past several decades, differentaggregation indices and measures have been developed focusing on differenttypes of data in the spatial analysis [15, 16]. Depending on the degree ofspatial information incorporated in the measure, three categories of spatialstatistics can be further distinguished, i.e. spatially implicit, semi-explicit andexplicit. Spatial implicit indices measure the statistical heterogeneity basedon the mean and variance of the number of individuals in samples (cells)and were widely used in early literature, such as the coefficient of diffusion[46], Morisita’s IM [47], Lloyd’s IL [48], the clumping parameter k in thenegative binomial distribution [49] and the exponent b in Taylor’s power law[10]. With the focus of species distributions shifting from concerns of statis-tical heterogeneity to real spatial structure, spatially semi-explicit (spatial

Page 182: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal Unit Problem 181

autocorrelation) indices were developed. These spatially semi-explicit mea-sures describe patterned variance [12] and are formally known as local indi-cators of spatial autocorrelation (LISA) statistics [16]. Indices, to list a few,include Ripley’s K function [50], Moran’s I [51] and correlograms [52]. Mostrecently, spatially explicit indices were also developed based on the distancethat would be required to move individuals to achieve a random distribution.One commonly used measure is Perry’s [53, 54] spatial analysis by distanceindices (SADIE).

The reasons for choosing a specific LISA index in this chapter are threefold.First, spatially implicit indices in fact do not depict the spatial structure ofspecies distribution, and thus can not be used in the spatial scaling analysisthat is related to the MAUP. Second, spatially explicit indices still rely largelyon computer programming and thus make mathematical analysis difficult orimpossible. Finally, the local spatial structures that are described by the LISAstatistics reflect distance (or scale) dependent patterns, which are analogousto the scaling issue of the MAUP. It is exactly for this reason that LISAstatistics are suitable here. Normally, a LISA index would have two versions:global and local versions. For example, the global index of the widely usedMoran’s I [43, 51] is,

I(d) =1

W (d)

∑ni=1,i�=j

∑nj=1,j �=i wij(d)(xi − x)(xj − x)1n

√∑ni=1(xi − x)2

(1)

where wij(d)(= 1 or 0) is the distance class d-connectivity between cell (sam-ple) i and j; xi and xj are the values (here the number of individuals) atlocation i and j; W (d) is the sum of wij(d). The local Moran’s Ii is simplythe same index but for a given cell i,

Ii(d) =xi − x

1n

∑ni=1(xi − x)2

n∑j=1,j �=i

wij(d)(xj − x). (2)

Of course, Moran’s I is still too complicated to come up with any analyticresults. For this reason, we choose a simplified global LISA index, the join-count statistics, in the following analysis (see below for detail), which allowsus to reach a reasonable approximation and good analytic power of the scalingpattern of species distributions, as well as the MAUP.

2.4 Bayesian Estimation

Pair Approximation and Join-Count Statistics

A closely related field in spatial ecology is metapopulation ecology, whichstudies the spatiotemporal dynamics of binary (presence-absence) maps basedon Levins’ patch occupancy model [55, 56]. The Levins model assumes notonly the infinite number of habitat patches but also the insensitivity of col-onization by distance. Since movements of most organisms are restricted in

Page 183: Foundations of Computational Intelligence

182 C. Hui

space and hence all patches in a large network are not likely to be equallyaccessible from a given patch, these assumptions seem contradictory [57].However, what is really underlying these assumptions is that the model as-sumes all patches are equally connected to other patches, which is called themean-field assumption. Although this assumption is at the heart of manyecological theories, it ignores much of what is important about the dynamicsof ecological interactions. Ecological interactions such as predation, resourcecompetition, parasitism, epidemic transmission, and reproduction often oc-cur at spatial scales much smaller than that of the whole population [58].The dispersal and colonization of migrants in metapopulations are certainlylocal processes in space, and hence the distribution cannot be described bythe mean-field approximation.

The most powerful approach to modeling spatially structured populationdynamics and local processes in ecology are the lattice or cellular automatonmodels, which have been widely applied to the research of metapopulationdynamics and more general questions of spatial ecology [59]. These spatiallyexplicit simulation models can be analyzed by a useful approach, called pairapproximation, introduced to ecological research by Matsuda et al. [60]. Thisapproximation, original from statistical physics [61, 62], has been appliedto many models of population dynamics of plants [63, 64, 65]. These twoapproaches are, thus, especially designed for analyzing those spatial patternsin figure 2B, C and D.

In the spatial autocorrelation analysis by experimental and landscape ecol-ogists, as mentioned above, within the group of LISA indices [17], the simplestindex to describe the spatial pattern is the join-count statistics [21], which isalso conceptually and mathematically similar to the pair-approximation (ormoment approximation) approach [19, 58].

If we only consider two states, presence and absence, instead of the exactnumber of individuals in it, we have four states with regard to the focal celland its randomly-chosen neighbor: an occupied cell with a neighbor that wasalso occupied, q+/+; an occupied cell with an empty neighboring patch, q0/+;an empty cell with an occupied neighbor, q+/0; an empty cell with an emptyneighbor, q0/0. In fact we only need two variables (p+, q+/+) to express allthe other join-count statistics:

p0 = 1 − p+ (3)

q0/+ = 1 − q+/+ (4)

q+/0 = (1 − q+/+)p+/(1 − p+) (5)

q0/0 = (1 − 2p+ + q+/+p+)/(1 − p+) (6)

There is still an inequality controlling the balance between p+ and q+/+,0 ≤ p+ ≤ 1 and 2−1/p+ ≤ q+/+ ≤ 1 [20]. These are the join-count statistics,the simplest indices for describing the spatial pattern observed. If q+/+ > p+,we have a spatially autocorrelated population (aggregated). For the spatial

Page 184: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal Unit Problem 183

Fig. 3. The occupancy and the spatial correlation by scaling-up

pattern in figure 2B, we have q+/+ = 0.3837 > p+ = 0.2794; for figure 2C,q+/+ = 0.7589 > p+ = 0.6325. If q+/+ = p+, we have a spatially randomone. The ratio q+/+/p+ or the difference q+/+−p+ gives the degree of spatialclustering [21, 22]. Of course, the group of LISA can analyze the spatialpatterns for not only the presence-absence data, and as a consequence it canprovide more information on spatial autocorrelation, i.e. it can describe theclustering of abundance in samples by for example Moran’ I index [66]. Thefollowing sections will show how we predict the occupancy in figure 2C andD with only the knowledge from figure 2B using a Bayesian estimation.

The Scale Problem

Using the join-count statistics, we can scale up the sample size (grain) [6]. Fora cell in figure 2C, it is difficult to calculate the probability of presence sinceit could be 1, 2, or 541 individuals in the cell, which leads to a superimposed

Page 185: Foundations of Computational Intelligence

184 C. Hui

distribution of multiple binomial distributions [67, 68]. Since the presenceprobability is equal to one minus the absence probability, it will be mucheasier for us to calculate the absence probability first. The absence probabilityp0(a) and the correlation of two adjacent empty patches q0/0(a) with scalingup (grain from a to 4a) will be,

p0(4a) = p0(a) × q0/0(a)2 × b0(a) (7)

q0/0(4a) = q0/0(a)2 × b0(a)2 (8)

where a indicates the grain size (e.g. if the grain size of figure 2B is a, then theone of figure 2C will be 4a); b0(a) is the probability that a sample patch withtwo empty neighboring patches is absent. If we choose a Bayesian estimation(BYE) for b0(a), it will be,

b0(a) =q0/0(a)2 × p0(a)

q0/0(a)2 × p0(a) + q0/+(a)2 × p+(a)(9)

According to probability rules, p+ = 1− p0 and q+/+ = 1− (1 − q0/0p0/p+).In 2006, Hui, McGeoch and Warren [6] presented a formula governing thepattern of the occupancy and spatial correlation with increase of scale,

p+(4a) = 1 − �4

� (10)

q+/+(4a) =�10 − 2 �4 �2 + �3

�2(�−�4)(11)

where � = p0(a) − q0/+(a)p+(a) and � = p0(a)[1 − p+(a)2(2q+/+(a)− 3) +p+(a)(q+/+(a)2 − 3)].

An important result here is that the occupancy p+(a) and spatial cor-relation q+/+(a) will both limit to 1 with the increase of grain (figure 3)[18], which means that the spatial distribution of species will change fromaggregation to randomness with scaling-up (p+(a) = q+/+(a)). Using equa-tions (4) and (5), as well as p+(a) = 0.2793 and q+/+(a) = 0.3837 for figure2B, we have the occupancy and spatial correlaiton for the 20×20 resolution(figure 2C), p+(4a) = 0.6672 and q+/+(4a) = 0.6849, closed to the oberva-tion (p+(4a) = 0.6325 and q+/+(4a) = 0.7589). The accurracy for occupancy(defined as 1 − Abs(Predicted − Observed)/Predicted, [6]) is 95% and forspatial correlation is 89%. Furthermore, equations (4) and (5) only have fourvariables: occupancies and spatial correlation at grains a and 4a.

What makes the BYE function more powerful is that it can predictthe occupancy and the spatial correlation at fine-scales using coarse-scaledata. For example, under the condition of an aggregated spatial distribution(q+/+ > p+), we can solve p+(a) and q+/+(a) using p+(4a) and q+/+(4a),the prediction for figure 2B is p+(a) = 0.4129 and q+/+(a) = 0.7418. Be-cause fine-scale binary (presence-absence) data are more information rich

Page 186: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal Unit Problem 185

than coarse-scale data [69], scaling-up predictions of occupancy are boundto be more accurate than those obtained by scaling-down. Other tests usingDrosophilidae species across the mesocosm arena also confirm the predictioncapacity of equations (4) and (5) [6]. Thus, the BYE provides a preliminarysolution to the scale problem in the MAUP.

The Aggregation Problem

For the transect shape grain as in figure 1C and figure 2D, we can calculateits probability for absence p0(4a) and the conditional probability q0/0(4a)according to the BYE [6],

p0(4a) = p0(a) × q0/0(a)3 (12)

q0/0(4a) = q0/0(a) × b0(a)3 (13)

q0/0(4a) = q0/0(a)4 (14)

where equation (7) is the probability adjacent to the long edge and equation(8) is to the short edge. Using the similar procedure, we can have,

0

0

Unf

easi

ble

regi

on

Aggregation

Segregation

Random

Fig. 4. The contour plot of the difference between the occupancy with the tran-sect grain (as in figure 1C) and the one with the square grain (as in figure 1B),p+(4a)transect−p+(4a)square. The unfeasible region indicates 2−1/p+ > q+/+; theaggregation region indicates p+ < q+/+; the segregation region indicates p+ > q+/+,random indicates the line that p+ = q+/+.

Page 187: Foundations of Computational Intelligence

186 C. Hui

p+(4a) = 1 − (1 − 2p+(a) + p+(a) × q+/+(a))3

(1 − p+(a))2(15)

The formula for q+/+(4a) were not shown here for conciseness but can easilybe obtained using equations (7) and (8). According to equation (9), we havethe occupancy for figure 2D, p+(4a) = 0.6823, which is comparable to theobservation (p+(4a) = 0.65).

A quick test of the spatial distribution pattern can be executed, by com-paring the occupancy in a chessboard (as in figure 1B) with the one in atransect grain map (as in figure 1C). We find that the occupancy will belarger in a longer-perimeter sample if the spatial distribution is aggregatedbut will be smaller if it is segregated in samples with similar sample areaor grain (see figure 4; [6]). In a real case, De Grave and Casey [70] reportedthe variability in density estimates of intertidal, infaunal benthos due tothe shape of sampling devices. They found that the density for most inter-tidal macrofauna is lower in rectangular shaped samples compared to thedensity observed in square shaped samples with similar area, but the situ-ation was inversed for Bathyporeia guilliamsoniana Bate. According to theabove analysis, the reason might be that most species are aggregated in spacewhile B. guilliamsoniana has a segregated distribution due to its high mo-bility. Therefore, the BYE further provides the solution for the aggregation(zoning) problem in the MAUP.

2.5 Scaling Patterns

As demonstrated above, this iterative formula of the Bayesian solution cal-culates the species occupancy and spatial correlation after combining fourneighboring cells into one new larger grain, and thus fits the understandingof species scaling pattern as a percolation process [71, 72]. Criticism of theformula mainly concerns the difficulty in calculation [73], as well as its re-striction of linking species distributions in a grain of a only to those in agrain of 4a, i.e. its discrete essence. This section is intended to advance theunderstanding of the scaling pattern of species distribution based on the samerationale behind this Bayesian estimate. This will be achieved by generating,through induction, new formulae of species occupancy and spatial correctionthat rely on scale (grain size) as the only independent variable. Intriguingly,this formulae not only simplifies the Bayesian estimate, but also generatesthe shapes of species scaling patterns that are consistent with those fromthe intraspecific occupancy-abundance relationship (OAR) and the area-of-occupancy (AOO) models. OAR describes the positive correlation betweenthe abundance and occupancy of a species over time or across regions, is oneof the most widely used descriptions of species distribution patterns [74, 72].AOO has been found to obey a log-log linear relationship across sample scales(grain sizes) [72]. Such a log-log linear AOO provides an instant link to thebox-counting fractal dimension [72], and therefore has been argued to revealthe scale-invariant nature of the species distribution.

Page 188: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal Unit Problem 187

3,12,1

3,2,

1,1,1

1,,

jiji

jiji

jiji

jiji

Fig. 5. The procedure of calculating the probability of absence in a (2×2)-cell (leftblock), p0(2 × 2), and the conditional probability of a randomly chosen neighborof this (2× 2)-cell being also absent, q0/0(2× 2). Black arrows indicate conditionalprobabilities; white arrows indicate Bayesian estimates.

A spatial (presence-absence) map can be expressed by a binary matrix,M =< ρi,j >m×m, with the element ρi,j being either + or 0, indicating thestate of the cell (i, j) as either occupied or empty respectively. The dimension-ality of the matrix, m×m, can normally be considered as infinite. Now if wecombine n×n cells together to form a new cell, the binary matrix will becomeM =< ρi,j >m/n×m/n, with ρi,j = 1 indicating that at least one previous cell(or a sub-cell) is occupied and ρi,j = 0 indicating that all n×n previous cellsare empty. A solution to the spatial scaling of species distribution is to calcu-late the global density (occupancy) and the local density (spatial correlation)for various grain sizes. it can be achieved by first calculating the probabilityof absence p0(n × n) and q0/0(n × n), and then using the relationships be-tween those probabilities, p+ = 1 − p0 and q+/+ = 1 − (1 − q0/0)p0/(1 − p0)to calculate p+(n × n) and q+/+(n × n).

For n = 2, given that the probability of a randomly selected cell be-ing empty (ρi,j = 0) is p0, the probability of finding two empty neighbors(ρi,j+1 = 0 and ρi+1,j = 0) is q2

0/0 and the probability that a shared neighborof these two absent cells is also empty (ρi+1,j+1 = 0) be denoted as b0, we caneasily obtain p0(2× 2) = p0 × q2

0/0 × b0 and q0/0(2× 2) = q20/0 × b2

0 accordingto the diagram in figure 5.

For n ≥ 3, probabilities p0(n × n) and q0/0(n × n) can be calculatedaccording to different procedures of combining n × n cells in forming anew larger grain. This is essentially similar to the mathematical problemof the seven bridges of Konigsberg. For example when n = 3, it is possi-ble to calculate p0(3 × 3) and q0/0(3 × 3) using a spiral method of addingneighbors together to form a larger grain (figure 6A). Accordingly, we havep0(3× 3) = p0 × q6

0/0 × b0 × g0 and q0/0(3× 3) = q60/0 × b0 × k0 × g0, where g0

is the probability that a shared neighbor of four absent cells is empty, and k0

is the probability that a shared neighbor of three absent cells is empty. Thesetwo extra probabilities, i.e. g0 and k0, also need to be estimated in this spiralprocedure of scaling-up. Notably, this is not the only procedure to combinenine cells together in forming larger grains. In figure 6B, another procedureis presented for calculating the probability of p0(3×3) and q0/0(3×3), whichgives p0(3× 3) = p0 × q4

0/0 × b40 and q0/0(3× 3) = q4

0/0 × b40 × k0. The number

of prcedures of calculating p0(n×n) and q0/0(n×n) becomes extremely largefor n >> 3 (similar to the number of Euler walks). To solve this problem

Page 189: Foundations of Computational Intelligence

188 C. Hui

2,21,2,2

2,11,1,1

2,1,,

jijiji

jijiji

jijiji

2,21,2,2

2,11,1,1

2,1,,

jijiji

jijiji

jijiji

2,21,2,2

2,11,1,1

2,1,,

jijiji

jijiji

jijiji

A

B

C

Fig. 6. The procedures of calculating the probability of absence in a (3 × 3)-cell.Black arrows indicate the order of adding sub-cells using, A a spiral procedure, Ban alternative procedure, and C a zigzag procedure. see text for detail.

and also to avoid calculating extra probabilities, a procedure is used here forcalculating p0(n×n) and q0/0(n×n) according to a zigzag method of addingcells together. For example, following this zigzag procedure (figure 6C), wehave p0(3 × 3) = p0 × q4

0/0 × b40 and q0/0(3 × 3) = q3

0/0 × b60. After examining

several more cases (e.g. for n = 4 and 5), formulae can easily be induced forthe general case:

p0(n × n) = p0 × q2(n−1)0/0 × b

(n−1)2

0 , (16)

q0/0(n × n) = qn0/0 × b

n(n−1)0 . (17)

Since n × n indicates the size of the new grain (denoted as a), we have thescaling pattern of absence,

p0(a) = ba0(

p0 × b0

q20/0

)(q0/0

b0)2a1/2

, (18)

Page 190: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal Unit Problem 189

q0/0(a) = ba0(

q0/0

b0)a1/2

. (19)

Accordingly, we can simply get the following scaling patterns of occupancyand spatial correlation,

p+(a) = 1 − θ × β2a1/2 × δa, (20)

q+/+(a) = p+(a) +(θ−1 × β−a1/2 − 1)(1 − p+(a))2

p+(a), (21)

where θ = p0 × b0/q20/0, β = q0/0/b0 and δ = b0 are model parameters.

This scaling pattern describes species occupancy and spatial correlation asa function of the spatial scale a, and elucidates the percolation process ofnon-random structure when scaling up.

The scaling patterns of species distribution were illustrated in figure 7.First, if we classify the spatial structure into three categories: aggregation(q+/+ > p+), randomness (q+/+ = p+) and segregation (q+/+ < p+) [6, 20],the category of spatial structure will not change across categories with changeof scales (figure 7A), i.e. it remains in the same category. With the increaseof grain a, occupancy p+(a) increases monotonously, whereas the spatial cor-relation q+/+(a) decreases at first but then increases ultimately at the samerate of p+(a), i.e. it converges to randomness. The watershed threshold forq+/+(a) changing from declining to increasing [the dashed line in figure 7A]can not be solved analytically, but an approximate numerical solution canbe obtained as an ellipse, (p+ − 1)2/1.142 + q2

+/+/0.892 = 1, after a rotationof π/4. This threshold is also insensitive to the initial values of p0 and q0/0.It is worth noting that the spatial correlation is not perfectly synchronizedwith the changing rate of occupancy dp+(a)/da; however generally speaking,a lower spatial correlation indeed corresponds with a higher changing rateof occupancy, i.e. the scale-dependence of occupancy becomes strong whenthe spatial correlation is weak [6]. Second, the overall saturated-curve-shapeand the S-shape (especially obvious for those highly aggregated patterns,q+/+ >> p+) of occupancy scaling are consistent with the results from OARmodels [72]. Furthermore, if the spatial structure is random, it is easy toobtain that p+(a) = 1 − e−λ×a (where λ = −ln(δ)), which is actually thePoisson OAR for randomness (e.g. [74, 72]). Finally, a quasi-power law holdsfor the scaling pattern of occupancy over about two orders of scale magni-tude (figure 7B), which is consistent with a log-log linear AOO as reportedin literature [75]. However, such a quasi-power law does not reflect a strictself-similar nature of species distributions. A further test on the spatial struc-ture reveals the scale-dependence of spatial correlation (figure 7C). A powerlaw form of AOO does not explain how those occupied cells distribute, and,therefore, does not necessarily lead to the scale-invariance (self-similarity) ofthe spatial distribution, i.e. the fractal objects can always lead to a powerlaw scaling, but not vice verse. The approach presented here facilitates ac-curate, cost-efficient estimation of occupancy, and provides a comprehensiveapproach towards modelling species distributions.

Page 191: Foundations of Computational Intelligence

190 C. Hui

A

B C

Fig. 7. Scaling patterns of species occupancy and spatial correlation. A, Severaltrajectories of p+(a) and q+/+(a) in a parametric plot, which all lead to the top-right with the increase of grain a. The dashed line indicates the watershed thresholdof q+/+(a) where it changes from declining to increasing with the increase of graina. B , Scaling patterns of occupancy, with p+ = 0.01 and q+/+ = 0, 0.1, 0.2, ..., 0.9(from top to bottom curves). C , Scaling patterns of spatial correlation, with thesame initial values as in B .

3 Conclusion

From the above analysis, the following propositions are brought forward. (1:For the scale problem) With the increase of grain size, the occupancy p+ andthe spatial correlation q+/+ both increase. The BYE can largely explain thetrajectories of these two variables across scales. (2: For the scale problem) Theaccuracy of prediction of the BYE for scaling-up is higher than for scaling-down; the accuracy of prediction of the BYE for occupancy, p+ (first-orderdescription of the spatial distribution), is higher than for the spatial correla-tion, q+/+ (second-order description of the spatial distribution). (3: For theaggregation/zoning problem) Under the same grain size, the occupancy pre-diction is sensitive to irregularity (or length of the perimeter) as well as the

Page 192: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal Unit Problem 191

spatial sturcture (the intensity of aggregation, randomness and segregation;e.g. those described by the join-count statistics). If the spatial structure is ag-gregated (p+ < q+/+), occupancy estimates with longer perimeter grain willhave higher values. If the spatial structure is random (p+ = q+/+), occupancyestimates will not be affected by the shape of the grain. If the spatial struc-ture is segregated (p+ > q+/+), occupancy estimates with longer perimetergrains will have higher values of occupancy than those with shorter perimetergrain. Thus, the BYE provide a preliminary solution to the presence-absenceformat MAUP.

The above analysis shows that it will be possible in the near future to pro-vide a comprehensive solution to the MAUP. Studies on the scaling pattern ofspecies distribution present a framework to the scale problem in MAUP. Huiand McGeoch [72] have reviewed different occupancy-abundance models aswell as the scaling pattern of occupancy. Because such essential links betweenpresence-absence data and the abundance (or the intensity) of the spatialpatterns, those more general models, slotting them into the BYE framework,will eventually resolve the scale problem of the MAUP. Moreover, the solu-tion to the scale problem will also benefit conservation science. Kunin [75]gave a log-log linear area-of-occupancy (i.e. the relationship between occu-pancy and grain size), which has been used to predict the biodiversity trend[76]. Hartley and Kunin [77] suggested that the solution to the scale prob-lem can also be used to estimate the abundance of focal species based onlyon presence-absence data, or even on the occupancy alone. Such techniqueswill surely improve the efficiency of conservation management. Furthermore,He and Hubbell [71] have explored the effect of perimeter length of grain onthe occupancy and abundance estimates. Their findings are consistent withthe analysis of the aggregation problem here in the MAUP. The solution tothe aggregation problem will help us to understand ecological fallacy andsampling artifacts. In conclusion, the BYE together with those advances willcontribute to building a comprehensive solution to the Modifiable Areal UnitProblem.

4 Future Directions

Research should foucs on the following with regard to solving the MAUPin the spatial analysis. First, a scale-free index for the description of spatialpatterns is urgently needed. Up to now, as outlined in this chapter, spatialstatistics and indices are largely scale-dependent. Even though a few indices,such as Shannon’s information and entropy index, have proclaimed to bescale-free, this is questionable under stingent testing. Furthermore, althoughspecific scale-dependent statistics and indices might be especially valuable forcase studies, they contribute little to solving the scale problem as a whole.Second, more effort should be put into finding proper measurements of ir-regular shapes (e.g. the width-to-length ratio here). Studies have revealedthe strong correlation between the grain irregularity and the intensity of the

Page 193: Foundations of Computational Intelligence

192 C. Hui

zoning effect [6, 71]. Fractality might shed light on this issue as its initialintention is to describe the irrigularity of fractal, self-similar objects. Finally,the MAUP is a push forward for unveiling the spatial character of naturalsystems rather than simply a problem. Fotheringham [78] has suggested theshifting of spatial analysis towards relationships that focus on rates of change(see also [24]): “Can we acquire information on the rate of change in variablesand relationships of interest with respect to scale?” This is exactly the focusof this study, i.e. seeking the amount of change when the grain shifts froma to 4a. Furthermore, Hui and McGeoch [72] indicated that the scale prob-lem is eventually a percolation process. Linking with knowledge from othernonlinear sciences will surely help to overcome the MAUP in spatial analysis.

Suprisingly, the analysis in this chapter can also be used in the analy-sis of co-variance and association, which is not part of the MAUP. Speciesassociation belongs to the compositional variance, measuring the degreeof co-occurrence and co-distribution of two species in samples [3, 5], thescaling pattern of which is closely linked with the mechnaism behind betadiversity and species-area relationship [3, 4, 5, 79]. For instance, consid-ering two species, there exist four scenarios for a randomly chosen cell:species A and B coexist, PA∩B(a) (also called the joint occupancy); onlyspecies A occurs, PA∩B(a); only species B occurs, PA∩B(a); neither exists,PA∩B(a). Note that PA(a) and PB(a) have the same meanings as p+(a)A

and p+(a)B, respectively. Similarly, we can define a postive association ofspecies A and B as PA∩B(a) > PA(a) × PB(a), a negative association asPA∩B(a) < PA(a) × PB(a), and the independence of the distribution ofthese two species as PA∩B(a) = PA(a) × PB(a). This definition is also wellconsisitent with the study of null models for species co-occurrence [80, 81].Therefore, the same framework developed in this chapter can also be used tocalculate the scaling pattern of association and co-variance. Further explo-ration of the inclusion of spatial autocorrelation structure into spatial mod-els should be productive. Indeed, the fields of spatial analysis [12, 43] andmodelling [58] have developed largely independently to date. Integration ofthese two approaches, as demonstrated in this chapter, are likely to resultin significant advances towards the development of a general spatial frame-work for understanding the non-randomness phenomena in nature, and surelywarrants further attention.

References

1. Levin, S.A.: The problem of pattern and scale in ecology. Ecology 73, 1943–1967(1992)

2. Plimak, L.I., Walls, D.F.: Nonclassical spatial and momentum distributions ina Bose-condensed gas. Phys. Rev. A 54, 652–655 (1996)

3. Whittaker, R.H.: Evolution and measurement of species diversity. Taxon 21,213–251 (1972)

4. Hui, C.: On species-area and species accumulation curves: a comment on Chongand Stohlgren’s index. Biol. Indic. 8, 327–329 (2008)

Page 194: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal Unit Problem 193

5. Hui, C., McGeoch, M.A.: Does the self-similar species distribution model leadto unrealistic predictions? Ecology 89, 2946–2952 (2008)

6. Hui, C., McGeoch, M.A., Warren, M.: A spatially explicit approach to esti-mating species occupancy and spatial correlation. J. Anim. Ecol. 75, 140–147(2006)

7. Openshaw, S.: The modifiable areal unit problem. Geo Books, Norwick (1984)8. Unwin, D.J.: GIS, spatial analysis and spatial statistics. Prog. Human Ge-

ogr. 20, 540–551 (1996)9. Burger, O., Todd, L.: Grain, extent, and intensity: the components of scale in

archaeological survey. In: Lock, G., Molyneaux, B.L. (eds.) Confronting scale inarchaeological: issues of theory and practice, pp. 235–255. Springer, New York(2006)

10. Taylor, L.R.: Aggregation, variance and the mean. Nature 189, 732–735 (1961)11. Ratcliffe, J.H., McCullagh, M.J.: Hotbeds of crime and the search for spatial

accuracy. Geogr. Sys. 1, 385–395 (1999)12. Wiens, J.A.: Ecological heterogeneity: ontogeny of concepts and approaches.

In: Hutchings, M.J., Jones, E.A., Stewart, A.J.A. (eds.) The ecological conse-quences of environmental heterogeneity, pp. 9–31. Blackwell Science, Oxford(2000)

13. Li, H., Reynolds, J.F.: On definition and quantification of heterogeneity.Oikos 73, 280–284 (1995)

14. Burrough, P.A., McDonnell, R.A.: Principles of geographical information sys-tems. Oxford Univ. Press, Oxford (1998)

15. Perry, J.N., Liebhold, A.M., Rosenberg, M.S., Dungan, J.L., Miriti, M., Jako-mulska, A., Citron-Pousty, S.: Illustrations and guidelins for selecting statisticalmethods for quantifying spatial pattern in ecological data. Ecography 25, 578–600 (2002)

16. Dungan, J.L., Perry, J.N., Dale, M.R.T., Legendre, P., Citron-Pousty, S., Fortin,M.J., Jakomulska, A., Miriti, M., Rosenberg, M.S.: A balanced view of scale inspatial statistical analysis. Ecography 25, 626–640 (2002)

17. Anselin, L.: Local indicators of spatial association. Geogr Analysis 27, 93–116(1995)

18. Hui, C.: Crossing the borders of spatial analysis and modelling: a rethink. In:Kelly, J.T. (ed.) Progress in Mathematical Biology Research, pp. 170–197. NovaScience, Hauppauge (2008)

19. Sato, K., Iwasa, Y.: Pair approximation for lattice-based ecological models. In:Dieckmann, U., Law, R., Metz, J.A.J. (eds.) The geometry of ecological in-teractions: simplifying spatial complexity, pp. 341–359. Cambridge Univ Press,Cambridge (2000)

20. Hui, C., Li, Z.: Distribution patterns of metapopulation determined by Alleeeffects. Popul. Ecol. 46, 55–63 (2004)

21. Fortin, M.J., Dale, M.R.T., ver Hoef, J.: Spatial analysis in ecology. In: El-Shaarawi, A.H., Piegorsch, W.W. (eds.) Encyclopedia of environmentrics, pp.2051–2058. Wiley and Sons, New York (2002)

22. Hui, C., McGeoch, M.A.: Spatial patterns of prisoner’s dillema game inmetapopulations. Bull. Math. Biol. 69, 659–676 (2007)

23. Gehlke, C., Biehl, K.: Certain effects of grouping upon the size of the correlationcoefficient in census tract material. J. Am. Stat. Assoc. 29, 169–170 (1934)

24. Jelinski, D.E., Wu, J.: The modifiable areal unit problem and implications forlandscape ecology. Land Ecol. 11, 129–140 (1996)

Page 195: Foundations of Computational Intelligence

194 C. Hui

25. Fotheringham, A.S., Wong, D.W.S.: The modifiable areal unit problem in mul-tivariate statistical-analysis. Environ. Plan A 23, 1025–1044 (1991)

26. Dorling, D.: The visualization of local urban change across Britain. Environ.Plan B 22, 269–290 (1995)

27. Amrhein, C.G.: Searching for the elusive aggregation effect - Evidence fromstatistical simulations. Environ. Plan A 27, 105–119 (1995)

28. Dark, S.J., Bram, D.: The modifiable areal unit problem (MAUP) in physicalgeography. Prog. Phys. Geogr. 31, 471–479 (2007)

29. Downey, L.: Using geographic information systems to reconceptualize spatialrelationships and ecological context. Am. J. Soc. 112, 567–612 (2006)

30. Flowerdew, R., Manley, D., Steel, D.: Scales, levels and processes: Studyingspatial patterns of British census variables. Comp. Environ. Urban. Sys. 30,2143–2160 (2006)

31. Lery, B.: A comparison of foster care entry risk at three spatial scales. SubsUse Misuse 43, 223–237 (2008)

32. Sexton, K., Waller, L.A., McMaster, R.B., Maldonado, G., Adgate, J.L.: Theimportance of spatial effects for environmental health policy and research. Hu-man Ecol. Risk Ass. 8, 109–125 (2002)

33. Lembo, A.J., Lew, M.Y., Laba, M., Baveye, P.: Use of spatial SQL to assess thepractical significance of the modifiable areal unit problem. Comp. Geosci. 32,270–274 (2006)

34. Wong, D.W.S.: Spatial decomposition of segregation indices: A framework to-ward measuring segregation at multiple levels. Geogra. Anal. 35, 179–194 (2003)

35. Harrison, J.A., Allan, D.G., Underhill, L.G., Herremans, M., Tree, A.J., Parker,V., Brown, C.J.: The atlas of Southern African birds, BirdLife South Africa,Johannesburg (1997)

36. Fielding, A.H., Bell, J.F.: A review of methods for the assessment of predic-tion errors in conservation presence/absence models. Environ. Cons. 24, 38–49(1997)

37. Kadmon, R., Farber, O., Danin, A.: A systematic analysis of factors affectingthe performance of climate envelope models. Ecol. Appl. 13, 853–867 (2003)

38. Hui, C., Li, Z.: Dynamical complexity and metapopulation persistence. Ecol.Model 164, 201–209 (2003)

39. Hui, C., Yue, D.: Niche construction and polymorphism maintenance inmetapopulations. Ecol. Res. 20, 115–119 (2005)

40. Levin, S.A., Grenfell, B., Hastings, A., Perelson, A.S.: Mathematical and com-putational challenges in population biology and ecosystem science. Science 275,334–343 (1997)

41. Hui, C., McGeoch, M.A.: Evolution of body size, range size, and food compo-sition in a predator-prey metapopulation. Ecol. Complex 3, 148–159 (2006)

42. Hui, C., Li, Z., Yue, D.X.: Metapopulation dynamics and distribution, andenvironmental heterogeneity induced by niche construction. Ecol. Model 177,107–118 (2005)

43. Fortin, M.J., Dale, M.R.T.: Spatial analysis: a guide for ecologists. CambridgeUniv. Press, Cambridge (2005)

44. Fahrig, L., Nuttle, W.K.: Population ecology in spatial heterogeneous envi-ronments. In: Lovett, G.M., Jones, C.G., Turner, M.G., Weathers, K.C. (eds.)Ecosystem function in heterogeneous landscapes, pp. 95–118. Springer, Berlin(2005)

Page 196: Foundations of Computational Intelligence

A Bayesian Solution to the Modifiable Areal Unit Problem 195

45. Pacala, S.W., Levin, S.A.: Biologically generated spatial pattern and the coex-istence of competing species. In: Tilman, D., Kareiva, P. (eds.) Spatial ecology:the role of space in population dynamics and interspecific interactions, pp.204–232. Princeton Univ. Press, Princeton (1997)

46. Downing, J.A.: Biological heterogeneity in aquatic ecosystems. In: Kolasa, J.,Pickett, S.T.A. (eds.) Ecological heterogeneity, pp. 160–180. Springer, Berlin(1991)

47. Morisita, M.: Id−index, a measure of dispersion of individuals. Res. Popul.Ecol. 4, 1–7 (1962)

48. Lloyd, M.: Mean crowding. J. Anim. Ecol. 36, 1–30 (1967)49. Bliss, C.I., Fisher, R.A.: Fitting the negative binomial distribution to biological

data. Biometrics 9, 176–200 (1953)50. Ripley, B.D.: Spatial statistics. Wiley, New York (1981)51. Moran, P.A.P.: Notes on continuous stochastic phenomena. Biometrika 37, 17–

23 (1950)52. Geary, R.C.: The contiguity ratio and statistical mapping. Incorp. Stat. 5, 115–

145 (1954)53. Perry, J.N.: Spatial analysis by distance indices. J. Anim. Ecol. 64, 303–314

(1995)54. Perry, J.N.: Measures of spatial pattern for counts. Ecology 79, 1008–1017

(1998)55. Levins, R.: Some demographic and genetic consequences of environmental het-

erogeneity for biological control. Bull. Entomol. Soc. Am. 15, 237–240 (1969)56. Hanski, I.: Metapopulation dynamics. Nature 396, 41–49 (1998)57. Hanski, I.: Metapopulation ecology. Oxford Univ. Press, Oxford (1999)58. Dieckmann, U., Law, R., Metz, J.A.J.: The geometry of ecological interactions:

simplifying spatial complexity. Cambridge Unive. Press, Cambridge (2000)59. Tilman, D., Karieva, P.: Spatial ecology: the role of space in population dy-

namics and interspecific interactions. Princeton Univ. Press, Princeton (1997)60. Matsuda, H., Ogita, A., Sasaki, A., Sato, K.: Statistical mechanics of popula-

tion: the lattice Lotka-Volterra model. Prog. Theor. Phys. 88, 1035–1049 (1992)61. Katori, M., Konno, N.: Upper bounds for survival probability of the contact

process. J. Stat. Phys. 63, 115–130 (1991)62. Tainaka, K.: Paradoxical effect in a three-candidate voter model. Phys. Lett.

A 176, 303–306 (1993)63. Iwasa, Y., Sato, K., Nakashima, S.: Dynamic modeling of wave regeneration

(Shimagare) in subalpine Abies forests. J. Theor. Biol. 152, 143–158 (1991)64. Harada, Y., Ezoe, H., Iwasa, Y., Matsuda, H., Sato, K.: Population persistence

and spatially limited social interaction. Theor. Popul. Biol. 48, 65–91 (1994)65. Harada, Y., Iwasa, Y.: Lattice population dynamics for plants with dispersing

seeds and vegetative propagation. Res. Popul. Ecol. 36, 237–249 (1994)66. Moran, P.A.P.: Notes on continuous stochastic phenomena. Biometrika 37, 17–

23 (1950)67. Hui, C., McGeoch, M.A.: A self-similarity model for the occupancy frequency

distribution. Theor. Popul. Biol. 71, 61–70 (2007)68. Hui, C., McGeoch, M.A.: Modeling species distributions by breaking the as-

sumption of self-similarity. Oikos 116, 2097–2107 (2007)69. McGeoch, M.A., Gaston, K.J.: Occupancy frequency distributions: patterns,

artefacts and mechanisms. Biol. Rev. 77, 311–331 (2002)

Page 197: Foundations of Computational Intelligence

196 C. Hui

70. De Grave, S., Casey, D.: Influence of sample shape and orientation on densityestimates on intertidal macrofauna. J. Marine Biol. Assoc. UK 80, 1125–1126(2000)

71. He, F., Hubbell, S.P.: Percolation theory for the distribution and abundnce ofspecies. Phys. Rev. Lett. 91, 198103 (2003)

72. Hui, C., McGeoch, M.A.: Capturing the “droopy-tail” in the occupancy-abundance relationship. Ecoscience 14, 103–108 (2007)

73. Meynard, C.N., Quinn, J.F.: Predicting species distributions: a critical com-parison of the most common statistical models using artificial species. J. Bio-geogr. 34, 1455–1469 (2007)

74. Holt, A.R., Gaston, K.J., He, F.: Occupancy-abundance relationships and spa-tial distribution: a review. Basic Appl. Ecol. 3, 1–13 (2002)

75. Kunin, W.E.: Extrapolating species abundance across spatial scales. Sci-ence 281, 1513–1515 (1998)

76. Wilson, R.J., Thomas, C.D., Fox, R., Roy, D.B., Kunin, W.E.: Spatial patternsin species distributions reveal biodiversity change. Nature 432, 393–396 (2004)

77. Hartley, S., Kunin, W.E.: Scale dependency of rarity, extinction risk, and con-servation priority. Cons. Biol. 17, 1559–1570 (2003)

78. Fotheringham, A.S.: Scale-independent spatial analysis. In: Goodchild, M.F.,Gopal, S. (eds.) Accuracy of spatial databases, pp. 221–228. Taylor and Francis,London (1989)

79. Scheiner, S.M.: Six types of species-area curves. Global Ecol. Biogeogr. 12,441–447 (2003)

80. Bell, G.: The co-distribution of species in relation to the neutral theory ofcommunity ecology. Ecology 86, 1757–1770 (2005)

81. Gotelli, N.J., Graves, G.R.: Null models in ecology. Smithsonian InstitutionPress, Washington (1996)

Page 198: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication

Networks

Chrysostomos Chrysostomou and Andreas Pitsillides

Department of Computer Science, University of Cyprus,75 Kallipoleos Street, P.O. Box 20537, 1678 Nicosia, [email protected], [email protected]

Summary. The problem of network congestion control remains a criticalissue and a high priority, especially given the increased demand to use theInternet for time/delay-sensitive applications with differing Quality of Ser-vice (QoS) requirements (e.g. Voice over IP, video streaming, Peer-to-Peer,interactive games). Despite the many years of research efforts and the largenumber of different control schemes proposed, there are still no universallyacceptable congestion control solutions. Thus, even with the classical controlsystem techniques used from various researchers, these still do not performsufficiently to control the dynamics, and the nonlinearities of the TCP/IP net-works, and thus meet the diverse needs of today’s Internet. Given the needto capture such important attributes of the controlled system, the design ofrobust, intelligent control methodologies is required. Consequently, a numberof researchers are looking at alternative non-analytical control system designand modeling schemes that have the ability to cope with these difficulties inorder to devise effective, robust congestion control techniques as an alterna-tive (or supplement) to traditional control approaches. These schemes employfuzzy logic control (a well-known Computational Intelligence technique). Inthis chapter, we firstly discuss the difficulty of the congestion control prob-lem and review control approaches currently in use, before we motivate theutility of Computational Intelligence based control. Then, through a numberof examples, we illustrate congestion control methods based on fuzzy logiccontrol. Finally, some concluding remarks and suggestions for further workare given.

1 Introduction

It is generally accepted that the problem of network congestion control re-mains a critical issue and a high priority. Despite the many years of researchefforts and the large number of different control schemes proposed, thereare still no universally acceptable congestion control solutions. Network con-gestion control is a complex problem, which is becoming even more difficult,

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 197–236.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 199: Foundations of Computational Intelligence

198 C. Chrysostomou and A. Pitsillides

given the increased demand to use the Internet for high speed, delay-sensitiveapplications, with differing Quality of Service (QoS) requirements.

Designing effective congestion control strategies is known to be difficult be-cause of the complexity of the structure of the network and the variety of thedynamic network parameters involved. In addition to these, the uncertaintiesinvolved in identification of the network parameters lead to the difficulty ofobtaining realistic, cost effective, analytical models of the network. Thus, evenwith the classical control system techniques used from various researchers,these still do not perform sufficiently to control the dynamics, and the nonlin-earities of the TCP/IP networks, and thus meet the diverse needs of today’sInternet.

Consequently, a number of researchers are looking at alternative non-analytical control system design and modelling schemes that have the abilityto cope with these difficulties in order to devise effective, robust congestioncontrol techniques as an alternative (or supplement) to traditional controlapproaches. These schemes employ fuzzy logic control (a well-known Com-putational Intelligence technique).

In this chapter, we firstly define network congestion, and discuss the diffi-culty of the congestion control problem. We then review current approacheson congestion control in the world of Internet. We propose that fuzzy logiccontrol (a well-known Computational Intelligence technique) should have anessential role to play in designing this challenging control system. Finally,we present illustrative examples, based on documented published studies, ofsuccessful application of fuzzy logic in controlling congestion, and concludewith some suggestions.

2 Congestion Control in Internet Protocol Networks

Congestion control is a critical issue in Internet Protocol (IP) networks. Manyresearch proposals can be found in the literature to provide means of avoidingand/or controlling the congestion. The fundamental principles of congestion,and different approaches to avoid or/and control congestion are widely dis-cussed. In this Section, the main functionalities of the standard TransmissionControl Protocol (TCP) congestion control mechanisms are explained. In ad-dition, many variants of TCP which have been proposed to meet Internet’stoday needs are briefly described. As there is strong trend to progressivelymove the controls inside the network, closer to where it can be sensed, wediscuss the use of router support to congestion control, either by having ex-plicit single-bit feedback, or multi-bit feedback. Due to its current practicalsignificance, in this chapter we focus on explicit single-bit feedback.

2.1 Defining Congestion

Congestion is a complex process to define. Despite the many years of researchefforts in congestion control, currently there is no agreed definition. One may

Page 200: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 199

refer to the ongoing discussion between the active members of the networkingcommunity as to give the right definition for congestion [1].

Two perspectives on network congestion are the user perspective and thenetwork perspective.

Keshav [2] states that ”Network congestion is a state of degraded per-formance from the perspective of a particular user. A network is said to becongested from the perspective of a user if that user’s utility has decreaseddue to an increase in network load”. The user experiences long delays in thedelivery of data, perhaps with heavy losses caused by buffer overflows. Thus,there is degradation in the quality of the delivered service, with the needfor retransmissions of packets (for services intolerant to loss). In the event ofretransmissions, there is a drop in the throughput, which leads to a collapseof network throughput, when a substantial part of the carried traffic is dueto retransmissions (in that state not much useful traffic is carried). In theregion of congestion, queue lengths, hence queuing delays, grow at a rapidpace - much faster than when the network is not heavily loaded.

Yang and Reddy [3] give a network-centric definition of congestion, as anetwork state in which performance degrades due to the saturation of net-work resources, such as communication links, processor cycles, and memorybuffers. For example, if a communication link delivers packets to a queue at ahigher rate than the service rate of the queue, then the size of the queue willgrow. If the queue space is finite then in addition to the delay experienced bythe packets until service, losses will also occur. Observe that congestion is nota static resource shortage problem, but rather a dynamic resource allocationproblem [4]. Networks need to serve all users requests, which may be unpre-dictable and bursty in their behaviour. However, network resources are finite,and must be managed for sharing among the competing users. Congestionwill occur, if the resources are not managed effectively. The optimal controlof networks of queues is a well-known, much studied, and notoriously difficultproblem, even for the simplest of cases (e.g., [5], [6]).

Figure 1(a) shows the throughput-load relationship in a packet-switchingnetwork [7], [14]. This plot shows the effect of excessive loading on the net-work throughput for three cases: no control, ideally controlled, and practicallycontrolled. In the case of ideal control, the throughput increases linearly untilsaturation of resources, where it flattens off and remain constant, irrespectiveof the increase of loading beyond the capacity of the system. Obviously, thistype of control is impossible in practice. Hence for the practically controlledcase, we observe some loss of throughput, as there is some communicationoverhead associated with the controls, possible some inaccuracy of feedbackstate information as well as some time delay in its delivery. Finally, for theuncontrolled case, congestion collapse may occur whereby as the network isincreasingly overloaded the network throughput collapses, i.e. very little use-ful network traffic is carried - due to retransmissions or deadlock situations.Figure 1(b) shows the corresponding delay-load relationship. The delay (re-sponse time) plot follows a similar pattern as the throughput plot. At first,

Page 201: Foundations of Computational Intelligence

200 C. Chrysostomou and A. Pitsillides

(a)

(b)

Fig. 1. Network throughput and delay vs offered load [4]

the delay rises slowly with the offered load even for fast increments of thethroughput. Then after the knee point is reached (i.e., the queues start build-ing), the delay curve jumps significantly while the throughput stays flat.Finally, the delay grows indefinitely when the network becomes congested(i.e., the queues start overflowing).

2.2 Congestion Control Principles

Chiu and Jain [8] classify most congestion control approaches into two cat-egories: approaches for congestion avoidance and approaches for conges-tion recovery. Congestion avoidance mechanisms allow a network to operatein the optimal region of low delay and high throughput, thus, preventingthe network from becoming congested. In contrast, the congestion recovery

Page 202: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 201

mechanism allows the network to recover from the congested state of highdelay and losses, and low throughput. Even if a network adopts a strategy ofcongestion avoidance, congestion recovery schemes would still be required toretain throughput in the case of abrupt changes in a network that may causecongestion.

Both types of approaches are basically resource management problems.They can be formulated as system control problems, in which the systemsenses its state and feeds this back to its users who adjust their control [8].This simple classification only provides a very general picture of commonproperties between separating groups of approaches.

A number of taxonomies of congestion control were/could be considered.A detailed taxonomy for congestion control algorithms is proposed by Yangand Reddy [3], which focuses on the decision-making process of individualcongestion control algorithms. The main categories introduced by the Yangand Reddy taxonomy are [3]:

• Open loop: These are the mechanisms in which the control decisions ofalgorithms do not depend on any sort of feedback information from thecongested spots in the network, that is, they do not monitor the state ofthe network dynamically.

• Closed loop: These are the mechanisms that make their control decisionsbased on some sort of feedback information to the sources. With the pro-vision of feedback, these mechanisms are able to monitor the networkperformance dynamically. The feedback involved may be implicit or ex-plicit. In the explicit feedback scheme, feedbacks have to be sent explicitlyas separate packets (or can be piggybacked) (e.g. [9]). If there is no ne-cessity of sending the feedback explicitly, the scheme is said to be animplicit feedback scheme. Some examples of such implicit feedbacks aretime delays of acknowledgment or timeouts, and packet loss (e.g. [10] - animplicit binary feedback scheme).– The feedback can be further categorized into binary and ”full” feed-

back. A single bit in the packet header is used as a binary feedbackmechanism (e.g. [9] - an explicit binary feedback scheme). ”Full” feed-back incorporates use of more than one bit in the packet header thatare used to send a whole (i.e. ”full”) information about the status ofthe network, like the exact sending rate, the round-trip time, etc (e.g.[11] - an explicit multi-bit (”full”) feedback scheme).

A congestion control system should be preventive, if possible. Otherwise, itshould react quickly and minimise the spread of congestion and its duration.A good engineering practice will be to design the system in such a way as toavoid congestion. But taken to the extreme (i.e. to guarantee zero losses andzero queuing delay), this would not be economical. For example, assuring zerowaiting at a buffer implies increasing the service rate at its limit to infinity.A good compromise would be to allow for some deterioration of performance,but never allow it to become intolerable (congested). The challenge is to keep

Page 203: Foundations of Computational Intelligence

202 C. Chrysostomou and A. Pitsillides

the intolerance at limits acceptable to the users. Note the fuzziness presentin defining when congestion is actually experienced.

The difficulty of the congestion control problems has caused a lot of debateas to what are appropriate control techniques for the control of congestion,and depending on one’s point of view, many different schools of thought werefollowed, with many published ideas and control techniques.

2.3 Internet Congestion Control

The Internet Protocol (IP) architecture is based on a connectionless end-to-end packet service. Transmission Control Protocol (TCP) is an end-to-endtransport protocol that provides reliable, in-order service. Congestion controlis implemented via a reactive, closed-loop, dynamic window control scheme[10]. This window-based scheme operates in the hosts to cause TCP connec-tions to ”back off” during congestion. That is, TCP flows are responsive tocongestion signals (i.e. dropped packets indicated by a timeout or a tripleduplicate acknowledgment) from the network. It is primarily these TCP con-gestion avoidance algorithms that prevent the congestion collapse of today’sInternet.

A fundamental aspect of TCP is that it obeys a ”conservation of packets”principle, where a new segment is not sent into the network until an old seg-ment has left. TCP implements this strategy via a self-clocking mechanism(acknowledgements received by the sender are used to trigger the transmis-sion of new segments). This self-clocking property is the key to TCP’s con-gestion control strategy. Other elements of TCP’s congestion control includethe congestion avoidance algorithm, the congestion recovery algorithm (i.e.slow-start), and the fast retransmit/recovery algorithms [12].

A TCP sender additively increases its rate when it perceives that theend-path is congestion-free, and multiplicatively decreases its rate when itdetects (via a loss event) that the path is congested. Thus, in such situations,TCP congestion control deploys the so called additive-increase, multiplicative-decrease (AIMD) algorithm. The linear increase phase of TCP’s congestioncontrol protocol is known as congestion avoidance. The value of congestionwindow repeatedly goes through cycles during which it increases linearly andthen suddenly drops to half its current value (when a loss event occurs, andparticularly a triple duplicate acknowledgment), giving rise to a saw-toothedpattern in long-lived TCP connections [13].

During the initial phase of TCP’s congestion control, which is called slow-start, the TCP sender begins by transmitting at a slow rate but increasesits sending rate exponentially, until a slow-start threshold is reached, wherethe congestion avoidance phase begins. In the case of a loss event, the AIMDsaw-toothed pattern begins.

The TCP congestion control reacts differently to a loss event that is de-tected via a timeout event, than it does to a loss event detected via receiptof a triple duplicate acknowledgment (ACK). After a triple duplicate ACK,

Page 204: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 203

Fig. 2. Responsiveness and smoothness of the control [8]

the congestion window is cut in half and then increases linearly (i.e. AIMD).However, after a timeout event, a TCP sender enters a slow-start phase,where the congestion window is set to 1, and then it grows exponentially,until it reaches one half of the value it had before the timeout event. At thatpoint, the TCP enters congestion avoidance phase.

TCP Feedback Signalling Scheme

The TCP feedback signalling scheme in the current Internet is binary andimplicit (i.e., network congestion is detected at the sources by loss events).Due to the binary nature of the feedback, and consequently the AIMD saw-toothed pattern, the system does not generally converge to a single steadystate. The system reaches an ”equilibrium”, in which it oscillates aroundthe optimal state [8] (see Fig. 2). The time taken to reach the equilibrium(that determines the responsiveness of the control), and the size of the os-cillations (that determines the smoothness of the control) jointly determinethe convergence. Ideally, we would like the time as well as oscillations to besmall. Therefore, the controls with smaller time and smaller amplitude ofoscillations are called more responsive and smoother, respectively.

TCP Evolution

The congestion control mechanisms continue to be enhanced as TCP/IPevolves to meet new and more demanding requirements.

The early version of TCP, known as TCP Tahoe, enters the slow-start phaseirrespective of the type of loss event. The newer version of TCP, TCP Reno,cancels the slow-start phase after a triple duplicate ACK, which is called fastrecovery, and resends the lost packet, without waiting for a timeout event,that is a fast retransmit occurs.

Page 205: Foundations of Computational Intelligence

204 C. Chrysostomou and A. Pitsillides

The TCP NewReno [15] improves the Reno implementation regarding thefast recovery mechanism. The aim of TCP NewReno is to prevent a TCPsender from reducing its congestion window multiple times in case severalpackets are dropped from a single window of data (a problem Reno has).The NewReno remains in fast recovery until all of the data outstanding bythe time the fast recovery was initiated have been acknowledged. NewRenocan retransmit one lost packet per round trip time (RTT), until all the lostpackets from a particular window of data have been retransmitted. ThusNewReno avoids multiple reductions in the congestion window, or unneces-sary retransmit timeout with slow start invocation.

Another proposed modification to TCP, the TCP Selective Acknowledge-ment (SACK) [16] allows a TCP receiver to acknowledge out-of-order packetsselectively rather than just cumulatively acknowledging the last correctly re-ceived, in-order packet. Thus TCP Sack may recover multiple lost packets ina window of data in just one single RTT.

The behaviour of TCP/IP congestion control still remains a matter ofcontinuous research interest in the TCP/IP world (highlighted by the frequentInternet Engineering Task Force - IETF - Request for Comments - RFCs, andmany published papers in various journals and conferences, proposing fixesor new solutions). Recently, there is an ongoing research towards enhancingthe TCP congestion control mechanisms in order for TCP to fully exploit thenetwork capacity of fast, long-distance networks (i.e. high-speed networksoperating at 622 Mbps, 2.5 Gbps, or 10 Gbps, which have a high bandwidth-delay product).

Network-Assisted Congestion Control

With network-assisted congestion control, routers provide explicit feedbackto the sender regarding the congestion state in the network. This feedbackmay be as simple as a single bit indicating congestion at a link, or more com-plex as a multi-bit feedback giving to the source ”full” information about thenetwork state (e.g., the exact sending rate). Congestion information is typi-cally conveyed from each router at the path from the sender to the receiver,by marking/updating a field in a packet’s header, to indicate congestion, andthen fed back from the receiver to the sender as a form of notification.

It has become clear [17] that the existing TCP congestion avoidance/controlmechanisms and its variants, while necessary and powerful, are not sufficientto provide good service in all circumstances. Basically, there is a limit to howmuch control can be accomplished from the edges of the network. Some ad-ditional mechanisms are needed in the routers to complement the endpointcongestion avoidance/control methods, as suggested by various researchers([9], [17], [18]). Note that the need for router control was realised early; e.g.see [10], where for future work the router side is advocated as necessary. Aclear trend is observed: to progressively move the controls inside the network,closer to where it can be sensed.

Page 206: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 205

By using network-assisted congestion control, TCP does not need to awaita loss event - due to buffer overflow - to detect congestion and slow downproperly. Instead, it is informed by the intermediate nodes (routers) whenincipient congestion starts, and reacts accordingly.

Binary Feedback

The simplest method of assisting the TCP from the network point of viewis to provide a binary feedback to the source about the network state. Theuse of Explicit Congestion Notification (ECN) was proposed [9] in order toprovide TCP an alternative to packet drops as a mechanism for detecting in-cipient congestion in the network. The ECN proposal works together with theaddition of active queue management (AQM) to the Internet infrastructure,where routers detect congestion before the queue overflows (see discussion ofAQM in detail in Sect. 2.4).

The ECN scheme requires both end-to-end and network support. An ECN-enabled router can mark a packet by setting a bit in the packet’s header, if thetransport protocol is capable of reacting to ECN. Specifically, the ECN pro-posal requires specific flags in both IP and TCP headers. Two bits are used ineach header for proper signalling among the sender, routers, and the receiver.

In the IP header, the two bits (ECN field) results in four ECN codepoints(see Table 1). The ECN-Capable Transport (ECT) codepoints ’10’ and ’01’ areset by the data sender to indicate that the end-points of the transport protocolare ECN-capable. Routers treat both codepoints as equivalent; senders arefree to use either of the two to indicate ECT. The not-ECT codepoint ’00’indicates a packet that is not using ECN. The ECN codepoint ’11’ is setby a router to indicate congestion to the end nodes (i.e. marks the packet);this is called the Congestion Experienced (CE) codepoint. Upon the receiptby an ECN-capable transport of a single CE packet, the congestion controlalgorithms followed at the end nodes must be essentially the same as thecongestion control response to a single dropped packet.

In the TCP header, two new flags are introduced. The ECN-Echo (ECE)flag is used by the receiver to inform the sender that a CE packet has beenreceived. This is done in the ACK packet sent. Similarly, the sender uses theCongestion Window Reduced (CWR) flag to announce to the receiver thatits congestion window has been reduced, as a consequence of the receptionof the ECE ACK.

Table 1. The ECN field in IP

ECN FIELD

0 0 Not-ECT0 1 ECT(1)1 0 ECT(0)1 1 CE

Page 207: Foundations of Computational Intelligence

206 C. Chrysostomou and A. Pitsillides

The use of ECN for notification of congestion to the end-nodes generallyprevents unnecessary packet drops, and thus is appealing to be used in theInternet.

2.4 Active Queue Management in TCP/IP Networks

The TCP congestion avoidance/congestion control mechanisms have beenvery successful, as the Internet has evolved from a small-scale research net-work to today’s interconnected millions of networks. However, the increaseddemand to use the Internet for time/delay-sensitive applications with differ-ing QoS requirements, questions the efficiency and the feasibility of such anend-to-end implicit feedback based congestion control. Thus the need for arobust enough controller to capture the dynamics, the highly bursty networktraffic, and the nonlinearities of the controlled system leads to the introduc-tion of Active Queue Management (AQM) mechanisms to assist the TCPcongestion control to obtain satisfactory performance. AQM mechanisms areproposed to provide high link utilization with low loss rate and queuing delay,while responding quickly to load changes.

Due to the adherence to the current Internet standards next we focus onAQM mechanisms, which either drop or mark packets to indicate congestion,and also keep the TCP’s window increase and decrease mechanism at thesources unchanged.

Active Queue Management Principles

The AQM approach can be contrasted with the Drop Tail (DT) queue man-agement approach, employed by common Internet routers, where the discardpolicy of arriving packets is based on the overflow of the output port buffer.Contrary to DT, AQM mechanisms [17] start dropping or marking packetsearlier in order to notify traffic sources about the incipient stages of con-gestion (TCP interprets dropped packets as congestion). AQM allows therouter to separate policies of dropping packets from the policies for indicatingcongestion. In the case of dropping of packets the TCP congestion controllerrelies on the implicit feedback signal (generated by the lost packet as a time-out) to reduce the TCP congestion window. In the case of packet markingpackets are not dropped, rather a bit is set on their header indicating conges-tion (hence termed Explicit Congestion Notification, ECN [9]), and returnedvia the destination to the source.

The main AQM performance characteristics include [20]:

• Efficient queue utilization: the queue should avoid overflow that resultsin lost packets and undesired retransmissions or emptiness that results inlink underutilization.

• Queuing Delay: It is desirable to keep small both the queuing delay andits variations.

Page 208: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 207

• Robustness : AQM scheme needs to maintain robust behaviour in spiteof varying network conditions, such as variations in the number of TCPsessions, and variations in the propagation delay and link capacity.

Examples of Active Queue Management Mechanisms

Several schemes have been proposed to provide congestion control in TCP/IPnetworks (e.g., [18], [19], [20], [21], [22], [63], [64], [65], [66]). Below we brieflyreview some of the prominent proposals and their limitations.

Random Early Detection (RED) [18] was the first AQM algorithm pro-posed. It sets some minimum and maximum drop/mark thresholds in therouter queues. In case the average queue size exceeds the minimum threshold,RED starts randomly dropping/marking packets, based on a linear heuristic-based control law, with a drop/mark probability depending on the averagequeue length, whereas if it exceeds the maximum threshold every packet isdropped.

Adaptive-RED (A-RED) [19], proposed by one of the authors of RED, at-tempts to solve the problem of the need for continuously (re)tuning RED pa-rameters. In particular, A-RED adjusts the value of the maximum drop/markprobability to keep the average queue size within a target range half waybetween the minimum and maximum thresholds. Thus, A-RED aims tomaintain a desired average target queue length (TQL) twice the minimumthreshold (if the maximum threshold is kept three times the minimum thresh-old). The adjustment of the maximum drop/mark probability is based on anadditive fixed increase step when the average queue length exceeds the de-sired average queue, and on a multiplicative fixed decrease step when theaverage queue length goes below the desired average value, following a linearAIMD approach.

The Proportional-Integral (PI) controller is proposed in [20], based on lin-ear control theory. Three key network parameters - the number of TCP ses-sions, the link capacity and the round-trip time (RTT) - are related to theunderlying feedback control system. The key feature is that PI control isbased on formal feedback based linear control theory, drawing from its vastexperience in controlling systems. It allows one to explicitly set the networkqueuing delay by introducing a desired queue length so as to stabilize therouter queue length around this value.

In [21] a new AQM scheme was proposed, namely Random ExponentialMarking (REM). The key idea behind this AQM design is to stabilize boththe input rate around link capacity and the queue length around a smalltarget. The mark probability calculated is based on an exponential law.

Further, in [22] an Adaptive Virtual Queue (AVQ) -based dropping/markingscheme for AQM was proposed. AVQ uses a modified token bucket model as avirtual queue to regulate link utilization, rather than the actual queue length.The AVQ scheme detects congestion solely based on the arrival rate of thepackets at the link.

Page 209: Foundations of Computational Intelligence

208 C. Chrysostomou and A. Pitsillides

Limitations of Existing AQM Mechanisms

The properties of RED have been extensively studied in the past few years.Issues of concern include: problems with performance of RED under differentscenarios of operation and loading conditions; the correct tuning of REDparameters implies a global parameterization that is very difficult, if notimpossible to achieve. Some researchers have advocated against using RED,in part because of this tuning difficulty [23]; the sensitivity to the averagingof the queue size [20]; also the linearity of the dropping function has beenquestioned by a number of researchers (e.g., [24]).

As the RED-based algorithms control the macroscopic behaviour of thequeue length (looking at the average) they often cause sluggish responseand fluctuation in the instantaneous queue length [25]. As a result, a largevariation in end-to-end delays is observed. Further, the linear drop/markprobability of RED itself is not robust enough for the highly bursty networktraffic. The motivation should be to find a proper nonlinear function, ratherthan to find RED parameters appropriately tuned for a specific operatingpoint for the original linear RED function.

A-RED attempts to tune the RED parameters for a robust behavior butfails to do so in various dynamic cases [25] due to the fact that A-RED retainsRED’s basic linear structure. Thus, fine tuning of the RED parameters is notsufficient to cope with the undesired RED behavior.

The PI controller behaves in a similar way by exhibiting sluggish responseto varying network conditions. This can be explained due to the fact that thefixed/static PI parameters are dependent on network parameters, like thenumber of flows and RTT, and thus it is difficult to get a stable operation ina broad range of dynamic varying traffic conditions. An illustrative exampleof how the PI AQM mechanism requires careful configuration of non-intuitivecontrol parameters can be found in [26], where the PI controller shows weak-nesses to detect and control congestion under dynamic traffic changes, and aslow response to regulate queues.

Similarly, the AVQ control parameters are dependent on network param-eters, like the round-trip delay and the number of flows. Thus, it is difficultto get a stable operation as stated above.

The REM controller follows a function equivalent the price control func-tion of the PI controller, thus it is also found to exhibit sluggish responseto varying network conditions. The correct configuration of REM control pa-rameters is still an issue for further investigation, concerning the dynamic,time-varying nature of TCP/IP networks.

In general, the existing AQM mechanisms still require a careful config-uration of non-intuitive control parameters. The dynamics of TCP/AQMmodels are mostly studied with the aid of linearization around equilib-rium points of the nonlinear model developed, in order to study TCP/AQMstability around equilibrium. However, linearization fails to track the sys-tem trajectories across different regions dictated by the nonlinear equations

Page 210: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 209

derived. As stated in [27], linearization ”assumes, and hence requires thatthe system always stays within a certain operating regime”. Moreover, theequations modeled are dependent on various network parameters, such as thenumber of flows and the round trip delays, which vary substantially in to-day’s Internet. Therefore, the linearization around a specific operating pointand the dependence on varying network parameters make it difficult to geta stable and robust operation in the case of TCP/IP networks with dynamicload and delay changes.

Hence, a major weakness of the proposed models is that the configurationof control parameters is done for a specific operating point, for which varioussystem parameters are assumed to be known, and certain important dynamicsare ignored. As stated in [24], even if the assumptions regarding the inputparameters fit the specific scenario, the applicability of the AQM algorithmwould be restricted to a small range of the assumed values only. Therefore, theconfigured parameter set and stability conditions introduced by the proposedmodels lack applicability to all possible real scenarios with varying dynamicsof network conditions. In addition, even if the linearized system is made stableat equilibrium, there is no guarantee that the nonlinear system will remainstable [24], especially if the deviations from the equilibrium are at times large.

2.5 Differentiated Services Congestion Control

The Differentiated Services (Diff-Serv) approach [28] proposes a scalablemeans to deliver IP QoS based on handling of traffic aggregates. It operateson the premise that complicated functionality should be moved toward theedge of the network with very simple functionality at the core. The Diff-Servframework enables QoS provisioning within a network domain by applyingrules at the edges to create traffic aggregates and coupling each of these witha specific forwarding path treatment in the domain through the use of acodepoint in the IP header. The Diff-Serv Working Group (WG) of IETF(Internet Engineering Task Force) has defined the general architecture fordifferentiated services and has focused on the forwarding path behaviour re-quired in routers. The WG has also discussed the functionality required atDiff-Serv domain edges to select and condition traffic according to a set ofrules. The Diff-Serv architecture aims to provide aggregated QoS. Our fo-cus in this chapter is on the development of differential dropping/markingalgorithms for network core routers to support this objective. Further, AQMmechanisms, we believe, can be usefully employed at the core of the Diff-Servdomain to provide bandwidth assurance, with low loss and bounded delay tovarious (aggregated) service classes. Currently, there is ongoing work betweenmembers of the networking community for creating configuration guidelinesrecommended for Diff-Serv Service Classes [29].

In this chapter, we concentrated on the managing of the Assured Per-hopBehavior (AF PHB) [30] that specifies a forwarding behavior in which packetsare expected to see a very small amount of loss. The AF PHB group is a means

Page 211: Foundations of Computational Intelligence

210 C. Chrysostomou and A. Pitsillides

to offer different levels of forwarding assurances for IP packets, and it providesdelivery of IP packets in four independently forwarded AF classes (AF1, AF2,AF3, and AF4). In each Diff-Serv node, each AF class is allocated a certainamount of forwarding resources (buffer space and bandwidth), and shouldbe serviced to achieve the configured service rate (bandwidth). Within eachAF class, IP packets are marked with one of three possible drop precedencevalues (e.g., AF11, AF12, AF13). In case of congestion, the drop precedenceof a packet determines the relative importance of the packet within the AFclass. A congested Diff-Serv node tries to protect packets with a lower dropprecedence value from being lost by preferentially discarding packets with ahigher drop precedence value; thus it differentiates flows with different droppreference levels.

The most popular algorithm used for Diff-Serv implementation is basedon RED [18]. The RED implementation for Diff-Serv, called RED In/Out(RIO) [31], implements that we have different thresholds for different dropprecedence levels. ”In” and ”Out” means packets are in or out of the con-nection conformance agreement. RIO uses the same mechanism as in RED,but it is configured with two different sets of parameters, one for ”In” (highpriority - low drop precedence level) packets, and one for ”Out” (low prior-ity - high drop precedence level) packets. Upon each packet arrival at therouter, RIO checks whether the packet is tagged as ”In” or ”Out”. If itis an ”In” packet, RIO calculates the average queue length of ”In” pack-ets only; otherwise (i.e., the packet is tagged as ”Out”) RIO calculates thetotal average queue length (i.e., of both ”In” and ”Out” arriving packets).The probability of dropping/marking an ”In” packet depends on the averagequeue length of ”In” packets, whereas the probability of dropping/markingan ”Out” packet depends on the total average queue length. The discrimina-tion against ”Out” packets is created by carefully choosing the parameters ofminimum and maximum thresholds, and maximum drop/mark probability.It drops ”Out” packets much earlier than ”In” packets; this is achieved bychoosing the minimum threshold for ”Out” packets smaller than the mini-mum threshold for ”In” packets. It also drops/marks ”Out” packets with ahigher probability, by setting the maximum drop/mark probability for ”Out”packets higher than the one for ”In” packets.

However, as RIO is the implementation of RED for Diff-Serv, it still suf-fers from the undesired RED behaviour, as this was discussed in Sect. 2.4. In[32], based on analytic evaluation of the loss probability, it is concluded thatthe ”choice of different RIO parameter values can have a major impact onperformance”. RIO also retains RED’s basic linear structure (between mini-mum and maximum average queue threshold values) that itself is not robustenough for the bursty network traffic, and cannot capture the dynamics andnonlinearities of TCP/IP networks. Furthermore, RIO’s decision for drop-ping/marking a packet of any level of drop precedence is not based on thetotal buffer occupancy; this may be a drawback if we want to have a boundeddelay for the queue as a whole, and under any congestion level.

Page 212: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 211

Beyond RIO, another popular algorithm, proposed for Diff-Serv AQM-based congestion control, is based on the standard PI AQM [20]. In partic-ular, a two-level AQM controller (TL-PI) is proposed in [33] for providingdifferential marking probabilities at the Diff-Serv core. The PI AQM schemeproposed in [20] is used to preferentially drop/mark high-level of drop prece-dence, rather than low-level, by introducing two set points (TQLs) for thecore queue, which correspond to the two levels of drop precedence used, re-spectively. The drop/mark probability for both levels is computed by twoPI AQM controllers, using the same parameter values, except for the TQL.In order to preferentially drop/mark packets of high drop precedence duringcongestion the TQL of the low-level of drop precedence is set higher thanthe TQL of the high-level of drop precedence. However, as the two-level PIcontroller is actually the PI implementation for Diff-Serv congestion control,it still suffers from the undesired PI behavior, discussed in Sect. 2.4 (e.g., thedependency of PI control parameters on dynamic network parameters, likethe number of flows and the round trip propagation delays, and the linearityof the control law).

2.6 The Need for the Alternative

As discussed in Sect. 2.3, the current Internet feedback mechanism for con-gestion control is binary and implicit and the network provides a best effortservice. However, the existing TCP congestion avoidance/control mechanismsand its variants, while necessary and powerful, are not sufficient to providegood service in all circumstances [17]. Therefore, network-assisted mecha-nisms have been introduced (e.g., ECN) to provide a more responsive feed-back mechanism. The pressing need to better capture the dynamics and thehighly bursty network traffic, and nonlinearities of TCP has lead to the designof AQM mechanisms as router support to the TCP congestion control.

While many AQM mechanisms (see Sect. 2.4) have recently been proposedin the best effort TCP/IP environment, these require careful configurationof non-intuitive control parameters that are dependent on network/trafficparameters, and show weaknesses to detect and control congestion underdynamic traffic changes, and exhibit a slow response to regulate queues [26].

Based on the above identified limitations it is evident [26] that by usinga nonlinear drop/mark probability function, which does not require knowl-edge of dynamic system/network parameters, an effective and robust AQMsystem can be designed to drive quickly the system to be controlled into thesteady-state. This should be contrasted with the linear drop/mark probabil-ity function that itself is not robust enough for the highly bursty networktraffic and cannot capture the dynamics and nonlinearities of TCP/IP net-works. For example, during high load conditions a disproportionately higherdrop/mark probability is required than in a low load condition, in order tokeep the queue length in the same range, a requirement met only by a non-linear drop/mark function.

Page 213: Foundations of Computational Intelligence

212 C. Chrysostomou and A. Pitsillides

Thus the complexity of these problems and the difficulties in implementingconventional controllers to eliminate those problems, as identified in Sect. 2.4,motivate the need to investigate intelligent control techniques, such as fuzzylogic, as a solution to controlling systems in which dynamics and nonlin-earities need to be addressed. This work supplements the standard TCP toobtain satisfactory performance in a best-effort environment. Fuzzy logic con-trol [38] has been widely applied to control nonlinear, time-varying systems,in which they can provide simple and effective solutions. The capability toqualitatively capture the attributes of a control system based on observablephenomena is a main feature of fuzzy logic control and has been demonstratedin various research literature and commercial products. The main idea is thatif the fuzzy logic control is designed with a good (intuitive) understanding ofthe system to be controlled, the limitations due to the complexity system’sparameters introduce on a mathematical model can be avoided. A commonapproach in the networking literature is to either ignore such complex param-eters in the mathematical model (e.g., ignoring the slow-start phase in thenonlinear model derived in [34]), or to simplify the model (e.g., ignoring thetimeout mechanism and linearization of the model derived in [35]) to such anextent (in order to obtain tractable model for controller design and/or stabil-ity results), which render the designed controllers and their derived stabilitybounds overly conservative.

3 Fuzzy Logic Control

Fuzzy logic is a logical system, which is an extension and generalizationof multivalued logic systems. It is one of the family of tools of what iscommonly known as Computational Intelligence (CI). CI is an area of fun-damental and applied research involving numerical information processing.While these techniques are not a panacea (and it is important to view themas supplementing proven traditional techniques), there is a lot of interestnot only from the academic research community (e.g. [4], [36]) but also fromindustry, including the telecommunications industry (e.g. [37]), due to itssuccessful deployment in controlling difficult systems.

Fuzzy Logic Control (FLC) [38] denotes the field in which fuzzy set theory[39] and fuzzy inference are used to derive control laws. A fuzzy set is definedby a membership function that can be any real number in the interval [0, 1],expressing the grade of membership for which an element belongs to thatfuzzy set. The concept of fuzzy sets enables the use of fuzzy inference, whichin turn uses the knowledge of an expert in a field of application to constructa set of ”IF-THEN” rules. Fuzzy logic becomes especially useful in capturinghuman expert or operator’s qualitative control experience into the controlalgorithm, using linguistic rules.

The idea of FLC was initially introduced by Zadeh [40] and first appliedby Mamdani [41] in an attempt to control systems that are difficult to modelmathematically and hence design controllers. FLC may be viewed as a way of

Page 214: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 213

designing feedback controllers in situations where rigorous control theoreticapproaches are too difficult and time consuming to use, due to difficulties inobtaining a formal analytical model, while at the same time some intuitiveunderstanding of the process is available. Fuzzy logic control has strengths incontrolling highly nonlinear, complex systems, which are commonly encoun-tered in product design, manufacturing and control. Fuzzy logic provides aset of mathematical methods for representing information in a way that re-sembles natural human communication, and for handling this informationin a way that is similar to human reasoning. The control algorithm is en-capsulated as a set of linguistic rules, leading to algorithms describing whataction should be taken based on system behaviour observations. FLC hasbeen applied successfully for controlling numerous systems in which analyt-ical models are not easily obtainable or the model itself, if available, is toocomplex and possibly highly nonlinear (e.g. [42]).

Therefore, FLC concentrates on attaining an intuitive understanding ofthe way to control the process, incorporating human reasoning in the controlalgorithm. A fuzzy logic controller can be conceived as a nonlinear controllerwhose input-output relationship is described in linguistic terms that can bebetter understood and easily modified (tuned). It is independent of mathe-matical models of the system to be controlled. It achieves inherent robustnessand reduces design complexity. This is in contrast with conventional controlapproaches that concentrate on constructing a controller with the aid of ananalytical system model that in many cases is overly complex, uncertain, andsensitive to noise.

3.1 Application of Fuzzy Logic in Networks

Fuzzy Logic Control (FLC) has been successfully used in a wide variety ofapplications in engineering, science, business, medicine and other fields. Anumber of research papers using fuzzy logic investigating solutions to conges-tion control issues in networking, especially in Asynchronous Transfer Mode(ATM) networks, have been published. Given the complexity of ATM net-works, rich variety of traffic sources that operate on them, and difficulty ofobtaining formal models for in depth analysis, it is not surprising to see thatFLC was adopted by many researchers. For example, [4], [43], [44] and [45],since early 90’s, have successfully used the concept of FLC for congestioncontrol in ATM, as an alternative to conventional counterparts. A survey ofsome of these techniques is given in [36].

Based on the vast experience of successful implementations of FLC in thedesign of control algorithms, as indicated above, and the reported strength offuzzy logic in controlling complex and highly nonlinear systems, FLC was alsoused in the IP world. To the best of our knowledge, fuzzy logic, in the conceptof AQM congestion control in TCP/IP networks, was introduced in the early2000s by Pitsillides, Rossides, Chrysostomou, et al. (e.g. [46], [47], and [48]).Their proposed scheme concerned a fuzzy-based RED variant illustrated in

Page 215: Foundations of Computational Intelligence

214 C. Chrysostomou and A. Pitsillides

a Diff-Serv environment with input variables the error on the queue length,and the rate of change of the error, while the output is the packet dropprobability. This earlier research demonstrated that the application of fuzzycontrol techniques to the problem of congestion control in TCP/IP networksis worthy of further investigation.

Lately, we are witnessing an increase of research papers focusing on the useof fuzzy logic in various fields of the IP world. Fengyuan, Yong, and Xiuming[49] have proposed a fuzzy controller for AQM in best effort IP networks,with input variables the error on the queue length, and the rate of changeof the error, while the output is the increment step of the packet drop/markprobability. They clearly presented guidelines towards to the design of theFLC, implemented in NS-2 simulator [58], and compared performance undervarious scenarios with the PI controller [20]. The proposed FLC has superiorsteady and transient performance, and provides robustness against noise anddisturbance, as well as adaptability to the variances of link delay and capacity.

Wang, Li, Sohraby, and Peng [50] proposed a fuzzy controller for best effortAQM, with only one input - the queue length, while the output is the dropprobability. This scheme is implemented by keeping the RED’s algorithm [18]semantics (they use the same threshold-based method as that in RED, i.e.,when the queue length is less than a minimum threshold the probability iszero; when the queue length is between the minimum and maximum thresholdthe drop probability is computed, and when the queue length is greater thanthe maximum threshold then the same gentle mechanism in RED is used).The important feature of their proposed scheme is that it also designs anadaptive mechanism to dynamically readjust the fuzzy rule so as to make thescheme itself extensively stable for many dynamic environments. Simulationresults [58] show that their scheme can effectively and stably control thequeue length to the expected value or so. Compared with RED [18] and PI[20] algorithm, they obtain higher goodput and stable queue length thanRED and PI, even with the introduction of UDP flows.

Aul, Nafaa, Negru, and Mehaoua [51] proposed a fuzzy controller for besteffort AQM, with input variables the error on the queue length, and the rateof change of the error, while the output is the drop probability. Comparingto traditional AQM algorithms (e.g., RED [18]), their proposal avoids bufferoverflows/underflows, and minimizes packets dropping. Further, an on-lineadaptation mechanism is proposed that captures fluctuating network condi-tions, while classical AQM algorithms require static tuning.

Di Fatta, Hoffmann, Lo Re, and Urso [52] proposed a fuzzy PI-controllerfor AQM, where the gains are tuned by a genetic algorithm with respectto optimal disturbance rejection. The analytical design of the proposed con-troller is carried out in analogy with a PI [20] controller. The main objectivesof the controller design method are fast response to high load variationsand disturbance rejection in steady-state behavior. The experimental resultsdemonstrate that the proposed controller outperforms the other AQM poli-cies ([18], [20]) under various operating conditions, especially for traffic that

Page 216: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 215

exceeds the nominal bandwidth causing severe overload on the node. Theimprovement in terms of response time and link utilization is due to the factthat the nonlinear fuzzy controller has a variable gain that allows the AQMto recover faster from large variation in traffic loads.

Chrysostomou et al. [59], [60], [25], [26] have proposed a generic AQMcontrol methodology in TCP/IP networks, based on FLC. A simple, effectiveand efficient nonlinear control law is built, using a linguistic model of the sys-tem, rather than a traditional mathematical model, which is easily adapted indifferent network environments (e.g. Best-Effort and Differentiated-Servicesarchitectures). It is demonstrated, via extensive simulative evaluation [26],that the proposed fuzzy control methodology offers inherent robustness witheffective control of the system under widely differing operating conditions,without the need to (re)tune the settings for two different architectures (Best-Effort and Differentiated-Services). As demonstrated, this is in contrast withthe well-known conventional counterparts of A-RED [19], PI [20], REM [21],AVQ [22] for Best-Effort, and two-level PI [33], RIO [31] for Diff-Serv basednetworks, where the proposed approach outperforms all tested counterpartsin each different architecture. A detailed overview is presented in Sect. 3.2.

Fuzzy logic control has also been used, beside AQM, in other fields con-cerning today’s Internet. Siripongwutikorn, Banerjee, and Tipper [53] haveproposed an adaptive bandwidth control algorithm based on fuzzy control tomaintain the aggregate loss QoS. Habetha and Walke [54] developed a newclustering scheme concerning mobility and load management, based on fuzzylogic. Wang et al. [55] presented a fuzzy-based dynamic channel-borrowingscheme to maximize the number of served calls in a distributed wireless cellu-lar network. Savoric [56] proposed a fuzzy explicit window adaptation schemethat can decrease the advertised receiver window in TCP acknowledgementsif necessary in order to avoid congestion and packet losses. Oliveira and Braun[57] proposed a technique for packet loss discrimination using fuzzy logic overmultihop wireless networks.

3.2 An Illustrative Example: A Generic Fuzzy AQM ControlMethodology in TCP/IP Networks

In this section, the operation of the unified fuzzy congestion controller ([59],[60], [25], [26]) for best effort and Diff-Serv networks is summarized.

Fuzzy Explicit Marking (FEM) Control System Design

The nonlinear fuzzy logic-based control system (FLCS) is designed to operatein TCP/IP Best-Effort networks, and specifically in the IP routers’ outputport buffer. However, the aim is to also design a generic control methodologythat can be easily adopted in other network environments as well, as forexample in TCP/IP Diff-Serv.

The proposed FLCS [25] (see Fig. 3 - details are discussed below) is basedon an AQM approach, which implements a drop probability function, and

Page 217: Foundations of Computational Intelligence

216 C. Chrysostomou and A. Pitsillides

Fig. 3. Fuzzy logic based AQM (FEM) system model [25]

supports ECN in order to mark packets, instead of dropping them. It usesfeedback from the instantaneous queue length sampled frequently and isdriven by the calculated errors between a given queue reference for the presentand previous periods. The end-to-end behaviour of TCP is retained, with theTCP increase and decrease algorithm responding to ECN marked packets.

The principal aim of the proposed nonlinear FLCS is to achieve the fol-lowing goals:

• Dynamic and effective fast system response with robustness to thetime-varying, dynamic nature of the controlled system, under differingoperating conditions, without the need for (re)tuning, and thus provideacceptable QoS.

• High link utilization (based on the useful throughput).• Minimal packet losses.• Bounded-regulated queue fluctuations and delays (mean and variation).

The bounded mean queuing delay and delay variation can be achieved byregulating the queues of the output port buffers of IP routers at predefinedlevels. This will, as a consequence, have low losses and maintain high uti-lization as well. By having a nonlinear control law, based on fuzzy logic, theaim is to effectively deal with the high variability appearing in the network,and thus exhibit fast system response and robust behavior in spite of varyingnetwork conditions.

The proposed FLCS in TCP/IP Best-Effort networks, called Fuzzy ExplicitMarking (FEM) controller, provides a new nonlinear probability function thatmarks packets at IP routers in order to regulate queues at predefined levels,by achieving a specified target queue length (TQL).

In order to design the FEM controller the following standard steps havebeen followed by the authors:

• Identify the inputs and their ranges (universe of discourse)• Identify the output and its range

Page 218: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 217

• Construct the rule base (knowledge) that the system will operate under• Create the degree of fuzzy membership function for each input and output• Decide how the action will be executed for each rule• Combine the rules and defuzzify the output

There is no systematic procedure to design the fuzzy controller [38]. The mostcommonly used approach is to define membership functions of the inputs andoutput based on a qualitative understanding of the system, together with arule data base and to test the controller by trial-and-error until satisfactoryperformance is achieved. The authors rely on the use of heuristic expertiseand study of the plant dynamics about how to best configure the control law.The focus is on the achievement of the design goals indicated above, whilstkeeping the design of the controller as simple and generic as possible. Asnoted by the authors, as the fuzzy controller is nonlinear, it is very difficultto examine analytically the influence of certain parameters. Usually, extensivesimulation and experimentation are used to investigate its behaviour.

The authors’ aim is to ensure that the controller will have the properinformation available to be able to make good decisions, and will have propercontrol inputs to be able to steer the controlled system in the directionsneeded, so that it achieves a high-performance operation, as pointed outabove. Some of the design choices are briefly described below.

Input-Output Selection and Scaling

Since multiple inputs are usually used to capture the dynamic state of thecontrolled system more accurately, and also to offer better ability to linguis-tically describe the system dynamics [38], the authors utilize a two-input,single-output (the simplest of the Multiple Input Single Output (MISO)model based) fuzzy controller on the buffer of each output port of a routerin TCP/IP networks.

There is a need to choose the right inputs and output with generic nor-malized universe of discourse, applicable in any network/traffic environment.Thus, the decision made by the authors is to use the error on the instanta-neous queue length from a target value for two consecutive sampling intervals.Sampling at every packet arrival, just like RED [18] does, is in the authors’opinion an overkill and provides no perceptible benefit. By measuring thequeue at two consecutive sampling intervals (the current and the past), it isattempted to estimate the future behavior of the queue.

It is well known that the difference between the input arrival rate at abuffer and the link capacity at discrete time intervals can be approximatedand visualized as the rate at which the queue length grows when the bufferis non-empty. Thus, as it is usually easier to sample queue length than ratein practice, the change of the queue length for two consecutive discrete timeintervals is tracked. The system converges only when both sampled queuelengths reach the TQL (i.e. the errors on the queue length go to zero). Theerrors converging to zero imply that the input rate has been matched to the

Page 219: Foundations of Computational Intelligence

218 C. Chrysostomou and A. Pitsillides

link capacity, and there is no growth or drain in the router queue level. Thishas the effect of decoupling the congestion measure from the performancemeasure by keeping as congestion indices the queue length and the inputrate (which is approximated with queue growth rate, as discussed above).

Further, the output of the controller is selected as a nonlinear mark prob-ability that is given as input of the controlled system in order to decidewhether to mark a particular packet.

After all the inputs and the output are defined for the FEM controller,the fuzzy control system shown in Fig. 3 is specified, where all quantities areconsidered at the discrete instant kT :

• T is the sampling period• e(kT ) is the error on the controlled variable queue length, q(kT ), from a

specified TQL (qdes), at each sampling period kT , defined in (1).

e(kT ) = qdes − q(kT ) (1)

• e(kT − T ) is the error on queue length with a delay T (at the previoussampling period)

• p(kT ) is the calculated packet mark probability• qdes is the specified desired TQL.• SGi1,2(kT ) and SGo(kT ) are the input and output scaling gains, respec-

tively.

In fuzzy control theory, the range of values of inputs or outputs for a givencontroller is usually called the ”universe of discourse”. Often, for greater flex-ibility in fuzzy controller implementation, the universe of discourse for eachprocess input is ”normalized” to the interval [−1, +1] by means of constantscaling factors [38]. For FEM controller design, the scaling gains SGi1(kT ),SGi2(kT ) and SGo(kT ), shown in Fig. 3, are employed to normalize the uni-verse of discourse for the controller inputs error e(kT ) and e(kT −T ), and forthe controller output p(kT ), respectively. The input gains SGi1,2(kT ) are cho-sen so that the range of values of SGi1(kT )×e(kT ) and SGi2(kT )×e(kT−T )lie on [−1, 1], and SGo(kT ) is chosen by using the allowed range of inputs tothe plant in a similar way. The range of values of the controller’s output liesbetween 0 and 1 (i.e., p(kT ) ∈ [0, 1]).

In order to achieve a normalized range of the FEM input variables from−1 to 1, the input scaling gain SGi1(kT ) is set as shown in (2).

SGi1(kT ) =

{− 1

qdes−BufferSize if e(kT ) < 01

qdesotherwise

(2)

The SGi1(kT ) values are taken by considering the lower and upper boundsof the queue length. When the instantaneous queue length takes its maximumvalue (i.e., is equal to the buffer size), then the error on the queue length e(kT )(see (1)) should have its minimum value of qdes −BufferSize. On the otherhand, when the instantaneous queue length takes its minimum value, that is,

Page 220: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 219

zero, then the error on the queue length has its maximum value that is equalto qdes. Similarly for SGi2(kT ), however, e(kT − T ) is used.

The output scaling gain SGo(kT ) is determined so that the range of out-puts that is possible is the maximum, as well as to ensure that the input tothe plant will not saturate around the maximum. Following the approach in[19], SGo(kT ) is dynamically set to a value indicating the maximum markprobability (initially set to e.g. a value of 0.1) in response to changes of theinstantaneous queue length, q(kT ), as shown in (3).

SGo(kT )=

⎧⎨⎩

SGo(kT − T ) + 0.01 if q(kT ) > 1.1TQL and SGo(kT ) < 0.5SGo(kT − T )× 0.01 if q(kT ) < 0.9TQL and SGo(kT ) > 0.01SGo(kT − T ) otherwise

(3)

Rule Base Selection

The multi-input fuzzy inference engine uses linguistic rules to calculate themark probability. These linguistic rules form the control knowledge - rulebase of the controller and provide a description of how best to control thesystem, under differing operating conditions. Hence, linguistic expressions areneeded for the inputs and the output, and the characteristics of the inputsand the output. Linguistic variables are used (that is, symbolic descriptions ofwhat are in general time-varying quantities) to describe fuzzy system inputsand output. The linguistic variables take on ”linguistic values” that changedynamically over time and are used to describe specific characteristics of thevariables. Linguistic values are generally descriptive terms such as ”positive-big”, ”zero” and ”negative-small”.

The linguistic variables and values provide a language to express the au-thors’ ideas about the control decision-making process in the context of theframework established by the authors’ choice of FEM controller inputs andoutput. In order to determine the linguistic values of the input and outputvariables, the authors define partitions over the input and output space sothat they will adequately represent the linguistic variables. Since the inputsof the FEM controller deal with the queue evolution, which is dynamic andtime-varying in nature, the authors need to have as ”many” operating re-gions - state partitions as possible, in order to capture as much detail of thedynamics and the nonlinearities of the TCP/IP plant. The authors also needto keep the controller as simple as possible by not increasing the number oflinguistic values - state partitions beyond a number, which offers insignificantimprovement on the plant performance. The same applies for the output ofthe FEM controller, the mark probability.

The model of the FEM control system, comprising the control rules andthe values of the linguistic variables, is obtained through an offline intuitivetuning process that starts from a set of the initial insight considerations andprogressively modifies the number of linguistic values of the system until itreaches a level of adequate performance. The design objective is to keep the

Page 221: Foundations of Computational Intelligence

220 C. Chrysostomou and A. Pitsillides

controller as simple as possible to start with, and only increase complexity, byadding more linguistic values, if required. An adequate number of linguisticvalues is needed to describe the nonlinear behavior of the system accuratelyenough. Adding more rules, as expected, increases the accuracy of the ap-proximation, which yields an improved control performance. But beyond acertain point the improvement is marginal.

By choosing the simplest MISO controller, the authors have avoided theexponential increase of the rule base, and subsequent increase in the com-plexity of the controller, when the number of input variables increases. Acareful design of the rule base is done based on two goals:

• Completeness : all kinds of situations of system behavior are taken intoconsideration, i.e., all kinds of combinations of input variables results inan appropriate output value.

• Consistency: The rule base does not contain any contradiction. A setof rules is inconsistent if there are at least two rules with the sameantecedents-part and different consequent-part.

The philosophy behind the knowledge base of the FEM scheme is that of beingaggressive when the queue length deviates from the TQL (where congestionstarts to set in and quick relief is required), but on the other hand beingable to smoothly respond when the queue length is around the TQL. Allother rules can represent intermediate situations, thus providing the controlmechanism with a highly dynamic action.

A convenient way to list all possible IF-THEN control rules is to use atabular representation (see Table 2). These rules reflect the particular viewand experiences of the designer, and are easy to relate to human reasoningprocesses and gathered experiences. Note that the actual number of rulesimplemented in FEM is reduced, since when the current error on queue lengthis negative-very-big, then the output control signal is always huge, irrespective

Table 2. FEM Linguistic rules - Rule base [25]

Table Notations: negative/positive very big (NVB/PVB), negative/positive big(NB/PB), negative/positive small (NS/PS), zero (Z), huge (H), very big (VB),

big (B), small (S), very small (VS), tiny (T)

p(kT) Qerror (kT-T)NVB VB NS Z PS PB PVB

NVB H H H H H H HNB B B B VB VB H H

Qerror NS T VS S S B VB VB(kT) Z Z Z Z T VS S B

PS Z Z Z Z T T VSPB Z Z Z Z Z Z TPVB Z Z Z Z Z Z Z

Page 222: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 221

(a) linguistic input variables

(b) linguistic output variable

Fig. 4. Membership functions of the linguistic values representing the input vari-ables ”normalized error on queue length for two consecutive sample periods”, andthe output variable ”mark probability” [25]

of the status of the past error on queue length. The same applies when thecurrent error on queue length is positive-very-big, then the output controlsignal is always zero.

Membership Functions Selection

The membership functions of the linguistic variables are determined by usingan intuitive and pragmatic choice and not an analytic approach (this is oneof the reported advantages of fuzzy logic controllers compared to the conven-tional counterparts). The choice of membership function shape is open. Manyshapes are often found in studies (see, e.g. [38]). Due to computational sim-plicity, the authors select triangular or trapezoidal shaped membership func-tions in FEM control system. These types of shapes are a standard choiceused in many industry applications due to the mathematical simplicity of

Page 223: Foundations of Computational Intelligence

222 C. Chrysostomou and A. Pitsillides

the expressions representing them. The selected membership functions repre-senting the linguistic values for both the inputs and the output of the FEMcontroller are shown in Fig. 4.

The amount of overlapping between the membership functions’ areas issignificant. The left and right half of the triangular membership functions foreach linguistic value is chosen to provide membership overlap with adjacentmembership functions. The chosen method is simple in that symmetric-and-equally spaced membership functions are used, where the sum of the grade ofmembership of an input value, concerning the linguistic values of a specificinput variable, is always one (see (4)).

m∑k=1

µk(xi) = 1 (4)

where µk(xi) is the membership value of the input value xi taken from themembership function of the linguistic value k, (1 < k < m, where m is thenumber of linguistic values of a linguistic variable), of the input variable ofconcern.

This results in having at most two membership functions overlapping, thuswe will never have more than four rules on/activated at a given time. Thisoffers computational simplicity on the implementation of the FEM controller,a design objective. The overlapping of the fuzzy regions, representing thecontinuous domain of each control variable, contributes to a well-behaved andpredictable system operation; thus the fuzzy system can be very robust [25].

The FEM controller is a Mamdani-based model. Mamdani’s fuzzy inferencemethod is the most commonly used fuzzy methodology [61]. This approach isadopted due to its simplicity and effectiveness. The calculated output controlsignal of the nonlinear fuzzy controller, shown in (5), uses the center of grav-ity - the most common defuzzification method [38], of the aggregated fuzzyoutput set C.

pk =∫

yµC(y)dy∫µC(y)dy

(5)

where, µC(y) = max(µ1(y), µ2(y), ..., µN (y)) is the membership degree of yin the aggregated fuzzy set C (which is found using the max-operation overall N implicated output fuzzy sets), and N is the number of linguistic rules.

The limits of integration correspond to the entire universe of discourseY of output mark probability values, to which y belongs. To reduce com-putations, the output universe of discourse Y is discretized to m values,Y = {y1, y2, ..., ym}, which gives the discrete fuzzy centroid (see (6)).

pk =

m∑j=1

yj × µC(yj)

m∑j=1

µC(yj)(6)

Page 224: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 223

Fig. 5. Control-decision surface of the fuzzy inference engine of FEM controller.The nonlinear control surface is shaped by the rule base and the linguistic valuesof the linguistic variables [25].

Note that the use of symmetric triangular and trapezoidal membershipfunctions makes the computation of the equation easy.

The nonlinear control-decision surface implemented by the FEM controlleris shaped by the constructed rule base and the linguistic values of the inputsand output variables (see Fig. 5). This surface represents in a compact wayall the information in the fuzzy controller. An inspection of this nonlinearcontrol surface and the linguistic rules shown in Table 2 provides hints onthe operation of FEM. The mark probability behaviour under the regionof equilibrium (i.e., where the error on the queue length is close to zero)is smoothly calculated. On the other hand, the rules are aggressive by in-creasing the probability of packet marking sharply in the region beyond theequilibrium point, where congestion starts to set in and quick relief is re-quired. Thus the inference process of FEM controller dynamically calculatesthe mark probability behaviour based on the two inputs. The dynamic wayof calculating the mark probability by the inference process comes from thefact that according to the error of queue length for two consecutive sampleperiods, a different set of fuzzy rules and so inference apply. Based on theserules and inferences, the mark probability is expected to be more responsivethan other AQM approaches, (as for e.g. [18], [19], [20], [21], and [22]) due tothe human reasoning and the inbuilt non linearity.

Fuzzy Explicit Marking In/Out (FIO) Control System Design forDiff-Serv

Congestion control at the core of a Dif-Serv network benefits from AQMschemes to preferentially drop/mark packets based on the level of precedencethey belong to, by giving priority to low drop precedence against high drop

Page 225: Foundations of Computational Intelligence

224 C. Chrysostomou and A. Pitsillides

Fig. 6. FIO system model [25]

precedence traffic. At the same time, it is significant to provide adequateQoS-centric performance objectives, in terms of bounded delays, with highlink utilization and minimal losses overall. As discussed in Sect. 2.5, theexisting AQM schemes for Diff-Serv congestion control show weaknesses torespond to such objectives. The authors build on FEM, the fuzzy controllerdesigned for Best-Effort service, and investigate its extension and suitabil-ity to provide effective congestion control for Diff-Serv environments. Thegoals are to achieve differentiated treatment of traffic aggregates, ensuring atthe same time bounded queuing delays, low losses, and high link utilizationin overall; hence offering (differentiated) QoS in traffic aggregates. In accom-plishing these goals, fast system response with robustness to the time-varyingdynamic nature of the controlled system play a significant role and hence areimportant design requirements. Furthermore, low complexity is also sought.

A two-level of precedence FEM controller structure is formulated (seeFig. 6), which is designed to operate on the core routers’ buffer queue, calledFuzzy Explicit Marking In/Out (FIO), where ”In” and ”Out” terms are usedto distinguish packets that are classified into different precedence traffic ag-gregates, distinguished by the drop/mark precedence level they belong to.”In” packets belong to the low drop/mark precedence (i.e., high-priority traf-fic), while the ”Out” packets belong to the high drop/mark precedence (i.e.,low-priority traffic).

Both high- and low-priority traffic aggregates share a FIO queue. FIOcomprises of two identical FEM controllers, one for each traffic aggregate,and two different TQLs are introduced (TQLhigh and TQLlow), on the totalqueue length, one for each FEM controller. The TQL for low-priority traffic(TQLlow) is lower than the one for high-priority traffic (TQLhigh). Therefore,low-priority packets are more likely to be marked than the high-priority ones.

Page 226: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 225

The idea behind this is to regulate the queue at the lower TQL. In thiscase, the mark probability of the high-priority traffic is closer to zero, asthe TQL is set higher and thus it is less likely that high priority packetswill be marked. In the presence of a small amount of high-priority packets,the queue would be mostly regulated at the lower TQL and thus markingof high-priority packets would be less likely. If however, the high-prioritytraffic is very high in comparison to the low-priority traffic, then the queueis regulated at the higher TQL (as there is not enough low priority traffic toensure the lower TQL is maintained). In this case the mark probability forlow-priority traffic is closer to one. In either case, the lower-priority traffic ismarked at a higher rate. Therefore, both differentiation as well as a boundeddelay - by regulating the queue between the two TQLs, depending on thedynamic network traffic conditions - can be accomplished. It is thereforeexpected that FIO can achieve an adequate differentiation between the twoprecedence traffic aggregates in the presence of congestion, by preferentiallymarking the lowest-priority packets, and giving priority/preference to high-priority-tagged traffic, while controlling the queue at the predefined levels,and thus providing QoS assurances for delay, loss, and link utilization.

It is noted by the authors that even though FIO is introduced with twodrop/mark precedence levels, it is easy to extend the fuzzy logic controlmethodology to multiple drop/mark precedence, due to the generic controlmethodology adopted.

Performance Evaluation

The authors have done extensive simulations to demonstrate the effectivenessand robustness of the AQM-based nonlinear fuzzy logic control methodol-ogy implemented in both Best-Effort and Diff-Serv TCP/IP environments. Acomparison is also made with other published results selecting some represen-tative well-known AQM schemes: A-RED [19], PI [20], REM [21], and AVQ[22] for TCP/IP Best-Effort networks, and RIO [31] and TL-PI [33] in thecase of TCP/IP Diff-Serv networks. The performance of the AQM schemes isevaluated using the most widely used network simulator, the Network Simu-lator NS-2 [58].

A number of scenarios are selected as realistic and practical as possible,which aim to stress the tested approaches. The authors use both single- andmultiple-congested (tandem) links (bottlenecks) network environments (in-cluding topologies with congestion at peripheral links), and also widely dif-fering operating conditions, in order to examine the following effects, on theAQM schemes:

• dynamic traffic changes - time-varying dynamics• traffic load factor• heterogeneous propagation delays• different propagation delays at bottleneck links• different link capacities

Page 227: Foundations of Computational Intelligence

226 C. Chrysostomou and A. Pitsillides

• introduction of noise-disturbance (background traffic) to the network (e.g.short-lived TCP connections)

• introduction of reverse-path traffic• different types of data streams, like TCP/FTP and TCP/Web-like traffic,

as well as unresponsive traffic (UDP-like).

The performance metrics that are used for evaluating the performance of thetested AQM schemes are:

• Bottleneck link utilization (based on the useful throughput, also com-monly called goodput)

• Loss rate• Mean queuing delay and its standard deviation.

Due to lack of space, what follows are some indicative results. More compre-hensive results can be found in [25] and [26].

Best Effort Scenario

This scenario [25] provides a realistic network topology, where multiple-congested (tandem) links occur. The network topology of a multiple-bottlenecklink is shown in Fig. 7.

AQM is used in the queues of all core links from router-A to router-F. Allother links (access links) have a simple Drop Tail queue. The link capacitiesand propagation delays are set as follows: (C1, d1) = (C8, d8) = (C9, d9) =(100 Mbps, 5 ms), (C2, d2)=(C4, d4)=(C6, d6)=(15 Mbps, 10 ms), (C3, d3)=(15 Mbps, 60 ms), (C5, d5) = (15 Mbps, 30 ms), and(C7, d7) = (C10, d10)= (C11, d11) = (200 Mbps, 5 ms). N1 flows end up at destination 1, whereasN2 flows end up at destination 2, and N3 flows end up at destination 3 creat-ing cross traffic. The results in [25] and [26] show that both bottleneck links,where the cross traffic exists, (i.e., between router-B and router-C, and betweenrouter-D and router-E) exhibit similar behaviour, as far as the performancecomparison is concerned. Therefore, the bottleneck link between router-D androuter-E is chosen to show the results obtained.

iMac

N1 flows

dest 1

Router A Router B

(C1,d1)

(C2,d2)

(C8,d8)src

Router C

(C3,d3)

Router D

(C4,d4)

Router E

(C6,d6)

Router F

(C7,d7)

N2 flows

iMac

dest 2

iMac

dest 3

(C9,d9)

N3 flowssrc src

(C10,d10)

(C5,d5)(C11,d11)

Fig. 7. Multiple-bottleneck network topology [25]

Page 228: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 227

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Buf

fer

Siz

e (P

acke

ts)

Time (Seconds)

Buffer Size

FEM

(a) FEM

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Buf

fer

Siz

e (P

acke

ts)

Time (Seconds)

Buffer Size

PI

(b) PI

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Buf

fer

Siz

e (P

acke

ts)

Time (Seconds)

Buffer Size

A-RED

(c) A-RED

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Buf

fer

Siz

e (P

acke

ts)

Time (Seconds)

Buffer Size

REM

(d) REM

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Buf

fer

Siz

e (P

acke

ts)

Time (Seconds)

Buffer Size

AVQ

(e) AVQ

Fig. 8. Best Effort Scenario: Queue lengths for bottleneck prop. delay = 200 ms [25]

The performance of AQM schemes is investigated under variation of thebottleneck link propagation delays. Specifically, the effect of the RTT is exam-ined by increasing the propagation delay router-D and router-E from 30 msto 120 and 200 ms, thus increasing the RTT from 260 ms to 600 ms. Thenumber of long-lived flows is N1 = 500, N2 = 100, and N3 = 200. Also time-varying dynamics are introduced on the network, by stopping half of all theflows at time t = 40 s, and resuming transmission at t = 70 s. Figure 8 showsthe queue length evolution for the case of 200 ms bottleneck propagationdelay. From the results, it can be observed the superior steady performanceof FEM with stable queue dynamics, with graceful performance degradation

Page 229: Foundations of Computational Intelligence

228 C. Chrysostomou and A. Pitsillides

0

2

4

6

8

10

12

14

20 40 60 80 100 120 140 160 180 200

Loss

Rat

e (%

)

Propagation Delay (msec)

Loss Rate vs Propagation Delay

FEMPI

A-REDREMAVQ

Fig. 9. Best Effort Scenario: Loss Rate vs Propagation Delay (bottleneck prop.delay of 30, 120, 200 ms) [25]

86

88

90

92

94

96

98

100

20 30 40 50 60 70 80 90 100 110

Util

izat

ion

(%)

Queuing Delay Variation (msec)

Utilization vs Delay Variation

FEMPI

A-REDREMAVQ

Fig. 10. Best Effort Scenario: Utilization vs Delay Variation (bottleneck prop.delay of 30, 120, 200 ms) [25]

as the bottleneck propagation delay increases up to a value of 200 ms (notethat there is a total of 600 ms round-trip propagation delay). FEM has thehighest utilization, the lowest losses and the shortest delay variation (eventhough for 200 ms, FEM exhibits larger variation around the TQL than inprevious situations found in [25] and [26], it still behaves much better that theother schemes). PI, REM, A-RED and AVQ exhibit large queue fluctuations,and show weakness to react quickly to dynamic changes resulting in degradedutilization and high variance of queuing delay. Thus, these mechanisms areshown to be sensitive to variations of RTT within the range of interest. Thisis clearly illustrated in Fig. 9, where the loss rate as the propagation delay

Page 230: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 229

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Buf

fer

Siz

e (P

acke

ts)

Time (Seconds)

Buffer Size

FIO

(a) FIO

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Buf

fer

Siz

e (P

acke

ts)

Time (Seconds)

Buffer Size

TL-PI

(b) TL-PI

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Buf

fer

Siz

e (P

acke

ts)

Time (Seconds)

Buffer Size

RIO

(c) RIO

Fig. 11. Diff-Serv Scenario: Queue lengths (bottleneck prop. delay = 120 ms - high-priority traffic consists of 1.33% of the total traffic passing through the bottlenecklink) [25]

increases is shown. FEM shows robustness by having minimal, and the lowestamong the others, losses. Figure 10 shows the utilization with respect to thequeuing delay variation. FEM outperforms the other AQMs, in managing toachieve high utilization, and at the same time regulating the queue and thusproviding bounded mean delay, and delay variation.

Diff-Serv Scenario

The same network topology of multiple-bottleneck links as the one used in theBest Effort Scenario,with the same network parameters is used (see Fig. 7).The bottleneck link between router-B and router-C is chosen to show theresults obtained [25]].

The performance of AQM schemes under variation of the bottleneck linkpropagation delays is investigated. The effect of the RTT is examined by vary-ing the bottleneck propagation delay between router-B and router-C from30, to60, and120ms, thus increasing the RTT from 200 ms to 380 ms. Allsources (N1 = 100, N2 = 50, and N3 = 100 flows) are greedy sustainedFile Transfer Protocol applications. A limited number of flows tagged ashigh-priority traffic is considered (2% of N1 flows, whereas the rest 98%are tagged as low-priority). N2 and N3 flows are considered as being of

Page 231: Foundations of Computational Intelligence

230 C. Chrysostomou and A. Pitsillides

0

5

10

15

20

25

30

35

40

45

40 60 80 100 120 140 160 180 200 220 240

Hig

h-pr

iorit

y U

tiliz

atio

n (%

)

Mean Queuing Delay (msec)

High-priority Utilization vs Mean Delay

FIOTL-PI

RIO

Fig. 12. Diff-Serv Scenario: Utilization of high-priority traffic vs mean queuingdelay (bottleneck propagation delay varies from 30, 60, and 120 ms - high-prioritytraffic consists of 1.33% of the total traffic passing through the bottleneck link).The TQLlow value of 100 packets is equivalent to 53, 33 ms [25].

low-priority (i.e., about 1.33% of the traffic passing through the bottlenecklink, under consideration, belongs to the high-priority level). Figure 11 showsthe queue length evolution for the case of 120 ms bottleneck propagationdelay. From the results, the superior steady performance of FIO can be ob-served with stable queue dynamics, irrespective of the increase of RTT. FIOhas the highest utilization, and the lowest losses and the shortest delay vari-ation ([25], [26]). On the other hand, RIO exhibits larger queue fluctuationsas the RTT increases that result in degraded utilization and high varianceof queuing delay [25], [26]. Also, TL-PI suffers from a slow response to regu-late the queue, and has higher delay variation than FIO has [25], [26]. Thus,these mechanisms are shown to be sensitive to variations of RTT. Figure12 shows the utilization of the bottleneck link regarding the high-prioritytraffic with respect to the mean queuing delay. Despite the fact that thehigh-priority traffic consists only of 1.33% of the total traffic passing throughthe particular link, FIO outperforms the other AQMs, in managing to achievea considerable amount of utilization for the high-priority; thus it achieves amuch higher differentiation between the two drop precedence levels comparedwith the other schemes, and at the same time regulating the queue and thusproviding bounded mean delay, and delay variation [25], [26]. On the otherhand, the other schemes exhibit larger delays, and provide no differentiationbetween high- and low-priority traffic.

4 Conclusions

Network management and control is a complex problem, which is becom-ing even more difficult with the increased demand to use the Internet fortime/delay-sensitive applications with differing QoS requirements (e.g. Voice

Page 232: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 231

over IP, video streaming, Peer-to-Peer, interactive games). The existing TCPcongestion avoidance/control mechanisms, while necessary and powerful, arenot sufficient to provide good service in all circumstances. The insufficienciesof the implicit end-to-end feedback adopted by the TCP paradigm necessi-tate the design and utilization of new effective congestion control algorithms,to supplement the standard TCP based congestion control, since the replace-ment of the current TCP congestion control algorithm does not appear to berealistic at this point in time. Further, given the need for providing adequateQoS new network architectures have been proposed, such as the Differenti-ated Services architecture to deliver aggregated QoS in IP networks.

Basically, there is a limit to how much control can be accomplished fromthe edges of the network of such an end-to-end implicit feedback based con-gestion control. Some additional mechanisms are needed particularly in therouters to complement the endpoint congestion control methods. Thus theneed for router control has recently led to the concept of active queue man-agement (AQM).

The problem of network congestion control remains a critical issue and ahigh priority; despite the many years of research efforts and the large numberof different control schemes proposed, there are still no universally acceptablecongestion control solutions. Current solutions of existing AQM mechanisms,introduced to assist the TCP congestion control, are ineffective to meet thediverse needs of today’s Internet, due to the dynamic, time-varying natureof TCP/IP networks. It is widely accepted that they have serious limitationsand drawbacks.

Thus, despite the classical control system techniques used from variousresearchers, these still do not perform sufficiently to control the dynamics,and the nonlinearities of the TCP/IP networks. Given the need to capturesuch important attributes of the controlled system, the design of robust,intelligent control methodologies is required.

Hence, given the need for such control methodology - to capture thedynamics, the highly bursty network traffic, and the nonlinearities of theTCP/IP system, under widely differing operating conditions - we show theusefulness of fuzzy logic control to meet such objectives. Fuzzy Logic Controlcan be considered as suitable candidate for AQM-based control mechanismdue to its reported strength in controlling nonlinear systems using linguisticinformation.

The capability to qualitatively capture the attributes of a control systembased on observable phenomena is a main feature of fuzzy logic control andhas been demonstrated in various places in the research literature as wellas in commercial products. The main idea is that if the fuzzy logic con-trol is designed with a good (intuitive) understanding of the system to becontrolled, the limitations due to the complexity system’s parameters intro-duced on a mathematical model can be avoided. Therefore, the applicationof fuzzy control techniques to the problem of congestion control in TCP/IPnetworks is worthy of investigation, due to the difficulties in obtaining a

Page 233: Foundations of Computational Intelligence

232 C. Chrysostomou and A. Pitsillides

precise enough mathematical model (amicable to analysis) using conventionalanalytical methods, while some intuitive understanding of congestion controlis available.

We have presented illustrative examples of using fuzzy logic to controlcongestion. These and the literature we review on fuzzy logic methods appliedto networks show that fuzzy logic can be effective in the control of congestion.There is no doubt that we will see more and more use of these techniques,including new challenging networking areas (e.g., sensor networks, 3G andbeyond mobile networks, etc). We also expect that, as in other commercialproducts, fuzzy logic techniques will finally make it into real products in thisarea, and we expect with tremendous success.

Of course many challenges to the control of congestion using fuzzy logicremain unresolved. Much work remains for the analytical study of fuzzy logic,particularly in the area of stability and performance analysis. Most proposedfuzzy logic controllers in literature do not have any stability analysis becauseof the difficulty in analysis. This is mainly due to the existence of the non-linearity in the control structure that usually makes it difficult to conducttheoretical analysis to explain why fuzzy logic controllers in many instancesachieve better performance than the conventional counterparts, especially forhighly nonlinear processes. However, as elegantly pointed out by Mamdani[62], overstressing the necessity of mathematically derived performance eval-uations may be counter productive and contrary to normal industry approach(e.g. prototype testing may suffice for accepting the controlled systems per-formance). Nevertheless, a certain degree of safety concerning fuzzy logicapplied in networks can be examined.

References

1. ICCRG, Internet Congestion Control Research Group (2006), http://oakham.cs.ucl.ac.uk/mailman/listinfo/iccrg

2. Keshav, S.: Congestion Control in Computer Networks. Ph.D. Thesis, Univer-sity of California Berkeley (1991)

3. Yang, C.Q., Reddy, A.V.S.: A taxonomy for congestion control algorithms inpacket switching networks. IEEE Network Magazine 9(4), 34–45 (1995)

4. Pitsillides, A., Sekercioglu, A.: Congestion Control. In: Pedrycz, W., Vasilakos,A. (eds.) Computational Intelligence in Telecommunications Networks, pp. 109–158. CRC Press, Boca Raton (2000)

5. Hassan, M., Sirisena, H.: Optimal control of queues in computer networks. In:IEEE International Conference on Communications, vol. 2, pp. 637–641 (2001)

6. Andrews, M., Slivkins, A.: Oscillations with TCP-like flow control in networksof queues. In: IEEE Infocom 2006, pp. 1–12 (2006)

7. Schwartz, M.: Telecommunication networks: Protocols, modelling, analysis.Addison-Wesley, Reading (1988)

8. Chiu, D.M., Jain, R.: Analysis of the increase and decrease algorithms forcongestion avoidance in computer networks. Computer Networks and ISDNSystems 17, 1–14 (1989)

Page 234: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 233

9. Ramakrishnan, K., Floyd, S., Black, D.: The addition of explicit congestion no-tification (ECN) to IP. Request for Comments RFC 3168, Internet EngineeringTask Force (2001)

10. Jacobson, V.: Congestion avoidance and control. In: ACM SIGCOMM 1988,pp. 314–329 (1988)

11. Katabi, D., Handley, M., Rohrs, C.: Congestion control for high bandwidth-delay product networks. In: ACM SIGCOMM 2002, vol. 32(4), pp. 89–102(2002)

12. Stevens, W.: TCP slow start, congestion avoidance, fast retransmit, and fastrecovery algorithms. Request for Comments RFC 2001, Internet EngineeringTask Force (1997)

13. Lakshman, T.V., Madhow, U.: The performance of TCP/IP for networks withhigh bandwidth delay products and random loss. IEEE/ACM Transactions onNetworking 5, 336–350 (1997)

14. Kurose, J.F., Ross, K.W.: Computer networking: a top-down approach featur-ing the Internet. Addison-Wesley, Reading (2005)

15. Floyd, S., Henderson, T., Gurtov, E.A.: The NewReno modification to TCP’sfast recovery algorithm. Request for Comments RFC 3782, Internet EngineeringTask Force (2004)

16. Mathis, M., Mahdavi, J., Floyd, S., Romanow, A.: TCP Selective Acknowledge-ment options. Request for Comments RFC 2018, Internet Engineering TaskForce (1996)

17. Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, S., Estrin, D., Floyd,S., Jacobson, V., Minshall, G., Partridge, C., Peterson, L., Ramakrishnan, K.,Shenker, S., Wroclawski, J., Zhang, L.: Recommendations on queue manage-ment and congestion avoidance in the Internet. Request for Comments RFC2309, Internet Engineering Task Force (1998)

18. Floyd, S., Jacobson, V.: Random early detection gateways for congestion avoid-ance. IEEE/ACM Trans on Networking 1(4), 397–413 (1993)

19. Floyd, S., Gummadi, R., Shenker, S.: Adaptive RED: An Algorithm for Increas-ing the Robustness of RED’s Active Queue Management. Technical report, ICSI(2001)

20. Hollot, C.V., Misra, V., Towsley, D., Gong, W.B.: Analysis and Design of Con-trollers for AQM Routers Supporting TCP Flows. IEEE Transactions on Au-tomatic Control 47(6), 945–959 (2002)

21. Athuraliya, S., Li, V.H., Low, S.H., Yin, Q.: REM: Active Queue Management.IEEE Network Magazine 15(3), 48–53 (2001)

22. Kunniyur, S., Srikant, R.: An adaptive virtual queue (AVQ) algorithm for activequeue management. IEEE/ACM Transactions on Networking 12(2), 286–299(2004)

23. May, M., Bolot, J., Diot, C., Lyles, B.: Reasons Not to Deploy RED. In: 7thInternational Workshop on Quality of Service, pp. 260–262 (1999)

24. Plasser, E., Ziegler, T.: A RED Function Design Targeting Link Utilization andStable Queue Size Behaviour. Computer Networks Journal 44, 383–410 (2004)

25. Chrysostomou, C., Pitsillides, A., Sekercioglu, A.: Fuzzy Explicit Marking: AUnified Congestion Controller for Best Effort and Diff-Serv Networks. Com-puter Networks Journal (accepted for publication) (2008)

26. Chrysostomou, C.: Fuzzy Logic Based AQM Congestion Control in TCP/IPNetworks. PhD Thesis, University of Cyprus (2006), http://www.netrl.cs.

ucy.ac.cy/images/thesis/chrysostomou-phd-thesis-sep06.pdf

Page 235: Foundations of Computational Intelligence

234 C. Chrysostomou and A. Pitsillides

27. Guirguis, M., Bestavros, A., Matta, I.: Exogenous-Loss Awareness in QueueManagement - Towards Global Fairness. Techical Report, Computer ScienceDepartrment, Boston University (2003)

28. Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., Weiss, W.: An archi-tecture for Differentiated Services. Request For Comments RFC 2475, InternetEngineering Task Force (1998)

29. Babiarz, J., Chan, K., Baker, F.: Configuration guidelines for DiffServ serviceclasses. Request for Comments RFC 4594, Internet Engineering Task Force(2006)

30. Heinanen, J., Baker, F., Weiss, W., Wroclawski: Assured Forwarding PHBGroup. Request for Comments RFC 2597, Internet Engineering Task Force(1999)

31. Clark, D., Fang, W.: Explicit Allocation of Best Effort Packet Delivery Service.IEEE/ACM Transactions on Networking 6(4), 362–373 (1998)

32. May, M., Bolot, J.C., Jean-Marie, A., Diot, C.: Simple perfomance models ofdifferentiated services schemes for the Internet. In: IEEE INFOCOM 1999, NewYork, pp. 1385–1394 (1999)

33. Chait, Y., Hollot, C.V., Misra, V., Towsley, D., Zhang, H., Lui, C.S.: Providingthroughput differentiation for TCP flows using adaptive two-color marking andtwo-level AQM. In: IEEE INFOCOM 2002, New York, vol. 2, pp. 837–844(2002)

34. Misra, V., Gong, W.B., Towsley, D.: Fluid-based Analysis of a Network ofAQM Routers Supporting TCP Flows with an Application to RED. In: ACMSIGCOMM 2000, pp. 151–160 (2000)

35. Hollot, C.V., Misra, V., Towsley, D., Gong, W.B.: A control theoretic analysisof RED. In: IEEE Infocom 2001, vol. 3, pp. 1510–1519 (2001)

36. Sekercioglu, A., Pitsillides, A., Vasilakos, A.: Computational intelligence inmanagement of ATM networks. Soft Computing Journal 5(4), 257–263 (2001)

37. Azvine, B., Vasilakos, A.: Application of soft computing techniques to thetelecommunication domain. In: Tselentis, G. (ed.) ERUDIT Roadmap, pp. 89–110 (2000)

38. Passino, K., Yurkovich, M.: Fuzzy Control. Prentice Hall, Englewood Cliffs(1998)

39. Zadeh, L.A.: Fuzzy Sets. Information and Control 8, 338–353 (1965)40. Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and

decision processes. IEEE Transactions on Systems, Man, and Cybernetics 3(1),28–44 (1973)

41. Mamdani, E.H.: Applications of fuzzy algorithms for simple dynamic plant.Proceedings of IEE 121(12), 1585–1588 (1974)

42. Morales, E., Polycarpou, M., Hemasilpin, N., Bissler, J.: Hierarchical Adap-tive and Supervisory Control of Continuous Venovenous Hemofiltration. IEEETransactions on Control Systems Technology 9(3), 445–457 (2001)

43. Sekercioglou, A., Pitsillides, A., Egan, G.K.: Study of an adaptive fuzzy con-troller based on the adaptation of relative rule weights. In: Proceedings ofANZIIS 1994, Brisbane, Queensland, Australia, pp. 204–208 (1994)

44. Pitsillides, A., Sekercioglou, A., Ramamurthy, G.: Effective Control of TrafficFlow in ATM Networks Using Fuzzy Explicit Rate Marking (FERM). IEEEJournal on Selected Areas in Communications 15(2), 209–225 (1997)

45. Douligeris, C., Develekos, G.: A fuzzy logic approach to congestion control inATM networks. In: IEEE ICC 1995, Washington, USA, pp. 1969–1973 (1995)

Page 236: Foundations of Computational Intelligence

Fuzzy Logic Control in Communication Networks 235

46. Rossides, L., Sekercioglu, A., Kohler, S., Pitsillides, A., Phuoc, T.G., Vassilakos,A.: Fuzzy Logic Controlled RED: Congestion Control for TCP/IP Diff-ServArchitecture. In: 8th European Congress on Intelligent Techniques and SoftComputing, Aachen, Germany, pp. 263–269 (2000)

47. Rossides, L., Chrysostomou, C., Pitsillides, A., Sekercioglu, A.: Overview ofFuzzy-RED in Diff-Serv Networks. In: Bustard, D.W., Liu, W., Sterritt, R.(eds.) Soft-Ware 2002. LNCS, vol. 2311, pp. 1–13. Springer, Heidelberg (2002)

48. Chrysostomou, C., Pitsillides, A., Rossides, L., Polycarpou, M., Sekercioglu,A.: Congestion Control in Differentiated Services Networks using Fuzzy-RED.IFAC Control Engineering Practice (CEP) Journal 11(19), 1153–1170 (2003);special Issue on Control Methods for Telecommunication Networks

49. Fengyuan, R., Yong, R., Xiuming, S.: Design of a fuzzy controller for activequeue management. Computer Commmunications 25, 874–883 (2002)

50. Wang, C., Li, B., Sohraby, K., Peng, Y.: AFRED: An adaptive fuzzy-basedcontrol algorithm for active queue management. In: 28th IEEE InternationalConference on Local Computer Networks (LCN 2003), pp. 12–20 (2003)

51. Aul, Y.H., Nafaa, A., Negru, D., Mehaoua, A.: FAFC: Fast adaptive fuzzyAQM controller for TCP/IP networks. In: IEEE Globecom 2004, vol. 3, pp.1319–1323 (2004)

52. Di Fatta, G., Hoffmann, F., Lo Re, G., Urso, A.: A Genetic Algorithm for theDesign of a Fuzzy Controller for Active Queue Management. IEEE Transactionson Systems, Man, and Cybernetics, Special Issue on Computational Intelligencein Telecommunications Networks and Internet Services: Part I 33(3), 313–324(2003)

53. Siripongwutikorn, P., Banerjee, S., Tipper, D.: Adaptive bandwidth controlfor efficient aggregate QoS provisioning. In: IEEE Globecom 2002, vol. 3, pp.2435–2439 (2002)

54. Habetha, J., Walke, B.: Fuzzy rule-based mobility and load management forself-organizing wireless networks. International journal of wireless informationnetworks 9(2), 119–140 (2002)

55. Wang, C., Li, B., Hou, Y.T., Sohraby, K., Lin, Y.: LRED: A Robust ActiveQueue Management Scheme Based on Packet Loss Ratio. In: IEEE Infocom2004, vol. 1, pp. 1–12 (2004)

56. Savoric, M.: Fuzzy explicit window adaptation: a method to further enhanceTCP performance. Technical Report TKN-03-010, Telecommunication Net-works Group, Technical University Berlin (2003)

57. Oliveira, R., Braun, T.: A delay-based approach using fuzzy logic to improveTCP error detection in ad hoc networks. In: IEEE Wireless Communicationsand Networking conference, Atlanta, USA, vol. 3, pp. 1666–1671 (2004)

58. Network Simulator (1989), http://nsnam.isi.edu/nsnam/59. Chrysostomou, C., Pitsillides, A., Hadjipollas, G., Polycarpou, M., Sekercioglu,

A.: Fuzzy Logic Control for Active Queue Management in TCP/IP Networks.In: 12th IEEE Mediterranean Conference on Control and Automation Ku-sadasi, Aydin, Turkey, 6 pages (2004) (CD ROM Proceedings)

60. Chrysostomou, C., Pitsillides, A., Hadjipollas, G., Polycarpou, M., Sekercioglu,A.: Congestion Control in Differentiated Services Networks using Fuzzy Logic.In: 43rd IEEE Conference on Decision and Control, Bahamas, pp. 549–556(2004) (CD ROM Proceedings - ISBN: 0-7803-8683-3, IEEE Catalog Number:04CH37601C)

Page 237: Foundations of Computational Intelligence

236 C. Chrysostomou and A. Pitsillides

61. Mamdani, E.H., Assilian, S.: An experiment in linguistic synthesis with a fuzzylogic controller. International Journal of Man-Machine Studies 7(1), 1–13 (1975)

62. Mamdani, E.H.: Twenty years of fuzzy logic: experiences gained and lessonslearned. In: IEEE International conference on fuzzy systems, San Franscisco,pp. 339–344 (1975)

63. Andrew, L.H., Hanly, S.V., Chan, S., Cui, T.: Adaptive Deterministic PacketMarking. IEEE Comm. Letters 10(11), 790–792 (2006)

64. Thommes, R.W., Coates, M.J.: Deterministic packet marking for time-varyingcongestion price estimation. IEEE/ACM Transactions on Networking 14(3),592–602 (2006)

65. Liu, S., Basar, T., Srikant, R.: Exponential-RED: A Stabilizing AQM Schemefor Low- and High-Speed TCP Protocols. IEEE/ACM Transactions on Net-working 13(5), 1068–1081 (2005)

66. Ariba, Y., Labit, Y., Gouaisbaut, F.: Design and Performance Evaluation ofa State-Space Based AQM. In: International Conference on CommunicationTheory, Reliability, and Quality of Service, pp. 89–94 (2008)

Page 238: Foundations of Computational Intelligence

Adaptation in Classification Systems

Abdelhamid Bouchachia

Group of Software Engineering & Soft Computing, Dept. of Informatics,University of Klagenfurt, [email protected]

Summary. The persistence and evolution of systems essentially depend of theirability to self-adapt to new situations. As an expression of intelligence, adaptationis a distinguishing quality of any system that is able to learn and to adjust itselfin a flexible manner to new environmental conditions. Such ability ensures self-correction over time as new events happen, new input becomes available, or newoperational conditions occur. This requires self-monitoring of the performance in anever changing environment. The relevance of adaptation is established in numerousdomains and by versatile real world applications.

The primary goal of this contribution is to investigate adaptation issues in learn-ing classification systems form different perspectives. Being a scheme of adaptation,life long incremental learning will be examined. However, special attention will begiven to adaptive neural networks and the most visible incremental learning algo-rithms (fuzzy ARTMAP, nearest generalized exemplar, growing neural gas, general-ized fuzzy minmax neural network, IL based on function decomposition) and theiradaptation mechanisms will be discussed. Adaptation can also be incorporated inthe combination of such incremental classifiers in different ways so that adaptiveensemble learners can be obtained too. These issues and other pertaining to driftwill be investigated and illustrated by means of a numerical simulation.

1 Introduction

The continuity of systems rests on their ability to adapt to new situations. In real life,such an ability is one of the key features of any living organism and can be seen as anexpression of intelligence. Undoubtedly, adaptation should feature any system thatis able to adjust itself in a flexible manner to new environmental conditions throughself-correction over time as new events happen, new input becomes available, ornew operational conditions occur. This implies a continuous improvement and atleast non-degradation of the system performance in an ever changing environment.Hence, building adaptive systems that are able to deal with nonstandard settings oflearning and which are flexible in interacting with their environment at any time inan open-ended cycle of learning is an eminent research issue.

Adaptation is particularly manifest in intelligent applications where learningfrom data is at the heart of system modeling and identification. The goal is tocope with non-stationary changing situations by employing adaptive mechanisms

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 237–258.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 239: Foundations of Computational Intelligence

238 A. Bouchachia

Fig. 1. Steps of the adaptive incremental learning

to accommodate changes in the data. This becomes more important when storagecapacities (memory) are very limited and when data arrives over long periods oftime. In such situations, the system should adapt itself to the new data sampleswhich may convey a changing situation and at the same time should keep in mem-ory relevant information that had been learned in the remote past.

In this contribution, we aim at studying one of the fundamental aspects of adap-tation , that is, adaptive incremental learning (AIL) which seeks to deal with dataarriving over time or with (static) huge amounts of data that exceed the storagecapacities. Thus, processing of data at once is not feasible. Most of the availableliterature on machine learning reports on learning models that are one-shot experi-ence and, therefore, lack adaptation . Therefore, learning algorithms with an adap-tive incremental learning ability are of increasing importance in many nowadayson-line data streams and time series applications (e.g., text streams, video streams,stock market indexes, user profile learning, computer intrusion, etc.) but also dis-crete data-starved applications where the acquisition of training data is costly andrequires much time. An illustrative scenario would be that typifying the particu-lar case of running expensive chemical or physical experiments that my take longperiods of time in order to collect a training sample.

The concept of AIL we are interested in pertains to classification and clustering,though the focus is more on the first aspect. In such a context, the issue bears ondevising adaptive learning mechanisms to induce new knowledge without ’catas-trophic forgetting’ and/or to refine the existing knowledge. This raises the questionof how to adaptively accommodate new data in an incremental way while keepingthe system under use. Figure 1 illustrates the type of frameworks considered in thiswork.

In particular, this paper we explore, among others, the efficiency of some neurallearning systems that are implicitly based on neural constructivism ideas [32][37].Investigations based on these ideas have allowed to understand how the lateralconnectivity in the cortex have emerged and, more importantly, how the cortexcan be seen as a continuously-adapting dynamic system formed by competitive and

Page 240: Foundations of Computational Intelligence

Adaptation in Classification Systems 239

cooperative lateral interactions. In other words, the cortex is shaped through thedynamic and adaptive interaction between the neural progressive evolution and theneural activity owed to the environment.

Following these cognitive and biological motivations, the neural learningsystems discussed here are classification oriented architectures that suggest an adap-tive incrementality based on algorithmic tuning. From a classification and clusteringperspective, it is worth defining the notion of incrementality since the literature re-view shows that this term is often used in many different contexts and is subjectto confusion. A classification learning system is said to be incremental if it has thefollowing characteristics:

• Ability of on-line (life-long) learning,• Ability to deal with the problem of plasticity and stability,• Once processed, there is no capacity to save individual historical data points in

subsequent stages,• No prior knowledge about the topological structure of the neural network is

needed,• Ability to incrementally tune the network’s structure,• No prior knowledge about the data and its statistical properties is needed,• No prior knowledge about the number of classes and prototypes per class is

required, and• No particular prototype initialization is required.

All incremental learning algorithms are confronted with the plasticity-stabilitydilemma. This dilemma establishes the tradeoff between catastrophic interference(or forgetting) on one hand and the ability to incrementally and continually accom-modate new knowledge in the future whenever new data samples become available.The former aspect is referred to as stability, while the latter is referred to as plas-ticity. In a nutshell, the stability-plasticity dilemma is concerned with learning newknowledge without forgetting the previously learned one. This problem has beenthoroughly studied by many researchers [10][14][29] [33][36].

From another perspective, incrementality assumes phenomena that evolve overtime and change their known evolution schemes. This refers in the first to the prob-lem of concept drift. To deal with such a problem, most often dedicated techniquesto drift detection and handling use either full memory (e.g. the system has a mem-ory and therefore has access to already seen data in the past) or partial memory (e.g.temporal window of data). It seems however quite appealing to investigate the prob-lem of concept drift with no memory (i.e., data is processed online without any fullor temporal storage). This paper aims at looking closely at this approach.

Moreover, as we are interested in studying a collection of incremental learningalgorithms, it sounds legitimate to observe adaptation from the perspective of en-semble learning. In fact, by considering such line of investigation, the aim is toachieve a 3-level adaptation mechanism:

• Adaptation due the nature of the classifiers, they are self-adaptive byconstruction,

Page 241: Foundations of Computational Intelligence

240 A. Bouchachia

• Adaptation due to proportional (weighted) contribution of each classifier in theensemble decision, and

• Adaptation due to the structural update (dynamically changing structure) of theensemble.

Before delving into the details of each adaptation level, we highlight the structure ofthe paper. Section 2 describes the incremental classifiers used, their differences andsimilarities. Section 3 looks at the problem of ensemble classifiers before discussingthe problem of concept drift and adaptation consequences in Section 4. Section 5describes an approach that unifies ensemble learning and drift handling from theperspective of adaptation. Section 6 provides an evaluation of the various adaptationlevels mentioned earlier.

2 Roadmap through AIL Algorithms

There exists a certain number of incremental learning algorithms that are knownto be lifelong learning algorithms. For the sake of exhaustiveness, we select fivemost illustrative algorithms. These include: fuzzy ARTMAP (FAM) [7][14], near-est generalized exemplar (NGE) [35], generalized fuzzy min-max neural networks(GFMMNN) [13], growing neural gas (GNG) [12][28], and incremental learningbased on function decomposition (ILFD) [3]. These incremental algorithms are cho-sen due to their characteristics including different types of prototypes, generationmechanisms, operations on the prototypes (shrinking, deletion, growing, overlapcontrol), noise resistance, and data normalization requirement. It is, however, impor-tant to recall that some of the algorithms require recycling over the data to achievemore stability. This will be avoided in this study so that the spirit of incrementallearning as defined earlier is preserved.

Table 1. Characteristics of AIL algorithms

Hyperbox-based Point-based

Characteristics FAM GFMMNN NGE GNG ILFD

Online learning y y y y yType of prototypes hbox hbox hbox node centerGeneration control y y y y yShrinking n y y u uDeletion n n n y yOverlap y n n u uGrowing y y y u uNoise resistance u y u u uSensitivity to data order y y y y yNormalization y y y/n n y/n

Legend y:yes, n:no, u:unkown

Page 242: Foundations of Computational Intelligence

Adaptation in Classification Systems 241

Table 1 shows some of the characteristics of the studied algorithms. Each of thesealgorithms is capable of online learning and produces a set of prototypes per class(in the case of classification). The algorithmic steps in all these algorithms are the-oretically the same. A prototype is generated when the incoming data point is suf-ficiently dissimilar to the existing prototypes. Otherwise, an adjustment of some ofthe existing prototypes is conducted. The first characteristic that distinguishes thesealgorithms is the type of prototypes. In fact, we propose to categorize them into2 classes: hyperbox-based algorithms (HBAs) and point-based algorithms (PAs).The HBAs class includes FAM, NGE and GFMMNN. Many variations of thesealgorithms exist. For instance, there exist some attempts to generalize the FAM cate-gories to different shapes. The PAs class includes GNG and ILFD. While prototypesin GNG are nodes of a graph, in ILFD they are cluster centers in the sense of radialbasis function neural networks and can take different geometrical shapes dependingon the type of distance used.

Usually, in these IL algorithms the decision of assigning new data points to eitheran existing prototype or to a new prototype is based on some control parameters (orprocedures) like the size of hyperboxes, the error threshold and the similarity thresh-old. A further observation is that all HBAs allow the hyperboxes to grow. Moreover,they all, except FAM, allow the hyperboxes to shrink. Quite appealing is the fact thatthe growth of the hyperboxes is controlled in GFMMNN, but not in the other HBAs,in order to avoid any overlap of hyperboxes with different labels (classes). Note alsothat, while PAs are equipped with prototype deletion mechanisms, HBAs are not.Worth pointing out is that all algorithms suffer from the problem of sensitivity tothe presentation order of data and that not all algorithms require standardization ofdata. FAM and GFMMNN require data normalization to fall in [0,1], while ILFDworks better with normalized data in contrast to the other algorithms which do notrequire any standardization.

On the other hand, the desired properties of any AIL algorithm are stability, plas-ticity and ability to track the evolution [4]. Plasticity is already fulfilled by all HBAsand PAs since they are able to accommodate new data points. However, the way toquantify it has not been yet discussed in the relevant literature. Its importance per-tains to the complexity of the model. Usually too much plasticity leads to the prob-lem of prototype proliferation. As noted earlier, generation is controlled by someparameter(s) and the question that arises is, what is the appropriate value of suchparameters? Another related issue is, how can the model distinguish between rareevents and noisy data? These questions remain open issues.

On the other hand stability aims at equipping the AIL algorithms with the abilityof preserving the prototypes already learned. Usually the trend of the data changesand the system has to adapt in order to be consistent with the current configurationof the environment. However, this adaptation may result in forgetting some of thelearned knowledge. Quantifying stability can be done through measuring amount offorgetting. This has not been clearly studied in all of the existing AIL algorithms.

The third aspect that is of relevance in AIL is the ability to track changes. Thisis related to the problem of plasticity in the sense of distinguishing between rareevents and noise. The challenge lies in scenarios where the system at a time t makes

Page 243: Foundations of Computational Intelligence

242 A. Bouchachia

Fig. 2. Logical structure of an online learning system

a prediction p given a sample x, gets adapted afterwards several times and then attime t′ due to some external feedback as portrayed in Fig. 2, it turns out that pwas a wrong decision. The question is then, how can AIL reconsider their wrongdecisions by undoing adaptations that took place between time t and t′. Currentalgorithms are not able to adjust the systems by re-examining old decisions. To dealwith this problem and as minimum requirement, AIL systems have to be able toretrieve some of their past decisions.

The present paper does not deal explicitly with such issues. However, for the sakeof completeness, their reference here aims at opening new research perspectives.

2.1 Fuzzy ARTMAP

Fuzzy ARTMAP (FAM) is one of many adaptive resonance network models intro-duced by Grossberg and Carpenter [7][14]. FAM has some interesting propertieslike the fast convergence and the capacity of incremental learning. It is a supervisednetwork which is composed of two Fuzzy ARTs which are identified as ARTa andARTb. ARTa consists of two layers of neurons, the input layer F1 and the outputlayer F2 which are fully interconnected. A neuron of F2 represents one prototypeformed by the network and is characterized by its weight vector. It learns by placinghyperboxes in the 2*M , where M is the size of layer F1. Each prototype is definedby a box. The first M positions of a weight vector (prototype) represent one cornerof the box (the closest to the origin) and the other M positions represent the oppositeone. When an input is presented to the network, it goes through the following steps.First, the smallest box including the input is selected using a choice function whichchecks if the box is a fuzzy subset of the input. If no such box exist, either the onewhich needs to be the less expanded to enclose the point is selected or a new oneis created. Once a neuron is selected, a vigilance criterion is checked. It serves forchoosing another box if the one selected is already too large compared to a vigilanceparameter ρ. If the criterion is satisfied, the network learns the input, otherwise it se-lects the next neuron with the highest choice function and reevaluates the vigilancecriterion. These two steps are repeated until the vigilance criterion is met. It is saidthat the network is in resonance and all neurons of F2 are set to 0 except the winningneuron which is set to 1. Then, ARTb compares the mapped class of the input withthe actual input’s label. If the labels are different, ARTa creates a new prototype andconnects it to the input’s label.

Page 244: Foundations of Computational Intelligence

Adaptation in Classification Systems 243

2.2 Nearest Generalized Exemplar

Similar to ART, NGE uses class exemplars in the form of hyperboxes to performclassification. During training, hyperboxes are incrementally either created, shrunkor expanded along one or more dimensions. Specifically, the new input is firstmatched against each of the existing hyperboxes. This is done as follows. If thesample lies outside of all existing hyperboxes, a weighted Euclidean distance iscomputed. In this distance measure, both features and hyperboxes are weighted.Initially the weight, wh, of each hyperbox, h, is set to 1 and is updated incremen-tally. At each step: wh is set to A/B where A is the number of times h has beenused and B is the number of times h has made a correct classification. Likewise,each feature is weighted and is initially set to 1. This latter can be either increasedor decreased depending on its contribution to the correct classification of samples. Ifthe sample falls inside a hyperbox, its distance to that hyperbox is simply zero. If thesample is equidistant to several hyperboxes, the smallest of these is chosen. Oncethe nearest hyperbox is found, its label is checked against that of the new input.If they have the same label, the hyperbox is extended to include the input sample.If they have distinct labels, the second nearest hyperbox is searched and its label isagain compared with that of the input sample. If they meet, the hyperbox is updated.Should both the first and second nearest hyperboxes have different labels than thenew sample, a new hyperbox is created and will consist solely of the input sample.Note that in [42], a data normalization step is executed ahead of the training.

2.3 Growing Neural Gas

Growing neural gas (GNG) [12] generates prototypes within the data space by build-ing a graph that consists of a set of nodes and a set of connections. Each connectionis characterized by an age and each node is characterized by a local error variable.

Given a new input x, the first winner, n1 and the second winner, n2 are deter-mined. If there is no connection between these two nodes, then it is created. Theconnection age is set to 0. The squared distance between the first winner and theinput is added to a local error variable. The weight vectors of the first winner andits direct topological neighbors are adapted by fractions η1 and ηi of the distanceto x. The age of all connections leading from the winner neuron are increased by1. Connections with age greater than a given max, are then removed. If some neu-rons become disconnected, then they are also removed. If the number of processedinputs reached an integer multiple of given parameter λ, a new neuron q is insertedbetween the neuron n with the largest error and the neuron m among the neighborsof n with the largest error. The weight of q is set to the mean value of the n andm weights. A connection between q and each of n and m is set and that betweenn and m is removed. Error of n and m is decreased by a given fraction, while thatof the new node is set to the mean value of the errors of n and m. Then, the errorof all neurons is decreased by a given fraction. The process is repeated until a pre-defined termination condition is met which can be the size of the network or anyperformance measure.

Page 245: Foundations of Computational Intelligence

244 A. Bouchachia

2.4 Generalized Fuzzy Min-Max Neural Network

The generalized fuzzy min-max neural network (GFMMNN) [13] is a classifier be-longing to the class of HBAs. GFMMNN is a neural network that consists of 3layers: F1, F2, and F3. The input layer, F1, consists of 2∗n processing nodes, twiceas many dimensions as of the input.

The hidden layer, F2 consists of nodes in the form of hyperbox fuzzy set. Thesenodes are created during training. The connections between F1 and F2 represent themin-max points of hyperboxes, while the transfer function of the F2 nodes is thehyperbox membership function. The min and max points are stored in the matricesV and W respectively. The connections are adjusted using a learning algorithm. Thelayer, F3, is the class layer.

Learning consists of three steps: hyperbox expansion, hyperbox overlap test, andhyperbox contraction. Given an input, x, expansion aims at identify the hyperboxthat can be expanded. If such a hyperbox does not exist, a new hyperbox is createdwhich will be consisting of x. An expandable hyperbox has to satisfy three condi-tions: (1) providing the highest degree of membership, (2) representing the sameclass as the actual input and (3) meeting an expansion criterion which is a func-tion of the hyperbox size (that is θ). The overlap test aims at checking the existenceof overlap between hyperboxes from different classes. Contraction is executed ifoverlap exists. It aims at eliminating that overlap using an adjustment procedure.

2.5 Incremental Learning via Function Decomposition

ILFD [3] aims at enabling lifelong classification of data lying in different regions ofthe space allowing to generate non-convex and disconnected partitions. Each classis approximated by a certain number of prototypes centered around their prototypes.ILFD tries to infer a classification rule that is obtained by composition of 2 func-tions. The first function, G, mapping the input X onto prototypes W is called aclustering function. The second function H mapping the prototypes onto class la-bels Y is called a labeling function. These functions are realized by a 3-layeredneural network: input layer, prototype layer and class layer. The topology of thenetwork is updated either by adding new prototype nodes, adding new class nodesor by deleting redundant and dead units of L(2). A prototype node is redundant ifit can be covered by other nearby nodes. A prototype node is dead if it remains in-active for a long period. Deletion is controlled by two checks called dispersion testand staleness test.

Given a new input xk with a label yi, if no class node of that label is available,then a new prototype node j and a corresponding new class node i are inserted inthe net and the weight inter-layers are accordingly adjusted. If the label of the newinput is known, this input is compared against all prototypes of the same label as xk.If no prototype is sufficiently close to xk, a new prototype P i

j is created and weightsare accordingly adjusted. If there is a matching prototype node, the weights Wj areupdated using the rival penalized competitive learning rule. The idea here is to movethe competitor with a different label away from the region of the input vector so thatthe classes of the competing winners are kept as distinct as possible.

Page 246: Foundations of Computational Intelligence

Adaptation in Classification Systems 245

3 Combination of the AI Learners

Although many of the classifiers are proven to be universal non-linear function ap-proximators (e.g., radial basis function, multi-layer perceptron, etc.), due to the di-versity and the definition range of their parameters, their performance may varystrongly. To alleviate the effect of parameter setting, it seems appealing to com-bine in a symbiotic way several classifiers. The idea is that even if the performanceof one or few neural networks may not be that much satisfactory, the ensemble ofthe algorithms can still predict the correct output. Usually, when the task is rela-tively hard, multiple predictors are used following the conquer-and-divide princi-ple [25, 45, 18, 17, 26, 20].

Fig. 3. Combining online learners

It is important to note that there are two combination schemes:

1. The individual classifiers (based on the same model) are trained on differentdata sets randomly generated sets (re-sampling from a larger training set) beforethey are combined to perform the classification. These include stacking [44],bagging [5] and boosting [11].

2. The ensemble contains several classifiers trained on the same data but they areof different types (neural networks, decision trees , etc.), with different param-eters (e.g. in multi-layer neural networks: different number of hidden layers,different number of hidden neurons, etc.), and trained using different initial con-ditions (e.g. weight initialization in neural networks, etc.) [25, 8].

Page 247: Foundations of Computational Intelligence

246 A. Bouchachia

Table 2. Combination rules

Output type Rule Expression

Class labelMajority voting D(x) = arg maxj

[∑Ni=1 δ(Di(x), Cj)

]Weighted maj. voting D(x) = arg maxj

[∑Ni=1 wiδ(Di(x), Cj)

]

Actual output

Product Rule Oj(x) = 1N

∏Ni=1 Oj

i (x)

Sum Rule Oj(x) =∑N

i=1 Oji (x)

Average Rule Oj(x) = 1N

∑Ni=1 Oj

i (x)

Generalized average Oj(x) =(

1N

∑Ni=1

(Oj

i (x))η

) 1η

Max Rule (optimistic) Oj(x) = maxNi=1O

ji (x)

Min Rule (pessimistic) Oj(x) = minNi=1 Oj

i (x)

Both schemes seek to ensure a high diversity of the ensemble. Since the presentpaper is about different online algorithms, it is natural to focus on the second com-bination scheme. The ensemble consists of FAM, GFMMN, NGE, GNG, and ILFD.The important issue is then how to combine such classifiers taking the notion ofadaptation into account.

Basically there are two combination strategies as shown in Tab. 2:

1. Class labels: In this strategy, the combination rules use the individual deci-sions (assignments) of the N classifiers to generate the final decision of theensemble. The most representative rule in this strategy is the voting rule and itsweighted version. The class receiving the highest number of votes is retained asthe decision of the ensemble. In the weighted version the contribution of eachclassifier is represented by a weight, wi that is obtained based on backgroundknowledge (e.g., performance of the classifier) or via training using a separatealgorithm. Note that δ(.) in Tab. 2 indicates the kroneker function. Moreoverthere exist other class-label based combination strategies such as BKS andBorda count [16] which are less popular.

2. Actual outputs: The decision of the ensemble can also be obtained by combin-ing directly the outputs of each classifier rather than the labels correspondingto those outputs. The resulting outputs are then used to infer the final assign-ment decision (winning class). The widely used rules are those shown in Tab. 2.The most general one is the generalized mean rule that subsumes the minimumrule (when η → −∞), the geometric mean (variant of the product rule) (whenη = 0), the average rule ( η = 1), and the maximum rule (when η → ∞).

Because the output, Oij by a classifier i with respect to a class j corresponds

to the estimate of the posterior probability P (Cj |x) for j, it is important tonormalize the outputs so that they sum up to 1. To do that, usually the softmaxfunction is applied:

Page 248: Foundations of Computational Intelligence

Adaptation in Classification Systems 247

Oji =

eOji∑C

k=1 eOki

j = 1 · · · C (1)

There exist other combination rules that take the assignment decisions of theclassifiers during training into account. These include decision templates andDempster-Shafer [24] which are more demanding from the computational pointof view.

Moreover, there has been several investigations to find the best combination rule [1,20, 23, 40]. Although there is no general conclusion about such issue, it seems thatthe most admitted rules are the majority voting (for class label) and the average(continuous output) and their weighted versions. These are preferred due to theiradvantageous tradeoff between accuracy and computational simplicity. Therefore,in the present study, the weighted majority voting rule will be applied to combinethe diverse classifiers. In addition to the previously mentioned reasons, the choiceof such combination rule is motivated by the aim of incorporating adaptation at thislevel too. The weights stand for a tuning mechanism that is not imposed but ratherlearned through time as data arrives and classifiers evolve. Hence, we see weightedmajority voting as a second level of adaptation. It is the second element of the hybridapproach proposed in this paper that consists of self-adaptive basic classifiers, to becombined by a dynamic combination rule and where the structure of the ensembleis dynamic too. More details will follow in the next sections.

4 Concept Drift

In dynamic environments where data arrives over time, very often the data distri-bution drifts over time leading to performance deterioration of the off-line trainedsystem. Particularly, the drift makes the model built using old data inconsistent withthe new data. To tackle this problem, the classifier needs to be equipped with ap-propriate mechanisms to detect and handle concept drift. Such mechanisms helpmonitor the performance of the system efficiently by instantly updating the under-lying model.

The state-of-art techniques in the context of concept drift are rather data driven,meaning that drift is handled only from the perspective of data. Several techniquesbased on this strategy exist: (i) instance weighting, (ii) instance selection, and (iii)ensemble learners. The idea underlying instance weighting [43] consists of decreas-ing with time the importance of samples. It is, however, hard to decide which in-stances should be assigned higher weights, though some heuristics using aging andtypicality can be devised. Independently of drift handling, such an idea underliessome incremental classifiers discussed here. These include ILFD and GNG.

Instance selection [21] is the best known technique and includes two method-ologies: fixed window and adaptive window where the model is regenerated usingthe last data batches (the system possesses a memory). The challenge in instanceselection technique is finding the appropriate window. This can be seen from theperspective of forgetting in that the classifier is adjusted over time by learning the

Page 249: Foundations of Computational Intelligence

248 A. Bouchachia

new data and forgetting/unlearning outdated knowledge. As outlined in [43], the is-sue is quantifying the characteristics of forgetting, that is, the rate and the type ofthe drift (gradual, seasonal, abrupt). Often forgetting is simulated by re-training [34]the classifiers on a window of length n containing the new sample and the n− 1 oldsamples. However the size of the window is a critical issue. Small window leads tolow stability due the few samples used to train the model and large window mightlead to less responsiveness to the change.

Simulating forgetting can also be done using an adaptive window whose sizechanges overtime. If a drift is detected, the window is downsized. But even withthat, it is difficult to accurately modify the size of the window according to the paceof drift. The last known type of forgetting is density-based [38] where instancesare deleted from the learning set if they are already reflected by the local regions(local models). This results in constant update of the high-density regions, while notremoving the data representing rare events.

More relevant to the present study is the ensemble learners 1 technique, espe-cially incremental ensembles, according to which the idea of instance selectionis generalized so that many classifiers vote. Their weight is changed so that thesuccessful classifiers that detect drift are rewarded and/or those which do not detectdrift are replaced by new ones trained on newly arriving data. Other ensemble-basedapproaches seek to learn a set of concept descriptions over different time inter-vals [22, 19, 39, 41]. Relying on weighted voting, the predictions of the learnersare combined and the most relevant description is selected.

To simplify the categorization provided by Kuncheva [24], ensemble learning al-gorithms for changing environments can be classified into one of the following threetypes: (i) dynamic combination where the base learners are trained in advance anddynamically combined to respond to the changes in the environment by changingthe combination rule (Winnow and weighted majority are often used [27, 41, 43]);(ii) continuous update of the learners such that the learners are either retrained inbatch mode or updated online using the new data [6, 9, 30] (the combination rulemay or may not change in the process); (iii) structural update where new learnersare added (or existing ones are activated if they deactivated) and inefficient ones areremoved (or deactivated) [22, 39] relying on accuracy criteria.

To fit the context of open-ended cycle of learning, we consider a hybrid approachthat unifies: dynamic combination rule for classifiers that learn online and that canbe structurally updated overtime as the accuracy of the ensemble starts to dete-riorate. The adaptation of the ensemble structure is further described in the nextsection.

5 Online Mistake-Driven Ensemble Learning

Following the description line in the previous section, the next goal is to definemechanisms that allow to achieve the online learning of the individual classifiers,

1 ”Ensemble learners” and ”ensemble classifiers” are used interchangeably.

Page 250: Foundations of Computational Intelligence

Adaptation in Classification Systems 249

while seeking dynamic structural update of the ensemble that ensures adaptation indynamically changing environments (e.g. in presence of concept drift).

It is therefore important to apply appropriate combination techniques that fit thecurrent incremental context. An early work by Littlestone [27] has focused on theidea of online mistake-driven. While such work was originally proposed in the con-text of attribute weighing so that a more accurate classifier can be obtained, itsadoption in ensemble learning is straightforward as we will see shortly.

The basic Winnow mistake-driven algorithm for individual classifiers is given as:

• Initialize weights w1, ..., wn

• Given an input xt with its corresponding output yt, get the predicted label from theclassifierupdate the classifier

◦ If no mistake, then w(t + 1) = w(t) // no modification◦ Otherwise set w(t + 1) = wte

ηytxt/Zt // Zt is a normalization factor and η is alearning rate

Like in Perceptron, the Winnow algorithm updates the weights only when aninput is misclassified. Moreover, the Winnow algorithm can be seen as a weightedmajority algorithm (i.e. a learner replaces the attribute in the original version ofWinnow). However the update rule changes [2]. This idea is portrayed in Alg. 1.

As illustrated, the weights of the individual learners are updated when the en-semble misclassify the current input following the original scheme of the Winnowalgorithm. In doing that, individuals that produce the right predictions are promoted(by increasing their weights) and those that fail to predict the correct class of thecurrent input are demoted (i.e. penalized) by decaying their weight. In the currentpaper, we apply the following steps: Note that n indicates the number of learners,therefore its value is set to 5 in our case. Recall, however, that all classifiers aretrained while the ensemble is tuned, all online.

One aspect to be added to the portrayed algorithm 1 in the context of incre-mental open-ended learning cycle is the removal of inefficient learners and additionof new ones. This corresponds to the third adaptation level after self-adaptation andweighted contribution of individual classifiers. As explained earlier, there exist someattempts in the context of online ensemble to change the structure of the ensemblethrough the operation of addition and deletion of ensemble members [22, 39]. Inthis study, Alg. 1 is modified resulting in the version shown in Alg. 2.

While the delete operation in Step 7 of Alg. 2 is well defined the add operationin Step 9 needs full specification. The learners used in this study already ensure thedesired high diversity of the ensemble, more reinforcement of such a diversity canbe still achieved. To do that, if a new learner is to be appended to the ensemble atsome time, its type (i.e., FAM, NGE, GFMMN, GNG, ILFD) can be decided basedon the current variability of the ensemble. For each type, a learner is virtually addedand the variability of the ensemble is computed. The type that produces the highestvariability is effectively added. Of course one can consider another alternative, that

Page 251: Foundations of Computational Intelligence

250 A. Bouchachia

Algorithm 1. The Winnow algorithm as a weighted majority voting algorithm1: - Initialize weights w1, ..., wn // n learners, the promotion parameter α (α > 1) and the

demotion parameter β (0 < β < 1), the weights wi s.t.∑n

i wi = 12: - Present the current input xt to each learner (the corresponding output is yt)3: - Get predictions from the learners (pt

1, · · · ptn)

4: - Compute the weighted majority vote (the decision of the ensemble)

yt = arg maxj

n∑j=1

wj [yj = ptj ]

5: if yt �= yt then6: for all Learners j = 1 · · ·n do7: if pj = yt then8: - w

(t+1)j = w

(t)j ∗ α

9: else10: - w

(t+1)j = w

(t)j ∗ β

11: end if12: end for13: - Normalize the weights: wj =

w(t+1)j∑n

i=1 w(t+1)j

14: end if15: - Train each learner on the the input (xt, yt)

is, the accuracy reflected by the associated weight of each type in the current con-figuration of the ensemble. Such a weight can be used as a factor for deciding whichtype of the learners is to be instantiated. In the present paper, we focus on the latteralternative. The former one will be the subject of future work.

The idea of adding and deleting learners aims at dealing with data drift. If thearriving data has a different probabilistic distribution compared to that of the dataalready seen, the learners might not able to handle such situations efficiently despitetheir adaptation capabilities. Therefore, the option of online creation of new learnersand deletion of inefficient ones is highly desirable.

6 Numerical Evaluation

For the sake of illustration, in this study 3 real-world data sets pertaining to classifi-cation problems are used whose characteristics are shown in Tab. 3. These sets are:defect dataset that describes software defect detection [31], breast cancer, and spamdata sets from the UCI repository [15].

In order to simulate drift in the real world data sets that do not originally containdrift, we proceed by first sorting the data according to a certain feature and thenwe delete it from the data set. This is an easy way for appending drift to a givendata set and to make that drift somehow implicit. In the case of the defect data set,the attribute representing the total McCabe’s line count of code was used to generatethe drift, while in the case of spam and cancer, ”capital run length total” (represent-ing the total number of capital letters in the e-mail) and ”radius” (representing the

Page 252: Foundations of Computational Intelligence

Adaptation in Classification Systems 251

Algorithm 2. The Winnow algorithm as a weighted majority voting algorithm1: - Initialize weights w1, ..., wn // n learners, the promotion parameter α (α > 1) and the

demotion parameter β (0 < β < 1), the weights w1i s.t.

∑ni w1

i = 1, and the maximumnumber of learners L

2: - Present the current input xt to each learner (the corresponding output is yt)3: - Get predictions from the learners (pt

1, · · · ptn)

4: - Compute the weighted majority vote (the decision of the ensemble)

yt = arg maxj

n∑yj=1

wtj [j = pt

j ]

5: if yt �= yt then6: if n > L then7: - Delete learner j with weight wt

j = min[wti ]i=1···n, and set n = n − 1

8: end if9: - Add a learner of a particular type (after some criterion) and set n = n +1, wn = 1;

10: for all learners j = 1 · · ·n − 1 // Ignoring the newly added learner do11: if pj = yt then12: - w

(t+1)j = w

(t)j ∗ α

13: else14: - w

(t+1)j = w

(t)j ∗ β

15: end if16: end for17: - Normalize the weights: wj =

w(t+1)j∑n

i=1 w(t+1)j

18: end if19: - Train each learner on the the input (xt, yt)

Table 3. Characteristics of the data

Data Size # Classes # FeaturesCancer 683 2 9Spam 4601 2 57Defect 2109 2 22

radius of the cell nucleus, that is the mean of distances from center to points on theperimeter) are respectively used to generate the drift.

Note that in previous studies [34], we generated drifting data using some con-trolled formulas.

We intend to study in this numerical evaluation, (1) evaluation of the algorithmsin an incremental settings (data arrives over time and the algorithms can see the dataonly once), (2) their combination using the ensemble learning algorithms describedby Alg. 1 (without dynamic update of the number basic classifiers), (3) their com-bination using Alg. 2 with dynamic update of the number basic classifiers, (4) theircombination in presence of drift without update and (5) using their combinationwith update in presence of drift. Note that all results are averaged over 10 runs sothat initial conditions have less effect on the general conclusions.

Page 253: Foundations of Computational Intelligence

252 A. Bouchachia

Table 4. Parameter settings of the networks

FAM GFMMNNBaseline vigilance (ρa) 0.8 Hyperbox size (θ) 0.1Vigilance of Map Field (ρab) 0.3 Sensitivity (γ) 0.05Choice parameter (α) 0.01

GNG ILFDLearning rate - winner (ew) 0.9 Learning rate - winner (ew) 0.4Learning rate - neighbors (en) 0.0001 Learning rate - rival (en) 0.02Learning rate - output (η) 0.08 Confidence (R) 0.91Adaptation (insertion) (λ) 100 Confusion (M ) 0.01Error decrease (all nodes) (α) 0.1 Staleness (γ) 500Error decrease (insertion) (β) 0.5Maximal age amax 300

After initial experiments, we found out the values of the parameters shown inTab. 4 to be the most fitting providing the highest classification accuracy. Note thatNGE does not have any tuning parameter.

6.1 Performance of the Base Learners

Recall that training of the classifiers is done incrementally in an adaptive onlinemanner. Hence, each of the classifiers sees a particular sample only once (i.e., clas-sifiers have no data storage memory). Moreover, we use the accuracy ratio (the ratioof the correctly classified samples to the size of the testing set) to measure the per-formance of the classifiers.

Considering the pre-specified parameter settings, the accuracy of the individualclassifiers is computed on the testing set providing the results shown in Tab. 5. It isworth noticing that the classifiers perform differently from one data to another. Thesimplest data set is the cancer data while the most difficult on average is the defectdata. In terms of performance, FAM seems to perform better than the others baseclassifiers on the average. However occasionally, the other classifiers perform betterthan FAM.

6.2 Performance of the Ensemble Classifiers

When training an ensemble classifiers consisting of the five basic incremental clas-sifiers relying on Alg. 1 where the combination rule is the weighted majority voting,we obtain the results shown in Tab. 6 accompanying Tab. 5. The weight representsthe contribution of each individual in making the ensemble’s decision. The higherthe weight, the more importance it gets in the voting process.

Having obtained the weights during the training phase, the ensemble can be eval-uated on the testing data. The classification results (last column) reflect the gen-eral performance of the individual classifiers. Moreover, as one can notice fromboth Tabs. 5 and 6, the weights have clear effect. In the case of the cancer data

Page 254: Foundations of Computational Intelligence

Adaptation in Classification Systems 253

Table 5. Performance of individual classifiers

FAM GFMMN NGE GNG ILFDCancer 89.89 ± 0.63 94.81 ± 1.36 72.593 ± 1.72 86.67 ± 0.01 93.33 ± 1.02Spam 86.96 ± 1.54 62.53 ± 1.92 70.91 ± 2.04 61.80 ± 0.03 72.67 ± 1.94Defect 66.67 ± 0.63 68.57 ± 1.118 72.38 ± 1.02 79.05 ± 1.54 75.24 ± 2.66

Table 6. Performance of the ensemble classifiers (Winnow 1) - Weights and accuracy

FAM GFMMN NGE GNG ILFD Ensemble AccuracyCancer 0.0564 0.1696 0.0015 0.0016 0.7710 97.04 ± 1.31Spam 0.2991 0.2181 0.1963 0.0684 0.2181 82.16 ± 0.78Defect 0.1527 0.0902 0.3192 0.1001 0.3379 72.38 ± 0.83

Table 7. Performance of the adaptive ensemble classifiers - Winnow 2

Cancer Spam DefectAlg. type Weight Accur. Type Weight Accur. Type Weight Accuracy1 FAM 0.0647 95.56± 0.93 NGE 0.0342 82.32± 1.23 FAM 0.0928 54.43± 3.032 GFMMN 0.0647 78.52± 1.24 FAM 0.0342 73.48± 1.92 GFMMN 0.0835 60.76± 1.053 NGE 0.0583 72.59± 1.72 FAM 0.0342 83.62± 2.04 NGE 0.0928 79.75± 1.004 GNG 0.0647 64.81± 1.13 FAM 0.0380 78.84± 1.03 GNG 0.0835 79.11± 1.655 ILFD 0.0647 91.85± 1.09 FAM 0.0380 84.93± 1.65 ILFD 0.0835 51.14± 4.116 FAM 0.0524 94.07± 1.31 FAM 0.0422 77.25± 1.78 FAM 0.0928 53.80± 3.037 FAM 0.0719 78.52± 1.23 FAM 0.0422 75.65± 1.09 FAM 0.1146 79.11± 1.658 FAM 0.0719 91.85± 0.65 FAM 0.0469 76.67± 1.09 FAM 0.1146 79.11± 1.659 FAM 0.0799 81.48± 1.23 FAM 0.0469 85.51± 1.65 FAM 0.1146 64.05± 1.9310 FAM 0.0888 88.89± 1.23 FAM 0.0469 84.64± 1.00 FAM 0.1273 79.11± 1.4811 FAM 0.1096 84.81± 0.93 FAM 0.0469 85.36± 1.7812 FAM 0.0987 82.22± 0.93 FAM 0.0469 82.17± 1.7813 FAM 0.1096 87.04± 1.23 FAM 0.0469 82.61± 1.0014 FAM 0.0579 80.58± 1.9215 FAM 0.0644 65.36± 1.6516 FAM 0.0794 63.33± 3.0117 FAM 0.1090 72.61± 1.9318 GFMMN 0.0422 65.36± 1.4219 FAM 0.0342 60.58± 2.7820 GNG 0.0342 57.54± 4.6321 ILFD 0.0342 59.42± 3.11

Ensemble 97.30±1.09 Ensemble 83.91 ± 2.54 Ensemble 79.75 ± 3.11

set, the contribution of the combination is apparent, while in the other two cases,there are individuals that outperform the ensemble. The reason for that is obviouslythe weights which have been adapted on the training data. But in all the ensemblealways outperform the majority of the individuals. Therefore the ensemble can beconsidered as the most reliable.

Page 255: Foundations of Computational Intelligence

254 A. Bouchachia

Table 8. Performance of the ensemble classifiers in presence of drift - Winnow 1

Alg. FAM GFMMN NGE GNG ILFD Ensemble

Cancer75.93 ± 1.23 74.70 ± 1.84 67.41 ± 1.32 97.04 ± 1.18 79.26 ± 2.39

85.93 ± 1.560.1728 0.2370 0.2133 0.1399 0.2370

Spam77.39 ± 1.53 63.62 ± 2.32 69.71 ± 2.04 59.04± 0.93 56.23 ± 2.05

79.57 ± 1.840.2448 0.2203 0.2448 0.0454 0.2448

Defect66.33 ± 1.65 60.96 ± 1.13 67.59 ± 1.13 60.63 ± 1.63 67.00 ± 1.84

66.16 ± 1.040.1757 0.1757 0.2423 0.2306 0.1757

6.3 Performance of Adaptive Ensemble Classifiers

While in the previous experiments the base classifiers do not change during training,using Alg. 2, we can enhance the adaptation of the ensemble by a self-tuning mecha-nism that allows the classifiers to leave the ensemble and other to be created. There-fore, adaptation under this scenario is present in its three forms: self-adaptation ofindividual classifiers, weighting and structural update. Once training is exhausted,the final configuration of the ensemble is obtained. Such configuration of the en-semble is shown in Tab. 7 columns 2, 5 and 8 along with weights shown in columns3, 6 and 9. Note that these experiments, we set the maximum number of individualsL (see Alg. 2) illustratively to 21.

Testing of the individual classifiers and the resulting ensemble produces theaccuracy values shown in Tab. 7 for each of the data sets (columns 4, 7 and 10).Comparing the results obtained without structure adaptation (Tab. 6) against thoseof the adaptive ensemble, the performance of the ensemble (i.e., the last row) showsthat adaptation has a clear contribution irrespectively of the data sets. In the case ofthe defect data the contribution is even clearer.

6.4 Performance of Ensemble Classifiers in Presence of Drift

This experiment aims at examining the capability of the ensemble classifiers to dealwith concept drift. After generating the drifting data according to the mechanismdescribed in Sec. 6, we turn to the analysis of the performance of the incrementalclassifiers and their combination. Here we study the ensemble without adaptation.Table 8 shows the results obtained. Clearly the drift has an effect on the classifiers.When comparing Tab. 6 against Tab. 8, the performance of the ensemble decreases,but that is to some extent expected due to drift that often leads the prototypes gener-ated by the networks to overlap. However, one can see clearly that the performanceof the ensemble is very acceptable in presence of drift. It is also important to notethat some of the classifiers such as NGE and GNG resist quite well to drift.

6.5 Performance of Adaptive Ensemble Classifiers in Presence of Drift

Considering adaptive ensemble in presence of drift, the idea is to observe whetherequipping the ensemble classifier with incremental structural adaptation has an ef-fect on the performance. The results obtained on each of the data sets are portrayed

Page 256: Foundations of Computational Intelligence

Adaptation in Classification Systems 255

Table 9. Performance of adaptive ensemble classifiers in presence of drift - Winnow 2

Cancer Spam DefectAlg. type Weight Accur. Type Weight Accur. Type Weight Accuracy1 FAM 0.1090 95.56± 0.95 NGE 0.0313 76.52 ± 1.06 FAM 0.1243 65.19 ± 1.452 GFMMN 0.1211 81.48± 1.53 FAM 0.0347 76.96 ± 1.09 GFMMN 0.1119 60.63 ± 1.563 NGE 0.1211 97.04± 0.73 FAM 0.0476 69.28 ± 1.35 NGE 0.1119 60.63 ± 1.734 GNG 0.1211 96.30± 0.79 FAM 0.0386 71.01 ± 1.13 GNG 0.1119 57.47 ± 2.015 ILFD 0.1090 71.85± 1.58 FAM 0.0476 68.70 ± 1.54 ILFD 0.1243 59.37 ± 1.766 FAM 0.1346 76.30± 1.06 FAM 0.0476 63.77 ± 1.67 FAM 0.1243 62.53 ± 1.257 FAM 0.1346 64.81± 1.98 FAM 0.0476 68.99 ± 1.60 FAM 0.1381 61.90 ± 1.498 FAM 0.1495 86.67± 0.97 FAM 0.0476 69.42 ± 1.48 FAM 0.1534 60.00 ± 1.639 FAM 0.0653 63.77 ± 1.6310 FAM 0.0529 67.68 ± 1.2911 FAM 0.0476 66.96 ± 1.5212 FAM 0.0529 79.28 ± 0.9713 FAM 0.0807 58.99 ± 1.8514 FAM 0.0726 71.74 ± 1.1515 FAM 0.0726 70.72 ± 1.2116 GFMMN 0.0386 70.87 ± 1.2317 FAM 0.0347 71.74 ± 1.0918 ILFD 0.0386 65.36 ± 1.7319 FAM 0.0347 61.74 ± 1.6920 FAM 0.0313 61.74 ± 1.8221 GNG 0.0347 68.55 ± 1.61

Ensemble 94.81 Ensemble 81.45 Ensemble 69.82

in Tab. 9. As anticipated drift may impact the accuracy of the ensemble if we com-pare the results of Tab. 7 against those of Tab. 9. However, the accuracy remains verycompetitive. On the other hand, the accuracy of the individual incremental classi-fiers slightly decreases. Therefore one can conclude that the ensemble approach isjustifiably worth considering. Moreover, it is quite interesting to note that whencomparing results of Tab. 9 against the results of Tab. 8 (corresponding to non adap-tive ensemble in presence of drift), the adaptation expressed by means of structuralupdate in presence of drift is arguably rational.

7 Conclusions

In this paper, the problem of adaptation has been discussed from three perspectives.The first concerns the self-organizing nature of the classifiers studied; the secondis about the proportional (weighted) contribution of the classifiers when incorpo-rated into an incremental ensemble classifiers; the third pertains to the incrementaland dynamic update over time of the ensemble’s structure (i.e., the ensemble cangrow and shrink). Extensive experiments have been conducted to study each of theseadaptation forms showing in particular the rationality of considering the ensembleclassifiers as an approach to deal with various incrementality scenarios.

Further investigations are planned in the future such as equipping the individualclassifiers and the ensemble with forgetting mechanisms, undertaking a closer look

Page 257: Foundations of Computational Intelligence

256 A. Bouchachia

at the conditions under which new classifiers are added or deleted by consideringvarious criteria such as diversity and accuracy performance, and comparing suchincremental algorithms with known techniques such as retraining that require partialmemory.

References

1. Battiti, R., Colla, A.: Democracy in neural nets: Voting schemes for classification. NeuralNetworks 7(4), 691–707 (1994)

2. Blum, A.: Empirical support for winnow and weighted-majority based algorithms: re-sults on a calendar scheduling domain. Machine Learning 26, 5–23 (1997)

3. Bouchachia, A.: Incremental learning via function decomposition. In: Proc. of the Int.Conf. on machine learning and applications, pp. 63–68 (2006)

4. Bouchachia, A.: Incremental Learning. In: Encyclopedia of Data Warehousing and Min-ing, 2nd edn., Idea-Group (in press) (2008)

5. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)6. Breiman, L.: Pasting small votes for classification in large databases and on-line. Ma-

chine Learning 36, 85–103 (1999)7. Carpenter, G., Grossberg, S., Rosen, D.: Fuzzy ART: Fast Stable Learning and Catego-

rization of Analog Patterns by an Adaptive Resonance System. Neural Networks 4(6),759–771 (1991)

8. Dietrich, C., Palm, G., Schwenker, F.: Decision templates for the classification of bioa-coustic time series. Information Fusion 4(2), 101–109 (2003)

9. Fern, A., Givan, R.: Machine learning. Machine Learning 53, 71–109 (2003)10. French, R.: Catastrophic forgetting in connectionist networks: Causes, consequences and

solutions. Trends in Cognitive Sciences, Trends in Cognitive Sciences 3(4), 128–135(1999)

11. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and anapplication to boosting. In: Proceedings of the Second European Conference on Compu-tational Learning Theory, pp. 23–37 (1995)

12. Fritzke, B.: A growing neural gas network learns topologies. In: Advances in neuralinformation processing systems, pp. 625–632 (1995)

13. Gabrys, B., Bargiela, A.: General fuzzy min-max neural network for clustering and clas-sification. IEEE Trans. on Neural Networks 11(3), 769–783 (2000)

14. Grossberg, S.: Nonlinear neural networks: principles, mechanism, and architectures.Neural Networks 1, 17–61 (1988)

15. Hettich, S., Blake, C., Merz, C.: UCI repository of machine learning databases (1998),www.ics.uci.edu/˜mlearn/MLRepository.html

16. Huang, Y., Suen, C.: A method of combining multiple experts for the recognition of un-constrained handwritten numerals. IEEE Transactions on Pattern Analysis and MachineIntelligence 17(1), 90–94 (1995)

17. Jacobs, R.: Methods of combining experts’ probability assessments. Neural Comput-ing 7, 865–888 (1995)

18. Jacobs, R., Jordan, M., Nowlan, S., Hinton, G.: Adaptive mixtures of local experts. Neu-ral Computing 3, 79–87 (1991)

19. Stanley, K.: Learning concept drift with a committee of decision trees. Technical ReportTR-03-302, Dept of Computer Science, University of Texas at Austin, USA (2003)

20. Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Transactionson Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)

Page 258: Foundations of Computational Intelligence

Adaptation in Classification Systems 257

21. Klinkenberg, R.: Learning drifting concepts: example selection vs. example weighting.Intelligent Data Analysis 8(3), 281–300 (2004)

22. Kolter, J., Maloof, M.: Dynamic weighted majority: a new ensemble method for trackingconcept drift. In: Proceedings of the 3rd International Conference on Data Mining ICDM2003, pp. 123–130. IEEE CS Press, Los Alamitos (2003)

23. Kuncheva, L.: A theoretical study on six classifier fusion strategies. IEEE Transactionson Pattern Analysis and Machine Intelligence 24(2), 281–286 (2002)

24. Kuncheva, L.: Classifier ensembles for changing environments. In: Proc. of the 5th inter-national workshop on multiple classifier systems, pp. 1–15 (2004)

25. Kuncheva, L., Bezdek, J., Duin, R.: Decision templates for multiple classifier fusion: Anexperimental comparison. Pattern Recognition 34(2), 299–314 (2001)

26. Lam, L., Suen, C.: Optimal combinations of pattern classifiers. Pattern Recognition Let-ters 16, 945–954 (1995)

27. Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear thresh-old algorithm. Machine Learning 2, 285–318 (1988)

28. Martinetz, T., Berkovich, S., Schulten, K.: Neural gas network for vector quantizationand its application to time-series prediction. IEEE Trans. Neural Networks 4(4), 558–569 (1993)

29. McCloskey, M., Cohen, N.: Catastrophic interference in connectionist networks: thesequential learning problem. The psychology of learning and motivation 24, 109–164(1999)

30. Oza, N.: Online Ensemble Learning. Phd thesis, University of California, Berkeley(2001)

31. Promise. Software engineering repository (May 2008),http://promise.site.uottawa.ca/SERepository

32. Quartz, S., Sejnowski, T.: The neural basis of cognitive development: a constructivistmanifesto. Behavioral and Brain Sciences 20(4), 537–556 (1997)

33. Ratcliff, R.: Connectionist models of recognition memory: constraints imposed by learn-ing and forgetting functions. Psychological Review 97, 285–308 (1990)

34. Sahel, Z., Bouchachia, A., Gabrys, B.: Adaptive mechanisms for classification problemswith drifting data. In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part II.LNCS, vol. 4693, pp. 419–426. Springer, Heidelberg (2007)

35. Salzberg, S.: A nearest hyperrectangle learning method. Machine learning 6, 277–309(1991)

36. Sharkey, N., Sharkey, A.: Catastrophic forgetting in connectionist networks: Causes, con-sequences and solutions. An analysis of catastrophic interference 7(3-4), 301–329 (1995)

37. Sirosh, J., Miikkulainen, R., Choe, Y. (eds.): Lateral Interactions in the Cortex: Structureand Function, The UTCS Neural Networks Research Group, Austin, TX, Electronic book(1996)

38. Slaganicoff, M.: Density-adaptive learning and forgetting. In: Proc. of the 10th Int. Conf.on Machine Learning, pp. 276–283 (1993)

39. Street, W., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale classification.In: Proceedings of the 7th International Conference on Knowledge Discovery and DataMining KDDM 2001, pp. 377–382 (2001)

40. Tax, D., van Breukelen, M., Duin, R., Kittler, J.: Combining multiple classifiers by aver-aging or by multiplying? Pattern Recognition 33(9), 1475–1485 (2000)

41. Tsymbal, A., Pechenizkiy, M., Cunningham, P., Puuronen, S.: Handling local conceptdrift with dynamic integration of classifiers: Domain of antibiotic resistance in noso-comial infections. In: Proc. of the 19th IEEE Symposium on Computer-Based MedicalSystems, pp. 679–684 (2006)

Page 259: Foundations of Computational Intelligence

258 A. Bouchachia

42. Wettschereck, D., Dietterich, T.: An experimental comparison of the nearest-neighborand nearest-hyperrectangle algorithms. Machine Learning 19(1), 5–27 (1995)

43. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts.Machine Learning 23, 69–101 (1996)

44. Wolpert, D.: Stacked generalization. Neural Networks 5(2), 241–259 (1992)45. Woods, K., Kegelmeyer, W., Bowyer, K.: Combination of multiple classifiers using local

accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19,405–410 (1997)

Page 260: Foundations of Computational Intelligence

Music Instrument Estimation in Polyphonic

Sound Based on Short-Term Spectrum Match

Wenxin Jiang1, Alicja Wieczorkowska2, and Zbigniew W. Ras1,2

1 University of North Carolina,Department of Computer Science, Charlotte, NC 28223, USA

2 Polish-Japanese Institute of Information Technology,Koszykowa 86, 02-008 Warsaw, [email protected], [email protected], [email protected]

Summary. Recognition and separation of sounds played by various instru-ments is very useful in labeling audio files with semantic information. This isa non-trivial task requiring sound analysis, but the results can aid automaticindexing and browsing music data when searching for melodies played byuser specified instruments. In this chapter, we describe all stages of this pro-cess, including sound parameterization, instrument identification, and alsoseparation of layered sounds. Parameterization in our case represents poweramplitude spectrum, but we also perform comparative experiments with pa-rameterization based mainly on spectrum related sound attributes, includingMFCC, parameters describing the shape of the power spectrum of the soundwaveform, and also time domain related parameters. Various classificationalgorithms have been applied, including k-nearest neighbor (KNN) yieldinggood results. The experiments on polyphonic (polytimbral) recordings andresults discussed in this chapter allow us to draw conclusions regarding thedirections of further experiments on this subject, which can be of interest forany user of music audio data sets.

1 Introduction

Recently, a number of acoustical features for the construction of a compu-tational model for music timbre estimation have been investigated in MusicInformation Retrieval (MIR) area. Timbre is a quality of sound that distin-guishes one music instrument from another, since there are a wide varietyof instrument families and individual categories. It is rather a subjectivequality, defined by ANSI as the attribute of auditory sensation, in terms ofwhich a listener can judge that two sounds, similarly presented and havingthe same loudness and pitch, are different [1], [2]. Such definition is clearlysubjective and not of much use for automatic sound timbre classification,although the footnote to the definition gives hints towards physical timbredescription, stating that the timbre depends primarily upon the spectrum ofthe stimulus, but also upon the waveform, the sound pressure, the frequency

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 259–273.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 261: Foundations of Computational Intelligence

260 W. Jiang, A. Wieczorkowska, and Z.W. Ras

location of the spectrum, and the temporal characteristics of the stimulus[2], [5]. Still, musical sounds must be very carefully parameterized to allowautomatic timbre recognition.

So far, there is no standard parameterization used as a classification basis.The sound descriptors applied are based on various methods of analysis intime domain, spectrum domain, time-frequency domain and cepstrum, withDiscrete Fourier Transform (DFT) for spectral analysis being most common,e.g. Fast Fourier Transform (FFT), and so on. Also, wavelet analysis gainsincreasing interest for sound and especially for musical sound analysis andrepresentation.

Researchers explored different statistical summations to describe signa-tures of music instruments based on vectors or matrices in features, such asTristimulus parameters, brightness, irregularity of the spectrum etc. [6], [14],[21]. Flattening these features for traditional classifiers increases the numberof features. In [16] authors used a new set of features jointly with other pop-ular features used in music instrument identification. They built a databaseof music instrument sounds for training a number of classifiers. These classi-fiers are used by MIRAI system to identify music instruments in polyphonicsounds.

MIRAI is designed as a web-based storage and retrieval system which canautomatically index musical input (of polyphonic, polytimbral type), trans-forming it into a database, and answer queries requesting specific musicalpieces, see http://www.mir.uncc.edu/. When MIRAI receives a musicalwaveform, it divides this waveform into segments of equal size and then theclassifiers incorporated into the system identify the most dominating musicalinstruments and emotions associated with that segment. A database of mu-sical instrument sounds describing about 4,000 sound objects by more than1,100 features is associated with MIRAI. Each sound object is represented asa temporal sequence of approximately 150-300 tuples which gives a tempo-ral database of more than 1,000,000 tuples, each one represented as a vectorof about 1,100 features. This database is mainly used to train classifiers forautomatic indexing of musical instrument sounds. It is semantically reachenough (in terms of successful sound separation and recognition) so the con-structed classifiers have a high level of accuracy in recognizing the dominatingmusical instrument and/or its type when music is polyphonic. Unfortunately,the loss of information on non-dominant instruments by the sound separationalgorithm, due to the overlap of sound features, may significantly lower therecognition confidence of the remaining instruments in a polyphonic sound.This chapter shows that by identifying a weighted set of dominating instru-ments in a sequence of overlapping frames and using a special voting strategy,we can improve the overall confidence of the indexing strategy for polyphonicmusic, and the same improve the precision and recall of MIRAI retrievalengine.

Page 262: Foundations of Computational Intelligence

Music Instrument Estimation in Polyphonic Sound 261

2 Sound Parameterization for Automatic ClassificationPurposes

Sound wave can be described as a function representing amplitude changes intime. For digitally recorded sound, this function is quantized in time and inamplitude. Sampling rate describes how many values are recorded per timeunit, and binary resolution in multi-bit recording describes how many bits areused to represent quantized amplitude axis, for each channel. Standard settingfor CD for sampling rate is 44,100 samples per second, i.e. 44.1 kHz, with 16-bit resolution for amplitude, i.e. with 216 quantization levels. Obviously, suchdata are not well suited to perform automatic classification, so usually soundparameterization is performed before further experiments on audio databases.

There are numerous ways the sound can be parameterized. Audio data canbe parameterized in time domain, in frequency domain, and time-frequencyfeatures can also be extracted. Parameterization can be based on Fourieranalysis, particularly on amplitude spectrum, on cepstral analysis, also onwavelet analysis, and so on; various features can be extracted to describe theresults of these analyzes [3], [4], [7], [8], [9], [11], [13], [19], [22].

In our research, we decided to base mainly on direct observation of soundspectrum (see Section 4), but we also performed experiments based on thefollowing sound features [8], [22]:

• AudioSpectrumBasis - MPEG-7 descriptor, representing low-dimensionalprojection of a high-dimensional spectral space, to aid compactness andrecognition [8]. AudioSpectrumBasis is a matrix derived from the SVD(singular value decomposition) of a power spectrum in normalized dBscale, i.e. in log scale with maximal value defining 0 dB. In our research,frequency axis for AudioSpectrumBasis was divided into 32 bands, with1/4-octave resolution for 8 octaves; octave distance means doubling thefundamental frequency, i.e. the pitch of the sound.

• AudioSpectrumProjection - projection of AudioSpectrumBasis [8],• AudioSpectrumFlatness - MPEG-7 parameter, calculated in our research

for spectrum divided into 32 frequency bands, i.e. with 1/4-octave resolu-tion for 8 octaves, and the length of this 32-element vector is added as 0th

element of this 33-dimensional feature; if there is a high deviation from aflat spectral shape for a given band, it may signal the presence of tonalcomponents [8],

• MFCC = {mfccn : 1 ≤ n ≤ 13} - cepstral coefficients in mel scale; fea-ture originating from speech processing, but also used for music analysis[12], [17]. 13 coefficients were used (the 0th one and the next 12), for 24mel frequency scale hearing filters, using Julius software [10],

• HamonicPeaks = {HamoPkn : 1 ≤ n ≤ 28} - sequence of the first 28local peaks of harmonics (in normalized dB scale) for a given frame

• TemporalCentroid - time instant where the energy of the sound is fo-cused, calculated as energy weighted mean of the sound duration,

Page 263: Foundations of Computational Intelligence

262 W. Jiang, A. Wieczorkowska, and Z.W. Ras

• LogSpecCentroid - AudioSpectrumCentroid from MPEG-7 standard [8];this parameter represents the gravity center of a log-frequency powerspectrum,

• LogSpecSpread - AudioSpectrumSpread descriptor from MPEG-7 [8];calculated as RMS (Root Mean Square) value of the deviation of thepower spectrum in log frequency scale with respect to the gravity centerin a frame,

• Energy - energy of spectrum, averaged through all frames of the sound,• ZeroCrossings - zero-crossing rate, i.e. number of sign changes of the

wave form in a frame, averaged through all frames of the sound,• SpecCentroid - calculated as HarmonicSpectralCentroid from MPEG-

7, representing power-weighted average of the frequency of the bins in thelinear power spectrum, and averaged over all the frames of the steadystate of the sound,

• SpecSpread - calculated as HarmonicSpectralSpread from MPEG-7, de-scribing the amplitude-weighted standard deviation of the harmonic peaksof the spectrum, normalized by the instantaneous HarmonicSpectralCen−troid and averaged over all the frames of the steady state of the sound,

• RollOff - averaged (over all frames) frequency below which an ex-perimentally chosen percentage of the accumulated magnitudes of thespectrum is concentrated,

• Flux - difference between the magnitude of the amplitude spectrum pointsin a given and successive frame, averaged through the entire sound,

• LogAttackT ime - decimal logarithm of the sound duration from the timeinstant when the signal starts, to the time when it reaches its maximumvalue, or when it reaches its sustained part, whichever comes first.

3 Polyphonic Sound Estimation Based on SoundSeparation and Feature Extraction

The traditional way of pattern recognition in music sound is to extractfeatures from raw signals in digital form, usually recorded as a sequence ofinteger samples representing quantized values of amplitude of a sound wavein consequent time instants. By feature extraction, the acoustic characteris-tics such as pitch and timbre are described by smaller and more structureddataset which is further fed to traditional classifiers to perform estimation.

In case of polyphonic sounds, sound separation can be applied to extract thesignal which is identified as one specific instrument at timbre estimation pro-cess. Then timbre estimation can be applied again on the residue of the signalto get other timbre information. Fig. 1 shows the process of music instrumentrecognition system based on feature extraction and sound separation.

However, there are two main problems in this method. First of all, overlap-ping of the features make it difficult to perform timbre estimation and soundseparation. Secondly, during the classification process, only one instrumentis picked up from all candidates, which makes the estimation inaccurate.

Page 264: Foundations of Computational Intelligence

Music Instrument Estimation in Polyphonic Sound 263

New spectrum

Get spectrum

FFT Feature extraction

Pitch Estimation

Get pitch

Get Instrument Sound separation

Timbre Estimation

Classifier

Polyphonic Sound

Power Spectrum

Fig. 1. Flow chart of music instrument recognition with sound separation

3.1 Overlapping of the Features Both in Temporal and SpectralSpace

Feature based datasets are easier and more efficient to work with classifiers,however, there is usually information loss during the feature extraction pro-cess. Feature is the abstract or compressed representation of waveform orspectrum, such as harmonic peaks, MFCC (Mel Frequency Cepstral Coeffi-cients), zero-crossing rate, and so on. In the case of monophonic music soundestimation tasks with only singular non-layered sounds, the features can beeasily extracted and identified. However, this is not the case in polyphonic,polytimbral sound. It is difficult or even often impossible to extract distinctclear features representing single instrument from polyphonic sound, becausethe overlapping of the signals and their spectra, especially when instrumentshave the similar patterns in their features space.

3.2 Classification with Single Instrument Estimation for EachFrame

The traditional classification process usually gives single answer, representingone class; in our case, it would be the name of instrument playing in an analyzedsample. In such a case, when only the best answer, i.e. the name of one (the onlyone or dominating) instrument playing for each frame of music sound is given,then information about other possibly contributing instruments is lost.

In fact, it is common for the polyphonic music sound to have multipleinstruments playing simultaneously, which means that in each frame, there

Page 265: Foundations of Computational Intelligence

264 W. Jiang, A. Wieczorkowska, and Z.W. Ras

are representations of multiple timbres existing in the signal. Providing oneonly candidate leads to obtaining predominant timbre while ignoring othertimbre information. And also, there could be no dominating timbre in eachframe, when all instruments play equally loud. This means that classifier hasto randomly choose one of the equally possible candidates. In order to findsolution to this problem, authors introduce the Top-N winner strategy whichgives multiple candidates for each evaluated frame.

4 Pattern Detection Directly from Power Spectrum

The fact that discriminating one instrument from another depends on moredetails from raw signals leads to another way of pattern recognition: directlydetecting distinct patterns of instruments based on lower representation ofsignal, such as power spectrum. Fig. 2 shows two different ways of patternrecognition.

Since spectrum is very useful for timbre representation purposes, we pro-pose the new strategy of instrument estimation based on short term powerspectrum match.

4.1 Sub-Pattern of Single Instrument in the Mixture SoundSegment

Figure 3 shows the power spectrum of trumpet, piano and the mixture ofthose two instruments. As we can see, the spectrum of mixture preservespart of the pattern of each single instrument.

Fig. 2. Two different methods of pattern recognition

Page 266: Foundations of Computational Intelligence

Music Instrument Estimation in Polyphonic Sound 265

Fig. 3. Power spectrum of trumpet, piano and their mixture; frequency axis is inlinear scale, whereas amplitude axis is in log [dB] scale

The same similarity of properties of the spectra is also observed e.g. forflute, trombone and their mixture, as Figure 4 shows.

In order to index the polyphonic sound, we need to detect the instrumentinformation in each small slice of music sound. Such detection is rather not fea-sible directly in time domain. Therefore, in our experiments, we have observedthe short term spectrum space. This has been calculated via short time Fouriertransform (STFT). Figure 5 shows another example of the spectrum slice forflute and trombone and their mixture sound. Each slice is 0.04 seconds long.

As Figure 5 shows, the power spectrum patterns of single flute and singletrombone can still be identified in mixture spectrum without blurring witheach other (as marked in the figure). Therefore, we do get the clear pictureof distinct pattern of each single instrument when we observe each spectrumslice of the polyphonic sound wave.

4.2 Classification Based on Power Spectrum Pattern Match

In order to represent accurately the short term spectrum with high resolutionin frequency axis, allowing more precise pattern matching, long analyzing

Page 267: Foundations of Computational Intelligence

266 W. Jiang, A. Wieczorkowska, and Z.W. Ras

Fig. 4. Power spectrum of flute, trombone and their mixture

frame with 8192 numeric samples was chosen. Fourier transform performedon these frames describes frequency space for each slice (or frame). Instead ofparameterizing the spectrum (or time domain) and extracting a few dozensof features to represent sound, we decided to work directly on the poweramplitude spectrum values (points). When a new sound is analyzed with agoal to find what instrument or instruments contributed to create this sound,even though their spectra overlap, we can still try to find the closest vectorsfrom the training data set of singular sounds and discover which instrumentsounds they represent.

The traditional classification models such as decision trees, Naive Bayesianclassifiers, and neural networks do not perform well in this case. It is becausethere are too many attributes (8192 numeric attributes) for those classifiersto yield good classification models, and also any classification model itselfstands for some sort of abstraction, which is in conflict with any informationpreserving strategy. However, one of the most fundamental and simple clas-sification methods, K Nearest Neighbor algorithm, needs no prior knowledgeabout the distribution of the data and it seems to be an appropriate classifierfor numeric spectrum vectors.

Page 268: Foundations of Computational Intelligence

Music Instrument Estimation in Polyphonic Sound 267

Fig. 5. Sub-patterns of single instruments in the mixture sound slice for flute,trombone, and their mix

5 Top-N Winners from Classification Model

As opposed to the traditional pattern matching or classification process,which uses classification model to choose the one with the highest confidenceas the estimation result, we do not get the ”best answer” for an evaluatedframe. Instead, we choose the multiple candidates from classification modelaccording to the confidence measure. As it was already discussed, we knowthat the classifier during the classification process compares pattern from

Page 269: Foundations of Computational Intelligence

268 W. Jiang, A. Wieczorkowska, and Z.W. Ras

each frame with standard instrument patterns in a training database. Sincethere are several different sub-patterns existing in the spectrum, classifierwill assign the confidence to each recognized pattern. This way, we may iden-tify which N matches have the highest confidence levels. They are our top Nwinners.

Thus, at each frame we get n instruments Ii with the confidence level Ci

and save them to the candidates pool for the voting process. After evaluatingall the frames, we get weights for all the candidates from the candidates poolby adding up their confidences, and the final voting proceeds according tothe weights Wj of each instrument. The following is the pseudo-code for theTop-N winners procedure:

For each frame from the soundGet power spectrum by STFT

For each candidate Xi from top-N winners of classifiersIf Xi exists in candidates pool then

Confidence[x] += CiElse

Add Xi into candidates poolConfidence[x] = Ci

End IfEnd For

End ForSelect Top m candidates from candidates pool

Some noise coming from errors occurred during the single frame estimationprocess could be minimized in terms of the whole music context. By keepingthe original acoustical information of the music sound, we are getting muchhigher recognition rate for multiple instruments in polyphonic sound.

Here are the steps of pattern matching process:

1. Use STFT and Hamming window to extract power spectrum for each0.04s frame for all the standard single instrument sounds.

2. Save these spectra in a training database; since there is overlapping of2/3 of frame length for Hamming window, the number of items in thedataset actually almost triples for each sound.

3. During the estimation process, use KNN to do the vector distance mea-sure (8192 points) and decide which frame in the training dataset is themost similar to the unknown sound frame; when we give multiple matches,the multiple instrument candidates are saved for the overall weightscalculation.

Fig. 6 shows the new music instrument recognition system which has beendeveloped with the strategy of Top-N winners based on short-term spectrummatching.

Page 270: Foundations of Computational Intelligence

Music Instrument Estimation in Polyphonic Sound 269

Classifier

Get frame

Power Spectrum Get Instrument

Candidates

Timbre

Estimation

Polyphonic

Sound

Training data

FFT

Finish all the Frames

estimation Voting process Get Final winners

Fig. 6. Flow chart of music instrument recognition system with new strategy

6 Experiment of Top-N-Winners Strategy Based onShort Term Spectrum Matching

To simplify the problem, we only performed tests on the middle C instrumentsounds, i.e. for pitch equal to C4 in MIDI notation, of frequency 261.6 Hz (forA4 tuned to 440 Hz). The training subset including 3323 objects has beenselected from the entire training database. Each object is represented by theframe-wise (0.04 seconds) power spectrum extracted by short time Fouriertransform from the following 26 single instrument sounds:

Electric Guitar, Bassoon, Oboe, B-flat Clarinet, Marimba, C Trumpet, E-flat Clarinet, Tenor Trombone, French Horn, Flute, Viola, Violin, EnglishHorn, Vibraphone, Accordion, Electric Bass, Cello, Tenor Saxophone, B-FlatTrumpet, Bass Flute, Double Bass, Alto Flute, Piano, Bach Trumpet, Tuba,and Bass Clarinet.

To compare the results with the traditional feature based classificationstrategy, we have also extracted the following 5 groups of both temporal andspectral features (calculated for spectrum divided into 33 frequency bands),mainly originating from the MPEG-7 standard [8], [20], [22], and used deci-sion tree classifier to perform the timbre estimation:

Group1: BandsCoefficient = {bandsCoefn : 1 ≤ n ≤ 33} - coefficientsfor 33 AudioSpectrumFlatness bands.

Group2: Projections = {prjn : 1 ≤ n ≤ 33} - AudioSpectrumProjectionfrom MPEG-7,

Page 271: Foundations of Computational Intelligence

270 W. Jiang, A. Wieczorkowska, and Z.W. Ras

Group3: MFCC = {mfccn : 1 ≤ n ≤ 13}Group4: HamonicPeaks = {HamoPkn : 1 ≤ n ≤ 28}Group5: Other Features:

• TemporalCentroid,• LogSpecCentroid,• LogSpecSpread,• Energy,• ZeroCrossings,• SpecCentroid,• SpecSpread,• RollOff ,• Flux,• bandsCoefSum - AudioSpectrumFlatness bands coefficients sum,• prjmin, prjmax, prjsum, prjdis, prjstd - minimum, maximum, sum, dis-

tance, and standard deviation of AudioSpectrumProjection calculatedfor AudioSpectrumBasis. Distance represents a dissimilarity measure:distance for a matrix is calculated as sum of absolute values of differencesbetween elements of each row and column. Distance for a vector is calcu-lated as the sum of dissimilarity (absolute difference of values) of everypair of coordinates in the vector,

• LogAttackT ime.

52 polyphonic audio files have been mixed (using Sound Forge sound editor[18]) from 2 of those 26 instruments sound. These mixture audio files havebeen used as test files.

The system uses MS SQLSERVER2005 database to store training datasetand K nearest neighbor algorithm as the classifier. When the polyphonicsound is submitted to system, it provides several estimations as the finalcandidate answers. In our experiment, we gave 4 estimations for each sub-mitted audio file.

The performance of our algorithm was measured using recognition rate R,calculated as

R = P/Awhere P is the positive response, i.e. the number of the correct estimations,

and A is the actual number of instruments existing in the polyphonic sound.For comparison purpose, five experiments were performed independently.

We applied feature-based sound separation strategy and we used a decisiontree type classifier in our first two experiments. In experiment 1, only onecandidate was chosen by a classifier for each frame. In the first step of ex-periment 2, top n candidates (with n = 2) were chosen by a classifier foreach frame. In its second step, for each candidate, the confidences over allthe frames were added to get the overall score used to identify the final nwinners.

In the remaining three experiments, we applied a new strategy of spectrummatch based on KNN classifier. In experiment 3, we used KNN (k = 1) to

Page 272: Foundations of Computational Intelligence

Music Instrument Estimation in Polyphonic Sound 271

Table 1. Recognition rate of music instrument estimation based on variousstrategies

experiment # description Recognition

Rate

1 Feature-based and separation + Decision Tree (n=1) 36.49%

2 Feature-based and separation + Decision Tree (n=2) 48.65%

3 Spectrum Match + KNN(k=1;n=2) 79.41%

4 Spectrum Match + KNN(k=5;n=2)_ 82.43%

5 Spectrum Match + KNN(k=5;n=2) without percussion instrument_ 87.1%

choose the top 2 candidates as the winners for each frame. In experiment4, we increased k from 1 to 5. In experiment 5, we ruled out the percussioninstrument objects from the testing audio files, since they have less clearpatterns in the spectrum envelope.

From the results shown in Table 1, we get the following conclusions:

1. Using the multiple candidates for each frame yields better results thanusing single winner strategy.

2. Spectrum-based KNN classification improves the recognition rate of poly-phonic sounds significantly.

3. Some percussion instrument (such as vibraphone, marimba) are not suit-able for spectrum-based classification, but most instruments generatingharmonic sounds work well with this new strategy.

7 Conclusion

We have provided a new solution to an important problem of instrumentidentification in polyphonic music: The loss of information on non-dominantinstruments during the sound separation process due to the overlapping ofsound features. The new strategy is to directly detect sub-patterns from shortterm power spectrum, which is relatively lower-level and at the same timemore efficient representation of the raw signals, instead of usually a few dozens(or maximally hundreds) of features, most often used for instrument recog-nition purposes. Next, we choose the multiple candidates from each frameduring the frame-wise classification based on similarity of the spectrum, andweight them based on their possibility over all the sound period to get themore accurate estimation of multiple instruments which are playing simulta-neously in the music piece. This approach also avoids extracting more com-pact feature patterns of multiple instruments from polyphonic sounds, whichis difficult and inaccurate because of the information-loss during the abstrac-tion process. Our experiments show that the sub-patterns detected from thepower spectrum slices contain sufficient information for the multiple-timbreestimation tasks and improve the robustness of instrument identificationas well.

Page 273: Foundations of Computational Intelligence

272 W. Jiang, A. Wieczorkowska, and Z.W. Ras

Acknowledgments

This work was supported by the National Science Foundation under grantIIS-0414815, and also by the Research Center of PJIIT, supported by thePolish National Committee for Scientific Research (KBN).

We are grateful to Dr. Xin Zhang for many helpful discussions we hadwith her and for the comments she made which improved the quality andreadability of the chapter.

References

1. Agostini, G., Longari, M., Pollastri, E.: Content-Based Classification of MusicalInstrument Timbres. In: International Workshop on Content-Based MultimediaIndexing (2001)

2. American National Standards Institute, American national standard: Psychoa-coustical terminology. ANSI S3.20-1973 (1973)

3. Aniola, P., Lukasik, E.: JAVA Library for Automatic Musical InstrumentsRecognition. AES 122 Convention, Vienna, Austria (2007)

4. Brown, J.C.: Computer identification of musical instruments using patternrecognition with cepstral coefficients as features. J. Acoust. Soc. Am. 105, 1933–1941 (1999)

5. Fitzgerald, R., Lindsay, A.: Tying semantic labels to computational descriptorsof similar timbres. In: Sound and Music Computing 2004 (2004)

6. Fujinaga, I., McMillan, K.: Real Time Recognition of Orchestral Instruments.In: International Computer Music Conference (2000)

7. Herrera, P., Amatriain, X., Batlle, E., Serra, X.: Towards instrument segmenta-tion for music content description: a critical review of instrument classificationtechniques. In: International Symposium on Music Information Retrieval IS-MIR (2000)

8. ISO/IEC JTC1/SC29/WG11, MPEG-7 Overview (2004),http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm

9. Kaminskyj, I.: Multi-feature Musical Instrument Sound Classifier w/user de-termined generalisation performance. In: Proceedings of the Australasian Com-puter Music Association Conference ACMC, pp. 53–62 (2002)

10. Kawahara, T., Lee, A., Kobayashi, T., Takeda, K., Minematsu, N., Sagayama,S., Itou, K., Ito, A., Yamamoto, M., Yamada, A., Utsuro, T., Shikano, K.: Freesoftware toolkit for Japanese large vocabulary continuous speech recognition.In: Proc. Int’l Conf. on Spoken Language Processing (ICSLP), vol. 4, pp. 476–479 (2000)

11. Kitahara, T., Goto, M., Okuno, H.G.: Pitch-Dependent Identification of Musi-cal Instrument Sounds. Applied Intelligence 23, 267–275 (2005)

12. Logan, B.: Mel Frequency Cepstral Coefficients for Music Modeling. In: Pro-ceedings of the First International Symposium on Music Information RetrievalISMIR 2000 (2000)

13. Martin, K.D., Kim, Y.E.: Musical instrument identification: A pattern-recognition approach. In: 136-th meeting of the Acoustical Society of America,Norfolk, VA (1998)

Page 274: Foundations of Computational Intelligence

Music Instrument Estimation in Polyphonic Sound 273

14. Pollard, H.F., Jansson, E.V.: A Tristimulus Method for the Specification ofMusical Timbre. Acustica 51, 162–171 (1982)

15. Ras, Z., Wieczorkowska, A., Lewis, R., Marasek, K., Zhang, C., Cohen, A., Kol-czynska, E., Jiang, M.: Automatic Indexing of Audio With Timbre Informationfor Musical Instruments of Definite Pitch (2008), http://www.mir.uncc.edu/

16. Ras, Z., Zhang, X., Lewis, R.: MIRAI: Multi-hierarchical, FS-tree based MusicInformation Retrieval System (Invited Paper). In: Kryszkiewicz, M., Peters,J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585,pp. 80–89. Springer, Heidelberg (2007)

17. Saha, G., Yadhunandan, U.: Modified Mel-Frequency Cepstral Coefficient. In:Proceedings of the IASTED 2004 (2004)

18. Sonic Foundry, Sound Forge. Software (2003)19. Wieczorkowska, A.: Towards Musical Data Classification via Wavelet Analysis.

In: Ohsuga, S., Ras, Z.W. (eds.) ISMIS 2000. LNCS (LNAI), vol. 1932, pp.292–300. Springer, Heidelberg (2000)

20. Wieczorkowska, A., Ras, Z., Zhang, X., Lewis, R.: Multi-way Hierarchic Classi-fication of Musical Instrument Sounds. In: Kim, S., Park, J., Pissinou, N., Kim,T., Fang, W., Slezak, D., Arabnia, H., Howard, D. (eds.) International Con-ference on Multimedia and Ubiquitous Engineering MUE 2007, Seoul, Korea.IEEE Computer Society, Los Alamitos (2007)

21. Wold, E., Blum, T., Keislar, D., Wheaten, J.: Content-based classification,search, and retrieval of audio. IEEE Multimedia 3(3), 27–36 (1996)

22. Zhang, X.: Cooperative Music Retrieval Based on Automatic Indexing of Musicby Instruments and Their Types. PhD dissertation, The University of NorthCarolina at Charlotte, Charlotte (2007)

Page 275: Foundations of Computational Intelligence

Ultrasound Biomicroscopy Glaucoma Images

Analysis Based on Rough Set and PulseCoupled Neural Network

El-Sayed A. El-Dahshan1, Aboul Ella Hassanien2,Amr Radi1, and Soumya Banerjee3

1 Physics Department, Faculty of Science, Ain Shams University,Abbassia, Cairo 11566, Egypte @Yahoo.com, [email protected]

2 Information Technology Department, FCI, Cairo University5 Ahamed Zewal Street, Orman, Giza, [email protected], [email protected]

3 Department of Computer Science, Birla Institute of TechnologyInternational Center, [email protected]

Summary. The objective of this book chapter is to present the rough setsand pulse coupled neural network scheme for Ultrasound Biomicroscopyglaucoma images analysis. To increase the efficiency of the introduced scheme,an intensity adjustment process is applied first using the Pulse Coupled Neu-ral Network (PCNN) with a median filter. This is followed by applying thePCNN-based segmentation algorithm to detect the boundary of the interiorchamber of the eye image. Then, glaucoma clinical parameters have beencalculated and normalized, followed by application of a rough set analysisto discover the dependency between the parameters and to generate set ofreduct that contains minimal number of attributes. Finally, a rough confu-sion matrix is designed for discrimination to test whether they are normal orglaucomatous eyes. Experimental results show that the introduced scheme isvery successful and has high detection accuracy.

1 Introduction

Glaucoma is a disease that can cause a severe impairment of visual function andleads to irreversible blindness if untreated. About 60 million people worldwidewill have glaucoma by 2010, and the number will increase to nearly 80 millionby 2020, according to a recent study in the British Journal of Ophthalmology[1]. It has been estimated that one-half of the glaucoma patients are affectedby angle closure glaucoma [2]. Angle closure glaucoma (ACG) has been calledthe most common form of glaucoma in the worldwide, and the leading cause ofbilateral blindness [2, 3, 4]. If the disease is detected in its early stages, damagecan be minimized and the long term prognosis for the patient is improved.

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 275–293.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 276: Foundations of Computational Intelligence

276 E.-S.A. El-Dahshan et al.

Fig. 1. Schematic diagram for normal and glaucomatous eye

Fig. 2. Healthy and glaucomatous eye

Often the diagnosis of glaucoma depends on the type of glaucoma. Thereare two main types of glaucoma, ıopen angle and angle closure. The anglerefers to the drainage area where the clear protective lining of the frontchamber of the eye, the cornea, and the iris, the colored portion of the eye,join. If this area is closed or narrow, one could develop angle closure glaucoma.If this area is physically open and the individual has glaucoma, it is termedopen angle [5]. Gonioscopy, Ultrasound Biomicroscopy (UBM), and OpticalCoherence Tomography (OCT) are potentially important tools for diagnosingangle closure glaucoma (ACG) [6, 7]. Figure 1 shows the Schematic diagramfor normal and glaucomatous eye. Figure 2 shows two cases, healthy andglaucomatous eye, the left image is a healthy eye with a wide anterior chamberangle and the right image is a glaucomatous eyes.

UBM images can help the clinician visualize structures behind the iris. Itis also of benefit when the anterior chamber structures cannot be clearly seen,such as through a cloudy cornea. The diagnostic utility of ultrasound biomi-croscopy has been reported for anterior segment disorders such as glaucoma,

Page 277: Foundations of Computational Intelligence

Ultrasound Biomicroscopy Glaucoma Images Analysis 277

iris tumor, corneal diseases, and ocular trauma. UBM can also detect smallforeign bodies, including those missed by computed tomography and B-scanultrasound [7]. The UBM has enabled clinicians to quantitatively assess theiris curvature and degree of angle opening, since it images a cross-section ofangle structures similar to that of a low power microscope section. One candetermine the state of closure of the entire angle, even when it cannot bevisualized by gonioscopy.

UBM acts at a frequency of 50 to 100 Hz with 20 to 60 μm resolutionand 4 mm penetration [8, 9]. It produces high resolution images of the ante-rior part of the eye by which a qualitative and a quantitative evaluation ofstructures and their relation can be done [7]. In spite of recent advancesin ultrasonic imaging, manually glaucoma clinical parameters assessmenton UBM images by physicians is still a challenging task due to poor con-trast, missing boundary, low signal-to-noise ration (SNR), speckle noise andrefraction artifacts of the images. Besides, manual identification for glaucomaclinical parameters is tedious and sensitive to observer bias and experience.Thus, Semi- or automatic angle closure glaucoma clinical parameters mea-surements methods provide robust results with a certain degree of accuracyand can remove the physical weaknesses of observer interpretation withinultrasound images[10, 11]. This is essential for the early detection and treat-ment of glaucoma disease.

Over the past two decades several traditional multivariate statisticalclassification approaches, such as the linear discriminant analysis and thequadratic discriminant analysis, have been developed to address the classi-fication problem. More advanced and intelligent techniques have been usedin medical data analysis such as neural network, Bayesian classifier, geneticalgorithms, decision trees[12], fuzzy theory and rough set. Fuzzy sets [13]provide a natural framework for the process in dealing with uncertainty. Itoffers a problem-solving tool between the precision of classical mathemat-ics and the inherent imprecision of the real world. Neural networks [14, 15]provide a robust approach to approximating real-valued, discrete-valued andvector-valued functions. The well-known algorithm Backpropagation, whichuses gradient descent to tune network parameters to best fit the trainingset with input-output pair, has been applied as a learning technique for theneural networks. Other approaches like case based reasoning and decisiontrees [12]are also widely used to solve data analysis problems. Each one ofthese techniques has its own properties and features including their ability offinding important rules and information that could be useful for the medicalfield domain. Each of these techniques contributes a distinct methodologyfor addressing problems in its domain. Rough set theory [16, 17, 18] is afairly new intelligent technique that has been applied to the medical domainand is used for the discovery of data dependencies, evaluates the importanceof attributes, discovers the patterns of data, reduces all redundant objectsand attributes, and seeks the minimum subset of attributes. Moreover, it isbeing used for the extraction of rules from databases. One advantage of the

Page 278: Foundations of Computational Intelligence

278 E.-S.A. El-Dahshan et al.

rough set is the creation of readable if-then rules. Such rules have a poten-tial to reveal new patterns in the data material. This chapter introduces arough set scheme for Ultrasound Biomicroscopy glaucoma images analysis inconjunction with pulse coupled neural network.

This chapter is organized as follows: Section 2 gives a brief mathematicsbackground to pulse coupled neural network and to the rough sets. Section3 discusses the proposed rough set data analysis scheme in detail. Section 4discuss the Rough Sets Prediction Model. Experimental analysis and discus-sion of the results are described in section 5. Finally, conclusions and futurework are presented in section 6.

2 Mathematics Background

2.1 Pulse Coupled Neural Network

The Pulse Coupled Neural Network(PCNN) [19, 20] are neural networksthat are based on cats visual cortex and developed for high-performancebiomimetic image processing. Eckhorn et al. [21, 22, 23] introduced a neuralmodel to emulate the mechanism of cats’ visual cortex. The Eckhorn modelprovided a simple and effective tool for studying small mammals’ visual cortexand was soon recognized as having significant application potential in imageprocessing. In 1994, Eckhorn model was adapted to be an image processingalgorithm by Johnson who termed this algorithm PCNN.

A PCNN is a two-dimensional neural network. Each neuron in thenetwork corresponds to one pixel in an input image, receiving its correspond-ing pixels color information (e.g. intensity) as an external stimulus. Eachneuron also connects with its neighboring neurons, receiving local stimulifrom them. The external and local stimuli are combined in an internal ac-tivation system, which accumulates the stimuli until it exceeds a dynamicthreshold, resulting in a pulse output. Through iterative computation, PCNNneurons produce temporal series of pulse outputs. The temporal series of pulseoutputs contain information of input images and can be utilized for variousimage processing applications, such as image enhancement and segmentation[19, 20]. Pulse coupled neural network model is comprised of four parts thatform the basis of the neuron. The first part is the feeding receptive fieldthat receives the feeding inputs (i.e., image pixel values); the second part isthe linking receptive field that receives the linking inputs from the neighborneurons; the third part is modulation field, which the linking input added aconstant positive bias, then it is multiplied by the feeding input; the last partis a pulse generator that consists of an output pulse generator and a thresholdspike generator. When PCNN is applied to image processing, one neuron cor-responds to one pixel. Figure 3 depicts the layout structure of PCNN and itscomponents.

Page 279: Foundations of Computational Intelligence

Ultrasound Biomicroscopy Glaucoma Images Analysis 279

Fig. 3. The layout structure of PCNN and its components

2.2 Rough Set Theory

Rough set theory [16, 17, 18] is a fairly new intelligent technique for manag-ing uncertainty that has been applied to the medical domain and is used forthe discovery of data dependencies, evaluates the importance of attributes,discovers the patterns of data, reduces all redundant objects and attributes,seeks the minimum subset of attributes, recognizes and classifies objects inimage processing. Moreover, it is being used for the extraction of rules fromdatabases. Rough sets have proven useful for representation of vague regionsin spatial data. One advantage of the rough set is the creation of readableif-then rules. Such rules have a potential to reveal new patterns in the data;furthermore, it also collectively functions as a classifier for unseen data sets.Unlike other computational intelligence techniques, rough set analysis re-quires no external parameters and uses only the information presented in thegiven data. One of the nice features of rough set theory is that it can tellwhether the data is complete or not based on the data itself. If the data isincomplete, it suggests more information about the objects required to be col-lected in order to build a good classification model. On the other hand, if thedata is complete, rough set approach can determine whether there are morethan enough or redundant information in the data and find the minimumdata needed for classification model. This property of rough set approachis very important for applications where domain knowledge is very limitedor data collection is very expensive/laborious because it makes sure the data

Page 280: Foundations of Computational Intelligence

280 E.-S.A. El-Dahshan et al.

collected is just good enough to build a good classification model withoutsacrificing the accuracy of the classification model or wasting time and effortto gather extra information about the objects [16, 17, 18].

In rough set theory, the data is collected in a table, called decision table.Rows of the decision table correspond to objects, and columns correspond toattributes. In the data set, we assume that the a set of examples with a classlabel to indicate the class to which each example belongs are given. We callthe class label the decision attributes, and the rest of the attributes the condi-tion attributes. Rough set theory defines three regions based on the equivalentclasses induced by the attribute values lower approximation, upper approxima-tion and boundary. Lower approximation contains all the objects, which areclassified surely based on the data collected, and upper approximation con-tains all the objects which can be classified probably, while the boundary isthe difference between the upper approximation and the lower approximation.So, we can define a rough set as any set defined through its lower and upperapproximations. On the other hand, indiscernibility notion is fundamental torough set theory. Informally, two objects in a decision table are indiscernible ifone cannot distinguish between them on the basis of a given set of attributes.Hence, indiscernibility is a function of the set of attributes under consideration.For each set of attributes we can thus define a binary indiscernibility relation,which is a collection of pairs of objects that are indiscernible to each other.An indiscernibility relation partitions the set of cases or objects into a num-ber of equivalence classes. An equivalence class of a particular object is simplythe collection of objects that are indiscernible to the object in question. Herewe provide an explanation of the basic framework of rough set theory, alongwith some of the key definitions. Reader’s may consult [16, 17, 18] for morefundamental details on rough set theory and applications.

3 Ultrasound Biomicroscopy Glaucoma Rough SetsImages Analysis Scheme

Figure 4 illustrates the overall steps in the proposed Ultrasound Biomi-croscopy Glaucoma Rough Sets Images Analysis Scheme using a UMLActivity Diagram where a square or rectangular represents a data object,a rounded rectangular represents an activity, solid and dashed directed linesindicate control flow and data object flow respectively. Functionally, RBIScan be partitioned into three distinct phases.

3.1 Preprocessing Phase

In the first phase of the experiment the UBM eye images have been pre-processed to remove noise. Eye structure in UBM images are not very clearand this makes them very challenging to analysis, both for naked human eyeand any automatic assessment algorithm. PCNN is a very powerful tool toenhance the boundaries in ultrasound images.

Page 281: Foundations of Computational Intelligence

Ultrasound Biomicroscopy Glaucoma Images Analysis 281

Fig. 4. Ultrasound Biomicroscopy Glaucoma Rough Sets Images Analysis Scheme

PCNN with the median filter noise reduction algorithm

To increase efficiency of automating the boundary detection process, a pre-processing process should be considered to enhance the quality of the eyeimages before detection their boundaries. The median filter [19, 20] is usedto reduce noise in an image. It operates one pixel in the image at a time andlooks at its closest neighbors to decide whether or not it is representative ofits surroundings. To begin with, one should decide the size of the window thatthe filter operates the image within. The size could, for example, be set tothree, which means that the filter will operate on a centered pixel surroundedby a frame of 33 neighbors. Then the filter sorts the pixels contained in theimage area surrounded by the window. The center pixel will be replaced by

Page 282: Foundations of Computational Intelligence

282 E.-S.A. El-Dahshan et al.

the median, the middle value, of the ranking result. The advantage of themedian filter, compared with other smoothing filters of similar size, is that itperforms noise-reduction with considerably less blurring. Thus, the filter alsopreserves the edges in an image very well. The median filter works especiallywell for random noise. The algorithm works as follows: it first finds out theconcrete position of the noised pixel according to the firing pattern and thenremoves the noise from the image with median filter. Initially the thresholdof all of the neurons is set to zero, and at the first iteration all the neuronsare activated which means all neurons receive the maximal linking input atthe next iteration. So the proper set of the PCNN’s parameters will makethe neurons correspond to noised pixels with high intensity fire before itsneighborhood at the second iteration, and according to the current firingpattern the concrete position of noised pixels can be found out. Then thenoised pixels can be removed with 33 median filter. The removal of noisedpixels with low intensity is the same as the removal of noised pixels with highintensity if the intensity is inverted. Due to the fact that this algorithm canfind out the concrete positions of noised pixels and apply median operationonly on the noised regions, its ability to keep the details of the image isstrong, for more details, reader may consult [19, 20].

PCNN boundary detection algorithm

The success of the application of PCNNs to image segmentation dependson the proper setting of the various parameters of the network, such as thelinking parameter β thresholds θ, decay time constants αθ , and the inter-connection matrices M and W . The image can be represented as an array ofM × N normalized intensity values. Then the array is fed in at the M × Ninputs of the PCNN. If initially all neurons are set to 0, the input results inactivation of all of the neurons at the first iteration. The threshold of eachneuron, Θ, significantly increases when the neuron fires, then the thresholdvalue decays with time. When the threshold falls below the respective neu-ron’s potential (U), the neuron fires again, which again raises the threshold.The process continues creating binary pulses for each neuron. We observe

Table 1. Chamber area decision table

PCNN parameters values

β 0.2αF 0.001αL 1αΘ 10VF 0.01VL 0.01VΘ 2N 5

Page 283: Foundations of Computational Intelligence

Ultrasound Biomicroscopy Glaucoma Images Analysis 283

(a) anterior chamber angle (b) glaucoma parameters

Fig. 5. Relationship between the Angle TIA and AOD500

Table 2. Summary of the definition of variables in this study

Variables Definition

TIA Angle between the arms passing through a point on the trabecular mesh-work 500 μm from the scleral spur and the point perpendicularly oppo-site on the iris

AOD500 Length of a perpendicular from the trabecular meshwork on the iris ata point 500 μm from the scleral spur

ARA500 The total area bounded by the iris and cornea at 500 μm from the scleralspur was calculated as the angle-recess area

that the visible difference between the enhanced image and the original im-age is not too drastic. The results corresponding to the segmentation withoutpreprocessing does result in blank image but with preliminary preprocessing,it does not result in blank image. The PCNN parameter values used in thisapplication is given in table (1).

3.2 Clinical Parameters Assessment Phase

The second phase of the experiment shows the clinical parameters as-sessment. The degree of angle opening was measured using the followingvariables: trabecular-iris angle (TIA), the angle-opening distance (AOD) at500 micron from the scleral spur (AOD500), and angle-recess area (ARA500),as described by Palvin et al [24, 25]. Figure 5(a) shows UBM image ofthe anterior chamber angle demonstrating the angle-recess area and Figure5(b) illustrates the Glaucoma parameters to be measured. A summary of thedefinition of variables is shown in Table 2.

Clinical parameters assessment algorithm

We designed an algorithm to identify the sclera spur, and then automati-cally calculate the distance along a perpendicular line drawn from the corneal

Page 284: Foundations of Computational Intelligence

284 E.-S.A. El-Dahshan et al.

Algorithm 1. Clinical parameters assessment algorithmInput: the enhanced UBM glaucoma image.Output: normal or glaucomatous eye.

1: Draw the anterior chamber boundary2: Locate the Sclera spur point (Apex point)3: Draw a line of 25 pixels parallel to the x-axis and at the end of this line draw

a perpendicular line to intersect the upper boundary and the lower boundaryof the anterior chamber region, then calculate the distances d1 and d2

4: Calculate the distances xx and xxxx sing the Euclidian rule5: Calculate the angle 1 and angle 2, then angle a = angle 1+angle26: From the apex point draw a line of 25 pixels on the upper boundary of the

anterior chamber, then find the distance z= xx/cos (angle a), also find thedistance y = sin (angle a)

7: Calculate the Angle-recess area (ARA500) =1/2(xy)

endothelial surface to the iris at 500 μm yielding the AOD500 μm. The totalarea bounded by the iris and cornea at 500 μm from the sclera spur wascalculated as the angle-recess area (ARA500). Also, the TIA was measuredfrom the apex point. Then the measured TIA and AOD500 parameters arefed to the classifier to classify the cases as normal and glaucomatous eye.Figure (5). shows a schematic diagram for the calculations of the glaucomaclinical Parameters and the main steps for clincal parametrs assesment aregiven in algorithm 1.

The angles of patients were categorized as Grade 0 to Grade 4, usingShaffer’s classification[3]. These angles were quantified by ultrasound biomi-croscopy (UBM) using the following biometric characteristics: Angle openingdistance 500 μm (AOD500) from the scleral spur and Angle Recess Area(ARA) [3, 26]. The angles were further segregated as narrow angles (Schaf-fer’s Grade 2 or less) and open angles (Schaffer’s Grade 3 and 4).

3.3 Rough Set Data Analysis Phase

• Pre-processing stage(Activities in Dark Gray). This stage includes taskssuch as extra variables addition and computation, decision classes as-signments, data cleansing, completeness, correctness, attribute creation,attribute selection and discretization.

• Analysis and Rule Generating stage(Activities in Light Gray). This stageincludes the generation of preliminary knowledge, such as computation ofobject reducts from data, derivation of rules from reducts, rule evaluationand prediction processes.

• Classification and Prediction stage(Activities in Lighter Gray). This stageutilize the rules generated from the previous phase to predict the stockprice movement

Page 285: Foundations of Computational Intelligence

Ultrasound Biomicroscopy Glaucoma Images Analysis 285

Rough set pre-processing stage

In this stage, the decision table required for rough set analysis is created. Indoing so, a number of data preparation tasks such as data conversion, datacleansing, data completion checks, conditional attribute creation, decisionattribute generation, discretization of attributes are performed. Data splittingis also performed creating two randomly generated subsets: one subset foranalysis containing %75 of the objects in the data set and one validationcontaining the remainder %25 of the objects. It must be emphasized thatdata conversion performed on the initial data must generate a form in whichspecific rough set tools can be applied.

3.4 Data Completion Process

Often, real world data contain missing values. Since rough set classificationinvolves mining for rules from the data, objects with missing values in thedata set may have undesirable effects on the rules that are constructed. Theaim of the data completion procedure is to remove all objects that have one ormore missing values. Incomplete data or information systems exist broadly inpractical data analysis, and approaches to complete the incomplete informa-tion system through various completion methods in the preprocessing stagethat are normal in data mining and knowledge discovery. However, thesemethods may result in distorting the original data and knowledge, and caneven render the original data to be un-minable. To overcome these shortcom-ings inherent in the traditional methods, we used the decomposition approachfor incomplete information system ( i.e. decision table) proposed in [27].

3.5 Data Discretization Process: RSBR Algorithm

Attributes in concept classification and prediction, may have varyingimportance in the problem domain being considered. Their importance canbe pre-assumed using auxiliary knowledge about the problem and expressedby properly chosen weights. However, when using rough set approach forconcept classification, rough set avoids any additional information aside fromwhat is included in the information table itself. Basically, rough set approachtries to determine from the available data in the decision table whether allthe attributes are of the same strength and, if not, how they differ in re-spect of the classifier power. Therefor, some strategies for discretization ofreal value attributes have to be used when we need to apply learning strate-gies for data classification with real value attributes (e.g. equal width andequal frequency intervals). It has been shown that the quality of learningalgorithm depends on this strategy. Discretization uses data transformationprocedure which involves finding cuts in the data sets that divide the datainto intervals. Values lying within an interval are then mapped to the samevalue. Performing this process leads to reducing the size of the attributes

Page 286: Foundations of Computational Intelligence

286 E.-S.A. El-Dahshan et al.

value set and ensures that the rules that are mined are not too specific. Forthe discretization of continuous-valued attributes, we adopt, in this chapter,rough sets with boolean reasoning (RSBR) algorithm proposed by Zhong etal. [27] The main advantage of RSBR is that it combines discretization ofreal valued attributes and classification. (more detalis refer to [13]).

Analysis and Rule Generating Stage

As we mentioned before, Analysis and Rule Generating stage includes gen-erating preliminary knowledge, such as computation of object reducts fromdata, derivation of rules from reducts, and prediction processes. These stageslead towards the final goal of generating rules from information system ordecision table.

Reduce Irrelevant and Redundant Attributes

In decision tables, there often exist conditional attributes that do notprovide (almost) any additional information about the objects. These at-tributes need to be removed in order to reduce the complexity and cost ofdecision process [17, 18, 28]. A decision table may have more than one reduct.And any of these reducts could be used to replace the original table. However,finding all the reducts from a decision table is NP-complete but fortunately,in applications, it is usually not necessary to find all of them – one or a fewof them are sufficient. Selecting the best reduct is important. The selectiondepends on the optimality criterion associated with the attributes. If a costfunction could be assigned to attributes, then the selection can be based onthe combined minimum cost criteria. But in the absence of such cost func-tion, the only source of information to select the reduct from, is the contentsof the table. In this chapter, we adopt the criteria that the best reducts arethose with minimal number of attributes and if there are more such reductswith the least number of combinations of values of its attributes cf. [28].

In general, rough set theory provides useful techniques to reduce irrelevantand redundant attributes from a large database with a lot of attributes. Thedependency degree (or approximation quality, classification quality) and theinformation entropy are two most common attribute reduction measures inrough set theory. In this chapter, we use the dependency degree measureto compute the significant features and measuring the effect of removing afeature from the feature sets.

Computation of the Reducts

A reduced table can be seen as a rule set where each rule corresponds toone object of the table. The rule set can be generalized further by applyingrough set value reduction method. The main idea behind this method is to

Page 287: Foundations of Computational Intelligence

Ultrasound Biomicroscopy Glaucoma Images Analysis 287

drop those redundant condition values of rules and to unite those rules in thesame class. Unlike most value reduction methods, which neglect the differenceamong the classification capabilities of condition attributes, we first removevalues of those attributes that have less discrimination factors. Thus moreredundant values can be reduced from decision table and more concise rulescan be generated.

Rule Generation from a Reduced Table

The generated reducts are used to generate decision rules. The decision rule,at its left side, is a combination of values of attributes such that the set of(almost) all objects matching this combination has the decision value given atthe rule’s right side. The rule derived from reducts can be used to classify thedata. The set of rules is referred to as a classifier and can be used to classifynew and unseen data. The quality of rules is related to the correspondingreduct(s). We are especially interested in generating rules which cover thelargest parts of the universe U . Covering U with more general rules impliessmaller size rule set.

Classification and Prediction Stage

Classification and prediction is the last stage of our proposed approach. Wepresent a classification and prediction scheme based on the methods andtechniques described in the previous sections. To transform a reduct into arule, one only has to bind the condition feature values of the object classfrom which the reduct originated to the corresponding features of the reduct.Then, to complete the rule, a decision part comprising the resulting part ofthe rule is added. This is done in the same way as for the condition features.To classify objects, which has never been seen before, rules generated froma training set will be used. These rules represent the actual classifier. Thisclassifier is used to predict to which classes new objects are attached. Thenearest matching rule is determined as the one whose condition part differsfrom the feature vector of re-object by the minimum number of features.When there is more than one matching rule, we use a voting mechanism tochoose the decision value. Every matched rule contributes votes to its decisionvalue, which are equal to the t times number of objects matched by the rule.The votes are added and the decision with the largest number of votes ischosen as the correct class. Quality measures associated with decision rulescan be used to eliminate some of the decision rules.

Rule Strength Measures

The global strength defined in [29] for rule negotiation is a rational numberin [0, 1] representing the importance of the sets of decision rules relative tothe considered tested object. Let us assume that T = (U, A

⋃(d)) is a given

Page 288: Foundations of Computational Intelligence

288 E.-S.A. El-Dahshan et al.

decision table, ut is a test object, Rul(Xj) is the set of all calculated ba-sic decision rules for T , classifying objects to the decision class Xj(v

jd = vd),

MRul(Xj, ut) ⊆ Rul(Xj) is the set of all decision rules from Rul(Xj) match-ing tested object ut. The global strength of decision rule set MRul(Xj, ut)is defined by the following form [29]:

MRul(Xj, ut) =

∣∣∣⋃r⊂MRul(Xj ,ut)|Pred(r)|A ∩ |d = vj

d|A∣∣∣∣∣∣|d = vj

d|A∣∣∣ . (1)

Measure of strengths of rules defined above is applied in constructing clas-sification algorithm. To classify a new case, rules are first selected matchingthe new object. The strength of the selected rule sets is calculated for anydecision class, and then the decision class with maximal strength is selected,with the new object being classified to this class.

4 Implementation and Results Evaluation

4.1 UBM Images Characteristic

The UBM images were from the New York Glaucoma Research Institute, ob-tained with the UBM Model 840, Paradigm Medical Industries Inc, with a 50MHz transducer probe. The image has a lateral and axial physical resolutionof approximately 50 μ and 25 μ respectively and a penetration depth of 4-5mm, typically of dimensions 5 x 5 mm at a resolution of 440 x 240 pixels.Twenty images were used in the verification of the technique. The techniquewas implemented on PC with a 3 GHz P4 processor using MATLAB 7.01.

4.2 PCNN: Chamber Boundary Detection Results

Preprocessing results: In the first phase of the experiment, the UBMeye images have been preprocessed to remove noise. Eye structure in UBMimages are not very clear which makes them very challenging to analysis,both for naked human eye as well as any automatic assessment algorithm. Itcan be seen that with the preprocessing module which removes image noise,smoothes images and enhances the image resolutions, the performance of thesegmentation module can be significantly improved. Table 3 shows the re-sults of the developed PCNN enhancement and boundary detection techniquein 2D UBM images. Table 3(a) is the original image. After noise removaland image enhancement by the preprocessing module, the output image isshown in Table 3(b). Table 3(c) shows the boundary of the anterior cham-ber on the original image. Table 3(d) shows the boundary of the anteriorchamber alon.

Page 289: Foundations of Computational Intelligence

Ultrasound Biomicroscopy Glaucoma Images Analysis 289

Table 3. Determination of chamber boundaries

a) Original b) PCNN enhanced c) segmentation d) boundaries

Table 4. Chamber area decision table

Angle-TIA AOD500 ARA Decision class

45.43 28.161 63.04 124.8 11.78 150.17 013.68 6.13 77.66 013.6 6.05 75.89 024.58 11.52 145.03 056.4 48.19 771.28 137.44 20.61 277.53 1

4.3 Rough Sets Data Analysis Results

Table (4) represents the Chamber area rough decision system.We reach the minimal number of reducts that contains a combination

of attributes which has the same discrimination factor. The final generatedreduct set which is used to generate the list of rules for the classification is:

{TIA, with support 100%}

Page 290: Foundations of Computational Intelligence

290 E.-S.A. El-Dahshan et al.

A natural use of a set of rules is to measure how well the ensemble of rulesis able to classify new and unseen objects. To measure the performance ofthe rules is to assess how well the rules do in classifying new cases. So weapply the rules produced from the training set data to the test set data. Thefollowing present the generated rules in a more readable format:

R1: IF TIA < 29.94 THEN Decision Class is 0.0R2: IF TIA >= 29.94 THEN Decision Class is 1.0

Measuring the performance of the rules generated from the training dataset in terms of their ability to classify new and unseen objects is also impor-tant. Our measuring criteria were Rule Strength and Rule Importance [30]and to check the performance of our method, we calculated the confusion ma-trix between the predicted classes and the actual classes as shown in Table(5). The confusion matrix is a table summarizing the number of true posi-tives, true negatives, false positives, and false negatives when using classifiersto classify the different test objects.

Several runs were conducted using different setting with strength rulethreshold. Table (6) shows the number of generated rules using rough setsand for the sake of comparison we have also generated rules using neuralnetwork. Table (6) indicates that the number of rules generated using neuralnetworks is much larger than the rough sets.

Comparative analysis

To evaluate the efficiency of the classification developed method, we com-pared the results obtained using our classifier with those manually definedby an expert. It is noticed that for the most part, our measurement andthe radiologist measurement agree. Also, Our analysis has been comparedwith the analysis of the anterior chamber angle using UBM Pro2000 software(Paradigm Medical Industries Inc, Salt Lake City, Utah) [refer Figure 6][31, 32]. After the observer selects the scleral spur, the program automati-cally detects the border and calculates the angle recession area at 500 μmanterior to the scleral spur.

Table 5. Model Prediction Performance (Confusion Matrix)

Actual Predict PredictClass 0 Class 1 Accuracy

Class 0 17 0 1.0Class 1 0 32 1.0

1.0 1.0 1.0

Page 291: Foundations of Computational Intelligence

Ultrasound Biomicroscopy Glaucoma Images Analysis 291

Table 6. Number of generated rules

Method Generated rule numbers

Neural networks 37Rough sets 2

Fig. 6. UBMPro2000 [31, 32]

5 Conclusion and Future Works

We have developed an advanced hybrid rough pulse coupled neural networkscheme for Ultrasound Biomicroscopy glaucoma images analysis and provideda methodology for assessing the clinical parameters of angle closure glaucomabased onUBM images of the eye. To increase the efficiency of the introducedhy-brid scheme, an intensity adjustment process is applied first, based on the PulseCoupled Neural Network with a median filter. This is followed by applying thePCNN-based segmentation algorithm to detect the boundary of the prostateimage. Combining the adjustment and segmentation enables us to eliminatePCNNsensitivity to the setting of the variousPCNNparameterswhose optimalselection can be difficult and can vary even for the same problem. Then, cham-ber boundary features have been extracted and normalized, followed by appli-cation of a rough set analysis to discover the dependency between the attributesand to generate set of reduct that contains minimal number of attributes.

Experimental results showed that the introduced scheme is very successfuland has high detection accuracy. It is believed that the proposed automaticscheme for glaucoma clinical parameters assessment and classification for theUBM images is a promising approach, which provides an efficient and robustassessment and diagnosis strategy and acts as second opinion for the physician’sinterpretation of glaucoma diseases. In conclusion, the analysis of the UBMimages is a useful method for evaluating the chamber angle structure of eye.

Page 292: Foundations of Computational Intelligence

292 E.-S.A. El-Dahshan et al.

Acknowledgments

The authors wish to acknowledge the valuable feedback provided by Prof.Taymoor, The University of Ain Shams, during the development of the pro-posed algorithm. They would also like to acknowledge the help provided byProf. Adel Abdel-Shafeek from Faculty of medicin-Ain Shams University forthe manual measurements of ultrasound images. They wish to thank Dr.Zaher Hussein (Ophthalmology-Glaucoma Specialist) from the (New YorkGlaucoma Research Institute) New York Eye and Ear infirmary, GlaucomaService, for providing the UBM images and reports.

References

1. Quigley, H.A., Broman, A.T.: The number of people with glaucoma worldwidein 2010 and 2020. Br. J. Ophthalmol. 90(3), 262–267 (2006)

2. Razeghinejad, M.R., Kamali-Sarvestani, E.: The plateau iris component of pri-mary angle closure glaucoma. Developmental or acquired Medical Hypothe-ses 69, 95–98 (2007)

3. Kaushik, S., Jain, R., Pandav, S.S., Gupta, A.: Evaluation of the anterior cham-ber angle in Asian Indian eyes by ultrasound biomicroscopy and gonioscopy.Indian Journal of Ophthalmology 54(3), 159–163 (2006)

4. Quigley, H.A.: Number of people with glaucoma worldwide. Br. J. Ophthal-mol. 80, 389–393 (1996)

5. Glaucoma, http://www.theeyecenter.com6. Nishijima, K., Takahashi, K., Yamakawa, R.: Ultrasound biomicroscopy of the

anterior segment after congenital cataract surgery. American Journal of Oph-thamology 130(4), 483–489 (2000)

7. Radhakrishnan, S., Goldsmith, J., Huang, D., Westphal, V., Dueker, D.K.,Rollins, A.M., Izatt, J.A., Smith, S.D.: Comparison of optical coherence tomog-raphy and ultrasound biomicroscopy for detection of narrow anterior chamberangles. Arch. Ophthalmol. 123(8), 1053–1059 (2005)

8. Urbak, S.F.: Ultrasound Biomicroscopy. I. Precision of measurements. ActaOphthalmol Scand 76(11), 447–455 (1998)

9. Deepak, B.: Ultrasound biomicroscopy ”An introduction”. Journal of the Bom-bay Ophthalmologists Association 12(1), 9–14 (2002)

10. Zhang, Y., Sankar, R., Qian, W.: Boundary delineation in transrectal ultra-sound image for prostate cancer. Computers in Biology and Medicine 37(11),1591–1599 (2007)

11. Youmaran, R., Dicorato, P., Munger, R., Hall, T., Adler, A.: Automatic detec-tion of features in ultrasound images of the Eye. In: IMTC, Proceedings of theIEEE, Ottawa, Canada, May 16-19, 2005, vol. 3, pp. 1829–1834 (2005)

12. Hasanien, A.E.: Classification and feature selection of breast cancer data basedon decsion tree algorithm. International Journal of Studies in Informatics andControl Journal 12(1), 33–39 (2003)

13. Hassanien, A.E.: Fuzzy-rough hybrid scheme for breast cancer detection. Imageand computer vision journal 25(2), 172–183 (2007)

14. Basheer, I.A., Hajmeer, M.: Artificial neural networks: fundamentals, com-puting, design, and Application. Journal of Microbiological Methods 43, 3–31(2000)

Page 293: Foundations of Computational Intelligence

Ultrasound Biomicroscopy Glaucoma Images Analysis 293

15. Haykin, S.: Neural Networks: A Comprehensive Foundation. IEEE Press, LosAlamitos (1994)

16. Pal, S.K., Polkowski, S.K., Skowron, A. (eds.): Rough-Neuro Computing: Tech-niques for Computing with Words. Springer, Berlin (2002)

17. Pawlak, Z.: Rough Sets. Int. J. Computer and Information Sci. 11, 341–356(1982)

18. Grzymala-Busse, J., Pawlak, Z., Slowinski, R., Ziarko, W.: Rough Sets. Com-munications of the ACM 38(11), 1–12 (1999)

19. El-dahshan, E., Redi, A., Hassanien, A.E., Xiao, K.: Accurate Detection ofProstate Boundary in Ultrasound Images Using Biologically inspired SpikingNeural Network. In: International Symposium on Intelligent Siganl Process-ing and Communication Systems Proceeding, Xiamen, China, November 28-December 1, pp. 333–336 (2007)

20. Hassanien, A.E.: Pulse coupled Neural Network for Detection of Masses inDigital Mammogram. Neural Network World Journal 2(6), 129–141 (2006)

21. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., Reit-boeck, H.J.: Coherent oscillations: A mechanism of feature linking in the visualcortex? Biol. Cybern. 60, 121–130 (1988)

22. Eckhorn, R., Reitboeck, H.J., Arndt, M.: Feature Linking via Synchronizationamong Distributed Assemblies: Simulations of Results from Cat Visual Cortex.Neural Comp. 2, 293–307 (1990)

23. Eckhorn, R.: Neural mechanisms from visual cortex suggest basic circuits forlinking field models. IEEE Trans. Neural Networks 10, 464–479 (1999)

24. Pavlin, C.J., Harasiewicz, K., Foster, F.S.: Ultrasound biomicroscopy of anteriorsegment structures in normal and glaucomatous eyes. Am. J. Ophthalmol. 113,381–389 (1992)

25. Hodge, A.C., Fenstera, A., Downey, D.B., Ladak, H.M.: Prostate boundary seg-mentation from ultrasound images using 2D active shape models: Optimisationand extension to 3D. Computer Methods and Programs in Biomedicine 8(4),99–113 (2006)

26. Gohdo, T., Tsumura, T., Iijima, H., Kashiwagi, K., Tsukahara, S.: Ultrasoundbiomicroscopic study of ciliary body thickness in eyes with narrow angles.American Journal of Ophthamology 129(3), 342–346 (2000)

27. Qizhong, Z.: An Approach to Rough Set Decomposition of Incomplete In-formation Systems. In: 2nd IEEE Conference on Industrial Electronics andApplications, ICIEA 2007, May 23-25, 2007, pp. 2455–2460 (2007)

28. Setiono, R.: Generating concise and accurate classification rules for breast can-cer diagnosis. Artificial Intelligence in Medicine 18(3), 205–219 (2000)

29. Bazan, J., Nguyen, H.S., Nguyen, S.H., Synak, P., Wroblewski, J.: Rough SetAlgorithms in Classification Problem. In: Polkowski, L., Tsumoto, S., Lin, T.Y.(eds.) Rough Set Methods and Applications, pp. 49–88. Physica Verlag (2000)

30. Ning, S., Xiaohua, H., Ziarko, W., Cercone, N.: A Generalized Rough SetsModel. In: Proceedings of the 3rd Pacific Rim International Conference on Ar-tificial Intelligence, vol. 431, pp. 437–443. Int. Acad. Publishers, Beijing (1994)

31. Sbeity, Z., Dorairaj, S.K., Reddy, S., Tello, C., Liebmann, J.M., Ritch, R.:Ultrasound biomicroscopy of zonular anatomy in clinically unilateral exfoliationsyndrome. Acta Ophthalmol. 86(5), 565–568 (2008)

32. Dorairaj, S.K., Tello, C., Liebmann, J.M., Ritch, R.: Narrow Angles and AngleClosure: Anatomic Reasons for Earlier Closure of the Superior Portion of theIridocorneal Angle. Acta Ophthalmol. 125, 734–739 (2007)

Page 294: Foundations of Computational Intelligence

An Overview of Fuzzy C-Means Based Image

Clustering Algorithms

Huiyu Zhou1 and Gerald Schaefer2

1 School of Engineering and DesignBrunel UniversityUxbridge, [email protected]

2 Department of Computer ScienceLoughborough UniversityLoughborough, [email protected]

Summary. Clustering is an important step in many imaging applicationswith a variety of image clustering techniques having been introduced in theliterature. In this chapter we provide an overview of several fuzzy c-meansbased image clustering concepts and their applications. In particular, we sum-marise the conventional fuzzy c-means (FCM) approaches as well as a numberof its derivatives that aim at either speeding up the clustering process or atproviding improved or more robust clustering performance.

1 Introduction

Image clustering is widely performed in a variety of applications like com-puter vision, robotics, medical imaging and information retrieval. It can beseen as a process of grouping an image into non-overlapping homogeneousregions that hold consistent characteristics such as gray level, colour ortexture. Fuzzy c-means (FCM) is one of the most popular methods for im-age clustering [3]. Compared to hard thresholding clustering methods, FCMis capable of reducing the uncertainty of pixels belonging to one class andtherefore in general providing improved clustering outcomes. In addition,FCM enables multiple classes with varying degrees of membership to becontinuously updated [4].

As an unsupervised method, FCM does not require a priori labeling ofsome patterns to categorise others or infer the cluster structure of the wholedata. Apart from the original fuzzy c-means algorithm [3], a number of FCMderivatives have been introduced in the literature. These either target atalgorithmic speedup (e.g., fast FCM with random sampling [7] and fastgeneralized FCM [30], or improvement of clustering performance with re-spect to noise or artefacts [4] (e.g. probabilistic clustering [24], fuzzy noiseclustering [13], LP norm clustering [19]).

A.-E. Hassanien et al. (Eds.): Foundations of Comput. Intel. Vol. 2, SCI 202, pp. 295–310.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 295: Foundations of Computational Intelligence

296 H. Zhou and G. Schaefer

Sato and Sato [29] presented a fuzzy clustering algorithm for interactivefuzzy vectors. Hathaway et al. [18] and Pedrycz et al. [28] proposed algorithmsthat can be used to convert parametric or non-parametric linguistic variablesinto generalised coordinates before performing fuzzy c-means clustering. Yangand Ko [35] proposed fuzzy c-numbers clustering procedures for segmentingdata while Yang and Liu [36] extended this to high-dimensional fuzzy vectors.Takata et al. [31] proposed a clustering method for data with uncertaintiesusing the Hausdorff distance. They also suggested fuzzy clustering proceduresfor data with uncertainties using minimum and maximum distances based onL1 metric [32]. Auephanwiriyakul and Keller [2] presented a linguistic versionof the fuzzy c-means method, based on the extension principle and the de-composition theorem. Hung and Yang [21] suggested fuzzy c-numbers cluster-ing algorithm for LR-type fuzzy number based on exponential-type distancemeasure. Novel clustering techniques for handling both symbolic and fuzzydata were proposed by Yang et al. [37]. Their fuzzy clustering algorithms, formixed features of symbolic and fuzzy data, are obtained by modifying theGowda and Diday’s dissimilarity measure for symbolic data [17] while chang-ing the parametric approach for fuzzy data proposed in [18]. Recently, there isincreasing interest in developing fuzzy clustering models for three-way fuzzydata. Coppi and D’Urso presented a fuzzy c-means clustering approach forfuzzy time trajectories that form a geometrical representation of the fuzzydata time array [11].

2 Classical Fuzzy C-Means Clustering

Fuzzy c-means is based on the idea of finding cluster centres by iterativelyadjusting their positions and evaluation of an objective function which istypically defined as

E =C∑

j=1

N∑i=1

μkij ||xi − cj ||2 (1)

where μkij is the fuzzy membership of sample (or pixel) xi and the cluster

identified by its centre cj , and k is a constant that defines the fuzziness ofthe resulting partitions.

E can reach the global minimum when pixels nearby the centroid ofcorresponding clusters are assigned higher membership values, while lowermembership values are assigned to pixels far from the centroid [8]. In here,the membership is proportional to the probability that a pixel belongs to aspecific cluster where the probability is only dependent on the distance be-tween the image pixel and each independent cluster centre. The membershipfunctions and the cluster centres are updated by

μij =1∑C

m=1

(||xj−ci||

||xj−cm||)2/(k−1)

) (2)

Page 296: Foundations of Computational Intelligence

An Overview of FCM Based Image Clustering Algorithms 297

and

ci =

∑Nj=1 uk

ijxj∑Nj=1 uk

ij

(3)

The steps involved in fuzzy c-means image clustering are [3]:

Step 1: Initialise the cluster centres ci and let t = 0.Step 2: Initialise the fuzzy partition memberships functions μij according to

Equation (2).Step 3: Let t = t + 1 and compute new cluster centres ci using Equation (3).Step 4: Repeat Steps 2 to 3 until convergence.

An initial setting for each cluster centre is required and FCM can be shownto converge to a local minimisation solution. The efficiency of FCM has beeninvestigated in [20]. To effectively address the inefficiency of the algorithmseveral variants of the fuzzy c-means algorithm have been introduced.

3 Fast FCM Clustering with Random Sampling

To combat the computational complexity of FCM, Cheng et al. [7] proposeda multistage random sampling strategy. This method has a lower number offeature vectors and also needs fewer iterations to converge. The basic ideais to randomly sample and obtain a small subset of the dataset in order toapproximate the cluster centres of the full dataset. This approximation isthen used to reduce the number of iterations. The random sampling FCMalgorithm (RSFCM) consists of two phases. First, a multistage iterative pro-cess of a modified FCM is performed. Phase 2 is then a standard FCM withthe cluster centres approximated by the final cluster centres from Phase 1.

Phase 1

Let XΔ% be a subset whose number of subsamples is Δ% of the N samplescontained in the full dataset X and denote the number of stages as n. ε1and ε2 are parameters used as stopping criteria. After the following steps thedataset (denoted as X(ns∗Δ%)) will include N ∗ Δ% samples:

Step 1: Select X(Δ%) from the set of the original feature vectors matrix (z =1).

Step 2: Initialise the fuzzy memberships functions μij using Equation (2)with X(z∗Δ%).

Step 3: Compute the stopping condition ε = ε1-z∗((ε1-ε2)/ns) and let j = 0Step 4: Set j = j + 1Step 5: Compute the cluster centres c(z∗Δ%) using Equation (3).Step 6: Compute μ(z∗Δ%) using Equation (2).Step 7: If ||μj

(z∗Δ%) − μj−1(z∗Δ%)|| ≥ ε, then go to Step 4.

Step 8: If z ≤ ns then select another X(Δ%) and merge it with the cur-rent X(z∗Δ%) and set z = z + 1, otherwise move to Phase 2 of thealgorithm.

Page 297: Foundations of Computational Intelligence

298 H. Zhou and G. Schaefer

Phase 2

Step 1: Initialise μij using the results from Phase 1, i.e. c(ns∗Δ%) with Equa-tion (3) for the full data set

Step 2: Go to Steps 3 of the conventional FCM algorithm and iterate thealgorithm stopping criterion ε2 is met.

It has been shown that this improved algorithm is able to reduce thecomputation requested in the classical FCM method. Other variants of thismultistage random sampling FCM framework have also been developed andcan be found e.g. in [14] and [23].

4 Fast Generalized FCM Clustering

Ahmed et al. [1] introduced an alternative to the classical FCM by addinga term that enables the labelling of a pixel to be associated with its neigh-bourhood. As a regulator, the neighbourhood term can change the solutiontowards piecewise homogeneous labelling. As a further extension of this work,Szilagyi et al. [30] reported their EnFCM algorithm to speed up the cluster-ing process for black-and-white images. In order to reduce the computationalcomplexity, a linearity-weighted sum image g is formed from the originalimage, and the local neighbour average image evaluated as

gm =1

1 + α

⎛⎝xm +

α

NR

∑j∈Nr

xj

⎞⎠ (4)

where gm denotes the gray value of the m-th pixel of the image g, xj representsthe neighbours of xm, NR is the cardinality of a cluster, Nr represents theset of neighbours falling into a window around xm.

The objective function used for clustering image g is defined as

J =C∑

i=1

qc∑i=1

γlμmij (gl − ci)2 (5)

where qc denotes the number of the gray levels in the image, and γl is thenumber of the pixels having an intensity equal to l with l = 1, 2, . . . , qc. Thus,∑qc

l=1 γl = N under the constraint that∑C

i=1 μij = 1 for any l.Finally, we can obtain the following expressions for membership functions

and cluster centres [4]:.

μil =(gl − si)−2/m−1∑Cj=1(gl − sj)−2/m−1

(6)

and

si =∑qc

l=1 γlμmil gl∑qc

l=1 γlμmil

(7)

Page 298: Foundations of Computational Intelligence

An Overview of FCM Based Image Clustering Algorithms 299

EnFCM considers a number of pixels with similar intensities as a weight.Thus, this process may accelerate the convergence of searching for globalsimilarity. On the other hand, to avoid image blur during the clustering,which may lead to inaccurate clustering, Cai et al. [4] use a measure Sij ,which incorporates the local spatial relationship Ss

ij and the local gray-levelrelationship Sg

ij , and is defined as

Sij ={

Ssij × Sg

ij , j �= i

0, j = i(8)

with

Ssij = exp

(−max(|pcj − pci|, |qcj − qci|)λs

)(9)

and

Sgij = exp

(−||xi − xj ||2λg × σ2

g

)(10)

where (pci, qci) describe the co-ordinates of the i-th pixel, σg is a globalscale factor of the spread of Ss

ij , and λs and λg represent scaling factors. Sij

replaces α in Equation (4).Hence, the newly generated image g is updated as

gi =

∑j∈Ni

Sijxj

Sij(11)

and is restricted to [0, 255] due to the denominator.Given a pre-defined number of clusters C and a threshold value ε > 0, the

fast generalised FCM algorithm proceeds in the following steps:

Step 1: Initialise the clusters cj.Step 2: Compute the local similarity measures Sij using Equation (8) for all

neighbours and windows over the image.Step 3: Compute linearly-weighted summed image g using Equation (11).Step 4: Update the membership partitions using Equation (6).Step 5: Update the cluster centres ci using Equation (7).Step 6: If

∑Ci=1 ||ci(old) − ci(new ||2 > ε go to Step 4.

Similar efforts to improve the computational efficiency and robustness havealso been reported in [25] and [5].

5 Anisotropic Mean Shift Based FCM Clustering

An approach to fuzzy c-means clustering that utilises an anisotropic meanshift algorithm coupled with fuzzy clustering was recently introduced by Zhouet al. [41, 40]. Mean shift based techniques have been shown to be capableof estimating the local density gradients of similar pixels. These gradient es-timates are iteratively performed so that all pixels can find similar pixels in

Page 299: Foundations of Computational Intelligence

300 H. Zhou and G. Schaefer

the same image [9, 10]. A standard mean shift approach method uses radiallysymmetric kernels. Unfortunately, the temporal coherence will be reduced inthe presence of irregular structures and noise in the image. This reduced co-herence may not be properly detected by radially symmetric kernels and thus,an improved mean shift approach, namely anisotropic kernel mean shift [33],provides better performance.

In mean shift algorithms the image clusters are iteratively moved alongthe gradient of the density function before they become stationary. Thosepoints gathering in an outlined area are treated as the members of the samecluster. A kernel density estimate is defined by

f(x) =1N

N∑i=1

K(x − xi), (12)

withK(x) = |H |−0.5K(H−0.5x), (13)

where N is the number of samples, and xi stands for a sample from an un-known density function f . K(·) is the d-variate kernel function with compactsupport satisfying the regularity constraints, and H is a symmetric positivedefinite d×d bandwidth matrix. Usually, we have K(x) = ke(φ), where ke(φ)is a convex decreasing function, e.g. for a Gaussian kernel

ke(φ) = cte−φ/2 (14)

and for an Epanechnikov kernel,

ke(φ) = ct max(1 − φ, 0) (15)

where ct is a normalising constant.If a single global spherical bandwidth is applied, H = h2I (I is the identity

matrix), then we have

f(x) =1

Nhd

N∑i=1

K

(x − xi

h

)(16)

Since the kernel can be divided into two different radially symmetric kernels,we have the kernel density estimate as

f(x) =1N

N∑i=1

1hβ(Hα

i )qkα(d(cα

i , xαi , Hα

i ))kβ(||(cβ

i − xβi )/(hβ(Hα

i ))||2)(17)

where and α and β denote the spatial and temporal components respectivelyand d(cα

i , xαi , Hα

i ) is the Mahalanobis metric, i.e.

d(cαi , xα

i , Hαi ) = (xα

i − cαi )T Hα−1

i (xαi − cα

i ). (18)

Page 300: Foundations of Computational Intelligence

An Overview of FCM Based Image Clustering Algorithms 301

Anisotropic mean shift is intended to modulate the kernels during the meanshift procedure. The objective is to keep reducing the Mahalanobis distanceso as to group similar samples as much as possible. First, the anisotropicbandwidth matrix Hα

i is estimated with the following constraints:{kα

e (d(x, xi, Hαi )) < 1

kβe

(||(x − xi)/hβ(Hαi )||2) < 1 (19)

The bandwidth matrix can be decomposed to

Hαi = λV AV T (20)

where λ is a scalar, V is a matrix of normalised eigenvectors, and A is adiagonal matrix of eigenvalues whose diagonal elements ai satisfy

p∏i=1

ai = 1 (21)

Fig. 1. Fuzzy clustering for segmentation of skin lesions: original image (upper left),ground truth (upper right), FCM (middle left), RSFCM (middle right), EnFCM(bottom left) and AMSFCM (bottom right)

Page 301: Foundations of Computational Intelligence

302 H. Zhou and G. Schaefer

The bandwidth matrix is updated by adding more and more points to thecomputational list: if these points are similar in intensity or colour, thenthe Mahalanobis distance will be consistently reduced. Otherwise, if the Ma-halanobis distance is increased, these points will not be considered in thecomputation.

Anisotropic mean shift based FCM (AMSFCM) proceeds in the followingsteps:

Step 1: Initialise the cluster centres ci. Let j = 0.Step 2: Initialise the fuzzy partitions μij using Equation (2).Step 3: Set j = j + 1 and compute ci using Equation (3) for all clusters.Step 4: Update μij using Equation (2).Step 5: For each pixel xi determine anisotropic kernel and related colour

radius using Equations (17) and (20). Note that mean shift is appliedto the outcome image of FCM.

Step 6: Calculate the mean shift vector and then iterate until the mean shift,M+(xi) − M−(xi), is less than a pixel considering the previous po-sition and a normalised position change:

M+(xi) = νM−(xi)+ (1− ν)∑N

j=1(xj−M−(xi))||(M−(xβi )−xβ

j )/(hβHαj )||2∑

Nj=1 ||(M−(xβ

i )−xβj )/(hβHα

j )||2with ν = 0.5.

Step 7: Merge pixels with similar colour.Step 8: Repeat Steps 3 to 6 until convergence.

In Fig. 1 we show the application of various FCM algorithms on dermoscopicimages of skin lesions [40]. It can be seen that fuzzy clustering methods areable to accuractely segment the skin lesion.

6 RCFCM and S-FCM Clustering

Wei and Xie [34] addressed the low convergence speed of FCM using a com-petitive learning approach to developed their RCFCM algorithm. The mainidea of their algorithm is to magnify the largest membership degree whilesuppressing the second largest membership degree. This scheme has beenfound to be able to accelerate the clustering convergence.

The key step of RCFCM is to add the following process after conductingsteps 1 to 2 of the classical FCM:

Step 2+: Modify membership degree matrix U(k), where U(k) = {μij}.Considering xj , if the degree of membership of xj belonging to the p-

th cluster is the largest of the overall clusters with the value μpj . In themeanwhile, the degree of membership of it belonging to q-th cluster is thesecond largest of the overall clusters with the value μqj . After being modified,the degree of membership of xj belonging to each cluster is (0 ≤ α ≤ 1):

μpj = μpj + (1 − α)μqj

μsj = αμsj (22)

Page 302: Foundations of Computational Intelligence

An Overview of FCM Based Image Clustering Algorithms 303

Although RCFCM has some advantages, some problems still appear. Forexample, values of μij preserve the ranks, i.e. if dij ≤ drj , then μij ≤ μrj .Thse ranks reflect the relation of the sample belonging to each cluster centre.However, if α has not been properly defined, this can result in a even slowerconvergence of clustering. To avoid this, Equation 22 can be modified as

μpj = 1 − α∑i�=p

μqj

= 1 − α + αμpj

μij = αμij , i �= p (23)

as is done in the S-FCM algorithm [15] which also rewards the largest mem-bership but suppresses the others. Interestingly, if α = 0, the proposedalgorithm becomes the classical hard c-means (HCM) algorithm, while if α= 1 it takes on the form of FCM. Therefore, the algorithm holds a balancedpoint between HCM and FCM while the determination of α will dominatethe convergence of S-FCM.

7 Spatially Weighted FCM Clustering

Yang et al. proposed a global image clustering algorithm called SWFCM [38].This algorithm is formulated by incorporating spatial information into theclassical FCM algorithm. The weight used in the algorithm by k-nearestneighbour classifier is modified so as to improve the thresholding perfor-mance. A gray level histogram is used to compute the parameters of theFCM. Considering the neighbouring pixels around the central pixel, the fuzzymembership function can be extended to be

μ∗ik = μikpik (24)

with k = 1,2,...,n where n is the index of each pixel, and pik is the probabilityof data point k belonging to cluster i. Then, the degrees of membership μ∗

ik

and the cluster centers ci are updated by

(μ∗ik)b =

pik∑Jj=1(dik/djk)2/(r−1)

(25)

(ci)b+1 =∑n

k=1((μ∗ik)b)rxk∑n

k=1((μ∗ik)b)r

(26)

The key issue here is how to define the auxiliary weight variable pik.k-nearest neighbor (k-NN) algorithm [12] is used

pik =

∑xn∈Ni

k1/d2(xn, k)∑

xn∈Nk1/d2(xn, k)

(27)

Page 303: Foundations of Computational Intelligence

304 H. Zhou and G. Schaefer

where Nk is the data set of the nearest neighboring of central pixel k, andN i

k is the subset of Nk referring to the data belonging to class i. Given thepotential function of each feature vector, we have

K(x, xk) =1

1 + α||x − xk||2 (28)

where α is a positive constant. Hence the weight value is defined as

pik =

∑xn∈Ni

k1/(1 + αd2(xn, ci))∑

xn∈Nk1/(1 + αd2(xn, ci))

(29)

To prevent SWFCM from getting trapped in a local minima it is ini-tialised with a fast FCM algorithm. Once FCM stops, the SWFCM algorithmcontinues with the updating membership function.

In related work, Cheng et al. [6] introduced the concept of fuzziness into amaximum entropy thresholding technique. Zhao et al. [39] presented a directsolution to the search for fuzzy thresholding parameters by exploiting therelationship between the fuzzy c-partition and the probability partition.

8 Lp Norm FCM Clustering

It is well known that the quality of the computed cluster centres ci can bedegraded due to the effects of noise or outliers in the data sets. This occursdue to the fact that dij = ||xj − ci||22 which can lead to cluster prototypesbeing pulled away from the main distribution of the cluster. Kersten [22] andMiyamoto and Agusta [27] independently proposed replacing ||xj − ci||22 with||xj −ci||11 =

∑sk=1 |xkj −cij | in order to improve robustness against outlying

data.Lp norm FCM clustering [19] is based on this observation. The objective

function employed is hence formulated as

Fm,p(U, c) =I∑

i=1

J∑j=1

Umij ||xj − ci||pp

=I∑

i=1

J∑j=1

k0∑k=1

Umij |xkj − cij |p, m > 1. (30)

Thus, the datum-to-prototype dissimilarities can be deduced as

dij =s∑

k=1

|xkj − cij |p , i = 1, ..., I; j = 1, ..., J. (31)

To compute the v-variable we can optimise the following independent uni-variate minimisation:

Page 304: Foundations of Computational Intelligence

An Overview of FCM Based Image Clustering Algorithms 305

fij(cij) =s∑

k=1

Umki |xkj − cij |p , i = 1, ..., I; j = 1, ..., J. (32)

For p > 1, the computed value of cij is taken to be a numerical approximationto the unique zero of f ′

ij = −p∑s

k=1 Umki |xkj − cij |p−1sign(xkj − cij).

Lp norma FCM clustering proceeds in the following steps:

Step 1: Initialise the cluster centres ci. Let l = 1.Step 2: Initialise the fuzzy partitions μij .Step 3: Estimate ci using Equations (30), (31) and (32).Step 4: Repeat the above process until (||Unew − Uold|| < 0.00001).

This approach was later extended to integrate the non-Euclidean relationalFCM (NERFCM) proposed in [18]. Here, the relational data is representedas

R = [Rij ] = [||xi − xj ||pp] = [s∑

i=1

|xij − xik|p]. (33)

To handle this data a terminal partition matrix U∗ is sought with a newerexpression as follows:

c∗ = argminc

Fm,p(U∗, c). (34)

9 Probabilistic FCM and FGcM Clustering

The original FCM attempts to minimise the objective function as follows:

F (L, U) =I∑

i=1

J∑j=1

(μij)md2ij subject to

I∑i=1

μij = 1 ∀j. (35)

where L = (β1,...,βI) is a I-tuple of prototypes, I is the number of classes, Jis the total number of feature vectors, and U is the fuzzy c-partition matrix.This objective function can be re-formulated to satisfy specific requirements:

Fm(L, U) =I∑

i=1

J∑j=1

(μij)md2ij +

I∑i=1

ηi

J∑j=1

(1 − μij)m (36)

where ηi are suitable positive numbers. Minimising the left hand side of Equa-tion (36) is equivalent to investigating the following equation:

F ijm (βi, μij) = μm

ij d2ij + ηi(1 − μij)m (37)

Differentiating Equation (37) with respect to μij and setting it to zero leads to

μij =1

1 + (d2ij/ηi)1/(m−1)

(38)

Page 305: Foundations of Computational Intelligence

306 H. Zhou and G. Schaefer

In practice, the following expression is used so as to reach a good conver-gence:

ηi = K

∑Jj=1 μm

ij d2ij∑J

j=1 μmij

(39)

This makes ηi proportional to the average fuzzy intra-cluster distance ofcluster βi. K is set to be 1. Therefore, the following rule is valid:

ηi =

∑xj∈(

∏i)α

d2ij

|(∏i)α| (40)

where (∏

i)α is an appropriate α-cut of∏

i is the average intra-cluster distancefor all of the “good” feature vectors.

Probabilistic FCM algorithm [24] proceeds in the following steps:

Step 1: Initialise the cluster centres ci. Let l = 1.Step 2: Initialise the fuzzy partitions μij .Step 3: Estimate ηi using Equation (39).Step 4: Update the prototypes using U l.Step 5: Compute U l+1 using Equation (38).Step 6: Increment l.Step 7: If ||U l−1 − U l|| < ε goto Step 4 .Step 8: Re-estimate ηi using Equation (40).

Menard et al. [26] proposed a strategy that was inspired by the work ofFrieden [16] in which a unifying principle of physics namely the extremephysical information (EPI) was developed. EPI can provide a mechanism tofind the constraint terms and search for an exact solution for the unknowndistribution of the measurement scenario.

Consider the problem of estimating c. Any fluctuation yi − ci = xi willhappen with a probability as

Pi(yi/ci) = Pi(xi), xi = yi − ci. (41)

Assuming ri = |y− ci|, then we have the Fisher information according to theEPI approach

f [q] = −4I∑

i=1

∫dri(dqi/dri)2, Pi(ri) = q2

i (ri), (42)

where qi is the i-th component probability amplitude for the fluctuation inthe measurement. Then we have a bound information functional J [qi] whichobeys

K[qi] = I[qi] − F [qi] (43)

The bound information functional is denoted as follows:

F [qi] = 4∫

drifi(qi, ri), (44)

Page 306: Foundations of Computational Intelligence

An Overview of FCM Based Image Clustering Algorithms 307

wherefi(qi, ri) = q2k

i (ri)Bi(ri), (45)

for some functions Bi(ri).The proposed FGcM algorithm employs an objective function given by

F fcm(U, c; Y ) =I∑

i=1

J∑j=1

μqijd

2ij +

1

λ(q − 1)

I∑i=1

J∑j=1

μqij −

1

λ

J∑j=1

γj(I∑

i=1

μij −1) = min .

(46)

The second term of the above equation defines the Tsallis entropy when∑Ii=1 μij = 1. Once minimised, Equation (46) leads to

μij =1Zq

[1 + λ(q − 1)d2(yj , ci)]−1/(q−1), ∀j ∈ [1, J ], ∀ ∈ [1, I], q > 1, (47)

where Zq =∑J

j=1[1+λ(q−1)d2(yj , ck)]−1/(q−1). The prototype update equa-tion can be formed as

ci =

∑Jj=1 μq

ijyj∑Jj=1 μq

ij

. (48)

When q → 1, FGcM has the same objective function and algorithm as thatof FCM with regularization approach.

10 Discussion and Conclusions

In this chapter we provided an overview of image clustering strategies basedon fuzzy c-means (FCM). The conventional FCM approach is similar to thehard c-means algorithm (HCM) in that it arrives at a solution through iter-ative refinement of cluster prototypes, yet in contrast to HCM it also allowspartial membership to clusters which in turn leads to an improved clusteringperformance. Many variants of FCM image clustering have been introducedin the literature and in this chapter we reviewed some of the more importantones that try to either speed up the clustering process or improve the qualityof the resulting clusters.

Multistage random sampling FCM starts with random sampling of a smallsubset of the full image. This step is intended to explore the cluster centres ofthe entire image. Ideally, this sub-sampling scheme should maximally reducethe number of iterations needed for convergence if the sampled subset hascharacteristics similar to those of the entire image. However, this requirementcannot be guaranteed especially in complex images. Consequently, the priorsub-sampling may result in an incomplete learning process, and in turn inslow clustering for the entire image.

EnFCM or fast generalised FCM techniques add a term to the originalclustering function so as to associate a pixel with its neighborhood. To con-sider the effect of the neighbors, a window needs to be defined beforehand.

Page 307: Foundations of Computational Intelligence

308 H. Zhou and G. Schaefer

Determination of the window size is image-dependent and hence this param-eter may affect the final outcomes of clustering efficiency and accuracy.

Comparatively, mean shift FCM leads to similar clustering outcomes butprovides slightly faster clustering. The success of the new scheme is due tothe fact that the used anisotropic kernel allows us to dynamically updatethe state parameters and achieve fast convolution by the anisotropic kernelfunction.

RCFCM and S-FCM add a stage before a classical FCM starts in order tomagnify the largest membership degree while suppressing the second largestmembership degree. By doing this, the clustering convergence can be furtheraccelerated.

In SWFCM, spatial information is taken into account in the classical FCMalgorithm. The weight used in the algorithm is determined using a k-nearestneighbour classifier. As a simple gray level histogram is utilised to computethe parameters it provides fast convergence.

Lp norm FCM algorithm allows the dissimilarity variables to be changedand hence has better performance in the presence of outliers.

Probabilistic FCM algorithms add a regulation term in the objectivefunction, which dynamically updates the entire function during the opti-misation of FCM. Using the concept of extreme physical information, FGcMincorporates a mechanism to find the constraint terms and search for an exactsolution for the unknown distribution of the measured scenario.

References

1. Ahmed, M., Yamany, S., Mohamed, N., Farag, A., Moriaty, T.: A modifiedfuzzy c-means algorithm for bias field estimation and segmentation of MRIdata. IEEE Trans. Medical Imaging 21, 193–199 (2002)

2. Auephanwiriyakul, S., Keller, J.M.: Analysis and efficient implementation of alinguistic fuzzy c-means. IEEE Trans. Fuzzy Systems, 563–581 (2002)

3. Bezdek, J.: A convergence theorem for the fuzzy isodata clustering algorithms.IEEE Trans. Pattern Analysis and Machine Intelligence 2, 1–8 (1980)

4. Cai, W., Chen, S., Zhang, D.: Fast and robust fuzzy c-means clustering algo-rithms incorporating local information for image segmentation. Pattern Recog-nition 40(3), 825–838 (2007)

5. Chen, S.C., Zhang, D.Q.: Robust image segmentation using FCM with spa-tial constraints based on new kernel-induced distance measure. IEEE Trans.Systems, Man and Cybernetics - Part B: Cybernetics 34, 1907–1916 (2004)

6. Cheng, H.D., Chen, J., Li, J.: Thresholding selection based on fuzzy c-partitionentropy approach. Pattern Recognition 31, 857–870 (1998)

7. Cheng, T., Goldgof, D., Hall, L.: Fast fuzzy clustering. Fuzzy Sets and Sys-tems 93, 49–56 (1998)

8. Chuang, K., Tzeng, S., Chen, H., Wu, J., Chen, T.: Fuzzy c-means cluster-ing with spatial information for image segmentation. Computerized MedicalImaging and Graphics 30, 9–15 (2006)

9. Comaniciu, D., Meer, P.: Mean shift analysis and applications. In: 7th Int.Conference on Computer Vision, pp. 1197–1203 (1999)

Page 308: Foundations of Computational Intelligence

An Overview of FCM Based Image Clustering Algorithms 309

10. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature spaceanalysis. IEEE Trans. Pattern Analysis and Machine Intelligence 24, 603–619(2002)

11. Coppi, R., D’Urso, P.: Three-way fuzzy clustering models for LR fuzzy timetrajectories. Computational Statistics & Data Analysis 43, 149–177 (2003)

12. Cover, T.M., Hart, P.E.: Nearest neighboring pattern classification. IEEETrans. Information Theory 13, 21–27 (1967)

13. Dave, R.N., Krishnapuram, R.: Robust clustering methods: a united view. IEEETrans. Fuzzy Systems 5, 270–293 (1997)

14. Eschrich, S., Ke, J., Hall, L., Goldgof, D.: Fast accurate fuzzy clustering throughdata reduction. IEEE Trans. Fuzzy Systems 11, 262–270 (2003)

15. Fan, J.-L., Zhen, W.Z., Xie, W.X.: Suppressed fuzzy c-means clustering algo-rithm. Pattern Recognition Letters 24, 1607–1612 (2003)

16. Frieden, B.: Physics from Fisher information, A Unification. Cambridge Uni-versity Press, Cambridge (1999)

17. Gowda, K.C., Diday, E.: Symbolic clustering using a new dissimilarity measure.Pattern Recognition 24, 567–578 (1991)

18. Hathaway, R.J., Bezdek, J.C.: NERF c-means: Non-euclidean relational fuzzyclustering. Pattern Recognition 27, 429–437 (1994)

19. Hathaway, R.J., Bezdek, J.C., Hu, Y.: Generalised fuzzy c-means clusteringstrategies using lp norm distance. IEEE Trans. Fuzzy Systems 8, 576–582 (2000)

20. Hu, R., Hathaway, L.: On efficiency of optimization in fuzzy c-means. Neural,Parallel and Scientific Computation 10, 141–156 (2002)

21. Hung, W.-L., Yang, M.-S.: Similarity measures of intuitionistic fuzzy sets basedon hausdorff distance. Pattern Recognition Letters 25(14), 1603–1611 (2004)

22. Kersten, P.R.: Implementing the fuzzy c-medians clustering algorithm. In:IEEE Conf. Fuzzy Syst., pp. 957–962 (1997)

23. Kolen, J., Hutcheson, T.: Reducing the time complexity of the fuzzy c-meansalgorithm. IEEE Trans. Fuzzy Systems 10(2), 263–267 (2002)

24. Krishnapuram, R., Keller, J.M.: A probabilistic approach to clustering. IEEETrans. Fuzzy Systems 1, 98–110 (1993)

25. Leski, J.: Toward a robust fuzzy clustering. Fuzzy Sets and Systems 137, 215–233 (2003)

26. Menard, M., Courboulay, V., Dardignac, P.-A.: Possibilistic and probabilisticfuzzy clustering: unification within the framework of the non-extensive ther-mostatistics. Pattern Recognition 36(6), 1325–1342 (2003)

27. Miyamoto, S., Agusta, Y.: An efficient algorithm for l1 fuzzy c-means and itstermination. Contr. Cybern. 25, 421–436 (1995)

28. Pedrycz, W., Bezdek, J.C., Hathaway, R.J., Rogers, G.W.: Two nonparametricmodels for fusing heterogeneous fuzzy data. IEEE Trans. Fuzzy Systems 6,411–425 (1998)

29. Sato, M., Sato, Y.: Fuzzy clustering model for fuzzy data. In: IEEE Int. Con-ference on Fuzzy Systems, pp. 2123–2128 (1995)

30. Szilagyi, L., Benyo, Z., Szilagyii, S.M., Adam, H.S.: MR brain image segmenta-tion using an enhanced fuzzy c-means algorithm. In: 25th IEEE Int. Conferenceon Engineering in Medicine and Biology, vol. 1, pp. 724–726 (2003)

31. Takata, O., Miyamoto, S., Umayahara, K.: Clustering of data with uncertain-ties using hausdorff distance. In: 2nd IEEE Int. Conference on IntelligenceProcessing Systems, pp. 67–71 (1998)

Page 309: Foundations of Computational Intelligence

310 H. Zhou and G. Schaefer

32. Takata, O., Miyamoto, S., Umayahara, K.: Fuzzy clustering of data with un-certainties using minimum and maximum distances based on l1 metric. In:Joint 9th IFSA World Congress and 20th NAFIPS International Conference,pp. 2511–2516 (2001)

33. Wang, J., Thiesson, B., Xu, Y., Cohen, M.: Image and video segmentationby anisotropic kernel mean shift. In: 8th European Conference on ComputerVision, pp. 238–2492 (2004)

34. Wei, L.M., Xie, W.X.: Rival checked fuzzy c-means algorithm. Acta ElectronicaSinica 28(7), 63–66 (2000)

35. Yang, M.-S., Ko, C.-H.: On a class of fuzzy c-numbers clustering proceduresfor fuzzy data. Fuzzy Sets and Systems 84(1), 49–60 (1996)

36. Yang, M.-S.: Fuzzy clustering procedures for conical fuzzy vector data. FuzzySets and Systems 106(2), 189–200 (1999)

37. Yang, M.S., Hwang, P.Y., Chen, D.H.: Fuzzy clustering algorithms for mixedfeature variables. Fuzzy Sets and Systems 141, 301–317 (2004)

38. Yong, Y., Chongxun, Z., Pan, L.: A novel fuzzy c-means clustering algorithmfor image thresholding. Measurement Science Review 4, 11–19 (2004)

39. Zhao, M.S., Fu, A.M.N., Yan, H.: A technique of three level thresholding basedon probability partition and fuzzy 3-partition. IEEE Trans. Fuzzy Systems 9,469–479 (2001)

40. Zhou, H., Schaefer, G., Sadka, A., Celebi, M.E.: Anisotropic mean shift basedfuzzy c-means segmentation of dermoscopy images. IEEE Journal of SelectedTopics in Signal Processing 3(1), 26–34 (2009)

41. Zhou, H., Schaefer, G., Shi, C.: A mean shift based fuzzy c-means algorithm forimage segmentation. In: 30th IEEE Int. Conference Engineering in Medicineand Biology, pp. 3091–3094 (2008)

Page 310: Foundations of Computational Intelligence

Author Index

Banerjee, Soumya 275Bouchachia, Abdelhamid 237

Ceberio, Martine 27, 133Chrysostomou, Chrysostomos 197

El-Dahshan, El-Sayed A. 275

Gamez, J. Esteban 53

Hassanien, Aboul Ella 275Hui, C. 175

Jiang, Wenxin 259

Kosheleva, Olga 53Kreinovich, Vladik 27, 53, 133

Magoc, Tanja 133Modave, Francois 27, 53, 133

Nakamatsu, Kazumi 75Nguyen, Hung T. 27, 53

Peters, James F. 3Pitsillides, Andreas 197

Ras, Zbigniew W. 259Radi, Amr 275

Schaefer, Gerald 295

Wieczorkowska, Alicja 259

Zeephongsekul, P. 111Zhou, Huiyu 295