THE LAMSTAR NEURAL NETWORK: A BRIEF REVIEWrgandhi/resume/LAMSTAR.pdf · 1 THE LAMSTAR NEURAL NETWORK: A BRIEF REVIEW Daniel Graupe Department of Electrical & Computer Engineering

1

THE LAMSTAR NEURAL NETWORK: A BRIEF REVIEW Daniel Graupe Department of Electrical & Computer Engineering University of Illinois, Chicago, IL 60607-7053 ABSTRACT This paper reviews the principles and several different applications of the LAMSTAR (Large Memory Storage and Retrieval) Neural Network. The LAMSTAR was specifically developed for application to problems involving very large memory that relates to many different categories (attributes), where some of the data is exact while other data are fuzzy and where, for a given problem, some data categories may be totally missing. Consequently, the network has been successfully applied to many decision, diagnosis and recognition problems in various fields. The LAMSTAR network is very fast and can grow/shrink in dimensionality with no reprogramming. It is capable of forgetting, of interpolation/extrapolation and of changing its resolution. The network employs standard perceptron-like neurons that are arranged in many SOM (Self-Organizing Maps) Kohonen modules (layers). Their SOM structure implies that their neurons are WTA (Winner-Take-All) neurons whose memory is stored in BAM-fashion (Bidirectional Associative Memory). However, the LAMSTAR network differs from most neural networks in that a key feature of the LAMSTAR is its employment of link-weights between the neurons of the various SOM modules and between these and neurons of SOM-type output modules. Hence, learning takes place both in the setting of memory-storage weights and in the setting of link-weight matrices (inter-relation or correlation-weights or Verbindungen, in Kantian terms). Decisions are therefore based, not on the memory values, but both on the memory elements (stored values) and on the connections (relations) between memory elements. Link-weights are learnt by reinforcement, in a Hebbian manner. The resulting linkage-maps between winning neurons in the various modules thus mirror the "understanding" of the data. The network's output modules can operate in closed loop, to intelligently request additional data and to sort it out when fed back into the network. The forgetting feature of the LAMSTAR plays an important role in maintaining the network's efficiency. Seven different applications to medical diagnosis, to college course evaluations analysis, to control and to speech recognition are reviewed. KEYWORDS Neural networks, NN applications, medical diagnosis, industrial fault analysis, speech recognition, browsing, Kohonen layers, self-organizing maps, back-propagation, counter propagation, winner-take-all, link weights, Hebbian principle, large memory storage and retrieval NN Address for communication: Daniel Graupe, University of Illinois, EECS Dept., 851 South Morgan St., 851 South Morgan St., Room 1117 Chicago, IL 60607-7053 PH: (312)996-3085), FX: (312)413-0024 EMAIL: [email protected]

2

I. Introduction Research on Neural Networks has been ongoing since the 1940's [1] in order to model and to grossly simulate the biological central nervous system (CNS) on the one hand, and in order to develop computational tools that can take advantage of the remarkable computational capabilities and the efficiency of the CNS on the other hand. When observing that a simple house-fly, with only a few hundred neural cells, with signal propagation speeds averaging 3 meters/second and with bit rates of the order of 100 Hz can compute flight trajectories to evade a human hand trying to catch the fly, then one can understand the potential involved in imitating the biological computation system. This computational ability is achieved even while the average house-fly probably holds no Ph.D. in mathematics or in computer science. Indeed, the biological neural network is strikingly efficient in its recognition and retrieval capabilities. Its abilities of generalization, and of dealing with non-analytical and incomplete while huge data bases, within a virtually fixed architecture that involves no reprogramming when moving from one class of tasks to another, is well beyond those of any other computational architecture. The latter computational tasks involving huge data bases with partly missing data sets and where data is in part non-analytical and/or fuzzy and/or stochastic, are the main challenges for today's computer science. This is indeed the motivation for presenting the large-scale neural network of the present article. The major principles of neural networks (NN's) are common to practically all NN approaches, including the LAMSTAR NN on which we focus below. First to be considered is the employment of the model of the elementary unit or cell (neuron) employed in all NN’s. This is nothing but a simplified mathematical model of the biological neuron, which was first formulated by Rosenblatt [2]. Accordingly, if the N inputs into a given neuron (from other neurons or from sensors or transducers at the input to the whole or part of the whole network are denoted as x(i); i = 1, 2, ...N, and if the (single) output of that neuron is denoted as y , then (1) where f[.] is a nonlinear function denoted as Activation Function, that can be considered as a (hard or soft) binary (or bipolar) switch [3]. The weights w(i) of eqn. (1) are the weights assigned to the neuron's inputs and whose setting is the learning action of the NN. The model of eqn. (1) is often known as the Perceptron model [2]. It is a simplified model of the biological neuron where the inputs are trains of electrical impulses that are input to the neuron's dendrites, whereas the neuron's (single) output is an output all-or-nothing output at the pre-synaptic region of the same neuron. The weights w(i) are input-weights whose embodiment is in terms of the chemistry at the synaptic connection and over the very narrow gap that separates the dendrites from the pre-synaptic region of another neuron whose output forms the input in question. Each neuron thus constitutes of N inputs (N not being fixed per all neurons) and a single output which is being transferred from the given neuron to many other neurons (not to all), thus forming the neural networking structure. The other main principles of networking are that neural firing (output production) is of all-or-nothing nature., and that the weights w(i) often constitute the memory itself. Hence, a vector x(j) of memory (of input to be stored, say, for further computation) , will be stored in weights w(ij) of vector w(j) relating to a j'th neuron, if the distance d(ij) satisfies: (2)

][1

∑=

=N

iii xwfy

),(||||||||),( kjdwxwxjjd jkjjj

∆

≠ =−≤−=

3

such storage is known as Associative Memory (AM) or BAM (Bidirectional Associative Memory) storage [4]. Also, a WTA (Winner-Take-All) principle is often employed [5], such that an output (firing) is produced only at the winning neuron (say, neuron j, satisfying eqn (2) above), whose weights are closest to vector x(j) when being applied to several neurons during a memory search/retrieval task). Another principle, derivable from Hebb's Law [6], is that there are interconnecting weights (link weights) that adjust and serve to establish flow of neuronal signal traffic between groups of neurons, such that when a certain neuron fires very often in close time proximity (regarding a given situation/task), then the interconnecting link-weights (not the memory-storage weights) relating to that traffic, increase as compared to other interconnections [3,7]. The above principles of Associative Memory, of WTA and of link-weights are observable in biological networks to at least some degree. However, not all NN architectures employ all of them. Of the classical NN's, we briefly comment on the following: (1) The Back-Propagation NN (BP) is essentially a multi-layer Perceptron. It employs a Dynamic-Programming-based algorithm [8] for weight setting, that is not an AM or WTA setting, nor does it employ link-weights. This NN is mathematically rigorous and well suited for well formulated analytical problems. Like Dynamic Programming, it suffers from the curse of dimensionality and requires full data sets. (2) Hopfield Nets (HN) [9] are fully-connected recurrent NN's, using AM (BAM), but no WTA, nor do they employ traffic -based link weights. They are also rigorous but suffer from the curse of dimensionality regarding number of categories they can handle efficiently, and are sensitive to incomplete data sets. (3) Counter Propagation NN's (CP) [10] are AM-based WTA networks, using a selforganizing map (SOM) layer. They are fast to compute. They employ no traffic -based weights but are Hebbian in their nature. They play an important role in the LAMSTAR neural network described below. The LAMSTAR network discussed below, while using the neural element’s structure common to all neural networks, and while employing SOM (as in CP) and the related WTA and AM principles (as in CP, HN and other networks) for storing memory elements, differs from the other networks by its employing link-weights that weight the interrelations (correlation) between neural cells. It uses for its decision and browsing not just the memory values but also their interrelations (Verbindungen, as Kant [11] called them). The LAMSTAR’s understanding is thus based on memories and on relations between them. These relations (link weights) are thus fundamental to its operation. They are very much Hebbian intersynaptic weights [6] as above, and fit recent results from brain research [12].

II. Basic Principles of the LAMSTAR Neural Network The present paper concentrates on a neural network specifically designed for application to retrieval, diagnosis, classification, prediction and decision problems which involve a very large number of categories. The resulting LAMSTAR (LArge Memory STorage And Retrieval) neural network [3,13,14,15] is designed to store and retrieve patterns in a computationally efficient manner, using tools of neural networks, especially SOM (Self Organizing Map)-based network modules [5], combined with statistical decision tools. By its structure as described in Section III, the LAMSTAR network is uniquely suited to deal with analytical and non-analytical problems [13,14,15] where data are of many vastly different

4

categories and where some categories may be missing, where data are both exact and fuzzy and where the vastness of data requires very fast algorithms. These features are rare to find, especially when coming together, in other neural networks. The network can be viewed as in intelligent expert system, where expert information is continuously being ranked for each case through learning and correlation. What is unique about the LAMSTAR network is its capability to deal with non-analytical data, which may be exact or fuzzy and where some categories may be missing. These characteristics are facilitated by the network’s features of forgetting, interpolation and extrapolation. These allow the network to zoom out of stored information via forgetting and still being able to approximate forgotten information by extrapolation or interpolation. We shall show below that the LAMSTAR is equally powerful in many other decision and recognition applications in a wide range of areas. The basic storage modules of the LAMSTAR network are modified Kohonen SOM modules [5] that are BAM-based WTA, as discussed in Sect. I above. In the LAMSTAR network the information is stored and processed via correlation links between individual neurons in separate SOM modules. Its ability to deal with a large number of categories is partly due to its use of very simple calculation of link weights and by its use of forgetting features and features of recovery from forgetting. The link weights are the main engine of the network, connecting many layers of SOM modules such that the emphasis is on (co)relation of link weights between atoms of memory, not on the memory atoms (BAM weights of the SOM modules) themselves. In this manner, the design becomes closer to knowledge processing in the biological central nervous system than is the practice in most conventional artificial neural networks. The forgetting feature too, is a basic feature of biological networks whose efficiency depends on it, as is the ability to deal with incomplete data sets. The input word is a coded real vector X given by:

(3)

where T denotes transposition., xi

T being subvectors (subwords describing categories or attributes of the input word). In the training phase, the input word is augmented by a subset of subwords that represents the desired output of the network (diagnosis/decision). Each subword xi is channeled to a corresponding i'th SOM module that stores data concerning the i'th category of the input word. The network is organized to find a neuron in a set of neurons of a class (namely, in one SOM module) that best matches (correlates) the input pattern in WTA-manner.

III. An Outline of the LAMSTAR Network. III.A. Basic Structural Elements The SOM structure employed in the LAMSTAR system adheres to fundamentals of the SOM structure but it differs in details. Whereas in Kohonen's networks [5] all neurons of an SOM module are checked, in the LAMSTAR network only a finite group of p neurons is checked at a time due to the huge number of neurons involved (the large memory involved). The final set of p neurons is determined by the weights (Ni) as shown in Figures 1 and 2. A winning neuron is determined for each input based on the similarity between the input (vector X in Fig. 2) and a weight vector W (stored information). For an input subword xi , the winning neuron is determined by minimization of a distance norm || * || given by:

(4)

[ ]TTN

TT xxxX ,...,, 21=

}{~;,,min ,,, jikiimii Nlpllkwxwx +∈∀−=−

5

where m : is the winning unit in i'th SOM module (WTA), (Ni,j) : denoting of the weights to determine the neighborhood of top priority in SOM

module i. L : denoting the first neuron to be scanned (determined by weights Ni,j ): ~ denoting

proportionality III.B. Adjustment of Resolution in SOM Modules Eqn. (4), which serves to determine the winning neuron, does not deal effectively with the resolution of close clusters/patterns. This may lead to degraded accuracy in the decision making process when decision depends on local and closely related patterns/clusters which lead to different diagnosis/decision. The local sensitivity of neuron in SOM modules can be adjusted by incorporating an adjustable maximal Hamming distance function dmax as in eqn. 5.

[ ])(maxmax ii wxdd = (5) Consequently if the number of subwords stored in a given neuron (of the appropriate module) exceeds a threshold value, then storage is divided into two adjacent storage neurons (i.e. a new-neighbor neuron is set) and dmax is reduced accordingly. III.C. Links Between SOM Storage Modules (L - weights). Information in the LAMSTAR system in encoded via correlation links Li,j (Fig. 1,2) between individual neurons in different SOM modules (denoted as internal links) for storage of inputs subwords. The LAMSTAR system does not create neurons for an entire input word. Instead, only individual subwords are stored in BAM-like manner in SOM modules (W weights), and correlations between subwords are stored in terms of creating/adjusting L-links (Li,j in Fig. 1,2) that connect neurons in different SOM modules. This allows the LAMSTAR network to be trained with partially incomplete data sets. The L-links are fundamental to allow interpolation and extrapolation of patterns (when a neuron in an SOM model does not correspond to an input subword but is highly linked to other modules serves as an interpolated estimate). When the new input word is presented to the system during the training phase, the LAMSTAR network inspects all weight vectors (wi) in SOM module i that corresponds to an input subword xi that is to be stored. If any stored pattern matches the input subword xi within a preset tolerance, the system updates weights W according to the following procedure:

(6)

where wi,m(t+1) : modified weights in module i for neuron m. a i : learning coefficient for module i. em : minimum error of all weight vectors Wi in module i (eqn. 2). t : denotes the sequential number of the iteration (time equivalent).

If no match was found, the system creates a new pattern in the SOM module. It stores input subword xi as a new pattern win, where subscript n denotes the first unused neuron in i'th SOM module. We repeat the above storage procedure for every input subword xi to be stored. Link weight values L are then set such that for a given input word, after determining a winning k’th neuron in module i and a winning m’th neuron in module j, then the link weight L mk,

ji, is

)(:,))()(()()1( ,,, constmfortwtxtwtw immiiimimi εεα <−+=+

6

counted up by an increment ∆L, whereas, all other links L vsji,, are reduced by a very small

forgetting increment. (Fig.2) [3, 7, 16]. The values of L-link weights are modified according to:

(7a)

(7b)

where:

L mk,ji, : links between winning neuron i in k'th module and winning neuron j in m'th

module. ∆L : increment value Lmax: maximal links value. f(t) : some low increment value, that determines forgetting rate as function of time. The link weights then serve as address correlations [7] to evaluate traffic rates between neurons [3,16]. See Fig.2. The L link weights above thus serve to guide the storage process and to speed it up in problems involving very many subwords (patterns) and huge memory in each such pattern. It also serves to exclude patterns that totally overlap, such that one (or more) of them are redundant and need be omitted. III.D. The Forgetting Feature Noting the learning formula of eqn. 7a and 7b link weights Li,j decay over time. Hence, if not chosen successfully, the appropriate Li,j will drop towards zero. Therefore, correlation links L which do not participate in successful diagnosis/decision over time, or lead to an incorrect diagnosis/decision are gradually forgotten. The forgetting feature allows the network to rapidly retrieve very recent information. Since the value of these links decreases only gradually and does not drop immediately to zero, the network can re-retrieve information associated with those links. The forgetting feature of the LAMSTAR network helps to avoid the need to consider a very large number of links, thus contributing to the network efficiency. At the forgetting feature requires storage of link weights and numbering of input words. Hence, in the simplest application of forgetting, old link weights are forgotten (subtracted from their current value) after, say every M input words. The forgetting can be applied gradually rather than stepwise. III.E. Retrieval of Information in the LAMSTAR Network III.E.1. Input Word for Training and Information Retrieval In applications such as medical diagnosis, the LAMSTAR system is trained by entering the symptoms/diagnosis pairs (or diagnosis/medication pairs). The training input vectors X are of the following form: (8) where xi are input subwords and di are subwords representing the output of the network (diagnosis/decision). In the processing of data (storage and retrieval), the diagnosis subwords ( d in eqn.8) are processed in the same manner as other subwords, namely, all punishment/reward feedbacks also apply to the diagnosis subwords. Therefore, one or more SOM module serve as output modules to output the LAMSTAR's decision/diagnosis. The input word of eqns. 3 and 8 is set to be a coded word (Section III.A), comprising of coded vector-subwords (xi) that relate to various categories (input dimensions). Also, each SOM module

max,,

,,

,, :)()1( LLLtLtL mk

jimkji

mkji ≤∆+=+

0:)()()1( ,,,

,, ≥−=+ ji

vsji

vsji LtftLtL

[ ]TTk

TTn

TT ddxxxX ,...,,,...,, 121=

7

of the LAMSTAR network corresponds to one of the categories of xi such that the number of SOM modules equals the number of subvectors (subwords) xn and d in X defined by eqn.8. III.F. Links for Output Modules for Determination of Winning Decision The diagnosis/decision at the output SOM modules is found by analyzing correlation links L between diagnosis/decision neurons in the output SOM modules and neurons in all input SOM modules selected and accepted by process outlined in Section E. The winning neuron (diagnosis/decision) from the output SOM module is a neuron with the highest cumulative value of links L connecting to the selected (winning) input neurons in the input modules. The diagnosis/detection formula for output SOM module i is given by:

(9) where:

i : i’th output module. n : winning neuron in the i’th output module kw : winning neuron in the k’th input module. M : number of input modules.

L jikw, : link weight between winning neuron in input module k and neuron j in i’th output

module.

The L weights above are derived in the same manner as L weights between the input SOM modules, while success/failure is now being trained (and updated) by the task-evaluation unit. III. G. LAMSTAR Processing Algorithm for Data Analysis Since all information in the LAMSTAR network is encoded in the correlation links, the LAMSTAR can be utilized as a data analysis tool. In this case the system provides analysis of input data such as evaluating the importance of input subwords, the strengths of correlation between categories, or the strengths of correlation of between individual neurons. The system's analysis of the input data involves two phases: (1) training of the system (as outlined in Section III) (2) analysis of the values of correlation links created after the training. Since the correlation links connecting clusters (patterns) among categories are modified (increased/decreased) in the training phase, it is possible to single out the links with the highest values. Therefore, the clusters connected by the links with the highest values determine the trends in the input data. In contrast to using data averaging methods, isolated cases of the input data will not affect the LAMSTAR results, noting its forgetting feature. Furthermore, the LAMSTAR structure makes it very robust to missing input subwords. After the training phase is completed, the LAMSTAR system finds the highest correlation links and reports messages associated with the clusters in SOM modules connected by these links. The links can be chosen by two methods: (1) links with value exceeding a pre-defined threshold, (2) a pre-defined number of links with the highest value. An example to this analysis capability is given in Section IV-E (College-Course Evaluation Application).

ninjkLLM

kw

jikw

M

kw

nikw ≠∀≥ ∑∑ ;,,,,

8

III. G.1. Feature Extraction and Reduction Features can be extracted and reduced in the LAMSTAR network according to the Theorems below: DEFINITION: A feature can be extracted by the matrix A(i,j) where i denotes a winning neuron in SOM storage module j. All winning entries are 1 while the rest are 0. Furthermore, A(i,j) can be reduced via Theorems II, III, IV, V below. THEOREM I: The most (least) significant subword (winning memory neuron) {i} over all SOM modules (i.e., over the whole NN) with respect to a given output decision {dk} and over all input words, denoted as [i*,s*/dk], is given by: [i*,s*/dk]: L(i,s/dk) ≥ L(j,p/dk) for any winning neuron {j} in any module {p}, …(10) where p is not equal to s, L(j,p/dk) denoting the link weight between the j'th (winning) neuron in layer p and the winning output-layer neuron dk. Note that for determining the least significant neuron, the inequality as above is reversed. THEOREM II: The most (least) significant SOM module {s**} per a given winning output decision {dk} over all input words, is given by: s**(dk): SUM{L (i,s/dk)} >/= SUM {L(j,p/dk)} for any module p ……(11) i j Note that for determining the least significant module, the inequality above is reversed. THEOREM III: The neuron {i**(dk)} that is most (least) significant in a particular SOM module (s) per a given winning output decision (dk), over all input words per a given class of problems, is given by i*(s,dk) such that: L(i,s/dk) >/= L(j,s/dk) for any neuron (j) in same module (s) …………(12) Note that for determining the least significant neuron in module (s), the inequality above is reversed. THEOREM IV (Redundancy Theorem via Internal Links): If the link weights L(p,a/q,b) from any neuron {p} in layer {a} to some neuron {q} in layer {b} is very high, WHILE it is (near) zero to EVERY OTHER neuron in layer {b}, we denote the neuron {q} in layer {b} as q(p). Now, IF this holds for ALL neurons {p} in layer {a} which were ever selected (declared winners) , THEN layer {b} is REDUNDANT, as long as the number of neurons {p} is larger or equal to the number of {q(p)}, AND layer {b} should be removed. COROLARY: If the number of {q(p)} neurons is less than the number of {p} neurons, then layer {b) is called an INFERIOR LAYER to {a}. Also see Theorem IX below on redundancy determination via correlation-layers. THEOREM V (Zero-Information Redundancy): If only one neuron is ALWAYS the winner in layer (k), regardless of the output decision, then the layer contains no information and is redundant.

9

The above theorems can serve to reduce number of features or memories by considering only a reduced number of most-significant modules or memories or by eliminating the least significant ones. III.G.2. Correlation, Interpolation and Extyrapolation Consider the (m) most significant layers (modules) with respect to output decision (dk) and the (n) most significant neurons in each of these (m) layers, with respect to the same output decision. (Example: Let m = n = 4). THEOREM VI (Correlation-Layer Theorem): Establish additional SOM layers denoted as CORRELATION- LAYERS λ(p/q, dk), such that the number of these additional correlation- layers is: m-1 SUM (i) per output decision dk i=1 (Example: The correlation- layers for the case of n = m = 4 are: λ(1/2, dk); λ(1/3, dk); λ(1/4,dk); λ(2/3,dk); λ(2/4,dk); λ(3/4,dk).) Subsequently, WHENEVER neurons N(i,p) and N(j,q) are simultaneously (namely, for the same given input word) winners at layers (p) and (q) respectively, and both these neurons also belong to the subset of ‘most significant ’ neurons in ‘most significant’ layers (such that p and q are ‘most significant’ layers), THEN we declare a neuron N(i,p/j,q) in Correlation-Layer λ(p/q, dk) to be the winning neuron in that correlation-layer and we reward/punish its output link-weight L(i,p/j,q-dk) as need be for any winning neuron in any other input SOM layer. (Example: The neurons in correlation- layer λ(p/q) are: N(1,p/1,q); N(1,p/2,q); N(1,p/3,q); N(1,p/4,q), N(2,p/1,q); ….N(2,p/4,q); N(3,p/1,q); ….N(4,p/1,q); ….N(4,p/4,q), to total mxm neurons in the correlation- layer.) THEOREM VII (Interpolation/Extraspolation Theorem via Internal Links): For a given input word that relates to output decision dk, if no input subword exists that relates to layer (p), then the neuron N(i,p) which has the highest summed-correlation link (internal link weights) with winning neurons (for the same input word) in other layers v, will be considered the interpolation/extrapolation neuron in layer p for that input word. However, no rewards/punishments will be applied to that neuron while it is an interpolation/extrapolation neuron. THEOREM VIII (Interpolation/Extrapolation Theorem via Correlation Layers): Let p be a ‘most significant’ layer and let i be a ‘most significant neuron with respect to output decision dk in layer p, where no input subword exists in a given input word

10

relating to layer p. Thus, neuron N(i,p) is considered as the interpolation/extrapolation neuron for layer p if it satisfies: SUM{L(i,p/ w,q-dk)} >/= SUM{L(v,p/ w,q-dk)} ………………(13) q q where v are different from i and where L(i,p/j,q-dk) denote link weights from correlation- layer λ(p/q). Note that in every layer q there is only one winning neuron for the given input word, denoted as N(w,q), whichever w may be at any q’th, layer. (Example: Let p = 3. Thus consider correlation- layers λ(1/3,dk); λ(2/3,dκ); λ(3/4,dk) , such that q = 1, 2, 4.) ΤΗΕΟRΕΜ ΙX (Redundancy Theorem via Correlation-Layers): Let p be a ‘most significant’ layer and let i be a ‘most significant’ neuron in that layer. Layer p is redundant if for all input words there is there is another ‘most significant’ layer q such that, for any output decision and for any neuron N(i,p), only one correlation neuron i,p/j,q (i.e., for only one j per each such i,p) has non-zero output- link weights to any output decision dk, such that every neuron N(j,p) is always associated with only one neuron N(j,p) in some layer p. (Example: Neuron N(1,p) is always associated with neuron N(3,q) and never with N(1,q) or N(2,q) or N(4,q), while neuron N(2,p) is always associated with N(4,q) and never with other neurons in layer q). Also, see Theorem IV above. III.H. INNOVATION DETECTION THEOREM X: If link-weights from a given input SOM layer to the output layer output change considerably and repeatedly (beyond a threshold level) within a certain time interval (a certain specified number of successive input words that are being applied), relatively to link weights from other input SOM layers, then innovation is detected with respect to that input layer (category). COROLARY: Innovation is also detected if weights between neurons from one input SOM layer to another input SOM layer similarly change.

11

IV. Examples of Applications of the LAMSTAR Network. IV.A. General Introduction. The decisions of the LAMSTAR neural network are based on many categories of data, where often some categories are fuzzy while some are exact, and often categories are missing (incomplete data sets). As mentioned in Section III.B, the LAMSTAR network can be trained with incomplete data or category sets. Therefore, due to its features, the LAMSTAR neural network is a very effective tool in just such situations. As an input, the system accepts data defined by the user, such as, system state, system parameters, or very specific data as it is shown in the application examples presented below. Then, the system builds a model (based on data from past experience and training) and searches the stored knowledge to find the best approximation/description to the features/parameters given as input data. The input data could be automatically sent through an interface to the LAMSTAR's input from sensors in the system to be diagnosed, say, an aircraft into which the network is built in. The LAMSTAR system can be utilized as: - a computer-based medical diagnosis system.

- a teaching aid. - a tool for financial evaluations.

- a tool for industrial maintenance and fault diagnosis. - a tool for data analysis, classification, browsing, and prediction. - a browser tool and search engine for huge memories. The LAMSTAR network can provide multidimensional analysis of input variables that can, for example, assign different weights (importance) to the items of data, find correlation among input variables, or perform identification, recognition and clustering of patterns. Since it is a neural network algorithm, the LAMSTAR system can do all this without re-programming for each diagnostic problem. In the sub-sections below we summarize examples of application of the LAMSTAR network to various problems and compare performance with other neural networks applied to the same problems, using the same data. The examples considered below are: (1) patient diagnosis after removal of kidney stones, (2) renal cancer diagnosis, (3) diagnosis of drug abuse in emergency room situation (unconscious patient), (4) college-course evaluation analysis, (5) load balancing in distributed computations, (6) speech recognition. The examples presented below illustrate the scope of applications of the LAMSTAR network. IV.B. Application to ESWL Medical Diagnosis Problem. In this application, the LAMSTAR network serves to aid in a typical urological diagnosis problem that is in fact, a prediction problem [14, 15]. The network evaluates a patient's condition and provides long term forecasting after removal of renal stones via Extracorporeal Shock Wave Lithotripsy (denoted as ESWL). The ESWL procedure breaks very large renal stones into small pieces that are then naturally removed from the kidney with the urine. Unfortunately, the large kidney stones appear again in 10% to 50% of patients (1-4 years post surgery). It is difficult to predict with reasonable accuracy (more than 50%) if the surgery was a success or a failure, due to the large number of analyzed variables. In this particular example, the input data (denoted as a "word" for each analyzed case, namely, for each patient) are divided into 16 subwords (categories). The length in bytes for each subword in this example varies from 1 to 6 bytes. The subwords describe patient's physical and physiological characteristics, such as patient demographics, stone’s chemical composition, stone location, laboratory assays, follow-up, re-treatments, medical therapy, etc.. Table 1 compares results for the LAMSTAR network and for a Back-Propagation (BP) neural network [17], as applied to exactly the same training and test data sets [15]. While both networks

12

model the problems with high accuracy, the results show that the LAMSTAR network is over 1000 times faster in this case. The difference in training time is due to the incorporation of an unsupervised learning scheme in the LAMSTAR network, while the BP network training is based on error minimization in a 37-dimensional space (when counting elements of subword vectors) which requires over 1000 iterations. Both networks were used to perform the Wilks’ Lambda test [18, 19] which serves to determine which input variables are meaningful with regard to system performance. In clinical settings, the test is used to determine the importance of specific parameters in order to limit the number of patient’s examination procedures.

Table 1. Performance comparison of the LAMSTAR network and the BP network for the renal cancer and the ESWL diagnosis.

Renal Cancer Diagnosis

ESWL Diagnosis

LAMSTAR Network

BP Network

LAMSTAR Network

BP Network

Training Time

0.08sec

65 sec

0.15 sec

177 sec

Test Accuracy

83.15 %

89.23%

85.6%

78.79%

Negative Specificity

0.818

0.909

0.53

0.68

Positive Predictive Value

0.95

0.85

1

0.65

Negative Predictive Value

0.714

0.81

0.82

0.86

Positive Specificity

0.95

0.85

1

0.83

Wilks’ Test Computation time

< 15 mins

weeks

< 15 mins

Weeks

Comments: Positive/Negative Predictive Values – ratio of the positive/negative cases that are correctly diagnosed to the positive/negative cases diagnosed as negative/positive. Positive/Negative Specificity – the ratio of the positive/negative cases that are correctly diagnosed to the negative/positive cases that are incorrectly diagnosed as positive/negative. IV.C. Study of Renal Cancer Diagnosis Problem. This application illustrates how the LAMSTAR serves to predict if patients will develop a metastatic disease after surgery for removal of renal-cell-tumors. The input variables were grouped into sub-words describing patient's demographics, bone metastases, histologic subtype, tumor characteristics, and tumor stage [15]. In this case study we used 232 data sets (patient record), 100 sets for training and 132 for testing. The performance comparison of the LAMSTAR network versus the BP network are also summarized in Table 1 below. As we observe, the LAMSTAR network is not only much faster to train (over 1000 times), but clearly gives better prediction accuracy (85% as compared to 78% for BP networks) with less sensitivity.

13

IV.D. Application to Diagnosis of Drug Abuse for Emergency Cases. In this application, the LAMSTAR network is used as a decision support system to identify the type of drug used by an unconscious patient who is brought to an emergency-room (data obtained from Maha Noujeime, University of Illinois at Chicago [20, 21]). A correct and very rapid identification of the drug type, will provide the emergency room physician with the immediate treatment required under critical conditions, whereas wrong or delayed identification may prove fatal and when no time can be lost, while the patient is unconscious and cannot help in identifying the drug. The LAMSTAR system can diagnose to distinguish between five groups of drugs: alcohol, cannabis (marijuana), opiates (heroin, morphine, etc.), hallucinogens (LSD), and CNS stimulants (cocaine) [20]. In the drug abuse identification problem diagnosis can not be based on one or two symptoms since in most cases the symptoms overlap. The drug abuse identification is very complex problem since most of the drugs can cause opposite symptoms depending on additional factors like: regular / periodic use, high/low dose, time of intake [20]. The diagnosis is based on a complex relation between 21 input variables arranged in 4 categories (subword vectors) representing drug abuse symptoms. Most of these variables are easily detectable in an emergency-room setting by simple evaluation (Table 2). The large number of variables makes it often difficult for a doctor to properly interrelate them under emergency room conditions for a correct diagnosis. An incorrect diagnosis, and a subsequent incorrect treatments may be lethal to a patient. For example, while cannabis and cocaine require different treatment, when analyzing only mental state of the patient, both cannabis and large doses of cocaine can result in the same mental state classified as mild panic and paranoia. Furthermore, often not all variables can be evaluated for a given patient. In emergency-room setting it is impossible to determine all 21 symptoms, and there is no time for urine test or other drug tests. The LAMSTAR network was trained with 300 sets of simulated input data of the kind considered in actual emergency room situations [15]. The testing of the network was performed with 300 data sets ( patient cases ), some of which have incomplete data ( in emergency-room setting there is no time for urine or other drug tests). Because of the specific requirements of the drug abuse identification problem (abuse of cannabis should never be mistakenly identified as any other drug), the training of the system consisted of two phases. In the first phase, 200 training sets were used for unsupervised training, followed by the second phase where 100 training sets were used in on-line supervised training with punishment coefficients of eqn. 7a, 7b, 9 increased when cannabis was incorrectly identified. The LAMSTAR network successfully recognized 100% of cannabis cases, 97% of CNS stimulants, and hallucinogens ( in all incorrect identification cases both drugs were mistaken with alcohol), 98% of alcohol abuse ( 2% incorrectly recognized as opiates), and 96% of opiates ( 4% incorrectly recognized as alcohol).

Table 2. Symptoms divided into four categories for drug abuse diagnosis problem. CATEGORY 1

CATEGORY 2

CATEGORY 3

CATEGORY 4

Respiration

Pulse

Euphoria

Physical Dependence

Temperature

Appetite

Conscious Level

Psychological Dependence

Cardiac Arrhythmia Vision

Activity Status Duration of action

Reflexes Hearing

Violent Behavior Method of Administration

Saliva Secretion

Constipation

Convulsions

Urine Drug Screen

14

IV.E. Assessment of Fetal Well-Being This application [22] is to determine neurological and cardiologic risk to a fetus prior to delivery. It concerns situations where, in the hours before delivery, the expectant mother is connected to standard monitors of fetal heart rate and of maternal uterine activity. Also available are maternal and other related clinical records. However, unexpected events that may endanger the fetus, while recorded, can reveal themselves over several seconds in one monitor and are not conclusive unless considered in the framework of data in anther monitor and of other clinical data. Furthermore, there is no expert physician available to constantly read any such data, even from a single monitor, during the several hours prior to delivery. This causes undue emergencies and possible neurological damage or death in approximately 2% of deliveries. In [22] preliminary results are given where all data above are fed to a LAMSTAR neural network, in terms of 126 features, including 20 maternal history features, 9 maternal condition data at time of test (body temperature, number of contractions, dilation measurements, etc.) and 48 items from preprocessed but automatically accessed instruments data (including fetal heart rate, fetal movements, uterine activity and cross-correlations between the above). This study on real data involved 37 cases used for training the LAMSTAR NN and 36 for actual testing. The 36 test cases involved 18 positives and 18 negatives. Only one of the positives (namely, indicating fetal distress) was missed by the NN, to yield a 94.44% sensitivity. There were 7 false alarms as is explained by the small set of training cases. However, in a matter of fetal endangerment, one obviously must bias the NN to minimize misses at the cost of higher rate of false alarms. Computation time is such that decisions can be almost real time if the NN and the preprocessors involved are directly connected to the instrumentation considered. In the literature, several other applications of NN to this problem were reported, using other neural networks [22]. Of these, results were obtained in [23] where the miss percentage (for the best of several NN’s discussed in that study) was reported as 26.4% despite using 3 times as many cases for NN-training. A study of [24], where a Back-Propagation NN was employed, using 8 parameters only, with accuracy of 86.3% for 29 cases (10.000 iterations). The miss rate for this study is only indirectly computable to be 20%. Another BP-NN study [25] using 631 training cases achieved a miss rate of 11.1%. on a test set of 319 cases after 15,000 iterations IV.F. College-Course Evaluation Analysis. In this application, the LAMSTAR system is utilized not as a diagnostic tool, but as a tool for multidimensional analysis of input variables. The prototype system for course evaluation is implemented at the Knowledge System Institute - Graduate School at Skokie, IL [26]. The results generated by the system will identify the strengths and weaknesses of the course. This will further assist the dean or the administration personnel in evaluating the performance of the faculty members in an objective manner. The components of the entire evaluation system are: data entry forms, LAMSTAR network, and evaluation results forms. The system is implemented in a secure environment through the Internet Web Pages. The input data were grouped into pre-defined categories, such as: course/teacher, general communication, quizes/exams/homeworks, textbook/handouts. These categories serve as subwords in the LAMSTAR network. All the entries are stored into a database for easy retrieval and analysis of data. The results generated by

15

LAMSTAR are: (1) list of categories with strengths/weaknesses, (2) numerical score for each category. These results are subsequently mapped into pre-defined sentences that should be included into the evaluation letters to the faculty. The processing algorithm of the LAMSTAR Network utilized in this example is outlined in section III.E. IV.G. Load Balancing in Distributed Computing. In this application the LAMSTAR system is employed to balance a distributed network [27]. The LAMSTAR network controls a system with N computers, where computer can be a member of distributed system or a networked computer (as shown on Figure 3) interchanging services or data. The LAMSTAR system is used to redirect services from client computers to servers that provide the fastest and most reliable services. The on-line training of the system is based on the following criteria: (1) finding the nearest server to the client, (2) finding a server that will not be overloaded while providing the service. (3) finding an appropriate server while keeping the cost of communication between computers to minimum. The input data to the system consists of a type of service requested (data, or application type) along with information about the requested/projected computational load. The LAMSTAR system is trained by providing information about available servers, such as: database, type/speed of the connection, operating system, processors type/speed, available services. IV.H. Speech Recognition. In this application the LAMSTAR system is utilized as a limited-dictionary word recognition system. The system’s categories (subwords) represent the frequency bins of the FFT of the analyzed words (with cut-off at 3.5 kHz). The output of the network is an integer index describing which word was detected. After the system is trained, the correlation links (as described in section 2) define relationship between the frequency bins for each word. The network recognizes close to 99% of words for speaker dependent word recognition, and 83% for speaker independent recognition. In both cases a noise was added to the input. The results for recognizing 10 words for a case involving speech control of electrical stimulation for paraplegics has yielded 98.7% correct recognition using the LAMSTAR against 92% with Counter Propagation network [10, 28], when employing exactly the same pre-processing of the speech signals in both cases, and the same amount of training. Figure 4 shows LAMSTAR neural network used for speech recognition.

V. Conclusions

This paper reviewed the principles of the LAMSTAR neural network and several of its applications to problems involving a large number of features (categories) that are exact or fuzzy and where the problems involved are non-analytical. The LAMSTAR NN was shown to be very fast in its computation due to its employment of link-weights when combined with winner-take-all associative memory, in a many- layer structure. This structure is allows the incorporation of the forgetting feature, so important for adaptation to new situations/environments. The network’s arrays of link-weights also allows fast extracting and of assessing the most important features relative to a decision problem, to reduce its own dimensionality when desired. Similarly, this same capability allows entering features at will, even if unimportant or redundant, as the network can easily detect and hence eliminate such (often hidden) redundancies, especially in problems

16

involving a huge number of features. Furthermore, consistent and significant (above threshold) changes in link weights over a given number of successive input words allow innovation detection. The application reviewed are an example of the LAMSTAR’s range of applications. Five of the seven applications presented above are to medical diagnosis or medical decision problems. In all cases the network is fast in reaching its decisions. Comparative performance relative to other neural networks is also discussed. The LAMSTAR is, for a given performance, always (much) faster and requires less computation power (see Table 1 above). Its performance is equal or superior to other networks in the cases discussed, even with considerably less training (see Section IV-E). The LAMSTAR requires relatively little initial training. Still, it never stops learning (training) since every new decision, during normal operation, still modifies its weights. Its forgetting feature prevents is from ignoring new trends.

17

Figure 1. General Block Diagram - LAMSTAR Network. Task Evaluation unit provides highest hierarchy of control - to modify tolerances and thresholds. Stochastic Modulation Unit introduces modulation noise to all settings of weights.

18

Figure 2. Details of Figure 1. Top: Links between SOM modules. Bottom: Low-Hierarchy feedbacks from neurons that control weights N, V and L used in the LAMSTAR.

19

20

Figure 4. The LAMSTAR neural network used for speech recognition problem. REFERENCES

[1] McCulloch, W.S. and Pitts, W. (1943), A logical calculus of the ideas imminent in nervous activity, Bull. Math. Biophys., 5, 115-133. [2] Rosenblatt, F. (1958), The Perceptron, a probabilistic model for information storage and organization in the brain, Psychol. Rev., 65, 386-408 [3] Graupe, D.,(1997): Principles of Artificial Neural Networks, World Scientific Publishing Co., Singapore and River Edge, N.J., (especially, chapter 13 thereof). [4] Languet-Higgins, T. (1968), Holographic memory of recall, Nature, 217, 104. [5] Kohonen, T., (1988): Self-Organizing and Associative Memory, 2nd Edition, Springer Vela, N.Y. [6] Herb, D., (1949), The Organization of Behavior, J. Wiley, New York. [7] D.Graupe, W.J. Lynn, (1970), “Some aspects regarding mechanistic modeling of recognition and memory”, Cybernetica, Vol.3, pp.119-141. [8] Bellman, R., (1961), Dynamic Programming, Princeton Univ. Press, Princeton, N.J..

21

[9] Hopfield, J.J., (1982), Neural networks and physical systems with emergent collective computational capabilities, Proc. National Acad. Sci., 79, 2554-2558. [10] Hecht-Nielsen, R., (1987), Counter propagation networks, Appl. Opt., 26, 4979-4984. [11] Ewing, A.,C., (1938) : A Short Commentary of Kant’s Critique of Pure Reason, Univ. of Chicago Press, [12] Levitan, L.B. , Kaczmarek, L.,K. (1997) : The Neuron, Oxford Univ. Press. 2nd Ed., [13] Graupe. D. and Kordylewski, H. (1998) : A Large Memory Storage and Retrieval Neural Network for Adaptive Retrieval and Diagnosis, Internat. J. Software Eng. and Knowledge Eng., Vol. 8, No. 1, pp. 115-138. [14] Kordylewski, H. and Graupe, D. (1997): Applications of the LAMSTAR Neural Network to Medical and Engineering Diagnosis/Fault Detection, Proc. 7th ANNIE Conf., St. Louis, MO. [15] Kordylewski, H., Graupe, D., and Liu, K. (1999): Medical Diagnosis Applications of the LAMSTAR Neural Network, Proc. of Biol. Signal Interpretation Conf. (BSI-99), Chicago, IL. [16] Minsky, M.L. (1980): K-Lines: A Theory of Memory, Cognitive Sci., Vol. 4, pp. 117- 133. [17] Niederberger CS, et al (1996), A neural computational model of stone recurrence after ESWL, Internat. Conf. on Eng. Appl. of Neural Networks (EANN '96), pp, 423-426. [18] Morrison,D.F. (1996), Multivariate Statistical Methods, McGraw-Hill, p. 222. [19] Wilks S. (1938) “The large sample distribution of the likelihood ration for testing composite hypothesis”, Ann. Math. Stat., Vol.9, pp. 2-60. [20] Bierut, L.J., et. al., ( 1998), “Familiar transmission of substance dependence: alcohol, marijuana, cocaine, and habitual smoking”, Arch. Gen. Psychiatry, 55(11). pp. 982-988. [21] Noujeime, M., (1997), Primary Diagnosis of Drug Abuse for Emergency Case, Project Report, EECS Dept., Univ. of Illinois, Chicago. [22] Scarpazza, D.P., Graupe, M.H., Graupe, D. and Hubel, C.J., (2002), Assessment of Fetal Well-Being Via A Novel Neural Network, Proc. IASTED International Conf. On Signal Processing, Pattern Recognition and Application, Heraklion, Greece, pp. 119-124. [23] Rosen, B.E., T. Bylander and B. Schifrin, (1997), Automated diagnosis of fetal Outcome from Cardio- tocograms, Intelligent Eng. Systems Through Artificial Neural Networks, NY, ASME Press, 7, 683-689. [24] K. Maeda, M. Utsu, A. Makio, M. Serizawa, Y. Noguchi, T. Hamada, K. Mariko and F. Matsumo, (1998), Neural Network Computer Analysis of Fetal Heart Rate, J. Maternal-Fetal Investigation, 8, 163-171. [25] S. Kol, I. Thaler, N. Paz and O. Shmueli, (1995), Interpretation of Nonstress Tests by an Artificial NN, Amer. J. Obstetrics & Gynecol., 172(5), 1372-1379. [26] Kordylewski, H., (1998) , A Large Memory Storage and Retrieval Neural Network for Medical and Industrial Diagnosis, Ph.D. Thesis, EECS Dept., Univ. of Illinois, Chicago. [27] Todorovic, V., (1998), Load Balancing in Distributed Computing, Project Report, EECS Dept., Univ. of Illinois, Chicago. [28] T.S. Patel, (2000), LAMSTAR NN for Real Time Speech Recognition to Control Functional Electrical Stimulation for Ambulation by Paraplegics, MS Project Report, EECS Dept., Univ. of Illinois, Chicago.

Documents

THE LAMSTAR NEURAL NETWORK: A BRIEF REVIEWrgandhi/resume/LAMSTAR.pdf · 1 THE LAMSTAR NEURAL NETWORK: A BRIEF REVIEW Daniel Graupe Department of Electrical & Computer Engineering