Improving diagnostic accuracy using a hierarchical neural network to model decision subtasks

International Journal of Medical Informatics 57 (2000) 41–55

Improving diagnostic accuracy using a hierarchical neuralnetwork to model decision subtasks

David West a,*, Vivian West b

a Department of Decision Sciences, College of Business Administration, East Carolina Uni6ersity, Green6ille, NC 27836, USAb East Carolina Uni6ersity Center for Health Sciences Communication, Green6ille, NC 27836, USA and School of Nursing,

Uni6ersity of North Carolina, Chapel Hill, NC 27599, USA

Received 26 March 1999; received in revised form 30 September 1999; accepted 30 September 1999

Abstract

A number of quantitative models including linear discriminant analysis, logistic regression, k nearest neighbor,kernel density, recursive partitioning, and neural networks are being used in medical diagnostic support systems toassist human decision-makers in disease diagnosis. This research investigates the decision accuracy of neural networkmodels for the differential diagnosis of six erythamatous-squamous diseases. Conditions where a hierarchical neuralnetwork model can increase diagnostic accuracy by partitioning the decision domain into subtasks that are easier tolearn are specifically addressed. Self-organizing maps (SOM) are used to portray the 34 feature variables in a twodimensional plot that maintains topological ordering. The SOM identifies five inconsistent cases that are likely sourcesof error for the quantitative decision models; the lower bound for the diagnostic decision error based on five errorsis 0.0140. The traditional application of the quantitative models cited above results in diagnostic error levelssubstantially greater than this target level. A two-stage hierarchical neural network is designed by combining amultilayer perceptron first stage and a mixture-of-experts second stage. The second stage mixture-of-experts neuralnetwork learns a subtask of the diagnostic decision, the discrimination between seborrheic dermatitis and pityriasisrosea. The diagnostic accuracy of the two stage neural network approaches the target performance established fromthe SOM with an error rate of 0.0159. © 2000 Elsevier Science Ireland Ltd. All rights reserved.

Keywords: Computer-aided diagnosis; Diagnostic accuracy; Learning subtasks; Mixture-of-experts neural network; Self-organizingmap

www.elsevier.com/locate/ijmedinf

1. Introduction

Medical diagnostic decision support sys-tems (MDSS) have become an establishedcomponent of medical technology and their use

will continue to grow, fueled by electronicmedical record and automatic data capture[27]. The purpose of a MDSS is to augment,not replace the natural capabilities of humandiagnosticians in the complex process of med-ical diagnosis. To date, the most successfulMDSS applications are in limited, focused

* Corresponding author. Tel.: +1-252-328-6370.E-mail address: [email protected] (D. West)

1386-5056/00/$ - see front matter © 2000 Elsevier Science Ireland Ltd. All rights reserved.

PII: S1 386 -5056 (99 )00059 -3

D. West, V. West / International Journal of Medical Informatics 57 (2000) 41–5542

domains. The heart of this medical technol-ogy is an inductive engine that learns thedecision characteristics of the diseases andcan then be used to diagnose future patientswith uncertain disease states. Several quanti-tative models have been implemented in diag-nostic support systems: linear discriminantanalysis, logistic regression, k nearest neigh-bor, kernel density, recursive partitioning,and more recently, neural networks [27]. Thepurpose of this research is to investigate thediagnostic accuracy of neural network deci-sion support models for the differential diag-nosis of six erythamatous-squamous diseases.To provide a benchmark, the diagnostic ac-curacy for those quantitative models listedabove is also reported. While most diagnosticsupport models have been designed for bi-nary decisions that detect the presence orabsence of one particular disease, this appli-cation is concerned with the diagnosis of sixdermatology diseases. Because there are morethan two diagnostic groups, logistic regres-sion is not included in this study.

Salchenberger et al. [32] used a neural net-work MDSS to detect breast implant ruptureand reported that the MDSS was more accu-rate than the radiologists. Neural networkshave also been used to overcome the lowspecificity in the diagnosis of breast cancerfrom mammogram lesions [3]. The authorsreported finding a striking variability in theinterpretation of mammograms by radiolo-gists with different levels of training and ex-perience. Their neural network based MDSSproduced more consistent diagnoses, achiev-ing 100% sensitivity and increasing the posi-tive predictive value from 38 to 58–66%.PAPNET is an FDA approved decision sup-port system using neural network technologythat increases the detection of cervical abnor-malities by as much as 30% [25,26]. The falsenegative error rate for cervical smear teststraditionally has ranged from 5 to 50%, re-

sulting in missing early treatment opportuni-ties. Maclin et al. [22] developed a neuralnetwork decision support system for the earlydetection of five classifications of hepatic can-cer masses. They found their MDSS to bemore accurate than the average radiologistsin training and state that it has value tophysicians in remote areas who need a secondopinion, as well as providing for faster diag-noses with fewer costly specialists [22].Tourassi et al. developed a neural networkdiagnostic tool for diagnosing pulmonaryembolism from ventilation-perfusion lungscans and chest radiographs. They concludethat the neural network decision tool signifi-cantly outperformed the physicians in thestudy with a two-tailed P valueB0.01 [34].The diagnosis of cytomegalovirus retinopathyafter renal transplantation was investigatedby Sheppard et al. [33]. The authors concludethat neural network diagnostic models cor-rectly diagnose disease presence and absencewith accuracy well in excess of 80% and thatthe neural network models are a considerableimprovement over current methods of diag-nosis. There have been a number of applica-tions of neural network diagnostic systems[4–6,9,29] to increase diagnostic accuracy forthe detection of acute myocardial infarction.Baxt [4] reports that a neural network MDSScorrectly identifies 92% of patients with acutemyocardial infarction and 96% of patientswithout infarction. The best previous methodwas an 88% detection rate and a false alarmrate of 26%. Fricker [9] and Josephson [17]conclude that a neural network MDSS can bemore accurate than experienced cardiologistsin diagnosing acute myocardial infarction,and that the MDSS system is particularlyuseful when junior staff are on duty in theemergency room.

The six dermatology diseases investigatedin this research share the clinical featuresoferythema and scaling, and show very littlevisual differences. Unfortunately, they may

D. West, V. West / International Journal of Medical Informatics 57 (2000) 41–55 43

also share the histopathological features froma biopsy as well. A nonlinear self-organizingmap (SOM) of the dermatology data is em-ployed to provide exploratory insight into thecomplexity of the differential diagnosis andto set optimal properties to guide the searchfor the best quantitative engine. A compre-hensive set of experiments was then con-ducted to estimate the diagnostic accuracy ofthe quantitative models using a rigorous 10-fold crossvalidation methodology. In the nextsection of this paper the three neural networkmodels used in this research are briefly dis-cussed: multilayer perceptron, mixture-of-ex-perts, and SOM. This is followed by adescription of the experimental design used toestimate diagnostic accuracy for each of thequantitative models, a discussion of the re-sults, and finally conclusions to guide thedevelopment of a MDSS.

2. Description of neural network models

2.1. The multilayer perceptron neuralnetwork (MLP)

The MLP [7,12,13,28,30,31] shown in Fig.1 is the architecture most frequently used inMDSS applications. The MLP used in thisresearch consists of three layers: an inputlayer that presents the 34 input values, ahidden layer that extracts features from theinput values, and an output layer that calcu-lates the equivalent of a posterior probabilityfor each of the diseases. While not shown onFig. 1, each layer is fully connected, that iseach input value is transmitted to each of thehidden layer neurons and each hidden layerneuron transmits its calculated output valueto each of the output neurons. A variableweight value is associated with each of theseconnection paths. As will be seen, the weightsare determined during a learning phase that

encodes knowledge of the diagnostic prob-lem. If one assumes for the moment that thenetwork weights have been determined, thefeed-forward activity in the network that re-sults when values are applied to the inputlayer variables can be traced. First, the inputvalues are transmitted directly to each of thehidden layer neurons. Each hidden layer neu-ron calculates a weighted sum of the inputvalues multiplied by the respective weight forthat connection. A transfer function g(x) isthen applied to this weighted sum to deter-mine the hidden neuron output. The inset inFig. 1 depicts the properties of a single neu-ron. The transfer function used in this paperis the hyperbolic tangent function defined inEq. (1). An advantage of the asymmetrichyperbolic tangent is that it allows the neu-ron activation to assume negative values andtherefore facilitates faster learning [13].

g(a)=ea−e−a

ea+e−a (1)

The output value, Z, for hidden neuron, h,can be expressed as a function of the 34 inputvalues Ii as follows:

Zh=g� %

34

i=1WhiIi

�+Whb (2)

where i indexes the input neurons, h indexesthe hidden layer neurons, and b indexes therespective bias value. A diagnosis is deter-mined by transmitting the hidden layer valuesto the output layer and calculating values foreach of the six output neurons representingthe dermatology diseases. Using the expres-sion of Eq. (2), the output value Z for anydisease, o, is expressed as a function of thedermatology input values and networkweights in Eq. (3).

Zo= %6

h=1Who

�g� %

34

i=1WhiZi

�+Whb

�+W1b

(3)


where o indexes the output layer. A softmaxfunction can be applied to the outputs toensure values between 0 and 1, which canthen be interpreted as posterior probabilitiesfor the presence of the disease.

The specific weight values are determinedduring an iterative learning process, the back-propagation algorithm. This learning process

starts by assigning small random values forall weights. Examples from a training set arerandomly selected, applied to the networkinputs, and then a corresponding networkoutput is calculated from Eq. (3). The calcu-lated output is compared to the desired out-put, do, for the training example, and anetwork error is determined.

Fig. 1. Multilayer perceptron diagnostic network.


E=0.5 %6

o=1((do−Z0)2) (4)

A gradient descent method is then used tochange each weight in a direction which de-creases the global network error function, E.This is accomplished by propagating errorsbackwards through the network from theoutput layer to the input layer and adjustingthe weight by Eq. (5)

Dwmn[s] = −a

� (E(wmn

[s]� (5)

where a is a learning coefficient and s indexesthe layer in the MLP network. The gradientin parentheses in Eq. (5) can be evaluated foreach layer by using the chain rule [13].

2.2. Mixture-of-experts neural architecture

When a set of training cases can be natu-rally partitioned into subsets, network learn-ing effectiveness may be improved with asystem of local expert networks and a gatingnetwork that decides which of these experts touse for a given input. This concept is referredto as an adaptive mixture of local experts(MOE) [2,13,15,16]. When diagnosing twodiseases that overlap in feature space, a localexpert can be dedicated to learning each ofthe diseases. The advantage of the MOE isthat during the learning process, weightchanges are localized to the respective expertnetwork and the gating network. Weights ofone expert are de-coupled from weights ofother experts, minimizing the possibility oflearning interference between experts. A sec-ond advantage is that the local expert mustlearn a smaller local region of input space. Bycontrast, when a single MLP network istrained by back-propagation of error, theremay be strong interference effects leading toslow learning and poor generalization.

The MOE network is used in this research

to diagnose two overlapping diseases, sebor-rheic dermatitis and pityriasis rosea. The ar-chitecture of MOE is shown in Fig. 2 wherethe MLP is used for the first stage diagnosis.Each local expert mirrors the MLP architec-ture with 34 input neurons, six hidden layerneurons, and two output neurons. A gatingnetwork then controls the competition be-tween these two experts and learns to assign adisease to one of the local expert networks.The final output of the MOE network (y1, y2)is a weighted average of the respective localexpert outputs defined in Eq. (6)

y1= %2

o=1goyo1 and y2= %

2

o=1goyo2 (6)

where

g1=es1

%2

l=1esl

and g2=es2

%2

l=1esl

(7)

In Eq. (7), s1 and s2, represent the activa-tion level of the respective gating networkoutput node. It is important for the MOEnetwork error function to promote localiza-tion of weights during the learning process.Jacobs et al. [15] found that the objectivefunction, J, defined in Eq. (8), gives the bestperformance for the MOE architecture.

J= − %2

o=1goe

−0.5(d−yo )T(d−yo ) (8)

The vector d is the desired network outputvector (0, 1 or 1, 0), and yo (o=1, 2) is theactivation vector for the respective local ex-pert network. If bo1th the gating network andthe local experts are trained by gradient de-scent methods that maximize the objectivefunction, J, the MOE tends to devote a singleexpert to each training case. The MOE differsfrom BPN in that training involves determining weight values for both the gating networkand the local experts. The contribution of thegating network to the objective function, J, isdetermined by taking the partial derivative of


Fig. 2. Hierarchial two stage neural diagnostic network.

the network error with respect to the summa-tion vectors of the gating network. Theproper value to back-propagate in the gatingnetwork is the term ho−go where ho is givenbelow.

ho=goe

−0.5(d−yo )T(d−yo )

%2

o=1goe−0.5(d−yo )T(d−yo )

(9)

Similarly, the value to back-propagate to


the local expert networks can be calculated byEq. (9) below.

(J(Io

=ho(d−yo)(yo

(Io

(10)

2.3. SOM neural networks

The SOM neural network consists of twolayers, an input layer and the Kohonen layer[18,19]. The SOM input layer of neurons isfully connected to the Kohonen layer. TheKohonen layer is usually designed as a twodimensional arrangement of neurons thatmaps ‘N ’ dimensional input space to two-di-mensions, preserving topological order. TheKohonen layer computes the Euclidean dis-tance between the weight vector for each of theKohonen neurons and the input pattern. TheKohonen neuron that is closest, (i.e. minimumdistance) is the winner with an activation valueof one while all other neurons have an activa-tion of zero.

Self-organization [8,35,37] is an unsuper-vised learning process; there are no examplesof the correct response presented to the net-work to guide the learning process. To map thedermatology data, a 10×10 array of Kohonenneurons is used, each of which receives thesame input vector X. The index i measures thedimensionality of the input vector X such thati=1, 2,…, 34. Let the Kohonen layer neuronsbe indexed by the numbers j=1, 2,…, 100.Any particular Kohonen neuron j, nj, has aninput weight vector Wj. The neuron c, nc, is theKohonen neuron with weight vector Wc thatis closest to the input signal vector X withdistance measured as follows.

�X−Wc �=minj �X−Wj �Öj. (11)

Nc is defined as the subset of neurons thatincludes nc and its adjacent neighbors. Theprocess of self-organization is accomplished asfollows:

dWj

dt=a(t)(X−Wj) for j�Nc (12)

dWj

dt=0 for jQNc (13)

where 0BaB1. The magnitude of the learn-ing coefficient a(t) determines how rapidly thesystem adjusts over time. Typically alpha isdecreased as learning proceeds. The neighbor-hood function that defines Nc starts with alarge area and decreases over time. The self-or-ganization process begins with all networkweights (wij) initialized to a small randomvalue. Training proceeds by repeatedly expos-ing the network to the entire set of inputvectors. For each input, X, the neurons com-pete for the right to respond. The neuron withthe weight vector Wi that is a minimumdistance from the input vector X, Eq. (11) isthe winner. The weights of this winning neuronare adjusted in the direction of the inputvector. The weights of neurons included in theset Nc defined by the neighborhood functionare also adjusted. The weight adjustments forall J neurons are calculated according to Eqs.(12) and (13). The result of a single competitivelearning step is that the neighborhood ofneurons surrounding the winner moves to-wards the input vector. The learning processcontinues with the presentation of input vec-tors in random order until the Kohonen weightvectors stabilize.

3. Methods and experimental design

The dermatology database investigated inthis paper consists of 358 cases of erythama-tous-squamous diseases compiled by Dr NilselIlter and Dr H. Altay Guvenir at Gazi Schoolof Medicine, Ankara Turkey [11]. Table 1 liststhe six classes of erythamatous-squamous dis-eases. Dermatology patients were first assessedto obtain information about the 12 features


Table 1Dermatology patients by disease

Dermatology diagnosis Number of patients

Psoriasis 11160Seborrheic dermatitis71Lichen planus48Pityriasis rosea48Chronic dermatitis

Pityriasis rubra pilaris 20

Total 358

tions of the data serves as a test set for thediagnostic model trained with the remainingnine partitions. The overall diagnostic accu-racy reported is an average across all tentraining set partitions. An advantage of crossvalidation is that the diagnostic model istrained with a large proportion of the avail-able data (90% in this case), and that all ofthe data is used to test the resulting models.

Several key design decisions involving thetopology and the learning process are re-quired to define the MLP neural networkmodel. Topology decisions establish the net-work architecture and include the number ofhidden layers and number of neurons in eachlayer. The number of neurons in the inputlayer of the neural models is simply the num-ber of variables in the data set. For theneural output layer, 1 of 6 coding is usedwith an output neuron dedicated to each ofthe erythamatous-squamous disease out-comes. The hidden layer is more difficult todefine. A relatively large hidden layer createsa more flexible diagnostic model. The diag-nostic error for such a model will tend tohave a low bias component with a largevariance caused by the tendency of the modelto over-fit the training data. A relativelysmall hidden layer results in a model with ahigher error bias and a lower variance. Thedesign of the hidden layer, therefore, involvesa tradeoff between error components. In thisresearch, the number of neurons in the MLPhidden layer is determined using a cascadelearning process. The cascade learning pro-cess is constructive, starting with an emptyhidden layer and adding neurons to this layerone at a time. The addition of hidden neu-rons continues until there is no further im-provement in network performance. Resultssuggest using six hidden layer nodes for a34×6×6 network architecture. Diagnosticaccuracy is also dependent on the dynamicsof the network learning process. The mostaccurate diagnosis results typically do not

labeled clinical/sociological attributes inTable 2. After the clinical examination, skinsamples were taken for evaluation of 22histopathological features. With the excep-tion of age and family history, four valueordinal scales measure all variables, with 0representing the absence of the attribute and3 representing largest possible presence. Val-ues of 1 and 2 were used to represent inter-mediate levels of the variables. Differentialdiagnosis of these six diseases is confoundedby the fact that they share many of theclinical and histopathological features.

To estimate the predictive accuracy of thediagnostic models, the data set must be splitinto a training set and a test set. The trainingset is used to establish the diagnostic model’sparameters, while the independent holdoutsample is used to test the generalization capa-bility of the model. Ten-fold cross validationis used in this research to minimize the im-pact of data dependency on the results and toimprove the reliability of the resultant predic-tive estimates. The choice of ten partitions issomewhat arbitrary. Ideally, a leave-one-out-strategy [12] would maximize the training setsize but creates prohibitive time requirementsfor neural network training. Using ten parti-tions is a reasonable compromise that hasbeen used in several experimental designs[1,10,20,22,36]. Each of the ten random parti-


Table 2Dermatology variables

MaximumMinimum Mean S.D.

Clinical/sociological attributesErythema 30 2.07 0.66Scaling 30 1.80 0.70

3 1.550 0.91Definite borders0Itching 3 1.37 1.140Koebner phenomenon 3 0.63 .091

3 0.450 0.96Polygonal papules3Follicular papules 0.170 0.573 0.380 0.83Oral mucosal involvement

0Knee and elbow involvement 3 0.61 0.980Scalp involvement 3 0.52 0.91

1 0.130 0.33Family history75 36.4 15.2Age 7

Histopathological attributesMelanin incontinence 30 0.40 0.87

2 0.14 0.41Eosinophils in the infiltrate 03 0.340 0.85Fibrosis of the papillary dermis

0Exocytosis 3 1.37 1.103 1.96Acanthosis 0.7103 0.530 0.76Hyperkeratosis

0Parakeratosis 3 1.29 0.923 0.66Clubbing of the rete ridges 1.0603 0.990 1.16Elongation of the rete ridges

0Thinning of the suprapapillary epidermis 3 0.63 1.030Spongiform pustule 3 0.30 0.67

3 0.360 0.76Munro microabscess0Focal hypergranulosis 3 0.39 0.850Disappearance of the granular layer 3 0.46 0.86

3 0.460 0.95Vacuolization and damage of basal layer0Spongiosis 3 0.95 1.13

3 0.450 0.95Saw tooth appearance of retes0Follicular horn plug 3 0.10 0.450Perifollicular parakeratosis 3 0.11 0.49

3 1.870 0.73Inflammatory mononuclear infiltrate3 0.55Band-like infiltrate 1.110

coincide with network error convergence forthe training data. Initial experiments withtraining lengths varying from 30 000 to 300 000iterations were conducted. Based on theseexperiments, the following training guidelinesare used in this research. Network weights areupdated after each learning epoch, defined asone cycle through the training data set. Thenetwork learning parameters (the learning rate

and learning momentum) are decreased everyten learning epochs, and training is terminatedafter fifty learning epochs. It is our experiencethat setting relatively low values for the net-work learning rate and momentum increasesdecision accuracy. We therefore use a learningrate of 0.3 and a momentum of 0.4; thesevalues are consistent with parameters chosenby other researchers [13,21].


4. Results and discussion

4.1. Exploratory data analysis

The following exploratory data analysisprovides an understanding of the complexityof the differential diagnosis of the six derma-tological diseases. The dermatology data set,which consists of 34 clinical/sociological andhistopathological variables that define six dis-ease indications, is difficult to visualize usingconventional statistical methods. In this re-search, a SOM is used to reduce the 34dimensional data to a two dimensional mapthat preserves topological ordering. Theproperty of topological ordering ensures thatpatients with similar test results in the origi-nal 34 dimensional space remain neighbors inthe two dimensional map [14,19,23]. Fig. 3depicts the SOM map produced for the 358cases investigated in this research. Four ofthe dermatology diseases shown on the SOM

map form distinct clusters with no overlap.This suggests that the MDSS should be ableto diagnose psoriasis, lichen planus, chronicdermatitis, and pityriasis rubra pilars to ahigh degree of accuracy. Approximate groupboundaries are drawn on the SOM map inFig. 3 where classification decisionboundaries between groups may belong. Thereader is cautioned that these groupboundaries are based on limited data andthat the positions are not exact. Fig. 1 alsoprovides evidence of overlap between in-stances of seborrheic dermatitis and pityriasisrosea. Fig. 4 provides a higher resolutionmap of only the two overlapping diseases,seborrheic dermatitis and pityriasis rosea,providing more detail of the boundaries andthe extent of overlap. The five circled obser-vation in Fig. 4 identify cases that clearlyreside within the decision boundaries of thewrong disease. The SOM exploratory analy-sis presents an opportunity for the investiga-

Fig. 3. Self-organizing map (SOM) of dermatology diseases.


Fig. 4. Self-organizing map (SOM) of two dermatology diseases.

tors collecting data to insure the validity ofspecific observations. It is possible that someof these instances are caused by human errordiagnosing the disease. A second possibility isthat these observations are inconsistent casesand will be a source of error and confusion forany MDSS model. In this paper we assumethat the five observations are inconsistentcases, and use this information to define per-formance targets for the MDSS.

Insights from the exploratory SOM analysisallow us to establish performance targets forthe MDSS models investigated. Ideally, onemight expect perfect diagnostic accuracy forthe four distinct disease groups and five in-stances of confusion between the two over-lapped groups. Using these assumptions, alower bound for the MDSS overall error rateis 0.0140 with a corresponding accuracy of98.6%.

4.2. Base performance of MDSS models

Table 3 reports the diagnostic error for thestandard application of the MLP network andthe four statistical quantitative models usingall 34 variables in the data set. These resultsare averages of errors determined experimen-tally for each of the ten data set partitions usedin our cross validation methodology. Since thetraining of the MLP network is a stochasticprocess, the neural network error determinedfor each data set partition is itself an averageof an ensemble of ten MLP networks initial-ized with random weights and trained in-dependently. The overall MLP error is there-fore an average of 100 partition-ensembletrials. For the ‘average case’ MLP resultsreported in Table 3, there is no attempt toedit and eliminate situations where highdiagnostic errors are caused by MLP


Table 3Dermatology diagnostic errors based on 34 variablesa

Group 2 diagnosedMethod Group 4 diagnosed OverallAll othererrorerrorsas Group 2as Group 4

3.6 (0.0600)MLP network (average case) 5.2 (0.1083) 0.5 (0.0020) 9.3 (0.0260)4.1 (0.0854) 0.5 (0.0020)MLP network (best case) 7.4 (0.0206)2.8 (0.0466)3 (0.0625) 4 (0.016)6 (0.1000) 13 (0.0363)Discriminant analysis

8 (0.1333)K nearest neighbor 4 (0.0833) 1 (0.0040) 13 (0.0363)4 (0.0833)Kernel density 3 (0.012)10 (0.1666) 17 (0.0475)5 (0.1042) 10 (0.0400)6 (0.1000) 21 (0.0586)CART

a Group 2 is seborrheic dermatitis; Group 4 is pityriasis rosea; ( ) denotes estimated error rate.

network convergence problems during thetraining phase. These results do contain someinstances of poorly trained networks. MLPnetwork ‘best case’ results are also reported.These results are based on the five mostaccurate MLP results of the ensemble of tenMLP networks. The diagnostic results areconsistent with the exploratory SOM analysisfrom the previous subsection. The largestsource of error is the diagnosis of seborrheicdermatitis as pityriasis rosea. The secondmost common error is the inverse, diagnosingpityriasis rosea as seborrheic dermatitis. Theresults of Table 3 are presented as a reducedconfusion matrix with an explicit error re-ported for the two predominant groups thatare misdiagnosed, an all other error category,and an overall error. Table 3 reports both theactual number of classification errors and theclassification error rate, shown in parenthesis.The error rate is the number of errors dividedby the number of opportunities for an error.The overall error for all six disease states isthe total number of diagnostic errors dividedby the 358 cases in the data set.

The most accurate diagnostic model is theMLP neural network with 9.3 errors and anerror rate of 0.0260 for the ‘average case’,and 7.4 errors (0.0206) for the ‘best case’. Thefractional number of errors reported for the

MLP network is caused by the 100 partition-ensemble trials used to establish average out-comes. The average and best case MLP alsohave the lowest error in diagnosing sebor-rheic dermatitis as pityriasis rosea with 3.6errors (2.8 errors for best case), and a 0.0590error rate (0.0466 for best case). The oneweakness of the MLP network is in diagnos-ing pityriasis rosea as seborrheic dermatitis.Here there are 5.2 misdiagnoses with an errorrate of 0.1061, the highest of the five modelsinvestigated. The simple parametric discrimi-nant analysis and the k nearest neighbor arethe second most accurate models after theMLP, both with 13 overall errors and anerror rate of 0.0363. Next is kernel densitywith 17 overall classification errors and anerror rate of 0.0475, while the recursive-parti-tioning (CART) model is last with 21 errorsand an error rate of 0.0586. It is also interest-ing to note the composition of the MLP erroris almost exclusively confusion between theseborrheic dermatitis and pityriasis rosea dis-eases. These two groups represent 8.8 errorsout of the total 9.3 errors. This suggests anopportunity for partitioning the decision toimprove diagnostic accuracy by constructinga two-stage neural network model with thesecond stage trained exclusively on seborrheicdermatitis and pityriasis rosea data.


Table 4Dermatology diagnostic errors based on reduced data set and two stage hierarchical neural networka

Method Group 2 diagnosed Group 4 diagnosed OverallAll othererrorerrorsas Group 2as Group 4

3.1 (0.0645) 0.5 (0.0020)MLP-MOE two stage network 5.7 (0.0159)2.1 (0.0350)2 (0.0417) 3 (0.0120)3 (0.0500) 8 (0.0223)Discriminant analysis

8 (0.1333)K nearest neighbor 3 (0.0625) 0 (0.0000) 11 (0.0307)7 (0.1166)Kernel density 3 (0.0625) 2 (0.0080) 12 (0.0336)

5 (0.1042) 10 (0.0400)6 (0.1000) 21 (0.0586)CART

a Group 2 is seborrheic dermatitis; Group 4 is pityriasis rosea; ( ) denotes estimated error rate.

4.3. Data reduction and decision partitioning

Data reduction is accomplished by using theSAS stepwise discriminant analysis procedureto identify variables that do not contribute tothe task of distinguishing between dermatol-ogy diseases. The significance level used toretain variables is set at PB0.1500. The step-wise discriminant analysis indicates that thenine variables identified in Table 5 can beremoved from the dermatology data set. Thecross validation diagnosis is repeated for allfive MDSS models using the reduced data setand the results reported in Table 4. With fewervariables, the accuracy of the discriminantanalysis model increases significantly to a totalof eight errors (from 13 previously), with anoverall error rate of 0.0223. The accuracy ofboth nonparametric methods also increases; Kneighbor has 11 errors (0.0307) and kerneldensity has 12 errors (0.0336). The CARTmodel has the ability to ignore irrelevantvariables, so there is no improvement in accu-racy using the smaller data set. Interestingly,the MLP model is less accurate with thereduced data set described above.

The two-stage neural network depicted inFig. 2 is designed to simplifying the diagnosticdecision between the two overlapping diseasestates. The first stage is the original MLPnetwork trained for the six dermatologicaldiseases. Any case diagnosed by the first stage

MLP as seborrheic dermatitis or pityriasisrosea is evaluated in a second stage mixture-of-experts neural network trained exclusivelywith seborrheic dermatitis and pityriasis roseadata. The second stage MOE employs twolocal experts (one for each disease state) anda gating network that decides which of theexperts to use for a given input. The MOEarchitecture was chosen for the second stagebased on a recent study indicating that it ismore accurate than MLP for many small twogroup decision applications [24]. This accuracyadvantage is traced to the ability of MOE topartition the input space in a manner thatreduces the frequency of local minima. Thecross validation results for the two-stagehierarchical neural network produce a diag-nostic error close to the target of five

Table 5Dermatology variables removed

Clinical/sociological attributesErythemaDefinite bordersKnee and elbow involvementScalp involvementAge

Histopathological attributesAcanthosisSpongiform pustuleVacuolization and damage of basal layerFollicular horn plug


errors. The instances of seborrheic dermatitisdiagnosed as pityriasis rosea are reduced from3.6 to 2.1, for an error rate of 0.0350. The errordiagnosing pityriasis rosea as seborrheic der-matitis is reduced from 5.2 to 3.1, for an errorrate of 0.0645. The overall error rate for thetwo-stage MLP-MOE network is 5.7, errorswith a diagnostic error rate of 0.0159.

5. Conclusion

Using a SOM map as an exploratory dataanalysis tool, we identify an optimal targetvalue for our MDSS model to be five diagnosticerrors. The exploratory analysis also revealsthat four of the six diseases are distinct and thatmost of the diagnostic error will result fromconfusion in the diagnosis of seborrheic der-matitis and pityriasis rosea. The baseline per-formance of the MDSS models tested with all34 variables yields a diagnostic accuracy signifi-cantly higher than the target of five errors. TheMLP network ‘best case’ is 7.4 errors and the‘average case’ 9.3 errors. The four other MDSSquantitative models range from 13 to 21 errors.Improvements to the baseline performancerequire additional modeling skills, namely vari-able reduction and decision partitioning. Thelatter is accomplished by constructing a two-stage MLP-MOE neural network to take ad-vantage of its near perfect diagnosis of the fourdistinct diseases. After these efforts, it wasfound that two models approach the ideal offive errors; two-stage MLP-MOE trained withall 34 variables achieves 5.7 errors (0.0159), andlinear discriminant analysis trained on thereduced set of 25 variables achieves eight errors(0.0223). The advantage of linear discriminantanalysis is that it is the simplest of all themodels, while two-stage MLP-MOE has theadvantage of being more accurate. The twononparametric methods, k nearest neighborand kernel density, suffer from curse of dimen-

sionality and are not relatively accurate for thisapplication. The dimensionality problem forthese two models results from the data condi-tions. There are 25 relevant variables and only358 observations; the data is too sparse foraccurate nonparametric data density estima-tion. Their performance might improve in adata rich environment, for example if 10 000 ormore cases were available. Unfortunatelyabundant data is rare. The recursive portioningmethod is conceptually appealing because itsstructure provides diagnostic decision rules,but it is the least accurate of the five MDSSmodels investigated.

While we feel we have structured our exper-iments to yield an unbiased evaluation of allfive MDSS models, we acknowledge it may bepossible to improve the accuracy of any modelby fine-tuning, which includes activities likedata transformation, and an exhaustive inves-tigation of model parameters. We also recog-nize that this research is conducted on a singlepopulation of 358 patients and the results maynot generalize to other dermatology popula-tions.

References

[1] S.S. Anand, A.E. Smith, P.W. Hamilton, J.S.Anand, J.G. Hughes, P.H. Bartels, An evaluation ofintelligent prognostic systems for colorectal cancer,Artif. Intell. Med. 15 (1999) 193–214.

[2] G. Auda, M. Kamel, CMNN: cooperative modularneural networks for pattern recognition, PatternRecognit. Lett. 18 (1997) 1391–1398.

[3] J.A. Baker, P.J. Kornguth, J.Y. Lo, C.E. Floyd,Artificial neural network: improving the quality ofbreast biopsy recommendations, Radiology 198(1996) 131–135.

[4] W.G. Baxt, Use of an artificial neural network fordata analysis in clinical decision-making: the diagno-sis of acute coronary occlusion, Neural Comput. 2(1990) 480–489.

[5] W.G. Baxt, Use of an artificial neural network forthe diagnosis of myocardial infarction, Ann. Intern.Med. 115 (1991) 843–848.


[6] W.G. Baxt, A neural network trained to identify thepresence of myocardial infarction bases some deci-sions on clinical associations that differ from ac-cepted clinical teaching, Med. Decis. Making 14(1994) 217–222.

[7] C.M. Bishop, Neural Networks for Pattern Recog-nition, Clarendon Press, Oxford, 1995.

[8] J.A. Flanagan, Self-organization in Kohonen’sSOM, Neural Netw. 9 (7) (1996) 1185–1197.

[9] J. Fricker, Artificial neural networks improve diag-nosis of acute myocardial infarction, Lancet 350(1997) 935.

[10] E. Gilpin, R. Olshen, H. Henning, J. Ross Jr, Riskprediction after myocardial infarction, Cardiology70 (1983) 73–84.

[11] H.A. Guvenir, G. Demiroz, N. Ilter, Learningdifferential diagnosis of erythemato-squamous dis-eases using voting feature intervals, Artif. Intell.Med. 13 (1998) 147–165.

[12] D.J. Hand, Construction and Assessment of Classifi-cation Rules, Wiley, New York, 1997.

[13] S. Haykin, Neural Networks A comprehensiveFoundation, Macmillan, New York, 1994.

[14] D.B. Henson, S.E. Spenceley, D.R. Bull, Artificialneural network analysis of noisy visual field data inglaucoma, Artif. Intell. Med. 10 (1997) 99–113.

[15] R.A. Jacobs, M.I. Jordan, S.J. Hinton, G.E. Hinton,Adaptive mixtures of local experts, Neural Comput.3 (1991) 79–87.

[16] R.A. Jacobs, M.I. Jordan, A.G. Barto, Task decom-position through competition in a modular connec-tionist architecture: the what and where vision tasks,Cogn. Sci. 15 (1991) 219–250.

[17] D. Josefson, Computers beat doctors in interpretingECGs, Br. Med. J. 315 (1997) 764–765.

[18] T. Kohonen, The self-organizing map, Proc. IEEE78 (9) (1990) 1464–1480.

[19] T. Kohonen, Self-Organizing Maps, Springer, NewYork, 1997.

[20] P. Lapuerta, G.J. L’Italien, S. Paul, R.C. Hendel,J.A. Leppo, L.A. Fleisher, M.C. Cohen, K.A. Eagle,R.P. Giugliano, Neural network assessment of peri-operative cardiac risk in vascular surgery patients,Med. Decis. Making 18 (1) (1998) 70–75.

[21] D.A. Linkensand, L. Vefghi, Recognition of patientanaesthetic levels: neural network systems, principalcomponent analysis, and canonical discriminantvariates, Artif. Intell. Med. 11 (1997) 155–173.

[22] P.S. Maclin, J. Dempsey, How to improve a neuralnetwork for early detection of hepatic cancer, CancerLett. 77 (1994) 95–101.

[23] P. Mangiameli, D. West, An improved neural clas-

sification network for the two group problem,Comput. Operations Res. 26 (1999) 443–460.

[24] P. Mangiameli, S.K. Chen, D. West, A comparisonof SOM neural network and hierarchical clusteringmethods, Eur. J. Operational Res. 93 (1996) 402–417.

[25] L.J. Mango, Computer-assisted cervical cancerscreening using neural networks, Cancer Lett. 77(1994) 155–162.

[26] L.J. Mango, Reducing false negatives in clinicalpractice: the role of neural network technology, Am.J. Obstet. Gynecol. 175 (4) (1996) 1114–1119.

[27] R.A. Miller, Medical diagnostic decision supportsystems-past, present, and future: a threaded bibliog-raphy and brief commentary, J. Am. Med. Inform.Assoc. 1 (1) (1994) 8–27.

[28] B.D. Ripley, Pattern Recognition and Neural Net-works, Cambridge University Press, Cambridge,1996.

[29] C. Rosenberg, J. Erel, H. Atlan, A neural networkthat learns to interpret myocardial planar thalliumscintigrams, Neural Comput. 5 (1993) 492–502.

[30] D.E. Rummelhart, J.L. McClelland (Eds.), ParallelDistributed Processing: Exploration in the Mi-crostructure of Cognition, vol. 1, MIT Press, Cam-bridge MA, 1986.

[31] D.E. Rummelhart, G.E. Hinton, R.J. Williams,Learning representations by back-propagating er-rors, Nature 323 (1986) 533–536.

[32] L. Salchenberger, E.R. Venta, L.A. Venta, Usingneural networks to aid the diagnosis of breastimplant rupture, Comput. Operations Res. 24 (5)(1997) 435–444.

[33] D. Sheppard, D. McPhee, C. Darke, B. Shrethra, R.Moore, A. Jurewits, A. Gray, Predicting cy-tomegalovirus disease after renal transplantation:an artificial neural network approach, Int. J. Med.Inform. 54 (1) (1999) 55–71.

[34] G.D. Tourassi, C.E. Floyd, H.D. Sostman, R.E.Coleman, Acute pulmonary embolism: artificial neu-ral network approach for diagnosis, Radiology 189(1993) 555–558.

[35] T. Villman, R. Der, M. Herrmann, T.M. Martinez,Topology preservation in self-organizing featuremaps: exact definition and measurement, IEEETrans. Neural Netw. 8 (2) (1997) 256–266.

[36] P. Wilding, M.A. Morgan, A.E. Grygotis, M.A.Shoffner, E.R. Rosato, Application of backpropaga-tion neural networks to diagnosis of breast andovarian cancer, Cancer Lett. 77 (1994) 145–153.

[37] H. Yin, N.M. Allinson, On the distribution andconvergence of feature space in self-organizingmaps, Neural Comput. 7 (1995) 1178–1187.

Documents

Improving diagnostic accuracy using a hierarchical neural network to model decision subtasks