7
335 Chemometrics and Intelligent Laboratory Systems, 5 (1989) 335-341 Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands Original Research Paper I Hirsutism: A Multivariate Approach of Feature Selection and Classification C. ARMANINO *, S. LANTERI and M. FORINA Istituto di Analisi e Tecnologie Farmaceutiche ed Alimentari, via Brigata Salerno, I-16147 Genoua (Itab) A. BALSAMO, M. MIGLIARDI and G. CENDERELLI Divisione di Endocrinologia, Ospedale Mauriziano ‘Umberto I’, torso Turati 46, I-10128 Torino (Italy) (Received 2 February 1988; accepted 11 November 1988) ABSTRACT Armanino, C., Lanteri, S., Forma, M., Balsamo, A., Migliardi, M. and Cenderelli, G., 1989. Hirsutism: a multivariate approach of feature selection and classification. Chemometrics and Intelligent Laboratory Systems, 5: 335-341. Supervised pattern recognition methods were applied to the results of seven hormonal tests from a population of twenty-six healthy subjects and one hundred and seven women affected by birsutism, in order to study the discriminant information from analytical data. Eigenvector projection and raw Varimax rotation, a stepwise multivariate method of feature selection based on quadratic discriminant analysis, the classification methods of k-nearest neighbours and quadratic discriminant analysis were applied. The prediction ability of the multivariate normal models, built by five selected variables (testosterone-estradiol binding globulin, dehydroepiandrosterone sulphate, estrone, salivary testosterone, 17&estradiol) was 87.5%. Hierarchical clustering was carried out on the analytical data from the group of hirsute patients: two principal clusters and one singleton were identified. INTRODUCTION hair, is knowing which patient to evaluate and which just to reassure [l]. Hirsutism may be defined as the appearance of The causes of hirsutism are multiple: it may be excessive hair in normal and abnormal sites in the inherited or acquired, or it may be secondary to female. To most females, hirsutism is extremely disorders of the hypothalamus, pituitary, thyroid, disturbing and a threat to their sense of feminin- ovary or adrenal cortex, or it may be idiopathic. ity. The difficult problem for the endocrinologist, The sebaceous glands and sexual hair follicles when approaching women who complain of excess together form a functional unit, the activity of 0169-7439/89/$03.50 0 1989 Elsevier Science Publishers B.V.

Hirsutism: A multivariate approach of feature selection and classification

Embed Size (px)

Citation preview

Page 1: Hirsutism: A multivariate approach of feature selection and classification

335

Chemometrics and Intelligent Laboratory Systems, 5 (1989) 335-341

Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands

Original Research Paper I

Hirsutism: A Multivariate Approach of Feature Selection and Classification

C. ARMANINO *, S. LANTERI and M. FORINA

Istituto di Analisi e Tecnologie Farmaceutiche ed Alimentari, via Brigata Salerno, I-16147 Genoua (Itab)

A. BALSAMO, M. MIGLIARDI and G. CENDERELLI

Divisione di Endocrinologia, Ospedale Mauriziano ‘Umberto I’, torso Turati 46, I-10128 Torino (Italy)

(Received 2 February 1988; accepted 11 November 1988)

ABSTRACT

Armanino, C., Lanteri, S., Forma, M., Balsamo, A., Migliardi, M. and Cenderelli, G., 1989. Hirsutism: a multivariate approach of feature selection and classification. Chemometrics and Intelligent Laboratory Systems, 5: 335-341.

Supervised pattern recognition methods were applied to the results of seven hormonal tests from a population of twenty-six healthy subjects and one hundred and seven women affected by birsutism, in order to study the discriminant information from analytical data.

Eigenvector projection and raw Varimax rotation, a stepwise multivariate method of feature selection based on quadratic discriminant analysis, the classification methods of k-nearest neighbours and quadratic discriminant analysis were applied. The prediction ability of the multivariate normal models, built by five selected variables (testosterone-estradiol binding globulin, dehydroepiandrosterone sulphate, estrone, salivary testosterone, 17&estradiol) was 87.5%.

Hierarchical clustering was carried out on the analytical data from the group of hirsute patients: two principal clusters and one singleton were identified.

INTRODUCTION hair, is knowing which patient to evaluate and which just to reassure [l].

Hirsutism may be defined as the appearance of The causes of hirsutism are multiple: it may be excessive hair in normal and abnormal sites in the inherited or acquired, or it may be secondary to female. To most females, hirsutism is extremely disorders of the hypothalamus, pituitary, thyroid, disturbing and a threat to their sense of feminin- ovary or adrenal cortex, or it may be idiopathic. ity. The difficult problem for the endocrinologist, The sebaceous glands and sexual hair follicles when approaching women who complain of excess together form a functional unit, the activity of

0169-7439/89/$03.50 0 1989 Elsevier Science Publishers B.V.

Page 2: Hirsutism: A multivariate approach of feature selection and classification

n Chemometrics and Intelligent Laboratory Systems 336

which is governed by the inherently cyclic nature of the hair follicle and its sex hormone depen- dence. Human skin and hair follicles are endowed

with specific androgen and estrogen receptors. The pilosebaceous apparatus is highly reflective of androgen activities in the skin and plays an im- portant function in androgen metabolism. In fact, it converts steroid precursors to compounds hav- ing androgen activity. In hirsute women these

metabolic conversions appear to be enhanced [2]. Androgens play a determinant role in the path-

ogenesis of hirsutism, but it is difficult to verify

whether this result is due to greater ovarian and/or adrenal production, whether neoplastic or not, or if it is due to reduced plasma concentration of binding proteins, or to enhanced peripheral con- versions of hormonal precursors, as well as to an excessive sensitivity of target organs to normal hormonal levels [3,4].

The evaluation of the different stages (andro- gen production, transport and metabolism), a basic point for correct diagnosis, is not easy to do in practice. An attempt has been made to obtain

sufficient information by determining, on the ba- sis of an analysis of peripheral blood, a great number of hormonal parameters, some of which in hirsute patients have different levels from the nor- mal female population. Some of these tests con- tain diagnostically useful information, whereas

others are irrelevant, containing the same, dupli- cate, information and/or noise. In a preliminary study [5], in which our principal aim was to omit tests of low discriminatory power, we verified by a univariate criterion of feature selection (Fisher weights) that the determination of two hormones (17a-hydroxyprogesterone and androstenedione) were tests that were useless in discriminating be- tween the groups of normal and hirsute women under study.

In this study we have applied multivariate methods of feature selection and data analysis to the results of the hormonal tests that are usually carried out in our laboratory to formulate a diag- nosis of hirsutism, in order to discover the subset of the measured variables by which the optimum boundaries between women affected by hirsutism and normal control subjects could be detected.

Cluster analysis was then applied to the class of

hirsute women, to study this class in detail and, possibly, to define groups having different degrees of hirsutism.

EXPERIMENTAL

Subjects

One hundred and seven female patients having various degree of hirsutism were studied. These patients were clinically examined and, in each

case, their menstrual cycle characteristics, ovarian morphology (evaluated by ultrasonography), be- ginning age and manner of development were evaluated. Twenty-six normal and healthy women were also studied as control group: their age and menarchal characteristics were similar to the group of the hirsute patients.

Test procedures

Samples: plasma samples were obtained from venous peripheral blood, collected in the early follicular phase, at least one hour after waking. Simultaneously, mixed and unstimulated saliva samples were collected from each subject.

Hormonal assays. The plasma concentrations of the following steroid hormones were measured by radioimmunoassay (RIA): (a) testosterone (T), 17fi-estradiol (E2) and estrone (El), after ether extraction; (b) dehydroepiandrosterone sulphate (DHEA-S), after sample dilution with Emagel (Behring Institute).

Salivary testosterone concentrations (ST) were determined by radioimmunoassay, after ether ex-

traction. Testosterone-estradiol binding globulin (TeBG)

plasma levels were measured by saturation analy- sis with 3H-dihydrotestosterone.

The free (unbound) testosterone plasma quota (fT) was calculated from the plasma concentra- tions of T and TeBG.

Packages and statistical methods

Data analysis and graphics were carried out by the PARVUS package [6]. We applied one non-

Page 3: Hirsutism: A multivariate approach of feature selection and classification

331 Original Research Paper n

parametric and one parametric classification

method: k-nearest neighbour (kNN) and quadratic discriminant analysis (QDA) [7,8]. The results are presented according to their classification and pre- diction abilities: randomly subdividing the data matrix into a training and a prediction set, the training set objects are used to find the mathe- matical rules separating the classes, and the clas- sification ability is computed as the percentage of the training set samples that are correctly classi- fied. Prediction ability is computed by the per- centage of the prediction set objects that are cor- rectly classified by the classification rule previ- ously developed.

The kNN classification rule is based on the computation of the interobject distance matrix. This technique does not require training (recogni- tion ability is, by definition, 100%) and one object at a time (prediction set) is classified.

The simple rule of kNN is very useful in many practical problems, where the assumption that the forms for the underlying density functions are multinormal is very often not verified, so it can be used to validate the application of parametric techniques when the assumption of multinormal- ity is doubtful. We used k = 5 and Euclidean distance.

The QDA implemented on PARVUS is a clas- sification and modelling technique, based on the hypothesis of normal multivariate distribution and the Bayes theorem. In this study the a priori probabilities and the loss factors of Bayes’ theo- rem were assumed to be equal to one since the actual numbers were not known. This technique, stepwise implemented [9], was used for multi- variate feature selection. Briefly, the data set is subdivided into training and evaluation (25% of objects) sets and the statistical parameters are computed using only the training set. At each step each non-selected variable is added to the previ- ously selected ones, QDA is carried out and, for each non-selected variable, classification plus pre- diction errors are computed; the variable giving the minimum number of classification plus predic- tion errors is selected. The selection procedure continues until, by adding one of the remaining variables, the number of errors does not decrease.

RESULTS AND DISCUSSION

Pretreatments and feature selection

The data matrix (Table 1) was formed by 133 objects ad 7 variables, and it was subdivided into two categories: Hirsutism (category 1) and Con- trols (category 2).

Within each category the distributions of the variables were examined by means of histograms: in the Hirsutism category DHEA-S, T, El and E2 showed an approximately log-normal distribution, while SHBG, ST and fT had a bimodal distri- bution. Within the Controls category all the vari- ables had a normal distribution: the normality hypothesis of the Lilliefors test [lo] was passed, at significance level > 15%, for each variable. There- fore, to preserve the normality of the Controls category, we chose not to apply any variable transformations.

The row data were standardized by autoscaling (column centering plus column standardization). The correlation coefficient matrix was computed to study the relationships among variables: the three determinations of testosterone were signifi- cantly correlated (T and ST: r = 0.52, T and fl? r = 0.84, ST and ff: r = 0.64), a negative correla- tion (r = - 0.53) was found between ff and TeBG,

TABLE 1

Data matrix

Index Name Object number

Categories

1 Hirsutism 107 2 Controls 26

Variables

1 Testosterone (T) 2 Salivary testosterone (ST) 3 Testosterone-estradiol binding

globulin (TeBG) 4 Estrone (El) 5 17/3-Estradiol (E2) 6 Dehydroepiandrosterone

sulphate (DHEA-S) I Free testosterone (fT)

Page 4: Hirsutism: A multivariate approach of feature selection and classification

n Chemometrics and Intelligent Laboratory Systems 338

and variables El and E2 were also found to be significantly correlated (r = 0.65). Five variables were selected by multivariate feature selection: TeBG, DHEA-S, El, ST, E2. The program was forced to select ST instead of T or ff, since the number of classification and prediction errors was practically equal when adding each one of them (15.0% of classification + prediction errors when adding T, as against 15.8% when adding ST or ff). ST was preferred for some practical considerations on the test: in fact, the saliva samples are easily collected by a non-invasive technique, and ST is more easily and quickly determined than fT. Be- sides, several studies [ll] have shown that the salivary concentration of T is a good index of the free (unbound) plasma quota.

The eigenvectors of the generalized covariance matrix were computed on the reduced data ma- trix: 133 objects, 5 selected variables. The eigen- vector projection in Fig. 1 retains 69.7% of total variance (41.7% on eigenvector 1) and it displays the Controls category grouped in a smaller area of the plot than the Hirsutism class; the plot also shows an overlap between the scores for Controls and Hirsutism. On the same figure the loadings of the selected variables on eigenvectors 1 and 2 have been drawn by the variable indices.

After a raw Varimax rotation of the loadings [12] of the variables on eigenvectors 1 and 2 (Fig. 2), the main direction of variation is nearly coinci- dent with the first varivector, which is almost

L 2

Fig. 1. Eigenvector projection of the samples (1: Hirsutism, 2: Controls). The variable loadings are reported by indices: 1= salivary testosterone, 2 = TeBG, 3 = estrone, 4 = 17/%estradiol, 5 = DHEA-S.

?2

1 1

Fig. 2. Scatter plot of scores and loadings of the variables of Fig. 1 after raw Varimax rotation of loadings.

entirely formed by androgens (ST, TeBG and DHEA-S), while varivector 2 is formed mainly by the two estrogens.

Classification methods

By the kNN classification rule, a prediction ability of 87.2% (116 correct classifications on 133 objects) was obtained.

By QDA, a classification ability of 88.7% and a prediction ability of 84.2% were obtained in the first cycle. One object of the Controls class and four objects of the Hirsutism class were identified as outliers (confidence level > 99.5) and discarded. In the second cycle, the classification ability was 92.3% and prediction was<87.5%. Prediction abili- ties were computed by the leave-one-out technique

1131.

J

Fig. 3. The omothetic ellipses (at 708, 808, 90% confidence level) of the QDA models of Hirsutism (1) and Controls (2) groups are projected on the l-2 eigenvector plot.

Page 5: Hirsutism: A multivariate approach of feature selection and classification

339 Original Research Paper n

TABLE 2

Results of kNN and of the first and second cycles of quadratic

discriminant analysis

The percentage classification (c) and prediction (p) abilities

within categories Hirsutism and Controls are reported.

Quadratic discriminant analysis kNN

cycle I cycle II

c P c P

Hirsutism 86.0 86.0 90.3 89.3 88.8

Controls 100.0 16.9 100.0 80.0 80.8

In Fig. 3 the omothetic ellipses of QDA class models (first cycle) are displayed by projecting them on the l-2 eigenvector plot.

Table 2 shows the percentage classification and prediction abilities of kNN and QDA within the

two categories. The two methods produce comparable results;

the high prediction ability of QDA confirms the effectiveness of the classification rule.

The results obtained on our data set show that the discrimination problem can be reduced to five variables. Indeed, using the information held by the five hormonal tests - TeBG, DHEA-S, El, ST, E2 - and multivariate methods of classifica- tion, it is possible to discriminate between normal

status and hirsutism with a mean prediction abil-

ity of about 85%.

Unsupervised pattern recognition

Within the class of hirsute patients, unsuper- vised pattern recognition was carried out to search for groupings of samples that were ‘similar’ in the levels of the measured hormonal tests, and, possi- bly, with some common clinical characteristics: e.g., we would verify whether or not high andro- gen levels correspond to excessive hair presence.

No relationship was shown by the display method (eigenvector plots); then, considering the whole of the original information, clusters analysis was applied to the seven original autoscaled vari- ables of the 107 hirsute patients.

Hierarchical clustering [14] was used; the simi-

larity matrix among objects was computed (the 107 * 107/2 interobject Euclidean distances), then the objects were agglomerated step-by-step by merging the two most similar objects or clusters. The average linkage procedure (weighted pair group) was the criterion of chaining objects or clusters together.

The output list of linkages and the similarity values at which the linkages occurred were repre- sented by the dendrogram of Fig. 4: two principal

.O

.I

.2 J

.3

.4 I

.6

Fig. 4. Clustering of the 107 hirsute patients according to their levels of the 7 hormonal variables. Dendrogram obtained by the weighted average linkage method.

Page 6: Hirsutism: A multivariate approach of feature selection and classification

n Chemometrics and Intelligent Laboratory Systems 340

Fig. 5. Eigenvector plot of the cluster 1 (Q) and cluster 2 (4) hirsute patients.

clusters and one singleton were identified. Cluster 1, formed by 43 objects and shown on the left in Fig. 4, was almost homogeneous, while cluster 2, 63 objects, was subdivided into more clusters (one of which was formed by four patients, objects 89, 37, 57, with high estrogen levels). The singleton, object 58, was a patient for whom an ovarian cystoma had recently been diagnosed; the cystoma was not detectable at the sample time, one year previously. In any case, this patient was one of the five outliers detected by QDA and she does not really belong to the Hirsutism category.

The same grouping of clusters was obtained by using the agglomerative procedure of complete linkage, instead of average linkage. None of the clinical characteristics evaluated (degree of hirsu- tism, menstrual cycle characteristics, ovarian mor- phology, familiarity, beginning age and manner of development of hirsutism) was found to be at the basis of this subdivision; nevertheless, the objects of cluster 1 have lower values of scores for eigen- vector 1 (Fig. 5) and, moreover, they have a nor- mal distribution (the values of all the variables in cluster 1, except E2, passed the Lilliefors test for normality). In short, the statistical behaviour of the patients in cluster 1 is similar to the Control subjects, in spite of their true hirsutism.

By contrast, the objects of cluster 2 are spread at a high value of eigenvector 1 and they have a non-normal distribution in the hyperspace of vari- ables, which is a characteristic of most of the pathological categories: nevertheless, some sub- jects in this cluster have a low degree of hirsutism, in spite of their high androgen levels.

CONCLUSIONS

The methods of multivariate data analysis ap- plied allowed us to discriminate, on the basis of only five hormonal tests, between normal status and hirsutism and, moreover, to define the androgenization degree of the patients examined. This result is useful and reassuring for the physi- cian when he has objectively to justify complex treatments given for a long time, beyond the pa- tient’s request, often motivated by aesthetic rea- sons.

The clinical significance of the observed cluster- ing of the hirsute patients will be the subject of a further study, in which the abovementioned clini- cal characteristics will be quantified as non-para- metric variables and evaluated together with the analytical data.

On the other hand, we have to add two ob- servations about the activity of the pilosebaceous unit. In fact, there are known autonomous meta- bolic processes in the skin .[15]; therefore the individual differences in the activity of 5a-re- ductase, the enzyme that converts testosterone to Sew-dihydrotestosterone, may partly explain the re- sults of the clustering of hirsute patients in this study.

Finally, non-steroid growth factors could also interfere with the complex development of the pilosebaceous apparatus [16,17].

ACKNOWLEDGEMENTS

This work received financial support from the Education Department (MPI, 40% and 60%) and from Regione Piemonte (Grant No. 42/84). The paper was presented in part at the SCA - Scien- tific Computing and Automation, Amsterdam, The Netherlands, in May 1987.

REFERENCES

1 J.J. Gold, Hirsutism and virilism, in J.J. Gold (Editor), Gynecologic Endocrinology, Harper & Row, Hagerstown, MD, 1975, p. 448.

2 V.A. Randall and F.J. Ebling, Is the metabolism of testosterone to 5adihydrotestosterone required for andro-

Page 7: Hirsutism: A multivariate approach of feature selection and classification

341 Original Research Paper n

gen action in the skin?” 2nd CIRD Symposium: The Role

of Receptors in the Skin, Sophia, Antipolis, October 1981,

British JournalofDermatology, 107 (Suppl. 23) (1982) 47-53.

L. Moltz and U. Schwartz, Gonadal and adrenal androgen

secretion in hirsute females, in R. Horton and R. Lobo

(Editors), Clinical Endrocrine Metabolism, W.B. Saunders

Company, London, 1986, Vol. 15, No. 2, pp. 229-305.

G.B. MarouIis, Evaluation of hirsutism and hyper-

androgenemia, Fertility and Sterility, 36 (1981) 273-305. A. Balsamo, G. Cenderelli, M. Mezzanotte, M. Mighardi

and V. De Filippis, Proposta di un modello di elaborazione

dati quale support0 matematico alla diagnosi di irsutismo,

Quaderni di Ligand Quarterly, 1 (III) (1984) 247.

M. Forma, R. Leardi, C. Armanino, S. Lanteri, P. Conti

and P. Princi, Parvus: an Extendable Package of Programs

for Data Exploration, Classification and Correlation, Else-

vier Scientific Software, Amsterdam, 1988.

D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y.

Michotte and L. Kaufman, Chemometrics: a Textbook,

Elsevier, Amsterdam, 1988, pp. 395-397.

R.O. Duda and P.E. Hart, Pattern Cktssifcation and Scene Analysis, Wiley-Interscience, New York, 1973, pp. 10-39.

M. Forma, S. Lanteri and R. Leardi, Feature selection by

stepwise Bayesian analysis, COBAC IV, Graz, September

15-19, 1986, Abstracts.

10 M.A. Stephens, EDF statistics of goodness of fit and some

comparisons, Journal of the American Statistical Associa- tion, 69 (374) (1974) 730.

11 D. Riad-Fhamy, G.F. Read, R.F. Walker and K. Griffiths,

Steroids in saliva for assessing endocrine function, Endo- crine Review, 3 (1982) 367-395.

12 R.J. Rummel, Applied Factor Analysis, Northwestern Uni-

versity Press, Evanston, 1970, pp. 391-393.

13 P.A. Lachenbruch, An almost unbiased method for obtain-

ing confidence intervals for the probability of misclassifica-

tion in discriminant analysis, Biometrics, 23 (1967) 639-645. 14 D.L. Massart and L. Kaufman, Hierarchical clustering

methods, in P.J. Elving and J.D. Winefordner (Editors),

The Interpretation of Analytical Chemical Data by the Use of Clurter Analysis, Wiley, New York, 1983, pp. 90-92.

15 P. Mauvais-Jarvis, Regulation of androgen receptor and

So-reductase in the skin of normal and hirsute women, in

R. Horton and R. Lobo (Editors), Clinical Endocrine

Metabolisms, W.B. Saunders Company, London, 1986, Vol.

15, No. 2, pp. 307-317.

16 A.L. Lorincz and G. Lancaster, Anterior pituitary prepara-

tion with tropic activity of sebaceous, preputial and

harderian glands, Science, 126 (1957) 124-125.

17 L.P. Woodbury, A.L. Lorincz and P. Ortega, Studies on

pituitary sebotropic activity. II. Further purification of a

pituitary preparation with sebotropic activity, Journal of Investigative Dermatology, 45 (1965) 364-367.