21
STATISTICS IN MEDICINE, VOL. 15,851-871 (1996) DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS ROGER C. GRIMSON Department of Preventive Medicine, HSC, SUW at Stony Brook, M, Stony Brook, M I 1794, U.S.A. AND NEAL ODEN The EMMES Corporation, I1325 Seven Loch Road, Potomac, MD 20854, U.S.A. SUMMARY This paper discusses the distributions of several cluster statistics and illustrates their utility in the analysisof disease incidence or prevalence data arising in organized or structured environments. Such environments involve primary inhabitant locations in office buildings, manufacturingplants, apartment complexes and communities of homes, for example. The methods, which include some generalizations of some existing methods as well as some new ones, will be used to test for household clustering of Trypanosoma cruzi seropositivity data. INTRODUCTION A continuing topic of interest in epidemiology is the ascertainment and investigation of patterns of morbidity within environments comprising small distinct groups of people or other organisms. A community of households, such as the one described in Figure 1, exemplifies such as environ- ment. The data summarized in Figure 1 will be discussed extensively later but for present purposes it suffices to note that the 40 columns of cells indicate 40 households (including one single cell household). Cells of the columns indicated individuals who reside in the respective households and the ‘points’ represent those individuals who were found to be seropositive for an infectious disease. Other examples of distinct groups wherein patterns of morbidity may be of interest are sets of sibships, litters, genetically related families, wards of patients, classrooms of students, classifications of people by job category, classifications of people by locations within a building, pens of animals, and sled rows or plots of plants. Each of these is characterized by some form of organization, and the characterizations may be made by arrays of cells such as Figure 1 wherein points represents incidence cases or prevalence cases of disease. In an instructive illustration of studying epidemicity in a structured environment, Klauber and Angulo‘ described patterns of the infectious disease variola minor among elementary school students of two schools by expressing the school environment as a composition of rows and aisles of classroom seats, files (left, centre, right) of seats, classrooms, floor levels, shifts (distinct groups of students taught at two different periods of the day) and school. This illustrates the fact that the general type of environment described in the above paragraph can be extended to hierarchy of configuration. A statistical search for a pattern of a disease within an environment, which involves specifics of the configuration of the physical environment, may reveal conduits of CCC 0277-6715/96/08085 1-21 0 1996 by John Wiley & Sons, Ltd.

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

Embed Size (px)

Citation preview

Page 1: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

STATISTICS IN MEDICINE, VOL. 15,851-871 (1996)

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

ROGER C. GRIMSON Department of Preventive Medicine, HSC, S U W at Stony Brook, M, Stony Brook, M I 1794, U.S.A.

AND

NEAL ODEN The EMMES Corporation, I1325 Seven Loch Road, Potomac, MD 20854, U.S.A.

SUMMARY This paper discusses the distributions of several cluster statistics and illustrates their utility in the analysis of disease incidence or prevalence data arising in organized or structured environments. Such environments involve primary inhabitant locations in office buildings, manufacturing plants, apartment complexes and communities of homes, for example. The methods, which include some generalizations of some existing methods as well as some new ones, will be used to test for household clustering of Trypanosoma cruzi seropositivity data.

INTRODUCTION

A continuing topic of interest in epidemiology is the ascertainment and investigation of patterns of morbidity within environments comprising small distinct groups of people or other organisms. A community of households, such as the one described in Figure 1, exemplifies such as environ- ment. The data summarized in Figure 1 will be discussed extensively later but for present purposes it suffices to note that the 40 columns of cells indicate 40 households (including one single cell household). Cells of the columns indicated individuals who reside in the respective households and the ‘points’ represent those individuals who were found to be seropositive for an infectious disease. Other examples of distinct groups wherein patterns of morbidity may be of interest are sets of sibships, litters, genetically related families, wards of patients, classrooms of students, classifications of people by job category, classifications of people by locations within a building, pens of animals, and sled rows or plots of plants. Each of these is characterized by some form of organization, and the characterizations may be made by arrays of cells such as Figure 1 wherein points represents incidence cases or prevalence cases of disease.

In an instructive illustration of studying epidemicity in a structured environment, Klauber and Angulo‘ described patterns of the infectious disease variola minor among elementary school students of two schools by expressing the school environment as a composition of rows and aisles of classroom seats, files (left, centre, right) of seats, classrooms, floor levels, shifts (distinct groups of students taught at two different periods of the day) and school. This illustrates the fact that the general type of environment described in the above paragraph can be extended to hierarchy of configuration. A statistical search for a pattern of a disease within an environment, which involves specifics of the configuration of the physical environment, may reveal conduits of

CCC 0277-6715/96/08085 1-21 0 1996 by John Wiley & Sons, Ltd.

Page 2: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

852 R. GRIMSON AND N. ODEN

Figure 1. The T . cruzi map based on households. Cells denote the population, points denote cases and columns represent households

epidemicity that pertain to environmental structure; in fact, this was the case in the Klauber and Angulo study.

The above examples emphasize the importance placed here on physical configuration. Analysis also involves the locations of people (or other organisms) within the structure. Moreover, density of people in a given physical structure is an important environmental factor. People may represent part of the geographic profile, and, in this view, they may be stationary with respect to primary location (for example, home, work place, work station, hospital room or bed). Referring to Figure 1, again, recall that the cells represent people and columns represent households. Columns with the same number of cells appear connected in these drawings. In reality, the columns are not physically connected. The order of the cells in a column is irrelevant in this configuration (Figure 1) of people in spatial structures. The randomization process introduced in this paper by which to test for clustering does not depend on rank order within columns. However, additional descriptive information may be obtained by expressing the cells (people) in a spatial unit (column) in an order according to age, or some other variable. For descriptive purposes the individuals in the columns of Figure 1 are arranged in increasing order of age, the younger people appearing above the older ones in each column.

Individuals who are not at risk for various reasons, such as immunity or (for gender specific diseases) gender, would not be represented in a configuration. The remaining individuals com- prise the configured population of study, where, as mentioned, the ‘points’ in some cells (Figure 1) represent cases of a particular disease (or injury, etc.) of interest occurring within a specified time frame. Thus, we define a diagram such as Figure 1 as a map of cases among individuals

Page 3: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS 853

comprising the configured population at risk. In the next section we identify finite sample spaces of maps for applicable statistical distributions considered in this paper.

This paper provides statistical methods for analysis of maps of patterns of epidemiologic events within structured environments. On the most general level we recognize that the combinatorial characteristics of epidemicity within these structured environments succinctly represented as configured populations at risk are appropriately modelled in accordance with the ‘parent’ Fermi-Dirac model (versus the Bose-Einstein and Maxwell-Boltzman models which are useful in epidemiology in other Briefly, the Fermi-Dirac model is an occupancy model of distinguishable cells and distinct items where items occupy cells under the condition that no more than one item can occupy any cell. This model generalizes Cb1 matrix (matrix occupancy) models; the map need not be rectangular. Such models involving cells which can contain no more than one item are also termed binary occupancy models.

We find that the ascertainment of the ‘fixed-properties enumerator’, defined in the next section, allows for simple derivations of combinatorial distributions on many Fermi-Dirac models of interest, although the formulas have the appearance of being complicated. Thus, many scenarios involving finite organized populations experiencing a disease process can be expressed by and investigated through combinatorial models by way of the fixed-properties enumerator.

Although this paper pertains to household aggregation of disease, the methods have a wider range of applicability. We extend some results of several earlier papers and we introduce some new results.

Early examination of the distribution of disease within households was made by Pearson4 and Greenwood and Yule.’ Later, a paper by Mathen and Chakraborty6 provided a rigorous treatment of the topic in terms of an occupancy model. Walter’ introduced a test based on the number of pairs of cases within households, and applied this test to data obtained in a survey of filariasis. Smith and Pike* compared and generalized some of these methods. They applied their methodology to data obtained from an epidemiologic study of the household distribution of Trypanosoma cruzi (T. cruzi),8*’ displayed in Figure 1.

We consider a collection of c geographic units (for example, households) wherein the kth unit comprises rk individuals (for example, rk members of the kth household). An individual may be assigned a value of 1 or 0 depending on whether or not the event under investigation has occurred for that individual or equivalently, as in the maps displayed here, we simply represent a case by a point. We let T denote the total number of cases.

This comprises the map of c columns where column k has i k cells and where a total of T points appear. Patterns of events, such as the aggregation of disease incidence within households, is reflected by properties of the columns; one form of clustering, for example, is reflected by significantly few columns containing most of the T events.

In general, a map is represented by the c-tuple (rl , r 2 , . . . , rc) = 9. Under random assignments of T events among cells of B, with no more than one event allowed per cell, we present null distributions for the following random variables:

E = the number of empty columns; U = the number of columns containing a single event; N = the number of events that are in columns containing strictly more than one event; S = the number of columns that contain some (one or more) events; M = the number of dense columns (‘dense’ is defined later).

Thus, the sample space may be characterized by the set (9, T} and each of the above random variables may be viewed as a partition of this space into mutually exclusive classes of statistical

Page 4: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

854 R. GRIMSON AND N. ODEN

events of interest. ('Statistical event' is thus different from 'epidemiologic event.') That map obtained in a specific application is the observed member of {9, T}.

Mathen and Chakraborty studied the distribution of S for the case rfr = r; here S is interpreted as the number of households affected. Smith and Pike* point out that a test based on S is equivalent to a test based on E. Clearly

c = S + E . (1)

A large observed value of E (or a small observed value of S) indicates that relatively few columns contain relatively many of the events. For the purpose of depicting an epidemic, S is appealing. The first four ordinary moments of S have been given by Mathen and Chakraborty for t k = r. For the more general case where column sizes may differ, we provide a formula for the gth factorial moment, more specific formulas for the first four moments, and explicit formulas for exact and approximate p-values. Our formulations are simpler than those of Mathen and Chakraborty.6 Smith and Pike6 give the mean and variance for S on {a, T}, where it is not necessary that r, = r. We derive an expression for the gth factorial moment for S and for the other random variables defined above. Also we indicate procedures for hypothesis testing for small and large samples, and apply them to the T. Cruzi data considered by Smith and Pike' (displayed in Figure 1).

Next, note that

T = U + N . (2) We will construct the distribution of U but remind users that results based on U may be

expressed in terms of N. Smith and Pike state that they have derived formulas for the mean and variance of N (therefore U ) but the formulas are not explicitly expressed. We shall present all exact moments of U , an exact formula for the p-value and small and large sample size approaches to data analysis.

Referring to S, Mathen and Chakraborty state that '. . . the study of the degree of concentration of cases in households will be important in assessing the importance of household or family common factors as well as family contact in the causation of diseases, even though it does not distinguish the latter from the former.. ',' The test based on S is not sensitive to distinctions between point-source (for example, radon, contaminated food/water, asbestos) exposures and contagious processes. It seems unlikely that any test based on such simple models and with such minor conditions will be found to be especially sensitive to one particular mode of epidemicity . However, for diseases with unknown aetiology or for diseases that can be propagated by both common-source or contagious mechanisms, such sensitivity is clearly desirable. (A good example of this occurred in the early months of the AIDS epidemic. It was not clear whether or not AIDS, seen at that time as the expression of Kaposi's sarcoma in a new risk group, was acquired through point-source exposures or by personal contact. There, statistical discrimination between the two was far more important than the question of the existence of clustering, for which the answer was obvious.)"

For structured environments represented by columns of binary events, statistic U offers an especially direct interpretation of data. If the number of columns that contain one and only one event is small, then the columns tend to contain either no event or several events. If U is significantly small, then columns that contain events tend to contain more than one event. Such an occurrence explicitly reflects within-household clustering of events. This nature of clustering is consistent with what one may imagine as the result of a contagious process, but, as the case before, it could reflect other sources of exposure as well (namely, few out of many households exposed to a contaminated food product).

Page 5: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS 855

Pike and Smith derived the expectation and variance for a statistic for the number of patients involved in ‘links’ with other patients.” Later, Smith and Pike state that the statistic for linked patients is equivalent to N (therefore U) which is the number of cases (not households) that are in households containing more than one case of the disease.’ In this regard, the reader may consider Walter’s pairs test’ and a subsequent discussion of it.’

One can imagine other test statistics such as the number of columns that contain 3 2 events or the number of columns that are more than half full of events. These prove to be ill-defined in the context of column-wise clustering for varying rk . However, in a subsequent section we define a ‘dense’ column in a meaningful way and give the distribution of the number of dense columns. This results in a powerful test for disease clustering in structured environments.

STATISTICAL PRELIMINARIES

Consider a finite sample space of A simple events. In what follows, the simple statistical events are combinatorid objects of a specified type. By way of illustration, types of combinatorial objects are combinations, permutations, binary matrices and occupancy (urn) models, to list a few. The maps that we described in the previous section may be rendered as a type of combinatorial object in order to develop an analytical methodology for studying epidemic processes, and we will do this.

For example, Figure 1 portrays a combinatorial object; it is composed of one isolated cell, 8 columns of 2 cells each, 13 columns of 3 cells each, ... , 1 column of 13 cells, for a total of 40 columns and 162 cells, and it contains 57 points (epidemiologic events). The finite sample space comprises all such pictures obtained by placing the 57 points in all possible ways on the map (configuration of cells), with no more than one point occupying any cell. The number of ways of doing this, which is the number, A, of objects in the sample space, is the binomial coefficient (l::). If analyses excludes isolated cells as will be discussed, then A = (I:;).

Next, suppose that each object of the sample space possesses some combination of the properties P I , P2, ... , P, where the properties are to be given a meaningful definition. For example, if all columns of Figure 1 are labelled from 1 to 40, then interest in column emptiness engenders formulation of emptiness as the basis of property. Pi would denote the property of column i being empty.

Let N ( x ) denote the number of objects that possess exactly x of the properties and Iet .., , i A denote the number of objects that possess the specific properties Pi , , Piz , ... , Pi,,

,., , i A is referred to as thefixed-properties enumerator together perhaps with other properties. in this paper. The principle of inclusion and exc lu~ ion’~*’~ states that for x >, 1

where Ztk) is the sum over all combinations of k of the n i’s. Also

“0) = A + 1 ( - l)kx(k)Ni,,ilr ..., i,. k 2 1

C(k)Ni,,i,, ... ,i, is referred to as the ‘combinatorial sum’. For a sample space of equally likely objects, the gth factorial moment of the distribution of the

random variable X , representing the number of properties, may be expressed in terms of

(3)

Nil,iz, ... , i ,:

E ( X ) g ) = A-lg!x(g)Nil,iz, ..., i,*

Page 6: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

856 R. GRIMSON AND N. ODEN

Derivation of (3) is not difficult using properties of Stirling number^;'^ although expressions equivalent to (3) are known,14 these authors feel that (3) is not widely known for its usefulness in Fermi-Dirac models.

The principle of inclusion and exclusion implies that the probability of observing x or more properties in a randomly selected object from a sample space of equally likely objects may be expressed by

for x 2 1. For the case x = 0, we note that P(X 2 0) = 1. In this paper a map is based on a pattern of cells arranged in c columns where the ith column

has ri cells. Many applications may be based on rectangular arrays (all ri = r), in which case the distributions of the random variables of interest have much simpler forms. Mielke and Siddiqui' ' and Grimson16 considered the rectangular model under the condition that the number of epidemiologic events in each row is f ixed .

T = number of events (for example, cases); c = number of columns (for example, households); ri = number of cells (for example, people) in column i; c j = number of columns with j cells; W = total number of cells (for example, population at risk).

For easy reference the parameters of the models are listed as follows:

THE DISTRIBUTIONS OF E AND S

This section extends the Mathen- Chakrabortystatistic,6 E, and its complimentary statistic, S, by presenting analytical formulas for situations in which column sizes differ. All moments are given and formulas for exact and large sample approximate p-values are given. By (l), E and S have essentially the same distribution. We shall discuss the more or less

cosmetic differences after we derive the factorial moments of E. Recall that E( ) denotes expectation so E((E)J and E((S) , ) are the gth factorial moments of the two random variables.

The number of configurations in {9, T } that have the property of columns i ~ , ... ,ik being empty is

The observed environment is represented by one ofthe configurations in (3, T}. Suppose that the observed number of empty columns is e . Then under the condition of equally likely configurations, the p-value for testing the null hypothesis of no disease clustering within house- holds, as reflected by e, is given by the formula obtained by substituting (5 ) into (4):

Page 7: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS 857

The expected number of empty columns is

C

E ( E ) = E((E) , ) = (W); ' c (W - riIT =(WIG1 ci(W - i)=. i - 1 i s 1

The last step accords with the definition of ci given at the end of the previous section. Also, from (5 )

The second factorial moment of E is

Alternatively, again from ( 5 )

(7)

The difference in appearance between (7) and (8) and between (9) and (10) arises from the identity obtained from interchanging each occurrence of T and (rl + -.. + rg) in (5). (Such reciprocity holds for the other distributions considered on (9, T ) as well.) Where 2ri is less than T , as typically would be the case, then less computation is involved in computing the first two moments using (8) and (10) rather than (7) and (9), especially for large T, because in (8) and (10) the falling factorials contain fewer factors.

Formula (7) above is equivalent to formula (13) of Smith and Pike.' The variance of E

v a r m = E m u - E((E)) + E((E)z) (11)

where E ( E ) and E((E),) are given by (8) and (lo), respectively, is equivalent to expression (14) in their paper. Smith and Pike provide complicated expressions for the case g = 3 and g = 4 which require complex programming for application.' Expression (5 ) is evidently the more tractible.

The randomization scheme of Smith and Pike incorporate strata in order to control for confounding effects of covariates such as age. Suppose there are W j people in the jth strata and suppose that T j of them are cases (W = XjWj and T = XjTj). These authors form the null distribution of E by randomly assigning T j of the W j people to be cases in the jth stratum. In effect, one value of E is calculated across all strata for each randomization trial of the data. The distribution of E is then obtained on the basis of all possible equally likely combinations of assignments.

The issue of whether or not controlling for size is necessary depends on properties of the epidemiologic 'event' (for example disease), such as whether or not incidence is dependent on column (for example, household) size. As a rough guide, if a 2-score obtained by the stratified approach differs 'much' from that of the unstratified approach, then the stratified analysis probably has controlled, to some extent, for effects of confounding.

Page 8: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

858 R. GRIMSON AND N. ODEN

In the case of equal sized column (for example, housenolds), where r l = + * - = rc = r, (5 ) yields the following simple result for r x c (0, 1) matrices:

Mathen and Chakraborty6 derived the first four moments of the distribution of the com- plementary statistic, S, but (12) with g = 1,2,3 and 4 is much simpler.

Therefore, the mean, variance and reverse cumulative distribution (exact p-value) for the distribution of empty columns, E, for random allocations of T like objects in an r x c matrix of cells with a limit of no more than one object per cell are, respectively,

Derivation of (15) is a straightforward application of (4) on (12) where e is an observed value of E.

Mathen and Chakraborty argue that the distribution of S (hence E) tends to normality with increasing c and that this convergence to normality is at its slowest when T is very small or very large. Thus for rare events (small T ) the normal approximation may be inadequate even for a fairly large number c of geographic units.

With care, analysis may be simplified by a bionomial approximation based on replacement sampling. Here the probability of an event occurring in a given cell is T/rc. Then the parameter (probability) of the statistical event of interest (an empty column) is (1 - (T/rc)y, so a bionomial approximation of (15) is

r c-i P ( E ,- 2 e),,=, = i: (;) (1 -:)(I - (1 - :)) i = e

The gth factorial moment for this bionomial distribution is

the replacement analogue of (12). Generally, for small parameters values of r, c and T , approximations (16) and (17) are poor

formulas for data analysis; results tend to be very conservative. Using (l), properties of the distribution of S are directly found from the above discussion of E.

However, some useful formulas regarding the distribution of S derived from (3) and (4) are simply listed as follows:

1 E ( S ) = c - - C ci(W - i)T

( w ) T i a 1

Page 9: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS 859

Alternatively,

Mathen and Chakraborty6 found the first four factorial moments or S where ri = r, but (21) provides simpler expressions of these moments:

Recall that S is the number of columns containing one or more event. Recognizing that small values of S are in the direction of clustering, (15) may be used to obtain the exact cumulative distribution:

P(S < s) = P ( E 2 e) where e = c - s. (26)

THE DISTRIBUTIONS OF U AND N

Smith and Pike' suggest U as a test statistic for use in detecting aggregation of disease in households, and they observe the relation (2). Significantly small values of U indicate clustering of cases within households; households would then tend to have either no case or more than one case. This is equivalent to observing a significantly large number, N, of cases in housenolds that contain two or more cases. These authors state that U is equivalent to another test statistic whose distribution has been explored by Pike and Smith.'' This statistic is the number of patients involved in 'links' with other patients, that is, the number of pairs of cases within households. The pair statistic is the basis of Walter's test but Walter7 considered a different randomization scheme; see Smith and Pike.'

The expected number of links and its variance are obtained from formulas that Pike and Smith produce in accordance with a randomization scheme whereby fixed numbers of persons from each of several strata are randomly selected, originating a null distribution of the number of patients involved in links among and within strata."

Page 10: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

860 R. GRIMSON AND N. ODEN

However, the formulas for the mean and variance are extraordinarily complicated; at the time of their paper, Pike and Smith made available upon request FORTRAN IV subroutines for the mean and variance for restricted special circumstances and they also offered a simulation program. They suggest that the null distribution for significance testing be approximated by simulation or by a chi-square test of one degree of freedom based on the mean and variance; they offer evidence that the number of persons involved in links is normally

This section provides the distribution of U (gth moment and p-value), hence of N . (As in the case of S and E the factorial moments differ for U and N, although measures of skewness and kurtosis, based on factorial moments, are the same except for the sign of skewness.)

The number of arrangements of T cases in 3 such that columns il , . . . , ik contain precisely one case is

Therefore

= g!(T)&(#) (W - T ) r , + .-. + r #-g r l r2 r g . ( W r , + ...+ r,

Also

but (27) usually offers simpler computation. A significantly small observed value of U, say u, infers clustering. The cumulative distribution

of U , which can be used to test for small values, may be obtained by applying (4) to (27) on the right side of

P(U < u) = 1 - P(V > u + 1). (28)

This gives

and, alternatively,

From (27)

Page 11: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS 86 1

and

From (30) or (31) the variance of U may be computed:

var(U) = E(U)(l - E ( U ) ) + E((U)d . For the case of equal column sizes, (ri = r), (27) becomes

which is, notably, a simple expression for the gth moment. Specifically the mean and variance of U are

and

Using (4), (28) and (32), an expression for the exact p-value may be obtained for the case r = ri :

A simpler expression is based on the binomial distribution obtained from sampling with replacement:

We express here the same concerns mentioned regarding the analogous bionomial approxirna- tion, (16); analysis will give conservative results for small samples. However, we have noted that the conservatism here appears not as serious as for the cases of E (and hence S).

From (2), some of the above expressions for the distribution of U are easily used for N

E ( N ) = T - E ( U )

var(N) = var(U).

The exact or approximate expression for testing for large values of n (that is, clustering) may be obtained from (29), (35) or (36) by noting that

P ( N 2 n) = P(U 6 u), where u = T - n. (37)

DENSE COLUMNS

Under the null hypothesis based on random allocation of incidence events to arrays of unequal column sizes, a large column is expected to receive more events than a small column is expected to

Page 12: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

862 R. GRIMSON AND N. ODEN

receive. This indicates that interpretations of analytical results obtained from tests based on E, U, N and S may be enhanced by inspecting the sizes of columns that the statistic E, U, N or S do and do not enumerate. For example, consider the application of E to a map of c/2 columns of 3 cells and c/2 columns of 10 cells. One would expect most of the empty cells to occur among the 3 cell columns. If an observed value of E is significantly large mainly due to a large number of empty 10 cell columns, the test results may be regarded as 'conservative' in the special sense invoked by comparison with the opposite situation where for the same observed value of E larger proportion of the 3 cell columns are empty. The latter would be the case if, for instance, under the alternative hypothesis large unit density or size contributes to incidence. However, distributions of E, U, N and S do incorporate (account for) the distributions of column sizes, c j , and the test statistics have meaningful interpretations.

Nevertheless, varying column sizes ceases to become an issue in interpretation if an integer dh can be found for a column of h cells (h = 2 ,3 , , . . ) such that if the number of events in a column of h cells is 2 d h , then that column is accepted as 'dense'. Furthermore, a large number of dense columns must logically accord with a notion of clustering. Criteria for rationally defining 'dense' as a function of h will be proposed in this section.

A dense column is not necessarily a cluster; clustering occurs in an environment if the number of dense columns is significantly larger than expected under randomization. Caveats and benefits of this notion of clustering will be articulated after definitions of d h are proposed.

events, respectively, is

The number of maps in (9, T 1 such that columns i l , ... , i k have 2 u l , ... , 2

We then specify that

ui, = d h if column i. has h cells (39)

where d h is the density criteria to be derived later by expressions (44a) or (44b). Also, as before, we define

ck = number of columns with k cells. (40)

Then, letting M denote the random variable for the number of dense columns, we have from (3) and (38)

Therefore, the expected value of M is given by (41) with g = 1:

Page 13: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS 863

where in the last step we use (39) and (40). Similarly, (41) with g = 2 is the second factorial moment which simplifies to

Next, we consider the specification of dh. The (replacement) probability that both cells of a two-cell column receive events is ( T / W ) z . For

a 2 x c array (h = 2 for all columns), a significantly large number of occurrences of such full columns would indicate clustering of events within columns. Without equivocation, this implies that d,, be set equal to 2. In the general case the objective is to define values of dk (h = 2 , 3 , . . . ) for which all columns have, as nearly alike as possible, the same probabilities of being dense. This is based on the conceptually simple two-cell column. We accept as a criterion that a column of h cells is dense if it contains 2 dh events where d,, is that integer for which

The interpretation is that a column with h > 2 cells has approximately the same probability of realizing > d h events as a two-cell column has of being full.

However, integer values of dh that satisfy (Ma) for various specifications of the parameters h, T and W seldom exist. Suppose the left side of (44a) is > (T/W)’ for d h = d, and it is < (T/W)’ for dh = d + 1 where d is an integer. Then if d is used for the criteria for density, columns of h cells have a higher probability of being dense than the standard two-cell column has of being full; if d + 1 is used then columns of h cells have a lower probability of being dense than the two-cell column. For the various values of h in an application, one may consider defining dh as d or d + 1 depending on which of these two integers renders the right side of(44a) nearest to (T/W)2. The left side of (Ma) may be replaced by the incomplete beta function for which dh may be explicitly derived as a rational number; still dh must be revised as its nearest integer in applications because it specifies a column property by a number of cells. Computer programs for generating binomial distributions, or incomplete beta functions, can be used to easily derive values of db.

Note that (TIW)”, where a > 2, may be used instead of (T/W)’ on the right side of (Ma) in order to derive a different set of defining values for dh. However, in that case columns of < a cells could never be dense, except in rare instances depending on values of a, T and W. For instance, full two-cell columns would not contribute to clustering, a matter which is worse than simply a loss of information; bits of information that are reasonably regarded as contributory to the evidence of clustering would by default count against clustering.

Possibly, expressions other than (T/W)’ may be entertained as defining criteria for density but these authors have not found one that possesses more satisfactory characteristics than

One alternative to (Ma) is the definition that a column is dense if it isfull. Then the test statistic M is the number of full columns. We use the notation F instead of M in this case. Sometimes this

(TIW)’.

Page 14: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

864 R. CRIMSON AND N. ODEN

approach is reasonable, especially if the number of cells in each column is small. In this case, we have

alternative definition: dh = h. (Mb) Then by (39) and (44b)

so the gth moment for the distribution of the number, F, of full columns, where the columns may be of unequal sizes, is by (41)

Then (42) and (43) become

and

respectively. Using (4) and (49, we have

Furthermore, if all of the columns are the same size, ri = r, (as would arise, for instance, in an analysis based on stratification by column size), then W = rc and (45) becomes

This expression is what we would expect on the basis of (12). The factorial moments of the distribution of the number of empty columns become, upon interchanging the events (which may be represented by 1's) and the non-events (which may be represented by Us), the factorial moments of the distribution of full columns.

From (41) and (4) we find that the distribution for M may be expressed by

where ui are given by (39). Expression (50) simplifies considerably for special cases. In the case (a) that the map is

rectangular (ri = r), and (b) that the column density requirement is that the column be full (dh = r; see (44b)), then the probability of observing 2 rn full columns is, by (50)

Page 15: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS 865

Just as (49) is related by reciprocity to (12), in view of interchanging the 1's and 0's (50) is so related to (15).

Density criteria (44b) seems best suited for rectangular maps with small r, although we do not wish to exclude the possibility that (44b) may be useful in some other special setting as well. In the remainder of this section and in the application presented later, we consider the distribution of M, based on (44a).

If m (using (Ma)) is significantly large, then investigators have statistical evidence that several columns are dense with events, compared with the number of relatively sparse columns. Another notion of clustering is the occurrence of a significantly large number of events in a column or in each of a few columns. In the first case, M is a count of columns and in the second case the focus is the number of euents is a few well-packed columns. Thus, it appears that situations can arise where M may not be significantly large (because too few dense columns exist) yet one or two columns are almost full and the others are quite sparse. However, this appears almost always not to be the case. In several statistical simulations wherein artificial data were arranged to reflect the latter sanerio, M was significantly large. The test for density appears to be robust in those situations where the number of events in a few columns exceeds dh by a large amount. However, this is merely a convenient coincidence, in that the test pertains to the number of dense columns.

Continuing in the direction implied by the above paragraph, in the circumstance that the number of events in one column is especially large, the following section is of interest.

DISTRIBUTION O F THE MAXIMUM COLUMN FREQUENCY AND SOME REMARKS ABOUT COMBINATORIAL MODELS

Here we consider (41) with r = ri for all i. Define ui as the number of events contained in the column with the most events, symbolized by ui = max. This specification of ui is not appropriate as an alternative way of defining density but it can be a very good test statistic for clustering on its own merits.

Representing these conditions in (41) and then applying (4) to (41) (thereby getting the distribution, P(M 2 m), of the number, M, of columns that have 2 max events) and finally by specifying that m = 1, we get the probability of observing one or more columns with 2 max events, but this is probability of getting 2 max events in a column, for the same sample space (9, T}. Denoting the random variable by 'MAX', the result is expressed by

A few brief remarks about (52) follow in this section. Expression (52) may be viewed as the 'matrix occupancy' analogue of the distribution of the

maximum cell frequency for the usual occupancy model (that is) urn model, multinomial

The introduction section briefly explains that in the parlance of combinatorial occupancy models, the usual occupancy model corresponds to the Maxwell-Boltzman model and a 0-1 cell model (matrix occupancy) gives rise to the Fermi-Dirac model. Thus, (52) represents an extreme value distribution for a certain Fermi-Dirac configuration (rectangular, where columns are important). Grimson" discusses MAX in connection with the Maxwell-Boltzman model. The third of the trio of general combinatorial models of the occupancy kind is the Bose-Einstein

Page 16: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

866 R. GRIMSON AND N. ODEN

Table I. Summary: reference numbers for applicable formulas*

For the general case: gth factorial moment 1st moment 2nd momentt

Exact p-values4 For the case all ri = r:

8th factorial moment 1st moment 2nd moment?

Exact p-value

E Random variable

S U N M

5 8

10 6

12 13 14 15

27 % 30 %

t 19 20 31 t 26 29 37

21 32 $ 22 33 x 23 32 x 26 35 37

41 42 43 50

41 42 43 50

F -

45 46 47 48

49 49 49 51

'For several of these expressions, alternate forms and approximations appear near them in the text. Also see expression (52) 'The variance can be obtained by using the same formula used for E, expression (1 1). Sometimes an explicit expression is given near the second moment in the text *Re N, see text proceding above expression (42). Re S, see (5) and (1)

the applications, these values were obtained by simulation

model. The distribution of maximum cell frequency for this model (which is based on the negative binomial distribution has been explicitly formulated and applied in an interesting manner to epidemiologic time series.

A DIRECTORY O F FORMULAS AND SOME COMMENTS ABOUT SHAPES OF DISTRIBUTIONS

For convenience, Table I provides the index numbers of the applicable formulas for each statistic of the paper.

The formulas for the exact p-values for the most general cases will be programmed in the course of ongoing work in this area. In the applications that follow, p-values that would be given by these formulas have been estimated by simulation but numerical evaluation of the moments have been calculated on the basis of their exact formulas. Generally, the formulas required for calculating means and variances of E, S, U , N, M and F can be applied fairly simply with a calculator that has combinatorial fiunctions.

Smith and Pike conjecture that for E the assumption of normality is reasonable. Simulation suggests to us that this may hold for large samples. For small samples the distributions of all of the statistics are skewed. Generally U and S are negatively skewed. Interestingly, on the basis of many simulations, we have observed that when the z-score for E and N is 2 2 (or < - 2 for U and S), the associated p-value is < 0.05.

ANALYSIS OF TRYPANOSOMA CRUZI DATA

Smith and Pike* examined seriologically positive Trypanosoma cruzi (2'. cruzi) prevalence data gathered from 40 households in northeast Brazil. (7'. cruzi seropositivity is associated with Chagas disease, or Chagas cardiomyopathy, which leads to shortened life expe~tancy.~) These data are summarized in Table I1 and mapped in Figure 1.

Page 17: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

Tabl

e 11

. T.

cru

zi s

erop

ositi

vity

prev

alen

ce in

40

hous

ehol

ds

T. C

ruzi

pre

vale

nce

32

01

01

01

20

01

20

62

10

22

10

13

04

50

20

11

12

15

12

10

Number o

f re

siden

ts 8

34

32

53

36

33

33

16

43

32

13

25

95

46

55

23

32

23

46

24

72

Tot

al p

reva

lenc

e =

57

m

r

n

Tot

al p

erso

ns =

162

5 E 2 3 9 1

rn

P C

Tabl

e 11

1. St

atis

tics f

rom

the

T. c

ruzi

clu

ster

ana

lysi

s by

age

parti

tion

and

for t

he en

tire e

nviro

nmen

t*

P 3 L2

E S

U N

M

F

Obs

Ex

p St

d Z

O

bs

Exp

Std

Z

Obs

Ex

p St

d Z

O

bs

Exp

Std

Z

Obs

Ex

p St

d Z

O

bs

Exp

Std

Z

Ages

< 20

13

95

38

1,24

3 2.

79

7 10

462

1.24

3 - 2.

79

2 73

95

2.08

2 -

2.69

12

6.40

5 2.

082

2.69

4 07

50

0742

4.

38

4 05

08

0654

5.3

4 5

Age

s220

9

7.453

1.

620

095

23

24.5

47

1.62

0 -0

95

12

14.4

34

2.76

8 - 0

88

25

22.5

66

2.76

8 08

8 9

6533

1.

492

1.65

9

5.35

8 1,

547

2.35

6 Age

adju

sted

' A

ll ag

es

11

8.557

1.

951

1.25

28 30444

1.95

1 - 1.

25

13

14.3

39

2.88

2 -0

46

44

4

26

1

2-88

2 04

6 6

3.177

1.

401

2.01

4

1.654

1.

149

1-44

2 E

12

8.34

1 1.8

59

197 - -

-

- - -

-

-

-

-

-

-

-

-

- - -

- - -

Sing

le cells

are

not

incl

uded

in a

naly

sis

2 ' O

btai

ned from

Smith

and Pike's

nsul

ts.'

Smith

and Pike in

corp

orat

ed th

e sin

& c

ell c

olum

n in

thei

r ana

lysi

s, w

hich

is re

ason

able

for s

tatis

tic E,

and

whi

ch in

this

case p

rovi

des a

larg

er Z

score t

han

that

w

hich

wou

ld be obtained if

this cell

wer

e ex

clud

ed

$

Page 18: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

868 R. GRIMSON AND N. ODEN

Smith and Pike observed 12 households in which no individual was seropositive; E = 12 in our notation. Stratifying by age, their methodology yields an adjusted (for age) expected value and variance of 8.3407 and 3.4555, respectively. These values result in a 2-score of 1.9685 or a continuity corrected chi-square value of 2,8885, of one degree of freedom.8 Smith and Pike indicate that in these data the assumption of normality is reasonable.

Figures 2 and 3 depict the partitioning of Figure 1 into two mutually exclusive age groups. This partitioning does not represent the usual mode of stratification based on person's age. The unit of observation is household, not person. Therefore, partitions are different from strata. Such partitioning represents a division of a map into two (or perhaps more) complementary sub-maps.

Figure 2 represents the map of the younger cases wherein older people have been removed. Certainly, the epidemic process involves the older people but it is of interest to look at the resulting pattern of cases of the younger people within their environment. Complementary remarks pertain to Figure 3. Benefits of partitioning will be seen shortly.

We do not suggest that calculating age adjusted statistics by customary stratified analysis procedures is appropriate for such partitioned data. Smith and Pike have presented a method for deriving age adjsuted values of E and its first few moments, and they applied their method to the T. cruzi data.' Our intent in this section is to illustrate use of the distributions of E, S, U, N, M and F on these data and on the age specific partitions of these data, and to interpret the results in light of this approach.

We have chosen to remove from analysis all single person columns. Each such column consists of only one cell which represents one person. Whether or not that person is a case, debatably inclusion of that person does not contribute information regarding within-household clustering. In fact, the null hypothesis, which is that disease does not cluster within households, carries the implication that inferences apply only to multiple person households. Also, the definition of each test statistic imparts meaning only for multiple cell columns. We note, however, that single cell columns are part of the complete environment and that the formulas or simulation can easily incorporate them, and that perhaps it is possible for circumstances of other applications to suggest that inclusion of single cells is reasonable.

Tables I11 and IV summarize the results of analysis. Table I11 shows the results, summarized by the 2-score (2 = (obs - exp)/(std)), from applications of the formulas in this paper. Table IV provides the respective p-values which in this case are obtained by simulation (10,OOO iterations each).

The T. cruzi data may be stratified by household size, for r = 2, 3, 4, 5, 6 (see Figure 1). However, there is quite a small amount of data within each stratum and the within-stratum analysis does not appear to reveal much information. With the exception of stratum r = 5, the p-values for the stratum specific statistics are large. For larger sample sizes, stratification by household size should lead to results about the effects of column (household) size on rates and on within-household clustering.

From Tables I11 and IV we note that the statistics, M and F, which refer to columns that have 'several' cases, are accompanied by smaller p-values than those of the other statistics applied to these data.

Evidently, seropositivity to T. cruzi tends to cluster within certain households. We note in particular that the pattern is striking for the partitioned map of younger people, albeit data is scarce. All of the statistics have low p-values for the partition of younger people. By way of descriptive summary, we note that in Figure 2 (younger people), 85 per cent of the columns with 2 or more cells are either empty or full, whereas in Figure 3,56 per cent of the columns with 2 or more cells are either empty or full.

Page 19: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

1.1 1 I 1 I 1.1 I 1.1 I

Figure 2. Map of T .

1 cruzi cases

11 < 20 years of age

Figure 3. Map of T. cruzi cases 2 20 years of age

869

We note that this cluster pattern is consistent with the tentative evidence forwarded by Mott et a1.’ that (1) seropositivity prevalence is high among children in households where both parents are seropositive, and (2) it is higher among children is households where the mother is seroposi- tive and the father is seronegative than in households where the father is seropositive and the

Page 20: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

870 R. GRIMSON AND N. ODEN

Table IV. P-values from the T. cruzi cluster analysis by age partition and for the entire environment

E and S U and N Dense (M) Full (F)

Ages < 20 0.0083 0.0057 00007 00001 Ages > 20 02593 0.2370 0.0945 00222 All ages 0 1598 0.3958 0.0477 00655

mother is seronegative. Interestingly, in view of our emphasis on organized environments, and in addition to the previous results regarding the association between seropositivity, age and family structure, investigators also have noted that properties of house construction is associated with the household c l u ~ t e r i n g . ~ * ~ ~ We do not have access to data which would permit us to elaborate on this point but we note that our findings that T. cruzi clusters within households is consistent with the observation that house construction is involved.

DISCUSSION

Epidemiologists often use rates or expected-to-observed ratios to compare characteristics of epidemicity of a study area to that of a comparison area. This paper provides ways of comparing observed patterns of epidemicity within a study area to patterns expected within the area on the basis of randomization; the study area serves as its own control. Thus, one approach is a comparison among areas and the other is strictly internal. The two are not alternative approaches to each other; they provide different kinds of information. Investigators can easily construct observed-to-expected ratios of the test statistics discussed here but a more rigorous approach emphasizes the calculation of p-values. Furthermore, geographical configura- tions of buildings accord with small numbers of cases; therefore, exact (combinatorial) methods are used.

The notion of map, based on Fermi-Dirac (binary occupancy) models, provides a convenient framework for which to study epidemicity within structured environments. Random maps directly comprise the finite sample space from which a useful family of distributions is derived. These maps are direct abstractions of the more encumbered real physical environments and they reflect important aspects of physical structures.

The simple environments may possess hierarchial structure. For example, in the T. cruzi data, the sub-environment of the younger ( < 20 years) cases and non-cases comprise one part of a partitioning of the larger environment. Such partitioning, which is not stratification, can be quite useful in determining features of epidemicity, as illustrated in the T. cruzi analysis, To be sure, individuals in the entire environment contribute to epidemicity in any one partition. For example, it is noted that within-household clustering pertains mainly to the environment of the younger group, so it would be of interest, say, to compare the young clustered group with mothers' seroreactivity. This illustrates a reason for partitioning.

Smith and Pike' stratified on age. However, note that the unit of observation is column (for example, household) rather than people. Typically, stratification pertains to factors that the units of observation (usually people) have. The Smith and Pike stratification procedure is useful, but the statistical formulas are quite complicated. Although partitioning and stratification are different concepts, each pertains to a way of assessing roles of personal factors, so partitioning may be viewed as a useful alternative to stratification.

Page 21: DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS

DISEASE CLUSTERS IN STRUCTURED ENVIRONMENTS 871

The application in this paper pertains to prevalence of seropositivity, but, more generally, each epidemiologic study or investigation will have special requirements regarding ascertainment and collection of data. Considerable care is required in using these methods for incidence data in dynamic environments in which the disease has long latency periods. Yet these methods can be useful in such settings and they are expected to possess maximal power, especially for small amounts of data.

Previously, no comprehensive strategy has been available for studying or identifying patterns of disease in small organized environments which accommodate people for much of their time, as, for example, homes, work places, hospitals, schools and other institutions. The combinatorial methods of this paper, indexed in Table I, and the useful notions of map (of structured environments), density and partition are offered as a constructive operational basis for such a strategy. Applications permit rigorous analysis of epidemiologic investigations in organized environments such as buildings wherein the primary locations of the individuals at risk are identified.

REFERENCES 1. Klauber, M. R. and Angulo, J. J. ‘Variola minor in Braganca Paulista County, 1956: space-time

interaction among variola minor cases in two elementary schools’, American Journal of Epidemiology, 99, 65-74 (1974).

2. Johnson, N. L. and Katz, S. Urn Models and Their Application, Wilev New York. 3. Feller, W. An Introduction to Probability Theory and ikApplications, Vol. 1,3rd edn, Wiley New York,

1968. 4. Pearson, K. ‘Multiple cases of disease in the same house’, Biometrika, 9, 28 (1913). 5. Greenwood, M. and Yule, G. U. ‘An inquiry into the nature of frequency distributions representative of

multiple happenings with particular reference to the occurrence of multiple attacks of disease or repeated accidents’, Journal of the Royal Statistical Society, 83, 256257 (1920).

6. Mathen, K. K. and Chakraborty, P. N. ‘A statistical study on multiple cases of disease in households’, Sankhya: The Indian Journal of Statistics, 10, 387-392 (1950).

7. Walter, S. D. ‘On the detection of household aggregation of disease, Biometrics, 30, 525-538 (1974). 8. Smith, P. G. and Pike M. C. ‘Generalization of two tests for the detection of household aggregation of

disease’, Biometrics, 32, 817-828 (1976). 9. Mott, K. E., Lehman, J. S., Hoff, R., Morrow, R. H., Muniz, T. M., Sherlock, I., Draper, C. C., Pugliese,

C. and Guimaraes, A. C. ‘The epidemiology and household distribution of seroreactivity to Trypanosoma cruzi in a rural community in northeast Brazil’, American Journal of Tropical Medicine and Hygiene, 25, 552-561 (1976).

10. Grimson, R. C. and Groshen, S. ‘A statistical test for contact among individuals in an epidemic and a pattern among cases of acquired immune deficiency syndrome’, Statistics in Medicine, 5, 271-279 (1986).

11. Pike,’M. C. and Smith P. G. ‘A case-control approach to examine diseases for evidence of contagion, including diseases with long latent periods’, Biometrics, 30, 263-279 (1974).

12. Hall, M. Combinatorial Theory, Chapter 2, Blaisdell, Waltham, MA, 1967. 13. Riordan, J. An Introduction to Combinatorial Analysis, Wiley, New York 1958. 14. David F. N. and Barton, D. E. Cornbinatorial Chance, Charles Griffin and Co., London, 1962. 15. Mielke, P. W. and Siddiqui, M. D. ‘A combinatorial test for independence of dichotomous responses’,

16. Grimson, R. C. ‘Combinatorial tests for concordance and a comparison of series of asthma attacks’,

17. Grimson, R. C. ‘Disease clusters, exact distribution of maxima and p-values, Statistics in Medicine, 12,

18. Grimson, R. C. ‘Tests based on the maximum occupancy freqnency’, Proceedings of the 1994 American

19. Grimson, R. C. ‘The culstering of disease, Mathematical Biosciences, 46, 257-278 (1979). 20. Marsden, P. D., Mott, K. E. and Prata, A. ‘The prevalence of trypanosoma cruzi parasitemia in

Journal of the American Statistical Association 60, 437-441 (1965).

Statistics in Medicine, submitted.

1773-1794 (1993).

Statistical Association, Section on Epidemiology, 64-69.

8 families in an endemic area’, Gazeta Medica da Bahia, 69, 6 5 4 9 (1969).