J-Shaped Distributions Robert M. Hayes 2004. Overview §Viewing Distributions of UseViewing Distributions of Use §Descriptive rather than AnalyticalDescriptive

J-Shaped Distributions Robert M. Hayes 2004

Overview Viewing Distributions of UseViewing Distributions of Use Descriptive rather than AnalyticalDescriptive rather than Analytical Effect of UncertaintiesEffect of Uncertainties Contexts for Library ApplicationContexts for Library Application The Types of DistributionsThe Types of Distributions

Viewing Distributions of Use There are two ways to view distributions and related J-shaped curves: l (1) in sequence of increasing frequency of uses, in which the number of items is the dependent variable l (2) in sequence of increasing numbers of items, in which the frequency of use is the dependent variable

Sequence of Increasing Uses To illustrate, consider the following distribution: Note that the listing is in increasing order of the frequency of use. For example, there are 500 items that are used only once and 1 item that is used 10 times. The graph of this distribution looks as follows:

Sequence of Increasing Uses

Example of Increasing Frequency This way of viewing might typically be used when looking at statistics on circulation of library materials in which the number of items circulating once would be followed by the number circulating two times, etc.

Sequence of Increasing Items The alternative picture, for the same data: Effectively, the data are now arranged in order of decreasing frequency of use. The graphical picture is quite different:

Sequence of Increasing Items

Example of Decreasing Frequency This means of viewing the data is typically used in applying laws such as Zipfs law, in which words are listed in decreasing order of frequency of use. Similarly, in the original formulation of Bradfords law, journals are sequenced in order of decreasing productivity for a subject field and then grouped into zones of equal productivity (the zones containing successively greater numbers of journals).

Distributions of Use Each of the distributions that will be presented here it intended to represent situations in which a few items (journals, scientists, users, volumes, etc.) account for the many (articles, citations, uses, circulations, etc. ). These models have value as means for assessing the effects of the patterns upon various kinds of decisions. In the library, those decisions might relate to acquisitions, to alternative means for storage of holdings, to staffing for services.

Descriptive rather than Analytical It is important to recognize that, with one exception, these models are essentially descriptive of empirical data. That is, most of them do not provide explanation for the behavior exhibited in the data; they merely represent that behavior in a mathematical form. Furthermore, they do not represent cause and effect relationships. The one exception is the "mixture of Poissons" which does provide an explanation for the behavior, deriving it from the assumption of a heterogeneous (i.e., mixed) population and random processes around the average for each of the components of the population.

Effect of Uncertainties It is also important to recognize that the empirical data, in any real situation, are themselves uncertain, subject to variation as a result of many factorserrors in observation, changes from one time period to another, changes in the mixtures of populations involved, changes in the context of observation. As a result, whatever model may best fit needs in analysis is the one to be used, since any of the models is likely to be as accurate as any other. Furthermore, unlike physical phenomena, patterns of usage reflect not underlying laws of nature but the effects of individual decisions or, in many cases, large scale policy decisions.

Contexts for Library Applications For the library, there are four contexts in which these kinds of distributions seem relevant: l (1) the context of the users,the context of the users, l (2) the context of the use of materials,the context of the use of materials l (3) the interaction between users and materials,the interaction between users and materials l (4) the context of bibliometric analysis.the context of bibliometric analysis The first usually shows a distribution of uses across the set of users that exhibits a J-shaped curve. The second usually shows a distribution of uses across the set of materials that also exhibits a J-shaped curve. The third helps to identify the nature of uses. The fourth helps in assessing contributions of journals to publication and use of the articles they contain.

Library User Patterns Library users differ in their relative frequency of use. For example, in academic libraries, faculty will, on average, use the library much more frequently that will students. And among students, graduate students will, on average, do so more frequently than undergraduates. The following shows relative use at the UCLA library:

Library Collection Use Patterns Turning to the second contextthe use of materials again the evidence is that in the library the extent to which materials are used varies greatly. Considering circulation data as a measure of use, some library materials are heavily circulated each year and some are virtually never circulated. Leaving aside for the moment differences for specific items, there are categories of items that almost by definition will vary in their circulation. There are materials that are put "on reserve" precisely because they are expected to be heavily circulated. There are rare books that will never be circulated and even will rarely be used at all. There are current "best sellers" that will be heavily used, and there are "dusty old volumes" that will almost never be used.

Library Collection Use Patterns Beyond that, though, are the differences among items, independent of identified categories. Some of those differences relate to date of publication or acquisition, some to the subject matter, some to the changeable role as assigned readings. Despite excellent efforts (thinking especially of that by Fussler and Simon) to identify reasons for such differences, there are no easy criteria for a priori identification of which items will be heavily used and which rarely so. The differences therefore usually need to be identified from actual experience, as exemplified in circulation records.

Relationship between Users & Materials Beyond the separate distributions for users and materials, there are also important distributions that reflect the relationships between the two. To illustrate, the following shows the relative use of two categories of materials (items with one use and all other items) by categories of users at UCLA: Note the relatively greater use by faculty of One Use items, especially in comparison with All Others.

Bibliometric Patterns A number of models (such as Bradfords law) are used to describe characteristics of the literature. For example how is the literature on a particular subject scattered or distributed in the journals? For libraries, the significance lies in the fact that, in a bibliography on any subject, there is always a small group of core journals that account for a substantial percentage (say 1/3) of the articles on that subject. Then there is a second, larger group of journals that account for another third while a much larger group of journals picks up the last third.

Bibliometric Patterns Distribution frequencies of the Bradford type are also evident in other bibliometric phenomena. Lotkas law, for example, describes the productivity of scientists within a given population. Productivity is defined here as the number of papers a scientist publishes within a given time. Underlying the J-shaped curve is an assumption that, if an individual (scientist or journal) is successful (writes or publishes an article) on one attempt, the probability of success on subsequent attempts increases. This has been called "cumulative advantage" equivalent to "success-breeds-success".

The Types of Distributions Negative Exponential DistributionsNegative Exponential Distributions l Bradford's Law - 1 Bradford's Law - 1 Negative Power or Harmonic DistributionsNegative Power or Harmonic Distributions l Zipf's Law Zipf's Law l Bradfords Law - 2 Bradfords Law - 2 l Lotka's Law Lotka's Law l Pareto Law Pareto Law l Cumulative Advantage Processes Cumulative Advantage Processes Mixture of Poisson DistributionsMixture of Poisson Distributions Negative Binomial DistributionsNegative Binomial Distributions Logistic DistributionsLogistic Distributions Linear DistributionsLinear Distributions

Negative Exponential Distributions The negative exponential is represented by the equation F(k) = N*2 (-A*k). There are two characterizing parameters: N and A The base for the exponential can be other than 2. It could be e or 10 or any other positive number. The choice of base simply affects the value of A. For example, if the base were 10, F(k) = N*2 (-A*k) = N*10 (log 10 (2))*(-A*k) = N*10 (-A*log 10 (2))*(k) so A is replaced by A*log 10 (2)

Negative Exponential Distributions Graphically, it looks as follows, for N = 1000 and A = 1:

Bradford's Law - 1 Samuel C. Bradford first formulated his law in 1934. But it did not receive wide attention until publication of his book, Documentation, in 1948. Bradford called it the law of scattering, since it describes how the literature on a particular subject is scattered or distributed in the journals. In information science, Bradfords law is perhaps the best known of all the bibliometric laws. A huge body of literature has been written on it. "Bradfords law, as originally defined, is a negative exponential distribution.

Initiation of Bradford's Law In Documentation, Bradford analyzed a four-year bibliography of references to articles in applied geophysics. He listed journals containing references to that field in descending order of productivity. He then divided the list into three zones, each containing roughly the same number of references. Bradford observed that the number of journals contributing references to each zone increased by a multiple of about five. Specifically, the first zone contained nine journals which contributed 429 references. The second contained 59 journals producing 499 references. In the third zone 258 journals provided 404 references.

Bradford S C. Documentation. Washington, DC: Public Affairs Press, 1950.

Qualitative Form of Bradford's Law On the basis of these observations, Bradford wrote, the numbers of periodicals in the nucleus and succeeding zones will be as 1, n, n 2, (p. 116). For applied geophysics then, the number of journals in each zone was proportionate to 1, 5, 25, Given that, the average frequency of use for journals in across the zone is represented by a negative exponential distribution: 1, 1/n, 1/n 2, Later, we will derive this negative exponential distribution from an underlying negative power distribution.

Log-linear Form Given that the negative exponential is represented by the equation F(k) = N*2 (-A*k). note that log 2 (F(k)) = log 2 (N) A*k. This log-linear form is useful for plotting the values of log 2 F(k) as a function of k or for estimating the values for N and A by regression. The following graph shows the log-linear form.

Log-linear Form Graphically the log-linear form looks as follows:

Negative Power or Harmonic Distributions The negative power or harmonic distributions derive from the harmonic series: 1, 1/2, 1/3, , 1/n, That basic series is augmented with two parameters, A and B, in the following formula: P(x) = (A/x)*(B/x) A defined over the interval 0 < B < x. Note that P(x) is expressed as a negative power of the value x. Hence, negative power distribution.

Harmonic or Negative Power Distributions Graphically it looks as follows, for a = 1.2 and b = 0.7

Zipf's Law In his book Human Behavior and the Principle of Least Effort, George K. Zipf treated the frequency with which words occur in a given piece of literature. Zipf arranged the 29,899 different words found in Joyces Ulysses in descending order of their frequency of occurrence. Then to each word he assigned a rank, from r = 1 (most frequently occurring word) to r = 29,899 (least frequently occurring). He found that by multiplying the numerical value of each rank r by its corresponding frequency F, he obtained a product, C, which was constant throughout the entire list of words. The formula for Zipfs law is thus F(r) = C/r, so it is a harmonic distribution.

Bradford's Law - 2 At about the same time that Zipf published his book, Bradford wrote Documentation. We have already discussed the negative exponential distribution represented by the original formulation of Bradfords law. We will now look at the underlying harmonic distribution and derive the exponential one from it.

Frequency of Use of Journals Underlying Bradfords law is the frequency of use of journals, as exemplified by their occurrence in a bibliography for a subject field. Let P(n) be the frequency of use of journal (n), listed in decreasing order of that frequency of use, so that P(n) > P(n+1). The empirical facts appear to be that, overall and with varying degrees of accuracy, the frequency of use of journals fits an harmonic distribution. Thus, P(n) = A/n (more or less)

Frequency for Groups of Journals Suppose that we now group the journals, the first group containing the most frequently used journals, the second the next most frequently used, and so on. Let G k be the number of journals in group k. Consider the frequency of use of the journals in each of the several groups: F(1) = P(1) + P(2) + + P(G 1 ) = 1/1 + 1/2 + + 1/ G 1 F(2) = P(G 1 +1) + P(G 1 +2) + + P(G 1 +G 2 ) = 1/ (G 1 +1) + + 1/ (G 1 +G 2 ) F(3) = P(G 1 +G 2 +1) + + P(G 1 +G 2 +G 3 ) = 1/ (G 1 +G 2 +1) + + 1/ (G 1 +G 2 +G 3 ) and so on.

Sums of the Harmonic Series There is not a closed form for evaluation of the several sums of an harmonic series, but we can compare the total for the areas of the rectangles with the integral of the function 1/x, shown in red in the following graph:

Harmonic Series & Natural Logarithm The sum from 1/(A+1) to 1/B can be approximated by the integral of 1/x from (A + 1 - 0.5) to (B + 0.5). The integral of 1/x is ln(x), so the sum from 1/(A+1) to 1/B would be approximately ln ((B + 0.5)/(A + 0.5)) Use that approximation and let T k = G i, T 0 = 0 so that F(k+1) = ln ((T k+1 + 0.5)/(T k + 0.5)) In Bradfords description of the law, the successive groups of journals were chosen to have about the same number of citations, so F(1) = F(2) = F(3) = F(4), etc. Hence, (T k+1 + 0.5)/(T k + 0.5) = (T k + 0.5)/(T k-1 + 0.5) (T k + 0.5) 2 = (T k+1 + 0.5)*(T k-1 + 0.5) = (T k + 0.5 + G k+1 )*(T k + 0.5 - G k ) = (T k + 0.5) 2 + G k+1 *(T k-1 + 0.5) G k *(T k + 0.5)

Harmonic Series & Natural Logarithm From that equation, G k+1 = G k *(T k + 0.5)/(T k-1 + 0.5) For k = 1, G 2 = G 1 *(G 1 + 0.5)/(0 + 0.5) = G 1 *(2G 1 + 1) By induction, we prove that G k+1 = G 1 *(2G 1 + 1) k : First, if G i = G 1 *(2G 1 + 1) (i 1) for all i < k + 1,then T k = G 1 * i (2G 1 + 1) i = G 1 *((2G 1 + 1) k 1)/(2G 1 + 1 1) = ((2G 1 + 1) k - 1)/2 = (2G 1 + 1) k /2 - 0.5 Hence, T k + 0.5 = (2G 1 + 1) k /2 and T k-1 + 0.5 = (2G 1 + 1) (k-1) /2 Hence, since G k+1 = G k *(T k + 0.5)/(T k-1 + 0.5), G k+1 = G k *((2G 1 + 1) k /2)/((2G 1 + 1) (k-1) /2) G k+1 = G 1 *(2G 1 + 1) k Q.E.D.

Harmonic Series & Natural Logarithm As a result, the number of journals in group k is an exponential function of k. Given the equal number of citations for each group, the frequency distribution is negative exponential. However, it is important to note that the approximation of the summation of 1/n by ln is significantly in error at the start. Specifically, 1/1 = 1 but ln (1.5/0.5) = 1.1. The result is over-estimate at the start by about 10%. This is at least a partial explanation of the difference between empirical data and the exponential model in the region called the core journals which will be illustrated next.

Graphical Form of Bradford's Law The following graph illustrates Bradfords law for articles on tropical and subtropical agriculture found in Tropical Abstracts during 1970. Note that the x-axis is the logarithm of the number of journals and the y-axis is the number of citations. The data-points on the graph are equally spaced on the y-axis and logarithmically spaced on the x-axis. In preparing graphs related to the prior discussion, the x and y axes would be reversed so as to represent the log (number of journals) as a function of k, the number of groups of journals.

Lawani, S. M. Bradfords law and the literature of agriculture. Int. Lib. Rev. 5:341-50, 1973,

Anomalies in Bradford's Law Notice that the empirical data initially appears as an upward curve before it becomes linear. This is typical of Bradford graphs. The area represented by the curving line is usually regarded as the nuclear zone, or journal core. Notice also that the empirical data begins to droop at about the 250th journal. The droop consistently appears among many different sets of empirical data. One theory is that including more journals would maintain the linearity. Another theory is that the droop is an integral part of article scatter. Later, when we look at the logistic distribution, we will consider a third explanation.

An Interpretation of Bradford's Law In 1967, Ferdinand F. Leimkuhler, Purdue University, proposed an equation for representing Bradfords law: F(x)= ln((1 + bx)/(1 + b)), where x denotes the fraction of documents in a collection which are most productive, 0 < x

Pareto Distribution The Pareto distribution is represented by the formula: f(x) = (a/b)[b/(b + x)] (a-1), x > b The Pareto distribution is named after Vilfredo Pareto, an Italian economist, who around 1900 determined that the majority of the world's wealth was held by a minority of the people. This is not news to us today, but was a revelation then. The format in which Pareto presented his data was a bar graph sequenced in descending order of wealth, and it was a J-shaped distribution.

Pareto Distribution In the 1920's, the Pareto distribution was applied to quality control to show the frequency with which each cause of problems had occurred. The result again was a J-shaped curve, implying that most problems in quality result from a small number of causes. It is valuable as a tool in determining the most frequent causes of a particular problem and deciding where to focus efforts for maximum effectiveness.

Pareto Distribution The Pareto distribution became known as the 80-20 rule: 80% of whatever may be involved is related to 20% of the potential sources. In practice, the percentages may not be always exactly 80/20, but there usually are "the vital few and the trivial many." A Pareto chart combines a bar graph with a cumulative line graph. The bar graph shows the values in the descending order from left to right, with bar height reflecting the frequency or impact of problems. The cumulative sum line shows the percent contribution of all preceding bars.

Pareto Distribution Pareto analysis is typically carried out by starting with a high level overview, identifying aspects of the greatest effect. Then analyzing them into root causes for those significant effects, and, if necessary, further Pareto the sub-causes. This approach is necessary when dealing with complex processes, enabling you to properly prioritize and focus on the right issues.

Mixture of Poisson Distributions A "mixture of Poisson distributions" is characterized by a set of parameters, where n(i) is the number of items in component(i), and m(i) is the "a priori expected frequency" with which an item in component(i) will circulate during a given time period. This mixture leads to a frequency distribution based on the following formulation: F j (k) = n(j)*e (m(j)) *(m(j) k )/k! k = 0, 1, 2, P F(k) = F j (k) j=0 where P+1 is the number of components, and F(k) is the number of volumes that the model predicts will circulate exactly k times in the given time period.

Mixture of Poisson Distributions An equivalent formulation is F j (0) = n(j)*e (m(j)) F j (k) = F j (k-1)*m(j)/k k = 1, 2, P F(k) = F j (k) j=0 which is useful since it avoids the problems of factorials for large values of k.

Illustration of a Mixture of Poissons

Algorithm for Estimating a Mixture Let D(k) be a distribution, where D(k) is the number of items, out of a total of N, that occur exactly k times, k varying from 0 to L. Let be a mixture of Poisson distributions, j = 0 to P, with M(j) < M(j+1). That is, F j (0) = N(j)*e (M(j)) F j (k) = F j (k-1)*M(j)/k k = 1, 2, P F(k) = F j (k) j=0

Algorithm for Estimating a Mixture Calculate for all k for which D(k) is not suppressed (D(k) being suppressed when it is unknown or, perhaps, when k = 0): L N j ' = N j + F j (k)* (D(k)-F(k))/D(k) k=0 L R j ' = R j + F j (k)*k*(D(k)-F(k))/D(k) k=0 M j ' = R j '/N j ' Iterate, replacing by until a desired degree of convergence is reached.

Negative Binomial Distribution The negative binomial distribution is represented by the following equation: P(k) = {(k 1)!/(r 1)!*(k r)!}*p (r) *(1 p) (k-r) The following is a graphical illustration:

Logistic Distribution The behavior exhibited by empirical data when applying the Bradford distribution (with exponential growth at the beginning, followed by linear growth, followed by a leveling-off) strongly suggests that the logistic distribution might be applicable. The logistic distribution arises when there is basic exponential growth which eventually is inhibited by the effects of an upper limit. Such an upper limit clearly is present for journals in the fact that there is a limit to the number of journals that are published.

Derivation of Closed Form - 2 The standard closed form for the logistic equation in the continuous case is the following: P(t) = K/(1 + e a+b*t ) The following is an illustrative graph for the logistic distribution:

Illustrative Logistic Difference Growth As this shows, the curve produced by the logistic difference equation is S-shaped. Initially there is an exponential growth phase, but as growth gets closer to the carrying capacity (more or less at time step 37 in this case), the growth slows down and the population asymptotically approaches capacity.

Linear GrowthLinear Growth - 1 Note that, qualitatively, there are three main sections of the logistic curve. The first has exponential growth and the third has asymptotic growth to the limit. But between those two is the third segment, in which the growth is virtually linear.

Linear Growth - 2 Unlimited linear growth is represented by the equation p t+1 = p t + C but that, like the exponential model, grows to exceed any identifiable limits. It is therefore valuable to consider a limited linear growth represented, perhaps, by the equation p t+1 = p t + ((K p t )/K)*C

THE END

Documents

J-Shaped Distributions Robert M. Hayes 2004. Overview §Viewing Distributions of UseViewing Distributions of Use §Descriptive rather than AnalyticalDescriptive