Appendix ANALYZING AND REPORTING BIOLOGICAL DATA · Appendix: Analyzing and Reporting Biological Data 16.3 plants in each group. The next step is to determine the statistical parameters

Appendix ANALYZING AND REPORTING BIOLOGICAL DATA

General Botany & Ecology Laboratory Cedarville University

This Appendix summarizes major considerations that are important in collecting, analyzing, andreporting data from biological investigations. PART A explains some of the concepts related to statisticalsampling and introduces the "t-test", one statistical test of difference between samples. PART B explainshow to perform the t-test by a computer spreadsheet program. PART C gives a suggested format forpresenting t-values. PART D explains how to compute confidence intervals between means. PART Esuggests how to present data in graphs and data tables.

PART A -- Comparing the Means of Two Statistical Samples

Biological experiments dealing with living organisms or environmental parameters must contend withconsiderable variation among data. Thus, it is unwise to base ones conclusions upon only one or two datapoints. Instead, the biologist must use a sufficient number of replicates to establish repeatable cause andeffect relationships. Because it is usually impossible to include all possible measurements as replicates inthe experiment, the data collected by the biologist will be a subset out of a whole array of values. Thus,through careful experimental design, one can strive to acquire a representative sample, the statisticalsample, out of the whole array of possible data that are "out there" -- i.e. the statistical population. Statistics provides methods of summarizing and analyzing statistical samples, and determining theprobability that one sample is different from another. In other words, does one sample represent a differentnormal distribution of values from the other sample? The t-test is used to compare means of two normaldistributions. First, one must calculate the means, variances, and standard deviations of each sample. It ishelpful to state a null hypothesis, a statement predicting that there is no significant difference between themeans of the samples. If the t-test does suggest a high probability of there being a difference betweenmeans, the null hypothesis is rejected. Before proceeding, you should clearly understand the significance of mean, variance, and standard deviation. The mean is equal to the sum of all the measured values of the statistical sample divided by the samplesize, n. As noted above, the statistical population mean (:), can seldom be determined. Instead it is estimated by calculating the statistical sample mean X. Thus, _ X = sum of X divided by n = (EQ-1)

The mean is considered a measure of "central tendency" for a given data set. Two samples may have the same sample mean but one may have a larger variation among individualvalues. The sample variance is a measure of departure from the mean. It is calculated by subtracting the mean (X) from each sample value (X), then squaring each difference (d). Thus, d 2 = (X -X)2. The sum of squares (SS) equals E (X - X)2, or E d2. Finally, the sum of squares is divided by thesample size minus 1, (n - 1). These mathematical relationships are expressed as

variance (s2) = or or (EQ-2)

Finally, the standard deviation can be calculated by taking the square root of the variance.

Appendix: Analyzing and Reporting Biological Data16.2

standard deviation (s) = = (EQ-3)

Consider the following experimental data which we shall use to illustrate the computation of mean,variance, and standard deviation. Scientific calculators and microcomputer programs can perform theseoperations in seconds, but your purpose here should be to understand the principles behind the calculations. A biologist plants 50 corn seeds and allows them to grow for a two-week period after which he measuresthe height of each plant in inches. The following data were recorded:

3.7 3.4 4.3 5.2 4.9 3.7 5.5 4.O5.5 4.3 5.2 4.6 4.9 3.4 5.8 4.36.1 5.5 3.7 4.O 3.7 4.O 5.8 4.65.2 3.7 5.5 4.3 5.5 5.2 4.O 5.84.O 5.5 4.O 5.8 4.O 5.8 4.O 5.53.7 6.1 4.3 5.2 3.7 5.5 5.2 6.14.3 4.9

The investigator decides to determine the frequency distribution of the values and develops the following graph:

F * R * X X E * X X X Q * X X X X X U (f)* X X X X X X E * X X X X X X N * X X X X X X X X C * X X X X X X X X X X Y * X X X X X X X X X X .)))))))))))))))))))))))))))))))))))))))))))))))))))))))))) 3.4 3.7 4.O 4.3 4.6 4.9 5.2 5.5 5.8 6.1

PLANT HEIGHT (INCHES)

Note that a graphic display of frequency distribution enhances the ability to interpret the data. Often,when a data set is plotted in this way, frequencies of values are distributed in a symmetrical fashion in whichthe histogram (bar graph) forms a bell-shape. This is referred to as a normal distribution. The value with thehighest frequency is called the mode. Note that it appears that there may not be just one normal distribution,but two; each with a separate mode. This is called a bimodal distribution. The question you may now have is,"Are there possibly two statistical populations represented here, each with a separate mean?" Statisticalanalysis allows our biologist to determine the probability that this statistical sample of corn plants is displayinga range of plant heights that follow two separate normal distributions, each distributed around one mean (X) thatis significantly different from the other.

The biologist decides to divide the corn seedlings into two groups, one with heights from 3.4 to 4.6 inches,and the other from 4.9 to 6.1 inches. This results in a separation of the 50 seedlings into two groups of 25

Appendix: Analyzing and Reporting Biological Data 16.3

plants in each group. The next step is to determine the statistical parameters separately for each of the twogroups. Even though two groups have been created, the biologist decides to base his statistical analysis on anull hypothesis which may be stated as follows:

There is no difference in plant height between the two groups. That is, the difference between the mean of group one (X1) and that of group two (X2) equals zero. Thequestion is whether or not these two statistical sample means are each positioned within separate normaldistributions; which, in turn each center around two distinct statistical population means, µ1 and µ2. The nullhypothesis states that there is no difference between the two means. This serves as a basis for the statisticalcalculations which will take into account the apparent difference between means of the two groups, and thevariance of the data around the means.

In calculating statistics without a microcomputer, it is helpful to use a data table such as the one below toreduce chances of error. Note the order of mathematical operations from left to right as performed on the firstgroup of data -- group #1. Note that 3f = n.

If you use a hand calculator or computer program to determining the sum of squares (SS), variances, andstandard deviations, the so-called "machine formula" for SS is simpler. This relationship is as follows:

SS = 3X2 - [(3X)2/n]

The SS value is then used to determine variance and standard deviation as in EQ-2 and EQ-3. The spreadsheetprogram, BIOSTATS, which is explained in Part B., uses the machine formula.

Table 1. Computation of mean and standard deviation -- Group 1.))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) sample frequency deviation X f fX d = X - X d2 fd2

)))))) ))))))) )))))) ))))))))) )))))) )))))) 3.4 2 6.8 -O.6 O.36 O.72 3.7 7 25.9 -O.3 O.O9 O.63 4.O 8 32.O O.O O.OO O.OO 4.3 6 25.8 O.3 O.O9 O.54 4.6 2 9.2 O.6 O.36 O.72 )))))) )))))) )))))) Total 25 99.7 SS1 = 2.61 ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) 3fX 99.7 SS1 2.61 X = )))))) = ))))) = 4.O Variance = ))))) = )))) = O.11 3f 25 (s2) n-1 24

Standard Deviation = = = 0.33

(s)


Next, the same computations should be performed on the other sample of 25 measurements, group #2. Forpractice in working through the calculations, complete the table below:

Table 2. Computation of mean and standard deviation -- Group 2.))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) sample frequency deviation X f fX d = X - X d2 fd2

)))))) )))))) )))))) ))))))))) )))))) )))))) 4.9

5.2

5.5

5.8

6.1 )))))) )))))) )))))) Total SS2 = )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))

Statistics: [check your calculations] X = 5.5 s2 = O.13 s = O.36

When the statistics for the two means, X1 and X2, have been computed, we are ready to determine whetherthere is a statistically significant difference between the two statistical populations. Every time the word"significant" is used in statistics, there must always be a level of significance attached to it. The significancelevel is expressed as a percentage. Statistics will not tell us whether or not the null hypothesis is true. However, we may choose a level of significance such as 5% which represents a 5% chance of being wrong inthe decision concerning acceptance or rejection of the null hypothesis. Many biologists regard significantdifference at the 5% level as satisfactory for their purposes. This is a good reminder of the tentative nature thatis a part of the biological sciences.

So, our question now becomes, "Is there a statistically significant difference at the 5% probability levelbetween the two means calculated above?" To determine the answer, one must consider two factors that willinfluence whether or not the two means are really different: (1) the difference between the means [ X1 - X2] --i.e. the deviation of the difference of the means from zero; and, (2) the standard deviation of each group of data. Remember that the null hypothesis says X1 = X2. Therefore, the greater the deviation of the difference of themeans from zero, the greater the probability that the two means are significantly different. Then, for a givendifference between the means, a larger and larger variance (hence standard deviation) would make it more andmore difficult to reject the null hypothesis which states that no difference exists between the means. Becauseboth (1) and (2) above influence whether or not the means are really different, the quotient of (1) divided by (2)can be used as an indication whether or not the means are part of one statistical population or two separate ones. This quotient is the t-value and it may be calculated as follows:

deviation of the difference of the means, from zero *X1 - X2*t = ------------------------------------------------------------ or, t = ) ))) (EQ-4) standard deviation of the difference of means sD


The standard deviation of the difference of means, (sD), is the square root of the two variances, firstdivided by the sample size and then added. That is,

sD = (EQ-5)

One can use the distribution of t-values in a table of probabilities to determine the point at which there is asignificant difference between two means. The t-values represent the difference between two means expressedin standard deviation units. As the t-value becomes greater, it suggests that the difference between means isbecoming greater per unit standard deviation. Consult PART B for instructions on how to compute t-valuesusing a spreadsheet program, BIOSTATS.

Having computed t, we are now ready to consult a table of t-values from the t-distribution as it relates tosample size and probability level. Sample size is taken into account thorough the component called degrees offreedom (df). For this t-test, df = 2(n - 1). Note that the table of values has columns representing differentprobability levels or levels of statistical significance. Horizontal rows degrees of freedom.

Statistical Table: Distribution of t

Degrees Probability (p)of Freedom 0.1 0.05 0.02 0.01 0.001

1 6.31 12.71 31.82 63.62 636.622 2.92 4.30 6.97 9.93 31.603 2.35 3.18 4.54 5.84 12.924 2.13 2.78 3.75 4.60 8.615 2.02 2.57 3.37 4.03 6.87

6 1.94 2.45 3.14 3.71 5.967 1.89 2.37 3.00 3.50 5.418 1.86 2.31 2.90 3.36 5.049 1.83 2.26 2.82 3.25 4.78

10 1.81 2.23 2.76 3.17 4.59

11 1.80 2.20 2.72 3.11 4.4412 1.78 2.18 2.68 3.06 4.3213 1.77 2.16 2.65 3.01 4.2214 1.76 2.14 2.62 2.98 4.1415 1.75 2.13 2.60 2.95 4.07

20 1.72 2.09 2.53 2.85 3.8525 1.71 2.06 2.49 2.79 3.7330 1.70 2.04 2.46 2.75 3.6540 1.68 2.02 2.42 2.70 3.554 1.65 1.96 2.33 2.58 3.29

Suppose our experiment involved 2 samples of ten readings in each sample--i.e. n = 10, and df = 2(n - 1) = 2(10 - 1) = 14. If we select the 5% level of significance, then the t-value appearing at theintersection of column (.05) and row (df = 18) is t = 2.10. This t-value is called the critical value.


If our calculated t-value falls below this critical value, then we can assume that the difference between themeans per unit of deviation from the means is small enough that we can accept our null hypothesis which statesthat there is no difference between the means. However, if our calculated t-value is greater than the criticalvalue, the difference between the means is great enough (per unit of deviation from the mean) that we havecause to reject the null hypothesis, and conclude that the means are significantly different at the 5% level. Thatis, 95 times out of 100, we should expect these data to represent two separate normal distributions. PART Bpresents an example in which 10 sample values for each of two treatments are compared and the means and t-values are computed (see Table 3). Note that the t-value computed in that example data set is 6.62 (see Table 3,bottom of Column B). In this case, we would conclude that the means of Sample 1 and 2 are significantlydifferent because the computed t-value is larger than the critical value of 2.10.

Hopefully, this brief explanation of the t-test as a method of determining whether there is a significantdifference between means of two samples will be helpful as you would use these or other statistical approachestoward interpreting your data. Once you understand something of the significance of mean, variance, andstandard deviation and the mathematical operations used to compute them, you may choose to use aspreadsheet program or scientific calculator to save time. See PART B for instructions.

PART B -- Performing the t-test by Computer Spreadsheet

1. Go to the “Start” menu, then "Science and Math", then Biology, and click on "Biostats Spreadsheet" icon. If necessary, click “Yes” on “Read Only File.” BIOSTATS will appear somewhat like Table 3, except thatin Table 3, hypothetical data has been entered. Note that sample size = 10 (see row 3) with individualvalues entered beneath in columns B and C.

2. Study the spreadsheet and notice the designations of statistical parameters in column "A." Theseparameters will automatically be computed for you after you enter your data. It is assumed that you haveread Part A of this document to learn the significance of mean, variance, standard deviation, etc. BIOSTATS can assist you in using the t-test to determine whether your null hypothesis regarding twosample means should be accepted or rejected. Columns B and C are ready to receive your experimentaldata.

3. Enter the values for sample size, (n), for SAMPE -1 and -2 in columns B and C. Then, enter yourindividual "measured values" in columns B and C, respectively. Leave unused cells blank.

4. As soon as your values are entered, you will see the results appear in the rows beneath. Record or print thespreadsheet for your records. SAVE your computations to a file (e.g. h:\docs\biostats\wb2).

5. Once you have computed the value of t, consult the “t-Table” to obtain the “critical t value.”

a. If the critical value of t, located at the intersection of df = 2(n - 1) and the 5% probability level isGREATER THAN your computed t-value, you should ACCEPT the null hypothesis that

= . There is no significant difference between the means at the 5% probability level.

b. If the critical value of t, located at the intersection of df = 2(n - 1) and the 5% probability level is LESSTHAN your computed t-value, you should REJECT the null hypothesis

that .

There is sufficient deviation of the difference between means (per unit standard deviation) to conclude that

is different from at the 5% probability level.


6. If your experiment has three or four means to compare, you may use the t-test on each possible pair ofmeans. PART C explains how you can present your t-values in an easy to read table.

7. Use Microsoft Excel or other spreadsheet program to present your data in appropriate graphs or tables,identified with respective experimental variables being studied. To conserve paper and ink, you need notprint out all of the BIOSTATS spreadsheets; simply copy the means, t-values and other parameters if youwant them, or save each spreadsheet under an appropriate name for later consultation.

Table 3. BIOSTATS Spreadsheet with Hypothetical Data EntryColumn A B C

SAMPLE 1 SAMPLE 2 SAMPLE1 SAMPLE2SAMPLE SIZE (n) = 10 10MEASURED VALUES: SQUARES SQUARES

X1 4.00 8.00 16.00 64.00X2 5.00 9.00 25.00 81.00X3 5.00 10.00 25.00 100.00X4 6.00 10.00 36.00 100.00X5 6.00 10.00 36.00 100.00X6 6.00 11.00 36.00 121.00X7 7.00 11.00 49.00 121.00X8 7.00 11.00 49.00 121.00X9 8.00 12.00 64.00 144.00X10 9.00 12.00 81.00 144.00

[ SUM of X1...Xn ] = 63.00 104.00 > [SUM X1...Xn]SQ. = 3969.00 10816.00

MEAN =(SUM of X1..Xn /n) 6.30 10.40 SUM (X-Squared) = 417.00 1096.00SUM(X1...Xn)SQ./n = 396.90 1081.60

SQ.ROOT OF SAMP SIZE = 3.16 3.16 SS(Sum of Squares) 20.10 14.40SAMP VARIANCE(SS/n-1) = 2.23 1.60 < VARIANCE(SS/n-1)= 2.23 1.60SAMPLE STANDARD DEV. = 1.49 1.26 SS1 + SS2 34.50STAND. ERROR OF MEAN = 0.47 0.40 DF1 + DF2 18.00COMPARING SAMPLES: |DIFF OF MEANS| = COMBINED VAR / n = 0.38 POOLED VARIANCE 1.92 S.D. of DIFF of MEANS = S.E. of DIFF of MEAN 0.62 COMPUTED "t"-VALUE = 6.62 "t"-VALUE = 6.62


PART C – A Suggested Format for Presenting t-Values from Statistical Analysis of Paired Means.

In cases where you have conducted t-tests on multiple pairs of means, it is helpful to be able to present themeans in Table 4. Note that example data is provided for photon flux density to illustrate use of (*) to indicatelevel of statistical significance (see footnotes).

Table 4. The t-Values Comparison of Means of Microclimate Parameters for Three Communities.

Comparison –>

Parameter

Deciduousversus

Coniferous

Deciduousversus

Open Field

Coniferousversus

Open Field

Photon Flux Density 2.01 2.89* 3.51**

Soil Temperature

Air Temperature

Wind Velocity

Relative Humidity

*Indicates paired means are significantly different at 0.05 probability level, df =8; Critical t-value = 2.31.**Indicates paired means are significantly different at 0.01 probability level, df =8; Critical t-value = 3.36.

PART D -- Computing Confidence Intervals Using the Standard Error of Means

Consider the data reported by Asghar and DeMason (1990) as shown in Figure 1, in PART D. Note thatthey performed daily measurements of lupine cotyledons for 21 days. Each data point (box, circle, etc.)represents the sample mean (X) of several replicate measurements of separate plants. Instead of using ananalysis of variance, or a t-test, they have chosen to report the degree of variability of each set of data usingvertical bars extending at equal length above and below the mean for that set of data. The bar length extendingabove (or below) the mean equals the standard error of the mean. This statistic represents a standard deviationof means. That is, if they took repeated samples of data from the same population and calculated means of eachsample, there would likely be a normal distribution of means in a bell-shaped curve which includes thepopulation mean (µ). However, rather than using repeated samplings, researchers can calculate the standarderror, sX, based upon the standard deviation, s, as follows:

sx = (EQ-6)

where n is the sample size. Note that , for a given sample size, the standard error will be greater if the standarddeviation, s, is greater. Hence, the more variable the data (greater variance), the wider will be the range aboveand below the mean represented by the standard error. It is also apparent that a larger sample size will tend toreduce sx, since "n" is in the denominator.

Note how the presentation of means accompanied by standard errors in Figure 1 helps the reader in datainterpretation. First, error bars reveal differences in variability around each mean. For example, note that the sxfor cotyledon thickness is much greater on day-9 after sowing than on neighboring days.

Appendix: Analyzing and Reporting Biological Data 16.9Second, note that the intervals included within the vertical bars for days 8 through 21 are over lapping,

suggesting that the means for these days are not significantly different from one another, even though the meansare not numerically identical. The interval represented by the vertical bars can be calculated as follows:

Lower Limit = X - Upper Limit = X +

Many journal reports such as that of Asghar and DeMason go no further than to give the mean ± S.E. However, using sx, one can express a confidence interval, or range within which the statistical population mean,µ, may be said to exist, with a given level of significance. This interval is represented as follows:

Confidence Interval for µ = X ± t sx

To determine this interval, first calculate sx using EQ-6, above, or see Part B. Then, decide what probability(confidence) level you wish to accept, and determine your degrees of freedom (df = n - 1). The 5% level iscommonly chosen. With these values, go to the t-table and find t at the intersection of the probability level anddf = n - 1. For example, if we choose the 5% level, and df = 7, then t = 2.36. The resultant 5% confidenceinterval will extend 2.36sx. We then assume that this confidence interval would include the population mean(µ) 95% of the time. Any confidence intervals that do not overlap can be assumed to represent means that arestatistically different.

PART E -- Presenting Data in Graphs and Tables

Graphs and data tables should be included in scientific reports to aid readers in understanding yourexperimental results. At a glance, readers can make visual comparisons of paired mean values and trends invalues caused by experimental treatments or time. The following suggestions may guide you in the choice offormats:

Graphs should be used when it is important to show a pattern or trend in your data. The guidelines belowshould be followed as exemplified in the two accompanying graphs reprinted from two different scientificjournal articles:

1. Number each graph with a sequence of "figure numbers" which are separate from "table numbers". The number sequence should follow the order in which you refer to them in your report.

2. For each graph, include a legend which explains the symbols used for data points, and statisticaltreatment of the data. The graph should be clear and easy to understand on its own.

3. Axes should be chosen to present the dependent variable on the y-axis. Label each axis with name ofparameter and units. Scale each axis numerically so that your data and resultant curves fill out thegraph.

4. If data points are sample means, indicate standard errors by vertical bars, or confidence intervals basedupon the t-test. See Asghar and DeMason (1990), Figure 1, next page.

5. Where the independent variable represents different treatments, and not a time course, you may use ahistogram (bar graph), as illustrated by the graph of Silvius, et.al. (1979).

6. Where possible use a computer-generated graphing program which constructs graphs from input data.



Data Tables should be used when it is important to compare numerical values, and to summarize orsupport your verbal discussion. Consult the following guidelines and examples from scientific journals:

1. Number each table in sequence with the flow of your report, and include a legend to make the tableunderstandable on its own.

2. Include independent variables (usually in the left column) and dependent variables along the top ascolumn headings with units (preferably metric) for each variable.

3. Specify your sample size and statistical information such as the type of statistical analysis andprobability level. Indicate which means are significantly different from the other means. Oneconvenient technique is to print different letters next to means that differ significantly.


REFERENCES

1. Asghar, R. and D.A. DeMason. 1990. Developmental changes in the cotyledons of Lupinus luteus L.during and after germination. Am. Jour. Botany 77(10): 1342-1353.

2. Bishop, O.N. 1983. Statistics for Biology: Microcomputer Edition. 4th ed. Longman. Essex, England.

3. Davis, T.M., L.J. Matthews, and W.R. Fagerberg. 1990. Coomparison of tetraploid and single gene-induced gigas variants in chickpea (Cicer arietinum). I. Origin and genetic characterization. Am. Jour.Botany 77(3): 295-299.

4. McMillan, Victoria E. 1988. Writing Papers in the Biological Sciences. Bedford / St. Martin's. NewYork, NY.

5. Silvius, J.E., N.J. Chatterton, and D.F. Kremer. 1979. Photosynthate partitioning in soybean leaves at twoirradiance levels. Plant Physiology 64: 872-875.

6. Sokal, R.R. and F.J. Rohlf. 1973. Introduction to Biostatistics. W.H. Freeman and Co., San Francisco,CA.

Documents

Appendix ANALYZING AND REPORTING BIOLOGICAL DATA · Appendix: Analyzing and Reporting Biological Data 16.3 plants in each group. The next step is to determine the statistical parameters