41
Advanced Statistics for Advanced Statistics for Librarians Librarians How to use and evaluate statistical information in library research John McDonald John McDonald Jason Price Jason Price Acquisitions Acquisitions Librarian Librarian Science & Electronic Science & Electronic Resources Librarian Resources Librarian Caltech Caltech Claremont Colleges Claremont Colleges

Advanced statistics for librarians

Embed Size (px)

DESCRIPTION

Preconference workshop, Charleston Conference, November 7, 2006

Citation preview

Page 1: Advanced statistics for librarians

Advanced Statistics for Advanced Statistics for LibrariansLibrarians

How to use and evaluate statistical information in library research

John McDonaldJohn McDonald Jason PriceJason Price

Acquisitions LibrarianAcquisitions LibrarianScience & Electronic Science & Electronic

Resources LibrarianResources Librarian

CaltechCaltech Claremont CollegesClaremont Colleges

Page 2: Advanced statistics for librarians

Advanced StatisticsAdvanced Statistics

Part I: Research Design

Part II: Statistical Concepts

Part III: Evaluating Library Statistics

Page 3: Advanced statistics for librarians

Research DesignResearch Design

ValidityHow well an indicator accurately measures the concept being studied. Is the technique appropriate to measure the concept being studied?

ReliabilityHow consistent is the measurement. Does it yield the same results over repeated attempts and by different researchers? How certain are the results?

GeneralizabilityHow well (or likely) can the findings be applied to other situations?

Page 4: Advanced statistics for librarians

Research Design StepsResearch Design Steps

Research Question

Hypotheses

Data definitions

Data collection

Data analysis

Conclusions

Page 5: Advanced statistics for librarians

Research QuestionResearch Question

What is the study designed to answer?

Why is the study important?

The more specific, the better!

Example: Should the library increase hours during finals week?

Page 6: Advanced statistics for librarians

HypothesisHypothesis

A statement about the expected results.

What you will test after collecting data.

Null Hypothesis, that there is no difference between Group 1 & Group 2 or Before/After. Notated Ho = Ha

Alternate Hypothesis, that there is a difference and what that difference will be. Notated Ho ≠ Ha

Can also be directional if theory or prior research indicates : Ho > Ha

Page 7: Advanced statistics for librarians

Data collectionData collection

Observation

Interviews

Focus Groups

Surveys

Transaction Logs

Others?

Page 8: Advanced statistics for librarians

Data Collection: SamplingData Collection: Sampling

Necessary when it is impossible to study an entire population due to logical, geographical, monetary, or time constraints.

A sample must be a good representation of the rest of the population.

The larger your sample, the more sure you can be that their answers truly reflect the population

Accuracy increases when more respondents pick one choice over another. E.g. More accuracy when 99% choose one presidential candidate

The larger your population size, the larger your sample needs to be, except if your population is very large (i.e. the U.S., or very small (i.e. your household)

Page 9: Advanced statistics for librarians

Simple Stratified

Assumes homogeneity Assumes heterogeneity

Sampling DesignsSampling Designs

Page 10: Advanced statistics for librarians

1) SS = Z2 * (p) * (1-p) / c2

2) ss = SS/1+(SS-1/pop)

1)When you have very large pop size2)When you have finite pop sizeZ = Z value (e.g. 1.96 for 95% confidence level) p = percentage picking a choice, expressed as

decimal (e.g. .5 for 50%)c = confidence interval, expressed as decimal

(e.g., .04 = ±4%)

Sample size spreadsheet

Calculating Sample SizesCalculating Sample Sizes

Page 11: Advanced statistics for librarians

• Research Question: What is the color distribution of M&Ms?

• Sample: What is the color distribution of a simple random sample of M&Ms.

• Test: Does my sample yield different results than what is reported by the company?

• Method: Packages of M&Ms distributed to each participant. Each package is a random sample from the company.

M&M SamplingM&M Sampling

Page 12: Advanced statistics for librarians

Let’s look at the colors in individual samples of M&Ms

M&M Data Collection & Testing

Red Blue Green Yellow Orange BrownJohn 3 2 1 7 3 4

M&M SamplingM&M Sampling

Page 13: Advanced statistics for librarians

Data DefinitionsData DefinitionsData Scales

NominalOrdinalIntervalRatio

Frequency DistributionsFlatNormalSkewed

Variable TypesDependentIndependentExtraneous

Page 14: Advanced statistics for librarians

Data ScalesData ScalesNominal: scaled without order, indicating that classifications are different. Example: Public & private institutions.

Ordinal: scaled with order, but without distance between values. Example: Carnegie classifications

Interval: scaled with order and establishes numerically equal distances on the scale. Example: Grade level (freshman, sophomore, etc.)

Ratio: scaled with equal intervals and a zero starting point. Example: Fulltext downloads.

Nominal or ordinal variables are discrete, while interval and ratio variables are continuous

Page 15: Advanced statistics for librarians

Name that data type!Name that data type!Salary

Author of a book

Hours spent in the library

Patron status

Publication year of a journal

Ranked journal lists

Test results on instruction classes

Number of articles read

FTE

Page 16: Advanced statistics for librarians

Data DistributionsData DistributionsDescribed by their kurtosis (variability) and skew (extremes)

Normal: bell shaped curve with gradual slopes

Non-normal (skewed): extreme values with steep slopes

Page 17: Advanced statistics for librarians

Fulltime Students at ARL SchoolsFulltime Students at ARL Schools

0 10,000 20,000 30,000 40,000 50,000 60,000Total Fulltime Students

0

5

10

15

20

25

Mean = 22171.55Std. Dev. = 10075.743N = 114

N=114

Mean = 22K

SD = 10K

Page 18: Advanced statistics for librarians

Total Salaries & Wages at ARL LibrariesTotal Salaries & Wages at ARL Libraries

10 20 30 40 50Total Salaries & Wages (million dollars)

0

10

20

30

40

Mean = 10153401.64Std. Dev. = 6534491.237N = 114

N=114

Mean = 10M

SD = 6.5M

Page 19: Advanced statistics for librarians

VariablesVariables

Dependent: the variable being measured, studied, and predicted.

Independent: variables that can be manipulated or are predictors of the dependent variable.

Extraneous: variables other than the independent variables that can influence the dependent variable.

Page 20: Advanced statistics for librarians

Data analysisData analysis

Descriptive statisticsMean, Median, ModeStandard Deviation

Correlational statisticsCorrelation

Inferential statisticsT-testRegressionChi-square ANOVA

Page 21: Advanced statistics for librarians

Correlational StatisticsCorrelational StatisticsCorrelation establishes that two measures have a relationship.

Indicates direction & strength, but not causation!Allows researcher to consider other statistical tests with confidence.Requirements

random sampleinterval or ratio datanormal distributionlinear relationship

Page 22: Advanced statistics for librarians

Correlational StatisticsCorrelational StatisticsDirection

Positive: As one value increases, the other does as well.

Example: Age and height.Library: Enrollment & materials budget.

Negative: As one value increases, the other decreases.

Example: Car speed & time to destination.Library: Items purchased & shelf space.

Strength

Value between 1 (positive) and -1 (negative). The closer to those values, the stronger the relationship.

Page 23: Advanced statistics for librarians

CorrelationCorrelation

Page 24: Advanced statistics for librarians

Inferential StatisticsInferential StatisticsParametric: assume that the dependent variable has a known underlying mathematical distribution (normal, binomial, Poisson, etc.) which serves as the basis for sample-to-population estimates. Parametric tests are robust and have great power efficiency.

Non-parametric: do not assume a normal distribution (distribution free) & require that the data meet fewer assumptions. Allow for the analysis of a mixture of data types.

Page 25: Advanced statistics for librarians

T-TestT-TestDetermine if there is a difference (in a characteristic) between two populations based on data from samples of those populations.

Requirementsrandom sampleinterval or ratio datanormal distributionequal standard deviations

Page 26: Advanced statistics for librarians

T-TestT-Test

Page 27: Advanced statistics for librarians

RegressionRegression

Predicts values of a dependent variable based on values of independent (predictor) variables

Requirements: interval or ratio datanormal distributioncorrelated variableslinear relationship

Page 28: Advanced statistics for librarians

RegressionRegression

Page 29: Advanced statistics for librarians

ANOVAANOVADetermine if there are differences between three or more sample means.

Test the significance and direction of the difference.

Requirements: normal distribution (in each cell)Interval or ratio datahomogeneity of variance

Page 30: Advanced statistics for librarians

ANOVAANOVA

Page 31: Advanced statistics for librarians

Chi Square TestChi Square Test

Difference between expected and observed frequencies for nominal or ordinal data

Requirements: Any type of dataLarge sample size (>50)Similar distributions

Page 32: Advanced statistics for librarians

Chi Square TestChi Square TestPepsi ChallengeObserved: Pepsi 85, Coke 57, RC 78 Expected (equal) = 73.33

O E O-E (O-E)2 (O-E)2/E

Pepsi 85 73.33 11.67 136.19 1.86

Coke 57 73.33 -16.33 266.67 3.64

RC 78 73.33 4.67 21.81 0.3

Totals 220 219.99 χ2 = 5.8

Degrees of freedom = rows - 1 = 3 - 1 = 2Critical value of χ2 = 5.99 at alpha = 0.05Observed value of χ2 = 5.8Decision: Fail to reject H0

Page 33: Advanced statistics for librarians

Inferential StatisticsInferential StatisticsGoal Parametric Non-parametric

Compare sample to a hypothetical value

T-test •Wilcoxon test•Chi-Square

Quantify association between variables

Pearson correlation

•Spearman correlation•Kendall's tau

Predict value from measured variables

OLS Regression •Poisson regression•Negative Binomial reg.

Compare 2 unpaired groups

Unpaired t-test •Mann-Whitney test •Fisher's test

Compare 2 paired groups

Standard two-group t-test

•Mann-Whitney•Komogorov-Smirnov

Compare 3+ unmatched groups

ANOVA •Kruskal-Wallace test •Chi-square test

Page 34: Advanced statistics for librarians

Review: Research DesignReview: Research Design Research Question

What will the study answer?

HypothesesWhat do you think the results will be?

Data definitionsWhat scales are the variables, what is the distribution, and what are the dependent, independent & extraneous variables?

Data collectionWhat is the best method for collecting the variables of interest?

Data analysis What are the proper statistical tests to use on the data?

ConclusionsWhat does the data show us or indicate?

Page 35: Advanced statistics for librarians

Case StudiesCase Studies

Citation AnalysisAntelman, K (2004) “Do Open-Access Articles Have a Greater Research Impact?” College & Research Libraries News 65(5):pp. 372-382

Usage AnalysisBlecic, DD (1999) “Measurements of journal use: an analysis of the correlations between three methods.” Bull Med Libr Assoc 87(1): 20-25.

Service AnalysisNichols, J; Shaffer, B; Shockey, K. (2003). “Changing the Face of Instruction: Is Online or In-class More Effective?” College & Research Libraries, 64:5: 378-389.

Page 36: Advanced statistics for librarians

““Changing the Face of Instruction…”Changing the Face of Instruction…”

Is an online tutorial as effective in teaching library instruction as a classroom setting?

H3. Students will report as much or more satisfaction with online instruction as students taking traditional instruction.

Research Question

Hypotheses H1. Students will have higher scores in information literacy tests after library instruction.

H2. Students will have the same or higher scores in info-lit tests after taking online tutorials as students taking traditional instruction.

Page 37: Advanced statistics for librarians

Variables: Test scores & survey resultsData Collection: Pretest/Posttest & Survey

““Changing the Face of Instruction…”Changing the Face of Instruction…”

Variables & Data Collection

Statistical Tests

Conclusions Accept H1: Instruction improves literacy.

Desc Stats incl. mean, standard deviation, standard error, T-tests (1 & 2 tailed)

Accept H3 alternative hypothesis – Student satisfaction is equal with both methods.

Accept H2 alternative hypothesis – Online has no significant difference from traditional.

Page 38: Advanced statistics for librarians

“Do Open-Access Articles…”

Research Question

Hypothesis

Variables and Data Collection

Statistical Tests

Conclusions

Critical Questions

Page 39: Advanced statistics for librarians

“Do Open-Access Articles…”

Do freely available articles have a greater research impact?

Research impact: citation rates

Open Access: freely available

Research Question

Hypotheses H1. Scholarly articles have a greater research impact if the articles are freely available online than if they are not.

Ho: (null hypothesis): There is no difference between the mean citation rates: Ho: d1 = d0

Measures

Page 40: Advanced statistics for librarians

Variables: Mean citation ratesData Collection: At least 50 articles from 10 leading journals in 4 disciplines.

“Do Open-Access Articles…”

Variables & Data Collection

Statistical Tests

Conclusions Reject Ho: Open Access articles are citation more than those that are not OA.

Desc Stats incl. mean, standard deviation, standard error, Wilcoxon sign-rank

Validity? Reliability of Measures?Generalizability?Alternate hypotheses?

Discussion

Page 41: Advanced statistics for librarians

My favorite statistic…

Baseball is 90% mental – the other half is physical.