Upload
others
View
10
Download
1
Embed Size (px)
Citation preview
SPSS and its usage
Dr. Bijay Lal [email protected]
http://bijaylalpradhan.com.np
2073/06/07 – 06/12
Copyright @ Dr Bijay Lal Pradhan
Object of session I
• Define Statistics and SPSS
• Install SPSS 20 and crack
• Open and exit SPSS
• Importing and exporting data
• Different format of files
Copyright @ Dr Bijay Lal Pradhan
What is Statistics?
• Singular form: The process of collection,
organization, presentation, analysis and
interpretation of number facts.
• Plural form: Aggregate of facts which has
different characteristics.
– Comparable
– Numerous factors effects
– Numerically expressed
– Systematically collected
– Purposefully collected
– Accurate reasonably Copyright @ Dr Bijay Lal Pradhan
Introduction: What is SPSS?
• Originally it is an acronym of Statistical
Package for the Social Science but now it
stands for Statistical Product and Service
Solutions
• One of the most popular statistical
packages which can perform highly
complex data organization, presentation
and analysis with simple instructions.
Copyright @ Dr Bijay Lal Pradhan
The Three Windows: Data Editor
• Data Editor
Spreadsheet-like system for defining, entering, editing,
and displaying data. Extension of the saved file will be
“sav.”
Copyright @ Dr Bijay Lal Pradhan
The Three Windows: Output Viewer
• Output Viewer
Displays output and errors. Extension of the saved file will
be “spv.”
Copyright @ Dr Bijay Lal Pradhan
The Three Windows: Syntax editor
• Syntax Editor
Text editor for syntax composition. Extension of the
saved file will be “sps.”
Copyright @ Dr Bijay Lal Pradhan
Installation of SPSS 20.0
• You have software SPSS 20.0 in your computer
• There are two folders namely setup and crack
• Open setup folder and double click on
application file “setup”.
• Follow the instruction and install SPSS in your
computer.
• Don’t go for licensing process. Copy "lservrc"
from crack folder and paste it into the installed
directory (C:\Programme\
IBM\SPSS\Statistics\20)Copyright @ Dr Bijay Lal Pradhan
Variable View
Var
iab
le
des
crip
tio
ns
Drop down
menus
Action
buttons
Copyright @ Dr Bijay Lal Pradhan
Variable View window: Type
• Type
– Click on the ‘type’ box. The two basic types of variables that you will use are numeric and string. This column
enables you to specify the type of variable.
Copyright @ Dr Bijay Lal Pradhan
Variable View window: Width
• Width
– Width allows you to determine the number of characters SPSS will allow to be entered for the variable
Copyright @ Dr Bijay Lal Pradhan
Variable View window: Decimals
• Decimals
– Number of decimals
– It has to be less than or equal to 16
3.14159265
Copyright @ Dr Bijay Lal Pradhan
Variable View window: Label
• Label
– You can specify the details of the variable
– You can write characters with spaces up to 256
characters
Copyright @ Dr Bijay Lal Pradhan
Variable View window: Values
• Values
– This is used and to suggest which
numbers represent which categories when
the variable represents a category
Copyright @ Dr Bijay Lal Pradhan
Defining the value labels
• Click the cell in the values column as shown below
• For the value, and the label, you can put up to 60
characters.
• After defining the values click add and then click OK.
Click
Copyright @ Dr Bijay Lal Pradhan
Scales of Measure
Scale Basic
Characteristics
Examples Examples
Nominal Numbers identify
& classify objects
Social Security
nos., numbering
of football
players
Brand nos.,
store types
Percentages,
mode
Chi-square,
binomial test
Ordinal Nos. indicate the
relative positions
of objects but not
the magnitude of
differences
between them
Quality rankings,
rankings of
teams in a
tournament
Preference
rankings,
market
position,
social class
Percentile,
median
quartile
deviation
Rank-order
correlation,
Friedman
ANOVA
Scale Zero point is fixed,
ratios of scale
values can be
compared
Length, weight Age, sales,
income, costs
Arithmatic,
Geometric
harmonic
mean range
MD SD
Z test, t-test,
ANOVA test
all other tests
Permissible Statistics
Descriptive Inferential
Copyright @ Dr Bijay Lal Pradhan
SPSS output viewer
Drop down
menus Action
buttons
Navigation
window
Copyright @ Dr Bijay Lal Pradhan
Practice 1
A study was conducted to know the attitude of a
bank’s customer towards the bank. The question
asked to the customer was:
• “Do you feel safe in your transactions with the bank?”
• The respondents were to answer the question on a
seven-point scale (1 = Strongly Disagree, 7 = Strongly
Agree). There were other variables mentioned below on
which data was collected.
• Strongly disagree 1
• Moderately disagree 2
• Little disagree 3
Construct the following variables in the variable view
on the basis of following information
No difference 4
Little agree 5
Moderately agree 6
Strongly agree 7
Other variable
1. Sex of the respondent
Male - M Female - F
2. Marital status
Married - M Single - S
3. Income of the respondent (in rupees)
4. Age of the respondent (in years)
5. Educational background of the respondent
Below higher secondary - 1
Higher secondary - 2
Graduation - 3
Post graduation - 4
Entering Data
• Copy paste can be done to copy it from
word to SPSS.
• First copy paste in to MS Excel and then
to SPSS.
• Save the data in Excel and import to SPSS
• Or save in CSV format then to SPSS
Copyright @ Dr Bijay Lal Pradhan
Variable/Case in and out
• Entering new variable
• Deleting the existing variable
• Entering new case
• Deleting the existing cases
Copyright @ Dr Bijay Lal Pradhan
Saving the data
• To save the data file you created simply click ‘file’ and
click ‘save as.’ You can save the file in different forms
by clicking “Save as type.”
Click
Copyright @ Dr Bijay Lal Pradhan
Sorting the data (cont’d)
• Double Click ‘Name of the students.’ Then click
ok.
Click
Click
Copyright @ Dr Bijay Lal Pradhan
Transforming data (cont’d)• Example: Adding a new variable named ‘corrected_CI’
which is corrected confidence interval
– Type in corrected_CI in the ‘Target Variable’ box. Then type
in ‘8-CI’ in the ‘Numeric Expression’ box. Click OK
Click
Transforming data (cont’d)
• In the same way find the log(income)
• Type in “ln_income” in the ‘Target
Variable’ box. Then type in ‘lnincome ’ in
the ‘Numeric Expression’ box. Click OK
• In the similar manner Create a new
variable named “sqrtage” which is the
square root of age.
Copyright @ Dr Bijay Lal Pradhan
Visual Binning
• Visual Binning is the process of
arranging data in a suitable class. So
that we can tabulate data and can be
drawn conclusion from the scale type
of data.
Copyright @ Dr Bijay Lal Pradhan
Scopes & Limitations of Data Analysis
If the data were collected from a random sample drawn from
a well-defined population in such a way that every unit in the
population has a known non-zero probability of being
included in the sample, then the information derived from
such sample can be generalized to the population (inferential
statistics). If the data were collected from a non-random
sample, then the information derived from sample cannot be
generalized (descriptive statistics).
If data and variables are not properly organized in a
computer, then computer software fail to provide meaningful
results.
Collection Organization Analysis ReportingPresentation
Copyright @ Dr Bijay Lal Pradhan
Condensation of Data
Summarizing data in tables and graphs (stem and leaf display, line graph, bar graph, pie chart and Histogram, measure of central tendency and measure of dispersion.
1. small tables (frequency tables)
2. graphs or diagrams (histogram, bar
graph, pie chart etc.)
3. summary statistics (percentage, mean,
standard deviation etc.)
Copyright @ Dr Bijay Lal Pradhan
The basic analysis of SPSS that will be
introduced in this class
• Frequencies– This analysis produces frequency tables showing
frequency counts and percentages of the values of individual variables.
• Graphical Presentation
– Pie chart, Bar chart, Histogram, Area chart, Line chart,
Scatter plot
• Descriptive Statistics– This analysis shows the maximum, minimum,
mean, and standard deviation of the variablesCopyright @ Dr Bijay Lal Pradhan
Descriptive & Inferential Statistics
Statistics
Descriptive Inferential
Tabular Graphical Numerical
Estimation Hypothesis Testing
Point Interval Parametric Non-Parametric
The methods of inferential statistics are applicable when results are
obtained from a random.
Uncertainty always remains while generalizing results from a sample
to a population. The degree of uncertainty is measured in terms of
probability in inferential statistics. Copyright @ Dr Bijay Lal Pradhan
What
type of
data
?
1. Prepare frequency table
2. Compute mode
3. Compute median (ordinal)
4. Draw graphs
• Bar diagram
• Pie-chart
5. Chi-square test
1. Prepare frequency table (discrete)
2. Compute mean. Median and mode
3. Compute positional statistics
4. Compute SD, range etc.
5. Draw graphs.
• Steam-and-leaf plot (discrete).
• Box-Whisker plot.
• Histogram (continuous).
• Bar diagram (discrete).
6. Z, t, F & 2 tests
7. Transform into categorical.
Nominal or Ordinal Scale data
Univariate Data Analysis
Analysis of data of a single variable at a time is univariate
analysis. The suitable univariate data analysis methods by scale of
variables are listed below
Copyright @ Dr Bijay Lal Pradhan
Bivariate Data Analysis
Analysis of data of two variables at a time. The kinds of data
analysis are listed below.
Nominal
Ordinal1. Prepare two-way frequency tables
2. Compute row or column percentages
3. Draw charts and diagrams
4. Test hypotheses (chi-square test of independence)
Scale
1. Prepare two-way frequency tables
2. Draw Scatter diagram
3. Test hypotheses (chi-square, z, t, F tests)
4. Carry out correlation & simple regression analysis
Copyright @ Dr Bijay Lal Pradhan
Frequency Distribution
Frequency distribution of a nominal/ordinal data
Copyright @ Dr Bijay Lal Pradhan
Stem Leaf Display
Income of the Respondent Stem-and-Leaf Plot
Frequency Stem & Leaf
.00 0 .
28.00 0 . 5555566667777777889999999999
14.00 1 . 01122222333444
12.00 1 . 556667788899
12.00 2 . 011122233444
4.00 2 . 5577
Stem width: 10000
Each leaf: 1 case(s)
Diagrammatic Presentation
• Bar diagram
• Line graphs
• Pie diagram
• Scatter diagram
• Histogram
Copyright @ Dr Bijay Lal Pradhan
Descriptive measures
• Measure of Central Tendency
– Mean – Arithmetic, Geometric, Harmonic
– Median
– Mode
• Measure of dispersion
– range
– QD
– SD
Copyright @ Dr Bijay Lal Pradhan
Mean Value for Grouped DataHeight Mid value frequency
146-150 148 3
151-155 153 10
156-160 158 21
161-165 163 29
166-170 168 25
171-175 173 10
176-160 178 2
Go to: Data>weight cases
Select weight cases by frequency and select ok
Then find out mean using mid value as variable
Skewness - Kurtosis
• Use compare mean and find out skewness
and kurtosis of the data
Copyright @ Dr Bijay Lal Pradhan
Bivariate Data Analysis
Analysis of data of two variables at a time. The kinds of data
analysis are listed below.
Nominal
Ordinal1. Prepare two-way frequency tables
2. Compute row or column percentages
3. Draw charts and diagrams
4. Test hypotheses (chi-square test of independence)
Scale
1. Prepare two-way frequency tables
2. Draw Scatter diagram
3. Test hypotheses (chi-square, z, t, F tests)
4. Carry out correlation & simple regression analysis
Copyright @ Dr Bijay Lal Pradhan
Estimation
• Point Estimation
• Interval estimation
– Confidence Interval
• (Analyse>descriptive statistics>explore>estimation)
Copyright @ Dr Bijay Lal Pradhan
Fundamental of Hypothesis Testing
• There two types of statistical inferences, Estimation and
Hypothesis Testing
• Hypothesis Testing: A hypothesis is a claim (assumption)
about one or more population parameters.
– Average price of a lunch in hetauda is μ = Rs 200
– The population mean monthly cell phone bill of this city
is: μ = Rs 125
– The average number of TV sets in Homes is equal to
three; μ = 2
Copyright @ Dr Bijay Lal Pradhan
• It Is always about a population parameter, not about a
sample statistic
• Sample evidence is used to assess the probability that the
claim about the population parameter is true
A. It starts with Null Hypothesis, H0
• We begin with the assumption that H0 is true and any
difference between the sample statistic and true population
parameter is due to chance and not a real (systematic)
difference.
• Always contains “=” , “≤” or “” sign
• May or may not be rejected
0H :μ 3 and X=2.79
Copyright @ Dr Bijay Lal Pradhan
B. Next we state the Alternative Hypothesis, H1
• Is the opposite of the null hypothesis– e.g., The average number of TV sets in
homes is not equal to 2 ( H1: μ ≠ 2 )• Never contains the “=” , “≤” or “” sign• May or may not be proven• Is generally the hypothesis that the
researcher is trying to prove. Evidence is always examined with respect to H1, never with respect to H0.
• We never “accept” H0, we either “reject” or “not reject” it
Copyright @ Dr Bijay Lal Pradhan
A. Rejection Region Method:
• Divide the distribution into rejection and non-rejection
regions
• Defines the unlikely values of the sample statistic if the
null hypothesis is true, the critical value(s)
– Defines rejection region of the sampling distribution
• Rejection region(s) is designated by , (level of
significance)
– Typical values are .01, .05, or .10
• is selected by the researcher at the beginning
• provides the critical value(s) of the test
Copyright @ Dr Bijay Lal Pradhan
H0: μ ≥ 12
H1: μ < 12
0
H0: μ ≤
12 H1: μ
> 12
a
a
Representscritical value
Lower-tail test
0Upper-tail test
Two-tail test
Rejection
region is
shaded
/2
0
a/2aH0: μ = 12
H1: μ ≠ 12
Rejection Region or Critical Value Approach:
Level of significance =
Non-rejection region
Copyright @ Dr Bijay Lal Pradhan
• P-Value Approach –• P-value=Max. Probability of (Type I Error), calculated from the
sample.
• Given the sample information what is the size of blue are?
H0: μ ≥ 12
H1: μ < 12
H0: μ ≤ 12
H1: μ > 120Upper-tail test
Two-tail test 0
H0: μ = 12
H1: μ ≠ 12
0Copyright @ Dr Bijay Lal Pradhan
Type I and II Errors:
• The size of , the rejection region, affects the risk of making different
types of incorrect decisions.
Type I Error– Rejecting a true null hypothesis when it should NOT be rejected
– Considered a serious type of error
– The probability of Type I Error is
– It is also called level of significance of the test
Type II Error– Fail to reject a false null hypothesis that should have been rejected
– The probability of Type II Error is β
Copyright @ Dr Bijay Lal Pradhan
Truth
Decision H0 true H0 false
Retain H0
Correct retention
Type II error
Reject H0 Type I error Correct rejection
α ≡ probability of a Type I error
β ≡ Probability of a Type II error
Two types of decision errors:
Type I error = erroneous rejection of true H0
Type II error = erroneous retention of false H0
• P-Value approach to Hypothesis Testing:
• That is to say that P-value is the smallest value of
for which H0 can be rejected based on the sample
information
• Convert Sample Statistic (e.g., sample mean) to Test
Statistic (e.g., Z statistic )
• Obtain the p-value from a table or computer
• Compare the p-value with
– If p-value < , reject H0
– If p-value , do not reject H0
Copyright @ Dr Bijay Lal Pradhan
P-value (Observed Significance Level)
• P-value - Measure of the strength of evidence the sample data provides against the null hypothesis:
P(Evidence This strong or stronger against H0 | H0 is true)
)(: obszZPpvalP
Test of Hypothesis for the Mean
The test statistic is:
n
S
μXt 1-n
σ Unknownσ known
The test statistic is:
n
σ
μXZ
Copyright @ Dr Bijay Lal Pradhan
Steps to Hypothesis Testing
1. State the H0 and H1 clearly
2. Identify the test statistic (two-tail, one-tail, and
type of test to be used)
3. Depending on the type of risk you are willing to
take, specify the level of significance,
4. Find the decision rule, critical values, and rejection
regions. If –CV<actual value (sample statistic) <+CV,
then do not reject the H0
5. Collect the data and do the calculation for the
actual values of the test statistic from the sample
Copyright @ Dr Bijay Lal Pradhan
Steps to Hypothesis testing, continued
Make statistical decision
Do not Reject H0 Reject H0
Conclude H0 may be true
Make
management/business/admi
nistrative decision
Conclude H1 is “true”
(There is sufficient evidence of
H1)
Copyright @ Dr Bijay Lal Pradhan
When do we use a two-tail test?
when do we use a one-tail test?
• The answer depends on the question you are trying to answer.
• A two-tail is used when the researcher has no idea which
direction the study will go, interested in both direction.
• (example: testing a new technique, a new product, a new theory and we don’t know
the direction)
• A new machine is producing 12 fluid once can of soft drink. The quality control
manager is concern with cans containing too much or too little. Then, the test is a
two-tailed test. That is the two rejection regions in tails is most likely (higher
probability) to provide evidence of H1.
oz 12 :H
oz12:H
1
0
12Copyright @ Dr Bijay Lal Pradhan
• One-tail test is used when the researcher is interested in
the direction.
• Example: The soft-drink company puts a label on cans
claiming they contain 12 oz. A consumer advocate desires
to test this statement. She would assume that each can
contains at least 12 oz and tries to find evidence to the
contrary. That is, she examines the evidence for less than
12 0z.
• What tail of the distribution is the most logical (higher
probability) to find that evidence? The only way to reject
the claim is to get evidence of less than 12 oz, left tail.
oz 12 :H
oz12:H
1
0
12 1411.5Copyright @ Dr Bijay Lal Pradhan
Type of Hypothesis
• What to test
• Significance of means– Single mean test
– Double mean test• Dependent pairs
• Independent pairs
– More than two mean test
Copyright @ Dr Bijay Lal Pradhan
How do we measure association
between two variables?
1. For ordinal and nominal variable
• Odds Ratio (OR)
• Chi square test of independence of
attributes
2. For scale variables
• Correlation Coefficient R
• Coefficient of Determination (R-Square)
Copyright @ Dr Bijay Lal Pradhan
Example
• A researcher believes that there is a linear relationship between BMI (Kg/m2) of pregnant mothers and the birth-weight (BW in Kg) of their newborn
• The following data set provide information on 15 pregnant mothers who were contacted for this study
Copyright @ Dr Bijay Lal Pradhan
BMI (Kg/m2) Birth-weight (Kg)
20 2.730 2.950 3.445 3.010 2.230 3.140 3.325 2.350 3.520 2.510 1.555 3.860 3.750 3.135 2.8
Copyright @ Dr Bijay Lal Pradhan
Scatter Diagram
• Scatter diagram is a graphical method to display the relationship between two variables
• Scatter diagram plots pairs of bivariate observations (x, y) on the X-Y plane
• Y is called the dependent variable
• X is called an independent variable
Copyright @ Dr Bijay Lal Pradhan
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50 60 70
Scatter diagram of BMI and Birthweight
Copyright @ Dr Bijay Lal Pradhan
Is there a linear relationship
between BMI and BW?
• Scatter diagrams are important for
initial exploration of the relationship
between two quantitative variables
• In the above example, we may wish to
summarize this relationship by a
straight line drawn through the scatter
of pointsCopyright @ Dr Bijay Lal Pradhan
Simple Linear Regression
• Although we could fit a line "by eye" e.g. using a transparent ruler, this would be a subjective approach and therefore unsatisfactory.
• An objective, and therefore better, way of determining the position of a straight line is to use the method of least squares.
• Using this method, we choose a line such that the sum of squares of vertical distances of all points from the line is minimized.
Copyright @ Dr Bijay Lal Pradhan
Least-squares or regression line
• These vertical distances, i.e., the distance
between y values and their corresponding
estimated values on the line are called
residuals
• The line which fits the best is called the
regression line or, sometimes, the least-squares line
• The line always passes through the point
defined by the mean of Y and the mean of
XCopyright @ Dr Bijay Lal Pradhan
Linear Regression Model
• The method of least-squares is available in most of the statistical packages (and also on some calculators) and is usually referred to as linear regression
• Y is also known as an outcome variable
• X is also called as a predictor
Estimated Regression Line
ˆˆˆ 0
ˆ. . . int
ˆ 0 . . .
y = + x = 1.775351 + 0. 330187 x
1.775351 is called y ercept
0. 330187 is called the slope
Copyright @ Dr Bijay Lal Pradhan
Application of Regression Line
This equation allows you to estimate BW
of other newborns when the BMI is
given.
e.g., for a mother who has BMI=40, i.e.
X = 40 we predict BW to be
ˆˆˆ 0 (40) 3.096y = + x = 1.775351 + 0. 330187
Copyright @ Dr Bijay Lal Pradhan
Correlation Coefficient, R
• R is a measure of strength of the linear association between two variables, x and y.
• Most statistical packages and some hand calculators can calculate R
• For the data in our Example R=0.94
• R has some unique characteristicsCopyright @ Dr Bijay Lal Pradhan
Correlation Coefficient, R
• R takes values between -1 and +1
• R=0 represents no linear relationshipbetween the two variables
• R>0 implies a direct linear relationship
• R<0 implies an inverse linearrelationship
• The closer R comes to either +1 or -1,the stronger is the linear relationship
Coefficient of Determination
• R2 is another important measure of linear association between x and y (0 R2 1)
• R2 measures the proportion of the total variation in y which is explained by x
• For example r2 = 0.8751, indicates that 87.51% of the variation in BW is explained by the independent variable x (BMI).
Copyright @ Dr Bijay Lal Pradhan
Difference between Correlation and
Regression
• Correlation Coefficient, R, measures
the strength of bivariate association
• The regression line is a prediction equation that estimates the values of
y for any given x
Copyright @ Dr Bijay Lal Pradhan
Limitations of the correlation
coefficient
• Though R measures how closely the two variables approximate a straight line, it does not validly measures the strength of nonlinear relationship
• When the sample size, n, is small we also have to be careful with the reliability of the correlation
• Outliers could have a marked effect on R
• Causal Linear Relationship
Regression Analysis
• Click ‘Analyze,’ ‘Regression,’ then click
‘Linear’ from the main menu.
Copyright @ Dr Bijay Lal Pradhan
Regression Analysis
• For example let’s analyze the model
• Put ‘Beginning Salary’ as Dependent and ‘Educational Level’ as
Independent.
edusalbegin 10
ClickClick
Copyright @ Dr Bijay Lal Pradhan
Plotting the regression line
• Click ‘Graphs,’ ‘Legacy Dialogs,’
‘Interactive,’ and ‘Scatterplot’ from the main menu.
Copyright @ Dr Bijay Lal Pradhan
Plotting the regression line
• Drag ‘Current Salary’ into the vertical axis box
and ‘Beginning Salary’ in the horizontal axis box.
• Click ‘Fit’ bar. Make sure the Method is
regression in the Fit box. Then click ‘OK’.
ClickSet this to
Regression!
Copyright @ Dr Bijay Lal Pradhan
Is the model significant?• r2 is the proportion of the variance in y that is
explained by our regression model
• SE is also another measure check significance through
• F-statistic:
F(dfŷ,dfer) =sŷ
2
ser2
=......=r2 (n - 2)2
1 – r2
complicated
rearranging
And we should know the significance of reg.
coeff.t =
byx
S.E.
If all these satisfies than we can say model is
Fit.
For further Questions:[email protected]://bijaylalpradhan.com.np
Copyright @ Dr Bijay Lal Pradhan