Upload
shanna-moody
View
224
Download
0
Tags:
Embed Size (px)
Citation preview
Basics of writing SPSS syntax filesVince GrayDLI Boot CampJune 3, 2014
Session goalsIntroduction to the basic parts of
a SPSS syntax file to read in data◦Not intended to show how to analyze
data, but how to make them available for analysis
Tips and tricks for preparing syntax file
Cleaning up blatant problems with the data
Have a short exercise in coding a SPSS syntax file
Why know how to do this?Older files may not have syntax
available – may be in paper only
SPSS is not Statistics Canada's specialty: they don't do much work with it, and that can show in what you receive from them
Faculty members may wish to deposit old data with you
Sample of a print-only codebookHousehold Income, Facilities and Equipment Micro Data File, 1971
Income:Survey of Consumer Finances, 1972 and
Survey of Household Facilities and Equipment, 1972
SPSS foundational conceptsSPSS is generally case insensitive
◦Commands and labels are capitalized for display purposes only
◦On Unix computers, file specification is case sensitive (C:\Data\file.txt <> c:\data\File.txt)
◦Operations on string variables are case sensitive
SPSS commands end with a periodRecommendation: edit syntax files
using a fixed pitch font (e.g., Courier new)
SPSS foundational conceptsComments (text that isn't a
command) can be used to explain what you're doing◦May be placed at the start of a line
with either the word comment or an asterisk and ending with a period
◦May be placed within a command or at end of a line enclosed within /* comment */
Variable labels /* this is a fatuous comment */
var001 1-4 …
Basics of SPSS syntax fileWhere is the file; what are its
attributes?What are the variables and what
format?What are the variable labels?What values do you want to label?Are any of the values missing (i.e.,
should they be ignored during analysis?)Do you need to repair data?Where & how to save files?
Where is the file and what are its attributes?
Usually done with data listdata list file='drive:\directory\filename.ext' format records=# table / variable list / line 2 variable list ….
Need to define a file handle for very large files (record length 8192+) firstfile handle myhandle / name='drive:\directory\filename.ext’ /recform=?? /lrecl=####.
data list file=myhandle format records=# table /variable list.
What are the variables and what format?
Each variable being read in from the file must be described
Must be assigned a variable name: see Variable Names in SPSS help (Syntax)◦Cannot be a reserved word◦May be up to 64 characters long: no
spaces◦Start with A-Z, @, # (scratch variables),
or $ (system variables)◦May contain A-Z 0-9 _ . $ # @
Thoughts on long variable namesUsers of older (perpetual) versions of
SPSS may not be able to use themVariable names may wrap across
linesBeing lazy, it's more typingCan use rename variables syntax to
retain long variable namesRecommendation: use 8
characters or less for variable names
Defining variable formatMultiple ways to do it
◦Specify columns and typeUniqueid 1-8Recwght 9-15 (3)Cityname 16-45 (A)var001 46-50 var002 51-55 var003 56-60income gvttrans othrincm 61-87 (2) …
◦Use Fortran encodingUniqueid (F8.0)Recwght (F7.3)Cityname (A30)var001 to var003 (3F5.0)income gvttrans othrincm (3F9.2) …
Defining variable format (cont'd)Can combine various formats in a data list
command* Here we will declare variables in the file.
data list file=oldfile records=1 table /
uniqueid 1-5
province 6-7
urbnrurl 8
farmflag 9
hhldwght 10-12
numprsns nmadults nmchlt06 nmch0615 nmch1617 nmch1824 13-24
hhldcomp (F1.0)
farm_income_dependence 56
mjsrcinc 57
nmearner nmpsninc (2F2.0)
earnrmbd invstmnt govttran miscincm ttlincom 62-91.
Note indentation of 1 space on each variable: used to be required, now more stylistic
Defining variable format (cont'd)Don't define variables as strings
unless the data contain non-numeric characters◦Can lose ordinal variable
relationships◦This may mean revising StatCan
syntax files, which have been known to define non-interval variables as string, regardless of the coding actually used for the variable
String variables (cont'd)◦In worst case (and at your discretion based
on comfort level), means recoding variables (e.g., Discharge Abstract Database)
◦If convert to Stata, value labels won't convert since can't be assigned to string variables
◦Recommendation: if the string requires a value label to be meaningful, convert it to a coded numeric value (therefore, leave place names, census tract numbers, etc. as strings) E.g., if "1" stands for "Male", read it as a non-
string
What are the variable labels?The purpose of variable labels is to give
more descriptive information than the variable name can provide◦Sex
Probably safe to guess that it is a gender variable But not necessarily: Have you had sex in the past month?
Recorded for whom? Respondent/spouse/1st-born?
If any doubt might exist, try to remove it!
Do not use arbitrary contractions – especially if loading into a searchable metadata service
Variable labels (cont'd)Sample code with arbitrary
contractionsVARIABLE LABELS YEAR "Refyr - 1998" PUCPID26 "Cross-sect random pers ID - 1998" PUCHID25 "Cross-sect random hhld ID - 1998" D31CF26 "Census family ID - 1998" ICSWT26 "Int cross-sect weight - 1998" ECYOB26 "Ext YOB (cross-sect) - 1998" ECAGE26 "Ext age refyr (cross-sect) - 1998" ECSEX99 "Ext sex refyr (cross-sect) - 1998" MARST26 "Marital status refyr - 1998" MJACT26 "Major activity - 1998" MJIEH26 "Major inc earner Hhld - 1998" MJINE26 "Major inc earner EF - 1998" RMJIG26 "Rel maj inc earner grp EF - 1998" MJICE26 "Major inc earner CF - 1998"
Variable labels (cont'd)Make sure that the label includes
the most important information. In the variables below, the key information was omitted by StatCan – does it what, would you what, described as what?HAL_Q150 "Does a physical condition or mental condition or health prob"
HAL_Q160 "Does a physical condition or mental condition or health prob"
HAL_Q170 "Does a physical condition or mental condition or health prob"
HAL_Q210 "Do you regularly have trouble going to sleep or staying asle"
MSS_Q110 "Thinking about the amount of stress in your life, would you"
MSS_Q120 "What is your main source of stress?"
HS_Q110 "Presently, would you describe yourself as:"
Variable labels (cont'd)Meaningful labels
HAL_Q150 "Reduction of amount/kind of activity at home"
HAL_Q160 "Reduction of amount/kind of activity at work or school"
HAL_Q170 "Reduction of amount/kind of activity in other activities (transport/leisure)"
HAL_Q210 "Regularly have trouble going to sleep or staying asleep"
MSS_Q110 "Self-assessed amount of stress in respondent's life"
MSS_Q120 "What is your main source of stress"
HS_Q110 "Self-assessed happiness"
Variable labels (cont'd)If labels are repeated, explain
why (the variable names may not be intuitive):SUDDLAI 'Any drug use (incl 1 time cann)'
SUDDLAE 'Any drug use (excl 1 time cann)'
SUDDLID 'Any drug use (excl cann) - life (D)'
SUDDYAI 'Any drug use (incl 1 time cann)'
SUDDYAE 'Any drug use (excl 1 time cann)'
is less useful thanSUDDLAI "Ever used drugs (including 1 time cannabis, derived)"
SUDDLAE "Ever used drugs (excluding 1 time cannabis, derived)"
SUDDLID "Ever used drugs (excluding cannabis, derived)"
SUDDYAI "Used any drugs in past 12 months (including 1 time cannabis, derived)"
SUDDYAE "Used any drugs in past 12 months (excluding 1 time cannabis, derived)"
Variable label formattingRecommend placing all labels in double
quotes rather than single quotesnoanswr1 "Didn't answer: wasn't at home"
rather thannoanswr1 'Didn't answer: wasn''t at home'
◦Either works, but single quotes can lead to more mistakes due to carelessness in data entry
Have up to 255 characters for variable labels: all may not be displayed, though (some procedures show only 40 characters)
What values do you want to label?
Nominal and ordinal variables are generally meaningless without value labels◦Gender: is 1 male and 0 female, or vice versa?◦Does a scale variable run worse to better or
better to worse (the value alone doesn't necessarily suffice to tell you this)
◦What does value 3 in Agegroup represent?Continuous variables may have key values
◦E.g., income or age may be capped or flooredMissing values need to be declared
Value label formatsDo not use arbitrary contractions: up to
120 characters can be displayedRecommend placing all labels in double
quotes rather than single quotes6 "Don't know"
rather than6 'Don''t know'
String values must be enclosed in quotes (e.g., "B" "Boston lettuce")◦but you won't be using string variables if you
need value labels to make sense, right?
Value label formats (cont'd)A single label declaration can be used
for any and all variables using that coding, or separate declarations can be madevalue labels
SUDDYO SUDDYOA SUDDYOD SUDFINT SUDFLAU SUDFLCA SUDFLCM SUDFLSU SUDFLTU
SUDFYCM SUDGLOTH SUD_87 SUI_01 SUI_02 SUI_03 TWD_1 TWD_3 TWD_5
1 "YES"
2 "NO"
6 "NOT APPLICABLE"
7 "DON'T KNOW"
8 "REFUSAL"
9 "NOT STATED"
/
Each declaration is separated from the previous with a /
Value label formats (cont'd) Can explicitly identify variables to which no values are
assigned If consecutive variables use the same coding, use "to”
value labels
uniqueid hhldwght
/
SUDDYO to SUDGLOTH SUD_87 SUI_01 SUI_02 SUI_03 TWD_1 TWD_3 TWD_5
1 "YES"
2 "NO"
6 "NOT APPLICABLE"
7 "DON'T KNOW"
8 "REFUSAL"
9 "NOT STATED"
.
Repeated value labels for any variable are ignored: the first one found is used, and a warning is issued in the syntax window
Missing valuesMissing values get omitted from
analysis – if you are looking for the average income of spouses, you don't include households who don't have spouses
Statistics Canada normally uses values ending in 6/7/8/9 as missings (i.e., not applicable, don't know, refusal, not asked) – but often only define the values 9 as missing values in SPSS: varies by Division
Missing values (cont'd)Other values may be missing as
wellmthrplbr fthrplbr
1 "Born in Canada"
2 "Born outside of Canada - North America/Europe"
3 "Born outside of Canada - Other country"
4 "Country uncodeable"
8 "Not stated"
9 "Don't know"
/
◦ The value 4 might be considered missing – I would code it as missing!
◦ Check the codebook carefully!
Missing values formatSPSS allows up to three discrete
values to be defined as missing, or a range (using thru, which includes all values within the range), or one discrete value and a range.
May explicitly declare that no values are missing for a variable.Missing values
uniqueid () /* Can explicitly show no missings */
var001 to var028 (6,7,9)
var029 var031 (6 thru 9)
var030 (-1, 6 thru highest).
Missing values format (cont'd)String and non-string missing values
can't be declared in the same missing values statement.Missing values
uniqueid ()
var001 to var028 (6,7,9)
var029 var031 (6 thru 9)
var030 (-1, 6 thru highest).
Missing values
stringv1
("ZZZZZZZ", "-1 ").
Missing values are dealt with immediately: be aware of the order of operations
Do you need to repair data?
Does each record have a unique record identifier (used to match variables from different files or subsets)◦If not, create one:
compute uniqueid=($casenum).variable labels uniqueid "Unique record identifier".* The formats command will specify how many columns are
reserved for the field: by default, new variables are created as F8.2. No decimals are needed for this variable. Length (#) is based on the number of records in the file.
formats uniqueid (F#.0).
Repairing data (cont’d)If numerically coded variables
are defined as string, change that to be non-string.Data list … >>> Data list …
uniqueid 1-8 uniqueid 1-8
gender 9 (A) gender 9
… …
Value labels Value labels
gender gender
"1" "Male" 1 "Male"
"2" "Female" 2 "Female"
"9" "Not ascertained" 9 "Not ascertained"
. .
Repairing data (cont’d)If string variables require value labels to be
meaningful, create non-string versions: this is case sensitive!Value labels gradelvl
"H" "Top 10% of the class" "M" "Middle 80% of the class"
"L" "Bottom 10% of the class" " " "Rank in class not known".
Missing values gradlvl (" ").
* Create a non-string version of the variable.
Formats newgrdlv (F1.0).
If gradelvl="H" newgrdlv=1.
If gradelvl="M" newgrdlv=2.
If gradelvl="L" newgrdlv=3.
If (missing(gradelvl)) newgrdlv=9.
Value labels newgrdlvl
1 "Top 10% of the class" 2 "Middle 80% of the class"
3 "Bottom 10% of the class" 9 "Rank in class not known".
Missing values newgrdlv (9).
Variable labels newgrdlv "Reformatted gradelvl: class placement".
Repairing data (cont’d)Repairing coding flaws is the most
difficult, and possibly, the most important thing you can do for your users: do it if you’re comfortable!
Solution to coding problem* Find records where there is no wife.
* According to documentation, should use (hdmarsta=1)
or (hdmarsta=8) or (hdmarsta=9) or (hdmarsta=10).
* Doing that results in 17,129 valid (non-missing) records.
* Defining 0 as missing for age gives 14,352 valid records.
* Since 0 is defined as a missing code for wfagegrp,
you cannot use "wfagegrp=0" as the condition.
do if (missing(wfagegrp)).
* Reset values from 0 to a specifed missing code.
+ compute wfincome=999999.
+ compute wfwkswrk=-1.
end if.
Try to not change the format of the variable when adding a value – wfincome has 6 columns, with valid entries from –ve 99999 to +99999. So, 999999 is outside the valid range. For wfwkswrk, we could have used 99 as the missing code (the valid range is 0 to 52).
Solution to coding problem (cont’d)Value labels are needed:Value labels wfincome
999999 "Not applicable - no wife"
/
wfincsrc
1 "No income" 2 "Wages and salaries"
3 "Military pay and allowances"
4 "Net income from self-employment"
5 "Net income from roomers and boarders"
6 "Government transfer payments"
7 "Net income from investment"
8 "Retirement pensions, superannuation and annuities"
9 "Other money income" 0 "Not applicable - no wife"
/
wfagegrp
76 "Age 76 and over" 0 "Not applicable - no wife"
/ …
Solution to coding problem (cont’d)Missing value declarations are needed, to
make having done this worthwhileMissing values
wfincsrc wfagegrp
(0)
wfincome
(999999)
…
The ripple effect of the change isn’t necessarily as simple as changing one piece of code: you have to track down the rest of the effects of the change and document them.
Where & how to save files?
Write: creates ASCII file (for preservation)◦Doesn’t actually do anything until the
program encounters an executable commandwrite outfile=‘drive:\directory\filename.dat’ table /all.
◦The table parameter tells SPSS to include the format used in writing the ASCII file in the log file; /all indicates to write out all variables on the file.
◦Does not preserve variable/value labels or missing declarations in ASCII file – you need syntax to read the file created by write into SPSS.
Where & how to save files?
Export: creates portable file◦No longer widely used: used to transport
between platforms or programsexport outfile=‘drive:\directory\filename.por’ /keep=? /drop=? /map.
◦Keep and drop allow you to include or exclude variables by naming them; map lists variable names and labels
◦Preserves variable/value labels and missing declarations: can be read back into SPSS
◦Long variable names truncate to 8 characters
◦ Is an executable command (will force Write)
Where & how to save files?
Save: creates system file◦This is the native format of SPSS: files
will load into SPSS and keep all variable/value labels, missing declarations and long variable namessave outfile=‘drive:\directory\filename.sav’ /keep=? /drop=? /map.
◦Keep and drop allow you to include or exclude variables by naming them; map lists variable names and labels
◦Is an executable command (will force Write)
Where & how to save files?
Syntax for saving data & metadata:write outfile='j:\presentations\hife1972.dat' table /all.
save outfile='j:\presentations\hife1972.sav' /map.
display dictionary.
Display dictionary◦Writes information about the system file into
the output – variable names, formats, labels, missing declarations, etc.
Save your output file, at least as a .spv file, better by exporting to text (because can ‘always’ read it – preservation purposes!)
ExerciseCreate syntax to read the 4 variables
on the next page into SPSS, including:◦A data list command (c:\data\
192_1972.dat)◦Variable labels◦Value labels◦Missing declarations◦Comments for any "fixups" that need to
be done: reflect any fixups in value labels and missing declarations
◦Saving your work
Exercise page
Good, better and horrible newsGood news
◦You’re done!Better news
◦You may never have to do this: ask on the DLI list if other DLI reps have a syntax file that they can provide you if you can’t locate one on the EFT site!
Horrible news◦ If a faculty member shows up with a file
that he or she collected, no one else will have syntax – someone may have to do this!