Upload
udell
View
108
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Creating Dictionaries. What is a Dictionary?. CSPro data files are text files with no metadata, only data A dictionary is needed to describe the contents of the data file CSPro dictionaries: End with the extension . dcf Are text files that can be edited manually, though that is inadvisable - PowerPoint PPT Presentation
Citation preview
Creating Dictionaries
What is a Dictionary?
CSPro data files are text files with no metadata, only data
A dictionary is needed to describe the contents of the data file
CSPro dictionaries: End with the extension .dcf Are text files that can be edited manually, though that is
inadvisable Are not dependent on the existence of a data entry
application Every CSPro application needs a dictionary Multiple CSPro applications can share the same
dictionary
CSPro Data Files
CSPro data files are: Flat files (all data in a single file) Text files (all data is stored in ANSI format and is human
readable) Items in the data file have a fixed length Records in the data file are stored one per line Have no specific file extension An index is created for the data file to allow for
quick access to specific cases (file extension: .idx)
Identification Items
CSPro needs a way to differentiate between different cases (questionnaires)
Identification (ID) items uniquely identify all cases Two cases in a single data file cannot have the same
ID, but cases across data files can share IDs
Identification Items (continued) Generally a questionnaire has geocodes or some
other system of attributes that uniquely identifies each unit of enumeration
For censuses, these IDs are almost always geocodes Example: Province – District – Division – Location –
Sublocation – Enumeration Area – Household Number For surveys, these ID sections are often more
condensed Example: Cluster – Household Number
Identification Items (continued) It is common for the “identification section” of a questionnaire
to have questions that do not help uniquely identify a household
Examples include: Enumerator number Household type Urban/rural status
Some people prefer to make the ID section as small as possible, to pick the fewest number of items possible to ensure that each case is unique
Other people take a more liberal approach to ID fields, but CSPro does have a limit to how long the ID field can be (length: 127)
ID Examples
ID: YearItem on Record: Winner of U.S. presidential election
1996William Jefferson Clinton2000George Walker Bush2004George Walker Bush2008Barack Hussein Obama II
ID: State, countyItem on Record: County name
0101Autauga [Alabama]5123Weston [Wyoming]
Dictionary Fundamentals
Identification Items: value(s) to uniquely identify a case
Levels: a group of one or several records Records: a group of one or several items Items: a value, or variable, that is numeric or
alphanumeric Subitems: part of an item Value Sets: a listing of valid values for an item
Dictionary Fundamentals (with a typical survey example) Identification Items: value (s) to uniquely identify a case
Cluster number, household number Levels: a group of one or several records
Household questionnaire, female questionnaires Records: a group of one or several items
Housing characteristics, household roster, fertility questions
Items: a value, or variable, that is numeric or alphanumeric Water access, roof type, …, sex, age, …, children ever born
Subitems: part of an itemDate of birth broken down into year, month, day
Value Sets: a listing of valid values for an itemSex: Male (1), Female (2)
Naming Dictionary Elements
Every element of a dictionary has two attributes, a name and a label
Name You use the name to refer to the element while programming logic Can be up to 32 characters but must start with a letter Each dictionary element must have a unique name, and there are
some names that are reserved for CSPro keywords Label
A more thorough description of the element Can be up to 255 characters and can contain punctuation and spacing Often labels are the only documentation that anyone sees, so be sure
to take care when creating labels
Naming Dictionary Elements (continued) If you plan on writing a lot of programming logic,
consider how long you make the names for elements
Three common approaches exist for naming elements when the questionnaire has each question numbered
Approach 1: P10_RELATIONSHIP, P11_SEX, P12_AGE
Approach 2: RELATIONSHIP, SEX, AGE Approach 3: P10, P11, P12 Remember that each element has a name and a
label, and that they do not (and probably should not) be the same value
Levels
Applications can have one or two levels Most applications are and should be one-level applications,
though some applications are better designed as two-level applications
Each level usually has its own questionnaire associated with it The top-level can only have one questionnaire, while multiple
questionnaires can exist at lower levels Different sections on a questionnaire translate to multiple
records, not multiple levels How many levels do these questionnaires need?
Household questions, population questions, agriculture questions Population questions, women of reproductive age questions
Records
Records are groupings of items, and generally translate to sections of a questionnaire
Examples of records in a census might be: housing record, population records, death records, emigrant records, agriculture record
A record can be optional, e.g., death records A record can occur more than once per
questionnaire, e.g., population records When deciding how many times a record can occur,
select the maximum possible reasonable value
Record Type
When a dictionary has more than one kind of record, each record must have a type value
The type value differentiates one record in a data file from the other records
You can specify particular values for the record types, or allow CSPro to assign these values automatically
If your dictionary has many records, you may need to increase the length of the record type (default length: 1)
Record Type in the Data File
This data file has two records: winner of the presidential election (1) and loser of the presidential election (2)
The ID item is the year of the election
RT ID RECORD ITEMS1 1996 William Jefferson Clinton2 1996 Robert Joseph Dole1 2000 George Walker Bush2 2000 Albert Arnold Gore, Jr.2 2004 John Forbes Kerry1 2004 George Walker Bush
Note that the order of the different records does not matter
Multiply-Occurring Records in the Data File This data file has two records: winner of the presidential election
(1, singly-occurring) and losers of the presidential election (2, multiply-occurring)
The ID item is the year of the election
RT ID RECORD ITEMS1 1996 William Jefferson Clinton2 1996 Robert Joseph Dole2 1996 Henry Ross Perot1 2000 George Walker Bush2 2000 Albert Arnold Gore, Jr.2 2000 Ralph Nader2 2000 Patrick Joseph Buchanan
Note that the order of the multiply-occurring records DOES matter
Items
Items (variables) describe the data for each question on a census or survey
Items have several properties: Length: How many characters are needed to faithfully
store all possible values for this question? Data Type: Will this item contain only numeric values, or
will it also store words or sentences? Item Type: Is this a subitem? (use selectively) Occurrences: Does this item repeat several times? (use
selectively)
Items (continued)
Items have several properties: Decimal: Will this item hold a decimal fraction? If so, how
many digits are necessary to the right of the decimal point?
Decimal Character: If the numeric item holds a decimal fraction, should the item be saved to the data file with a decimal point? (This is a purely cosmetic indicator, though it does have bearing on the length of the item.)
Zero Fill: Do you want the unused spaces to the left of a number padded with zeroes?
Item Representations
This is the number 3.14 stored using various item attributes Numeric, Length: 4, Decimal: 2, 3.14
Decimal Character: Yes, Zero Fill: Yes Numeric, Length: 6, Decimal: 2, 003.14
Decimal Character: Yes, Zero Fill: Yes Numeric, Length: 6, Decimal: 2, 000314
Decimal Character: No, Zero Fill: Yes Numeric, Length: 6, Decimal: 2, 3.14
Decimal Character: Yes, Zero Fill: No Numeric, Length: 6, Decimal: 3, 3.140
Decimal Character: Yes, Zero Fill: No Alphanumeric, Length: 6 3.14
Subitems
People tend to overuse subitems, but they are useful in situations in which you intend to process data that makes up a small part of a larger number
Using logic you can access parts of items without having to make them subitems, but subitems can simplify processing, as well as satisfy value set checking while on a form
Example: Item: Social Security Number, Length 11, comprised of three
subitems: Area Number, digits 1-3 Group Number, digits 5-6 Serial Number, digits 8-11
Value Sets Value sets are optional and tell CSPro what values are considered
acceptable for an item If no value set is present, CSPro will accept all values for the item
(within limit; i.e., numeric fields cannot contain letters) If an item has multiple value sets, CSPro will use the first one to check
the validity of keyed data Using logic the programmer can change what value set is active for an
item, and can even generate a value set dynamically Value sets can contain discrete values, and for numeric items, value
sets can contain ranges Value set ranges can overlap; this is common for tabulation applications If many items share the same possible values, you can link the value
sets so that modifying the value set of one item alters the value set for linked items
Value Set Examples
Sex:Label From ToMale 1Female 2
Age:Minor 0 17Teenager 13 19Adult 18 99Retiree 67 99
The from/to values of each value set are what is stored in the keyed data file, not the value set labels
Special Values
CSPro has three “special values” that describe certain kinds of data
Not Applicable: the item is blank(e.g., date of menarche would not be asked of men)
Missing: the codebook had a value for missing (or not stated) and you assign this value to be missing
Default: the item has an invalid value(e.g., your program logic assigned a three-digit value to a two-digit field)
By default CSPro ensures that keyed data fits in the value set and is not blank, but if desired CSPro can accept blank data or out of range data
Documenting Dictionary Elements To the left of every element in the dictionary editor
is a small gray box under the column heading N Clicking on this box brings up a field in which you
can write notes about the dictionary element These notes are stored in the dictionary file but are
not visible during data entry Consider making use of these notes, especially
when working with partners on an application
Relative Positioning
By default, CSPro will automatically assign the starting position (column number) of each item in your dictionary
When creating a new dictionary, it is best to let CSPro generate these values
Inserting an item in between other items, or modifying the length of an item, will cause all the other items’ starting positions to automatically change
There will be no gaps in the data file The default order in the data file will be: record
type, ID items, record items in the order they appear on the screen
Absolute Positioning
However, if you are creating a dictionary to match an existing data file, it may be necessary to select absolute positioning
With absolute positioning, you must specify the starting position (column number) of each item in your dictionary
It is your responsibility to make sure that items do not overlap
Gaps can exist in a data file
Relative vs. Absolute Example Relative:
11996William Jefferson Clinton21996Robert Joseph Dole
Absolute (one of many possibilities)
William Jefferson Clinton 1996 1Robert Joseph Dole 1996 2
Modifying the Dictionary
Before a data entry operation begins, feel free to modify the dictionary freely
CSPro will detect changes between the dictionary and forms, so if you rename or delete a dictionary item, the field on the form will also be renamed, or will be removed from the form
However, once some data exists using a dictionary format, modifying the dictionary must be done with great care
In all cases, make backups of your dictionary before any modifications so that you always have a dictionary to read data that was entered at any time of the data entry operation
Adding Fields to the Dictionary If, after the data entry process has begun, some
fields will be added to the dictionary, one option is to simply add them to the end of any given record
This means that, while the data that already exists will have blanks for these new values, that the data does not have to be reformatted and can be read by the new dictionary
However, if adding the fields to the end of a record is not practical, you can insert them in the record, but then all existing data must be reformatted to the new dictionary format
Modifying Item Lengths
If, after the data entry process has begun, the length of some items will be increased, you must reformat the existing data files
However, if the length of some items will be decreased, it may be possible to use absolute positioning to make your old data files readable
Likewise, deleting an item from the dictionary can be done in a way that does not require reformatting, but again absolute positioning must be used
Dictionary Macros
By right-clicking on the dictionary name in the tree you can access the undocumented dictionary macros
Names and labels of dictionary items, or value sets, can be copied to Excel format, modified in Excel, and then pasted back to CSPro
This can be particularly useful if you want coworkers who do not know how to use CSPro to help with the creation of the dictionary, perhaps by adding values to the codebook (value sets)