12
Hutcheson, G. D. (2011). Tutorial: Data Coding, Management and Manipulation. Journal of Modelling in Management. 6, 1: 123–132. http://www.emeraldinsight.com/loi/jm2 http://dx.doi.org/10.1108/jm2.2011.29706aab.001 Journal of Modelling in Management Graeme D. Hutcheson Data Coding, Management and Manipulation This tutorial discusses a very common problem in quantitative methods – the appropriate coding and management of data. Despite these being core skills for any analyst, there appears to be surprisingly little agreement on how data should be coded, recorded or managed and relatively few books and articles that deal specifically with this topic (see Spector 2008, Hutcheson 2011a, 2011b, 2011c, Muenchen 2009, Horton and Kleinman 2011) . Many researchers use coding conventions that they have developed themselves over years and save using a variety of different formats. This lack of consistency has made it difficult for researchers to control their data and easily share it with other researcher and software packages. It can even be argued that the use of certain formats for saving data has limited the analysis options of researchers, particularly the many techniques available as part of the the R statistical programme (see the tutorial in the Journal of Modelling in Management, 2010, vol. 4, No. 3). This tutorial introduces a system of data coding and management that alleviates many of these difficulties by using standard coding practices that encourage coding accuracy, data transparency and the ability to share data between software packages and researchers. This tutorial starts by describing a basic system used for categorising information based on measurement theory and then goes onto look at how this information may be coded, saved, transformed and managed using standardised methods. Measurement Scales There are many different types of information and ways in which this information can be categorised and represented as data. Although a number of different schemes have been proposed that utilise a variety of categories and sub-divisions (see, for example, Agresti and Finlay, 1997, Barford, 1985 and Sarle, 1995), in this tutorial we only distinguish between three distinct scales of measurement - numeric, ordered categorical and unordered categorical, based on a simplified version of Stevens' 1946 classification of measurement scales 1 , a scheme which is also adopted by Siegel and Castellan in their 1988 book on non-parametric data analysis. It is important to identify these three measurement scales as they are qualitatively different to each other and should be coded differently within the data set. These three scales also allow a wide range of statistical analyses to be applied and are, perhaps, the minimum required to provide a general introduction to statistical modelling. A brief description of these measurement scales is provided below. 1Stevens categories of continuous and ratio scales of measurement are combined into a single numeric category as both of these scales allow the same statistical methods to be applied and are coded identically.

Journal of Modelling in Management

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Hutcheson, G. D. (2011). Tutorial: Data Coding, Management and Manipulation. Journal of Modelling in Management. 6, 1: 123–132.

http://www.emeraldinsight.com/loi/jm2 http://dx.doi.org/10.1108/jm2.2011.29706aab.001

Journal of Modelling in Management

Graeme D. Hutcheson

Data Coding, Management and Manipulation

This tutorial discusses a very common problem in quantitative methods – the appropriate coding and management of data. Despite these being core skills for any analyst, there appears to be surprisingly little agreement on how data should be coded, recorded or managed and relatively few books and articles that dealspecifically with this topic (see Spector 2008, Hutcheson 2011a, 2011b, 2011c, Muenchen 2009, Horton and Kleinman 2011) . Many researchers use coding conventions that they have developed themselves over years and save using a variety of different formats. This lack of consistency has made it difficult for researchers to control their data and easily share it with other researcher and software packages. It can even be argued that the use of certain formats for saving data has limited the analysis options of researchers, particularly the many techniques available as part of the the R statistical programme (see the tutorial in the Journal of Modelling in Management, 2010, vol. 4, No. 3). This tutorial introduces a system of data coding and management that alleviates many of these difficulties by using standard coding practices that encourage coding accuracy, data transparency and the ability to share data between software packages and researchers. This tutorial starts by describing a basic system used for categorising information based on measurement theory and then goes onto look at how this information may be coded, saved, transformed and managed using standardised methods.

Measurement Scales

There are many different types of information and ways in which this information can be categorised and represented as data. Although a number of different schemes have been proposed that utilise a variety of categories and sub-divisions (see, for example, Agresti and Finlay, 1997, Barford, 1985 and Sarle, 1995), in this tutorial we only distinguish between three distinct scales of measurement - numeric, ordered categorical and unordered categorical, based on a simplified version of Stevens' 1946 classification of measurement scales1, a scheme which is also adopted by Siegel and Castellan in their 1988 book on non-parametric data analysis. It is important to identify these three measurement scales as they are qualitatively different to each other and should be coded differently within the data set. These three scales also allow a wide range of statistical analyses to be applied and are, perhaps, the minimum required to provide a general introduction to statistical modelling. A brief description of these measurement scales is provided below.

1Stevens categories of continuous and ratio scales of measurement are combined into a single numeric category as both of these scales allow the same statistical methods to be applied and are coded identically.

Unordered categorical scale

An unordered categorical categorical scale of measurement is achieved when the data are recorded as categories which have no meaningful order. The only information provided is the category identifier. This scale is also known as a classificatory scale and a labelling system. Table 1 shows a number of examples of variables that are typically regarded as being unordered categorical.

Table 1. Examples of unordered categorical variables

Car Treatment

Manufacturer Gender Group Subject

1 Ford male A subject 1

2 Citroen male A subject 2

3 Volvo female B subject 3

4 Volvo male B subject 4

5 Renault female C subject 5

For the data in Table 1, we are able to compare frequency counts of specific categories and conclude, for example, that a Volvo is the most frequently used car and that there are more males than females in the sample. Analytical techniques for unordered categorical data compare frequencies of individual categories and require techniques that only take into account the actual category membership (for example, multinomiallogit models).

Ordered categorical scale

An ordered categorical scale of measurement is achieved when the data are recorded as categories that can bearranged in order according to some criteria. The only information provided is the category identifier from which an order can be established (and explicitly indicated using an appropriate coding scheme, see below). Table 2 shows a number of examples of variables that are typically regarded as ordered categorical.

Table 2. Examples of ordered categorical variables

Highest Agreement Examination Mental health

Qualification Rating Grade Rating

1 A-level strongly agree B no symptoms

2 O-level Disagree A impaired functioning

3 Masters Neither A no symptoms

4 Degree Disagree C mild symptoms

5 Degree Agree D moderate symptoms

Variable Order:Highest Qualification: No qualification, O-level, A-level, degree, master, doctorateAgreement Rating: strongly agree, agree, neither, disagree, strongly disagreeExamination Grade: A, B, C, D, E.Mental Health: no symptoms, mild symptoms, moderate symptoms, impaired functioning.

The data in Table 2 can be arranged in order and we may conclude, for example, that subject 1 has a higher qualification than subject 2 (an `A-level' compared to an `O-level'), and that subject 2 has a more severe mental health rating than subject 4 (`impaired functioning' compared to `mild symptoms'). We can also disregard information about order and simply interpret frequency counts to make conclusions about individual categories such as the most common highest educational achievement in the observed data is a degree, or the most common mental health rating is 'no symptoms'. Ordered categorical data may be analysed using unordered categorical techniques, but may also be analysed using techniques that utilise information about order (for example, the proportional-odds logit model).

Numeric scale

A numeric scale of measurement is achieved when the recorded data can be considered to have a direct relationship with the variable being measured. At a very basic level, the data provides information about the actual magnitude (size, quantity, distance, etc.,) of the information. In other words, the numbers representing the variable have substantive meaning. Table 3 shows a number of examples of variables that are typically regarded as being numeric.

Table 3. Examples of numeric variables

Temperature Daily gas Examination Hourly rate

Difference Consumption Result Of pay

1 10.3°C 69 m3 62 % £ 5.80

2 9.8°C 74 m3 73 % £ 7.98

3 11.4°C 82 m3 58 % £ 15.65

4 11.5°C 81 m3 37 % £ 5.80

5 17.8°C 67 m3 62 % £ 5.80

The data in Table 3 represent actual quantities which can be added and subtracted to make conclusions such as subject 2 gets paid £2.18 more per hour than subject 1 (7.98 - 5.80), the total amount of gas used in the sample period was 373 m3 (69 + 74 + 82 + 81 + 67), and the difference between the highest and lowest scores on the examination was 36% (73-37). It also makes sense to describe the data with respect to order and conclude, for example, that some subjects are paid less than others and that three students scored more than 60% in the examination. It also makes sense to disregard information about quantity and order entirely and simply interpret frequency counts to make conclusions about individual observations such as `the most frequent wage is £5.80' (the current minimum wage legally payable to workers in England and Wales) and `the most common examination score is 62%' (even though this statement is unlikely to be of much practical use for describing this particular variable). Numeric data may be analysed using techniques that apply to categorical data (provided that this makes theoretical sense), but may also be analysed using techniques that make use of information about the magnitude of differences (for example, ordinary least-squares regression, t-tests, ANOVA and ANCOVA).

Coding Data

The three scales of data identified above allow different statistics to be applied and it is very important to accurately represent this information in the coded data. Below is provided a method of coding that accurately represents information in an obvious and unambiguous way. The major rationale behind the proposed coding is that data should...

• be coded clearly and unambiguously. • accurately represent the measurement scales.

• Not incorporate any extra 'spurious' information in the code (for example, representing categorical data using numeric information).

• be able to code information using standard ASCII characters without the use of any `hidden' codes orlabels.

• be of a form that can be easily imported into different software packages (the coded data should be transportable).

There are many coding `conventions' that may be used to represent information and analysts tend to develop their own conventions and practices over time. This can cause confusion, particularly for non-experienced users and those using unfamiliar data sets, or coming back to old data sets after some time. The present author has found it useful to standardise, at least to some extent, the coding of information so that it can be recorded efficiently. The following are a few general conventions that can be applied to coding data that willstandardise and simplify data sets. An example data set using these conventions is shown in Figure 1.

Conventions for coding data

• Variable names should be included in the first row.

• There should be NO EMPTY ROWS OR COLUMNS in the data file.

• Avoid spaces, commas, underscores, quotation marks or mathematical signs and other 'strange' characters whenever possible. For example, £ $ % ^ & * ? / \ | " ! ~ # + - _: ;

• Avoid using highlights, colours, lines or anything else in the data files.

• No formulas.

• Code the measurement scales appropriately:

• unordered: only use text category labels

• ordered: use text category labels preceded with a numbered order

• numeric: only use numbers (no units of measurement)

• Indicate all missing data using 'NA'.

• Code multiple categories of missing data when needed using a separate variable.

Insert Figure 1 about here

Saving Data

There are many software programs that can be used to generate and save data; for example, R and Rcmdr (see the tutorial from the last issue of this Journal), SPSS, Excel, Gnumeric, Stata, S-Plus, Dbase and SAS. Although these all have their own `specialised formats' for saving data, they are all able to load data saved in certain text-based formats. For true data transparency and data portability, it is worth using a text-based method for saving data. When saving data as a text file you will need to select an appropriate character to distinguish between variables (the `delimiter'). After a lot of experimentation I have settled on using comma-delimited datasets as these have the advantage of an explicit character (tabs, which are also popular, are represented as a hidden character in data files and can be difficult to deal with) and also have a special file-designator (.csv) so that data files are obvious. Many packages use comma-separated-variable files as a default data type and these are also easy to read and save in spreadsheet packages and text editors. In order to save files in .csv format, consult your software manual.

Data Meta-files

It is generally good practice to have a description of the data that explains coding methods, data types and other information about the data set. This information is best contained in a meta data-file which is stored along with the actual data set. Although it is possible to include the meta-information in the actual data file, for simplicity I include this information in a separate text file stored along with the data using a .desc name tag. An example data meta-file is shown in Figure 2.

Insert Figure 2 about here

Data Management.

Data are routinely manipulated and changed by, for example, recoding variables, combining and renaming categories, transforming observations, changing reference categories, dummy-coding variables and changing measurement scales (eg., recoding ordered categorical data into numeric so that a factor analysis can be computed). Re-coded data can be saved to an individual data set or may be saved into `new' data sets, both of which are problematic for a number of reasons...

• It is confusing when multiple variables represent a single attribute.• It is confusing when multiple data sets represent an individual area of research.• If data are amended or corrected, all data files and recoded copies of the data need amending along

with the 'original' variable.• Amended data sets are rarely accompanied by accurate meta-files.• It is more difficult to maintain data security for multiple data sets.

Data proliferation need not be problematic, however, so long as a suitable system of data management is used. One such system is to use a 'master' data set that contains the most accurate and complete representation of the information available. This master data set is the only one that is saved to disk and is the data file that is accessed at the start of all analysis sessions. If changes are made the data, they only need to be recorded for one data set.

Manipulating data in temporary data files:

A master data file will not be appropriate for all analyses and the variables it contains will often need to be changed (for example, recoded, renamed or transformed). In order to keep the integrity of the master data file, any changes to the data should be made on a temporary basis and not saved to the master data file unless absolutely necessary. It is good practice to run analyses on a temporary data set that has been derived from the master and then delete these data when the analyses have been completed. All data transformations and analyses can be saved to text command files and re-run later. Using this system the master data file is preserved and any analyses can be recreated easily if these data are amended.

As most recodes, transformations and renaming can be carried out as and when they are required there is rarely any need to permanently save the recoded variable. All that needs to be saved are the commands required to transform the data. Most statistical software allow data transformations and recodes to be done easily and saved as syntax files. For example, Figure 3 shows how a comma-separated data set can be loaded into Rcmdr and a variable recoded (for this example, English, Welsh and Scottish are recoded as British for the Nationality variable). This can be achieved using the pull-down menus in Rcmdr (see www.Rcmdr.com) and saved as a command file that be re-run later.

Insert Figure 3 about here

Figure 3 shows an example analysis where the master data set (ExampleData.csv) has been loaded into R andgiven the name TempData. For the planned analysis, the Nationality variable needs to be recoded with the categories English, Scottish and Welsh recoded into a single category labelled British. This was completed inRcmdr using the recode date menu and just indicating which categories were to be changed. The resulting code is shown in the script window and can be easily copied and used again when ever the same recoding exercise is required. The command syntax is shown below;

    TempData <­ read.table("/home/JM2/ExampleData.csv", header=TRUE, sep=",",        na.strings="NA", dec=".", strip.white=TRUE) 

    TempData$Nationality.recode <­ recode(TempData$Nationality,        '"English"="British"; "Welsh"="British"; "Scottish"="British"; ',       as.factor.result=TRUE)

There is no need to save the recoded data as it can be recreated each time it is needed by the script. The script file is also useful as it shows exactly how the Nationality variable has been recoded. If more data are added to the master data file or errors corrected, this will be reflected in the recoded variable each time the script is run. Using scripts to recode the data enables a proper system of data management to be applied and help protect the integrity and quality of the main data set.

SPSS data files

A number of researchers in Management use SPSS to analyse data and also save data files using the SPSS data format (.sav). This can cause difficulties, as SPSS requires a numeric code for it's data, even if the variables themselves are not numeric. For example, the unordered categorical variable 'Gender' is coded in SPSS using numbers, with an optional label often assigned to help with identification, for example, male=1 and female=2 (the numeric coding is often hidden as SPSS allows users to look at the labels in the data sets rather than the number codes). This causes two main issues, firstly with converting SPSS data files into .csv files and secondly with importing .csv files into SPSS. Solutions to these potential problems are dealt with below:

Transforming SPSS files into .csv files:

This is relatively easy, as SPSS files can be imported directly into Rcmdr (see the tutorial from the last issue of this Journal) and then saved as .csv files. In order to do this it is advisable to import the value labels as factors as this will import the names of the categorical variables (if they have been labelled) as categorical data rather than as numbers. It will also be necessary to check that missing data are appropriately identified as the importation of missing data codes can be difficult given the non-standard methods of indicating missing data in SPSS.

Importing .csv files into SPSS:

Importing .csv files can be problematic as accurately coded categorical data in .csv files will be imported as 'string' variables by SPSS. A solution is to first load the file in Rcmdr and then change the categorical variables into numeric variables using a simple script (this is really mis-coding the data, but is done here to allow SPSS to read the data files). For example, a script to load the example data into R, change the categorical variables to numeric and then save the resulting file as an SPSS-readable data set is shown below:

TempData <­   read.table("/home/JM2/ExampleData.csv", header=TRUE, sep=",",                na.strings="NA", dec=".", strip.white=TRUE)

TempData$Subject       <­ as.numeric(TempData$Subject)TempData$Nationality   <­ as.numeric(TempData$Nationality)TempData$EconStatus    <­ as.numeric(TempData$EconStatus)TempData$AgreeQ08      <­ as.numeric(TempData$AgreeQ08)TempData$ManGrade      <­ as.numeric(TempData$ManGrade)TempData$ManGrade.miss <­ as.numeric(TempData$ManGrade.miss)

write.table(TempData,   "/home/JM2/ExampleDataSPSS.csv", sep=",",   col.names=TRUE, row.names=FALSE, quote=FALSE, na="NA")

These commands simply load the master data set (ExampleData.csv) and save it to the name TempData. The categorical variables in TempData are then changed to numeric (using the 'as.numeric' command). Once all the variables have been changed, the data set can be saved using the 'write.table' function (in this example, the data are saved as ExampleDataSPSS.csv).  The ExampleDataSPSS.csv data file is identical tothe original data set except that now the variables are designated by numeric codes. The data in this file are shown in Figure 4.

Insert Figure 4 about here

This .csv file can now be loaded into SPSS (using SPSS's import data function). You will, however, have to appropriately identify 'NA' as the missing data indicator and add any labels if needed.

You should note that importing data from other formats rarely runs smoothly and you must check your data and be prepared to edit it so that it fully conforms with the coding conventions outlined above (eg., coding missing data, removing spaces, and providing ordered data with the proper order). Although this can be a lotof work, it is usually well worth the effort given the advantages of using a structured data-management system and correctly-coded data.

Conclusion

Although issues of data coding and data management are important for all researchers, there is surprisingly little standardisation. This tutorial has outlined a method for data identification, coding, storage and management that can alleviate many of the problems inherent with dealing with data and sharing it between software programmes and researchers.

One important principle is that ONLY information that is provided by the variable should be included in the coded data. If the attribute does not have a number naturally associated with it, the coded data should also not have a number. In particular, categorical variables should not be be coded using numbers. Incorrectly coded data is, however, ubiquitous in the management field and does lead to confusion and errors in analysis;for example, the use of OLS regression and ANOVA to model ordered and unordered categorical data. Correctly coded data will not allow these analyses to be used.

It is also important to keep control of data and appropriately manage data sets in a way that is transparent andenables data to be amended safely. The use of master and temporary data sets and on-the-fly recoding for analytical purposes makes this process much easier. A considered data management system also helps with ensuring data security.

Researchers often use the data format offered by their statistical analysis package to save data. These formats, however, often use hidden codes, non-standard characters and numbers to represent data. It is also not always simple to share data between packages. In practice, the use of such specialised formats tie researchers and collaborators into using a single analysis package. The use of a considered data coding and standard format for saving will help in giving analysts freedom to use not just those analytical techniques offered by their software.

In conclusion, data coding, saving and management are very important skills for all analysts and researchers and deserve to be given a high priority in any research.

References and further reading:

Agresti, A., and Finlay, B. (1997) Statistical methods for the social sciences, third edition, Prentice-Hall.

Barford, N. (1985) Experimental measurements. Precision, error and truth, second edition, John Wiley & Sons, Ltd.

Horton, N. J., and Kleinman, K. (2011) Using R for data management, statistical analysis, and graphics, CRCPress.

Hutcheson, G. D. (2011a) Coding conventions for representing information. In The Sage Dictionary of Quantitative Management Research, Eds: Moutinho, L., and Hutcheson, G. D., Sage publications, pp. 45–49.

Hutcheson, G. D. (2011b) Data set structure. In The Sage Dictionary of Quantitative Management Research, Eds: Moutinho, L., and Hutcheson, G. D., Sage publications, pp. 74–77.

Hutcheson, G. D. (2011b) Measurement scales. In The Sage Dictionary of Quantitative Management Research, Eds: Moutinho, L., and Hutcheson, G. D., Sage publications, pp. 184–187.

Muenchen, R. A. (2009). R for SAS and SPSS users, Springer, 2009

Sarle, W. S. (1995) Measurement theory. Frequently asked questions, fourth edition, ACG Press.

Siegel, S., and Castellan, Jr., N. J. (1988) Nonparametric statistics for the behavioural sciences, second edition, McGraw-Hill.

Spector, P. (2008) Data manipulation with R, Springer.

Call for papers:

The Journal of Modelling in Management invites the submission of articles and examples that illustrate methodological and practical issues associated with data collection, recording, analysis, graphics and presentation. Articles of no more than 5000 words can be submitted via the journal website at http://www.emeraldinsight.com/jm2.htm.

Graeme Hutcheson.

Manchester University

    Data File Name:   MasterData.csv              Date:   17.Nov.2011                    Variables:   Subject: unordered – individual subjects                      Age: numeric, age to nearest year                      Nationality: 8 unordered categories: English, German, Irish,                                                    Japanese, Korean, Russian, Scottish, Welsh.                       EconStatus 5 ordered categories taken from UN SES data.                      FactorSocial numeric factor scores from a factor analysis on social                                            responsibility (see Gordon and Nisbitt, 2008)                      AgreeQ08 5 ordered categories ranging from str.agree to                                            st.disagree (see Gordon and Nisbitt, 2006;                                                               questionnaire)                      ManGrade 3 ordered categories showing management category:                                             junior, middle, upper.                      ManGrade.miss 4 categories distinguishing missing data from ManGrade:

answered, spoiled, not­applicable, not­answered.

     Number of obs:   250 cases 

       Information:   These data are hypothetical and form an example data set for demonstrating                      data coding.

Figure 2: An example meta-file

Figure 3: Recoding data using Rcmdr.

   Subject Age Nationality EconStatus FactorSocial AgreeQ08 ManGrade  ManGrade.miss        1  23           6          2           NA        1        1  1        2  24           1          4       ­1.231        3       NA  3        3  31           9          1        0.821        2        1  1        4  NA          NA         NA           NA        5        3  1        5  43           3          2        0.076        2        2  1        6  41           2          2           NA        4        2  1        7  19           2          3        2.652        2        3  1        8  38           5          3        1.611        3        2  1        9  59           8         NA       ­0.812       NA       NA  2       10  24           7          2        3.769       NA       NA  2       11  39           3          4        0.023        5        1  1       12  22           3          3       ­0.182        2       NA  3       13  64           4          5        0.034        4        3  1

Figure 4: A .csv file transformed into numeric codes enabling it to be imported into SPSS.