Basic Concepts of Quantitative Research

8/3/2019 Basic Concepts of Quantitative Research

http://slidepdf.com/reader/full/basic-concepts-of-quantitative-research 1/38

Basic Concepts of Quantitative Research Dr. R. Ouyang

Results

Data types and preparation for analysis Different kinds of data results represent

different scales of measurement. There are four types

of measurement scales, that is, there are four types of data we usually deal with. They

are nominal, ordinal, interval and ratio. It is

important to know which type of scale or data youcollect for the research and which statistics are

appropriate for your data analysis.

Four scales of measurement (4 types of data)

Nominal (categories): A nominal scale represents thelowest level of measurement. Such a scale classifies

persons or objects into two or more categories. Inother words, the nominal data are those based on the

classification and categorization. When a nominal scale

is used, the data simply indicate how many subjects arein each category. Category 4 and category 1 are not

different base on the number 4 and 1, but the categories

4 and 1. 4 is not higher than 1 or more than

1. Example: Categories for IQ, types of school.

Ordinal (ranks): An ordinal scale puts the subjects in

order from the highest to lowest, form the most to

least. Although ordinal scales indicate that some

subjects are higher, or better, than other, they do notindicate how much higher or better. Subjects A, B, C,

D are measured as 4'5", 5'1", 6'2", 5'6" in height. The

rank order will be ranked 1 for C, 2 for D, 3 for B, and4 for A.

Interval (scores): An interval scale has all the

characteristics of a nominal and ordinal scale, inaddition it is based upon predetermined equal

intervals. Most of the tests used in educational

research, such as achievement tests, aptitude tests,

and intelligence tests, represent interval

scales. Interval scale, however, do not have a true zero

point. Such scales typically have an arbitrary

maximum score and an arbitrary minimum score, orzero point. If an IQ test produces scores ranging from 0

to 200, a score of 0 does not indicate the absence of

intelligence, nor does a score of 200 dedicate



possession of the ultimate intelligence. A score of 0

only indicates the lowest level of performance possibleon that particular test and a score of 200 represents the

highest level. We can say that an achievement test

score of 90 is 45 points higher than a score of 45, but

we cannot say that a person scoring 90 knows twice asmuch as a person scoring 45. Similarly, a person with a

measured IQ of 140 is not necessarily twice as smart or

twice as intelligent as a person with a measured IQ of 70.

Ratio: A ratio scale represents the highest, most precise,

level of measurement. A ratio scale has all theadvantages of the other types of scales and in addition it

has a meaningful, true zero point. Height, weight, time,

distance, and speed are examples.

Procession of coding data Scoring procedure: All instruments administered

should be scored accurately and consistently; eachsubject's test should be scored using the same

procedures and criteria.

For self-developed test, if other than objective-type items (such as multiple-choice questions) are to be

scored, it is advised to have at least one other person

score the tests as a reliability check.

For a standardized test, it is better to make sureall answer sheets are marked corrected and scored by

the machine properly.

Coding data: Coding data consists of developing a

system by which the data and identification informationare specified and organized in preparation for the

analysis.

If a large number of subjects are involved,coding of the data is especially important. Data for all

variables and subjects are usually converted to

numerical values when the data are entered into thedatabase management program since long entries take

considerable space and contribute to typographical and

spelling errors that mess up subsequent manipulations.

Steps of coding data: 1) to give each subject an IDnumber, 2) to make decisions as to how nonnumerical

or categorical data will be coded, 3) to prepare all data

for analysis.



Statistical packages (SPSS), (SAS), (JUMP-IN) include

programs for many statistics, from the most basic to themost sophisticated, frequently used in research studies.

Types of data

Explanations > Social Research > Measurement > Types of data

Nominal | Ordinal | Interval | Ratio | Parametric vs. non-parametric | Discrete and Continuous | See also

There are four types of data that may be gathered in social research, each one adding moreto the next. Thus ordinal data is also nominal, and so on.

Ratio

Interval

Ordinal

Nominal

Nominal

The name 'Nominal' comes from the Latin nomen, meaning 'name' and nominal data areitems which are differentiated by a simple naming system.

The only thing a nominal scale does is to say that items being measured have something incommon, although this may not be described.

Nominal items may have numbers assigned to them. This may appear ordinal but is not --these are used to simplify capture and referencing.

Nominal items are usually categorical , in that they belong to a definable category, such as'employees'.

Example

The number pinned on a sports person.

A set of countries.

Ordinal

Items on an ordinal scale are set into some kind of order by their position on the scale. Thismay indicate such as temporal position, superiority, etc.

http://changingminds.org/explanations/explanations.htm


http://changingminds.org/explanations/research/research.htm



http://changingminds.org/explanations/research/measurement/measurement.htm



http://changingminds.org/explanations/research/measurement/types_data.htm#nom


http://changingminds.org/explanations/research/measurement/types_data.htm#ord



http://changingminds.org/explanations/research/measurement/types_data.htm#int



http://changingminds.org/explanations/research/measurement/types_data.htm#rat



http://changingminds.org/explanations/research/measurement/types_data.htm#par



http://changingminds.org/explanations/research/measurement/types_data.htm#dis



http://changingminds.org/explanations/research/measurement/types_data.htm#see















The order of items is often defined by assigning numbers to them to show their relativeposition. Letters or other sequential symbols may also be used as appropriate.

Ordinal items are usually categorical, in that they belong to a definable category, such as'1956 marathon runners'.

You cannot do arithmetic with ordinal numbers -- they show sequence only.

Example

The first, third and fifth person in a race.

Pay bands in an organization, as denoted by A, B, C and D.

Interval

Interval data (also sometimes called integer ) is measured along a scale in which eachposition is equidistant from one another. This allows for the distance between two pairs tobe equivalent in some way.

This is often used in psychological experiments that measure attributes along an arbitraryscale between two extremes.

Interval data cannot be multiplied or divided.

Example

My level of happiness, rated from 1 to 10.

Temperature, in degrees Fahrenheit.

Ratio

In a ratio scale, numbers can be compared as multiples of one another. Thus one personcan be twice as tall as another person. Important also, the number zero has meaning.

Thus the difference between a person of 35 and a person 38 is the same as the differencebetween people who are 12 and 15. A person can also have an age of zero.

Ratio data can be multiplied and divided because not only is the difference between 1 and 2

the same as between 3 and 4, but also that 4 is twice as much as 2.

Interval and ratio data measure quantities and hence are quantitative. Because they canbe measured on a scale, they are also called scale data.

Example

A person's weight

The number of pizzas I can eat before fainting

Parametric vs. Non-parametric

Interval and ratio data are parametric , and are used with parametric tools in whichdistributions are predictable (and often Normal).

Nominal and ordinal data are non-parametric , and do not assume any particulardistribution. They are used with non-parametric tools such as the Histogram.

Continuous and Discrete

Continuous measures are measured along a continuous scale which can be divided intofractions, such as temperature. Continuous variables allow for infinitely fine sub-division,which means if you can measure sufficiently accurately, you can compare two items anddetermine the difference.

http://www.syque.com/improvement/Normal%20distribution.htm



http://www.syque.com/quality_tools/toolbook/Histogram/histogram.htm







Discrete variables are measured across a set of fixed values, such as age in years (notmicroseconds). These are commonly used on arbitrary scales, such as scoring your level of happiness, although such scales can also be continuous.

See also

Variables in research

What are Variables?

Variables are things that we measure, control, or manipulate in research. They differ in manyrespects, most notably in the role they are given in our research and in the type of measures that canbe applied to them.

Correlational vs. Experimental Research

Most empirical research belongs clearly to one of these two general categories. In correlationalresearch, we do not (or at least try not to) influence any variables but only measure them and look forrelations (correlations) between some set of variables, such as blood pressure and cholesterol level. Inexperimental research, we manipulate some variables and then measure the effects of thismanipulation on other variables. For example, a researcher might artificially increase blood pressureand then record cholesterol level. Data analysis in experimental research also comes down tocalculating "correlations" between variables, specifically, those manipulated and those affected by themanipulation. However, experimental data may potentially provide qualitatively better information:only experimental data can conclusively demonstrate causal relations between variables. For example,if we found that whenever we change variable A then variable B changes, then we can conclude that"A influences B." Data from correlational research can only be "interpreted" in causal terms based onsome theories that we have, but correlational data cannot conclusively prove causality.

Dependent vs. Independent Variables

Independent variables are those that are manipulated whereas dependent variables are onlymeasured or registered. This distinction appears terminologically confusing to many because, as somestudents say, "all variables depend on something." However, once you get used to this distinction, itbecomes indispensable. The terms dependent and independent variable apply mostly to experimentalresearch where some variables are manipulated, and in this sense they are "independent" from theinitial reaction patterns, features, intentions, etc. of the subjects. Some other variables are expectedto be "dependent" on the manipulation or experimental conditions. That is to say, they depend on

"what the subject will do" in response. Somewhat contrary to the nature of this distinction, theseterms are also used in studies where we do not literally manipulate independent variables, but onlyassign subjects to "experimental groups" based on some pre-existing properties of the subjects. Forexample, if in an experiment, males are compared to females regarding their white cell count(WCC), Gender could be called the independent variable and WCC the dependent variable.

To index

To index

http://changingminds.org/explanations/research/measurement/variables.htm


http://www.statsoft.com/textbook/elementary-statistics-concepts/#index







Measurement Scales

Variables differ in how well they can be measured, i.e., in how much measurable information theirmeasurement scale can provide. There is obviously some measurement error involved in everymeasurement, which determines the amount of information that we can obtain. Another factor thatdetermines the amount of information that can be provided by a variable is its type of measurement

scale. Specifically, variables are classified as (a) nominal, (b) ordinal, (c) interval, or (d) ratio.

1. Nominal variables allow for only qualitative classification. That is, they can be measured onlyin terms of whether the individual items belong to some distinctively different categories, but wecannot quantify or even rank order those categories. For example, all we can say is that twoindividuals are different in terms of variable A (e.g., they are of different race), but we cannot saywhich one "has more" of the quality represented by the variable. Typical examples of nominalvariables are gender , race, color , city , etc.

2. Ordinal variables allow us to rank order the items we measure in terms of which has less andwhich has more of the quality represented by the variable, but still they do not allow us to say"how much more." A typical example of an ordinal variable is the socioeconomic status of families. For example, we know that upper-middle is higher than middle but we cannot saythat it is, for example, 18% higher. Also, this very distinction between nominal, ordinal, andinterval scales itself represents a good example of an ordinal variable. For example, we cansay that nominal measurement provides less information than ordinal measurement, but wecannot say "how much less" or how this difference compares to the difference between ordinaland interval scales.

3. Interval variables allow us not only to rank order the items that are measured, but also toquantify and compare the sizes of differences between them. For example, temperature, asmeasured in degrees Fahrenheit or Celsius, constitutes an interval scale. We can say that atemperature of 40 degrees is higher than a temperature of 30 degrees, and that an increasefrom 20 to 40 degrees is twice as much as an increase from 30 to 40 degrees.

4. Ratio variables are very similar to interval variables; in addition to all the properties of intervalvariables, they feature an identifiable absolute zero point, thus, they allow for statements suchas x is two times more than y. Typical examples of ratio scales are measures of time or space.For example, as the Kelvin temperature scale is a ratio scale, not only can we say that atemperature of 200 degrees is higher than one of 100 degrees, we can correctly state that it istwice as high. Interval scales do not have the ratio property. Most statistical data analysisprocedures do not distinguish between the interval and ratio properties of the measurementscales.

Relations between Variables

Regardless of their type, two or more variables are related if, in a sample of observations, the values

of those variables are distributed in a consistent manner. In other words, variables are related if theirvalues systematically correspond to each other for these observations. Forexample, Gender and WCC would be considered to be related if most males had high WCC and mostfemales low WCC , or vice versa; Height is related to Weight because, typically, tall individuals areheavier than short ones; IQ is related to the Number of Errors in a test if people with higher IQ's makefewer errors.

To index

To index

















































































Why Relations between Variables are Important

Generally speaking, the ultimate goal of every research or scientific analysis is to find relationsbetween variables. The philosophy of science teaches us that there is no other way of representing"meaning" except in terms of relations between some quantities or qualities; either way involvesrelations between variables. Thus, the advancement of science must always involve finding new

relations between variables. Correlational research involves measuring such relations in the moststraightforward manner. However, experimental research is not any different in this respect. Forexample, the above mentioned experiment comparing WCC in males and females can be described aslooking for a correlation between two variables: Gender and WCC . Statistics does nothing else buthelp us evaluate relations between variables. Actually, all of the hundreds of procedures that aredescribed in this online textbook can be interpreted in terms of evaluating various kinds of inter-variable relations.

Two Basic Features of Every Relation between

VariablesThe two most elementary formal properties of every relation between variables are the relation's (a)magnitude (or "size") and (b) its reliability (or "truthfulness").

1. Magnitude (or "size"). The magnitude is much easier to understand and measure than thereliability. For example, if every male in our sample was found to have a higher WCC than any femalein the sample, we could say that the magnitude of the relation between the two variables(Gender and WCC ) is very high in our sample. In other words, we could predict one based on theother (at least among the members of our sample).

2. Reliability (or "truthfulness"). The reliability of a relation is a much less intuitive concept, butstill extremely important. It pertains to the "representativeness" of the result found in ourspecific sample for the entire population. In other words, it says how probable it is that asimilar relation would be found if the experiment was replicated with other samples drawnfrom the same population. Remember that we are almost never "ultimately" interested only inwhat is going on in our sample; we are interested in the sample only to the extent it canprovide information about the population. If our study meets some specific criteria (to bementioned later), then the reliability of a relation between variables observed in our samplecan be quantitatively estimated and represented using a standard measure (technically calledp-value or statistical significance level, see the next paragraph).

What is "Statistical Significance" (p-value)?

The statistical significance of a result is the probability that the observed relationship (e.g., betweenvariables) or a difference (e.g., between means) in a sample occurred by pure chance ("luck of thedraw"), and that in the population from which the sample was drawn, no such relationship ordifferences exist. Using less technical terms, we could say that the statistical significance of a resulttells us something about the degree to which the result is "true" (in the sense of being "representativeof the population").

More technically, the value of the p-value represents a decreasing index of the reliability of a result(see Brownlee, 1960). The higher the p-value, the less we can believe that the observed relation

To index

To index































































between variables in the sample is a reliable indicator of the relation between the respective variablesin the population. Specifically, the p-value represents the probability of error that is involved inaccepting our observed result as valid, that is, as "representative of the population." For example, a p-value of .05 (i.e.,1/20) indicates that there is a 5% probability that the relation between the variablesfound in our sample is a "fluke." In other words, assuming that in the population there was no relationbetween those variables whatsoever, and we were repeating experiments such as ours one afteranother, we could expect that approximately in every 20 replications of the experiment there would be

one in which the relation between the variables in question would be equal or stronger than in ours.(Note that this is not the same as saying that, given that there IS a relationship between thevariables, we can expect to replicate the results 5% of the time or 95% of the time; when there is arelationship between the variables in the population, the probability of replicating the study andfinding that relationship is related to thestatistical power of the design. See also, Power Analysis). Inmany areas of research, the p-value of .05 is customarily treated as a "border-line acceptable" errorlevel.

How to Determine that a Result is "Really"

SignificantThere is no way to avoid arbitrariness in the final decision as to what level of significance will betreated as really "significant." That is, the selection of some level of significance, up to which theresults will be rejected as invalid, is arbitrary. In practice, the final decision usually depends onwhether the outcome was predicted a priori or only found post hoc in the course of many analyses andcomparisons performed on the data set, on the total amount of consistent supportive evidence in theentire data set, and on "traditions" existing in the particular area of research. Typically, in many

sciences, results that yield p .05 are considered borderline statistically significant, but rememberthat this level of significance still involves a pretty high probability of error (5%). Results that are

significant at the p .01 level are commonly considered statistically significant, and p .005 or p.001 levels are often called "highly" significant. But remember that these classifications representnothing else but arbitrary conventions that are only informally based on general research

experience.

Statistical Significance and the Number of AnalysesPerformed

Needless to say, the more analyses you perform on a data set, the more results will meet "by chance"the conventional significance level. For example, if you calculate correlations between ten variables(i.e., 45 different correlation coefficients), then you should expect to find by chance that about two

(i.e., one in every 20) correlation coefficients are significant at the p .05 level, even if the values of

the variables were totally random and those variables do not correlate in the population. Somestatistical methods that involve many comparisons and, thus, a good chance for such errors includesome "correction" or adjustment for the total number of comparisons. However, many statisticalmethods (especially simple exploratory data analyses) do not offer any straightforward remedies tothis problem. Therefore, it is up to the researcher to carefully evaluate the reliability of unexpectedfindings. Many examples in this online textbook offer specific advice on how to do this; relevantinformation can also be found in most research methods textbooks.

To index

To index

To index













http://www.statsoft.com/textbook/statistics-glossary/s.aspx?button=s#Statistical%20Power



http://www.statsoft.com/textbook/power-analysis/













Strength vs. Reliability of a Relation betweenVariables

We said before that strength and reliability are two different features of relationships betweenvariables. However, they are not totally independent. In general, in a sample of a particular size, thelarger the magnitude of the relation between variables, the more reliable the relation (see the nextparagraph).

Why Stronger Relations between Variables are

More Significant

Assuming that there is no relation between the respective variables in the population, the most likely

outcome would be also finding no relation between these variables in the research sample. Thus, thestronger the relation found in the sample, the less likely it is that there is no corresponding relation inthe population. As you see, the magnitude and significance of a relation appear to be closely related,and we could calculate the significance from the magnitude and vice-versa; however, this is true onlyif the sample size is kept constant, because the relation of a given strength could be either highlysignificant or not significant at all, depending on the sample size (see the next paragraph).

Why Significance of a Relation between Variables

Depends on the Size of the Sample

If there are very few observations, then there are also respectively few possible combinations of thevalues of the variables and, thus, the probability of obtaining by chance a combination of those valuesindicative of a strong relation is relatively high.

Consider the following illustration. If we are interested in two variables (Gender : male/femaleand WCC : high/low), and there are only four subjects in our sample (two males and two females),then the probability that we will find, purely by chance, a 100% relation between the two variablescan be as high as one-eighth. Specifically, there is a one-in-eight chance that both males will have ahigh WCC and both females a low WCC , or vice versa.

Now consider the probability of obtaining such a perfect match by chance if our sample consisted of 100 subjects; the probability of obtaining such an outcome by chance would be practically zero.

Let's look at a more general example. Imagine a theoretical population in which the average valueof WCC in males and females is exactly the same. Needless to say, if we start replicating a simpleexperiment by drawing pairs of samples (of males and females) of a particular size from thispopulation and calculating the difference between the average WCC in each pair of samples, most of the experiments will yield results close to 0. However, from time to time, a pair of samples will bedrawn where the difference between males and females will be quite different from 0. How often will ithappen? The smaller the sample size in each experiment, the more likely it is that we will obtain sucherroneous results, which in this case would be results indicative of the existence of a relationbetween Gender and WCC obtained from a population in which such a relation does not exist.

To index

To index




























































Example: Baby Boys to Baby Girls Ratio

Consider this example from research on statistical reasoning (Nisbett, et al., 1987). There are twohospitals: in the first one, 120 babies are born every day; in the other, only 12. On average, the ratio

of baby boys to baby girls born every day in each hospital is 50/50. However, one day, in one of thosehospitals, twice as many baby girls were born as baby boys. In which hospital was it more likely tohappen? The answer is obvious for a statistician, but as research shows, not so obvious for a layperson: it is much more likely to happen in the small hospital. The reason for this is that technicallyspeaking, the probability of a random deviation of a particular size (from the population mean),decreases with the increase in the sample size.

Why Small Relations Can be Proven Significant Onlyin Large Samples

The examples in the previous paragraphs indicate that if a relationship between variables in questionis "objectively" (i.e., in the population) small, then there is no way to identify such a relation in astudy unless the research sample is correspondingly large. Even if our sample is in fact "perfectlyrepresentative," the effect will not be statistically significant if the sample is small. Analogously, if arelation in question is "objectively" very large, then it can be found to be highly significant even in astudy based on a very small sample.

Consider this additional illustration. If a coin is slightly asymmetrical and, when tossed, is somewhatmore likely to produce heads than tails (e.g., 60% vs. 40%), then ten tosses would not be sufficientto convince anyone that the coin is asymmetrical even if the outcome obtained (six heads and fourtails) was perfectly representative of the bias of the coin. However, is it so that 10 tosses is notenough to prove anything? No; if the effect in question were large enough, then ten tosses could bequite enough. For instance, imagine now that the coin is so asymmetrical that no matter how you tossit, the outcome will be heads. If you tossed such a coin ten times and each toss produced heads, mostpeople would consider it sufficient evidence that something is wrong with the coin. In other words, itwould be considered convincing evidence that in the theoretical population of an infinite number of tosses of this coin, there would be more heads than tails. Thus, if a relation is large, then it can befound to be significant even in a small sample.

Can "No Relation" be a Significant Result?

The smaller the relation between variables, the larger the sample size that is necessary to prove it

significant. For example, imagine how many tosses would be necessary to prove that a coin isasymmetrical if its bias were only .000001%! Thus, the necessary minimum sample size increases asthe magnitude of the effect to be demonstrated decreases. When the magnitude of the effectapproaches 0, the necessary sample size to conclusively prove it approaches infinity. That is to say, if there is almost no relation between two variables, then the sample size must be almost equal to thepopulation size, which is assumed to be infinitely large. Statistical significance represents theprobability that a similar outcome would be obtained if we tested the entire population. Thus,everything that would be found after testing the entire population would be, by definition, significantat the highest possible level, and this also includes all "no relation" results.

To index

To index

To index

















































How to Measure the Magnitude (Strength) of Relations between Variables

There are very many measures of the magnitude of relationships between variables that have beendeveloped by statisticians; the choice of a specific measure in given circumstances depends on thenumber of variables involved, measurement scales used, nature of the relations, etc. Almost all of them, however, follow one general principle: they attempt to somehow evaluate the observed relationby comparing it to the "maximum imaginable relation" between those specific variables.

Technically speaking, a common way to perform such evaluations is to look at how differentiated thevalues are of the variables, and then calculate what part of this "overall available differentiation" isaccounted for by instances when that differentiation is "common" in the two (or more) variables inquestion. Speaking less technically, we compare "what is common in those variables" to "whatpotentially could have been common if the variables were perfectly related."

Let's consider a simple illustration. Let's say that in our sample, the average index of WCC is 100 inmales and 102 in females. Thus, we could say that on average, the deviation of each individual score

from the grand mean (101) contains a component due to the gender of the subject; the size of thiscomponent is 1. That value, in a sense, represents some measure of relationbetween Gender and WCC . However, this value is a very poor measure because it does not tell us howrelatively large this component is given the "overall differentiation" of WCC scores. Consider twoextreme possibilities:

1. If all WCC scores of males were equal exactly to 100 and those of females equal to 102, thenall deviations from the grand mean in our sample would be entirely accounted for by gender. Wewould say that in our sample, Gender is perfectly correlated with WCC , that is, 100% of the observeddifferences between subjects regarding their WCC is accounted for by their gender.

2. If WCC scores were in the range of 0-1000, the same difference (of 2) between the averageWCC of males and females found in the study would account for such a small part of theoverall differentiation of scores that most likely it would be considered negligible. For example,one more subject taken into account could change, or even reverse the direction of thedifference. Therefore, every good measure of relations between variables must take intoaccount the overall differentiation of individual scores in the sample and evaluate the relationin terms of (relatively) how much of this differentiation is accounted for by the relation inquestion.

Common "General Format" of Most StatisticalTests

Because the ultimate goal of most statistical tests is to evaluate relations between variables, moststatistical tests follow the general format that was explained in the previous paragraph. Technicallyspeaking, they represent a ratio of some measure of the differentiation common in the variables inquestion to the overall differentiation of those variables. For example, they represent a ratio of thepart of the overall differentiation of the WCC scores that can be accounted for by gender to the overalldifferentiation of the WCC scores. This ratio is usually called a ratio of explained variation to totalvariation. In statistics, the term explained variation does not necessarily imply that we "conceptuallyunderstand" it. It is used only to denote the common variation in the variables in question, that is, the

To index

To index





























































part of variation in one variable that is "explained" by the specific values of the other variable, andvice versa.

How the "Level of Statistical Significance" isCalculated

Let's assume that we have already calculated a measure of a relation between two variables (asexplained above). The next question is "how significant is this relation?" For example, is 40% of theexplained variance between the two variables enough to consider the relation significant? The answeris "it depends."

Specifically, the significance depends mostly on the sample size. As explained before, in very largesamples, even very small relations between variables will be significant, whereas in very smallsamples even very large relations cannot be considered reliable (significant). Thus, in order todetermine the level of statistical significance, we need a function that represents the relationship

between "magnitude" and "significance" of relations between two variables, depending on the samplesize. The function we need would tell us exactly "how likely it is to obtain a relation of a givenmagnitude (or larger) from a sample of a given size, assuming that there is no such relation betweenthose variables in the population." In other words, that function would give us the significance (p)level, and it would tell us the probability of error involved in rejecting the idea that the relation inquestion does not exist in the population. This "alternative" hypothesis (that there is no relation in thepopulation) is usually called the null hypothesis. It would be ideal if the probability function was linearand, for example, only had different slopes for different sample sizes. Unfortunately, the function ismore complex and is not always exactly the same; however, in most cases we know its shape and canuse it to determine the significance levels for our findings in samples of a particular size. Most of thesefunctions are related to a general type of function, which is called normal.

Why the "Normal Distribution" is Important

The "normal distribution" is important because in most cases, it well approximates the function thatwas introduced in the previous paragraph (for a detailed illustration, see Are All Test Statistics Normally

Distributed?). The distribution of many test statistics is normal or follows some form that can bederived from the normal distribution. In this sense, philosophically speaking, the normal distributionrepresents one of the empirically verified elementary "truths about the general nature of reality," andits status can be compared to the one of fundamental laws of natural sciences. The exact shape of thenormal distribution (the characteristic "bell curve") is defined by a function that has only twoparameters: mean and standard deviation.

A characteristic property of the normal distribution is that 68% of all of its observations fall within arange of ±1 standard deviation from the mean, and a range of ±2 standard deviations includes 95%of the scores. In other words, in a normal distribution, observations that have a standardized value of less than -2 or more than +2 have a relative frequency of 5% or less. (Standardized value means thata value is expressed in terms of its difference from the mean, divided by the standard deviation.) If you have access to STATISTICA, you can explore the exact values of probability associated withdifferent values in the normal distribution using the interactive Probability Calculator tool; forexample, if you enter the Z value (i.e., standardized value) of 4, the associated probability computedbySTATISTICA will be less than .0001, because in the normal distribution almost all observations (i.e.,

To index

To index



























http://www.statsoft.com/textbook/elementary-concepts-in-statistics/#Are%20all%20test%20statistics%20normally%20distributed












more than 99.99%) fall within the range of ±4 standard deviations. The animation below shows thetail area associated with other Z values.

Illustration of How the Normal Distribution is Used

in Statistical Reasoning (Induction)

Recall the example discussed above, where pairs of samples of males and females were drawn from apopulation in which the average value of WCC in males and females was exactly the same. Althoughthe most likely outcome of such experiments (one pair of samples per experiment) was that thedifference between the average WCC in males and females in each pair is close to zero, from time totime, a pair of samples will be drawn where the difference between males and females is quitedifferent from 0. How often does it happen? If the sample size is large enough, the results of suchreplications are "normally distributed" (this important principle is explained and illustrated in the nextparagraph) and, thus, knowing the shape of the normal curve, we can precisely calculate theprobability of obtaining "by chance" outcomes representing various levels of deviation from thehypothetical population mean of 0. If such a calculated probability is so low that it meets thepreviously accepted criterion of statistical significance, then we have only one choice: conclude thatour result gives a better approximation of what is going on in the population than the "null hypothesis"(remember that the null hypothesis was considered only for "technical reasons" as a benchmarkagainst which our empirical result was evaluated). Note that this entire reasoning is based on the

assumption that the shape of the distribution of those "replications" (technically, the "samplingdistribution") is normal. This assumption is discussed in the next paragraph.

Are All Test Statistics Normally Distributed?

Not all, but most of them are either based on the normal distribution directly or on distributions thatare related to and can be derived from normal, such as t, F, or Chi-square. Typically, these testsrequire that the variables analyzed are themselves normally distributed in the population, that is, theymeet the so-called "normality assumption." Many observed variables actually are normally distributed,

which is another reason why the normal distribution represents a "general feature" of empirical reality.The problem may occur when we try to use a normal distribution-based test to analyze data fromvariables that are themselves not normally distributed (see tests of normalityinNonparametrics or ANOVA/MANOVA). In such cases, we have two general choices. First, we can usesome alternative "nonparametric" test (or so-called "distribution-free test" see, Nonparametrics); butthis is often inconvenient because such tests are typically less powerful and less flexible in terms of types of conclusions that they can provide. Alternatively, in many cases we can still use the normaldistribution-based test if we only make sure that the size of our samples is large enough. The latteroption is based on an extremely important principle that is largely responsible for the popularity of tests that are based on the normal function. Namely, as the sample size increases, the shape of the

To index

To index
























http://www.statsoft.com/textbook/statistics-glossary/n.aspx?button=n#Normal%20Distribution



http://www.statsoft.com/textbook/statistics-glossary/s.aspx?button=s#Student's%20t%20Distribution



http://www.statsoft.com/textbook/statistics-glossary/f.aspx?button=f#F%20Distribution



http://www.statsoft.com/textbook/statistics-glossary/c.aspx?button=c#Chi-square%20Distribution



http://www.statsoft.com/textbook/nonparametric-statistics/



http://www.statsoft.com/textbook/anova-manova/















sampling distribution (i.e., distribution of a statistic from the sample; this term was first used byFisher, 1928a) approaches normal shape, even if the distribution of the variable in question is notnormal. This principle is illustrated in the following animation showing a series of samplingdistributions (created with gradually increasing sample sizes of: 2, 5, 10, 15, and 30) using a variablethat is clearly non-normal in the population, that is, the distribution of its values is clearly skewed.

However, as the sample size (of samples used to create the sampling distribution of the mean)increases, the shape of the sampling distribution becomes normal. Note that for n=30, the shape of that distribution is "almost" perfectly normal (see the close match of the fit). This principle is calledthe central limit theorem (this term was first used by Pólya, 1920; German, "ZentralerGrenzwertsatz").

How Do We Know the Consequences of Violatingthe Normality Assumption?

Although many of the statements made in the preceding paragraphs can be proven mathematically,some of them do not have theoretical proof and can be demonstrated only empirically, via so-calledMonte-Carlo experiments. In these experiments, large numbers of samples are generated by acomputer following predesigned specifications, and the results from such samples are analyzed using avariety of tests. This way we can empirically evaluate the type and magnitude of errors or biases towhich we are exposed when certain theoretical assumptions of the tests we are using are not met byour data. Specifically, Monte-Carlo studies were used extensively with normal distribution-based teststo determine how sensitive they are to violations of the assumption of normal distribution of theanalyzed variables in the population. The general conclusion from these studies is that the

consequences of such violations are less severe than previously thought. Although these conclusionsshould not entirely discourage anyone from being concerned about the normality assumption, theyhave increased the overall popularity of the distribution-dependent statistical tests in all areas of research.

To index
































Chapter 3: Levels Of Measurement And Scaling

Chapter Objectives Structure Of The Chapter

Levels of measurement Nominal scales Measurement scales Comparative scales Noncomparative scales Chapter Summary Key Terms Review Questions Chapter References

A common feature of marketing research is the attempt to have respondents communicatetheir feelings, attitudes, opinions, and evaluations in some measurable form. To this end,marketing researchers have developed a range of scales. Each of these has uniqueproperties. What is important for the marketing analyst to realise is that they have wildelydiffering measurement properties. Some scales are at very best, limited in theirmathematical properties to the extent that they can only establish an association betweenvariables. Other scales have more extensive mathematical properties and some, hold outthe possibility of establishing cause and effect relationships between variables.

Chapter Objectives

This chapter will give the reader:

An understanding of the four levels of measurement that can be taken by researchers

The ability to distinguish between comparative and non-comparative measurement scales,and

A basic tool-kit of scales that can be used for the purposes of marketing research.

Structure Of The Chapter

All measurements must take one of four forms and these are described in the openingsection of the chapter. After the properties of the four categories of scale have been

explained, various forms of comparative and non-comparative scales are illustrated. Someof these scales are numeric, others are semantic and yet others take a graphical form. Themarketing researcher who is familiar with the complete tool kit of scaling measurements isbetter equipped to understand markets.

Levels of measurement


http://www.fao.org/docrep/W3241E/w3241e04.htm#chapter%20objectives


http://www.fao.org/docrep/W3241E/w3241e04.htm#structure%20of%20the%20chapter


http://www.fao.org/docrep/W3241E/w3241e04.htm#levels%20of%20measurement


http://www.fao.org/docrep/W3241E/w3241e04.htm#nominal%20scales


http://www.fao.org/docrep/W3241E/w3241e04.htm#measurement%20scales


http://www.fao.org/docrep/W3241E/w3241e04.htm#comparative%20scales


http://www.fao.org/docrep/W3241E/w3241e04.htm#noncomparative%20scales


http://www.fao.org/docrep/W3241E/w3241e04.htm#chapter%20summary


http://www.fao.org/docrep/W3241E/w3241e04.htm#key%20terms


http://www.fao.org/docrep/W3241E/w3241e04.htm#review%20questions


http://www.fao.org/docrep/W3241E/w3241e04.htm#chapter%20references















Most texts on marketing research explain the four levels of measurement: nominal, ordinal,interval and ratio and so the treatment given to them here will be brief. However, it is animportant topic since the type of scale used in taking measurements directly impinges onthe statistical techniques which can legitimately be used in the analysis.

Nominal scalesThis, the crudest of measurement scales, classifies individuals, companies, products,brands or other entities into categories where no order is implied. Indeed it is often referredto as a categorical scale. It is a system of classification and does not place the entity alonga continuum. It involves a simply count of the frequency of the cases assigned to thevarious categories, and if desired numbers can be nominally assigned to label eachcategory as in the example below:

Figure 3.1 An example of a nominal scale

Which of the following food items do you tend to buy at least once per month? (Please tick)

Okra Palm Oil Milled Rice

Peppers Prawns Pasteurised milk

The numbers have no arithmetic properties and act only as labels. The only measure ofaverage which can be used is the mode because this is simply a set of frequency counts.Hypothesis tests can be carried out on data collected in the nominal form. The most likelywould be the Chi-square test. However, it should be noted that the Chi-square is a test todetermine whether two or more variables are associated and the strength of thatrelationship. It can tell nothing about the form of that relationship, where it exists, i.e. it is notcapable of establishing cause and effect.

Ordinal scales

Ordinal scales involve the ranking of individuals, attitudes or items along the continuum ofthe characteristic being scaled. For example, if a researcher asked farmers to rank 5 brandsof pesticide in order of preference he/she might obtain responses like those in table 3.2below.

Figure 3.2 An example of an ordinal scale used to determine farmers' preferencesamong 5 brands of pesticide.

Order of preference Brand

1 Rambo2 R.I.P.

3 Killalot

4 D.O.A.

5 Bugdeath

From such a table the researcher knows the order of preference but nothing about howmuch more one brand is preferred to another, that is there is no information about the



interval between any two brands. All of the information a nominal scale would have given isavailable from an ordinal scale. In addition, positional statistics such as the median, quartileand percentile can be determined.

It is possible to test for order correlation with ranked data. The two main methods areSpearman's Ranked Correlation Coefficient and Kendall's Coefficient of Concordance.Using either procedure one can, for example, ascertain the degree to which two or moresurvey respondents agree in their ranking of a set of items. Consider again the ranking ofpesticides example in figure 3.2. The researcher might wish to measure similarities anddifferences in the rankings of pesticide brands according to whether the respondents' farmenterprises were classified as "arable" or "mixed" (a combination of crops and livestock).The resultant coefficient takes a value in the range 0 to 1. A zero would mean that therewas no agreement between the two groups, and 1 would indicate total agreement. It is morelikely that an answer somewhere between these two extremes would be found.

The only other permissible hypothesis testing procedures are the runs test and sign test.The runs test (also known as the Wald-Wolfowitz). Test is used to determine whether a

sequence of binomial data - meaning it can take only one of two possible values e.g.African/non-African, yes/no, male/female - is random or contains systematic 'runs' of one orother value. Sign tests are employed when the objective is to determine whether there is asignificant difference between matched pairs of data. The sign test tells the analyst if thenumber of positive differences in ranking is approximately equal to the number of negativerankings, in which case the distribution of rankings is random, i.e. apparent differences arenot significant. The test takes into account only the direction of differences and ignores theirmagnitude and hence it is compatible with ordinal data.

Interval scales

It is only with an interval scaled data that researchers can justify the use of the arithmeticmean as the measure of average. The interval or cardinal scale has equal units ofmeasurement, thus making it possible to interpret not only the order of scale scores but alsothe distance between them. However, it must be recognised that the zero point on aninterval scale is arbitrary and is not a true zero. This of course has implications for the typeof data manipulation and analysis we can carry out on data collected in this form. It ispossible to add or subtract a constant to all of the scale values without affecting the form ofthe scale but one cannot multiply or divide the values. It can be said that two respondentswith scale positions 1 and 2 are as far apart as two respondents with scale positions 4 and5, but not that a person with score 10 feels twice as strongly as one with score 5.Temperature is interval scaled, being measured either in Centigrade or Fahrenheit. Wecannot speak of 50°F being twice as hot as 25°F since the corresponding temperatures on

the centigrade scale, 10°C and -3.9°C, are not in the ratio 2:1.

Interval scales may be either numeric or semantic. Study the examples below in figure 3.3.

Figure 3.3 Examples of interval scales in numeric and semantic formats

Please indicate your views on Balkan Olives by scoring them on a scale of 5 down to 1 (i.e. 5 = Excellent;= Poor) on each of the criteria listed



Balkan Olives are: Circle the appropriate score on each line

Succulence 5 4 3 2 1

Fresh tasting 5 4 3 2 1

Free of skin blemish 5 4 3 2 1

Good value 5 4 3 2 1

Attractively packaged 5 4 3 2 1(a)

Please indicate your views on Balkan Olives by ticking the appropriate responses below:

Excellent Very Good Good Fair Poor

Succulent

Freshness

Freedom from skin blemish

Value for money

Attractiveness of packaging

(b)

Most of the common statistical methods of analysis require only interval scales in order thatthey might be used. These are not recounted here because they are so common and canbe found in virtually all basic texts on statistics.

Ratio scales

The highest level of measurement is a ratio scale. This has the properties of an intervalscale together with a fixed origin or zero point. Examples of variables which are ratio scaledinclude weights, lengths and times. Ratio scales permit the researcher to compare bothdifferences in scores and the relative magnitude of scores. For instance the differencebetween 5 and 10 minutes is the same as that between 10 and 15 minutes, and 10 minutes

is twice as long as 5 minutes.

Given that sociological and management research seldom aspires beyond the interval levelof measurement, it is not proposed that particular attention be given to this level of analysis.Suffice it to say that virtually all statistical operations can be performed on ratio scales.

Measurement scales

The various types of scales used in marketing research fall into two broad categories:comparative and non comparative. In comparative scaling, the respondent is asked tocompare one brand or product against another. With noncomparative scaling respondents

need only evaluate a single product or brand. Their evaluation is independent of the otherproduct and/or brands which the marketing researcher is studying.

Noncomparative scaling is frequently referred to as monadic scaling and this is the morewidely used type of scale in commercial marketing research studies.

Comparative scales



Paired comparison2: It is sometimes the case that marketing researchers wish to find outwhich are the most important factors in determining the demand for a product. Converselythey may wish to know which are the most important factors acting to prevent thewidespread adoption of a product. Take, for example, the very poor farmer response to thefirst design of an animal-drawn mould board plough. A combination of exploratory researchand shrewd observation suggested that the following factors played a role in the shaping ofthe attitudes of those farmers who feel negatively towards the design:

Does not ridge

Does not work for inter-cropping

Far too expensive

New technology too risky

Too difficult to carry.

Suppose the organisation responsible wants to know which factors is foremost in thefarmer's mind. It may well be the case that if those factors that are most important to thefarmer than the others, being of a relatively minor nature, will cease to prevent widespread

adoption. The alternatives are to abandon the product's re-development or to completely re-design it which is not only expensive and time-consuming, but may well be subject to a newset of objections.

The process of rank ordering the objections from most to least important is best approachedthrough the questioning technique known as 'paired comparison'. Each of the objections ispaired by the researcher so that with 5 factors, as in this example, there are 10 pairs-

In 'paired comparisons' every factor has to be paired with every other factor in turn.However, only one pair is ever put to the farmer at any one time.

The question might be put as follows:

Which of the following was the more important in making you decide not to buy the plough?

The plough was too expensive

It proved too difficult to transport

In most cases the question, and the alternatives, would be put to the farmer verbally.

He/she then indicates which of the two was the more important and the researcher ticks thebox on his questionnaire. The question is repeated with a second set of factors and theappropriate box ticked again. This process continues until all possible combinations areexhausted, in this case 10 pairs. It is good practice to mix the pairs of factors so that there isno systematic bias. The researcher should try to ensure that any particular factor issometimes the first of the pair to be mentioned and sometimes the second. The researcherwould never, for example, take the first factor (on this occasion 'Does not ridge') and



systematically compare it to each of the others in succession. That is likely to causesystematic bias.

Below labels have been given to the factors so that the worked example will be easier tounderstand. The letters A - E have been allocated as follows:

A = Does not ridge

B = Far too expensive

C = New technology too risky

D = Does not work for inter-cropping

E = Too difficult to carry.

The data is then arranged into a matrix. Assume that 200 farmers have been interviewedand their responses are arranged in the grid below. Further assume that the matrix is soarranged that we read from top to side. This means, for example, that 164 out of 200farmers said the fact that the plough was too expensive was a greater deterrent than the

fact that it was not capable of ridging. Similarly, 174 farmers said that the plough's inabilityto inter-crop was more important than the inability to ridge when deciding not to buy theplough.

Figure 3.4 A preference matrix

A B C D E

A 100 164 120 174 180

B 36 100 160 176 166

C 80 40 100 168 124

D 26 24 32 100 102

E 20 34 76 98 100

If the grid is carefully read, it can be seen that the rank order of the factors is -

Most important E Too difficult to carry

D Does not inter crop

C New technology/high risk

B Too expensive

Least important A Does not ridge.

It can be seen that it is more important for designers to concentrate on improving

transportability and, if possible, to give it an inter-cropping capability rather than focusing onits ridging capabilities (remember that the example is entirely hypothetical).

One major advantage to this type of questioning is that whilst it is possible to obtain ameasure of the order of importance of five or more factors from the respondent, he is neverasked to think about more than two factors at any one time. This is especially useful whendealing with illiterate farmers. Having said that, the researcher has to be careful not topresent too many pairs of factors to the farmer during the interview. If he does, he will find



that the farmer will quickly get tired and/or bored. It is as well to remember the formula ofn(n - 1)/2. For ten factors, brands or product attributes this would give 45 pairs. Clearly thefarmer should not be asked to subject himself to having the same question put to him 45times. For practical purposes, six factors is possibly the limit, giving 15 pairs.

It should be clear from the procedures described in these notes that the paired comparisonscale gives ordinal data.

Dollar Metric Comparisons3: This type of scale is an extension of the paired comparisonmethod in that it requires respondents to indicate both their preference and how much theyare willing to pay for their preference. This scaling technique gives the marketing researcheran interval - scaled measurement. An example is given in figure 3.5.

Figure 3.5 An example of a dollar metric scale

Which of the following types of fish doyou prefer?

How much more, in cents, would you be prepared to pay foryour preferred fish?

Fresh Fresh (gutted) $0.70Fresh (gutted) Smoked 0.50

Frozen Smoked 0.60

Frozen Fresh 0.70

Smoked Fresh 0.20

Frozen(gutted) Frozen

From the data above the preferences shown below can be computed as follows:

Fresh fish: 0.70 + 0.70 + 0.20 =1.60

Smoked fish: 0.60 + (-0.20) + (-0.50) =(-1.10)

Fresh fish(gutted): (-0.70) + 0.30 + 0.50 =0.10

Frozen fish: (-0.60) + (-0.70) + (-0.30) =(-1.60)

The Unity-sum-gain technique: A common problem with launching new products is one ofreaching a decision as to what options, and how many options one offers. Whilst a companymay be anxious to meet the needs of as many market segments as possible, it has toensure that the segment is large enough to enable him to make a profit. It is always easierto add products to the product line but much more difficult to decide which models shouldbe deleted. One technique for evaluating the options which are likely to prove successful isthe unity-sum-gain approach.

The procedure is to begin with a list of features which might possibly be offered as 'options'on the product, and alongside each you list its retail cost. A third column is constructed and

this forms an index of the relative prices of each of the items. The table below will helpclarify the procedure. For the purposes of this example the basic reaper is priced at $20,000and some possible 'extras' are listed along with their prices.

The total value of these hypothetical 'extras' is $7,460 but the researcher tells the farmer hehas an equally hypothetical $3,950 or similar sum. The important thing is that he shouldhave considerably less hypothetical money to spend than the total value of the alternativeproduct features. In this way the farmer is encouraged to reveal his preferences by allowing



researchers to observe how he trades one additional benefit off against another. Forexample, would he prefer a side rake attachment on a 3 metre head rather than have atransporter trolley on either a standard or 2.5m wide head? The farmer has to be told thatany unspent money cannot be retained by him so he should seek the best value-for-moneyhe can get.

In cases where the researcher believes that mentioning specific prices might introducesome form of bias into the results, then the index can be used instead. This is constructedby taking the price of each item over the total of $ 7,460 and multiplying by 100. Surveyrespondents might then be given a maximum of 60 points and then, as before, are askedhow they would spend these 60 points. In this crude example the index numbers are not tooeasy to work with for most respondents, so one would round them as has been done in theadjusted column. It is the relative and not the absolute value of the items which is importantso the precision of the rounding need not overly concern us.

Figure 3.6 The unity-sum-gain technique

Item Additional Cost ($s) Index Adjusted Index 2.5 wide rather than standard 2m 2,000 27 30

Self lubricating chain rather than belt 200 47 50

Side rake attachment 350 5 10

Polymer heads rather than steel 250 3 5

Double rather than single edged cutters 210 2.5 5

Transporter trolley for reaper attachment 650 9 10

Automatic levelling of table 300 4 5

The unity-sum-gain technique is useful for determining which product features are moreimportant to farmers. The design of the final market version of the product can then reflectthe farmers' needs and preferences. Practitioners treat data gathered by this method asordinal.

Noncomparative scales

Continuous rating scales: The respondents are asked to give a rating by placing a markat the appropriate position on a continuous line. The scale can be written on card andshown to the respondent during the interview. Two versions of a continuous rating scale aredepicted in figure 3.7.

Figure 3.7 Continuous rating scales



When version B is used, the respondent's score is determined either by dividing the line intoas many categories as desired and assigning the respondent a score based on the categoryinto which his/her mark falls, or by measuring the distance, in millimetres or inches, fromeither end of the scale.

Whichever of these forms of the continuous scale is used, the results are normally analysedas interval scaled.

Line marking scale: The line marked scale is typically used to measure perceivedsimilarity differences between products, brands or other objects. Technically, such a scale isa form of what is termed a semantic differential scale since each end of the scale is labelled

with a word/phrase (or semantic) that is opposite in meaning to the other. Figure 3.8provides an illustrative example of such a scale.

Consider the products below which can be used when frying food. In the case of each pair,indicate how similar or different they are in the flavour which they impart to the food.

Figure 3.8 An example of a line marking scale

For some types of respondent, the line scale is an easier format because they do not finddiscrete numbers (e.g. 5, 4, 3, 2, 1) best reflect their attitudes/feelings. The line markingscale is a continuous scale.

Itemised rating scales: With an itemised scale, respondents are provided with a scalehaving numbers and/or brief descriptions associated with each category and are asked toselect one of the limited number of categories, ordered in terms of scale position, that bestdescribes the product, brand, company or product attribute being studied. Examples of theitemised rating scale are illustrated in figure 3.9.

Figure 3.9 Itemised rating scales

http://www.fao.org/docrep/W3241E/w3241e03.jpg





Itemised rating scales can take a variety of innovative forms as demonstrated by the twoillustrated in figure 3.9, which are graphic.

Figure 3.10 Graphic itemised scales

Whichever form of itemised scale is applied, researchers usually treat the data as interval

level.

Semantic scales: This type of scale makes extensive use of words rather than numbers.Respondents describe their feelings about the products or brands on scales with semanticlabels. When bipolar adjectives are used at the end points of the scales, these are termedsemantic differential scales. The semantic scale and the semantic differential scale areillustrated in figure 3.11.

Figure 3.11 Semantic and semantic differential scales






Likert scales: A Likert scale is what is termed a summated instrument scale. This meansthat the items making up a Liken scale are summed to produce a total score. In fact, a Likertscale is a composite of itemised scales. Typically, each scale item will have 5 categories,with scale values ranging from -2 to +2 with 0 as neutral response. This explanation may beclearer from the example in figure 3.12.

Figure 3.12 The Likert scale

StronglyAgree

Agree Neither Disagree StronglyDisagree

If the price of raw materials fell firms would reducethe price of their food products.

1 2 3 4 5

Without government regulation the firms would

exploit the consumer.

1 2 3 4 5

Most food companies are so concerned aboutmaking profits they do not care about quality.

1 2 3 4 5

The food industry spends a great deal of moneymaking sure that its manufacturing is hygienic.

1 2 3 4 5

Food companies should charge the same price fortheir products throughout the country

1 2 3 4 5



Likert scales are treated as yielding Interval data by the majority of marketing researchers.

The scales which have been described in this chapter are among the most commonly usedin marketing research. Whilst there are a great many more forms which scales can take, ifstudents are familiar with those described in this chapter they will be well equipped to dealwith most types of survey problem.

Chapter Summary

There are four levels of measurement: nominal, ordinal, interval and ratio. These constitutea hierarchy where the lowest scale of measurement, nominal, has far fewer mathematicalproperties than those further up this hierarchy of scales. Nominal scales yield data oncategories; ordinal scales give sequences; interval scales begin to reveal the magnitudebetween points on the scale and ratio scales explain both order and the absolate distancebetween any two points on the scale.

The measurement scales, commonly used in marketing research, can be divided into two

groups; comparative and non-comparative scales. Comparative scales involve therespondent in signaling where there is a difference between two or more producers,services, brands or other stimuli. Examples of such scales include; paired comparison,dollar metric, unity-sum-gain and line marking scales. Non-comparative scales, described inthe textbook, are; continuous rating scales, itemised rating scales, semantic differentialscales and Likert scales.

Garbage in, garbage outFrom Wikipedia, the free encyclopedia

Look up garbage in,

garbage out in Wiktionary,

the free dictionary.

Garbage in, garbage out (abbreviated to GIGO, coined as a pun on the phrase first-in, first-out ) is a phrase in

the field of computer science or information and communication technology. It is used primarily to call attention

to the fact that computers will unquestioningly process the most nonsensical of input data ("garbage in") and

produce nonsensical output ("garbage out"). It was most popular in the early days of computing, but applies

even more today, when powerful computers can spew out mountains of erroneous information in a short time.

The term was coined as a teaching mantra by George Fuechsel,

[1]

an IBM 305 RAMAC technician/instructor inNew York. Early programmers were required to test virtually each program step and cautioned not to expect

that the resulting program would "do the right thing" when given imperfect input. The underlying principle was

noted by the inventor of the first programmable computing device design:

http://en.wiktionary.org/wiki/Special:Search/garbage_in,_garbage_out




http://en.wikipedia.org/wiki/Pun



http://en.wikipedia.org/wiki/FIFO



http://en.wikipedia.org/wiki/Computer_science



http://en.wikipedia.org/wiki/Information_and_communication_technology



http://en.wikipedia.org/wiki/Computer



http://en.wikipedia.org/wiki/Garbage_in,_garbage_out#cite_note-0



http://en.wikipedia.org/wiki/IBM_305_RAMAC














On two occasions I have been asked,—"Pray, Mr. Babbage, if you put into the machine wrong figures, will the

right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke

such a question.

—Charles Babbage, Passages from the Life of a Philosophe r [2]

It is also commonly used to describe failures in human decision-making due to faulty, incomplete, or imprecise

data.

The term can also be used as an explanation for the poor quality of a digitized audio or video file.

Although digitizing can be the first step in cleaning up a signal, it does not, by itself, improve the quality.

Defects in the original analog signal will be faithfully recorded, but may be identified and removed by a

subsequent step. (See Digital signal processing .)

Garbage in, gospel out is a more recent expansion of the acronym. It is a sardonic comment on the tendency

to put excessive trust in "computerized" data, and on the propensity for individuals to blindly accept what the

computer says. Because the data goes through the computer, people tend to believe it.

Decision-makers increasingly face computer-generated information and analyses that could be collected and

analyzed in no other way. Precisely for that reason, going behind that output is out of the question, even if one

has good cause to be suspicious. In short, the computer analysis becomes the gospel.[3]

Chapter 5

Standardized Measurement and Assessment (For the concept map that goes with this chapter, click here.)

Defining Measurement When we measure, we attempt to identify the dimensions, quantity, capacity, or degree of

something.

Measurement is formally defined as the act of measuring by assigning symbols or

numbers to something according to a specific set of rules.

Measurement can be categorized by the type of information that is communicated by the symbols

or numbers assigned to the variables of interest. In particular, there are four levels or types of

information are discussed next in the chapter. They are called the four "scales of measurement."

Scales of Measurement

1. Nominal Scale.

This is a nonquantitative measurement scale.

It is used to categorize, label, classify, name, or identify variables. It classifies groups or

types.

http://en.wikipedia.org/wiki/Charles_Babbage





http://en.wikipedia.org/wiki/Decision-making



http://en.wikipedia.org/wiki/Digitizing



http://en.wikipedia.org/wiki/Digital_signal_processing






http://www.southalabama.edu/coe/bset/johnson/dr_johnson/clickmaps/ch5/fr_ch5.htm












Numbers can be used to label the categories of a nominal variable but the numbers serve

only as markers, not as indicators of amount or quantity (e.g., if you wanted to, you couldmark the categories of the variable called "gender" with 1=female and 2=male).

Some examples of nominal level variables are the country you were born in, college

major, personality type, experimental group (e.g., experimental group or control group).

2. Ordinal Scale. This level of measurement enables one to make ordinal judgments (i.e., judgments about rank

order).

Any variable where the levels can be ranked (but you don't know if the distance between

the levels is the same) is an ordinal variable.

Some examples are order of finish position in a marathon, billboard top 40, rank in class.

3. Interval Scale. This scale or level of measurement has the characteristics of rank order and equal

intervals (i.e., the distance between adjacent points is the same).

It does not possess an absolute zero point. Some examples are Celsius temperature, Fahrenheit temperature, IQ scores.

Here is the idea of the lack of a true zero point: zero degrees Celsius does not mean notemperature at all; in a Fahrenheit scale, it is equal to the freezing point or 32 degrees.

Zero degrees in these scales does not mean zero or no temperature.

4. Ratio Scale.This is a scale with a true zero point.

It also has all of the "lower level" characteristics (i.e., the key characteristic of each of the

lower level scales) of equal intervals (interval scale), rank order (ordinal scale), andability to mark a value with a name (nominal scale).

Some examples of ratio level scales are number correct, weight, height, response time,

Kelvin temperature, and annual income.

Here is an example of the presence of a true zero point: If your annual income is exactlyzero dollars then you earned no annual income at all. (You can buy absolutely nothing

with zero dollars.) Zero means zero.

Assumptions Underlying Testing and Measurement

Before I list the assumptions, note the difference between testing and assessment. According tothe definitions that we use:

Testing is the process of measuring variables by means of devices or procedures designed

to obtain a sample of behavior and

Assessment is the gathering and integration of data for the purpose of making aneducational evaluation, accomplished through the use of tools such as tests, interviews,

case studies, behavioral observation, and specially designed apparatus and measurement

procedures.

In this section of the text, we also list the twelve assumptions that Cohen, et al. Consider basic to

testing and assessment:



1. Psychological traits and states exist.

A trait is a relatively enduring (i.e., long lasting) characteristic on which people differ;

a state is a less enduring or more transient characteristic on which people differ.

Traits and states are actually social constructions, but they are real in the sense that they

are useful for classifying and organizing the world, they can be used to understand andpredict behavior, and they refer to something in the world that we can measure.

2. Psychological traits and states can be quantified and measured.

For nominal scales, the number is used as a marker. For the other scales, the numbers

become more and more quantitative as you move from ordinal scales (shows ranking

only) to interval scales (shows amount, but lacks a true zero point) to ratio scales (showsamount or quantity as we usually understand this concept in mathematics or everyday use

of the term).

Most traits and states measured in education are taken to be at the interval level of

measurement.

3. Various approaches to measuring aspects of the same thing can be useful.

For example, different tests of intelligence tap into somewhat different aspects of theconstruct of intelligence.

4. Assessment can provide answers to some of life's most momentous questions.

It is important that the users of assessment tools know when these tools will provide

answers to their questions.

5. Assessment can pinpoint phenomena that require further attention or study.

For example, assessment may identify someone as having dyslexia or low self-esteem or

at-risk for drug use.

6. Various sources of data enrich and are part of the assessment process.

Information from several sources usually should be obtained in order to make an accurate

and informed decision. For example, the idea of portfolio assessment is useful.

7. Various sources of error are always part of the assessment process.

There is no such thing as perfect measurement. All measurement has some error.

We defined error as the difference between a person’s true score and that person’sobserved score.

The two main types of error are random error (e.g., error due to transient factors such as

being sick or tired) and systematic error (e.g., error present every time the measurement

instrument is used such as an essay exam being graded by an overly easy grader). (Laterwhen we discuss reliability and validity, you might note that unreliability is due to

random error and lack of validity is due to systematic error.)

8. Tests and other measurement techniques have strengths and weaknesses.

It is essential that users of tests understand this so that they can use them appropriately

and intelligently.



In this chapter, we will be talking about the two major characteristics: reliability and

validity.

9. Test-related behavior predicts non-test-related behavior.

The goal of testing usually is to predict behavior other than the exact behaviors required

while the exam is being taken. For example, paper-and-pencil achievement tests given to children are used to say

something about their level of achievement.

Another paper-and-pencil test (also called a self-report test) that is popular in counselingis the MMPI (i.e., the Minnesota Multiphasic Personality Inventory). Clients' scores on

this test are used as indicators of the presence or absence of various mental disorders.

The point here is that the actual mechanics of measurement (e.g., self-reports, behavioralperformance, projective) can vary widely and still provide good measurement of

educational, psychological, and other types of variables.

10. Present-day behavior sampling predicts future behavior.

Perhaps the most important reason for giving tests is to predict future behavior. Tests provide a sample of present-day behavior. However, this "sample" is used to

predict future behavior.

For example, an employment test given by someone in a Personnel Office may be used as

a predictor of future work behavior.

Another example: the Beck Depression Inventory is used to measure depression and,

importantly, to predict test taker’s future behavior (e.g., are they a risk to themselves?).

11. Testing and assessment can be conducted in a fair and unbiased manner.

This requires careful construction of test items and testing of the items on different typesof people.

Test makers always have to be on the alert to make sure tests are fair and unbiased.

This assumption also requires that the test be administered to those types of people for

whom it has been shown to operate properly.

12. Testing and assessment benefit society.

Many critical decisions are made on the basis of tests (e.g., teacher competency,employability, presence of a psychological disorder, degree of teacher satisfactions,

degree of student satisfaction, etc.).

Without tests, the world would be much more unpredictable.

Identifying A Good Test or Assessment Procedure

As mentioned earlier in the chapter, good measurement us fundamental for research. If we do not

have good measurement then we cannot have good research. That’s why it’s so important to use

testing and assessment procedures that are characterized by high reliability and high validity.

Overview of Reliability and Validity As an introduction to reliability and validity and how they are related, note the following:



Reliability refers to the consistency or stability of test scores

Validity refers to the accuracy of the inferences or interpretations we make from testscores

Reliability is a necessary but not sufficient condition for validity (i.e., if you are going to

have validity, you must have reliability but reliability in and of itself is not enough to

ensure validity. Assume you weigh 125 pounds. If you weigh yourself five times and get 135, 134, 134,

135, 136 then your scales are reliable but not valid. The scores were consistent but

wrong! Again, you want your scales to be both reliable and valid.

Reliability Reliability refers to consistency or stability. In psychological and educational testing, it refers tothe consistency or stability of the scores that we get from a test or assessment procedure.

Reliability is usually determined using a correlation coefficient (it is called a reliability

coefficient in this context).

Remember (from chapter two) that a correlation coefficient is a measure of relationship

that varies from -1 to 0 to 1 and the farther the number is from zero, the stronger thecorrelation. For example, minus one (-1.00) indicates a perfect negative correlation, zero

indicates no correlation at all, and positive one (+1.00) indicates a perfect positivecorrelation. Regarding strength, -.85 is stronger than +.55, and +.75 is stronger than +.35.

When you have a negative correlation, the variables move in opposite directions (e.g.,

poor diet and life expectancy); when you have a positive correlation, the variables movein the same direction (e.g., education and income).

When looking at reliability coefficients we are interested in the values ranging from 0 to

1; that is, we are only interested in positive correlations. Note that zero means no

reliability, and +1.00 means perfect reliability.

Reliability coefficients of .70 or higher are generally considered to be acceptable for

research purposes. Reliability coefficients of .90 or higher are needed to make decisions

that have impacts on people's lives (e.g., the clinical uses of tests).

Reliability is empirically determined; that is, we must check the reliability of test scoreswith specific sets of people. That is, we must obtain the reliability coefficients of interest

to us.

There are four primary ways to measure reliability.

1. The first type of reliability is called test-retest reliability.

This refers to the consistency of test scores over time.

It is measured by correlating the test scores obtained at one point in time with the test

scores obtained at a later point in time for a group of people. A primary issue is identifying the appropriate time interval between the two testing

occasions.

The longer the time interval between the two testing occasions, the lower the reliabilitycoefficient tends to be.

2. The second type of reliability is called equivalent forms reliability.



This refers to the consistency of test scores obtained on two equivalent forms of a test

designed to measure the same thing.

It is measured by correlating the scores obtained by giving two forms of the same test to a

group of people.

The success of this method hinges on the equivalence of the two forms of the test.

3. The third type of reliability is called internal consistency reliability.

It refers to the consistency with which the items on a test measure a single construct.

Internal consistency reliability only requires one administration of the test, which makes

it a very convenient form of reliability.

One type of internal consistency reliability is split-half reliability, which involvessplitting a test into two equivalent halves and checking the consistency of the scores

obtained from the two halves.

The measure of internal consistency that we emphasize in the chapter is coefficient alpha.

(It is also sometimes called Cronbach’s alpha.) The beauty of coefficient alpha is that it is

readily provided by statistical analysis packages and it can be used when test items arequantitative and when they are dichotomous (as in right or wrong).

Researchers use coefficient alpha when they want an estimate of the reliability of ahomogeneous test (i.e., a test that measures only one construct or trait) or an estimate of

the reliability of each dimension on a multidimensional test. You will see it commonly

reported in empirical research articles.

Coefficient alpha will be high (e.g., greater than .70) when the items on a test are

correlated with one another. But note that the number of items also affects the strength of

coefficient alpha (i.e., the more items you have on a test, the higher coefficient alpha will

be). This latter point is important because it shows that it is possible to get a large alphacoefficient even when the items are not very homogeneous or internally consistent.

4. The fourth and last major type of reliability is called inter-scorer reliability.

Inter-Scorer Reliability refers to the consistency or degree of agreement between two ormore scorers, judges, or raters.

You could have two judges rate one set of papers. Then you would just correlate their

two sets of ratings to obtain the inter-scorer reliability coefficient, showing the

consistency of the two judges’ ratings.

Validity Validity refers to the accuracy of the inferences, interpretations, or actions made on the basis of

test scores.

Technically speaking, it is incorrect to say that a test is valid or invalid. It is the

interpretations and actions taken based on the test scores that are valid or invalid.

All of the ways of collecting validity evidence are really forms of what used to be called

construct validity. All that means is that in testing and assessment, we

are always measuring something (e.g., IQ, gender, age, depression, self-efficacy).

Validation refers to gathering evidence supporting some inference made on the basis of test

scores.



There are three main methods of collecting validity evidence.

1. Evidence Based on Content Content-related evidence is based on a judgment of the degree to which the items, tasks, or

questions on a test adequately represent the domain of interest. Expert judgment is used toprovide evidence of content validity.

To make a decision about content-related evidence, you should try to answer these threequestions:

Do the items appear to represent the thing you are trying to measure?

Does the set of items underrepresent the construct’s content (i.e., have you excluded anyimportant content areas or topics)?

Do any of the items represent something other than what you are trying to measure (i.e.,

have you included any irrelevant items)?

2. Evidence Based on Internal Structure Some tests are designed to measure one general construct, but other tests are designed to measure

several components or dimensions of a construct. For example, the Rosenberg Self-Esteem Scaleis a 10 item scale designed to measure the construct of global self-esteem. In contrast, the Harter

Self-Esteem Scale is designed to measure global self-esteem as well as several separate

dimensions of self-esteem.

The use of the statistical technique called factor analysis tells you the number of

dimensions (i.e., factors) that are present. That is, it tells you whether a test is

unidimensional (just measures one factor) or multidimensional (i.e., measures two or

more dimensions).

When you examine the internal structure of a test, you can also obtain a measure of

test homogeneity (i.e., how well the different items measure the construct or trait).

The two primary indices of homogeneity are the item-to-total correlation (i.e., correlate

each item with the total test score) and coefficient alpha (discussed earlier underreliability).

3. Evidence Based on Relations to Other Variables This form of evidence is obtained by relating your test scores with one or more relevant criteria.

A criterion is the standard or benchmark that you want to predict accurately on the basis of the

test scores. Note that when using correlation coefficients for validity evidence we callthem validity coefficients.

There are several different kinds of relevant validity evidence based on relations to other

variables.

The first is called criterion-related evidence which is validity evidence based on the extent to

which scores from a test can be used to predict or infer performance on some criterion such as atest or future performance. Here are the two types of criterion-related evidence:

Concurrent evidence — validity evidence based on the relationship between test scores

and criterion scores obtained at the same time.



Predictive evidence — validity evidence based on the relationship between test scores

collected at one point in time and criterion scores obtained at a later time.

Here are three more types of validity evidence researchers should provide:

Convergent evidence — validity evidence based on the relationship between the focal test

scores and independent measures of the same construct. The idea is that you want yourtest (that your are trying to validate) to strongly correlate with other measures of the same

thing.

Divergent evidence — evidence that the scores on your focal test are not highly related tothe scores from other tests that are designed to measure theoretically different constructs.

This kind of evidence shows that your test is not a measure of those other things (i.e.,

other constructs).

Putting the ideas of convergent and divergent evidence together, the point is that to show

that a new test measures what it is supposed to measure, you want it to correlate with

other measures of that construct (convergent evidence) but you also want it NOT to

correlate strongly with measures of other things (divergent evidence). You want your test

to overlap with similar tests and to diverge from tests of different things. In short, bothconvergent and divergent evidence are desirable.

Known groups evidence is also useful in demonstrating validity. This is evidence thatgroups that are known to differ on the construct do differ on the test in the hypothesized

direction. For example, if you develop a test of gender roles, you would hypothesize that

females will score higher on femininity and males will score higher on masculinity. Thenyou would test this hypothesis to see if you have evidence of validity.

Now, to summarize these three major methods for obtaining evidence of validity, look again at

Table 5.6 (also shown below).

Please note that, if you think we have spent a lot of time on validity and measurement, the reason

is because validity is so important in empirical research. Remember, without good measurement

we end up with GIGO (garbage in, garbage out).



Using Reliability and Validity Information

You must be careful when interpreting the reliability and validity evidence provided with

standardized tests and in empirical research journal articles.

With standardized tests, the reported validity and reliability data are typically based on anorming group (which is an actual group of people). If the people with which you intend

to use a test are very different from those in the norming group, then the validity and

reliability evidence provided with the test become questionable. Remember that what youneed to know is whether a test will work with the people in your classroom or in your

research study.

When reading journal articles, you should view an article positively to the degree that the

researchers provide reliability and validity evidence for the measures that they use. Two

related questions to ask when reading and evaluating an empirical research article are “Itthis research study based on good measurement?” and “Do I believe that these

researchers used good measures?” If the answers are yes, then give the article high marksfor measurement. If the answers are no, then you should invoke the GIGO principle

(garbage in, garbage out).



Educational and Psychological Tests

Three primary types of educational and psychological tests are discussed in your textbook:

intelligence tests, personality tests, and educational assessment tests.

1) Intelligence Tests Intelligence has many definitions because a single prototype does not exist. Although far

from being a perfect definition, here is our definition: intelligence is the ability to think

abstractly and to learn readily from experience.

Although the construct of intelligence is hard to define, it still has utility because it can be

measured and it is related to many other constructs.

For some examples of intelligence tests, click here.

2) Personality Tests.

Personality is a construct similar to intelligence in that a single prototype does not exist. Hereis our definition: personality is the relatively permanent patterns that characterize and can be

use to classify individuals.

Most personality tests are self-report measures. A self-report measure is a test-taking

method in which the participants check or rate the degree to which various characteristics

are descriptive of themselves.

Performance measures of personality are also used. A performance measure is a test-

taking method in which the participants perform some real-life behavior that is observed

by the researcher.

Personality has also been measured with projective tests. A projective test is a test-takingmethod in which the participants provide responses to ambiguous stimuli. The test

administrator searches for patterns on participants’ responses. Projective tests tend to be

quite difficult to interpret and are not commonly used in quantitative research.

For some examples of personality tests, click here.

3) Educational Assessment Tests. There are four subtypes of educational assessment tests:

Preschool Assessment Tests.

--These are typically screening tests because the predictive validity of many of these tests

is weak.

Achievement Tests. --These are designed to measure the degree of learning that has taken place after a

person has been exposed to a specific learning experience. They can be teacherconstructed or standardized tests.

For some examples of achievement tests, click here.

http://c/Documents%20and%20Settings/BJohnson/Desktop/book%20pdf/4.7.pdf














Aptitude Tests. --These focus on information acquired through the informal learning that goes

on in life.

--They are often used to predict future performance whereas achievement tests are used

to measure current performance.

Diagnostic Tests.

--These tests are used to identify the locus of academic difficulties in students.

Sources of Information about Tests

The two most important main sources of information about tests are the Mental Measurements

Yearbook (MMY) and Tests in Print (TIP). Some additional sources are provided in Table 5.7.

Also, here are some useful internet links (from Table 5.8):