Operations analysis of behavioral observation procedures ... · observation procedures to be found in the literature. This not only offers utility by establishing a common vocabulary

Operations analysis of behavioral observation procedures:a taxonomy for modeling in an expert training system

Roger D. Ray & Jessica M. Ray & David A. Eckerman &

Laura M. Milkosky & LaDarien J. Gillins

Published online: 30 July 2011# Psychonomic Society, Inc. 2011

Abstract This article introduces a taxonomy based on aprocedural operations analysis (Verplanck, 1996) of variousmethod descriptions found in the behavior observationresearch literature. How these alternative procedures impactthe recording and subsequent analysis of behavioral eventson the basis of the type of time and behavior recordingsmade is also discussed. The taxonomy was generated as afoundation for the continuing development of an experttraining system called Train-to-Code (TTC; J. M. Ray &Ray, (Behavior Research Methods 40:673–693, 2008)).Presently in its second version, TTC V2.0 is softwaredesigned for errorless training (Terrace, (Journal of theExperimental Analysis of Behavior 6:1–27, 1963)) ofstudent accuracy and fluency in the direct observation andcoding of behavioral or verbal events depicted via digitalvideo. Two of 16 alternative procedures classified by thetaxonomy are presently modeled in TTC’s structuralinterface and functional services. These two models arepresented as illustrations of how the taxonomy guides

software user interface and algorithm development. Theremaining 14 procedures are described in sufficient opera-tional detail to allow similar model-oriented translation.

Keywords Behavioral observation . Observationprocedures . Observer training . Computer based training .

Observer agreement . Expert training systems

In most descriptive research literatures, observers use linguisticlabels to record an account of ongoing behaviors, be theydiscerned by auditory, visual, or any other sensory channel (seeBakeman & Gottman, 1987,1997; Martin & Bateson, 1993;Sharpe & Koperwas, 2003; Wallace & Ross, 2006). Observersare trained to classify, or code, behavior by appropriateapplication of a linguistic category, whether for data gathering,behavioral assessment, or intervention/treatment purposes (seeFriman, 2009). The degree to which training has beensuccessful is normally measured through statistical techniquesfor assessing interobserver agreement (IOA) or intraobserveragreement—techniques such as percentage of agreement,Cohen’s kappa, correlations, ratios of frequency or durationestimates, as well as others (see Bakeman & Gottman, 1997;Hartmann, 1977; Mitchell, 1979; Wallace & Ross, 2006).

There is a large variation in experimental procedures forcarrying out behavioral observation and recording, andthese procedures have a variety of descriptive summarylabels that have been used in the literature. In this article,we reconstruct from published procedural descriptions eachof the types of procedures used—an effort that W. S.Verplanck (1957, 1996) called operations analysis. InVerplanck’s (1957) earliest definition of this approach, heempirically established the exact meaning of scientificterms used by various researchers and published his earliest

Electronic supplementary material The online version of this article(doi:10.3758/s13428-011-0140-6) contains supplementary material,which is available to authorized users.

R. D. Ray (*) : J. M. Ray : L. M. Milkosky : L. J. GillinsRollins College,Winter Park, FL, USAe-mail: [email protected]

J. M. RayUniversity of Central Florida,Orlando, FL, USA

D. A. EckermanUniversity of North Carolina at Chapel Hill,Chapel Hill, NC, USA

Behav Res (2011) 43:616–634DOI 10.3758/s13428-011-0140-6

http://dx.doi.org/10.3758/s13428-011-0140-6

version as A Glossary of Terms.1 We follow this traditionand offer a more consistent and procedurally basedterminology, or verbal taxonomy, for classifying the variousobservation procedures to be found in the literature. Thisnot only offers utility by establishing a common vocabularyfor discussing widely varying systematic observationprocedures, but also offers the prospect of specifyingfunctional and structural requirements of computer inter-faces and algorithms designed to aid in data collectionusing any given procedure. This computerized designprocess represents at least one form of modeling suchprocedures (i.e., constructing structural and functionalexemplars for them).

This article also provides operational specifications thatwould allow us to extend an expert software system,previously described in this journal (J. M. Ray & Ray,2008), that models only two of the existing proceduresreported in the behavior observation research literature. Thissystem, called train-to-code2 (TTC), is a comprehensiveprogram for training behavioral observation/recording reper-tories using errorless training strategies (Terrace, 1963). Asecondary purpose of this article is to present an introductoryreview of the issues underlying observation research ingeneral. A systematic survey of literature conducted for thiseffort found few discussions that have compared andcontrasted the many problems, issues, decisions, andimplications that should be considered before selecting anobservational procedure. Thus, we provide a procedure-by-procedure guide through these considerations.

The first type of procedure modeled in TTC was acontinuous coding procedure adequate for serving as anexpert coding reference. In TTC, Continuous Codingprovides a frame-by-frame classification of any incorporatedvideo source and is thus sufficient to provide feedback andcorrection for any trainee learning to apply these codes (e.g.,J. M. Ray & Ray, 2008). However, not all continuousobservational procedures involve TTC’s explicit codingusing an exhaustive set of behavioral categories, thusresulting in continuous observation, but without a continuousrecord of all behaviors occurring during the entire timeperiod. For example, a recent survey by Mudford, Taylor,and Martin (2009) of articles published in the Journal of

Applied Behavior Analysis (JABA) from 1995 through 2005shows a divergent trend that starts with roughly equivalentfrequencies of use for continuous and discontinuous record-ing but subsequently favors the more frequent use ofcontinuous observation methods (but not continuous cod-ing—a distinction we will subsequently elaborate upon) overdiscontinuous methods. This trend in increasing use ofcontinuous observation correlates quite positively with theincreasing availability of miniaturized technologies for datacollection in real time over this same period.

We conducted a broader survey of the literature inpreparation for having the TTC system ultimately be capableof training observers to carry out any observational procedureused in either research or applied settings. In that broadsurvey, we focused on empirical features that distinguisheach alternative behavioral coding procedure and, subse-quently, generated an organizational scheme for systemati-cally classifying, comparing, and contrasting thoseprocedures. In that broad survey and in a subsequent targetedliterature survey, we found that discontinuous methods havebeen the more commonly reported coding procedures. Wealso sought to specify each of these reported procedures withsufficient precision to guide user-interface development inTTC. The result is a taxonomy suitable for teaching aspiringscientists about alternative observational methods and theirimplications. In fact, while modeling the second incorporatedprocedure in TTC (Behavioral Frequency Count), utilizingthe procedural specifications in the present taxonomy, welearned so much that we are convinced that even experiencedresearchers will find issues to further consider within thisreview.

Classifying observation research procedures

Few articles have attempted to standardize names for therelevant methodological variations found in the literature ondirect observation of behavior. Some articles haveaddressed specific topics, such as sampling strategies forbehavior or individuals (e.g., Altmann, 1974; Friman,2009) or alternative time sampling procedures and howthey compare with actual continuous recording (Mann,Have, Plunkett, & Meisels, 1991). Some secondary sources(e.g., Bakeman & Gottman, 1997; Martin & Bateson, 1993;Sharpe & Koperwas, 2003; Wallace & Ross, 2006) haveoffered comprehensive overviews of procedural alterna-tives. However, as we attempted to perform an operationsanalysis of these procedures, we found that descriptionstypically lacked sufficient clarity to replicate many of theprocedures. Furthermore, authors often conflated factorsthat should, instead, be treated separately.

In conducting our operations analysis, we surveyed bothprimary and secondary sources, focusing on disciplines that

1 After the 1957 publication of A Glossary of Terms, Verplanckpublished an Internet “Preface” statement on the approach taken(http://web.utk.edu/~wverplan/gt57/preface.html), and he spent theremainder of his life extending this early operations analysis approachfor defining procedural, empirical, and theoretical terms in the clearestway possible. The greatly expanded result of his pursuit of thedevelopment of a common language for discussing behavior isavailable as an online Glossary and Thesaurus of Behavioral Termsat http://www.ai2inc.com/Products/GT.html.2 The Train-to-Code software system is copyrighted and commerciallysold and distributed by (AI)2, Inc. (www.ai2inc.com). Authors R. D.Ray and J. M. Ray are each stockholders in (AI)2, Inc.

Behav Res (2011) 43:616–634 617

http://web.utk.edu/~wverplan/gt57/preface.html

http://www.ai2inc.com/Products/GT.html

http://www.ai2inc.com

utilize observational approaches extensively, such as appliedbehavior analysis, child development, and animal behavior.The present article presents an annotated and illustratedclassification taxonomy that summarizes procedures foundby our search. Although most of the studies used involvedvisual observations, we did not eliminate auditory or textualobservation research. We also attempt to identify proceduralvariations for which we have yet to find exemplars.

However, before focusing on this taxonomy, fouroverarching issues need consideration: (1) structural versusfunctional descriptions of behavior; (2) event versus stateperspectives on behavior; (3) mutually exclusive andexhaustive behavioral categories, along with the relevanceof R. D. Ray and Delprato’s (1989) concept of measure-ment domains as a classificatory means for parsingconcurrently occurring behavioral events and their record-ing; and (4) methods of calculating IOA. Each issuerepresents a decision that is best made prior to selecting aprocedure from our taxonomy.

Structural versus functional descriptions of behavior

R. D. Ray and Delprato (1989) have offered a detailedaccount of differences between relying upon structuralcriteria (topographical or kinesiological) versus functionalcriteria (consequential or effect defined) for definingbehavioral categories. A rather detailed review of thealternative measurement criteria for behavior is availablein Friman’s (2009) chapter on behavioral assessment.Friman distinguished between primary (direct observation)measures, secondary (indirect) measures, and use of theproducts of behavior for assessing behavior. Bakeman andGottman (1997) made a similar distinction in their discussionof a continuum of criteria that range from physical (e.g.,postural, physiological measures), on one end, to sociallydefined codings (e.g., aggressive, playful), on the other end.R. D. Ray and Delprato also discussed some of the problemsthat may arise when these two alternative approaches areconflated, or mixed into one intended taxonomy (see alsoPurton, 1978).

Clear examples of physically/structurally definedbehaviors are gaits of movement or locomotion (seeCollins & Stewart, 1993). Categories differentiating gaitsmight include mutually exclusive categories, such asstanding, walking, trotting, cantering, galloping, and soforth. Some gaits are so physically based that they may beanalyzed and even computer modeled on the basis ofphysical variations within a category across individuals(see Cunado, Nixon, & Carter, 2003). Categories describ-ing variations in movement-based gaits may be easilymixed with other nonmovement structural categories todefine a singular behavioral taxonomy, such as lyingdown, sitting, rolling over, or jumping, but typically

should not be conflated with more functionally or sociallydefined behaviors, such as playing or flirting. Playing andflirting are, in fact, apt examples of functional/sociallydefined categories that reflect the reactions of the observer.

Failing to distinguish between structure and function canlead to confusion as to when a correct coding can be made.Some procedures do not require timing accuracy, whileothers do. Structurally defined behavior categories cantypically be coded very quickly after initiation of abehavior. For example, a person does not have to remainstanding very long after transitioning from a sitting positionbefore an observer can reliably identify that standing isoccurring. Early identification of a behavior category mightbe desirable when recording it, since early identificationavoids confusing what just happened with what is happen-ing now in the record.

Functionally defined behavior categories, however, oftenrequire a judgment based on a completed behavioralepisode and cannot be designated near the beginning ofan episode (e.g., judging whether the person is making apun or whether a person is asking a rhetorical question). Insuch situations, it may be difficult to determine whether thecode applies to the recently ended behavior or to thebehavior currently in progress. If we allow for somepostbehavior or coding tolerance time window withinwhich the previous behavior must be coded, this timewindow helps to resolve this confusion despite the fact thata new behavior may be in progress. As we will seepresently, this resolution is critical for assessing observa-tional accuracy or reliability.

Methodological implications of events versus states

We often describe a behavior as though it did not extend intime but, rather, occurred at a moment in time (e.g., turnedon the light, fired the gun, or closed the door). That is, weare describing this behavior as an event. We locate suchevents at a particular moment in time, rather than givingthem temporal duration. Since behavior is actually acontinuous process, these descriptions, of course, representa convenient fiction. All behavior is extended in time. Atthe other extreme, we describe some behavior as a processspread out in time (e.g., mood, attitude, personality). Thatis, we describe this behavior as a state. We may try todigitize the flow of such behavior across time by locatingthe time when the state begins (e.g., falling asleep), but weconsider that this behavior perdures—at least until someending time (e.g., waking).

There is, of course, a continuum between describingbehavior as event and describing behavior as state, andthese actually represent potentially helpful perspectives,rather than facts about behavior. We expect personality toperdure and are surprised when people change (assumed:

618 Behav Res (2011) 43:616–634

their behavior). We expect a meal to be a temporary state.We often extend a series of events to emphasize that theyrepresent a bout of behavior (i.e., a temporary state). In thisreport, we merely mean to acknowledge this continuum, notto judge the value of event versus state descriptions. Forsome purposes, one will be best; for other purposes, theother. Yet there are methodological implications. If theobserver chooses to record the frequency of the behavior,that emphasizes behavior as an event. If the observerchooses to record the duration of the behavior, thatemphasizes behavior as a state. A continuous record ofbehavior is neutral with respect to event versus state andcan be used to focus on either frequency or duration.However, one needs to avoid conflations of event-basedcategories with state-based categories, because they typi-cally cannot be compared.

Another issue of comparison legitimacy arises even if aresearcher chooses to define all behaviors as state phenom-ena. If categories with radically different state durations,such as sneezing and sleeping, are incorporated into thesame coding scheme, comparing such states, on the basis ofeither frequency or duration, can be highly problematic.R. D. Ray and Delprato (1989) have emphasized micro- vs.macro-event resolution as a meaningful way of dealing withcategorical disparities in such parametrics, and codingschemes will avoid numerous problems by adhering tosome standard of relative homogeneity in their level ofresolution for categories. Other recording strategies de-scribed below will emphasize one perspective or the other(i.e., event vs. state; micro vs. macro). Often the choice ofprocedural parameters (e.g., length of observing period,length of time sample within the observing period) willalign well or badly with the chosen perspective.

Methodological implications of mutually exclusiveand exhaustive taxonomies of behavior

Designing a behavioral taxonomy that provides the infor-mation you desire in the most efficient manner is achallenging task. Adding more categories will provide amore complete picture; reducing the number of categoriesmay well reduce error. What is the right balance? Videoand/or computer-aided observation often improves observeraccuracy and precision, and thus their use will allow us toincrease the number of acceptable categories. But is theadded information overkill? Allowing for a process of trialand improvement as you design a taxonomy for yourobservation project is good advice. Two issues to considerin this development are the degree to which your behavioralcategories are mutually exclusive and to which they areexhaustive.

Two categories are mutually exclusive if the observedindividual can be doing one (e.g., walk) or the other (e.g.,

run), but not both, at a given time. The example, however,(walk vs. run) was chosen to indicate that work will berequired to develop a definition to clearly distinguish onefrom the other, that the definition should be one that otherswill understand, and that the definition will have importantimplications for what your observations will tell you.Should there be an intermediate category (e.g., jog)?Should a more extreme category be designated (e.g.,sprint)?

Among the issues to be addressed in developing abehavioral taxonomy is what to do with categories ofbehavior that are not mutually exclusive (e.g., walk, chewgum). R. D. Ray and Delprato (1989) introduced theconcept of descriptive measurement domains to separatecategories into mutually exclusive sets. Any two categoriesthat are not mutually exclusive may be placed into separatedomains. Ideally, a domain involves a consistent focalpoint, such as postural versus oral activities in the walk/chew example above. Alternatively, one domain may focuson postural configurations, such as standing, walking,sitting, and so forth, while another focuses on verbal-social interaction events initiated by the same individual,such as instructing, flirting, kidding, threatening, and soforth.

We can, of course, always create combination categoriesand, thereby, force our independent categories to bemutually exclusive—for example, walk while chewing,walk while not chewing, not walk while chewing, not walkwhile not chewing. We have thereby created four mutuallyexclusive categories by combining the two independentcategories. In the analysis phase, any multidomain compar-ison accomplishes exactly this kind of combinatorialdescription post hoc from concurrent or synchronizedsingle-domain recordings (see Astley et al., 1991). Theobserver’s task, however, has been made more difficult byrequiring both a behavior category and a domain to beobserved and recorded simultaneously, and thus, the codinghas been made less reliable.

A related issue is the extent to which the behavioralcategories being used are exhaustive—that is, whether theywill allow each and every moment in the behavior stream tobe categorized. This issue arises only under conditionswhere a continuous record of all behaviors is the goal. Forexample, if our categories are stand versus walk, what codewould we use for sit? There is always a default approachthat makes your list exhaustive by adding a residualcategory, such as other. But such a category is defined onlyby exclusion and may well be a category that observers finddifficult to use (see Wallace & Ross, 2006, for a far moreelaborate discussion of this issue on the basis of what theycall the “bucket” category). Nonexhaustive coding schemesdecrease the difficulty of observation but disallow someanalyses, as will be detailed below.

Behav Res (2011) 43:616–634 619

Issues attending interobserver agreement

Suen and Ary (1989) have presented a compelling case forapplying much of psychometric theory, including reliabilityand validity, to observational data and analyses. On the otherhand, Lee and Del Fabbro (2002), as well as Wallace andRoss (2006), have argued for a Bayesian approach and havedetailed various concerns with the frequentist approachunderlying psychometric strategies. Perhaps the most com-monly used concepts underlying observational research arethose of agreement, accuracy, and reliability. Suen and Aryargued that theory of measurement is highly germane tounderstanding these issues. They reviewed mutually contra-dictory definitions for agreement versus reliability versusaccuracy commonly found in the literature. For example,Bakeman and Gottman (1987) treated agreement as twoobservers agreeing between themselves and used reliabilityas a measure of how true a measure is. Alternatively, Martinand Bateson (1993) used reliability as an absence of randomerrors and used accuracy as an absence of systematic errors.Hartmann and Wood (1992) based agreement on the degreeto which raw scores match and reliability on the degree towhich standard scores match. Their preferred measure wasaccuracy, which they asserted is a match between observerand an external criterion. Finally, Cone (1982) aligned theterm accuracy with validity but found differences betweenall three concepts of reliability, accuracy, and validity. Suenand Ary pointed out that most of these positions are mutuallycontradictory and, thus, reflect differences in theoreticalorigins.

It is far from our present purpose to offer a fulldiscussion of, or resolution for, such discrepancies anddifferences of opinion. Neither is it our purpose to detail themany implied measurement tactics or their associateddegrees of statistical confidence. These issues are far toocomplex to deal with here. Beyond that, new approachesand solutions continue to evolve. What the present sectiondoes present is a simple classification of commonly usedstrategies and tactics, along with respective references toappropriate sources for accessing elaborations of each. Ourpresentation of these issues is also an affirmation ofVollmer, Sloman, and St. Peter Pipkin’s (2008) assertionthat, especially in the monitoring of treatment integrity,there are critical and practical dimensions that make the useof some measure of IOA and/or data reliability trulyimperative.

We will begin our overall referenced classification byadopting one convenient distinction offered by Suen andAry (1989). They used IOA to label processes and indicesfor measuring the degree to which two or more observersagree with one another. They also suggested intraobserverreliability (IOR) as a term for estimating the consistencywith which a single observer might record the same

behavior in multiple instances. They offered arguments formarkedly different measurement tactics for these twodifferent issues and suggested that IOA is the lesscomplicated reliability issue. Many, if not most, authorsdo not seem to agree, given their common application ofthe same indices for both IOA and IOR. The issue issufficiently complex that Suen and Ary offered a substantialportion of one chapter detailing what they considered to beuniquely applicable techniques for assessing IOR.

IOA measures typically begin with code-matchingtechniques between observers. This can be as simple asSuen and Ary’s (1989) smaller/larger index for comparingtwo frequency counts of the same behavioral category ormay encompass a more broadly applicable percentage ofagreement (Mitchell, 1979; Suen & Ary, 1989), or what issometimes called simple percent agreement (Bakeman &Gottman, 1997). Despite its common and continuing use(Mitchell, 1979), percent agreement has many critics(Bakeman & Gottman, 1997; Hartmann, 1977; Hartmann& Wood, 1992; Mitchell, 1979; Suen & Lee, 1985). On theother hand, it also has its defenders (Baer, 1977; House,Farber, & Nier, 1983). One commonly cited weakness is itsfailure to discount potential agreements that may resultfrom chance (Bakeman & Gottman, 1997). The mostcommon solution offered for this shortcoming is the useof Cohen’s (1960) kappa coefficient (see also Bakeman &Gottman, 1997; Mitchell, 1979; Suen & Ary, 1989),although an approach based on occurrence and nonoccur-rence agreement indices has also been used to adjust forchance agreements (Suen & Ary, 1989). There are alsosubstantial arguments against using kappa because of itssensitivity to unequal base rates of various categories(Grove, Andreasen, McDonald-Scott, Keller, & Shapiro,1981; Guggenmoos-Holzmann, 1993; Spitznagel & Helzer,1985).

The many problems associated with code-by-codecomparisons have led some authors to apply more classicalpsychometric approaches to measuring IOA. For exampleMitchell (1979) and Suen and Ary (1989) each reviewedthe typical employment of various forms of correlationtechniques, ranging from simple Pearson r correlationapplied to two observers to intraclass correlations, includ-ing split-half or alternate-form and even test–retest assess-ment techniques. Wallace and Ross (2006) argued thatcorrelations are too generalized and suggested, instead, acategory-by-category analysis based on signal detectionconcepts of sensitivity versus specificity, as commonly usedto describe receiver operating characteristic curves.

However, there is yet a third approach, which manyassert is perhaps the best of all options—largely, because itis so robust. It is called generalizability theory (Cronbach,Gleser, Nanda, & Rajaratnam, 1972) and has been reviewedby Campbell (1976), Mitchell (1979), and Suen and Ary

620 Behav Res (2011) 43:616–634

(1989). Bakeman and Gottman (1997) have called it aconceptual breakthrough, and Mitchell (1979) has sug-gested that the generalizability approach should be used toidentify error sources in addition to those involving theobserver. As Mitchell pointed out, Cronbach’s approach issomewhat analogous to factorial analyses in experimentalresearch, in that the analysis allows one to computevariance component contributions from multiple facets ofthe project. Thus, one might use generalizability coeffi-cients to reflect variance contributions from subjects,categories, scorers, and even their interactions (Mitchell,1979). This flexibility makes generalizability an appropriatemeans for assessing either IOA or IOR (Suen & Ary, 1989).Unfortunately, despite its power and appeal, it is beyondour present scope and purpose to detail the complexities ofthe generalizability approach, and interested readers shouldconsider the reviews cited above for elaborations.

Hartmann (1977) has distinguished between agreementissues that involve concerns with consistencies in observa-tion intervals, stability of IOA over time (i.e., observer drift),and stability across various situations and/or behavioralcategories, as well as reliability of the basic data themselves.Generalizability theory allows for the possibility of incorpo-rating each of these issues as facets of a single analysis ifone’s research is designed to systematically account for each.

Beyond the alternatives

For our present purposes, the critical elements in all ofthese alternatives is how the procedural taxonomy we areabout to detail impacts one’s options for applying any giventechnique, once one has decided which technique is mostsuited to the purpose of the research. In many cases, theobservational procedure will preclude some analyses whilesupporting others, as we will discuss later. Most germane tothe present discussion is the point made by Friman (2009)that an adequate observation procedure should allow foradequate assessment of at least some measure of IOA orIOR. This assertion implies that the selected proceduremust provide information necessary for that assessment. Ifall behavioral codes and their location in time are recorded(i.e., continuous coding), virtually any or all types ofagreement calculation are possible. But most research is notable to accommodate this comprehensive procedure.

Consider a case where two observers record thefollowing sequences (see Table 1) and we are to judgetheir agreement. In the table, observer 1 records A, A, B, A,C, B, A, B, C, A, and observer 2 records A, A, A, B, C, B,B, B, C, A. Without recording time, it will be difficult toresolve apparent discrepancies in alignment. Thus for entry3 in Table 1, where observer 1 records B and observer 2records A, did observer 1 miss an observation (i.e., the thirdA recorded by observer 2)? And if the event was missed,

how does such an omission error enter into the analysis,since no category is recorded. Alternatively, did observer 2record a single event twice (thus counting an extra Aevent)? And if so, how is that coding handled, since it hasnothing to be compared against? And if either of these arethe case, how do we realign our remaining observations forcomparison? If time is recorded, realignment may be easierto accomplish but will not resolve the code-matchingproblem. Thus, one must consider how contiguous in timeand/or sequence one’s observations must be to count themas being in agreement.

The last issue is rarely described in sufficient detailwithin research reports for readers to truly understand howresearchers actually matched codings for the IOAs theyreport. Neither are most methodological textbooks orarticles sufficiently clear in their procedural guidance toallow researchers to anticipate and solve these issues.Consider the following case-in-point, for which we havefound only one reference (i.e., Bakeman & Gottman, 1997)that addresses the topic sufficiently to offer any under-standing of the issues involved. Assume that a continuousobservation procedure, soon to be detailed, is being usedand that this procedure relies upon a mutually exclusive andexhaustive set of categories. Such a procedure would haveeach observer record both the category currently in progressand the start and stop times of that category. This of coursepreserves all possible information, including sequence andtemporal contiguity, that defines the stop time for onecategory, as well as the start time for the next.

But what if such a procedure is using video recordingsfor the coding of a given target participant and the observerrecords start/stop times within the accuracy allowed by theframe rate of the video? Typically formatted NTSC videorecords/plays at approximately 30 frames per second (fps).Using frame-by-frame analysis, do we expect the agreementto be exact to the actual frame (i.e., 1/30th of a second)? Ifnot, what is our tolerance window that allows for us to

Behavioral Record

Observer 1 Observer 2

A A

A A

B A

A B

C C

B B

A B

B B

C C

A A

Table 1 Hypothetical coding ofthe same behavioral events bytwo independent observers

Behav Res (2011) 43:616–634 621

discount or ignore differences in frame specificationsbetween two independent observer recordings? Bessel’sfamous personal equation defines individual differences asbeing unavoidable in detecting transitional events (Sanford,1888). If we cannot expect an exact frame match for wheneach behavioral transition occurs, what might be areasonable tolerance window in which any mismatchbetween observer A and observer B will be counted as amatch?

To clarify just one of the implications of alternativeanswers to this question, consider Figs. 1a, b as alternativeexamples of continuous coding. These examples illustratean event-defined (1a) versus state-defined (1b) approach toIOA measurement. The upper blue and aqua bars representtwo different, temporally sequential actual behaviors beingcoded. The second bar is an observer’s recording of thesesame events. Using a simple count of each behavioral eventfor measuring coding accuracy, and using actual versuscoding as our determination of agreement (+) or disagree-ment (−), as illustrated in Fig. 1a, we would code/count asfollows: blue versus blue (+), blue versus aqua (−), andaqua versus aqua (+). Using a simple percent agreement,we have 2 agreements divided by 2 agreements plus 1disagreement, or 3 total codings, for a ratio of 2/3, or 67%agreement on the behavioral counts being made.

But what if we wished to focus on the amount of time inagreement, as illustrated in Fig. 1b? Here, the actualagreement in behavioral states accounts for 14 of 17 totaltime units (30th of a second). That calculation changes theagreement to 82% of the time. This major difference inpercent agreement derives merely from a shift in compar-ison perspective. Which is correct? Furthermore, while wedo not illustrate it in the example, this change in perspective

impacts Cohen’s kappa through kappa’s correction forchance agreements (i.e., the subtracted chance, or p(exp),factor used in kappa calculations is .44 in the case of eventcount comparisons versus .49 in the case of time-in-agreement comparisons). Thus, perspective (event vs. time)matters.

So which of these IOA calculation strategies should beused when a continuous coding procedure is being followedand two observer records are to be compared? And if wefollow one rule in one procedure (e.g., simple frequencycount) and another rule in an alternative procedure (e.g.,continuous coding), what comparisons may be madebetween the procedural impacts on IOA? Unfortunately,while it may be relatively easy to discern whetherbehavioral categories being used in any given researchpublication are treated as states or entities, it is nearlyimpossible to determine from the method sections of mostpublications whether behavioral counts or synchrony intotal behavioral states are the basis for the reported IOAcalculations!

We became acutely aware of the event-agreement versustime-in-agreement differences when programming thesemetrics for TTC. It is worth noting that in most situationswhere one is training new observers, it is common toconsider the developer who established a given taxonomyor coding scheme as the reference coder against whichtrainees are compared for IOA. In such circumstances, thedeveloper is considered to be an expert on the applicationof the taxonomy, and thus it is common to take their codingas “correct” and to define the trainee’s agreement asaccuracy (see Boykin & Nelson, 1981). Thus, we followedthat tradition in TTC. This does not assert that thedeveloper has established a gold standard for comparisons,but only that, in training, we must use some referencecriterion. And in doing so, we deem coding accuracy theappropriate term to use for trainee agreement with anexpert.

Interestingly, a study was published by Mudford, Martin,Hui, and Taylor (2009) well after our current TTC designdecisions had been made and our considerations of time inagreement versus event agreements had been applied.Mudford and colleagues’ study is one of the rare publishedcomparisons of algorithmic impact on IOA, as well asagreements, of which we are aware. The authors compared12 observers recording the same behavioral records usinghand-held computer recorders. Low, medium, and high rateoccurrences of both discrete-event and state-durationalmeasures were compared for IOA, as calculated betweenobservers, as well as for accuracy—that is, comparingagainst a criterion record. Three agreement/accuracy algo-rithms were compared, including exact agreement, block-by-block agreement, and time window analysis. Also, theystudied the relative impact of increasing tolerance windows,

Fig. 1 Two approaches to calculating agreement. Percentage ofagreement between the actual occurrence of behavior and anobserver's record of that behavior may be calculated using a a countof the number of instances of agreement and disagreement (a, showing67% agreement) or b a measure of the amount of time when theobserver's record and the actual behavior were in agreement (b,showing 82% of the 30 fps video frames as time in agreement)

622 Behav Res (2011) 43:616–634

ranging from zero to ±5 s. Details of their findings arecomplex, but it suffices for our purpose to note thatvirtually all variables they investigated—type of behavioraldefinition, rates/durations, and algorithm used—had differ-ential impacts on agreement/accuracy values!

Perhaps the conclusion to draw for our present dis-cussion is that an observation procedure should providesufficient information to allow a convincing IOA to beperformed according to extant standards. Unfortunately,sufficient information is not always supplied. For example,in their rather extensive analysis of interobserver agreementcalculation methods reported in 10 years of articlespublished in JABA, Mudford, Taylor, and Martin (2009)reported that there were sufficient differences in publisheddescriptions of interobserver agreement algorithms as torequire clarification correspondence with many seniorauthors. It is important that sufficient detail be provided inpublications to allow readers to determine their level oftrust in the observations and to more accurately replicatethe reported procedure. But that information clearlyrequires that researchers clarify the behavioral unit versustemporal unit foundations of their measurements. It isimportant that researchers be aware that standards andexpectations change, both from journal to journal andacross time. Thus, the present taxonomy, to which we nowreturn, will attempt to identify not only observation andrecording procedures, but the attendant IOA/IOR analysispotentials as well.

A taxonomy of procedures used for behavioralobservation

Having reviewed significant issues that must be addressedregardless of the procedure used for observation, we remindthe reader that we developed the present proceduraltaxonomy by selectively searching for exemplars foroperations analysis. We actually conducted two literaturereviews, with the first being more expansive and focusedexclusively on reported methods where the only criterionfor our retention was the reported use of a directobservational procedure that added to our growing collec-tion of such procedures. It was this collection that wasoperationally analyzed to build our procedural taxonomy(actual references eventually used for examples of eachdistinct procedural operation are included in the Supple-mental Materials for this article). The second literaturereview focused on a selected volume of two representativebut disparate journals, to determine which, if any, of ourpredefined procedures were used within publicationscontained in that volume. These outcomes are reviewed ina subsequent section of this article (see “Is the TaxonomyExhaustive”). As was noted, for our first review, we looked

for as many variations of systematic coding proceduresas we could find and inductively derived the twocommon dimensions (i.e., behavior and time) from allexemplars found. We then proceeded to define ataxonomy on the basis of intersections of the distinctsubclassifications of those dimensions that emerged (i.e.,cells in a procedural matrix). Behavior and time aretherefore the two defining dimensions of the taxonomyillustrated in Table 2. It is worth noting that this taxonomyallows us to describe all studies using direct observation ofbehavior that we encountered in our literature search.Below, we will describe each of these procedures and willoffer one or more exemplary studies for those we coulddocument. Some cells reflect a use of behavior and time ina manner that is not a logically feasible procedure. Thesecells are marked with a “−” symbol in Table 2. Other cells,marked with a “+” symbol, reflect a use of behavior andtime in a manner that is logically possible, but noexemplar studies were found in our operations analysisliterature search.

Table 2 may be used as a specifications guide forprogrammers developing algorithms and interface designfor computer-assisted data gathering like those wedesigned into TTC As such, each cell in Table 2 may beexpanded in detail (see Supplemental Materials, Tables 1–16) to become a user’s help system that clarifies forresearch administrators and/or trainees how each selectedprocedure works if implemented, or would work if not,within such a system. Thus, specifics of how time andbehavior are used to define each procedure represented bythe cells in Table 2, along with acceptable behaviorassessment measures, IOA implications, and cited researchexamples that illustrate research based on each procedureare included in our Supplemental Materials tables (SMTs).Each SMTwill be referenced where appropriate for readersinterested in these corresponding operations analysisdetails and examples.

To also illustrate the subtle but important issues that arisein computer interface design, the two concrete in situexamples currently implemented within TTC are presentedin a subsequent section. These examples are includedbecause they give concrete illustrations of functional datacollection algorithms and interface design differences andimplications. Such a reference to software developmentemphasizes the requirement, imposed by all programmingtasks, that one develop a complete verbal description ofeach procedure implemented prior to creating a softwaremodel of that procedure. Both interface and algorithmicexpression provide important evaluative feedback on theprecision and completeness of one’s procedural descrip-tions. It was this requirement that guided us in assessing thecompleteness of our operations analysis of the literature andour development of the present taxonomy.

Behav Res (2011) 43:616–634 623

Approaches to behavioral recording

Each time a behavioral observation is recorded, a type ofbehavioral measurement and a type of time measurement isrequired. There are several alternative types of behavioralmeasurement, as illustrated by the columns of Table 2. Thefirst type is a simple dichotomous judgment resulting in aone/zero or yes/no entry to indicate whether the behavior didor did not occur or to make some social/functional judgmentof its occurrence (e.g., playing, fighting, or correct/incorrectresponding). The second type records only the dominantbehavior that occurred within the complete observationperiod (i.e., the behavior that occupied the greatest portionof the observation period). The third type involves recordingany behavior that occurred throughout the complete obser-vation period. The fourth type designates the total number ortotal duration for behavioral events of each type that occurwithin a complete observation period or the latencies until abehavior occurs following some environmental event. In thefifth type, the sequence of behaviors as they occur during acomplete observation period is recorded.

Observational records must also use one of three alternativeapproaches to incorporating time into the record. Observersmay record only the number of behavioral events (or theduration of these events) within the total observation period(Row 1 of Table 2), may segment the observation period andmake a separate recording for each segment (Rows 2, 3, and4 of Table 2), or may use continuous recording, noting thetime of each behavioral event within the overall observationperiod (Row 5 of Table 2). We thus elaborate the options for

using these types of time measurement and their impact onthe behavioral observation record.

Time of occurrence is ignored in the record In the firstoption, time is not recorded but is used only to define thebeginning and end of the total observation period. This isthe simplest use of time, since it references only the sessionfor observing and does so only by the frequency andduration of sessions (e.g., four 20-min sessions) or the start/end time of each. Within the family of procedures usingonly such a time reference are those that seek to recordwhether or not one or more behaviors of interest occurred atall during the session (see Fig. 2, Procedure 1, with moreprecisely defined operations and examples in SMT 1,Session: Yes/No Occurrence), which behavior dominatedthe session, relative to all alternatives (see Fig. 2, Procedure2 and expansions in SMT 2, Session: Dominant), thefrequency or total time spent in one or more behaviors ofinterest during the session (see Fig. 2, Procedure 3 andexpansions in SMT 3, Session: Behavior Frequency/Duration), or the exact sequence of behavior(s) during agiven session (see Fig. 2, Procedure 4 and expansions inSMT 4, Session: Behavior Sequence). While it may belogically possible to record behaviors that fill the entireobservation period (e.g., slept throughout the night), we didnot find any examples using such a procedure and, thus,have not enumerated it.

Segmenting observation periods by time intervals ortrials Each observation period may be divided into equal

Table 2 Types of behavior and time recording that define various types of behavior observation procedures

Types of Behavior Recording

Types of Time Recording Yes/No Dominant Whole-Interval Frequency or Duration Sequence

During the observing session 1. Session: Yes/nooccurrence

2. Session: Dominantbehavior

+ 3. Session: BehaviorFrequency/Duration

4. Session: BehaviorSequence

At the start or end of eachobserving interval/trialwithin the observingsession

5. Momentary timesampling: Y/N

+ − 6. Momentary TimeSampling: Frequency

−

Within each observinginterval/trial within theobserving session

7. Complete intervaltime sampling: Y/N (often calledpartial intervalsampling)

8. Complete interval timesampling: Dominantbehavior

9. Complete interval timesampling: Whole-interval behavior

10. Complete interval timesampling: Frequency/duration

+

Within each observinginterval/trial and recordedduring the followingrecording interval

+ 11. Alternating intervaltime sampling:Dominant behavior

+ 12. Alternating interval timesampling: Frequency

13. Alternating interval timesampling: Sequence

Time stamp recording ofevents within the observingsession

14. Event time stamprecording: Time ofmomentary eventoccurrence

− − 15. Partial continuous coding:Start and end times fornonexhaustive behavioralevents

16. Continuous coding: starttimes for exhaustivebehavioral events (endtimes derived)

Some cells reflect a use of behavior and time in a manner that is not a logically feasible procedure. These cells are marked with a “−” symbol inTable 2. Other cells, marked with a “+” symbol, reflect a use of behavior and time in a manner that is logically possible, but no exemplar studieswere found in our literature search.

624 Behav Res (2011) 43:616–634

time intervals (interval recording) or into separate trials(trial-by-trial recording). By creating a separate record foreach interval or trial within the overall observation period,changes in behavior can be tracked across that period. Adescription for each approach may be helpful.

For interval recording, a signal marks a time for theobserver to observe, record, or both. Thus, behavior may beobserved at some marker separating intervals (momentarytime sampling; Row 2 of Table 2) or may be observedthroughout each interval and recorded at the end (completeinterval time sampling; Row 3 of Table 2), or the intervalsmay be further divided into alternating observation andrecording subintervals (alternating interval time sampling,also known as intermittent interval time sampling; Row 4of Table 2).

For trial-by-trial recording, a trial consists of a cycle ofevents defined by environmental conditions and theparticipant’s behavior. For example, in a particular experi-ment, an intertrial interval is followed by presenting aquestion to the participant, who, in turn, produces or selectshis or her answer to end the trial and, thereby, starts the nextintertrial interval. As for interval recording, in trial-by-trialprocedures, behavior may be observed at the end of the trialor throughout the trial, or subintervals may be specified for

observing within the trial. There seems to be no commonvocabulary for labeling these distinctions.

The next three rows in Table 2 share the common featureof dividing the observation period into intervals or trials forsuccessive recordings. What distinguishes each of theserows is whether observations are made of the behavioroccurring at the beginning/end of each interval or duringeach interval and, if the latter, whether the observationinterval is concurrent with recording or offers a separateinterval for recording. In Row 2, the family of proceduresincludes only two found examples.

The first (see Fig. 2, Procedure 5 and expansions in SMT5, Momentary Time Sampling: Y/N) focuses upon record-ing only whether or not behaviors of interest are occurringat the interval reference (start or end), while the second (seeFig. 2, Procedure 6 and expansions in SMT 6, MomentaryTime Sampling: Frequency) records the number of instan-ces, items, or individuals that meet the observation criterionat the interval reference.

Of the interval options, only complete interval timesampling (see Table 2, Row 3) allows for a full accountingof all the time within the overall observation period. Thisapproach provides records of frequency, durations, orcumulative durations within each interval. Thus, Row 3depicts a family of procedures that share a use of intervals oftime to prompt recording, but behaviors that are recordedoccur at any time within the interval, and not necessarilyonly at the termination, as do those in Row 2. Theseprocedures may differ only in whether they involverecording a yes/no occurrence for behaviors at any time inthe interval (see Fig. 2, Procedure 7 and expansions in SMT7, Complete Interval Time Sampling: Y/N), recording thebehavior that meets criteria for dominance in the interval(see Fig. 2, Procedure 8 and expansions in SMT 8, CompleteInterval Time Sampling: Dominant Behavior), recordingbehavior that consumed the entire interval (see Fig. 2,Procedure 9 and expansions in SMT 9, Complete IntervalTime Sampling: Whole-Interval Behavior), or recordingfrequencies/durations for any behavior occurring within theinterval (see Fig. 2, Procedure 10 and expansions in SMT10, Complete Interval Time Sampling: Frequency/Duration).

As with other families, there is a logical alternativeprocedure for which no example was found and, thus, is notenumerated. This would involve recording behaviorsequences at the end of each interval but that occurredacross each interval (see Table 2, Row 3, Column 5).Technically, this would be very difficult to accomplish,since it implies observing an on-going sequence whilerecording a previous sequence.

It should be noted that each measure described above hasimplications for data analysis possibilities. Frequencyrecording allows analysis of frequency, relative frequency,and rate (frequency/length-of-interval or frequency/dura-

Fig. 2 Distribution of use of each type of direct observationalprocedures published in a single volume of two research journals.All research reports in one volume of the Journal of Applied BehaviorAnalysis (Vol. 40, 2007) and of Child Development (Vol. 77, 2006)were reviewed. The ordinate shows the number of studies foundwithin these research articles that used each of the 16 proceduresoutlined in Table 2. Categorization of procedure was carried out by theauthors, and agreement was achieved through a procedures-operationsanalysis of stated methods and discussion. The number of proceduresused is greater than the number of articles in the volumes that usedbehavioral observation, since a single research report often containedmultiple studies. The number of studies utilizing behavioral observa-tion procedures that could not be categorized as one of the 16procedures in Table 2 is shown in the rightmost column. Those sixstudies whose methods could not be categorized reported examplephrases from field notes or post hoc content analysis of field notesand, thus, were considered not to be using true coding procedures. Allstudies that actually used systematic coding of observations could becategorized

Behav Res (2011) 43:616–634 625

tion-of-trial). Recording each event’s duration by categoryallows an extrapolation of these same frequency analyses,as well as descriptive statistical summaries (e.g., mean andstandard deviations for each behavior’s temporal dynamics,including total time spent in each behavioral state).Cumulative durations allow only the latter analysis (i.e.,total time spent in each behavioral state). Since eachmeasure accounts for all time in each interval, intervalsmay be used as an integral part of the analysis of the entireobservation period. Thus, complete interval time samplingsallow researchers to use each interval or trial as a windowfor successive time series analyses within a total observa-tion period, as well as allowing an overall descriptiveaccounting of the total observation period.

Row 4 depicts a family of procedures that share the useof alternating successions of observation interval followedby recording interval. We found this use of time recordingto exist for only three types of behavioral recording. Oneprocedure involves recording, during the recording interval,any behavior meeting the criteria for dominance in theobservation interval (see Fig. 2, Procedure 11 and expan-sions in SMT 11, Alternating Interval Time Sampling:Dominant Behavior). The second procedure uses theseparate recording interval to record only the behavioralfrequencies occurring during the preceding observationinterval (see Fig. 2, Procedure 12 and expansions in SMT12, Alternating Interval Time Sampling: Frequency). Thethird procedure involves using alternating recording inter-vals to record the sequence of behaviors occurring duringthe immediately preceding observation interval (see Fig. 2,Procedure 13 and expansions in SMT 13, AlternatingInterval Time Sampling: Sequence). Other alternatinginterval procedures logically seem feasible (see Table 2,Row 4, Columns 1 and 3), but we found no examplepublications using either procedure.

All alternating interval procedures yield incompleteaccounts and do not allow measurement of true behavioralrates, durations, or accumulated times. However, by takinginto account the proportion of the session devoted toobserving, one can estimate true rates, durations, andcumulative times. Estimates of relative frequency, relativeduration, and relative cumulative duration by behavior arepossible even without considering the proportion of theinterval devoted to observing. Thus, if your taxonomyincludes several behavioral and/or environmental categoriesand/or if the recording is difficult to complete (e.g.,recording the sequence), intermittent interval time samplingmakes recordings possible in real-time circumstances andmay be preferred despite the shortcomings of imposedestimates. On the other hand, if the recording is quitesimple (one/zero, frequency with a small taxonomy,dominant behavior, whole-interval recording), there is lesspressure to use intermittent interval time sampling.

For trial-by-trial procedures, behavioral events at the endof the trial or within the trial are used to prompt recording.The same behavioral data described for interval proceduresmay be created. In trial-by-trial procedures, it is rare to finda recording of either the temporal latency of response or theduration of each trial. Thorndike’s (1898, 1899) use ofescape latency is an exception that exemplifies thetechnique in use. However, since the duration of trials istypically defined by a behavioral state (e.g., question hasbeen answered), rather than by the passage of time, theability to determine the true rate of a category of behaviortypically is lost in trial-by-trial approaches. When latencyor duration is used, it clearly incorporates time as afundamental dimension of the trial-by-trial procedure. Andwhen records do include durations, overall rate, andperhaps overall duration, measures may be available. Thus,if time and frequency are both recorded for each phase ofeach trial, rate of behavior may be determined (e.g., as isdone in the operant study of chained and concurrentchained schedules of reinforcement; see Kelleher & Gollub,1962, or Herrnstein, 1961).

Continuous observations When observing continuouslywithin any given observation session, time can be broughtinto the recording by time stamping each behavioraloccurrence that is observed across the entire observationperiod (Row 5 of Table 2.) We should emphasize that suchtime stamping may be done in one of two ways. In the firstoption, one time marker is made for each recorded eventdesignating the time when a behavioral event occurred (seeFig. 2, Procedure 14 and expansions in SMT 14, Procedure14—Event Time Stamp Recording). Alternatively, one canrecord both a start and a stop time for each behavioral state(see Fig. 2, Procedure 15 and expansions in SMT 15,Procedure 15, where the behavioral categories are notconsidered to be exhaustive), or only the time when abehavioral state started (see Fig. 2, Procedure 16 andexpansions in SMT 16, Procedure 16, Continuous Coding)and subsequently use the end of one behavior as alsoindicating the time when the next behavior starts, incircumstances where the behavioral categories are consid-ered to be exhaustive. Thus, within the alternatives forcontinuous observation, it is continuous coding that allowsthe most complete record of any of the procedures, so thatfrequency, relative frequency, rate, and sequence of behaviorare all directly available from the record, as is the timebetween each occurrence of a given category (interresponsetime). If both the start and end of each behavioral event arerecorded or derived (Procedures 15 and 16), the duration ofeach behavior is also available.

It should finally be noted that time stamp recordingapproaches are relatively challenging to the observer,although video and/or computer-aided observation tools

626 Behav Res (2011) 43:616–634

often allow observers to meet this challenge. Such aidedobservation allows for more complex recordings, including,for example, the use of coding schemes with morenumerous behavioral categories, multiple behavioraldomains, and even the inclusion of environmental con-ditions—a topic to which we now turn.

Approaches for including the environment as partof the record

Most researchers would agree that knowing the environ-mental conditions concurrent with behavior provides acontext for the description of behavior that is as importantas knowing the behavior itself. Thus, the ideal behavioralrecord would include a concurrent description of theenvironmental conditions associated with behavior. Insocial situations, the behavior of others would be treatedas an environmental condition. In some cases, the environ-mental description can be viewed as being static and, thus,described for the whole observation period (i.e., iscontextual). In other cases, environmental events must beconsidered to be dynamically changing along with thebehavior (see Astley et al., 1991; Hawkins, Sharpe, & Ray,1994). The most complete record would include time-stamped environmental, as well as behavioral, events orstates. Optionally, if an interval or a trial-by-trial approachis being used, both environmental and behavioral eventsmay be recorded by interval or by trial to locate themwithin the observation period.

The behavior analysis tradition is one that stronglyfavors mixing environmental with behavioral descriptions(Skinner, 1938). This approach often takes what is called anABC approach, with A indicating immediately antecedentenvironmental conditions, B indicating the behavior beingobserved, and C indicating conditions that follow anddepend on the behavior (i.e., response-contingent orconsequent events). All of these three factors are consideredto be necessary elements in a behavioral description (calleda three term contingency in Holland & Skinner, 1961). Tobe complete, then, a continuous behavioral record wouldneed to categorize antecedent conditions, behavioral eventsor states, and consequent outcomes. And an interval or trial-by-trial record would need to categorize the antecedentconditions, the behavioral events or states, and theconsequent outcome for each recording (specifics of therecord would depend on which of the items noted abovewas being used).

A simple approach to environmental–behavioral analysishas been summarized by Suen and Ary (1989) in theirdiscussion of interaction matrices (pp. 25–28). Suchmatrices traditionally have been limited by the fact thatthey tend to include only consequences and, thus, ignore

antecedents. They also are not typically dynamic (i.e., donot account for changes in these conditions across time orcircumstance) but, rather, summarize events for an entirerecording session only. On the other hand, R. D. Ray andRay (1976) used a more comprehensive accounting thatthey referred to as contingency analysis. Observing childrenin the United States and the out-islands Bahamas, theycompared teacher–student interactions to describe differ-ences in contingency management between the twocultures. Their records included antecedent, behavior, andconsequence sequences, using interval sampling techni-ques. As such, their analyses were limited to staticcomparisons between cultures and across varying contexts,such as school room interactions, playground interactions,and home environment interactions. R. D. Ray (1977)added to this approach by using a continuous recording ofexperimenters’ use of subject-initiated trial-by-trial discrim-ination training procedures used by Soviet researchers. Rayalso added a new analysis that tracked the temporaldynamics of antecedent, behavior, and consequence cate-gories as they changed across time.

If environmental as well as behavioral events are to beincluded in the record, it will by definition be required thatthe chosen observation approach be able to include eventsthat are not mutually exclusive and, therefore, should beconsidered as being in different domains. Thus, the issuesraised in the earlier section on Methodological Implicationsof Mutually Exclusive and Exhaustive Taxonomies ofBehavior are highly relevant.

Is the taxonomy exhaustive?

As has been noted, we developed our taxonomy byselectively searching for published exemplars to guide ouroperations analysis. That is, we searched the literature foras many published variations of systematic coding proce-dures as we could practically find. We inductively derivedthe common dimensions (i.e., behavior and time) from themethod descriptions in all exemplars found and proceededto define a taxonomy based on intersections of distinctsubclassifications of those dimensions that emerged (i.e.,cells in the matrix). What remained to be accomplished wasa testing of the taxonomy with a sufficiently broad sampleto evaluate how robust the taxonomy might be. We wantedto confirm that, given a generally representative sample ofpublished research articles, we could categorize all methodsusing behavioral observation with our existing taxonomy. Asecondary value in such an effort is a description of howsuch procedures are relatively applied within such a sample.

For our sample, we chose to survey one full year ofpublications in two divergent journals that have a stronghistory of publishing observational research. Thus, we

Behav Res (2011) 43:616–634 627

arbitrarily selected the most recent (at the time of ourresearch) volumes of Child Development (CD, 2006, Vol.77) and the Journal of Applied Behavior Analysis (JABA,2007, Vol. 40). Articles were first surveyed for discernableuse of direct observational methods for data collection.Observational operations reported in the method sections ofsuch articles were then analyzed and compared with thecriteria described within the procedural cells of Tables 1–16of the Supplemental Materials. All matches were thuscategorized by the appropriate matching procedure. As theheader of this section suggests, we were interested inverifying that our taxonomy was exhaustive in its catego-ries, but we were also interested in confirming thatindependent observers could reliably classify researchprocedures using our taxonomy’s operational definitions.As such, any and all questions as to appropriate categori-zation were resolved through discussion between theauthors. This ensured that we would determine anypotential need for expanding the taxonomy or clarifyingoperational definitions, but it also resulted in 100%interobserver agreement. Anyone seeking to verify ourclassification of the procedures in these volumes could usea similar approach.

All but 6 of the 37 observational procedures found forarticles in Vol. 77 (2006) of CD could be classified with thetaxonomy. These six exceptions reported only examplephrases from field notes or frequencies from post hoccontent analysis of field notes, which is a differentmethodology than systematic coding, and thus these studieswere excluded from our review. Studies using question-naires and surveys were also excluded from our review,since these also represent a different methodology based onself-reports. All 78 observational procedures found in Vol.40 (2007) of JABA could be readily classified according toour taxonomy of procedures summarized in Table 2. Thus,our taxonomy was sufficiently robust to account forvirtually all systematic (coding-based) observation studiesfound in our sample.

The utility of any taxonomy depends on the data itallows one to classify and describe. Ours is no exception.As such, our sample allows one not only to categorize anyand all direct observational procedures, but also todetermine the comparative or relative frequency of use forspecific procedures within the subdisciplines represented byour journal selections. For JABA, a high proportion of theempirical articles published in this volume used dataobtained from direct observations by researchers (91%; 68of 75). For the CD journal, a much smaller proportion ofempirical articles used observational procedures (27%; 34of 128). Note that the number of procedures found isgreater than the number of articles that used systematicobservation, since some articles reported use of more thanone procedure in a given study.

Our data from JABA compare with 76% of the 293empirical articles published in JABA from 1968 through1975 reported by Kelly (1977). Thus, observational datahave been used and continue to be used by the behavioranalysis research community, and their use appears to beincreasing. Our data from CD compare with 19% of 175empirical articles published in the 1976 volume of CD(Mitchell, 1979). Thus, child development researcherscontinue to use observational data in about a fifth to aquarter of the articles in this major journal for their field.

Figure 2 shows that the distribution of alternativeprocedures published in the two journals differ in distinctways. For JABA, Procedures 3, 5, 7, and 10 from summaryTable 2 account for most observational procedures reported.Thus Session: Behavior Frequency/Duration (Procedure 3),Momentary Time Sampling: Behavior Yes/No Occurrence(Procedure 5), and Complete Interval Time Sampling:Behavior Yes/No Occurrence (Procedure 7) or CompleteInterval Time Sampling: Behavior Frequency/Duration (Pro-cedure 10) account for approximately 91% (71 of 78) of thealternative methods reported in 68 different articles (i.e.,some articles do not report using observational methods,while others report multiple methods). For CD, however, themost common method was Complete Interval Time Sam-pling: Dominant Behavior (Procedure 8, which accounts for12 of 37 reported procedures, or approximately 32%). Thisprocedure had only sparse use in JABA (approximately 4%).Whole Session: Frequency/Duration (Procedure 3), Com-plete Interval Time Sampling: Behavior Yes/No Occurrence(Procedure 7), or Complete Interval Time Sampling: Behav-ior Frequency/Duration (Procedure 10) were also used fairlyfrequently in CD articles (combined, they account for 12 of37, or approximately 32%, but less frequently than in JABA,where they account for 53/78, or 68%). However, Momen-tary Time Sampling: Behavior Yes/No Occurrence (Proce-dure 5) had only one use in our CD sample.

In neither journal were any variations of the continuousrecording method (i.e, the variations of Procedures 14–16)much used—only 3 (4%) for JABA and 2 (5%) for CD.These procedures appear to preserve the most informationfrom the observational records and, therefore, seemdesirable. They also are the ones most aided by video/computer-assisted observational recording, so perhaps theincreasing use of technology will facilitate the use of theseprocedures in future research.

How the taxonomy guides the modeling of proceduresand designs for user interface in an expert trainingsystem

As was noted earlier, our taxonomy was primarily developedto guide the development of software models of commonly

628 Behav Res (2011) 43:616–634

used procedures for incorporation into the TTC expert trainingsystem. On the basis of the assumptions that learning todiscriminate occurrences of specific behavior categories isubiquitous across all coding procedures, the first modeledprocedure incorporated for trainees in TTC was Procedure 3in Table 2, which involves counting each behavioralcategory’s frequency of occurrence within a given observa-tional session. In this section, we illustrate how the specificaspects of this procedure are modeled in TTC via input andother user interface characteristics, including data display. Wefollow with a similar detailing of TTC’s modeling of thecontinuous coding procedure (Procedure 16 in Table 2),since that is the procedure used by the experts who managethe development and administration of training programsdelivered through TTC.

Session: behavioral frequency implementation

The specific user interface for the collection of categoricalfrequency counts, as well as prompts and feedback used forvarious adaptive levels of training services within TTC, isillustrated in Fig. 3a–d. After being alerted by a prompting“New Coding Loaded” alert dialog that signals a session’sbeginning (Fig. 3a), a sampling of a trainee’s subsequentexperiences in TTC is illustrated by the sequence of screenillustrations in Fig. 3b, c, and d. These illustrate the mostheavily prompted training level—Coding Skill Level 1(CSL1)—of the six alternative training skill levels thatdefine the adaptive stages of TTC. Subsequent levelsprogressively fade the use of the illustrated prompts andfeedback until, at Coding Skill Level 6 (CSL6), no promptsappear at all, and the only feedback occurs when a codingerror has occurred and needs to be edited. Administratorshave the option of activating a 7th Coding Skill Level thatfunctions as a “probe” to test or certify the trainee’s codingskills without any feedback whatsoever. Upon satisfying anadjustable accuracy criterion for certification across somenumber of codings, training on that given program isannounced to be complete. Failure to meet such criteria canbring the trainee back to lower CSLs for further training asrequired.

Throughout the time that a countable/recordable behavioralevent is occurring at training CSL1, the video depicting thatevent in TTC is framed in yellow as a prompt to aid inteaching trainees that a recordable event is in progress (seeFig. 3b). As Fig. 3b also illustrates, the multifacetedprompting of CSL1 includes showing beneath the videoframe the codable event’s categorical abbreviation, theassociated keypad entry number, and the full name of thebehavioral category being viewed.

When a behavioral category is functionally defined and,thus, is best coded only after the conclusion of the codableevent, a short tolerance period begins, during which the

video may be paused to allow a coding to be entered.Administrators may set this tolerance window to anydesired duration (a duration that is recommended to becomeprogressively shorter as training prompts are faded acrosssuccessively higher adaptive skill levels in TTC as traineesdevelop more sophisticated coding skills). During post-category tolerance periods, CSL1 training shows the traineea green frame surrounding the video display to indicate thata coding period is still in progress (see Fig. 3c), andinstructions beneath the video remind them that they maypause the video play to code by using the space bar. InCSL1 training, if no coding entry has occurred by the timethe tolerance window elapses, a red frame appears aroundthe video display area, and video play automatically stopsto supply a feedback message. This message informs thetrainee that a coding error has occurred and must becorrected (along with instructions as to how such acorrection is made; see Fig. 3d).

Also illustrated in each graphic of Fig. 3 is the data entryand accumulating count panel placed just to the right of thevideo display. This panel indexes each frequency count anytime a category is mouse-clicked or a corresponding keypadentry (denoted by bracketed numbers) is made. Because thecoding procedure being trained in this expression of TTC isSession: Behavioral Frequency (Procedure 3 in Table 2),coding accuracy is determined only by event-based criteriafor all time series and composite total-session IOAcalculations, since time in agreement can be used onlywith state-based categorical coding.

Continuous coding implementation

As was just discussed regarding TTC’s modeling ofProcedure 3, our overall procedures taxonomy was devel-oped to guide the TTC modeling of any or all observation/recording procedures. In actuality, the very first proceduremodeled within TTC was Continuous Coding (Procedure16 in Table 2). This is because the trainer’s (as opposed tothe trainee’s) codings that generate all expert or referencefiles used for training in TTC rely upon mutually exclusiveand exhaustive behavioral taxonomies and recordings ofstart/stop times for all behavioral occurrences. This ensuresthat every frame of video used for training in TTC has oneand only one corresponding behavioral state available forreference against a trainee’s coding input, regardless ofwhen that input is recorded. Thus, in TTC, if training reliesupon nonexhaustive taxonomies, an other bucket categorymust be used in the expert’s coding to identify allnonrelevant events. Importantly, administrators of alltraining programs to be experienced by trainees in TTChave the option of limiting training exposures to anynonexhaustive subset of the parent (and exhaustive)reference taxonomy for actual training. Any disabled

Behav Res (2011) 43:616–634 629

categories in the exhaustive parent list combine to become anot-to-be-coded category that is prompted for purposes oftrainee guidance, feedback, and accuracy calculations. Asthis label implies, trainees do not have to learn to activelycode such excluded noncoded events (but by implication,trainees are passively doing so eventually by not enteringentries for such events).

The specific interface for an expert’s Continuous Codingin TTC is instructive as to the alternative interface andalgorithm modeling implications of our alternative proce-dural categories in the taxonomy. This Continuous Codinginterface is illustrated in Fig. 4a–c. It includes the video

display, a coding panel allowing up to 10 categoryabbreviations and keypad equivalents, and a data table/field with three columns allowing for start time, stop time,and associated behavioral category recording accumula-tions. Data entries accumulate as successive lines withinthis table/field. Note that TTC records video times from anarbitrary start time designated as zero, rather than using anabsolute true time/date stamp, as might be encoded within avideo source. Thus, for training purposes, it is sufficientsimply to mark elapsed time from the start of the videobeing coded, given that coding of time-elapsed data is theonly goal of the TTC system.

Fig. 3 Example screens showing the Session: Behavioral Frequencycounting procedure (Procedure 3 in Table 2) as modeled in the Train-To-Code software training system. The user interface, prompts, andfeedback for this procedure are illustrated. a The "start coding" alertdialogue and general screen layout as seen by a trainee. b The yellowsurround added to the video window when one of the behaviors to becoded has begun. Prior coding counts by category that were enteredby the trainee are shown in the panel on the right of the video display.

The current time in the video is shown below the video window. c Thegreen surround that is added to the video window when the behaviorto be coded ends, but during the time a trainee may still stop the videoand enter a code (a "tolerance window" set by the trainer). d The redsurround of the video window signifying the error of allowing thetolerance window to elapse without coding. The video is stopped atthis point, and the trainee is informed that he or she must re-view andcode the past behavior before proceeding with further training

630 Behav Res (2011) 43:616–634

Fig. 4 Example screens illustrat-ing the Continuous Coding pro-cedure (Procedure 16 in Table 2)as modeled by the Train-To-Codesoftware training system. Theuser interface and editing optionsused by an expert while they arecoding the video are shown. Theexample screens illustrate stepsfrom the editing of an expertcoding of a video that depicts atrial-by-trial teaching procedurehaving intertrial intervals codedas other. a Time during the videowhen the first trial is completedand will be coded as "Promptwith No Error" (PNE). b Aperson editing the already codedvideo data file has paused at43:10 in the video—the starttime for a trial that was coded as"No Prompt, No Error" (NPNE).The start and end times for thecoded trials and intertrial inter-vals are shown in the field to theupper right of the video display.The trial where the video hasbeen paused is highlighted inblue. c Editing options pop-upmenu available to the personediting the expert codings, in-cluding the insertion of com-ments on decision criteria for thisevent, inserting a new additionalbehavior coding with that start-ing time, deleting the existingcoded behavior with that startingtime, or simply deselecting thatselected behavior

Behav Res (2011) 43:616–634 631

The practical result of a zero start time reference for alldata listings is that, unlike our table’s operationalizationcriteria for Continuous Coding procedures, in TTC onlystop time for a current category is actually registered by thecoding expert, and TTC auto-fills the derived start time forthe next to-be-coded event. Typically, accurate continuouscoding relies upon the unique characteristics of video-basedrecording and playback capabilities that include pausingvideo play and single-frame stepping forward or backwardwith respect to any video frames of special interest, such astransitions from one behavior to another or criteria-checking with regard to meeting standards required byoperational definitions for classification.

Anyone who has ever watched video replays used todetermine the accuracy of a referee’s call of an infraction infootball or another sport has had such a viewing experience.Thus, unlike the interface for trainees experiencing theBehavior Frequency counting procedure, expert codersestablishing reference codings in TTC are encouraged andallowed by the Continuous Coding interface design to relyupon such video controls for precision. Likewise, editing isa dramatically different algorithm and interface experiencefor Continuous Coding versus Behavior Frequency count-ing, as we will soon see.

When the video frame corresponding to a behavioralevent’s stop time is found, that time stamp is recorded byentering the behavioral code describing the event justending. As was previously noted, this data entry not onlyrecords the behavior and its corresponding stop time, butalso adds one time unit (typically, one frame, or 1/30th of asecond) as an auto-entered start time for the subsequentbehavioral event that is about to begin. In practical terms,this recording of a current behavior’s stop time while auto-producing the subsequent behavior’s start time procedure isactually preferred in all cases, because coders musttypically play through a transition from one behavior toanother to determine that a new category of behavior hasactually begun. As such, it is typically easier to step frame-by-frame backward to find the exact change-over, ortransition time, and then to record the behavior just ended(as opposed to the behavior just started, since that may yetto be determined by full event viewing, as with functionallydefined events).

As already noted, we also quickly discovered from ourmodeling effort for Continuous Coding procedures in TTCthat this interface and algorithm must also typicallyaccommodate corrections and/or editing of any inputs thata coder desires, regardless of the time or sequence in theircoding process that an edited entry was made. This includesediting start/stop times and/or behavioral category and,possibly, any data entry line. Importantly, behavior frequen-cy as a procedure cannot accommodate such edits, since asimple count of, say, 5 for any given category tells you

nothing about which of the five accumulated events one mightbe adding/subtracting. Thus, only the most recent (i.e.,current) entry can be edited in that procedure, even inpaper–pencil real-world recordings. Thus, it is not simply acomputer-modeling restriction.

As is illustrated in Figs. 4b, c, a mouse-click selection inTTC creates a focal-frame surrounding the item selected forediting. Selection of an item simultaneously sets that itemfor edit and presents access to a popup menu (see Fig. 4c),allowing new item insertion, current item deletion, or eventhe adding of comments that can be used in training, suchas the specific criteria used for resolving assignment of theselected video segment to the current category. Such editingalso allows the user to change the behavioral code or thestop/start time. This is accomplished by selecting theassociated item in a given data line. If a time is changed,the associated-by-inference start and/or end time is auto-matically corrected as well. Similarly, inserting or deleting acoded item adjusts the starting or ending time for the priorand subsequent codes as well.

Conclusion

An initial literature survey that was quite broadly focusedgenerated an operations analysis of as many alternativereported procedures for collecting data through systematicobservation as we could find. From this operations analysis,we generated an organizational scheme, or proceduraltaxonomy, for systematically classifying, comparing, andcontrasting those procedures. We subsequently conducted amore targeted survey of two disparate journal volumes toconfirm the inclusiveness of that taxonomy. We found that100% of published articles incorporating observationalmethods with systemic coding for data collection definedby the present taxonomy were classifiable. We found 6(5%) articles that used only field notes as sources, whichwe never intended to incorporate as coding procedures tobe modeled or taught. They used direct observation, but notsystematic coding schemes. We thus are encouraged thatour taxonomy is robust and will apply to the use andteaching of criteria for defining systematic data recordingprocedures likely to be used in direct observational researchsettings.

Concurrent with our survey was our development ofTTC as a means for automating the training of observerswho are typically tasked with collecting the data in suchobservational research. This project was yet anotherconvergent affirmation of our alternative procedural spec-ifications; that is, we found our procedure descriptionssufficiently specific to guide each stage of design decisionand algorithm development, not only for training, but alsofor the expert system specifications and for evaluating IOA

632 Behav Res (2011) 43:616–634

between trainee and expert. Our taxonomy specificationsalso guided our alternative data analysis designs, both forerror analysis and for summarizing the video source events.Thus far, our development efforts on TTC have confirmedthe adequacy of all of our original specifications.

Perhaps more important, the discipline imposed bydeveloping computer software has forced us to systematizean important and complex family of methods that is in neardisarray. Incomplete reports of procedures used are com-mon, and when complete reports do exist, a consistentvocabulary for naming the procedures is lacking. We feelthat the present taxonomy brings a much needed order tobehavioral observation research. We also believe that wehave confirmed through our sample survey that behavioralobservation is a very important dimension of behavioralresearch methods.

References

Altmann, J. (1974). Observational study of behavior: Samplingmethods. Behaviour, 49, 227–267.

Astley, C. A., Smith, O. A., Ray, R. D., Golanov, E. V., Chesney, M.,Chalyan, V. G., … Bowden, D. M. (1991). Integrating behaviorand cardiovascular responses: The code. American Journal ofPhysiology: Regulatory, Integrative, and Comparative Physiology,261, R172–R181.

Baer, D. M. (1977). Perhaps it would be better not to knoweverything. Journal of Applied Behavior Analysis, 10, 167–172.

Bakeman, R., & Gottman, J. M. (1987). Applying observationalmethods: A systematic view. In J. D. Osofsky (Ed.), Handbook ofinfant development (pp. 818–854). New York: Wiley.

Bakeman, R., & Gottman, J. M. (1997). Observing interaction: Anintroduction to sequential analysis (2nd ed.). Cambridge:Cambridge University Press.

Boykin, R. A., & Nelson, R. O. (1981). The effect of instructions andcalculation procedures on observers’ accuracy, agreement, andcalculation correctness. Journal of Applied Behavior Analysis,14, 479–489.

Campbell, J. T. (1976). Psychometric theory. In M. D. Dunnette (Ed.),Handbook of industrial and organizational psychology (pp. 185–222). Chicago: Rand McdNally.

Cohen, J. (1960). A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20, 37–46.

Collins, J. J., & Stewart, I. N. (1993). Coupled nonlinear oscillatorsand the symmetries of animal gaits. Journal of NonlinearScience, 3, 349–392.

Cone, J. D. (1982). Validity of direct observation assessmentprocedures. In D. P. Hartmann (Ed.), Using observers to studybehavior (New Directions for Methodology of Social andBehavioral Science, Vol. 14, pp. 67–79). San Francisco: Jossey-Bass.

Cronbach, L. S., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972).The dependability of behavioral mesurements. New York: Wiley.

Cunado, D., Nixon, M. S., & Carter, J. N. (2003). Automaticextraction and description of human gait models for recognitionpurposes. Computer Vision and Image Understanding, 90, 1–41.

Friman, P. C. (2009). Behavior assessment. In D. H. Barlow, M. K.Nock, & M. Herson (Eds.), Single case experimental designs:Strategies for studying behavior change (3rd ed., pp. 99–134).Boston: Pearson.

Grove, W. M., Andreasen, N. C., McDonald-Scott, P., Keller, M. B., &Shapiro, R. W. (1981). Reliability studies of psychiatric diagno-sis. Archives of General Psychiatry, 38, 408–413.

Guggenmoos-Holzmann, I. (1993). How reliable are chance-correctedmeasures of agreement? Statistics in Medicine, 12, 2191–2205.

Hartmann, D. P. (1977). Considerations in the use of interobserverreliability estimates. Journal of Applied Behavior Analysis, 10,103–116.

Hartmann, D. P., & Wood, D. D. (1992). Observational methods. In A.S. Bellack, M. Hersen, & A. E. Akzdin (Eds.), Internationalhandbook of behavior modification and therapy (pp. 107–138).New York: Plenum.

Hawkins, A., Sharpe, T. L., & Ray, R. D. (1994). Toward instructionalprocess measurability: An interbehavioral field systems perspec-tive. In R. Gardner, D. Sainatok, T. Cooper, T. Heron, W. Heward,J. Eshleman, & T. Grossi (Eds.), Behavior analysis in education:Focus on measurably superior instruction. (pp. 241–255). PacificGrove, CA: Brooks/Cole.

Herrnstein, R. J. (1961). Relative and absolute strength of response asa function of frequency of reinforcement. Journal of theExperimental Analysis of Behavior, 4, 267–272.

Holland, J. G., & Skinner, B. F. (1961). The analysis of behavior: Aprogram for self-instruction. New York: McGraw-Hill.

House, A. E., Farber, J., & Nier, L. L. (1983). Differences incomputational accuracy and speed of calculation between threemeasures of interobserver agreement. Child Study Journal, 13,95–201.

Kelleher, R. T., & Gollub, L. R. (1962). A review of positiveconditioned reinforcement. Journal of the Experimental Analysisof Behavior, 5, 543–597.

Kelly, M. B. (1977). A review of the observational data-collection andreliability procedures reported in The Journal of AppliedBehavior Analysis. Journal of Applied Behavior Analysis, 10,97–101.

Lee, M. D., & Del Fabbro, P. H. (2002). A Bayesian coefficient ofagreement for binary decisions. Retrieved from http://www.psychology.adelaide.edu.au/members/staff/michaellee/homepage/bayeskappa.pdf.

Mann, J., Have, T. T., Plunkett, J. W., & Meisels, S. J. (1991). Timesampling: A methodological critique. Child Development, 62,227–241.

Martin, P., & Bateson, P. (1993). Measuring behavior: An introductoryguide. London: Cambridge University Press.

Mitchell, S. K. (1979). Interobserver agreement, reliability, andgeneralizability of data collected in observational studies.Psychological Bulletin, 86, 376–390.

Mudford, O. C., Martin, N. T., Hui, J. K. Y., & Taylor, S. A. (2009).Assessing observer accuracy in continuous recording of rate andduration: Three algorithms compared. Journal of AppliedBehavior Analysis, 42, 527–539.

Mudford, O. C., Taylor, S. A., & Martin, N. T. (2009). Continuousrecording and interobserver agreement algorithms reported in theJournal of Applied Behavior Analysis (1995–2005). Journal ofApplied Behavior Analysis, 42, 165–169.

Purton, A. C. (1978). Ethological categories of behavior and someconsequences of their conflation. Animal Behavior, 26, 653–670.

Ray, R. D. (1977). Physiological–behavioral coupling research in theSoviet Science of Higher Nervous Activity: A visitation report.Pavlovian Journal of Biological Sciences, 12, 41–50.

Ray, R. D., & Delprato, D. J. (1989). Behavioral systems analysis:Methodological strategies and tactics. Behavioral Science, 34,81–127.

Ray, R. D., & Ray, M. R. (1976). A systems approach to behavior II:The ecological description and analysis of human behaviordynamics. Psychological Record, 26, 147–180.

Behav Res (2011) 43:616–634 633

http://www.psychology.adelaide.edu.au/members/staff/michaellee/homepage/bayeskappa.pdf



Ray, J. M., & Ray, R. D. (2008). Train-to-code: An adaptive expertsystem for training systematic observation and coding skills.Behavior Research Methods, 40, 673–693.

Sanford, E. C. (1888). Personal equation. American Journal ofPsychology, II, 3–38.

Sharpe, T., & Koperwas, J. (2003). Behavioral and sequentialanalysis: Principles and practice. Thousand Oaks, CA: Sage.

Skinner, B. F. (1938). The behavior of organisms. New York:Appleton-Century-Crofts.

Spitznagel, E. L., & Helzer, J. E. (1985). A proposed solution to thebase rate problem in the kappa statistic. Archives of GeneralPsychiatry, 42, 725–728.

Suen, H. K., & Ary, D. (1989). Analyzing quantitative behaviouralobservation data. Hillsdale, NJ: Erlbaum.

Suen, H. K., & Lee, P. S. C. (1985). Effects of the use of percentageagreement on behavioral observation reliabilities: A reassess-ment. Journal of Psychopathology and Behavioral Assessment, 7,221–234.

Terrace, H. S. (1963). Discrimination learning with and without “errors.”Journal of the Experimental Analysis of Behavior, 6, 1–27.

Thorndike, E. L. (1898). Animal intelligence: An experimental studyof the associative processes in animals. Psychological Mono-graphs, 2(4), Whole No. 8.

Thorndike, E. L. (1899). A reply to "The nature of animal intelligenceand the methods of investigating it." Psychological Review, 6,412–420.

Verplanck, W. S. (1957) A glossary of terms. Psychological Review, 64(Suppl.), 42 and i.

Verplanck, W. S. (1996). From 1924 to 1996 and into the future:Operation analytic behaviorism. Mexican Journal of BehaviorAnalysis, 22, 19–60.

Vollmer, T. R., Soloman, K. N., & St. Peter Pipkin, C. (2008). Practicalimplications of data reliability and treatment integrity monitoring.Behavior Analysis in Practice, 1, 4–11.

Wallace, B., & Ross, A. J. (2006). Beyond human error: Taxonomiesand safety science. Boca Raton, FL: CRC Press.

634 Behav Res (2011) 43:616–634

Documents

Operations analysis of behavioral observation procedures ... · observation procedures to be found in the literature. This not only offers utility by establishing a common vocabulary