19 Elementary statistical testing with R

Embed Size (px)

Text of 19 Elementary statistical testing with R

  • 360 BASIC STATISTICAL A N ALY SI S

    We need to specify the probability P(u;iu) of an individual probability, given the value of u. We use:

    (9)

    wbcrc Cis a constaul. Tht: role of Cis to detem1ine how strongly the individual pammctcrs u, depend on the general probability u. We use C= I 0.

    From the posterior distribution P(uiD) we can further compute the interval (r, .1) of values such as to give a probability of at least 95% that the value of u is within that interval; such an interval is called a 95% posterior inlerral in Bayesian literature.

    19 Elementary statistical testing with R

    STEfAN TH. GR I ES

    Introduction

    In Briun Joseph's fi11al cdit.oriul us the editor of what many sec as the flagship journal of the disci pi ine, Language, he commented on recent develop-ments in the field. One of the tccent developments he has seen happening is the following:

    Linguistics hns ulways had a llltmcricnl and mathematical side . . . bm the use of quantitative mcthocb , anti, rcbttedly, formalization> and mudd ing, seems to be ever on the in~rca,c; rare is the paper that does not report on some ~tatistical analysis of relevalll data or otTer ~ome model of the problem at hand. (Joseph 200ll: 687)

    For several reasons, this appe:ll's to be a development for the better. First, it situates the field of linguistics more firmly in the domains of social sciences and cogt1itivc science to which, !think. it belongs. Other fields in the social sciem:es and in cognitive science - l>~ychology. sociology, computer science, to name but a few - have long rccogni1.ed rhe power of quantitative methods for their respective fields of study. and since linguists deal with phenomena just as multifactorial and interrelated as scholars in these disciplines, it was time we also began to use the tools that have been so uscf ul in neighhori ng disciplines.

    Second, the quantitative study of phenomena aiTords us with a higher degree of comparability, objectivity, and replicability. Consider the following slightly edited statement:

    there is :11 besr a very slight increase in the iirst foL

  • 362 fi,IS l C: ST,\TISTlCAl ANALYSJS ------------- ---------------------------------------

    large (because, while thi~ also involves an addition of 100% to the first number, it involves an addition of 160, not just 20) ... ; the division of the nine decades into (three) stage.~ i~ ~imply as.~umed but not justified on the basis of operationalizablc data; the lengths or the nine stages (first four decade!>, then three decades, then two decades) is simply stated hut not justified on the ha~is of dma.

    Third, there is increasing evidence that much of the cognitive and/or linguistic system is statistical or probabilistic in nature, as opposed to based on clear-cut categorical rules. Many scholars in flfSt language acquisition, cognitive linguis-tic,, soeiolinguistics, psycholinguistics etc. embrace, for instance, exemplar-based models of language acquisition, representation, processing, and change. Jt is ccrtainlv no coincidence that this theoretical development coincides with the method~logieal development pointed out by Joseph, and if one adopts a probabil-istic theoretical perspective, then the choice of probabilistic - i.e. statistical -tools is only natLU1ll; cf. Bod, Hay, and .fannccly (2003) for an exce.llent overview of probabiUstic linguistics.

    For obvious limitations of space, this chapter cannot provide a full-fledged introduction ro quanritarive methods (cf. the refen;nc~~ hdow). Howt:vt:r, tht: main body of this chapter explains how to set up different kinds of quantitative data for statistical analysis and exemplifies three tests for the three most cornrnon ~tatistical sct:narios: t11e study of frequencies, of

  • 364 llA~IC STATl~TICAL ANALYS I S

    The way that raw data tables normally have to be arranged for stmisti to save them so that they can be ea~i ly loaded into a sradstics application. To that end, you should save the data into a formal that makes them maximally readable by a wide variety of progrdlliS. The simplest way to do this is to save the data into a tab-separated file, i.e. a raw text fi le in which different columns arc separated from each other with tabs. In Libre-Office Calc, one first chooses File: Save As ... , then ~.:!looses 'Text CSV (.csv)' as the fi k typt;, and chooses (Tab I as th~ E_ie!d delimiter. 1

    Table 19.2. Reor[?anizacion of 'table 2 (11ppendix 2).from Hundt and Smilh (2009)

    TENSE YAIUETY TIME

    DrE 1%0> 4 I 95 more rows like the one immediately above

    prc.se.nt_perfcct BrE 1990s 4072 more row~ like the one immediately above

    prescnt_pcrrcct A.mE 1960s 3537 mort: row~ like the one immediately above

    pre.~ent_perfcct A mE 1990s 3498 more rows like the one immediately above

    simplc_past BrE 1960s 35820 more rows like the one immediately above

    simple_past BrE 1990s 35275 more rows like the one immediately above

    simple_pasl ArnE 1960s 37223 more rows like the one immediately above

    simple_past A mE 1990s 36249 more rows like the one immediately above

    To perform the third step, i.e. to load the data inr.o statisti l:a! software, you must l'irst decide on which software to use. From my point of view, the best statistical package curremly availahlt: is lhe open source software environ-ment R (cf. R Development Core Team 200fi- 2013). The basic software as

    1 I recommend using only .wrd chanlCters (lcncrs, number;. and under>Cl programs nowadays provide sophi>tic~red import iunclions or w~r.ard~ -11 is my cxperien

  • 366

    To check whether the data have been read in correctly, it is always u~cful to look at the structure of the imported data fi rst, using the function s l ~ , which provides all the column names together with some information on what the columns contain, namely their kind of data (integer numbers, charad cr strings as factors, etc.) as well a~ the tirst few values. lf you had read in a file of the kind 'howu in Table 19.2. then this is what the output would look like:

    str(data . tablel~ ' dat~.frame ': 159876obs . of3variables: $TENS!:. : ~actorw/Zl!'vels "presenl_perfect" ... : 1: Ill' 111: .. . $VARIETY : Factorw/2levels "AmE". "BrE ": 22???72222 ... $TIME : ractorw/ 2lcvels "1960s" ,"1990s" : 1111 l 1111 ...

    The simplest way to be able to access all values of a column att.he same lime by just using the column name involves the function attach, which requires the dm.a frame's name as its only argument ami typically does not retum any OLitput.:

    attach(data .tahl~l,

    While the above is the typical way to input data into R, when the data set. in question is small, another way is sometimes simpler, namely entering the data oneself. For example, if you have collected the lengths of five indirect nhjccts and of five direct objects in tenns of number of phonemes and want to quickly compare t]Jeir mean lengths, it is maybe not necessary to create a tab-delimited input file - you can just enter the data into Rand assign them to a data structun:, a so-called vector, using the function c, which concatenates the clements provided as arguments (numbers or character suings) into a vector.

    indir .objects

  • 36B .BASIC ST~ I'I STTC:,\1. ANALYSIS

    a tt r( data. matrix. "d i mnames "lected frequencies, and we also now sec that t.hc effects for the prcpo~itionless answers arc somewhat more pronounced.

  • 370 D A SI C S'l A I'I STICAL AI' AI. YSIS

    A graphical representation that makes this even more obvious is the so-called association plot, which is shown in Figure 19.1: blae chi-square to identify cffec.t si1.es or compare t11em ac.;ross different. studies. Instead, one can u~e a correlation coefficient called Cramer's V, which falls between 0 and I, and the larger the value, the stronger the correlation. Cramer 's V is computed as shown in (6).

    (6) Cramer's V =

    Elementary statistical testing with R

    Tf we compute Cramer's V for the present data set. we obtain the re.sult in (7).

    (7) Cramer's V

    In R:

    37.5569 - /37.5569- ~ -,(---,') - \1---w3

    \ 200 miu(2, 2)

    sqrt(data.~atrix.testtstat/(sum(data.matrixl*min(oim:data.

    matri xl 1)) l" X squared 0.43334C8

    Thus, the data show an effect such Lhat the type of question detennines the type of answer: prepositional and preposition less questions yield higher frequencies of occurrence of preposition:tl and prepositionlcss answers respectively; the effect is highly significant (p

  • 372 LI ASII ALYS I S

    A frequent scenario is, then, that one wants to compare the central tendencies of two vectors to see whether they are significantly diffl:n.:ul from each other. C:on~ider the ~;as~.: where you have collected the lengths of subordinate clauses in words in 10 samples of spoken and 10 samples of written data and obtained the following results:

    (&) spoken: II , 10, 9, 6. 8, 9 I I, 7. &. 6

    (9) written: 13. 14. 12, 13, II, 14, 7. 10. 12, 12

    Thc

  • 37L DASIC >TAT!STICAI. AIoxp- o L: dependent variable ~ independent variable;

    the nrgumem pn i r cd, whkh can be set to TRUe or - />.LSF, where TRU [ means the values ofthc two groups l'onn meaningful pairs, and F1\ l SC means the opposite_ Since in 1his case the lenglh of any one suhordinate clause in speaking is not related to that of any one subordinate clause from a written file, we set pa i o'ed=FALSE;

    the argument :or 'e::. which can be set to nuc: or ~.~LS[ dgxiUJing on whctht:r you want to apply a correction for continuity or not as is sometimes recommended for small sample sizes. In the interest of comparability of the test with most other stati~tical sol'! ware, we set tlois to r