ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo [email protected]

ENG 626CORPUS APPROACHES TO LANGUAGE STUDIES

exploring frequencies in texts

Bambang Kaswanti [email protected]

Adolph, Svenja (2006) Ch. 3

role of frequency information in relation to

characterization of the whole texts or collections of texts

techniques and practices in data analysis ▪ quantitative exploration of texts and text collections different types of wordlists how the wordlists can be used for contrastive studies of different texts

▪ generating hypotheses frequency lists to inform the generation of hypotheses and research questions

▪ testing hypotheses electronic text analysis to test existing hypotheses in any area that deals with the use of language

▪ facilitating manual processes from “manual” to “automated” e.g. extraction of frequency info

not necessarily motivated by a particular research question

some of the software resources to facilitate the research process

▪ software packages to facilitate the manipulation and analysis of electronic texts ▫ the generation of frequency counts ▫ comparisons of frequency information in different texts ▫ different formats of concordance outputs [including Key Word In Context (KWIC)]» [free of charge via internet] ◊ The Compleat Lexical Tutor (Tom Cobb) ◊ View Variation in English Words and Phrases (Mark Davis)» [commercial] ◊ Wordsmith Tools (Mike Scott)

basic information about the text

most software packages ▪ allow textual data to be sorted into concordance outputs ▪ produce some basic information about the text or collection of texts ▫ average sentence length ▫ word length ▫ number of paragraphs ▫ number of individual running words (tokens) ▫ number of different words (types) ▫ number of lexical items and number of grammatical items (in tagged corpora)

» type-token ratio some of the info can be expressed in terms of ratios: ratio between grammatical and lexical items in the text (lexical density)

the type-token ratio ▪ to gain some basic understanding of the lexical variation within the text

tokens: the number of running words in a texttypes: the number of different words

This chapter moves from the discussion of design and development of electronic text resources to techniquesand practices in data analysis.

How many tokens? 21 How many types? 19

The type-token ratio: divide number of tokens by number of types 21/19 = 1.11

What is it for? to asses the level of complexity of a particular text or text collections (e.g. comparisons between documents for different types of audiences)

the higher the type-token ratio the less varied the text

watch out: the overall size of the text(s) on which the ratio is based compare type-token ratios of text(s) of similar length

textual complexity▪ sentence and word length▪ linguistic analysis of grammatical structure▪ semantic fields of the individual items

» word lists ● single words frequency of a word or phrase in different text types is important for the description of the context of use (e.g. for English language teaching)

▪ various word lists exist in the ELT context e.g. Academic Word List (Coxhead 200)

▪ spoken vs. written discourse▪ American vs. British English

word list

▪ frequency order▪ alphabetical order▪ lemmatized format▪ grammatical tags▪ other analytical tags

word list to account for ▪ individual items ▪ recurrent sequences of two or more items

lemmatized frequency lists group together words from the same lemma (all grammatical inflections of a word: e.g. say, said, saying, says)

▪ often variations of meaning between different variants of the lemma (Stubbs 1996, Tognini-Bonelli 2001)

▪ [ELT] beneficial to teach all forms of one lemma together and give priority to the most frequently used form

Table 3.1: one basic information from a frequency list

ten most frequent items in the ▪ spoken CANCODE corpus ▪ written component (BNC)

some of the key differences between the two discourse modes are highlighted:

▪ both contain mainly grammatical items▪ the spoken corpus includes the personal pronouns I and you (interactive nature of the spoken discourse)▪ Yeah – listener response tokens in conversation

● recurrent continuous sequences

other terms: “lexical bundles” (Biber et al. 1999)

“clusters” (Scott 1996)

corpus research: a large proportion for particular items to co-occurin a non-random fashion of language is phrasal in nature(observable tendency )

collocation: attraction between two words (Ch. 4)

[overall length to be determined at the outset; e.g. Wordsmith Tools]Table 3.2 ten most frequent two-word, three-word, and four- word recurrent sequences in the CANCODE corpus

most of the sequences are concerned with ▪ the management of discourse ▪ the deictics: you and I ▪ attempt to establish mutual understanding: know what I mean, I know, I think, do you think, etc.

● comparing frequencies in text collections of different sizes

How to compare the frequencies of individual items in two corpora of different sizes?

▪ represent them as a percentage of the overall number of words in the respective corpora▪ use a norming technique of frequency counts

▫ divide the raw frequency of individual items by the total number of words in a text▫ we need to decide on an appropriate number of words which forms the basis of the norm ▫ multiply the results by this figure

» keywords

◊ keywords = items that occur ▪ either with a significantly higher frequency (positive keywords) ▪ or with a significantly lower frequency (negative keywords)

in a text or collection texts when compared to a largerreference corpus (Scott 1997)

◊ keywords are identified on the basis of

▪ statistical comparisons of word frequency lists derived from the target corpus and the reference corpus▪ [via a chi-square or a log-likelihood analysis] each item in the target corpus is compared to its equivalent in the reference corpus and its statistical significance of difference is calculated

to generate words that are characteristic uncharacteristic

in a giventarget corpus

● single keywords

◊ on the basis of a 35,000 word corpus: the spoken language of health professionals

◊ five million word CANCODE corpus of general spoken Eng

a study of telephone calls made to the British advice helpline provided by The National Health Service (NHS-Direct) the data from the medical consultations was recorded

▪ most frequent items in both corpora grammatical items▪ distribution of personal pronouns Health Service “other-oriented”: you most frequent▪ the reverse frequency order of you and I ▪ right in Health Service, yeah in CANCODE

both are listener response tokens ▫ right signals more transactional nature ▫ yeah interactional nature (encourage the Sp to continue with the turn)

▪ comparison of frequency lists can help in the characterization of different spoken genres

▪ keyword analysis (below), based on a log-likelihood calculation, better suited to highlight the main elements that are characteristics for a particular text or collection of texts

Table 3.4 shows the top 10 positive keywords

the list gives a better idea of the content of the textsin the HP corpus

▪ reference to medication (antibiotics)▪ ailments (diarrhoea)▪ the nature of the discourse (information)▪ the mode of the discourse (call)▪ the medical context (NHS, Direct)

the keywords that mark listener response in an advice-giving setting (ok, okay)patient-oriented nature (you, your)

Table 3.5 confirms the result of the analysis of positive keywords

▪ the discourse in the HP corpus oriented towards the hearer who phones in with a health problem you, your third person pronouns – negative keywords (low in HP corpus) past tense verb was also NEG keywords HP reports current medical concerns in the present tense

▪

▪ laughter ([laughs]) significantly more in CANCODE HP relatively serious nature of medical consultation

● key sequences

analysis of keywords can be extended to include extended recurrent sequences

Table 3.6 key sequences provides us with even stronger evidence of the particular domain of HP discourse

▪ quite a few of the recurrent sequences “automated response” marking the beginning of telephone interaction with NHS Direct▪ other sequences relate to the gathering of basic information about the caller▪ the most significant NEG key sequence in the HP: I don’t know (professionals providing knowledge and advice)

Documents

ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo [email protected]