34
Corpus Linguistics Anca Dinu February, 2020

Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

  • Upload
    others

  • View
    25

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Corpus Linguistics

Anca Dinu February, 2020

Page 2: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Course info

• Where to find materials: official personal page at unibuc: https://limbimoderne.lls.unibuc.ro/catedra/• presence requirements: 60%• project 50% of the final grade• oral examination 50% of the final grade

Page 3: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Motivation

• Language in use is worthy of study;• Large quantities of authentic language are

needed for meaningful study;• Context is important;• General shift in social sciences to empiricism;• Rise of technology;• Data-based approach opens up research.

Page 4: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

History of Corpus Linguistics• First quantitative approaches to style and authorship

studies were did by hand (1901-1958).• Augustus de Morgan in a letter written in 1851

proposed a quantitative study of vocabulary as a means of investigating the authorship of the Pauline Epistles.

• T. C. Mendenhall, at the end of the nineteenth century, described his counting machine: two ladies computed the number of words of two letters, three, and so on in Shakespeare, Marlowe, Bacon in an attempt to determine who wrote Shakespeare' texts (Mendenhall 1901).

Page 5: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

History of Corpus Linguistics

• But the advent of computers made it possible to record word frequencies in much greater numbers and much more accurately than any human being can.

• In 1963, a Scottish clergyman, Andrew Morton, published an article in a British newspaper claiming that, according to the computer, St Paul only wrote four of his epistles.

• Morton based his c la im on word counts of common words in the Greek text, plus some elementary statistics.

Page 6: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

History of Corpus Linguistics

• The most influential computer-based authorship investigation dates back in 1964. This was the study by Mosteller and Wallace of the Federalist Papers in an attempt to identify the authorship of the twelve disputed.

• They were able to show that Madison was very likely to have been the author of the disputed papers. Their conclusions generally have been accepted, to the extent that the Federalist Papers have been used as a test for new methods of authorship discrimination.

Page 7: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

History of Corpus Linguistics

• 1949 - Roberto Busa, an Italian Jesuit and theologian, approached Thomas J. Watson, founder of IBM, sought help in indexing the works of Thomas Aquinas (totalling some 11 million words of medieval Latin).

• Busa wanted to produce a "lemma" concordance list where words are listed under their dictionary headings, not under their simple forms.

• In the process, Busa and Watson demonstrated that the storage, retrieval and search-and-sort functions of the computer were compelling tools.

Page 8: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

History of Corpus Linguistics

• A landmark in modern corpus linguistics was the publication by Henry Kučera and W. Nelson Francis of Computational Analysis of Present-Day American English in 1967, a work based on the analysis of the Brown Corpus.

• A further key publication was Randolph Quirk's 'Towards a description of English Usage' (1960) in which he introduced The Survey of English Usage.

Page 9: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

History of Corpus Linguistics

• Shortly after, Boston publisher Houghton-Mifflin approached Kučera to supply a million-word, three-line citation base for its new American Heritage Dictionary, the first dictionary compiled using corpus linguistics.

• The AHD took the innovative step of combining prescriptive elements (how language should be used) with descriptive information (how it actually is used).

Page 10: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

History of Corpus Linguistics

• 1985: The British publisher Collins' COBUILD monolingual learner's dictionary, designed for users learning English as a foreign language, was compiled using the Bank of English.

• The Survey of English Usage Corpus was used in the development of one of the most important Corpus-based Grammars, the Comprehensive Grammar of English (Quirk et al. 1985).

Page 11: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Corpus Linguistics

Search on Google february 2020:• “corpus linguistics is”: 18.100.000 occurences;• “corpus linguistics is not”: 11.500.000 occurences.

CL is the study of language data on a large scale;CL is a method of carrying out linguistic analyses of naturally occurring language (real life data) on the basis of computerized corpora, with specialized software.

Page 12: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Corpus Linguistics

• CL ia a methodology, not a linguistic theory (as structuralism, generativism, cognitiv linguistics, scociolinguistics, etc.)

Corpus Linguistics Linguistic theories

language in use intuition, introspection

bottom-up top-down

methodology explanatory/predictive power

Page 13: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Corpus Linguistics

• It has become one of the most wide-spread methods of linguistic investigation.

• It can be used for the investigation of many kinds of linguistic questions.

• It has the potential to yield highly interesting, fundamental, and often surpris ing new insights about language.

Page 14: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Linguistic Data

• What data do linguists use to investigate linguistic phenomena?

• Roughly, four types of data can be distinguished: 1) data gained by intuition

a) the researcher’s own intuition (“introspection”) b) other people’s intuition (accessed, for example, by

elicitation tests) 2) naturally occurring language

a) randomly collected texts or occurrences (“anecdotal evidence”)

b) systematic collections of texts - CORPORA

Page 15: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Corpora

• A corpus is as a systematic collection of naturally occurring texts (of both written and spoken language).

• Systematic means that the structure and contents o f the corpus fo l low c er ta in extralinguistic principles or criteria.

Page 16: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Corpora

• For example, the texts or transcriptions of a corpus are often restricted to certain time span, domain, genre, style, dialect, language, etc...

• If several of these subcategories are present in a corpus, these are often represented by the same amount of text (they are balanced) and separated as such in the corpus.

Page 17: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

What corpora are there? • Depending of the type of text or transcript, corpora

can be:– general/reference corpora (vs. specialized corpora) (e.g.

BNC = British National Corpus, or Bank of English) aim at representing a language or variety as a whole (contain both spoken and written language, different text types etc.)

– historical corpora (vs. corpora of present-day language) (e.g. Helsinki Corpus, ARCHER) aim at representing an earlier stage or earlier stages of a language.

Page 18: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

What corpora are there? – regional corpora (vs. corpora containing more than one

variety) (e.g. WCNZE = Wellington Corpus of Written New Zealand English) aim at representing one regional variety of a language.

– learner corpora (vs. native speaker corpora) (e.g. ICLE = Internat iona l Corpus o f Learner Eng l i sh ) a im at representing the language as produced by learners of this language.

– multilingual corpora (vs. one-language corpora) aim at representing several, at least two, different languages, often with the same text types (for contrastive analyses).

– spoken (vs. written vs. mixed corpora) (e.g. LLC = London-Lund Corpus of Spoken English) aim at representing spoken language.

Page 19: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Use of corpora

• Different types of corpora are used for different kind of analysis.

• In linguistics, the typical use of corpora is:– the (in)validation of linguistic hypothesis;– statistical analysis of the linguistic data (Corpus Pattern

Analysis , frequency l ists , word co-occurrences, concordances, idioms, structures).

– the (semi-)automated data extraction (like argumental structure, thematic role) for the creation of electronic lexicons.

Page 20: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Annotation

• Annotation of corpora means that some kind of linguistic analysis has been performed and marked on the texts.

• A corpus can be: – un-annotated: ortographic, raw, with just meta-

annotation;– annotated: phonetical, lexical, syntactical ,

semantical, pragmatical annotations.

Page 21: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Annotation• Phonetic annotation: adding information about how a word

in a spoken corpus was pronounced. • Prosodic annotation: adding information about prosodic

features such as stress, intonation and pauses, in a spoken corpus.

• Morphological and Lexical annotation: adding the lemma of each word form in a text (its headword in a dictionary - lying has the lemma LIE).

• Syntactic and part of speach annotation: adding information about how a given sentence is parsed (trees, dependencies, etc.), in terms of syntactic analysis into phrases (NP, VP, etc.) and clauses.

Page 22: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Annotation• Semantic and discourse annotation: adding information

about the semantic category of words likw human, artefact, etc., and anaphoric links in a text, like pronouns and their antecedents.

• Pragmatic annotation: adding information about the kinds of speech act (or dialogue act) that occur in a spoken dialogue, for instance, okay on different occasions may be an acknowledgement, a request for feedback, an acceptance, or a pragmatic marker initiating a new phase of discussion.

• Stylistic annotation: adding information about speech and thought presentation (direct speech, indirect speech, free indirect thought, etc.).

Page 23: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Annotation

• Annotation schemata should focus on a single coherent theme:

Different linguistic phenomena should be annotated separately over the same corpus.

• Annotations must be consistent with each other: unification and merging of multiple annotation is necessary.

Page 24: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Example of semantic annotation

Predicators and their named arguments: [The man]agent painted [the wall]patient.• Anaphors and their antecedents: [The protein] inhibits growth in yeast. [It] blocksproduction . . .• Acronyms and their long forms: [Platelet-derived growth factor] (known as [pdgf])

impacts . . .• Semantic Typing of entities: [The man]human fired [the gun]firearm.

Page 25: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Annotation

• Corpus annotation is usual ly made in a standardized manner with:

• XML (eXtensible Markup Language), designed to be both human- and machine-readable, via intuitive tags.

• Or TEI (Text Encoding Initiative), a text-centric community of practice that defined text guidelines in XML format).

Page 26: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus
Page 27: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus
Page 28: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus
Page 29: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus
Page 30: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

• The reference corpus of contemporary Romanian - CoRoLa has been relesed in 2018: http://corola.racai.ro/

Page 31: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Corpus Software

• Two types of software for corpus analysis: – software that is tailored to one specific corpus,

(such as SARA and BNCWeb for BNC, or ICE-CUP for ICE-GB) and

– software that can be used with almost any kind of corpus (such as AntConc, MonoConc Pro and WordSmith Tools, which is probably the most widely used corpus software).

Page 32: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

What can the software do?

• While there are many differences between the software packages designed for corpus analysis, certain basic functions can be performed by practically all the available software.

• For linguistic analyses, the most important function is the possibility of searching the corpus in question for the (co-)occurrence of certain strings (words or phrases).

Page 33: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

What can the software do?

• As output , the sof tware usua l l y g ives information on:– the number of these strings occurring in the

corpus, – on the text in which they were found, and– the so-called concordance-lines, which show the

string in question in context (with the search term(s) highlighted).

Page 34: Corpus Linguistics - limbimoderne.lls.unibuc.rolimbimoderne.lls.unibuc.ro/wp-content/uploads/sites/8/2020/02/Curs… · Corpus Linguistics Search on Google february 2020: • “corpus

Bibliography

• Tony McEnery and Andrew Hardie, Corpus Linguistics: Method, Theory and Practice, Cambridge Univerity Press, 2012

• Nadja Nesselhauf , Corpus Linguistics: A Practical Introduction, 2011

• Charlotte Taylor, What is corpus linguistics? What the data says, ICAME Journal No. 32, 2008

• Biber, Douglas, Susan Conrad and Randi Reppen (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge UP.

• Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.