English Corpora and Language Learning 2
OutlineWhat is a Corpus?
Compiling a corpus
First generation of corpora: BROWN, LOB
The Age of Mega Corpora
British National Corpus
International Corpus of English
International Corpus of Learner English
The Web as a corpus?
Availability
English Corpora and Language Learning 3
Corpora?(1) A collection of texts especially if complete and self
contained; the corpus of Anglo-Saxon verse(2) In linguistics and lexicography, a body of texts,
utterances or other specimens considered more or less representative of a language and usually stored as an electronic database
(The Oxford Companion to the English Language 1992)
A collection of naturally occurring language text chosen to characterize a state or variety of a language
John Sinclair Corpus Concordance Collocation OUP 1991
English Corpora and Language Learning 4
The pre-electronic eraHuge, painstaking manual effort
Covering a closed body of texts Bible Concordance
Shakespeare Concordance
Attempt to capture the whole language
English Corpora and Language Learning 5
Compiling a corpusAim
provide solid empirical evidence about language
Designgeographical and chronological bounds
speakers, genres,
defined by future use
Representative corpora?
Annotation
Output
English Corpora and Language Learning 6
Corpus Linguistics: the early phaseEarly Sixties
BROWN Corpus 500 texts of 2000 words each
LOB corpus British counterpart
Classic reference works
Part of speech tagged
English Corpora and Language Learning 7
Survey of English UsageA major undertaking at UCL led by Sidney Greenbaum
1 m word compilation
very careful annotation
500 words spoken material
LONDON-LUND Corpus
English Corpora and Language Learning 8
Structure of SEU
English Corpora and Language Learning 9
LOB corpus: a sample
•A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._.
•A01 3 ^ by_IN Trevor_NP Williams_NP ._.
•A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN
•A01 4 nominating_VBG any_DTI more_AP labour_NN
•A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN
•A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.
English Corpora and Language Learning 10
Concordance output
English Corpora and Language Learning 11
The age of Mega CorporaCOBUILD
John Sinclair at University of Birmingham
originally 20 m words
now over 300 m word BANK of English
the more the better
no fixed size: the idea of a Monitor corpus
English Corpora and Language Learning 12
A major undertaking in the mid-nineties
Birmingham, Lancaster – OUP,Longman,Chambers
100 m words carefully compiled
10 m words spoken data !
up-to-date standarg SGML encoding
still the paradigm example of a reference corpus
English Corpora and Language Learning 13
Accessing the BNC
English Corpora and Language Learning 14
BNC-Baby
English Corpora and Language Learning 15
Searching LOB/BROWN
English Corpora and Language Learning 16
International Corpus of EnglishA network of corpora corvering regional variaties of English
Project organized by UCL London
Each containing cc. 1 m. words
GB, Hong-Kong Australia, East-Africa more in preparation
English Corpora and Language Learning 17
ICE-HK
English Corpora and Language Learning 18
ICE-GB: sociolinguistic variation
English Corpora and Language Learning 19
ICE-GB: syntactic annotation
English Corpora and Language Learning 20
TreebanksGeoffrey Sampson
Meticulously hand-crafted syntactic annotationSUSANNE
CHRISTINE
LUCY
Penn-TreebankUniversity of Pennsyvania
Massive amounts of utomatically annotated data aimed for natural language processing work
English Corpora and Language Learning 21
International Corpus of Learner EnglishInternational Centre of English Corpus Linguistics Catholic University of Louvain led by Sylviane Granger
collection of essays
student profiles
Hungarian-English in preparation
English Corpora and Language Learning 22
Susanne CorpusAims of the Scheme
comprehensive — covering all features of surface and logical English grammar that are definite enough to be susceptible of formal annotation, and including all phenomena that occur in practice in modern English
explicit — if two researchers at separate sites are given the same sample of English and asked to annotate it according to the SUSANNE standards, their annotations should be identical nonpartisan — where aspects of grammar are the subject of theoretical controversy, the SUSANNE scheme aims to embody a neutral analysis which rival theoreticians can interpret in their own preferred terms
English Corpora and Language Learning 23
The Web as a corpusWhy sample when you can access the whole?
Huge and ever changing
The ultimate in authenticity?
Not necessarily …
English Corpora and Language Learning 24
The Webcorp project
English Corpora and Language Learning 25
http://devoted.to/corpora