43

Corpus and bnc

Embed Size (px)

Citation preview

Page 1: Corpus and bnc
Page 2: Corpus and bnc

BNC and its Online Use

Page 3: Corpus and bnc

Corpus and Famous Corpora

Presenter Memoona Butt

Roll No. 02

Page 4: Corpus and bnc

corpus

•A corpus can be defined as a systematic collection of naturally occurring text in electronic form.

Page 5: Corpus and bnc
Page 6: Corpus and bnc

Corpus linguistics•Corpus linguistics is the study of language/linguistic phenomena through the analysis of data obtained from a corpus.•Corpus linguistic is the analysis of text with the help of computer, i.e. with specialized software.

Page 7: Corpus and bnc

•A corpus is always designed for a particular purpose, the usefulness of a ready made corpus must be judged with regard to the purpose to which a user intends to put it.

Page 8: Corpus and bnc

Famous corpora•The Brown Corpus•The Lancaster-Oslo/Bergen•The London Lund Corpus•The British National Corpus

Page 9: Corpus and bnc
Page 10: Corpus and bnc

The Brown Corpus

• The Brown Corpus of Standard American English was the first of the modern, computer readable, general corpora. The corpus consists of one million words of American English texts printed in 1961.

Page 11: Corpus and bnc

The Lancaster-Oslo/Bergen

• The Lancaster-Oslo/Bergen Corpus is a million word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, The University of Oslo, and the Norwegian Computing Center for the Humanities, Bergen.

Page 12: Corpus and bnc

The London Lund Corpus

• The London Lund Corpus of English derives from two projects: the Survey of English Usage at University College London and the Survey of Spoken English, which was started at Lund University in 1975. the corpus consists of 500,000 words of spoken British English.

Page 13: Corpus and bnc
Page 14: Corpus and bnc

The British National Corpus

• The British National Corpus is a 100 million collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century.

Page 15: Corpus and bnc

Creation of BNC:• The project was developed by an

academic consortium called BNC consortium.• An industrial/academic consortium lead

by Oxford University press of which the members are more dictionary publishers.

Page 16: Corpus and bnc

• The Consortium was formed in 1990 and started work in 1991 on the three year task of producing a hundred million word corpus of modern British English for use in commercial and academic research. All major decisions regarding BNC are still made by them.

Page 17: Corpus and bnc

•The BNC comprises approximately 100 million words of•Written texts (90%)•Transcripts of speech (10%)

Page 18: Corpus and bnc

Why we use BNC• BNC can be used to know about aspects we

did not know about a word and to check our thoughts about its meaning. Moreover, the corpus can help to find out the meaning of a word not just what we think it means. We can use BNC to check either a word is a part of BNC or not.

Page 19: Corpus and bnc

Properties of British National Corpus

Presented by:- Hadia Tabassum

Page 20: Corpus and bnc

Bnc is a sample of 100 million words including spoken and written Britain English. It is a balanced and finite corpus that contains approximately 90% written data and 10%spoken data.

Features of British National Corpus

Page 21: Corpus and bnc

Spoken components\data in BNC:

Spoken compone

nts

The conversation part

Task oriented

part

Page 22: Corpus and bnc

The conversational part:-

• This part is largely based on recordings of every day conversation interaction engaged in by some 127 adults aged 15 and over. Some additional recording of under fifteen were included from COLT. The volunteers were selected according to demographic area of age, social group, and sex with the aim of obtaining approximately equal number in each group. well, conversational part make up just over 40% of the spoken corpus.

Page 23: Corpus and bnc

Respondents in ‘’conversational part” were

selected according to following properties;

Age Social group

Sex

Percentage

Under fifteen

Upper class

Male

41.14

15-24 Middle class

Female

58.47

24-34 Lower class

Unclassified

0.38

Page 24: Corpus and bnc

The task oriented part:

In this material was intended to represent those types of task oriented spoken activity that were unlikely to be recorded by conversational volunteers during a typical day in their lives. e.g. Lectures, consultations, sermons, T.V/radio broadcasting etc and this part contains 60% of spoken corpus.

Page 25: Corpus and bnc

The written components:

Written compone

nts

imaginative

Mostly fiction

informative

Non fictional

Page 26: Corpus and bnc

Continued…..Imaginative text account for 20% and informative text about 80% in written components. the imaginative text are divided into further categories prose,

poetry etc. on the other hand informative data is subdivided into eight categories.

1.Arts 2.Natural sciences

3.Commerce 4.Applied sciences5.Leisure 6.Social sciences7.Beliefs and arts 8. World affairs

Page 27: Corpus and bnc

Abbreviations and acronyms:

BNC provides us the same abbreviated sequence in many different ways such as P.C, PC, P.C although the same forms reflect different origins .(police Constable, postcard, personal computer)

Page 28: Corpus and bnc

Monolingual:

Although BNC include many different styles, verities and genera yet it deal with only modern British English and not with other languages used in Britain.

Page 29: Corpus and bnc

Synchronic:

BNC Covers British English of the late twentieth century ,rather than the historical development which produced it. it is updated time by time or with the passage of time

Page 30: Corpus and bnc

Editions of BNC Presenter

Kinza Asghar

Page 31: Corpus and bnc

First edition• The first edition of BNC was

completed in 1994.• The first general release of

the corpus for European researchers was announced in February 1995.

Page 32: Corpus and bnc

BNC World• BNC World, a slightly revised

version was made available in 2001, indicates that the corpus is now available under license world wide.

Page 33: Corpus and bnc

BNC is available in two flavors;1. Under the single user license (cost

50 pound) you can install the whole corpus and the SARA software on a single machine for personal use.

2. Alternatively, for the same price, you can install just the corpus itself and use whatever software you want.

Page 34: Corpus and bnc

BNC XML • BNC XML is the latest version of

the British National Corpus.• XML stands for Extensible

Markup Language. • XML is a set of rules for encoding

documents in machine readable form.

Page 35: Corpus and bnc

• The main differences between this version and the BNC World are:

1. Errors and inconsistencies have been removed.

2. Lemma information.3. Simplified part of speech

information added.

Page 36: Corpus and bnc

• BNC XML can be accessed in three ways:

1. Online use.2. Download the corpus and XAIRA.3. Download just the corpus and use

it with any software you want.

Page 37: Corpus and bnc

•Two subsets of BNC have been produced separately:

• BNC Baby.• BNC Sampler.

Page 38: Corpus and bnc

BNC Baby

• BNC Baby is a subset of the BNC. It consists of four one million word samples, each compiled as an example of a particular genre: fiction, newspaper, academic writing and spoken conversation.

Page 39: Corpus and bnc

BNC Sampler

• The BNC Sampler is a subset of the full BNC. It comprises two samples of written and spoken material of one million word each, compiled to mirror the composition of the full BNC as far as possible.

• The sampler was first created at Lancaster University during the creation of the BNC.

Page 40: Corpus and bnc

Online use of BNC

• Go to the home page.• Put the word into search bar and then

click on the search button.• It will show the content in which the

word is being used.• For instance, if we look for a word

“couch” the corpus will show us its collocations, frequency and KWIC.

Page 41: Corpus and bnc
Page 42: Corpus and bnc
Page 43: Corpus and bnc

.