Transcript

Why We Need Corpora and the Sketch Engine

Adam KilgarriffLexical Computing Ltd, UKUniversities of Leeds and Sussex

Madrid April 2010 Kilgarriff: Why corpora and how 2

Corpora show us the facts of the language

Madrid April 2010 Kilgarriff: Why corpora and how 3

Exercise

planet Think about the word What could you say about it if you

were writing a dictionary entry Write down three (or more) things

Madrid April 2010 Kilgarriff: Why corpora and how 4

The Sketch Engine: demo

http://www.sketchengine.co.uk

Madrid April 2010 Kilgarriff: Why corpora and how 5

Dictionaries

How to decide what to say about the word?

Madrid April 2010 Kilgarriff: Why corpora and how 6

Dictionaries

How to decide what to say about the word? What the native speaker knows

(introspection)

Madrid April 2010 Kilgarriff: Why corpora and how 7

Dictionaries

How to decide what to say about the word? What the native speaker knows

(introspection) What other dictionaries say

Madrid April 2010 Kilgarriff: Why corpora and how 8

Dictionaries

How to decide what to say about the word? What the native speaker knows

(introspection) What other dictionaries say corpus

Madrid April 2010 Kilgarriff: Why corpora and how 9

Four ages of corpus lexicography

Madrid April 2010 Kilgarriff: Why corpora and how 10

Age 1:

Pre-computer

Oxford English Dictionary:• 20 million index cards

Madrid April 2010 Kilgarriff: Why corpora and how 11

Age 2: KWIC Concordances

From 1980 Computerised Overhauled lexicography

Madrid April 2010 Kilgarriff: Why corpora and how 12

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long

time, slow • 5000 lines: no

Madrid April 2010 Kilgarriff: Why corpora and how 13

Age 3: Collocation statistics

Problem:too much data - how to summarise?

Solution:list of words occurring in neighbourhood of headword, with frequencies

Sorted by salience

Madrid April 2010 Kilgarriff: Why corpora and how 14

Collocation listing

For collocates of save (>5 hits), to right of nodeword

word word

forests life

$1.2 dollars

lives costs

enormous thousands

annually face

jobs estimated

money your

Madrid April 2010 Kilgarriff: Why corpora and how 15

Age-3 collocation statistics: limitations

Lists contain junk unsorted for type

mixes together adverbs, subjects, objects, prepositions

What we really want: noise-free lists one list for each grammatical relation

Madrid April 2010 Kilgarriff: Why corpora and how 16

Age 4: The word sketch

Large well-balanced corpus Parse to find

subjects, objects, heads, modifiers etc

One list for each grammatical relation Statistics to sort each list, as before

Madrid April 2010 Kilgarriff: Why corpora and how 17

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002, 2007

Madrid April 2010 Kilgarriff: Why corpora and how 18

Demo part 2

Madrid April 2010 Kilgarriff: Why corpora and how 19

Fruit task

Choose fruit Concordance

Lemma, noun, lower case Frequency: node forms Write down

Plural freq (pl) Singular freq (sing)

Compute proportion: pl/(pl+sing)

Madrid April 2010 Kilgarriff: Why corpora and how 20

What is a corpus?

A collection of texts (as used for linguistic study)

Which texts? How many?

Madrid April 2010 Kilgarriff: Why corpora and how 21

Which texts?

Written Spoken

Madrid April 2010 Kilgarriff: Why corpora and how 22

Written Books

Fiction Non-fiction Textbooks

Newspapers Letters, unpublished Web pages Academic journals Student essays …

Madrid April 2010 Kilgarriff: Why corpora and how 23

Spoken

Must be transcribed, for text corpora Conversation

Who? Region, class, age-group, situation… Lectures TV and Radio Film transcripts Meetings, seminars …

Madrid April 2010 Kilgarriff: Why corpora and how 24

Which texts?

Different purposes, different text types

Making dictionaries: Cover the whole language Some of everything

Madrid April 2010 Kilgarriff: Why corpora and how 25

How much?

Most words are rare Zipf’s Law To get enough data for most words,

we need very big corpora

Madrid April 2010 Kilgarriff: Why corpora and how 26

Zipf’s Law

Word (pos) r f r x f

the (det) 1 6187267 6187267 to (prep) 10 917579 9175790as (adv) 100 91583 9158300playing (vb) 1000 9738 9738000paint (vb) 2000 4539 9078000amateur (adj) 10,000 741 7410000

Madrid April 2010 Kilgarriff: Why corpora and how 27

Zipf’s Law the: 6%

100 most frequent: 45% 7500 most frequent: 90% all others: rare

Madrid April 2010 Kilgarriff: Why corpora and how 28

Zipf’s Law

0102030405060708090

100

'the' 100 mostfrequent

3500most

frequent

7500most

frequent

% of all texts

Madrid April 2010 Kilgarriff: Why corpora and how 29

Leading English Corpora: Size

109

108

107

106

Size of

Corpora

(in words)

1960s 1970s 1980s 1990s 2000s

Brown/LOB COBUILD BNC OEC

Madrid April 2010 Kilgarriff: Why corpora and how 30

Good news

The web

Madrid April 2010 Kilgarriff: Why corpora and how 31

Thank you

http://www.sketchengine.co.uk


Recommended