Upload
franklin-hodge
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Why We Need Corpora and the Sketch Engine
Adam KilgarriffLexical Computing Ltd, UKUniversities of Leeds and Sussex
Madrid April 2010 Kilgarriff: Why corpora and how 3
Exercise
planet Think about the word What could you say about it if you
were writing a dictionary entry Write down three (or more) things
Madrid April 2010 Kilgarriff: Why corpora and how 4
The Sketch Engine: demo
http://www.sketchengine.co.uk
Madrid April 2010 Kilgarriff: Why corpora and how 5
Dictionaries
How to decide what to say about the word?
Madrid April 2010 Kilgarriff: Why corpora and how 6
Dictionaries
How to decide what to say about the word? What the native speaker knows
(introspection)
Madrid April 2010 Kilgarriff: Why corpora and how 7
Dictionaries
How to decide what to say about the word? What the native speaker knows
(introspection) What other dictionaries say
Madrid April 2010 Kilgarriff: Why corpora and how 8
Dictionaries
How to decide what to say about the word? What the native speaker knows
(introspection) What other dictionaries say corpus
Madrid April 2010 Kilgarriff: Why corpora and how 10
Age 1:
Pre-computer
Oxford English Dictionary:• 20 million index cards
Madrid April 2010 Kilgarriff: Why corpora and how 11
Age 2: KWIC Concordances
From 1980 Computerised Overhauled lexicography
Madrid April 2010 Kilgarriff: Why corpora and how 12
Age 2: limitations
as corpora get bigger:too much data
• 50 lines for a word: :read all • 500 lines: could read all, takes a long
time, slow • 5000 lines: no
Madrid April 2010 Kilgarriff: Why corpora and how 13
Age 3: Collocation statistics
Problem:too much data - how to summarise?
Solution:list of words occurring in neighbourhood of headword, with frequencies
Sorted by salience
Madrid April 2010 Kilgarriff: Why corpora and how 14
Collocation listing
For collocates of save (>5 hits), to right of nodeword
word word
forests life
$1.2 dollars
lives costs
enormous thousands
annually face
jobs estimated
money your
Madrid April 2010 Kilgarriff: Why corpora and how 15
Age-3 collocation statistics: limitations
Lists contain junk unsorted for type
mixes together adverbs, subjects, objects, prepositions
What we really want: noise-free lists one list for each grammatical relation
Madrid April 2010 Kilgarriff: Why corpora and how 16
Age 4: The word sketch
Large well-balanced corpus Parse to find
subjects, objects, heads, modifiers etc
One list for each grammatical relation Statistics to sort each list, as before
Madrid April 2010 Kilgarriff: Why corpora and how 17
Macmillan English DictionaryFor Advanced Learners
Ed: Rundell, 2002, 2007
Madrid April 2010 Kilgarriff: Why corpora and how 19
Fruit task
Choose fruit Concordance
Lemma, noun, lower case Frequency: node forms Write down
Plural freq (pl) Singular freq (sing)
Compute proportion: pl/(pl+sing)
Madrid April 2010 Kilgarriff: Why corpora and how 20
What is a corpus?
A collection of texts (as used for linguistic study)
Which texts? How many?
Madrid April 2010 Kilgarriff: Why corpora and how 22
Written Books
Fiction Non-fiction Textbooks
Newspapers Letters, unpublished Web pages Academic journals Student essays …
Madrid April 2010 Kilgarriff: Why corpora and how 23
Spoken
Must be transcribed, for text corpora Conversation
Who? Region, class, age-group, situation… Lectures TV and Radio Film transcripts Meetings, seminars …
Madrid April 2010 Kilgarriff: Why corpora and how 24
Which texts?
Different purposes, different text types
Making dictionaries: Cover the whole language Some of everything
Madrid April 2010 Kilgarriff: Why corpora and how 25
How much?
Most words are rare Zipf’s Law To get enough data for most words,
we need very big corpora
Madrid April 2010 Kilgarriff: Why corpora and how 26
Zipf’s Law
Word (pos) r f r x f
the (det) 1 6187267 6187267 to (prep) 10 917579 9175790as (adv) 100 91583 9158300playing (vb) 1000 9738 9738000paint (vb) 2000 4539 9078000amateur (adj) 10,000 741 7410000
Madrid April 2010 Kilgarriff: Why corpora and how 27
Zipf’s Law the: 6%
100 most frequent: 45% 7500 most frequent: 90% all others: rare
Madrid April 2010 Kilgarriff: Why corpora and how 28
Zipf’s Law
0102030405060708090
100
'the' 100 mostfrequent
3500most
frequent
7500most
frequent
% of all texts
Madrid April 2010 Kilgarriff: Why corpora and how 29
Leading English Corpora: Size
109
108
107
106
Size of
Corpora
(in words)
1960s 1970s 1980s 1990s 2000s
Brown/LOB COBUILD BNC OEC