34
Do we still need corpora (now that we have the Web)? Silvia Bernardini University of Bologna, Italy [email protected] Postgraduate Conference in Corpus linguistics 22 May 2008

Do we still need corpora (now that we have the Web)?

  • Upload
    aislin

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Do we still need corpora (now that we have the Web)?. Silvia Bernardini University of Bologna, Italy [email protected]. Postgraduate Conference in Corpus linguistics 22 May 2008. The corpus. - PowerPoint PPT Presentation

Citation preview

Page 1: Do we still need corpora  (now that we have the Web)?

Do we still need corpora (now that we have the Web)?

Silvia BernardiniUniversity of Bologna, Italy

[email protected]

Postgraduate Conference in Corpus linguistics

22 May 2008

Page 2: Do we still need corpora  (now that we have the Web)?

The corpus• A collection of texts assumed to be representative of a given language,

dialect, or other subset of a language, to be used for linguistic analysis. (Francis 1992(1982):17)

• A collection of naturally-occurring language text, chosen to characterize a state or variety of a language. (Sinclair 1991:171)

• A closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria. (Engwall 1992:167)

• Finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration. (McEnery and Wilson 1996:23)

• A collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety. (McEnery et al. 2006:5)

Page 3: Do we still need corpora  (now that we have the Web)?

The Web

• A mine of language data of unprecedented richness (Lüdeling et al 2007)

• A fabulous linguists’ playground (Kilgarriff and Grefenstette 2003)

• [a] cheerful anarchy (Sinclair 2004)• A helluva lot of text, stored on

computers… (Leech 1992:106)

Page 4: Do we still need corpora  (now that we have the Web)?

Is the Web a corpus? Yes!

The definition of corpus should be broad. We define a corpus simply as “a collection of texts”. If that seems too broad, the one qualification we allow relates to the domains and contexts in which the word is used […]: A corpus is a collection of texts when considered as an object of language or literary study. The answer to the question “Is the web a corpus?” is yes.

Kilgarriff and Grefenstette (2003:334)

Page 5: Do we still need corpora  (now that we have the Web)?

Is the Web a corpus? No!The cheerful anarchy of the Web thus places a burden of care on a user, and slows down the process of corpus building. The organisation and discipline has to be put in by the corpus builder. […] users of a corpus assume that there is a consistency of selection, processing and management of the texts in the corpus.

Corpora should be designed and constructed exclusively on external criteria.

(Sinclair 2005)

Page 6: Do we still need corpora  (now that we have the Web)?

This talk

• The Web and the corpus– Disambiguating the WaC acronym– Where the Web wins out– Where the corpus holds its ground

• Web as Corpus initiatives @ Forlì– The BootCaT way– The WaCky! way

• Open issues and ways forward

Page 7: Do we still need corpora  (now that we have the Web)?

Web as Corpus?

• (The Web corpus “proper”)• The Web as a corpus surrogate • The Web as a corpus supermarket • The mega-corpus (or mini-Web)

Page 8: Do we still need corpora  (now that we have the Web)?

The Web as a corpus surrogate• Googleology…• e.g.: Keller and Lapata (2003)

– Predicate-argument bigrams– adj-noun, noun-noun, verb-noun – not attested in the BNC

“Web counts correlate reliably with [human plausibility] judgments, for all three types of predicate-argument bigrams tested, both seen and unseen. For the seen bigrams, […] the Web frequencies correlate better with judged plausibility than corpus frequencies” (ibid: 481).

• … is bad science“Working with commercial search engines makes us develop workarounds. We become experts in the syntax and constraints of Google, Yahoo!, Altavista, and so on. We become ‘googleologists’” (Kilgarriff 2007:147)

Page 9: Do we still need corpora  (now that we have the Web)?

Google…• Unreplicable

– Véronis (2005): 5 billion "the" have disappeared overnight– Kilgarriff (2007:148): “queries are sent to different computers, at

different points in the update cycle, and with different data in their caches”

• Uncontrollable– Asterisk treated as placeholder for 1 word or more than 1 word– Punctuation and capitalisation disregarded (even in phrases)– Search hits are per page– Ranking criteria and result sorting (popularity, geographic relevance, …)

• Linguistically naïve– No morphosyntactic annotation

• 36 queries to extract fulfill + obligation (Keller and Lapata 2003)• Impossible to extract fulfill + NOUN

– Unsophisticated query language• No sub-string matching• No span options

Page 10: Do we still need corpora  (now that we have the Web)?

SE post-processors?

• e.g. WebCorp, KWiCFinder– Wildcards and tamecards– Concordance output– Collocation

• Not a solution, really– Slow– Same limits as SE

Page 11: Do we still need corpora  (now that we have the Web)?

The Web as a corpus supermarket

• Selecting and downloading texts– General or specialized– Can be automatised (infra)

• e.g. (general): – Leeds Internet corpora (Sharoff 2006)

• English, Chinese, Finnish, French, German, Italian, Japanese• Lemmatised and pos-tagged• Indexed with the CWB and searchable online (CQP)

– Fletcher’s WaC (Fletcher 2007)• ~500M words of English (AU, CA, GB, IE, NZ, US)• will be pos-tagged

Page 12: Do we still need corpora  (now that we have the Web)?

Pros

• “Traditional” corpus =>– Replicable results– Control over corpus contents

• In principle– Control over search methods– Linguistically sophisticated searches

supported

Page 13: Do we still need corpora  (now that we have the Web)?

BUT…

• Compromise btwn Web and corpus => – Relying on SE (Google, LiveSearch) – Size– Up-to-dateness– Understanding of corpus contents/structure– Variety of corpus contents– Noise

Page 14: Do we still need corpora  (now that we have the Web)?

The mega-corpus/miniweb

• Baroni (2007): Effort spent by NLP community in developing Google-skills would be better spent building our own Google-sized corpora

• None available so far, but:– WebCorp (Renouf et al. 2007)– The WaCky! effort (infra)

• Ultimate objective, build a linguist’s search engine for the Web

Page 15: Do we still need corpora  (now that we have the Web)?

Where the Web wins out

• Up-to-dateness• Size• Convenience

– Cost – Ease of collection– Under-resourced languages

• Web-specific genres• Reference purposes

Page 16: Do we still need corpora  (now that we have the Web)?

Where the corpus holds its ground

• Selection on external criteria– Cf.: a collection of pieces of language text in

electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research (Sinclair 2005)

• Register/genre control• Representativeness and documentation• Pre- or non-Web genres

Page 17: Do we still need corpora  (now that we have the Web)?

e.g.: McEnery et al 2007• Collocation information for learners’ dictionaries• “Help”: Full or bare infinitive?

– Varieties of English, language change, syntactic environment• Acquisition of grammatical morphemes

– Learner language• Swearing in modern British English

– writing vs. speaking– sociolinguistic variables

• Conversation vs. formal speech in AmEng• Aspect marking in English-Chinese translation

– Parallel corpora– Cf. Resnik and Smith (2003)

Page 18: Do we still need corpora  (now that we have the Web)?

Two approaches to the Web as corpus

• The BootCaT way 1. Select initial seeds (terms)2. Query SE for random seed combinations3. Retrieve pages and format as text (corpus)4. Extract new seeds via corpus comparison5. Iterate

• Designed for translation students• Also used for reference corpus building

• Leeds Internet Corpora

Page 19: Do we still need corpora  (now that we have the Web)?

BootCaT pros…

• Implemented in perl as a set of simple command-line scripts

• Freely available (http://sslmit.unibo.it/~baroni/bootcat.html)

• documented• Integrated into the Sketch Engine pipeline• Community effort

– WebBootCaT– JBootCaT

Page 20: Do we still need corpora  (now that we have the Web)?

An example: wine tastingAutomatic query generation

aceticacidacidityaftertasteagedalcoholappleyaromaascescenceastringent…

wine rich unfiltered attractivewine stylish "malolactic fermentation" sourwine meager harsh spritzywine dumb tobacco directwine watery grapey tearswine hazy breed nouveauwine spicy flat bodywine vinous spritzy unfinedwine fleshy cigarbox easywine puckery sharp nutty…

Page 21: Do we still need corpora  (now that we have the Web)?
Page 22: Do we still need corpora  (now that we have the Web)?

“vanilla” collocates (span=1R)

BootCaT wine tasting corpus(English, 1.5M words) BNC

Page 23: Do we still need corpora  (now that we have the Web)?

…and BootCaT cons

• Relies on SE=> same limits (cf. supra)– …and Google no longer gives out API keys

• Not really an option for very large corpus building projects

Page 24: Do we still need corpora  (now that we have the Web)?

A more ambitious alternative The Wacky way

• Aim: produce very large (~2bn words) web-derived corpora for several languages

• Collaborative effort, using existing open tools, making developed tools publicly available

• http://wacky.sslmit.unibo.it/• Wacky corpora currently available:

– deWaC, itWaC, ukWaC, frWaC

Page 25: Do we still need corpora  (now that we have the Web)?

The Wacky pipeline

• Submit random word combinations to Google and obtain list of URLs (seeding)

• Crawling (Heritrix)• Code removal and boilerplate stripping• Language filtering• Near-duplicate detection• Tokenization, POS-tagging and lemmatisation• Indexing and querying

Page 26: Do we still need corpora  (now that we have the Web)?

An example: constructing ukWaC

• Seeding: mid-frequency content words (BNC); words from spoken text (BNC); vocabulary list for foreign learners

• Crawl limited to UK domain and html• Processing

– Only files btwn 5 and 200kb kept– Perfect duplicates discarded– Code, boilerplate, files with unconnected text and

pornographic pages removed– Near-duplicates removed

Page 27: Do we still need corpora  (now that we have the Web)?

UkWaC: Details and size• 2,000 seed word pairs• 6,528 seed URLs• 351 GB raw crawl size• 19 GB after document filtering• 5.69 M of documents after filtering• 12 GB after near-duplicate cleaning• 2.69 M of documents after near-duplicate cleaning• 30 GB size with annotation• 1,914,150,197 tokens • 3,798,106 types• Further info and availability: http://wacky.sslmit.unibo.it/

Page 28: Do we still need corpora  (now that we have the Web)?

A wacky exampleResults for wacky+NOUN (>2), Baroni et al. submitted

• BNC3 ideas2 roles 2 photo 2 items 2 humour 2 characters

• UkWaC71 world, 44 ideas, 43 wigglers, 42 wiggler, 28 characters, 27 sense, 22 comedy, 21 stuff, 21 races, 20 things, 19 idea, 15 humour, 13 games, 12 race, 11 backy, 10 baccy, 10 fun, 10 game, 10 inventions, 10 names, 10 uses

Page 29: Do we still need corpora  (now that we have the Web)?

WaC: What the future holds

• Have WaC replaced “traditional” corpora?– Not really…

• Challenges– Cleaning techniques– Web-tuned annotation tools– Indexing and querying systems– (Automatic) text classification

Page 30: Do we still need corpora  (now that we have the Web)?

Approaches to Web text classification

• Biber and Kurjan (2007)– Search engine categories not well defined for

purposes of linguistic analysis • Google directory

– Multidimensional analysis • text type approach

– Register approach • future work

Page 31: Do we still need corpora  (now that we have the Web)?

• Sharoff forthcoming– Genre typology based on EAGLES

recommendations• “Communicative intentions”• Discussion, information, instruction, propaganda,

recreation, regulations reporting– SVMs to automatically categorise texts in

Web corpus– Classifiers trained on manually-classified texts

• BNC + subset of Web corpus

Approaches to Web text classification

Page 32: Do we still need corpora  (now that we have the Web)?

WaC challenges

• Representativeness

Without representativeness, whatever is found to be true of a corpus, is simply true of that corpus – and cannot be extended to anything else

(Leech 2007:135)

Page 33: Do we still need corpora  (now that we have the Web)?

Compilers make the best corpus they can in the circumstances, and their proper stance is to be detailed and honest about the contents. From their description of the corpus, the research community can judge how far to trust their results, and future users of the same corpus can estimate its reliability for their purposes.

(Sinclair 2005)

WaC challenges

• Documentation

Page 34: Do we still need corpora  (now that we have the Web)?

Thank you

Silvia BernardiniUniversity of Bologna, Italy

[email protected]

Postgraduate Conference in Corpus linguistics

22 May 2008