Do we still need corpora (now that we have the Web)?

Do we still need corpora (now that we have the Web)?

Silvia BernardiniUniversity of Bologna, Italy

[email protected]

Postgraduate Conference in Corpus linguistics

22 May 2008

The corpus• A collection of texts assumed to be representative of a given language,

dialect, or other subset of a language, to be used for linguistic analysis. (Francis 1992(1982):17)

• A collection of naturally-occurring language text, chosen to characterize a state or variety of a language. (Sinclair 1991:171)

• A closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria. (Engwall 1992:167)

• Finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration. (McEnery and Wilson 1996:23)

• A collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety. (McEnery et al. 2006:5)

The Web

• A mine of language data of unprecedented richness (Lüdeling et al 2007)

• A fabulous linguists’ playground (Kilgarriff and Grefenstette 2003)

• [a] cheerful anarchy (Sinclair 2004)• A helluva lot of text, stored on

computers… (Leech 1992:106)

Is the Web a corpus? Yes!

The definition of corpus should be broad. We define a corpus simply as “a collection of texts”. If that seems too broad, the one qualification we allow relates to the domains and contexts in which the word is used […]: A corpus is a collection of texts when considered as an object of language or literary study. The answer to the question “Is the web a corpus?” is yes.

Kilgarriff and Grefenstette (2003:334)

Is the Web a corpus? No!The cheerful anarchy of the Web thus places a burden of care on a user, and slows down the process of corpus building. The organisation and discipline has to be put in by the corpus builder. […] users of a corpus assume that there is a consistency of selection, processing and management of the texts in the corpus.

Corpora should be designed and constructed exclusively on external criteria.

(Sinclair 2005)

This talk

• The Web and the corpus– Disambiguating the WaC acronym– Where the Web wins out– Where the corpus holds its ground

• Web as Corpus initiatives @ Forlì– The BootCaT way– The WaCky! way

• Open issues and ways forward

Web as Corpus?

• (The Web corpus “proper”)• The Web as a corpus surrogate • The Web as a corpus supermarket • The mega-corpus (or mini-Web)

The Web as a corpus surrogate• Googleology…• e.g.: Keller and Lapata (2003)

– Predicate-argument bigrams– adj-noun, noun-noun, verb-noun – not attested in the BNC

“Web counts correlate reliably with [human plausibility] judgments, for all three types of predicate-argument bigrams tested, both seen and unseen. For the seen bigrams, […] the Web frequencies correlate better with judged plausibility than corpus frequencies” (ibid: 481).

• … is bad science“Working with commercial search engines makes us develop workarounds. We become experts in the syntax and constraints of Google, Yahoo!, Altavista, and so on. We become ‘googleologists’” (Kilgarriff 2007:147)

Google…• Unreplicable

– Véronis (2005): 5 billion "the" have disappeared overnight– Kilgarriff (2007:148): “queries are sent to different computers, at

different points in the update cycle, and with different data in their caches”

• Uncontrollable– Asterisk treated as placeholder for 1 word or more than 1 word– Punctuation and capitalisation disregarded (even in phrases)– Search hits are per page– Ranking criteria and result sorting (popularity, geographic relevance, …)

• Linguistically naïve– No morphosyntactic annotation

• 36 queries to extract fulfill + obligation (Keller and Lapata 2003)• Impossible to extract fulfill + NOUN

– Unsophisticated query language• No sub-string matching• No span options

SE post-processors?

• e.g. WebCorp, KWiCFinder– Wildcards and tamecards– Concordance output– Collocation

• Not a solution, really– Slow– Same limits as SE

The Web as a corpus supermarket

• Selecting and downloading texts– General or specialized– Can be automatised (infra)

• e.g. (general): – Leeds Internet corpora (Sharoff 2006)

• English, Chinese, Finnish, French, German, Italian, Japanese• Lemmatised and pos-tagged• Indexed with the CWB and searchable online (CQP)

– Fletcher’s WaC (Fletcher 2007)• ~500M words of English (AU, CA, GB, IE, NZ, US)• will be pos-tagged

Pros

• “Traditional” corpus =>– Replicable results– Control over corpus contents

• In principle– Control over search methods– Linguistically sophisticated searches

supported

BUT…

• Compromise btwn Web and corpus => – Relying on SE (Google, LiveSearch) – Size– Up-to-dateness– Understanding of corpus contents/structure– Variety of corpus contents– Noise

The mega-corpus/miniweb

• Baroni (2007): Effort spent by NLP community in developing Google-skills would be better spent building our own Google-sized corpora

• None available so far, but:– WebCorp (Renouf et al. 2007)– The WaCky! effort (infra)

• Ultimate objective, build a linguist’s search engine for the Web

Where the Web wins out

• Up-to-dateness• Size• Convenience

– Cost – Ease of collection– Under-resourced languages

• Web-specific genres• Reference purposes

Where the corpus holds its ground

• Selection on external criteria– Cf.: a collection of pieces of language text in

electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research (Sinclair 2005)

• Register/genre control• Representativeness and documentation• Pre- or non-Web genres

e.g.: McEnery et al 2007• Collocation information for learners’ dictionaries• “Help”: Full or bare infinitive?

– Varieties of English, language change, syntactic environment• Acquisition of grammatical morphemes

– Learner language• Swearing in modern British English

– writing vs. speaking– sociolinguistic variables

• Conversation vs. formal speech in AmEng• Aspect marking in English-Chinese translation

– Parallel corpora– Cf. Resnik and Smith (2003)

Two approaches to the Web as corpus

• The BootCaT way 1. Select initial seeds (terms)2. Query SE for random seed combinations3. Retrieve pages and format as text (corpus)4. Extract new seeds via corpus comparison5. Iterate

• Designed for translation students• Also used for reference corpus building

• Leeds Internet Corpora

BootCaT pros…

• Implemented in perl as a set of simple command-line scripts

• Freely available (http://sslmit.unibo.it/~baroni/bootcat.html)

• documented• Integrated into the Sketch Engine pipeline• Community effort

– WebBootCaT– JBootCaT

http://sslmit.unibo.it/~baroni/bootcat.html

http://sslmit.unibo.it/~baroni/Readme.BootCaT-0.1.2

http://www.sketchengine.co.uk/

http://www.sketchengine.co.uk/

http://www.andy-roberts.net/software/jbootcat/

An example: wine tastingAutomatic query generation

aceticacidacidityaftertasteagedalcoholappleyaromaascescenceastringent…

wine rich unfiltered attractivewine stylish "malolactic fermentation" sourwine meager harsh spritzywine dumb tobacco directwine watery grapey tearswine hazy breed nouveauwine spicy flat bodywine vinous spritzy unfinedwine fleshy cigarbox easywine puckery sharp nutty…

“vanilla” collocates (span=1R)

BootCaT wine tasting corpus(English, 1.5M words) BNC

…and BootCaT cons

• Relies on SE=> same limits (cf. supra)– …and Google no longer gives out API keys

• Not really an option for very large corpus building projects

A more ambitious alternative The Wacky way

• Aim: produce very large (~2bn words) web-derived corpora for several languages

• Collaborative effort, using existing open tools, making developed tools publicly available

• http://wacky.sslmit.unibo.it/• Wacky corpora currently available:

– deWaC, itWaC, ukWaC, frWaC

http://wacky.sslmit.unibo.it/

The Wacky pipeline

• Submit random word combinations to Google and obtain list of URLs (seeding)

• Crawling (Heritrix)• Code removal and boilerplate stripping• Language filtering• Near-duplicate detection• Tokenization, POS-tagging and lemmatisation• Indexing and querying

An example: constructing ukWaC

• Seeding: mid-frequency content words (BNC); words from spoken text (BNC); vocabulary list for foreign learners

• Crawl limited to UK domain and html• Processing

– Only files btwn 5 and 200kb kept– Perfect duplicates discarded– Code, boilerplate, files with unconnected text and

pornographic pages removed– Near-duplicates removed

UkWaC: Details and size• 2,000 seed word pairs• 6,528 seed URLs• 351 GB raw crawl size• 19 GB after document filtering• 5.69 M of documents after filtering• 12 GB after near-duplicate cleaning• 2.69 M of documents after near-duplicate cleaning• 30 GB size with annotation• 1,914,150,197 tokens • 3,798,106 types• Further info and availability: http://wacky.sslmit.unibo.it/




A wacky exampleResults for wacky+NOUN (>2), Baroni et al. submitted

• BNC3 ideas2 roles 2 photo 2 items 2 humour 2 characters

• UkWaC71 world, 44 ideas, 43 wigglers, 42 wiggler, 28 characters, 27 sense, 22 comedy, 21 stuff, 21 races, 20 things, 19 idea, 15 humour, 13 games, 12 race, 11 backy, 10 baccy, 10 fun, 10 game, 10 inventions, 10 names, 10 uses

WaC: What the future holds

• Have WaC replaced “traditional” corpora?– Not really…

• Challenges– Cleaning techniques– Web-tuned annotation tools– Indexing and querying systems– (Automatic) text classification

Approaches to Web text classification

• Biber and Kurjan (2007)– Search engine categories not well defined for

purposes of linguistic analysis • Google directory

– Multidimensional analysis • text type approach

– Register approach • future work

• Sharoff forthcoming– Genre typology based on EAGLES

recommendations• “Communicative intentions”• Discussion, information, instruction, propaganda,

recreation, regulations reporting– SVMs to automatically categorise texts in

Web corpus– Classifiers trained on manually-classified texts

• BNC + subset of Web corpus

Approaches to Web text classification

WaC challenges

• Representativeness

Without representativeness, whatever is found to be true of a corpus, is simply true of that corpus – and cannot be extended to anything else

(Leech 2007:135)

Compilers make the best corpus they can in the circumstances, and their proper stance is to be detailed and honest about the contents. From their description of the corpus, the research community can judge how far to trust their results, and future users of the same corpus can estimate its reliability for their purposes.

(Sinclair 2005)

WaC challenges

• Documentation

Thank you

Silvia BernardiniUniversity of Bologna, Italy

[email protected]

Postgraduate Conference in Corpus linguistics

22 May 2008

Documents

Do we still need corpora (now that we have the Web)?