11
How Useful Is the How Useful Is the Web Web as a Linguistic as a Linguistic Corpus? Corpus? William H. Fletcher William H. Fletcher United States Naval Academy United States Naval Academy 2002 North American Symposium on Corpus Linguistics 2002 North American Symposium on Corpus Linguistics American Association of Applied Corpus Linguistics American Association of Applied Corpus Linguistics Indianapolis, IN, 1-3 November 2002 Indianapolis, IN, 1-3 November 2002

How Useful Is the Web as a Linguistic Corpus?

Embed Size (px)

DESCRIPTION

How Useful Is the Web as a Linguistic Corpus?. William H. Fletcher United States Naval Academy 2002 North American Symposium on Corpus Linguistics American Association of Applied Corpus Linguistics Indianapolis, IN, 1-3 November 2002. Making the Web More Useful as a Corpus. - PowerPoint PPT Presentation

Citation preview

Page 1: How Useful Is the Web as a Linguistic Corpus?

How Useful Is the WebHow Useful Is the Webas a Linguistic Corpus?as a Linguistic Corpus?

William H. FletcherWilliam H. FletcherUnited States Naval AcademyUnited States Naval Academy

2002 North American Symposium on Corpus Linguistics2002 North American Symposium on Corpus LinguisticsAmerican Association of Applied Corpus Linguistics American Association of Applied Corpus Linguistics

Indianapolis, IN, 1-3 November 2002Indianapolis, IN, 1-3 November 2002

Page 2: How Useful Is the Web as a Linguistic Corpus?

Making the Web More Useful Making the Web More Useful as a Corpusas a Corpus

Objective of this ongoing studyObjective of this ongoing studyTo develop and evaluate linguistic methods and PC To develop and evaluate linguistic methods and PC tools to identify domain-relevant and linguistically tools to identify domain-relevant and linguistically representative documents more efficientlyrepresentative documents more efficiently

Long-range goalLong-range goalTo establish the Web both as a "corpus of first To establish the Web both as a "corpus of first resort" and as a supplementary corpus for language resort" and as a supplementary corpus for language professionals and learnersprofessionals and learners

Page 3: How Useful Is the Web as a Linguistic Corpus?

Advantages of WebAdvantages of Web Virtually comprehensive coverage of major Virtually comprehensive coverage of major

languages and language varieties, content languages and language varieties, content domains and written text typesdomains and written text types

Ready availability and low cost throughout Ready availability and low cost throughout developed worlddeveloped world

Freshness and topicality: emerging usage and Freshness and topicality: emerging usage and current issues well documentedcurrent issues well documented

Easy to compile an ad-hoc corpus to answer a Easy to compile an ad-hoc corpus to answer a specific question or meet a specialized specific question or meet a specialized information needinformation need

User familiarity with Web and independent User familiarity with Web and independent motivation to become more proficient in using itmotivation to become more proficient in using it

Page 4: How Useful Is the Web as a Linguistic Corpus?

Disadvantages of WebDisadvantages of Web Generally unknown provenance and authorship, Generally unknown provenance and authorship,

reliability and authorativeness of texts, both for reliability and authorativeness of texts, both for content and linguistic formcontent and linguistic form

Predominance of certain text types among Predominance of certain text types among coherent texts, especially legal, journalistic, coherent texts, especially legal, journalistic, commercial and academic prosecommercial and academic prose

Overall lower standards of form and content Overall lower standards of form and content verification than printed sourcesverification than printed sources

Systematically accessible only through Systematically accessible only through commercial search engines, which support only commercial search engines, which support only very rough search criteriavery rough search criteria

Counts of a given linguistic feature give only a Counts of a given linguistic feature give only a general general numeric indicationnumeric indication, not , not statistical proofstatistical proof

Page 5: How Useful Is the Web as a Linguistic Corpus?

““Noise” Filter for HRDsNoise” Filter for HRDs

HHighly ighly RRepetitive epetitive DDocumentsocuments• Discussion groups where replies incorporate Discussion groups where replies incorporate

original postoriginal post• Internal linksInternal links• BoilerplateBoilerplate• Search engine SpamSearch engine Spam

Strategy: identify documents with frequent Strategy: identify documents with frequent n-gramsn-grams• 8-grams, 12-grams, 25-grams useful range8-grams, 12-grams, 25-grams useful range• Either eliminate document or eliminate Either eliminate document or eliminate

redundant textredundant text

Page 6: How Useful Is the Web as a Linguistic Corpus?

““Noise” Filter for VIDsNoise” Filter for VIDs

VVirtually irtually IIdentical dentical DDocumentsocuments• Mirrored documents with slight Mirrored documents with slight

differencesdifferences• News stories News stories

Rank and absolute frequency of 3- to Rank and absolute frequency of 3- to 5-grams alerts to VIDs5-grams alerts to VIDs

Page 7: How Useful Is the Web as a Linguistic Corpus?

““Noise” Filter for IDsNoise” Filter for IDs (Fully) (Fully) IIdentical dentical DDocumentsocuments

• Mirrored documentsMirrored documents• Multiple URLs for same documentMultiple URLs for same document• Server-generated error messagesServer-generated error messages

MD5 SHA (Message Digest 5 Secure Hash MD5 SHA (Message Digest 5 Secure Hash Algorithm) reduces normalized text of any Algorithm) reduces normalized text of any length to 20-byte code with high length to 20-byte code with high probability of uniquenessprobability of uniqueness

MD5 codes from thousands of documents MD5 codes from thousands of documents can be stored in binary tree for efficient can be stored in binary tree for efficient comparison and elimination of redundant comparison and elimination of redundant documentsdocuments

Page 8: How Useful Is the Web as a Linguistic Corpus?

? Unproven “Noise” Filters? Unproven “Noise” Filters Microsoft Word Spelling Checker to recognize, Microsoft Word Spelling Checker to recognize,

normalize ill-formed documents automaticallynormalize ill-formed documents automatically• Some success; deserves further attentionSome success; deserves further attention• Problem: large number of items (personal, commercial Problem: large number of items (personal, commercial

and place names, technological terms) not in default and place names, technological terms) not in default lexicon, so it rejects too many good documents. lexicon, so it rejects too many good documents.

Patterns of 1- and 2-grams to recognize PFDs Patterns of 1- and 2-grams to recognize PFDs ((PPrimarily rimarily FFragmentary ragmentary DDocuments) ocuments) • Some high-frequency types (articles, copula) rare in Some high-frequency types (articles, copula) rare in

fragments, others (common prepositions) frequentfragments, others (common prepositions) frequent• Content words and special terms (see above) relatively Content words and special terms (see above) relatively

prominentprominent

Page 9: How Useful Is the Web as a Linguistic Corpus?

Size as A Priori FilterSize as A Priori Filter Webpages under 3 kB or over 150 kB have Webpages under 3 kB or over 150 kB have

lower “signal to noise” ratiolower “signal to noise” ratio• In these extreme ranges documents consist of In these extreme ranges documents consist of

coherent text less frequently or to a lesser coherent text less frequently or to a lesser degree degree

• Shorter files tend to have much lower ratio of Shorter files tend to have much lower ratio of text file size to HTML file size (49% vs. 64% text file size to HTML file size (49% vs. 64% overall)overall)

Rule of thumb: download and process Rule of thumb: download and process onlyonly pages larger than 5 kB or smaller pages larger than 5 kB or smaller than 200 kB (size than 200 kB (size beforebefore stripping HTML stripping HTML tags)tags)

Page 10: How Useful Is the Web as a Linguistic Corpus?

My Web Corpus 1My Web Corpus 1 Compiled one afternoon in October 2001 via KWiCFinder searches Compiled one afternoon in October 2001 via KWiCFinder searches

on the 20 most frequent words in English on the 20 most frequent words in English Preliminary studies of 100 and 5859 webpages respectively Preliminary studies of 100 and 5859 webpages respectively

revealed great bias towards commercial sites due to "paid revealed great bias towards commercial sites due to "paid positioning" on AltaVista; sites ranked highest for this reason were positioning" on AltaVista; sites ranked highest for this reason were excluded from this studyexcluded from this study

Initially consisted of 11,201 online documents (OLDs)Initially consisted of 11,201 online documents (OLDs) Various "noise filters" were applied to make the results more Various "noise filters" were applied to make the results more

usefuluseful 7294 survived automatic elimination of IDs and VIDs7294 survived automatic elimination of IDs and VIDs 256 HRDs were eliminated256 HRDs were eliminated Remaining documents were viewed individually and classified asRemaining documents were viewed individually and classified as

• Primarily useful textPrimarily useful text• "Noisy" text"Noisy" text• Primarily non-text (link lists, fragments, headers / footers Primarily non-text (link lists, fragments, headers / footers

predominated...)predominated...)

Page 11: How Useful Is the Web as a Linguistic Corpus?

My Web Corpus 2My Web Corpus 2 4949 unique documents passed all automatic tests and human 4949 unique documents passed all automatic tests and human

classificationclassification 5.25 million tokens in 35 MB of files5.25 million tokens in 35 MB of files Longer coherent texts from government, academic, legal, religious Longer coherent texts from government, academic, legal, religious

(Christian, Jewish, Muslim, Hindu), journalistic and commercial (Christian, Jewish, Muslim, Hindu), journalistic and commercial sources, plus many “hobbyist” pages on a wide range of topics sources, plus many “hobbyist” pages on a wide range of topics

Compared to BNC as a standard to reference corpus (see appendix Compared to BNC as a standard to reference corpus (see appendix with annotated comparison of n-gram frequencies).with annotated comparison of n-gram frequencies).

Generally quite comparable, but important differences:Generally quite comparable, but important differences:• UK vs. US bias in institutions, place names, spellingUK vs. US bias in institutions, place names, spelling• BNC: bias toward third person, past tense, narrative styleBNC: bias toward third person, past tense, narrative style• WC: bias toward first (especially WC: bias toward first (especially we)we) and second person, present and second person, present

tense, interactive styletense, interactive style• Words referring to Internet concepts and information missing or rare in Words referring to Internet concepts and information missing or rare in

BNC, highly prominent in WC (and in contemporary English)BNC, highly prominent in WC (and in contemporary English)