32
Corpora

2 Body of language data Collected (or curated) for a particular purpose Various types of language Spoken Text Images Gestures Very valuable

Embed Size (px)

Citation preview

Page 1: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

Corpora

Page 2: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

2

Corpus (pl. corpora)

Body of language data Collected (or curated) for a particular purpose

Various types of language Spoken Text Images Gestures

Very valuable resource for linguist(ic)s and anyone else who is interested in language

Page 3: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

3

Purposes for corpora

Language instruction Task analysis Information access (search, indexing,

etc.) Computer systems development

Training, testing/evaluating systems Knowledge source development

(dictionaries, lexicons, etc.)

Page 4: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

Types of corpora

Text Speech Discourse Bitext Experimental transcripts Competition datasets Lyrics

Page 5: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

5

Sources for text corpora

Electronic text centers Digital libraries

Project Gutenberg Bibliomania

Corpus collections Wikipedia The web

Page 6: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

Corpus distributors

LDC BYU has a membership Catalog Top 10 corpora

ELRA: like LDC except based in Europe Government agencies (NIST, census,

etc.) Companies (news agencies, etc.) Universities 6

Page 7: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

7

Data formats

Text File formats: ASCII, EBCDIC, UNICODE, proprietary With or without markup (rtf, html, etc.) Application specific (doc, wpd, etc.) Can vary widely across languages

Speech Huge amount of variation across projects/hw/sw TIMIT, NIST (US Gov.), AIFF (Apple), SUNAU8 (Sun), OGI

File Format, WAV (Microsoft) Binary/machine formats

Sound/speech: MP3, AU, WAV, RA, … Graphical: GIF, JPEG, BMP, WMF, …

Knowledge of a scripting language (e.g. Perl) is invaluable!

Page 8: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

Corpus metrics

Size Tokens: # of words, count ALL of them Types: # of words, only count each once

Term frequency Genre/topic Dispersion

Page 9: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

9

Corpora at BYU

Lots of corpora listed here that are available for BYU faculty/student use.

corpus.byu.edu scriptures.byu.edu General Conference corpus

Page 10: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

Sample jobDate: Thu, 21 Feb 2013 10:40:22University or Organization: H5Job Location: California, USAWeb Address: http://www.h5.comJob Rank: Consultant Specialty Areas: Discourse Analysis; Semantics; Syntax; Text/Corpus Linguistics  About H5:H5 serves the needs of leading law firms and corporate clients, using powerful proprietary software to provide technology-assisted review and expert search consulting & research. H5’s document review and analytic services uniquely support our clients’ requirements for large-scale litigation, investigation, records retention, and regulatory compliance. H5’s "hybrid" approach to technology-assisted review combines patented information retrieval technology and expert professional services. Through this model, H5 has created a fully integrated document review system that is unparalleled in performance, as proven in independent, benchmarked studies. For more information, visit www.h5.com. Overview:The H5 Professional Services Group includes linguists, lawyers, researchers, statisticians, e-discovery and data modeling experts and project managers. Our multidisciplinary teams use H5’s proprietary software and a well-defined process to build linguistic models that classify electronic data and support strategic search for documents that help our clients win. H5 is seeking candidates with backgrounds in linguistics (or related fields of textual corpus analysis), an affinity for developing novel search strategies, and a desire to collaborate with professional teams and sophisticated search technologies. Primary Responsibilities:- Analyzing linguistic data;- Researching large corpora for linguistic patterns;- Creating search strategies based on linguistic patterns;- Researching subject matter and factual issues in complex litigation;- Rapidly developing an understanding of new subject matter;- Reading a wide variety of documents, from e-mail to academic articles;- Synthesizing large amounts of information from a variety of sources;- Designing, building, and testing search models unique to each project.  Key Competencies:- Understanding of syntax, semantics, and pragmatics, in written communication;- Experience in corpus, text, or discourse analysis a plus;- Experience in ethnography or anthropology can be helpful, particularly as it relates to an understanding of contextual cues in text-based communication;- Leadership skills, personal incentive and a demonstrated ability to initiate, develop, and successfully conclude projects;- A sharp eye for detail and precise thinking;- The ability to make analytical judgments;- A practiced sense of order and organization;- Ability to work under pressure and meet deadlines, both autonomously and collaboratively;- Strong interpersonal skills, flexibility, curiosity, creativity, and collaborative spirit;- Strong computer and software competency in a PC/Windows environment, including Microsoft Office;- Experience in a software development environment a plus. Minimal Qualifications:- Solid academic credentials: advanced-undergraduate and/or graduate-level coursework in linguistics, textual corpus analysis, or related field;- Experience applying linguistic and search expertise to real language data;- Experience in a professional or business environment;- Mastery of the English language. 

Page 11: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

11

Purpose of standards

Avoid duplication of effort Allow synergy, integration, exchange Specific goals

Reusable text and tagging formats Representative of

domain/discipline/genre Copyright

Page 12: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

12

Text markup standards

SGML (ISO standard) Standard Generalized Markup Language DTD, XOM, etc.

HTML (W3C standard) Hypertext Markup Language SGML with specific DTD

XML (W3C standard) Logical SGML subset replacement (?) for HTML

Page 13: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable
Page 14: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

14

Sample corpus analysis task ID terminology, collocations from

previous publications Find most-used vocabulary Find inconsistencies, varied usages Get a handle on domains, topics, size

of vocabulary Groundwork for tech writers,

translators

Page 15: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

15

Types of vocabulary lists

Single-word term lists Collocations and compound lists KWIC listings Frequency lists Saliency lists Weirdness: typos, low-freq words,

etc.

Page 16: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

16

Starting point

All English-language documentation ever published for which there was a machine-readable version (typesetting)

Several hundred documents of all kinds: repair manuals, warranty notices, user manuals, testing documents, etc.

Total number of files processed: 861

Page 17: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

17

Canonicalizing the input

Standardize character representation Tokenize punctuation Strip formatting codes Uncapitalize sentence-initial words

Page 18: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

18

ID, count single words

De-inflect morphological variants (base-form reduction, lemmatization)

-ing, -ed forms are problematic After fitting the pipe into the basin … The aft fitting is larger on the new… The tightly fitting bracket should be…

Fuel will be shunted… / The shunted fuel…

Page 19: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

19

Single-word statistics

Total number of sw occurrences: 7,230,000

Total number of unique sw occurrences: 12,000

Page 20: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

20

ID, count nominal compounds

Involve at least two of the following: Nouns Nominalized verb forms Some adjectives Any word whose category is not known

but not: Numbers, special characters, non-nouns

Page 21: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

21

Sample nominal compoundshub caplow amplitudeboom foot pin assemblyhydraulic oil tank drain plugcard cage type regulator voltage adjustment controls

There are ambiguities:

check valvetesting equipment

Page 22: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

22

Nominal Compound Statistics

Total number of nominal compounds: 1,034,861

Total number of unique nominal compounds: 110,298

Page 23: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

23

Sample long nominal compounds

off-highway truck final drive first reduction planetary assembly

parking brake/travel stop pilot control valve pressure switch

right front suspension cylinder pressure sensor circuit fault

fuel injection pump drive sprocket bearing lubrication line

track motor manifold valve high pressure relief setting

ground level right rear leg elevation control valve

axle wish bone ball joint flange mounting bolts

stick cylinder rod end check valve lines group

ground engaging tool bolt torques chart

scraper key start switch relay terminal

Page 24: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

24

NC Frequency Distribution :freq # terms-----------------1 458772 222073 82774 70265 35546 34417 19028 18919 136710 116915 52720 355

freq # terms-----------------30 16650 6675 33100 17250 2501 11098 13410 13862 13966 14889 16092 1

Page 25: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

25

NC Frequencies

6092 lb ft

4889 cooling system

3966 fuel injection

3862 parking brake

3410 relief valve

2789 control valve

2587service hours

2421 hydraulic oil

2588personal injury

2373 caterpillar dealer

Page 26: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

26

NC Frequencies (cont.)

2037 lift truck

1432 oil filter

953 seat belt

488 master cylinder

205 directional control

109 petroleum jelly

64 ball joint

33 caterpillar service technology group

10 outlet water temperature regulators

5 coolant leak

1 conveyor drive pump electrical displacement controls

Page 27: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

27

Term Length Distribution

Len # of terms2 508943 390434 151895 39516 9367 2078 499 1010 911 212 313 215 2

Page 28: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

28

Semantic Classes of NC’s parts and components conditions vehicles product offerings tools and hardware measurements humans and occupations corporate entities and procedures

Page 29: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

29

Non-nominal Collocations

hand tighten make sure air dry away from air to air aftercooler hydraulically released disc brakes

Page 30: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

30

Prep/adv-based Ambiguity (technical vs. not)

down arrow keys inside cab light left camshaft oil gallery  accelerator pedals down air inside bulldozer tilt left

Page 31: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

31

Variation in NC’s

Alternate spellings Typos Abbreviations Morphological variation ( &

possessives) Word-boundary variation

Page 32: 2  Body of language data  Collected (or curated) for a particular purpose  Various types of language  Spoken  Text  Images  Gestures  Very valuable

32

Compositionality

((ground level)(front leg)*(ground ((level front) leg))

BUT:hand fuel priming pump