Upload
randall-farmer
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Corpora
2
Corpus (pl. corpora)
Body of language data Collected (or curated) for a particular purpose
Various types of language Spoken Text Images Gestures
Very valuable resource for linguist(ic)s and anyone else who is interested in language
3
Purposes for corpora
Language instruction Task analysis Information access (search, indexing,
etc.) Computer systems development
Training, testing/evaluating systems Knowledge source development
(dictionaries, lexicons, etc.)
Types of corpora
Text Speech Discourse Bitext Experimental transcripts Competition datasets Lyrics
5
Sources for text corpora
Electronic text centers Digital libraries
Project Gutenberg Bibliomania
Corpus collections Wikipedia The web
Corpus distributors
LDC BYU has a membership Catalog Top 10 corpora
ELRA: like LDC except based in Europe Government agencies (NIST, census,
etc.) Companies (news agencies, etc.) Universities 6
7
Data formats
Text File formats: ASCII, EBCDIC, UNICODE, proprietary With or without markup (rtf, html, etc.) Application specific (doc, wpd, etc.) Can vary widely across languages
Speech Huge amount of variation across projects/hw/sw TIMIT, NIST (US Gov.), AIFF (Apple), SUNAU8 (Sun), OGI
File Format, WAV (Microsoft) Binary/machine formats
Sound/speech: MP3, AU, WAV, RA, … Graphical: GIF, JPEG, BMP, WMF, …
Knowledge of a scripting language (e.g. Perl) is invaluable!
Corpus metrics
Size Tokens: # of words, count ALL of them Types: # of words, only count each once
Term frequency Genre/topic Dispersion
9
Corpora at BYU
Lots of corpora listed here that are available for BYU faculty/student use.
corpus.byu.edu scriptures.byu.edu General Conference corpus
Sample jobDate: Thu, 21 Feb 2013 10:40:22University or Organization: H5Job Location: California, USAWeb Address: http://www.h5.comJob Rank: Consultant Specialty Areas: Discourse Analysis; Semantics; Syntax; Text/Corpus Linguistics About H5:H5 serves the needs of leading law firms and corporate clients, using powerful proprietary software to provide technology-assisted review and expert search consulting & research. H5’s document review and analytic services uniquely support our clients’ requirements for large-scale litigation, investigation, records retention, and regulatory compliance. H5’s "hybrid" approach to technology-assisted review combines patented information retrieval technology and expert professional services. Through this model, H5 has created a fully integrated document review system that is unparalleled in performance, as proven in independent, benchmarked studies. For more information, visit www.h5.com. Overview:The H5 Professional Services Group includes linguists, lawyers, researchers, statisticians, e-discovery and data modeling experts and project managers. Our multidisciplinary teams use H5’s proprietary software and a well-defined process to build linguistic models that classify electronic data and support strategic search for documents that help our clients win. H5 is seeking candidates with backgrounds in linguistics (or related fields of textual corpus analysis), an affinity for developing novel search strategies, and a desire to collaborate with professional teams and sophisticated search technologies. Primary Responsibilities:- Analyzing linguistic data;- Researching large corpora for linguistic patterns;- Creating search strategies based on linguistic patterns;- Researching subject matter and factual issues in complex litigation;- Rapidly developing an understanding of new subject matter;- Reading a wide variety of documents, from e-mail to academic articles;- Synthesizing large amounts of information from a variety of sources;- Designing, building, and testing search models unique to each project. Key Competencies:- Understanding of syntax, semantics, and pragmatics, in written communication;- Experience in corpus, text, or discourse analysis a plus;- Experience in ethnography or anthropology can be helpful, particularly as it relates to an understanding of contextual cues in text-based communication;- Leadership skills, personal incentive and a demonstrated ability to initiate, develop, and successfully conclude projects;- A sharp eye for detail and precise thinking;- The ability to make analytical judgments;- A practiced sense of order and organization;- Ability to work under pressure and meet deadlines, both autonomously and collaboratively;- Strong interpersonal skills, flexibility, curiosity, creativity, and collaborative spirit;- Strong computer and software competency in a PC/Windows environment, including Microsoft Office;- Experience in a software development environment a plus. Minimal Qualifications:- Solid academic credentials: advanced-undergraduate and/or graduate-level coursework in linguistics, textual corpus analysis, or related field;- Experience applying linguistic and search expertise to real language data;- Experience in a professional or business environment;- Mastery of the English language.
11
Purpose of standards
Avoid duplication of effort Allow synergy, integration, exchange Specific goals
Reusable text and tagging formats Representative of
domain/discipline/genre Copyright
12
Text markup standards
SGML (ISO standard) Standard Generalized Markup Language DTD, XOM, etc.
HTML (W3C standard) Hypertext Markup Language SGML with specific DTD
XML (W3C standard) Logical SGML subset replacement (?) for HTML
14
Sample corpus analysis task ID terminology, collocations from
previous publications Find most-used vocabulary Find inconsistencies, varied usages Get a handle on domains, topics, size
of vocabulary Groundwork for tech writers,
translators
15
Types of vocabulary lists
Single-word term lists Collocations and compound lists KWIC listings Frequency lists Saliency lists Weirdness: typos, low-freq words,
etc.
16
Starting point
All English-language documentation ever published for which there was a machine-readable version (typesetting)
Several hundred documents of all kinds: repair manuals, warranty notices, user manuals, testing documents, etc.
Total number of files processed: 861
17
Canonicalizing the input
Standardize character representation Tokenize punctuation Strip formatting codes Uncapitalize sentence-initial words
18
ID, count single words
De-inflect morphological variants (base-form reduction, lemmatization)
-ing, -ed forms are problematic After fitting the pipe into the basin … The aft fitting is larger on the new… The tightly fitting bracket should be…
Fuel will be shunted… / The shunted fuel…
19
Single-word statistics
Total number of sw occurrences: 7,230,000
Total number of unique sw occurrences: 12,000
20
ID, count nominal compounds
Involve at least two of the following: Nouns Nominalized verb forms Some adjectives Any word whose category is not known
but not: Numbers, special characters, non-nouns
21
Sample nominal compoundshub caplow amplitudeboom foot pin assemblyhydraulic oil tank drain plugcard cage type regulator voltage adjustment controls
There are ambiguities:
check valvetesting equipment
22
Nominal Compound Statistics
Total number of nominal compounds: 1,034,861
Total number of unique nominal compounds: 110,298
23
Sample long nominal compounds
off-highway truck final drive first reduction planetary assembly
parking brake/travel stop pilot control valve pressure switch
right front suspension cylinder pressure sensor circuit fault
fuel injection pump drive sprocket bearing lubrication line
track motor manifold valve high pressure relief setting
ground level right rear leg elevation control valve
axle wish bone ball joint flange mounting bolts
stick cylinder rod end check valve lines group
ground engaging tool bolt torques chart
scraper key start switch relay terminal
24
NC Frequency Distribution :freq # terms-----------------1 458772 222073 82774 70265 35546 34417 19028 18919 136710 116915 52720 355
freq # terms-----------------30 16650 6675 33100 17250 2501 11098 13410 13862 13966 14889 16092 1
25
NC Frequencies
6092 lb ft
4889 cooling system
3966 fuel injection
3862 parking brake
3410 relief valve
2789 control valve
2587service hours
2421 hydraulic oil
2588personal injury
2373 caterpillar dealer
26
NC Frequencies (cont.)
2037 lift truck
1432 oil filter
953 seat belt
488 master cylinder
205 directional control
109 petroleum jelly
64 ball joint
33 caterpillar service technology group
10 outlet water temperature regulators
5 coolant leak
1 conveyor drive pump electrical displacement controls
27
Term Length Distribution
Len # of terms2 508943 390434 151895 39516 9367 2078 499 1010 911 212 313 215 2
28
Semantic Classes of NC’s parts and components conditions vehicles product offerings tools and hardware measurements humans and occupations corporate entities and procedures
29
Non-nominal Collocations
hand tighten make sure air dry away from air to air aftercooler hydraulically released disc brakes
30
Prep/adv-based Ambiguity (technical vs. not)
down arrow keys inside cab light left camshaft oil gallery accelerator pedals down air inside bulldozer tilt left
31
Variation in NC’s
Alternate spellings Typos Abbreviations Morphological variation ( &
possessives) Word-boundary variation
32
Compositionality
((ground level)(front leg)*(ground ((level front) leg))
BUT:hand fuel priming pump