HG8003 Technologically Speaking:The intersection of language and technology.
Words, Lexicons and Ontologies
Francis BondDivision of Linguistics and Multilingual Studieshttp://www3.ntu.edu.sg/home/fcbond/
Lecture 4Location: LT8
HG8003 (2014)
Schedule
Lec. Date Topic1 01-16 Introduction, Organization: Overview of NLP; Main Issues2 01-23 Representing Language3 02-06 Representing Meaning4 02-13 Words, Lexicons and Ontologies5 02-20 Text Mining and Knowledge Acquisition Quiz6 02-27 Structured Text and the Semantic Web
Recess7 03-13 Citation, Reputation and PageRank8 03-20 Introduction to MT, Empirical NLP9 03-27 Analysis, Tagging, Parsing and Generation Quiz
10 Video Statistical and Example-based MT11 04-03 Transfer and Word Sense Disambiguation12 04-10 Review and Conclusions
Exam 05-06 17:00
➣ Video week 10
Words, Lexicons and Ontologies 1
Review of Representing Meaning
➣ Three ways of defining meaning
➢ Attributional (Compositional)➢ Relational➢ Distributional
➣ the Syntax-Semantic Interface
➢ Usage ⇀↽ Meaning
Words, Lexicons and Ontologies 3
Attributional Meaning
➣ Give a semantic description of word use in isolation of the categorisationof other lexical items
➢ definitions➢ decompositional semantics (break down into primitives)
➣ Easy for humans to understand
➣ Hard to decide on sense boundaries (granularity: splitters vs. lumpers)
➣ Definitions are circular (the grounding problem)
➣ Hard to be consistent
Words, Lexicons and Ontologies 4
Relational Meaning
➣ Capture correspondences between lexical items by way of a finite set ofpre-defined semantic relations
➣ Methodologies:
➢ lexical relations➢ constructional relations
➣ Captures many generalizations usefully
➣ Hard to make complete
➣ Leads to large, complex graphs
Words, Lexicons and Ontologies 5
Distributional Meaning
➣ Capture word meanings as collections of contexts in which words appear
➢ n-grams➢ syntactic relations➢ sentences➢ documents
➣ Good for synonymy, not so good for antonymy
➣ Computationally tractable
Words, Lexicons and Ontologies 6
Why are dictionaries important?
➣ For humans
➢ find meaning of unknown words➢ find more information about known words➢ codify knowledge about word usage (glossaries)
➣ For machines
➢ store information about words➢ link between text and knowledge
Words, Lexicons and Ontologies 7
Introduction to Words, Lexicons and Ontologies
➣ Design and implementation
➢ Machine Readable Dictionaries➢ Morphological lexicons➢ Syntactic lexicons➢ Semantic lexicons➢ Ontologies
➣ Construction and Maintenance
➢ Construction from scratch➢ Boot-strapping from existing resources➢ Ensuring consistency
Words, Lexicons and Ontologies 9
Machine Readable Dictionaries (MRDs)
➣ Human dictionaries made available on machine
➢ Electronic Dictionaries➢ Dictionary Applications
∗ often with automatic word lookup➢ On-line dictionaries
∗ Sometimes with glosses
Words, Lexicons and Ontologies 10
A typical entry
definition (n) a concise explanation of the meaning of a word or phraseor symbol
➣ Headword: definition
➣ Part of Speech: n (noun)
➣ Definition:
➢ genus: explanation➢ differentia: concise; of the meaning of a word or phrase or symbol
? Implied: countable (a), regular plural
Words, Lexicons and Ontologies 11
Parts-of-Speech (POS)
➣ Traditional Grammar has eight:Noun, Verb, Adjective, Adverb (open class)Conjunction, Preposition, Pronoun, Interjection (closed class)
➣ In the US, the Penn Treebank POS set is de-facto standard:
➢ http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html➢ 45 tags (including punctuation)
➣ In Europe, CLAWS tagset is popular
➢ http://ucrel.lancs.ac.uk/claws7tags.html➢ 137 tags (without punctuation)
Words, Lexicons and Ontologies 12
Penn Treebank Examples (14/45)
Tag Description Tag DescriptionNN Noun, singular or mass VB Verb, base formNNS Noun, plural VBD Verb, past tenseNNP Proper noun, singular VBG Verb, gerund or present participleNNPS Proper noun, plural VBN Verb, past participlePRP Personal pronoun VBP Verb, non-3rd person singular presentIN Preposition VBZ Verb, 3rd person singular presentTO to . Sentence Final punct (.,?,!)
➣ The tags include inflectional information
➢ If you know the tag, you can generally find the lemma
➣ Some tags are very specialized: I/PRP wanted/VBD to/TO go/VB ./.
Words, Lexicons and Ontologies 13
Good Definitions
➣ a definition should be simpler than the word being explained
➣ the definition should match the part of speechdefinition (n) a concise explanation of the meaning of a word or phraseldefine (v) – give a definition for the meaning of a word; “Define ‘sadness”’
➣ the definition should not be circular
➣ all words in the definition should be defined (somewhere)
➢ prefer small defining vocabulary➢ only use metalanguage (NSM: Natural Semantic Metalanguage)
Words, Lexicons and Ontologies 14
Circular definitions
beauty the state of being beautiful
beautiful full of beauty
bobcat a lynx
lynx a bobcat
Words, Lexicons and Ontologies 15
Other useful information
http://en.wiktionary.org/wiki/lynx
➣ Pronunciation
➣ Usage Examples
➣ Illustrations
➣ Etymology (history of the word)
➣ Links to other resources
Easier to do without the space restrictions of a paper dictionary.
Words, Lexicons and Ontologies 16
Dictionaries for NLP
Minimize content in order to minimize acquisition problem.
Declarativity and human readability with compilation into a machine-friendly representation.
Modularity so components are reusable: e.g. distinct monolingual andtransfer lexicons in an MT system.
Capture generalizations with inheritance (and lexical rules etc). Avoidserrors, easier to maintain and expand.
Underspecification to reduce disambiguation for a particular application.
(Copestake, 1992) 17
Morphological Analysis
森 永 前 日 銀 総 裁rin ei zen hi gin sou saimori mae nichimorinaga zennichi gin sousai
morinaga zen nichigin sousai
➣ 森永 前 日銀 総裁Morinaga former Bank of Japan President
Words, Lexicons and Ontologies 18
Morphological Lexicons
➣ Stem
➣ Inflectional Class
➣ Part of Speech (often 1-200)
➣ Arguments (?)
➣ For example
➢ Relations: 前 -総裁➢ Arguments: 前(総裁)➢ Abstraction: 前(title);総裁 ⊂ title
Words, Lexicons and Ontologies 19
Morphological Lexicon
➣ I fabricate for a living
➣ I make things for a living
➣ I fabricated yesterday
➣ I made things for a living
➣ These are differences in the inflectional class
Words, Lexicons and Ontologies 20
Inflection
➣ Inflection: In many languages, words appear in different forms toshow small differences in meaning: for example number (dog/dogs;child/children) or tense/aspect (make/made/making/made; take/took/taking/taken)
➣ Many words pattern the same way, this is called an inflectional class (orparadigm). For example, one class of plurals in English is words that endin y : fly/flies; sky/skies.
➣ The inflectional class is normally not predictable from the meaning orsyntax; and so must be stored for each word
➣ The root form (lemma) and the inflected form have the same meaningmodulo the number/tense/. . . and the same basic part-of-speech
➣ Normally a word only undergoes one inflection
Words, Lexicons and Ontologies 21
Derivation
➣ New words can also be created by changing the form. If the part ofspeech or meaning changes, we call it derivation: (happy/happiness;happy/unhappy ; happy/happily)
➣ You can also get zero derivation, where the meaning changes without achange in form (I butter the bread/I like butter.
➣ The root form in derivation is called the stem and the process of strippingoff derivational affixes is stemming
➣ You can have multiple derivations: anti-dis-establish-ment-arian-ism: thestem is establish
➣ Derivation is largely but not entirely productive: employer, teacher,*studier, actor, contractor
Words, Lexicons and Ontologies 22
Syntactic Lexicon
➣ I fabricated the results
➣ I made up the results
➣ = I made the results up
➣ I walked down the road
➣ 6= I walked the road down
➣ These are differences in the syntactic lexical type
Words, Lexicons and Ontologies 23
Differences in Argument Structure
➣ These are also differences in the syntactic lexical type
➢ I gave the book to him➢ I gave him the book➢ Cats eat mice➢ Cats eat➢ Cats devour mice➢ *Cats devour mice
➣ The information about what arguments a verb can take is also calledsubcategorization, valence or argument frame
Words, Lexicons and Ontologies 24
Semantic Lexicon
➣ I deposited the money in the bank (financial)
➣ The river overflowed its bank (riverside)
➣ I had lunch by the bank (???)
➣ These are differences in the semantic class
Words, Lexicons and Ontologies 25
All the possibilities combine
➣ I saw her duck
➢ see/saw, saw/sawed➢ duckN , duckV➢ duckN :cloth, duckN :bird
➣ Still useful to keep separate
➢ inflectional paradigm➢ arguments (subcategorization)➢ semantics (selectional preferences)
Words, Lexicons and Ontologies 26
Transfer Lexicons
➣ bank ↔銀行 ginkou
➣ bank ↔土手 dote
➣ 鼻 hana ↔ nose
➢ trunk [of elephant]➢ muzzle [of horse]➢ snout [of boar]
Words, Lexicons and Ontologies 27
Dictionaries in Processing
➣ Lexical lookup is slow (disk-based)
➢ Compile dictionaries into compressed format➢ Index➢ Cache the index
cache = load it into memory➢ Cache already accessed entries➢ Keep a list of frequent entries and cache them
the most frequent words are very frequent
➣ Batch check for consistency off-line
Words, Lexicons and Ontologies 28
Dictionaries and Intellectual Property Rights (IPR)
➣ Lexicography has along tradition of extending other’s work
➢ Johnson, Murray, . . .➢ Language itself should not be restricted➢ Dictionaries describe language
➣ Restricted resources lose to open resources
➢ Work on restricted resources is wasted
➣ It is hard to fund maintenance
➢ Users of the lexicons are the best developers
Words, Lexicons and Ontologies 29
Online Dictionaries
➣ Mouse-over lookup (http://www.polarcloud.com/rikaichan/)
➣ No space restrictions (JMDIct)
➣ Collaborative Construction (Wiktionary)
➣ Easy cross-referencing (WordNet)
➣ Easy to link dictionaries (Open Multilingual WordNet)http://compling.hss.ntu.edu.sg/omw/
Words, Lexicons and Ontologies 30
What is an Ontology
➣ A set of statements in a formal languagethat describes/conceptualizes knowledge in a given domain
➢ What kinds of entities exist (in that domain)➢ What kinds of relationships hold among them
➣ Ontologies usually assume a particular level of granularity
➢ doesn’t capture all details
32
In Other Words
➣ In theory
“An ontology is a formal, explicit specification of a sharedconceptualisation.” (Gruber, 1993)
➣ More generally
An ontology provides a shared vocabulary, which can be used tomodel a domain, that is, the type of objects and/or concepts that exist,and their properties and relations
Words, Lexicons and Ontologies 33
Why use Ontologies?
➣ To make different markup terminologies transparent
➣ Placing the focus on the meaning of the markup (not its form)
➣ To make implicit knowledge explicit
➣ To ensure that the knowledge is consistentor at least consistently formatted;
Words, Lexicons and Ontologies 34
What does an Ontology consist of?
Classes abstract groups, sets, or collections of objectscountry, language
Individuals Actual things or conceptsJapan, Japanese
Attributes Properties of classes or individuals∗
Japan AREA 377,873 km2
Relations Relations between classes or individualsJapanese SPOKEN-IN Japan
Words, Lexicons and Ontologies 35
Objects can be complex
➣ relation LOCATED-IN has the attribute TRANSITIVE
ADD3 LOCATED-IN BangkokBangkok LOCATED-IN Thailand⇒ ADD3 LOCATED-IN Thailand
Words, Lexicons and Ontologies 36
Some Examples
➣ Disease Ontologyhttp://diseaseontology.sourceforge.net
➣ Dublin Corehttp://dublincore.org/
➣ GOLD (General Ontology for Linguistic Description)http://www.linguistics-ontology.org/gold.html
➣ WordNet (Fellbaum, 1998)http://wordnet.princeton.edu/
Words, Lexicons and Ontologies 37
Disease Ontology
➣ A controlled medical vocabulary
➣ developed at the Bioinformatics Core Facility
➣ designed to map diseases to medical codessuch as ICD9CM, SNOMED and others
➣ an early version of the Disease Ontology
➢ doubled concept coverage➢ reduced the overall misclassification error percentage
Words, Lexicons and Ontologies 38
Ontology linked to Terminology
Concept Term StringsC316301 T657210 bovine spongiform encephalopathy, BSE
T657211 mad cow disease, Mad Cow Disease, MCDT734567 encephalopathy spongiforme bovine, ESBT734566 maladie de la vache folle, MVFT700345 encefalopatia espongiforma bovina, EEBT700346 enfermedad de la vaca loca, EVL
➣ Disease Ontology V2.1 2005➢ diseases and injuries
∗ Unspecified infectious and parasitic diseases· poliomyelitis and other non-arthropod-borne viral diseases
+ Unspecified slow virus infection of central nervous system
Words, Lexicons and Ontologies 39
Disease Ontology
➣ A lightweight ontology with a direct application
➣ Maintained by cooperative editing
➣ One of several linked medical ontologies
➣ Successful Application
Words, Lexicons and Ontologies 40
Dublin Core
➣ Goals
➢ Provides a semantic vocabulary for describing the “core” informationproperties of resources (electronic and “real” physical objects)
➢ Provide enough information to enable intelligent resource discoverysystems
➣ History
➢ A collaborative effort started in 1995➢ Initiated by people from computer science, librarianship, on-line
information services, abstracting and indexing, imaging and geospatialdata, museum and archive control.
http://www.tutorialsonline.info/Common/DublinCore.html 41
Dublin Core - 15 Elements
➣ Content (7)
➢ Title, Subject, Description, Type, Source, Relation and Coverage
➣ Intellectual property (4)
➢ Creator, Publisher, Contributor, Rights
➣ Instantiation (4)
➢ Date, Language, Format, Identifier
Words, Lexicons and Ontologies 42
Dublin Core – discussion
➣ Widely used to catalog web data
➢ OLAC: Open Language Archives Community➢ The MusicBrainz Project: http://www.musicbrainz.org
The musicbrainz project is run by volunteers that are defining ametadata standard for music recordings. This metadata standardis an extension of the Dublin Core. The goal of the project is todefine the metadata standard for music and to create a metadatacatalog of all music recordings around the world.
➢ Australian Government Locator Service (ALGS) http://www.agls.gov.au/AGLS was developed in late 1997 as the resource discovery metadatastandard for Australian governments and was endorsed for use by alllevels of government in Australia in November 1998.
➢ . . .
Words, Lexicons and Ontologies 43
GOLD
➣ an upper ontology for descriptive linguistics
➣ providing a set of linguistic notions
➣ designed to link different grammatical descriptions
➣ PART-OF E-MELD ‘Electronic Metastructure for Endangered LanguagesData’
➣ USED-IN ODIN http://www.csufresno.edu/odin/
Words, Lexicons and Ontologies 44
Singular Number (concept)
Definition:
A value of numberFeature. Singular quantifies the denotation of thenominal element so that:
1. it specifies that there is exactly one. In this English example below, singularNumber isshown by both the noun and the verb in (1): See example (Corbett2000 : 5 )
2. additionally, but not necessarily, this value may be assigned on the basis of formalproperties (e.g. singularia tantum), or (health / *healths).
3. if singularNumber functions as generalNumber, it may specify a lack of commitment withregard to quantification. In this Japanese (jpn) example below, ’inu’ (dog) is not specifiedfor number: See example
Words, Lexicons and Ontologies 45
Sg: Usage note
➣ On terminology: the term ’singulative’ is sometimes used for the concept singularNumber,especially when singularNumber is overtly expressed. ’Singulative’ has been usedsometimes for singularNumber in systems where singularNumber is distinct fromgeneralNumber.
➣ It is worth bearing in mind that the expression of number can differ cross-linguisticallyaccording to the animacy hierarchy. See numberAssignmentSystem.
➣ A note on minimal/augmented systems (and also minimal/unit-augmented/augmented).In some languages which have an inclusive/exclusive distinction in the first person, thefirstPersonInclusive may use the morphology which otherwise expresses singularNumber,even though the semantics of firstPersonInclusive entail that it cannot be singular. Thereis an analysis of this in which the morphology is seen as representing the minimalnumber associated with the particular person value. Under such a system, the label’minimal’ can be mapped onto the concept singularNumber, except if one is dealingwith firstPersonInclusive minimal, which would be mapped onto the concept dualNumber.(Corbett2000 : 166-169 ) (Conklin1962 )
Words, Lexicons and Ontologies 46
➣ There is an important theoretical question about whether minimal/unit-augmented/augmentedshould be considered separate concepts in the GOLD ontology. The main argument forthis is that under such systems, the number values dual and trial are expressed onlyon the firstPersonInclusive by using the morphology otherwise associated with singularand dual respectively. However, as it is possible to specify a mapping from one systemonto the other, we allow for a COPE to deal with this substantive issue while ensuringinteroperability.
Words, Lexicons and Ontologies 47
Sg: Cross references
Parent Number Feature
Siblings Dual Number, General Number, Greater Paucal Number, GreaterPlural Number, Paucal Number, Plural Number, Trial Number,
Words, Lexicons and Ontologies 48
GOLD — Discussion
➣ Main emphasis on definitions and references
➣ Not yet widely used
➣ Linked to by some grammars (e.g. JACY, ERG)
➣ Ontology framework makes it easy to
➢ validate➢ refer-to
Words, Lexicons and Ontologies 49
WordNet
➣ Princeton WordNet R©: is a large lexical database of English.
➣ Nouns, verbs, adjectives and adverbs grouped into sets of cognitivesynonyms (synsets), each expressing a distinct concept.
➣ Synsets interlinked
➢ hypernym/hyponym/instance (is-a)➢ meronym/part (has-a)➢ domain
➣ Free license
Words, Lexicons and Ontologies 50
Multilingual WordNets
➣ Over twenty languages (Polish, Serbian, Croatian, Hindi, Telugu, Tail,Malayalam to come)
➣ Linking many different projects with a common core
➣ Created at NTU
http://casta-net.jp/ kuribayashi/multi/ 51
Noun Relations (WordNet)
hypernyms: Y is a hypernym of X if every X is a (kind of) Y
hyponyms: Y is a hyponym of X if every Y is a (kind of) X
coordinate terms: Y is a coordinate term of X if X and Y share a hypernym
holonym: Y is a holonym of X if X is a part of Y
meronym: Y is a meronym of X if Y is a part of X
Words, Lexicons and Ontologies 52
Example: driver
1. (17) driver – (the operator of a motor vehicle)
2. driver – (someone who drives animals that pull a vehicle)
3. driver – (a golfer who hits the golf ball with a driver)
4. driver, device driver – ((computer science) a program that determines howa computer will communicate with a peripheral device)
5. driver, number one wood – (a golf club (a wood) with a near vertical facethat is used for hitting long shots from the tee)
Words, Lexicons and Ontologies 53
Verb Relations (WordNet)
hypernym: the verb Y is a hypernym of the verb X if the activity X is a (kindof) Y (travel to movement)
troponym: the verb Y is a troponym of the verb X if the activity Y is doing Xin some manner (lisp to talk)
entailment: the verb Y is entailed by X if by doing X you must be doing Y(sleeping by snoring)
coordinate terms: those verbs sharing a common hypernym
Words, Lexicons and Ontologies 54
Adjective and Adverb Relations (WN)
➣ Adjectives
antonymyrelated nounssimilar toparticiple of verb
➣ Adverbs
root adjectives
Words, Lexicons and Ontologies 55
Other Relations
domain:driver#n#3 “device driver” DOMAIN computing#n#1
derivationally related form:driver#n#3 “device driver” RELATED TO drive#n#20 “cause to functionby supplying the force or power for or by controlling”
Words, Lexicons and Ontologies 56
Usability and Accessibility
Usability :
➣ Originally designed for psycholinguistic experiments➣ Widely used in NLP
➢ PP attachment➢ WSD - senseval
Accessibility :
➣ downloadable➣ redistributable➣ actively maintained
Words, Lexicons and Ontologies 57
There are many wordnets
➣ Because WordNet is both usable and accessible it has inspired thecreation of wordnets in many languages.
➢ There are over 60 wordnet projectshttp://globalwordnet.org/wordnets-in-the-world/
➢ Many have released open data (22 languages in 2013)http://compling.hss.ntu.edu.sg/omw/
➣ Here at NTU we are building several
➢ Japanese Wordnet (Isahara et al., 2008)http://nlpwww.nict.go.jp/wn-ja/index.en.html
➢ Wordnet Bahasa (Nurril Hirfana at al, 2011)http://wn-msa.sourceforge.net/
➢ Chinese Open Wordnet (Wang & Bond, 2013)http://compling.hss.ntu.edu.sg/cow/
Words, Lexicons and Ontologies 58
SUMO
➣ Suggested Upper Merged Ontology (Niles & Pease, 2001; Pease, 2006)http://www.ontologyportal.org/
➣ IEEE sponsored free ontology (with Domain Ontologies)
➢ 20,000 terms➢ 70,000 axioms
➣ All WordNet synsets are mapped to the ontology
➣ Used as an upper ontology for many projects
Words, Lexicons and Ontologies 60
Lexicon Construction and Maintenance
➣ Hand construction
➣ Reuse of existing resources
➣ Machine readable dictionaries
➣ Corpus-based approaches
Words, Lexicons and Ontologies 62
Hand-coding
➣ Some one adds all the information about a word by hand
➣ Expensive (estimates from my time at NTT)
➢ 2,000 yen/verb➢ 200 yen/noun
➣ The traditional technique for full-blown, linguistically-motivated systems.
➣ For applications, hand-coding should be corpus-driven.
➣ Techniques exist to help the grammar engineer construct the lexicon, or to(partially) allow non-expert user to build lexicons.
Words, Lexicons and Ontologies 63
Reusing resources
The primary problem is building usable lexicons: if it’s usable then it’sreusable!
But there can be many problems:
➣ Lack of documentation, especially for semantics
➣ Cost/benefit - not worth reusing a 100 word lexicon
➣ Domain specificity
➣ Legal issues
There are many existing resources: wordlists (e.g. software companies),COMLEX, MRDs (machine readable dictionaries), Wordnet, EDR, . . .
Words, Lexicons and Ontologies 64
Machine-readable dictionaries
➣ Long history of research
➣ Limited explicit syntactic information,except in English learners’ dictionaries (OALD (Oxford), LDOCE(Longman), Cobuild, CIDE (Collaborative International Dictionary ofEnglish))
➣ Noun definitions can be analysed to derive taxonomies:inheritance hierarchies for semantic information.
➣ Less published work on verbs
➣ Aquilex, MindNet, Lexeed
Words, Lexicons and Ontologies 65
Deriving semantic information
Categorise lexical entries using predefined linguistically motivatedclasses. E.g. automatically derived taxonomies.
Sauternes a type of sweet gold-coloured French wine
1. find genus term (wine)
2. disambiguate genus term wrt MRD senses: wine1
3. (possibly) refine classification using differentia
Words, Lexicons and Ontologies 66
MRDs: pros and cons
➣ Advantages:
➢ dictionaries are used for manual lexicon construction➢ broad coverage➢ a labour-saving resource to augment a manually constructed core
lexicon
➣ Disadvantages:
➢ time-consuming to process initially➢ highly variable quality➢ different dictionaries require different strategies➢ combining senses is non-trivial➢ dictionaries inadequate even for human learners (ambiguities, little
frequency information, obsolete or offensive terms, poor coverage ofidioms)
Words, Lexicons and Ontologies 67
➢ publishers’ IPR (intellectual property rights)➢ restricted by printed medium
➣ Many problems partially solved by wiktionary
➢ Collaborative Lexical Construction
Words, Lexicons and Ontologies 68
Corpus-based acquisition techniques
➣ Test corpus essential for hand-coded lexicons
➣ Semi-automatic acquisition techniques:
➢ show examples in context (concordances)➢ filter automatically built entries
➣ You can learn monolingual information (typically not 100
➢ part of speech of unknown words➢ syntactic class (subcategorisation frames)
➣ Bilexicon acquisition from aligned corpora
➢ learn translations
More next week 69
Corpora: pros and cons
➣ Advantages:
➢ essential tool for human lexicographer➢ domain-specific terminology and translations➢ frequency information (some approaches)➢ possibility of addressing ambiguity problem
Words, Lexicons and Ontologies 70
➣ Disadvantages:
➢ automatically processing corpora is a research challenge in itself➢ large scale corpus may not exist for particular domain or language
– parallel corpora especially difficult➢ For MT: large scale results only demonstrated with one-to-one
mappings, so far, automatic extraction unproven in full systems➢ Also unclear how well results transfer to different classes of text
Words, Lexicons and Ontologies 71
Lexicons: Ensuring Consistency
➣ Documentation is essential
➢ Lexicons/ontologies are built by groups of people➢ Combine documentation with examples
∗ Automatically test examples (e.g COMLEX: Thursday)∗ Link documentation to annotated corpora
➣ Exploit redundancy rulesif there is a correlation between two different classes
➢ uncountable ⇒ singular (test it or enforce it)
Words, Lexicons and Ontologies 72
➣ Allow different views
➢ All words with a certain property or properties∗ POS∗ Semantic Class∗ Countability
➣ Add words class-by-classso that the meta-data is constant (POS, syntactic/semantic class)
➢ add all soccer player’s names (Diego, Best, . . . )➢ add all quotative verbs (say, tell, think, know, . . . )➢ add all time expressions (today, this morning, yesterday morning, last
night, tonight, tomorrow night, . . . )➢ add all classifiers (匹,人,台,個,枚,本, . . . )
Words, Lexicons and Ontologies 73
Some very concrete Examples
➣ Redefining the dictionary (by Erin McKean; TED Talk 2007) (http://blog.ted.com/2007/08/30/redefining_the/)
➣ Building a simple NLP Lexicon from an MRD
➣ Building a new bilingual dictionary
Words, Lexicons and Ontologies 74
Case Study: NLP Lexicon from MRD
➣ Build an ontology of relations between word senses
➢ Information comes from machine readable dictionary
driver2: somebody who drives a car
➣ Extract genus term by parsing the definition sentence
➢ headword wh ⊂ genus term wg
〈HYPERNYM, somebody, driver2〉
➣ Evaluate by comparing to Goi-Taikei
Nichols et al. (2005) 75
A Sample Entry: Driver 1
Index ドライバー doraiba-
POS noun
Familiarity 6.5 [1–7]
Sense 1
Lexical-type noun-lex
Definition
S1 ねじ/まわし/。
screw turn (screwdriver)
S1′ ねじ/を/差し入れ/たり/ 、
/抜き取っ/た/する/道具/。
a tool for inserting and removing screws .
Hypernym 道具1 equipment “tool”
Sem. Class 〈942:tool〉 (⊂ 893:equipment)
Words, Lexicons and Ontologies 76
A Sample Entry: Driver 2,3
Sense 2
Definition
[
S1 自動車/を/運転/する/人/。
Someone who drives a car
]
Hypernym 人1 hito “person”
Sem. Class 〈292:driver〉 (⊂ 5:person)
Sense 3
Definition
S1 ゴルフ/で/、/遠/距離/用/の/クラブ/。
In golf, a long-distance club.
S2 一番/ウッド/。/
A number one wood .
Hypernym クラブ2 kurabu “club”
Sem. Class (〈921:leisure equipment〉 (⊂ 921))
Domain ゴルフ1 gorufu “golf”
Words, Lexicons and Ontologies 77
Parse Results for Driver 2 (MRS)
〈h, x1{h : prpstn rel(h1)h1 : hito(x1)h2 : jidosha(x2)h3 : unten(u1, x1, x2)}〉
〈h, x1{h : prpstn rel(h0)h0 : person(x1)h1 : some(x1, h0, h4)h2 : car(x2)h3 : drive(u1, x1, x2)}〉
「自動車を運転する人」 somebody who drives a carMinimal Recursion Semantics (simplified)
➣ Generally language independent
➣ Genus term is normally the highest scoping word (x1):doraiba2 ⊂ hito(x1) or driver2 ⊂ person(x1)
Words, Lexicons and Ontologies 78
Extracting more from the MRS
ア:a:a:
アルプス
arupusualps
、,,
または
matawaor
日本アルプスnihon-arupusujapan alps
の
noADN
略
ryakuabbreviation
a: an abbreviation for the Alps or the Japanese Alps
➣ Sometimes highest scoping word is an explicit relatione.g. 〈abbreviation, kind, name, general term〉
➣ Sometimes there is coordination
➢ 〈ABBREVIATION, ア “a”,アルプス “Alps”〉➢ 〈ABBREVIATION, ア “a”,日本アルプス “Japanese alps”〉
Words, Lexicons and Ontologies 79
Example of class condiment
トマトケチャップ1
tomato ketchupホワイトソース1
white sauceミートソース1
meat sauceソース2
sauceトマトソース1
tomato sauceケチャップ1
ketchup調味料1
condiment塩1
saltカレー粉1
curry powderカレー1
curry香辛料1
spiceスパイス1
spice
You can only extract what is in the dictionary definitions
Words, Lexicons and Ontologies 80
Case Study: Transfer Lexicons
➣ I (or my system) speaks language S
➣ I (or my system) want to understand language T
Q: What do I do if I have no S ⇔ T lexicon?
A: Look up S ⇔ I ⇔ T
➢ How can I do this accurately, especially if I don’t understand T?
Bond & Ogura (2007) 81
Example
S I T
markanjing laut
印 in seal mohorstamp tera
Figure 1: Matching through I
Words, Lexicons and Ontologies 82
Our Specific Problem
➣ Make a bilingual lexicon Japanese → Malay lexiconby crossing J → E and M → E
➢ Largest existing lexicons ≈ 7, 000 words
➣ The resulting lexicon will be used by a Ja-Ms MT systemThe lexicon should have:
➢ Appropriate Translation Equivalents∗ Especially the first one
➢ Parts of Speech➢ Semantic Classes (Semantic Transfer System)
Words, Lexicons and Ontologies 83
Two Kinds of Sense Distinctions
➣ Homonyms (Must disambiguate)
➢ Clearly different meanings (different Semantic Classes)seal ⇔あざらし azarashi 〈animal〉 vs seal ⇔印 in 〈tool〉
➢ Distinguish using semantic classes
➣ Variants (near synonyms) (Should disambiguate)
➢ Finer grained differences (same Semantic Classes)鳩 hato → doves or pigeons
➢ Distinguish using domains, collocations, n-grams, . . .➢ As a fall-back, use ranked preferences鳩 hato → (1) pigeon; (2) dove
Words, Lexicons and Ontologies 84
Shared Translations
S I T
mark印 in seal mohor
stampimprint teragauge
anjing lautmohor 0.4 = 2
3+2
tera 0.286 = 2
3+4
anjing laut 0.25 = 1
3+1
Words, Lexicons and Ontologies 85
Our Approach
➣ Lexicons
➢ Japanese-English Lexicon (Goi-Taikei)➢ Malay-English-Chinese Lexicon (KAMI)➢ Japanese-Chinese Lexicon (Ri-Zhong Cidian)
➣ Scoring
➢ Syntactic matching (POS)➢ Shared Translations➢ Semantic matching (Semantic Classes)➢ Second-language matching (Chinese)
➣ Finally hand checking
Words, Lexicons and Ontologies 86
Japanese-English lexicon
➣ 380,000 Japanese-English word pairs
➣ 3,000 semantic categories (human, inanimate, etc.)
➢ 2,710 common-noun classes➢ 200 proper-noun classes➢ 108 verb, event, and state classes
➣ Subcat frames and selectional restrictions for 15,000 verb senses
➣ Available as a book (five volumes) or CD-ROM
➢ Goi-Taikei — A Japanese Lexicon
Words, Lexicons and Ontologies 87
Semantic Transfer Dictionary
➣
Japanese あざらし (azarashi)
English seal
POS noun
Sem Classes 〈animal〉
Rank 1
➣ In the noun dictionary:
➢ 63,926 Japanese index words➢ 71,818 Japanese-English pairs➢ 49,205 different English entries
∗ 90% with 1 translation, 6.5% with 2, 2% with 3➢ average number of translations is 1.12
Words, Lexicons and Ontologies 88
The Malay-English lexicon: KAMI
➣
Malay anjing lautPOS 〈noun〉
Classifier ekor (27%)Sem Classes 〈animal〉 (30%)English seal
Chinese 海豹 (hai3 bao4) (25%)
➢ 67,658 Malay index words➢ 91,426 Malay-English word pairs
∗ 79% with 1 translation, 14% with 2 4% with 3➢ average number of translations is 1.35
Words, Lexicons and Ontologies 89
Adding Semantic Classes to KAMI
1. Original syntactic-semantic codes → Goi-Taikei semantic classes (10,000)e.g. noun.city → city
2. CICC Indonesian dictionary classes (found for 14,784 entries)mapped to Goi-Taikei semantic classes (hand-made partial mapping)
Words, Lexicons and Ontologies 90
Adding Semantic Classes to KAMI
3. Malay numeral classifiers (found for 18,303 nouns)mapped onto Goi-Taikei semantic classes (hand-made partial mapping)e.g. ekor → animal; orang → human
4. Known word lists (ISO 639 languages; ISO 4217 currencies)ISO 4217 entry → currency
5. Manual addition
Words, Lexicons and Ontologies 91
Ja-Cn lexicon: Ri-Zhong Cidian
➣ Example:
Japanese あざらし (azarashi)
Japanese Kanji 海豹
Chinese Hanzi 海豹
Pronunciation hai bao
➢ 83,000 Japanese-Chinese word pairs
Words, Lexicons and Ontologies 92
Crossing
➣ For each pair in the Japanese-English lexicon
➢ Look up the Malay equivalent of the Englishif an entry with the same coarse POS exists∗ create a Japanese-Malay pair∗ store the intermediate English∗ Calculate scores
· shared translations· semantic matching· second-language matching
➢ else mark the Japanese-English pair
➣ For each Japanese index: rank Ja-Ms then Ja-En
Words, Lexicons and Ontologies 93
Example
S I T
mark印 in seal mohorstationary
stamp tera toolimprintgauge
anjing laut
Figure 2: Matching through I and Sem
Words, Lexicons and Ontologies 94
Calculating the Scores
➣ Shared Translations for pair J and M ,where E(W ) is the set of English translations of W :
shared translation score =|E(J) ∩ E(M)|
|E(J)|+ |E(M)|(1)
➣ Semantic matching score for word pair J and M ,number of semantic classes of J which subsume or are subsumed by asemantic class of M
➣ Total score = 10× semantic score + shared translation score − rank
Words, Lexicons and Ontologies 95
Results
➣ Crossed the Japanese-English common-nouns with the Malay-Englishnouns
➢ 22,658 out of 63,926 Japanese words linked➢ 16,974 out of 67,658 Malay words➢ 75,872 Japanese-Malay pairs➢ Average number of translations was 3.4
➣ Tested 65 randomly selected linked Japanese index words (232translations)
Words, Lexicons and Ontologies 96
Results (2)
0
10
20
30
40
50
40.1
Good
25
OK
12.1
Error
22.8
Bad0
10
20
30
40
5046.2
Good
33.8
OK
9.2
Error
10.8
Bad
All Pairs First Ranked Pair
Words, Lexicons and Ontologies 97
Matching through two languages
Japanese English Chinese Malaymark 印章
印 seal terastampimprint mohorgauge
anjing lautFigure 3: Matching through two languages
Words, Lexicons and Ontologies 98
Results (two pivots)
➣ 5,238 pairs matched both English and Chinese
➢ 97% good translations➢ 8.1% of the original Japanese index words➢ 1 in 4 matched Japanese index words
➣ High Precision/Low Recall
Words, Lexicons and Ontologies 99
Further Work
➣ Use a British/American spelling filter
➢ Japanese-English dictionary uses American spelling➢ Malay-English dictionary uses British spelling➢ armor/armour don’t match
➣ Lemmatize more before matching
➢ expecially singular/plural
➣ Use an English thesaurus to increase matchesSanfilippo and Steinberger (1997)
Words, Lexicons and Ontologies 100
Conclusions - building a bilingual lexicon
➣ Number of shared translations tells you something
➣ Semantic classes are even more useful in linking bilingual dictionaries
➢ word pairs with matching semantic classes are better translations
➣ Using two (or more) pivot languages gives even higher accuracy
➣ More information gives higher precision
➢ Link through a pivot language (≈ 65% precision)➢ Add in semantic links (≈ 80% precision)➢ Link through two pivot languages (≈ 97% precision)
Words, Lexicons and Ontologies 101
The Secret of Lexical Acquistion
➣ For a given word wu find the most similar known word wk and describe itin the same way
➣ Similarity can be
➢ Distributional➢ Translation Equivalence➢ Semantic Class➢ Burstiness (appears on the same date)➢ Sub-morpheme (same character)
Words, Lexicons and Ontologies 102
How to build Resources?
➣ Bootstrap ontologies from MRDs
1. Parse definitions to find the genus2. Take it as hypernym or parse further if it is relational
abbreviation [of x], nickname [for x], kind [of x], polite form [of x], . . .
➣ Bootstrap bilingual dictionaries from other bilingual dictionaries
➢ Link through a pivot language (≈ 65% precision)➢ Add in semantic links (≈ 80% precision)➢ Link through two pivot languages (≈ 97% precision)
➣ Find people to build it
➢ Wiktionary, lexicographers, fans, . . .
bootstrap – help oneself, often through improvised means 103
*References
Bond, Francis & Kentaro Ogura. 2007. Combining linguistic resources to create a machine-tractable Japanese-Malay dictionary. Language Resources and Evaluation 42(2). 127–136. URL http://dx.doi.org/10.1007/s10579-007-9038-4. (Special issue onAsian language technology).
Copestake, Ann. 1992. The representation of lexical semantic information. Brighton:University of Sussex dissertation.
Fellbaum, Christine (ed.). 1998. WordNet: An electronic lexical database. MIT Press.
Gruber, Thomas R. 1993. A translation approach to portable ontology specifications.Knowledge Acquisition 5(2). 199–200.
Words, Lexicons and Ontologies 104
Isahara, Hitoshi, Francis Bond, Kiyotaka Uchimoto, Masao Utiyama & Kyoko Kanzaki. 2008.Development of the Japanese WordNet. In Sixth international conference on languageresources and evaluation (lrec 2008), Marrakech.
Mohamed Noor, Nurril Hirfana, Suerya Sapuan & Francis Bond. 2011. Creating the openWordnet Bahasa. In Proceedings of the 25th pacific asia conference on language,information and computation (paclic 25), 258–267. Singapore.
Nichols, Eric, Francis Bond & Daniel Flickinger. 2005. Robust ontology acquisition frommachine-readable dictionaries. In Proceedings of the international joint conference onartificial intelligence ijcai-2005, 1111–1116. Edinburgh.
Niles, Ian & Adam Pease. 2001. Towards a standard upper ontology. In Chris Welty &Barry Smith (eds.), Proceedings of the 2nd international conference on formal ontology ininformation systems (fois-2001), Maine.
Pease, Adam. 2006. Formal representation of concepts: The suggested upper merged
Words, Lexicons and Ontologies 105
ontology and its use in linguistics. In Andrea C Schalley & D. Zaefferer (eds.),Ontolinguistics. how ontological status shapes the linguistic coding of concepts, Moutonde Gruyter. URL http://www.adampease.org/Articulate/publications/Ontolinguist%ics04.pdf.
Wang, Shan & Francis Bond. 2013. Building the Chinese Open Wordnet (COW): Startingfrom core synsets. In Proceedings of the 11th workshop on asian language resources, aworkshop at ijcnlp-2013, 10–18. Nagoya.
Words, Lexicons and Ontologies 106