26
Building a Dictionary from WWW Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology Center Thailand virach @ nectec .or. th

Building a Dictionary from WWW

  • Upload
    casey

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Building a Dictionary from WWW. Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology Center Thailand [email protected] Languages and Cultures of the East and West (LACEW) July 25-27, 2001, Tsukuba University. Motivations. - PowerPoint PPT Presentation

Citation preview

Page 1: Building a Dictionary from WWW

Building a Dictionary from WWW

Virach SornlertlamvanichInformation Research and Development Division

National Electronics and Computer Technology Center

[email protected]

Languages and Cultures of the East and West (LACEW)July 25-27, 2001, Tsukuba University

Page 2: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

2

Motivations

• WWW is an only one huge, up-to-date, language-and-area thoroughness online resource

• Lexicon & terminology database needed in – Electronic Dictionary– Machine Translation– Text Summarization, etc.

• Lack of open sharable resource– No standard formats– Legal issues

Collaborative Open Lexicon Development

Page 3: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

3

Concepts

• Open Format– XML-based

• Open Protocol– XML-based

request/response– DICT (RFC 2229)

• Open Participation– Data entry– Approver

• Open Source– Software Tools– Dictionary Content

• Corpus-based– To reflect the uses

and meanings of terms in real life

– To assist human thinking process

Page 4: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

4

Format : Standards Survey

• Models– ISO 12620:1999 : Terminology data categories– ISO 12200:1999 : Machine-Readable Terminology

Interchange Format (MARTIF)

• Variations– OTELO : OLIF (Open Lexicon Interchange Format)

A format for MT dictionary interchange– OSCAR (LISA) : TMX (Translation Memory eXchange

format)– SALT : Standards-based Access to multilingual Lexicons

and Terminologies XLT (XML representation of Lexicons and Terminologies)

Page 5: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

5

Development Procedure

Robot TextCorpus

WWW

SampleTexts

Terms

Ontology

Term-Concepts

AnnotatedConcepts

Term CandidateExtraction

ConceptClassification

Syntactic StructureAnalysis

Concept CorrelationDiscovery

• Document collection with robot (w/ language identification)• Term candidate extraction

(C4.5 on MI, Entropy, etc.)

• Syntactic structure extraction (POS tagger)

• Semantic correlation discovery (Ontology)

• Context-based concept classification (text classification)

Page 6: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

6

A Thai Running Text

สวั�สดี�ครั�บ ผมชื่ �อวั�รั�ชื่ ศรัเลิ�ศลิ���วั�ณิ�ชื่ ปั�จจ�บ�นเปั�นผ��อ��นวัยก�รัฝ่#�ยวั�จ�ยแลิะพั�ฒน�ส�ข�ส�รัสนเทศ ศ�นย*เทคโนโลิย�อ�เลิ,กทรัอน�กส*แลิะคอมพั�วัเตอรั*แห่/งชื่�ต� ผมเรั��มสนใจง�นวั�จ�ยในส�ข�ก�รัปัรัะมวัลิผลิภ�ษ�ธรัรัมชื่�ต�ต��งแต/ท��ไดี�ม�โอก�สเข��รั/วัมโครังก�รัวั�จ�ยแลิะพั�ฒน�รัะบบแปัลิภ�ษ�ในปั6 1989

สวั�สดี� ครั�บ ผม ชื่ �อ วั�รั�ชื่ ศรัเลิ�ศลิ���วั�ณิ�ชื่ ปั�จจ�บ�น เปั�น ผ��อ��นวัยก�รั ฝ่#�ย วั�จ�ย แลิะ พั�ฒน� ส�ข� ส�รัสนเทศ ศ�นย* เทคโนโลิย� อ�เลิ,กทรัอน�กส* แลิะ คอมพั�วัเตอรั* แห่/ง ชื่�ต� ผม เรั��ม สนใจ ง�น วั�จ�ย ใน ส�ข� ก�รั ปัรัะมวัลิผลิ ภ�ษ� ธรัรัมชื่�ต� ต��งแต/ ท�� ไดี� ม� โอก�ส เข��รั/วัม โครังก�รั วั�จ�ย แลิะ พั�ฒน� รัะบบ แปัลิภ�ษ� ใน ปั6 1989

Word/Sentence Segmentation

. ..

.

Page 7: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

7

Writing System

• 46 consonants; 18 vowels;4 tones; 9 symbols; 10 digitswritten 4 levels

• No punctuation• No word/sentence marker• No upper/lower case letter• No inflection

consonant

tone

vowel

vowel

Hard to identify (single/compound) word/phrase/sentence

baseline

Page 8: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

8

Sentence Extraction

Winnow(Feature-based ML)

Word segmentation andPOS tagging

Winnow

Input paragraph

Paragraph with sentence break

Word sequence with tagged POS

Trained network

Training POS tagged corpus

Page 9: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

9

Accuracy in Word/Sentence Segmentation

• Word Segmentation– Longest matching(92%)– Maximal matching (93%)– POS tri-gram (96%)– Machine learning (97%)

• Sentence Segmentation– POS tri-gram (85%)– Machine learning (89%)

Supervised approaches

Page 10: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

10

Term Candidate Extraction

• Virach Sornlertlamvanich et. al. (COLING 2000) :– Automatic Corpus-Based Thai Word

Extraction with the C4.5 Learning Algorithm– C4.5-trained decision tree for determining

potential word boundary from MI, Entropy, Linguistic information

– Capable of discovering new words in document without assistance from static dictionary

Page 11: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

11

Mutual Information

High mutual information implies that xy - z co occurs more than expected by chance. If xy z is a word, its Lm and Rm must be high.

…E function… and ...Function...

x y z

z

where x is the leftmost character of string xyzy is the middle substring of xyz z is the rightmost character of string xyzp( ) is the probability function.

x y

Page 12: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

12

Entropy

Entropy shows the variety of characters before and after a word. If y is a word, its left and right entropy must be high.

...?function... , ...?unction...

where A is the set of charactersx is the leftmost character of string xyzy is the middle substring of xyz z is the rightmost character of string xyzp( ) is the probability function.

x y

zy

Page 13: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

13

Other Features

• FrequencyWords tend to be used more often than

non-word string sequences.

• LengthShort strings are likely to happen by chance.

The long and short strings should be treated differently.

• Functional WordsFunctional words are used mostly in phrases. They

are useful to disambiguate words and phrases.Result of subjective test :

Word precision 85%Word recall 56%

Page 14: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

14

Evaluation Result of Word Extraction

Extracted words

Existing RID Not existing in RID

Training set(2933)

1643 1028(65.9%)

561(34.1%)

Test set(2720)

1526 1046(68.5%)

480(31.5%)

RID : Royal Institute Dictionary (30,000 words of Thai-Thai dictionary)

Page 15: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

15

Concept Classification

• Word and their contexts in the corpora• Manual word-sense disambiguation• Unsupervised word sense disambiguation (Yarowsky

1995)เกาะ 1(sense : to attach)

… ม�น เกาะ ต�วัเองก�บก��งไม� … (It clings itself on a tree ) … ผ��โดียส�รัไม/จ��เปั�นต�องย น เกาะ ห่/วังอ�กต/อไปัแลิ�วั … (Passengers

don't have to hold peddles anymore.)เกาะ ( 2 : )sense an island …บ��นผมอย�/ท�� เกาะ สม�ย… (I live at the Samui island.)

…ญี่��ปั�#นปัรัะกอบดี�วัย เกาะ ให่ญี่/ 4 เก�ะ… (There are four big islands in Japan.)

Page 16: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

16

Concept Classification

Page 17: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

17

Syntactic Structure Analysis

• Sentence/word segmentation by POS trigram tagger– POS assignment– Word co-occurrence

• Parser– Pattern of usages

Page 18: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

18

Ontologies

• EDR– Approach: Word description as employed in

dictionaries– Problem: Ambiguities and incomputability

• Wordnet– Approach: Synonym set and simple semantic

relations to other words– Problem: Ambiguities

• UW– Approach: Headwords and semantic restrictions– Advantage: Computability and no ambiguity

Page 19: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

19

Ontologies

EDR Wordnet 1.5 UW

Representation of concept “tired” in different schemes

- having or displaying a need for rest- having lost of interest- lack of imagination

- A1 : tired (vs. rested)- 2A : bromidic, commonplace, hackneyed, …- V1 : tire, pall, grow weary, fatigue- 2V : tire, wear upon, fag out- 3V : run down, exhaust, sap, …- bbbbb bbbbb bbb4

- tired- tired(icl>physical)- tired(icl>mental)

Page 20: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

20

Universal Word (UW)

• UW format : <headword> ( <list of restrictions> ) e.g. book (icl > do, obj > room)

• Headword : An English word roughly describes the UW sense.

• Restrictions :– Inclusion (icl ) indicates the class of the sense

e.g. car ( icl > movable thing)

Page 21: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

21

Universal Word (UW)

• Restrictions (continued)– UNL semantic relations

e.g. eat ( agt > volitional thing, obj > food )The agent of this UW is restricted to be volitional thing.The object of this UW is restricted to be food.

UW Class Hierarchy

Page 22: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

22

Architecture

• Centralized activities– For data integrity & consistency

• Distributed sites– For open participation– For backing up

• Job-based– Jobs generated by corpus analysis tools– Participants download jobs to work off-line

and submit back when done

Page 23: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

23

Job-based Working

CentralDB

Job A Generator

Job A Acceptor

Job A ApprovalAcceptor

Participant

Approver

Job B Generator

Job B Acceptor

Job B ApprovalAcceptor

Participant

Approver

Job Billboard

Job A Pool

Job ASubmittal

Job AApproval Pool

Job AApproved Pool

Job B Pool

Job BSubmittal

Job BApproval Pool

Job BApproved Pool

Page 24: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

24

Network Connection

• Committee nodes– Replicate same

database– Closely synchronized– Provide service to

neighbor participants

• Agents (optional)– Propagate

communications between committee and participants

Committee

Committee

Committee

Committee

Participant

Participant

Participant Participant

Participant

Participant

Participant

Participant

Agent Agent

Participant Participant

Page 25: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

25

Contents to Develop

• Thai word list• Corpus-based Thai lexicon• Co-occurrence dictionary• Thai ontology

Page 26: Building a Dictionary from WWW

25-27/7/2001 National Electronics and Computer Technology Center

26

SNLP+O-COCOSDA

The Fifth Symposium on Natural Language Processing + Oriental COCOSDA Workshop 2002

9-11 May 2002Hua Hin, Prachuapkirikhan, Thailand

http://kind.siit.tu.ac.th/snlp-o-cocosda2002/ orhttp://www.links.nectec.or.th/itech/snlp-o-cocosda2002/