Upload
mandar
View
63
Download
0
Tags:
Embed Size (px)
DESCRIPTION
From Synergy to Knowledge: Integrating multiple language resources Part II: Creating Synergy and Multi-functionality of Language Resources. Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm. Outline. From Language Resources to Language Technology A word’s company - PowerPoint PPT Presentation
Citation preview
4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
From Synergy to Knowledge: Integrating multiple language resources
Part II: Creating Synergy and Multi-functionality of Language Resources
Chu-Ren Huang
Academia Sinica
http://cwn.ling.sinica.edu.tw/huang/huang.htm
p. 2C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Outlineo From Language Resources to Language Technology
o A word’s company
o Classical Paradigm of Language Resource Development
o A new paradigm: Integrating Multiple Language resources
o Introduction: CGW Corpus
o Chinese WordSketch: Integrating multiple resources
o Wen-Guo: Merging different resources to create new synergy
p. 3C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
From Language Resources to Language Technology
Language Modeling and Knowledge Generation: How to acquire linguistic model and/or generalization from language resources?
Sharability: can two or more resources be combined to create bigger and better resources
Re-usability: Can a resource be used for a different purpose than what it is designed for?
p. 4C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
A word’s company: Corpus KeyWord In Context (KWIC) and the color pen
1 political association 4 person in an agreement/dispute4 person in an agreement/dispute 2 social event 5 to be party to something...3 group of peopleThe coloured pens method from Kilgarriff et al. 2005
1 arity, which will be used to take a party of under-privileged children to D 2 from outside. You are invited to a party and after a couple of drinks you d 3 tion, we believe politicians of all parties will listen to our views. &equo 4 ould be reaching agreement with all parties concerned, as to which events, 5 lack people. I have certainly been party to one or two discussions amongst 6 . These should be discussed by both parties before entering into the relatio 7 presents They had hosted a cocktail party at Kensington palace, for example 8 akes. By midnight the end-of-course party is in full swing, but most cadet 9 e should be a right for the injured party to terminate the contract. A mana 10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh 11 s. Ahead I could see the rest of my party plodding towards the final slope t 12 cial ethic. The two main political parties - the Tories and the Liberals - 13 ritish successes in Perth The small party of British players competing in th 14 to help control. One member of the party went to summon the rescue team and 15 rket society fashion magazine. The party was held at his flat which was a l 16 security and secrecy than any Tory Party Conference : it seems that bootleg
p. 5C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
A Word’s Company Automatically Detected: WordSketch w BNC Data
p. 6C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Sketch Engine and Chinese WordSketch Sketch Engine http://www.sketchengine.co.uk
Developed by team led by Adam Kilgarriff
A new corpus viewing tool
Discovering grammatical information from a gigantic corpus
Chinese Wordsketch by Academia Sinica
http://www.ling.sinica.edu.tw/wordsketch (for Taiwan only)
Academia Sinica, Taiwan (Huang, Smith, Ma, Simon 黃居仁,史尚明,馬偉雲,石穆 )
p. 7C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Classical Paradigm of Language Resource Development
Data Collection and Preparation:
Design Criteria : by human
Data collection : executed or supervised by human
digitization : input and/or proofreading by human
Knowledge Enrichment: tagging and structural annotation
Knowledge source : by human
Representational standard and annotation : by human
Quality and speed of human labor becomes the bottleneck of language resources development
p. 8C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Current Challenges to Corpus and Language Resource Research
Corpus size is too small : Disambiguation
Collocation
Grammatical functions and other dependencies
usually requires corpus size of 100 million words or above to yield significant distributional information.
Resources development is slow and tedious
Semantic Role Tagging
POS tagging post-processing
p. 9C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Estimating Corpus Scale for Automatic Extraction of Linguistic Knowledge
How many events do we need to establish reliable description of a word from corpus? automatically?
Grammatical Information based on Word-word Collocation
V + N :「開立」+「發票」 A + N :「不實」+ 「發票」
Collocational information between any given two mid-frequency words (frequency rank 10,000 or above)
That occur within a 10 word window of the keyword (5 before and 5 after
Requires a corpus size of 1 billion words or above
p. 10C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Classical Chinese Corpora:Million Word Scale
Corpus Name Online Year
Data Duration/
Content
Sinica 4.0
(Taiwan)1996
5.2 M words
7.9 M characters 1990-1996
Fully Tagged
Sinica 5.0
(Taiwan)2006
10 M words1990-2004
Fully Tagged
Sinorama
(Taiwan)2003
3.2 M English words
5.3 M Chinese characters
1976 – 2000
(1999-2000)
Aligned
CCL
(Peking)2003 85 M simplified characters
1919 -2003
Partially tagged
(1 million) M= million
p. 11C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
A new paradigm Integrating Multiple Language resources
From Synergy to Knowledge
Integrate multiple existing (language) resources to create new resource
Allow resources to scale up beyond existing resources,
Generate new knowledge which does not exist in any individual resource
General methodology (without too much additional manual work):
merging existing, similarly annotated resources, or
creating an overall conceptual framework for different knowledge/language resources to be integrated
Automatically
p. 12C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
From Synergy to Knowledge When A and B have synergy, we say in Chinese that
A and B bring out the advantages of each other
Knowledge is what we know about the world, either descriptive or explanatory
Knowledge cannot be created from nothing, it comes by
Keen observation of facts
Sharp reasoning when we put two or more facts together
Different language resources can be put together to
Facilitate observation of facts, and
Create an environment where different linguistic facts can be more easily associated (for knowledge discovery)
p. 13C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Synergy: Integrating different types of language resoureces
Research based on Chinese Gigaword Corpus
Chinese Gigaword Corpus: Introduction
Implementation of fully automatic corpus tagging
Word Sketch Engine: Introduction
Chinese Word Sketch
Integrating corpus program with
lexico-grammatical information
p. 14C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Introduction: CGW Corpus IChinese Gigaword Second Edition (2005)
Produced and released by Linguistic Data Consortium (LDC) in 2003 (first edition).
Newswire text data in Chinese.
Second edition contains additional data collected after the publication of the first edition.
Three distinct international sources :
Central News Agency of Taiwan
Xinhua News Agency of Beijing
Zaobao Newspaper of Singapore
p. 15C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Introduction: CGW Corpus II
CNA Xinhua Zaobao
First Edition 1991-2002 1990-2002
New in Second EditionOct. 2002 -
Dec. 2004
Jan. 2003 -
Dec. 2004
Oct. 2000 -
Sep. 2003
Table 1. Coverage of Chinese GigaWord Corpus
p. 16C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Introduction: CGW Corpus IIIMarkup Structure
All text data are presented in SGML form, using a very simple,
minimal markup structure.
<DOC id="CNA19910101.0003" type="story"><HEADLINE>捷運局對工程噪音採多項防治措施</HEADLINE><DATELINE>( 中央社台北一日電 )</DATELINE><TEXT><P>台北都會區捷運工程正處於積極趕工階段 ,…</P><P>淡水線工程進度百分之三十六點一九 , 落後百分之二點六七 ,…</P></TEXT></DOC>
p. 17C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Introduction: CGW Corpus IVStatistics
Resource Characters Words Documents
First
Edition
CNA 735 462 1,649
Xinhua 382 252 817
TOTAL 1,118 714 2,466
Second
Edition
CNA 792 497 1,769
Xinhua 471 310 992
Zaobao 28 18 41
TOTAL 1,291 825 2,803
Table 2. Content of data from each source
Unit: Million
p. 18C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CGW after fully automatic tagging
Word Type Word Token
CNA 1,917,093 496,465,879
XIN 1,409,747 305,595,420
ZBN 273,111 18,328,571
Total 2,999,590 820,389,870
p. 19C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
II. 1. Corpus Preparation: (Almost) Fully Automatic Segmentation and Tagging
Strategy (Ma and Chen 2005)(Ma and Chen 2005) : HMM method for POHMM method for POS tagging for words existing in basic lexicon and morS tagging for words existing in basic lexicon and morpheme-analysis-based method (Tseng and Chen 200pheme-analysis-based method (Tseng and Chen 2002) to predict POS’s for new words.2) to predict POS’s for new words.
Integrating Language Resources Sinica lexicon with 80,000 word entries. Sinica lexicon with 80,000 word entries.
A 50,000-words’ set collected from Sinica Corpus 3.0 A 50,000-words’ set collected from Sinica Corpus 3.0 (10 million words balanced corpus).(10 million words balanced corpus).
5,000 new words from Xinhua new-words dictionary.5,000 new words from Xinhua new-words dictionary.
Tagset : Adopting Sinica Tagset as a uniform tagging set.
p. 20C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Preparation: Implementation Environment: Environment: 2 PC (2.8GHz CPU) 2 PC (2.8GHz CPU)
Time ConsumedTime Consumed :: over 3 daysover 3 days
OutputOutput : : 462 million words of CNA462 million words of CNA
252 million words of XIN252 million words of XIN
Ma and Huang 2006 (LREC 2006)Ma and Huang 2006 (LREC 2006)
See http://ckipsvr.iis.sinica.edu.tw/ for demo of the CKIP Segmentation program
p. 21C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Preparation: Tagging
Segmented and Tagged Article
<DOC id="CNA19910101.0003" type="story"><HEADLINE>捷運局 (Nc) 對 (P31) 工程 (Nac) 噪音 (Nad) 採 (VC2) 多 (Neqa) 項 (Nfa) 防治 (VC2) 措施(Nac)</HEADLINE><DATELINE>((PARENTHESISCATEGORY) 中央社 (Nca) 台北 (Nca) 一日 (Nd) 電 (VC2) )(PARENTHESISCATEGORY)</DATELINE><TEXT><P>台北 (Nca) 都會區 (Ncb) 捷運 (Nad) 工程 (Nac) 正 (Dd) 處於 (VJ3) 積極 (VH11) 趕工 (VA4) 階段 (Nac) , (COMMACATEGORY) …</P><P>淡水線 (Na) 工程 (Nac) 進度 (Nad) 百分之三十六點一九 (Neqa), (COMMACATEGORY)落後 (VJ1) 百分之二點六七 (Neqa) , (COMMACATEGORY)…</P></TEXT></DOC>
p. 22C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Summary of Fully Tagged CGW Corpus Fully segmented and tagged with Sinica tagset by Ac
ademia Sinica
Being processing by PKU with their tagset
Potentially the most important source for processing and comparative studies of Mandarin Chinese
Will be available from LDC in 2007.
p. 23C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CWS and Integration of Corpus Search Engine with Lexico-grammatical Information
Overview
A word sketch is a one-page, automatic, corpus-derived summary of a word's grammatical and collocational behavior.
The Word Sketch Engine, which takes as input a corpus of any language and a corresponding grammar patterns, generates word sketches for the words of that language.
We synergize rich lexicon-based grammatical information (ICG, Chen and Huang 1992) with stochastic information.
p. 24C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Word SketchWord Sketch Engine (Kilgarriff et al.)
Register for trial usage at http://www.sketchengine.co.uk
A Versatile Corpus Viewing and Searching Tool
The Word Sketch Engine, which takes as input a corpus of any language and a corresponding grammar patterns, generates word sketches for the words of that language.
Based on pre-defined context-free rules to identify grammatical functions (relations)
Ranked by Saliency: frequency adjusted MI (based on Dekang Lin’s definition of Pair-wise MI)
p. 25C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Design Criteria of Sketch Engine Grammatical relation is the information that is both of
most interest to HLT and linguistic research
However, GR’s can only be discovered based on collocational data, hence requires very large corpus and high quality annotation at the same time, a seeming unsolvable dilemma
There is a solution when corpus is big enough Context-free patterns allows fairly reliable extraction of
a substantial number, if not all, relations
(When there are enough instances of relations extracted), the saliency ranking correctly picks the distributional tendencies and allows users to ignore idiosyncrasies/errors.
p. 26C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
WordSketch’s Approach:From Lexical Types to Relations Types
BNC has 100,000,000 Words 939,028 word types
70,000,000 tuples (relations) Extracted
More than 70 relations per lemma
For CWS II, and CGW corpus (CNA data) 1,917,093 word Types
59,183,238 tuples (<eat, obj, rice>)
More than 30 relations per lemma
p. 27C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Chinese WordSketch: An Overview Concordance
WordSketch
Sketch Difference
Thesaurus
p. 28C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 29C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 30C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 31C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 32C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CWS: SketchDiffComparing the behaviors of two words
p. 33C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CWS: Thesaurus of 快樂
p. 34C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Application to Chinese Corpus: Comparing ThesaurusWe shall know a word by the company it keeps
p. 35C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Context-free patterns: Does Quality of Grammatical Knowledge Matter?
The implementation of CWS I simply adopts English like CF grammatical patterns (since Chinese and English supposedly share very similar PS rules)
However, the result was not very satisfactory
Missing a lot of relations, such as objects which do not appear right next to a verb
Mis-classifying topicalized objects as subjects
Missing objects in non-canonical positions
p. 36C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Linguistic Knowledge Should Solve the above Problems
Comprehensive Lexical Knowledge of Verb Frames exists Information-based Case Grammar (ICG) Encoded on over 40,000 verbs in Sinica Lexicon
ICG Basic Patterns for Stative Pseudo-transitive Verb (VI)
EXPERIENCER<GOAL[PP[ 對 ]]<VI
EXPERIENCER<VI<<GOAL[PP[ 於 ]]
THEME<GOAL[PP{ 對、以 }]<VI
THEME<VI<<GOAL[PP[ 於 ]]
THEME<VI<<SOURCE[PP{ 自、於 }]
THEME< SOURCE[PP{ 歸、為 }]<VI
p. 37C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Comparing Lexical Knowledge Between CWS I and CWS II CWS I: 11 definitions, 11 patterns
One single patter for verb-object relation
CWS II: 32 definitions, 80 patterns
20 patterns for verb-object relation
59,183,238 tuples (<eat, obj, rice>)
from 496,465,879 words
English has 39 definitions, 40 patterns
p. 38C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Synergy among tagging, statistics, and linguistic knowledge
Collocations are identified with Context free rules in Word Sketch Engine
Collocating Pattern for Object from CSE I
1:"V[BCJ]" "Di"? "N[abc]"? "DE"? "N[abc]"? 2: "Na" [tag!= "Na"]
Challenge: Long-distance relations
全穀麵包,吃了很健康。
quan.gu mian.bao, chi le hen jian.kang
有人嘗試要將這荷花分類,卻越分越累。 you ren chang.shi yao jiang zhe he.hua fen.lei, que yue fen yue lei
他 只 吃了 一 口 飯 …
Ta zhi chi let yi kou fan
p. 39C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Integrating prior Knowledge in Processing
Knowledge Source
Information-based Case Grammar (ICG, Chen and Huang 1992)
Encoded on over 40,000 verbs in Sinica Lexicon
ICG Basic Patterns for Stative Pseudo-transitive Verb (VI)
EXPERIENCER<GOAL[PP[ 對 ]]<VI
EXPERIENCER<VI<<GOAL[PP[ 於 ]]
THEME<GOAL[PP{ 對、以 }]<VI
THEME<VI<<GOAL[PP[ 於 ]]
THEME<VI<<SOURCE[PP{ 自、於 }]
THEME< SOURCE[PP{ 歸、為 }]<VI
p. 40C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Integrating prior Knowledge in Processing: Examples
村莊 (object) 明天將 被 夷為平地 (VB11)
cunzhuang mingtian jiang bei yiweipingdi
begin time1 location time1 adv? passive_prep adv_string 1:"V[BCJ].*" [tag!="DE"]
大量 的 遊客 破壞 (VC2) 公園 景觀 (object)
daliang de youke pohuai gongyuan jingguan
1:"VC.*" (particle|prep)? NP not_noun
(NP is defined as “…noun_modifier{0,2} 2:noun…”.
p. 41C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Integrating prior Knowledge in Processing: Partial Result
Object Recall Comparison
CSE I CSE II
hong2 (red) 0 0
pao3 (run) 0 8,704
kan4 (look) 32,350 64,096
da3 (hit) 26,016 47,182
song4 (give) 0 76,378
shuo1 (say) 0 20,350
xiang1xin4 (believe) 0 52,373
quan4 (persuade) 0 3,852
p. 42C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Integrating prior Knowledge in Processing: Partial Result II
Most salient objects for chi1 「吃」 in CSEII
Those among top 20 salient object fromCSE1, but not II
飯 fan4 rice 802 70.96 (4),
虧 kui disadvantage 329 59.24 (12)
苦頭 ku3tou2 suffering 194 58.71 (14)
p. 43C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Applications: Chinese WordSketch Test version of Chinese Word Sketch is available
Permanent version of CWS will be available from Academia Sinica Soon
http://wordsketch.ling.sinica.edu.tw
p. 44C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Application: Resolving Nominalization
])[(log
][|])[(|])[(|log
][|])[(|])[(|log
][|])[(|])[(|log
][|])[(|])[(|log
111
111
111
111
nomtvP
nomvtPnomttPnomtvtP
nomvwPnomtwPnomtvwP
nomvtPnomttPnomtvtP
nomvwPnomtwPnomtvwP
ii
iiiiii
iiiiii
iiiiii
iiiiii
Chinese verbs are nominalized without overt markup
Resolving Categorical ambiguity with distributional information only
Two Approaches: HMM and Bayesian Classifier
HMM: N-grams
Classifier: left, right contexts, plus own verb sub-class, weighted
2.0 ,3.0 ,5.0
p. 45C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Nonminalization Results (Ma and Huang 2006)
0
20
40
60
80
100
文學 生活 社會 科學 哲學 藝術 綜合
Topics
F-sc
ore(
%)
HMM-1
HMM-2
Classifier-1
Classifier-2
Classifier-3
Best overall HMM performance: 69%
Best Overall Bayesian classifier performance: 74%
p. 46C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Mining Cross-Strait Lexical Difference Strategy: Using a pair of know contrasting words
as seeds and lookup SketchDifference
Clinton 克林頓 ke4 Vs. 柯林頓 ke1
What is found: Other unique translation for either PRC or Taiwan
克林頓 (PRC) only and/or patterns (vs 柯林頓 only)
葉利欽 88 54.6 Yeltin 葉爾勤 (3)
布什 65 49.7 Bush 布希 (4)
萊溫斯基 10 41.3 Lewinsky 呂茵斯基 / 呂女 (1)
戈爾 20 39.4 Gore 高爾 (2)
p. 47C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Adventures in Wen-Land: 文國尋寶記http://www.sinica.edu.tw/Wen/
Integrating the following
Corpora: Sinica Corpus, Textbook Corpus (3 different editions), Tang poems, Dream of the Red Chamber, On the Water Margin…
Lexicon: General, Classifier, Idiom ( 成語 )
Linked with a corpus/lexicon interface
Developed by: Huang, Fengju Lo, Hui-chun Hsiao, and team of teachers
p. 48C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
The Substantive IssuesLanguage Resources Used in WenGuo
Textual Databases (of classical texts)
Text Corpora
Linguistic and Philological Knowledge from previous research
LKB Extracted and composed from the above
p. 49C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Adventures in Wen-Land (2001)
p. 50C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Adventures in Wen-Land What: Is a virtual theme park for on-line Chinese lang
uage learning and teaching .
How: Is the end product of a National Digital Museum Project sponsored by the National Science Council, ROC (A Linguistic and Literary KnowledgetNet for Elementary School Children)
When: Was completed in spring, 2001
p. 51C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Adventures in Wen-Land Who: The team included
Chu-Ren Huang a linguist
Feng-ju Lo a literary scholar
Hui-chun Hsiao a web-based art-designer
Ching-Chun Hsieh a computer scientist
Chi-chao Liao, Chiu-Jung Lu Pei-chuan Wei...
Mei-ling Li, Hsiou-Hua Chiu, Shu-wen Huang, Cheng-chi Jiang elementary school Chinese teachers
p. 52C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
An Adventure in Seven PartsThe Geography of Wen-Land
天罡地煞梁山泊論英雄好漢 On the Water Margin梁山 mountain
大觀園 garden
西園 music hall
倒影湖 lake
接龍瀑布 falls
黑白宮 castle
學堂 colleges
名稱 Scene
大觀園一探紅樓兒女情懷 The Dream of the Red Chamber
進入時光隧道,回味唐宋流行歌 Song Poetry
語文的無窮趣味,遊戲的新鮮挑戰 Games
出口成章,妙語串成珠璣 Chinese Idiom Dict.
名詞語量詞配出中文的特色 Noun-class. Dict.
由教科書有限的字數裡找出豐富的知識與無窮的趣味 Three versions of textbooks
學習目標 Content
p. 53C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
The Adventure’s Seven GuidesThe Denizens of Wen-Land
神行太保 The Chinese Mercury (one of the 108 heroes)梁山 mountain
大觀園 garden
西園 music hall
倒影湖 lake
接龍瀑布 falls
黑白宮 castle
學堂 colleges
名稱 Scene
鴛鴦 A Maid who knows the ins and outs
宋代少婦 Young Song Dynasty Woman平平與明明 A Twin哪吒 The mythical flying child acrobat
林三本 A medieval estate owner
李小哲 a learned young scholar (a miniature version of Y.T Lee)
導覽人物 Featured Character
p. 54C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Designing Adventures: Threads that hold the KnowledgeNet Together
A Thread without a guiding needle goes nowhere
穿針引線 A Lexical Needle Picks Up & Connects
-Only the Textual Materials that it is allowed to go through
p. 55C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Pulling Through and Pulling Together Lexicalthread and hyperlink
Lexical KnowledgeBase (LKB) guides us through all language resources that use the same word
-In WenGuo, we assume users will be using textbook vocabulary to guide them
p. 56C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Pulling Through and Pulling Together Lexicalthread and Textutal Filter
LKB provides the chronological (such as when a word is first taught/learned) and distributional (such as frequency) feature of each word.
-In WenGuo, by knowing a user’s level at school, we can gauge/pace learning
p. 57C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Integrated Resources as Learning Background in Wenguo
p. 58C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Using LKB to Pace Linguistic Knowledge A learner identifies his/her school year (3rd grade
etc.) when log in
-control vocabulary level of learning activity
-pace/monitor development of ling. Skill
A user can also specify which textbook version to view
-allows cross-track comparison of linguistic development
-allows supplementation at corresponding learning level
p. 59C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Synergizing Archive-based LKB LKB’s based on classical or prototypical texts facilitat
es quick and accurate lexical comparison and allows immediate reference to original text
-In WenGuo, users can easily find out the literary references and citations in several classics and go immediately from vocabulary to text
p. 60C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Integrated Lexical KnowledgeBase entry of 雲海 yun2hai3
p. 61C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Citation of Classifier 個 ge5 in Three Textbooks
p. 62C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Collocating Nouns of Classifier 張 From Huang et al. 1997 國語日報量辭典
p. 63C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Concluding Remarks-Corpus is a sample of the ‘real words in action’
-Corpora and other language resources can be combined to create powerful language teaching and learning tools
-The integration must be linked by lexical terms
-Corpora must be tagged with POS
-In practice, different editions of textbooks can be treated as different corpora
-And be linked for comparison or borrowing
-Corpus facilitates creation of synergy for learning and teaching
p. 64C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Other Useful Resources Sinica Corpus 中央研究院現代漢語平衡語料庫 , first t
agge corpus of Chinese, online since 1996
http://www.sinica.edu.tw/SinicaCorpus
SouWenJieZi - A Linguistic KnowledgeNet. August 1999.
http://words.sinica.edu.tw/
SINICA BOW 2002
http://bow.sinica.edu.tw
Chinese Wordnet 2005, >16,000 synsets
http://cwn.ling.sinica.edu.tw/
p. 65C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Conclusion
The synergy of different language resources crea
tes
Knowledge
生生不息生生不息
p. 66C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Concluding RemarksOther NLP Research Activities at Academia Sinica
Chinese Wordnet: ongoing, >10,000 synsets
http://cwn.ling.sinica.edu.tw
Bilingual Wordnet linked to SUMO ontology
http://bow.sinica.edu.tw
Fully Sense-tagged corpus: combining cwn and Sinica corpus with machine learning algorithm
Directed by Sue-Jin Ker of Soochow Univ.
Subset to be available soon
Asian lexicon standard: NEDO project
Tokunaga, Calzolari, Shirai, Virach, Prevot…
p. 67C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Reference
DLC CGW Corpus: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14
Chinese Word Sketch 試用網址 : http://corpora.fi.muni.cz/chinese_all/ (帳號 :chinese 密碼 :chinese)
Wei-yun Ma, and Chu-Ren Huang. 2006. Uniform and Effective Tagging of a Heterogeneous Giga-word Corpus. To be Presented at the 5th International Conference on Language Resources and Evaluation (LREC2006). Genoa, Itlay. 24-28 May, 2006.
CKIP (Chinese Knowledge Information Processing Group). (1995/1998). The Content and Illustration of Academica Sinica Corpus. (Technical Report no 95- 02/98-04). Taipei: Academia Sinica
Huang Chu-Ren, Keh-Jiann Chen, Feng-Yi Chen, Keh- Jiann Chen, Zhao-Ming Gao and Kuang-Yu Chen. (2000). Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface. Proceedings of 2nd Chinese Language Processing Workshop pp. 29-37.
Kilgarriff, Adam, Chu-Ren Huang, Pavel Rychly, Simon Smith, and David Tugwell. (2005). Chinese Word Sketches. ASIALEX 2005: Words in Asian Cultural Context.
p. 68C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Reference ( 續 )
Ma, Wei-Yun and Keh-Jiann Chen, (2005). Design of CKIP Chinese Word Segmentation System, Chinese and Oriental Languages Information Processing Society, Vol 14. No. 3. pp. 235-249.
Tseng, H.H. & K.J. Chen, (2002). Design of Chinese Morphological Analyzer,” Proceedings of SIGHAN Workshop on Chinese Language Processing, pp. 49-55.
Tsai Yu-Fang and Keh-Jiann Chen, 2003, "Reliable and Cost-Effective Pos-Tagging", Proceedings of ROCLING XV, pp161-174.
Tsai Yu-Fang and Keh-Jiann Chen, 2003, "Context-rule Model for POS Tagging", Proceedings of PACLIC 17, pp146-151.
Tsai Yu-Fang and Keh-Jiann Chen, 2004, "Reliable and Cost-Effective Pos-Tagging", International Journal of Computational Linguistics & Chinese Language Processing, Vol. 9 #1, pp83-96.