View
233
Download
0
Embed Size (px)
Citation preview
1
Wen-Hsiang Lu (盧文祥 )Department of Computer Science and Information Engineering
National Cheng Kung University2014/02/17
Multilingual and Crosslingual Information System
2
Contact Information
• Room: 4261, Monday 09:10 - 12:00 AM
• Instructor: Prof. Wen-Hsiang Lu (盧文祥 )– Office: 4216
– Office hours: Monday 12:10 - 2:10PM
– Phone: 62545
– Web page: http://myweb.ncku.edu.tw/~whlu/mis.htm
– Email: [email protected]
– Teaching assistant: 王廷軒• Email: [email protected]
3
Course Grading
• Class participation/presentation: 30% • Tests: 25% • Project: 25% • Homeworks: 20%
4
Source Textbooks
• Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, 1999. ( 全華科技圖書 : 02-23717725)
• Daniel Jurafsky and James H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, 2000.
• James Allen, Natural Language Understanding, Benjamin/Cummings Publishing Co, 1995.
• Gregory Grefenstette, Cross-Language Information Retrieval, Kluwer, 1998.
• Jean Veronis, Parallel Text Processing: Alignment and Use of Translation Corpora, Kluwer, 2000.
5
Other Useful Sources (1)
• Reference Books– Charniak, E. Statistical Language Learning. – Cover, T. M., Thomas, J. A. Elements of Information Theory.– Jelinek, F. Statistical Methods for Speech Recognition.
• Major Conferences:– ACL (Association of Computational Linguistics)– COLING (International Conference on Computational Linguistics )– HLT (Human Language Technology Conference)– IJCNLP (International Joint Conference on Natural Language Processing )
• Journals– Computational Linguistics– Natural Language Engineering– TALIP (ACM Transactions on Asian Language Information Processing)– TSLP (ACM Transactions on Speech and Language Processing)
6
Other Useful Sources (2)
• Resource URL– http://www.aclclp.org.tw/res_other_c.php ( 中華民國計算語言學學
會 )
– http://nlp.stanford.edu/software/index.shtml (Stanford NLP Group)
– http://www.phontron.com/nlptools.php (Graham Neubig)
• Tools/Software– Online Dictionary
• WordNethttp://wordnet.princeton.edu/
• HowNethttp://www.keenage.com/html/c_index.html
• The Academia Sinica Bilingual Ontological Wordnet (BOW)http://bow.sinica.edu.tw/
7
CKIP ( 中研院詞庫小組 )(Chinese Knowledge and Information Processing)
• Parser: http://140.109.19.112/main.exe?id=6833
• POS (part of speech) tagger: http://ckipsvr.iis.sinica.edu.tw/
8
Eric Brill's POS Tagger
• Website: http://cst.dk/online/pos_tagger/uk/
This/DT is/VBZ a/DT book/NN ./.
9
Stanford Parser
• Website – http://nlp.stanford.edu/software/lex-parser.shtml
• Tools– Online version
• Stanford Parser version 1.5.1
• English & Chinese
• http://josie.stanford.edu:8080/parser/
11
[Homework 1]
• Using CKIP POS (part of speech) tagger, Eric Brill’s POS tagger, and Stanford parser to tag and parse at least three sentence.
12
Course Topics
• Probability and Information Theory– basics: definitions, formulas, examples.
• Language Modeling– n-gram models, parameter estimation– smoothing (EM algorithm)
• Some Linguistics– phonology, morphology, syntax, semantics, discourse
• Words and the Lexicon– word classes, mutual information, lexicography.
13
Course Topics (cont.)
• Hidden Markov Models– background, algorithms, parameter estimation
• Tagging: methods, algorithms, evaluation– tag sets, HMM tagging, transformation-based, feature-based
• Grammars and Parsing: data, algorithms– statistical parsing: algorithms, parameterization, evaluation
14
Course Topics (cont.)
• Applications– Machine Translation (MT)– Acoustic Speech Recognition (ASR)– Information Retrieval (IR)– Cross-Language Information Retrieval (CLIR)– Question Answering (QA)– Cross-Language Question Answering (CLQA)– Summarization– Information Extraction– …
15
Course Introduction
• Lecture1: Introduction
• Lecture2: Mathematical Foundations
• Lecture3: Linguistics Essentials
• Lecture4: Corpus-based Work
• Lecture5: Collocations
• Lecture6: Statistical Inference: n-gram Models over Sparse Data
• Lecture7: Word Sense Disambiguation
• Lecture8: Statistical Alignment and Machine Translation
• Lecture9: Markov Models
• Lecture10: Term Translation Extraction & Cross-Language Information Retrieval
• Lecture11 : Statistical/Probabilistic Models for Word Alignment & CLIR
• Lecture12: Part-of-Speech Tagging
• Lecture13: Probabilistic Context Free Grammars
• Lecture14: Question Answering
16
The Ultimate Research Goal in Natural Language Processing (NLP) • To develop an automated language understanding
system• Why is this important?
– Easy for everyone to use language
– Natural Human interface for a variety of applications (e.g., database access, on-line tutor, robot control, etc.)
– Language seems fundamental for developing an intelligent system
• iPhone Siri
• IBM's DeepQA project
21
Aspects of Computational Linguistics
• Description of the Language: universals, cross-linguistic research
• Implementation of Computer Model: algorithms and data structures, formal models to represent knowledge, model of the reasoning process
• Psycho-Linguistic Aspect: humans are an existence proof of the computability of language comprehension; psychological research can be used to justify a computer model; obtain human processing parameters
22
NLP Issues• Why is NLP difficult?
– Many “words”, many “phenomena”, many “rules”• OED (Oxford English Dictionary): 400k words;
Finnish lexicon (of forms): ~2 ×107
• sentences, clauses, phrases, constituents, coordination, negation, imperatives/questions, inflections, parts of speech, pronunciation, topic/focus, and much more!
– irregularity (exceptions, exceptions to the exceptions, ...)• potato potato es (tomato, hero,...); photo photo s, and even:
both mango mango s or mango es
• Adjective / Noun order: new book, electrical engineering, general regulations, flower garden, garden flower
23
Difficulties in NLP (cont.)– Ambiguity
• books: NOUN or VERB?
– you need many books vs. she books her flights online
• Thank you for not smoking, drinking, eating or playing radios without earphones. (MTA bus)
– Thank you for not eating without earphones??
– Thank you for drinking?? …
• Fred’s hat was blown off by the wind. He tried to catch it.
– ...catch the wind or ...catch the hat ?
24
Rules or Statistics?• Preferences:
– context clues: she books books is a verb– rule: if an ambiguous word (verb/nonverb) is preceded by a
matching personal pronoun word is a verb
– pronoun reference:– she/he/it often refers to the most recent noun or pronoun (but
there are certainly exceptions)
– selectional restrictions:– catching hat is better than catching wind (but not always)
– semantics: – We thank people for doing helpful things or not doing annoying
things
25
Solutions
• Don’t guess if you know:• morphology (inflections)• lexicons (word information)• unambiguous names• perhaps some (really) fixed phrases• syntactic rules?
• Use statistics (based on real-world data) for preferences (only?)
• No doubt: but this is an important question!
26
Types of Linguistic Knowledge
• Acoustic/Phonetic Knowledge: How words are related to their sounds. (transliteration)– E ri c sson <=> 易利信
• Morphological Knowledge: How words are constructed out of basic meaning units.un + friend + ly unfriendly
love + past tense loved
object + oriented object-oriented
27
More Types of Linguistic Knowledge
• Lexical Knowledge (or Dictionary): This should include information on parts of speech, features (e.g., number, case), typical usage, and word meaning.
• Syntactic Knowledge: How words are put together to make legal sentences (or constituents of sentences).
28
More Types of Linguistic Knowledge
• Semantic Knowledge: Word meanings, how words combine into sentence meaning, – e.g., Fred tossed the ball.
Semantic roles
29
More Types of Linguistic Knowledge
• Pragmatic Knowledge: How context affects the interpretation of a sentence. Examples:– Louise loves him.
[Context 1:] Who loves Fred?[Context 2:] Louise has a cat.
– What time is it?[Context 1:] Fred is fidgeting (坐立不安 )
and staring at his watch.[Context 2:] Louise has no watch.
30
More Types of Linguistic Knowledge
• World Knowledge: How other people‘s minds work, what a listener knows or believes, the etiquette ( 成規 ) of language. Examples:– Will you pass the salt?
– I read an article about the war in the paper.
– Fred saw the bird with his binoculars.
– Tim was invited to Tom's birthday party. He went to the store to buy him a present.
31
Multilingualism Issues in Web Age
• Language barrier– There are about 6,700 languages listed in the Ethnologue
(http://www.ethnologue.com/)
• Information overloading– Scaling up of language resources
• Webpages
• News
• Weblogs
• Microblogs
35
Real World Situation• Use statistical model based on REAL WORLD DATA and care
about the best sentence only • Imagine:
– Each sentence W = { w1, w2, ..., wn } gets a probability P(W|X) in a context X
– For every possible context X, sort all the imaginable sentences W according to P(W|X):
– Ideal situation:
best sentence (most probable in context X)
P(W)
Wbest Wworst