36
1 Wen-Hsiang Lu ( 盧盧盧 ) Department of Computer Science and Information Engineering National Cheng Kung University 2014/02/17 Multilingual and Crosslingual Information System

1 Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering National Cheng Kung University 2014/02/17 Multilingual and Crosslingual

  • View
    233

  • Download
    0

Embed Size (px)

Citation preview

1

Wen-Hsiang Lu (盧文祥 )Department of Computer Science and Information Engineering

National Cheng Kung University2014/02/17

Multilingual and Crosslingual Information System

2

Contact Information

• Room: 4261, Monday 09:10 - 12:00 AM

• Instructor: Prof. Wen-Hsiang Lu (盧文祥 )– Office: 4216

– Office hours: Monday 12:10 - 2:10PM

– Phone: 62545

– Web page: http://myweb.ncku.edu.tw/~whlu/mis.htm

– Email: [email protected]

– Teaching assistant: 王廷軒• Email: [email protected]

3

Course Grading

• Class participation/presentation: 30% • Tests: 25% • Project: 25% • Homeworks: 20%

4

Source Textbooks

• Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, 1999. ( 全華科技圖書 : 02-23717725)

• Daniel Jurafsky and James H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, 2000.

• James Allen, Natural Language Understanding, Benjamin/Cummings Publishing Co, 1995.

• Gregory Grefenstette, Cross-Language Information Retrieval, Kluwer, 1998.

• Jean Veronis, Parallel Text Processing: Alignment and Use of Translation Corpora, Kluwer, 2000.

5

Other Useful Sources (1)

• Reference Books– Charniak, E. Statistical Language Learning. – Cover, T. M., Thomas, J. A. Elements of Information Theory.– Jelinek, F. Statistical Methods for Speech Recognition.

• Major Conferences:– ACL (Association of Computational Linguistics)– COLING (International Conference on Computational Linguistics )– HLT (Human Language Technology Conference)– IJCNLP (International Joint Conference on Natural Language Processing )

• Journals– Computational Linguistics– Natural Language Engineering– TALIP (ACM Transactions on Asian Language Information Processing)– TSLP (ACM Transactions on Speech and Language Processing)

6

Other Useful Sources (2)

• Resource URL– http://www.aclclp.org.tw/res_other_c.php ( 中華民國計算語言學學

會 )

– http://nlp.stanford.edu/software/index.shtml (Stanford NLP Group)

– http://www.phontron.com/nlptools.php (Graham Neubig)

• Tools/Software– Online Dictionary

• WordNethttp://wordnet.princeton.edu/

• HowNethttp://www.keenage.com/html/c_index.html

• The Academia Sinica Bilingual Ontological Wordnet (BOW)http://bow.sinica.edu.tw/

7

CKIP ( 中研院詞庫小組 )(Chinese Knowledge and Information Processing)

• Parser: http://140.109.19.112/main.exe?id=6833

• POS (part of speech) tagger: http://ckipsvr.iis.sinica.edu.tw/

8

Eric Brill's POS Tagger

• Website: http://cst.dk/online/pos_tagger/uk/

This/DT is/VBZ a/DT book/NN ./.

9

Stanford Parser

• Website – http://nlp.stanford.edu/software/lex-parser.shtml

• Tools– Online version

• Stanford Parser version 1.5.1

• English & Chinese

• http://josie.stanford.edu:8080/parser/

10

Stanford Parser

11

[Homework 1]

• Using CKIP POS (part of speech) tagger, Eric Brill’s POS tagger, and Stanford parser to tag and parse at least three sentence.

12

Course Topics

• Probability and Information Theory– basics: definitions, formulas, examples.

• Language Modeling– n-gram models, parameter estimation– smoothing (EM algorithm)

• Some Linguistics– phonology, morphology, syntax, semantics, discourse

• Words and the Lexicon– word classes, mutual information, lexicography.

13

Course Topics (cont.)

• Hidden Markov Models– background, algorithms, parameter estimation

• Tagging: methods, algorithms, evaluation– tag sets, HMM tagging, transformation-based, feature-based

• Grammars and Parsing: data, algorithms– statistical parsing: algorithms, parameterization, evaluation

14

Course Topics (cont.)

• Applications– Machine Translation (MT)– Acoustic Speech Recognition (ASR)– Information Retrieval (IR)– Cross-Language Information Retrieval (CLIR)– Question Answering (QA)– Cross-Language Question Answering (CLQA)– Summarization– Information Extraction– …

15

Course Introduction

• Lecture1: Introduction

• Lecture2: Mathematical Foundations

• Lecture3: Linguistics Essentials

• Lecture4: Corpus-based Work

• Lecture5: Collocations

• Lecture6: Statistical Inference: n-gram Models over Sparse Data

• Lecture7: Word Sense Disambiguation

• Lecture8: Statistical Alignment and Machine Translation

• Lecture9: Markov Models

• Lecture10: Term Translation Extraction & Cross-Language Information Retrieval

• Lecture11 : Statistical/Probabilistic Models for Word Alignment & CLIR

• Lecture12: Part-of-Speech Tagging

• Lecture13: Probabilistic Context Free Grammars

• Lecture14: Question Answering

16

The Ultimate Research Goal in Natural Language Processing (NLP) • To develop an automated language understanding

system• Why is this important?

– Easy for everyone to use language

– Natural Human interface for a variety of applications (e.g., database access, on-line tutor, robot control, etc.)

– Language seems fundamental for developing an intelligent system

• iPhone Siri

• IBM's DeepQA project

17

Natural Language is VERY Useful

18

OCR Problems

19

20

21

Aspects of Computational Linguistics

• Description of the Language: universals, cross-linguistic research

• Implementation of Computer Model: algorithms and data structures, formal models to represent knowledge, model of the reasoning process

• Psycho-Linguistic Aspect: humans are an existence proof of the computability of language comprehension; psychological research can be used to justify a computer model; obtain human processing parameters

22

NLP Issues• Why is NLP difficult?

– Many “words”, many “phenomena”, many “rules”• OED (Oxford English Dictionary): 400k words;

Finnish lexicon (of forms): ~2 ×107

• sentences, clauses, phrases, constituents, coordination, negation, imperatives/questions, inflections, parts of speech, pronunciation, topic/focus, and much more!

– irregularity (exceptions, exceptions to the exceptions, ...)• potato potato es (tomato, hero,...); photo photo s, and even:

both mango mango s or mango es

• Adjective / Noun order: new book, electrical engineering, general regulations, flower garden, garden flower

23

Difficulties in NLP (cont.)– Ambiguity

• books: NOUN or VERB?

– you need many books vs. she books her flights online

• Thank you for not smoking, drinking, eating or playing radios without earphones. (MTA bus)

– Thank you for not eating without earphones??

– Thank you for drinking?? …

• Fred’s hat was blown off by the wind. He tried to catch it.

– ...catch the wind or ...catch the hat ?

24

Rules or Statistics?• Preferences:

– context clues: she books books is a verb– rule: if an ambiguous word (verb/nonverb) is preceded by a

matching personal pronoun word is a verb

– pronoun reference:– she/he/it often refers to the most recent noun or pronoun (but

there are certainly exceptions)

– selectional restrictions:– catching hat is better than catching wind (but not always)

– semantics: – We thank people for doing helpful things or not doing annoying

things

25

Solutions

• Don’t guess if you know:• morphology (inflections)• lexicons (word information)• unambiguous names• perhaps some (really) fixed phrases• syntactic rules?

• Use statistics (based on real-world data) for preferences (only?)

• No doubt: but this is an important question!

26

Types of Linguistic Knowledge

• Acoustic/Phonetic Knowledge: How words are related to their sounds. (transliteration)– E ri c sson <=> 易利信

• Morphological Knowledge: How words are constructed out of basic meaning units.un + friend + ly unfriendly

love + past tense loved

object + oriented object-oriented

27

More Types of Linguistic Knowledge

• Lexical Knowledge (or Dictionary): This should include information on parts of speech, features (e.g., number, case), typical usage, and word meaning.

• Syntactic Knowledge: How words are put together to make legal sentences (or constituents of sentences).

28

More Types of Linguistic Knowledge

• Semantic Knowledge: Word meanings, how words combine into sentence meaning, – e.g., Fred tossed the ball.

Semantic roles

29

More Types of Linguistic Knowledge

• Pragmatic Knowledge: How context affects the interpretation of a sentence. Examples:– Louise loves him.

[Context 1:] Who loves Fred?[Context 2:] Louise has a cat. 

– What time is it?[Context 1:] Fred is fidgeting (坐立不安 )

and staring at his watch.[Context 2:] Louise has no watch. 

30

More Types of Linguistic Knowledge

• World Knowledge: How other people‘s minds work, what a listener knows or believes, the etiquette ( 成規 ) of language. Examples:– Will you pass the salt?

– I read an article about the war in the paper.

– Fred saw the bird with his binoculars.

– Tim was invited to Tom's birthday party. He went to the store to buy him a present.

31

Multilingualism Issues in Web Age

• Language barrier– There are about 6,700 languages listed in the Ethnologue

(http://www.ethnologue.com/)

• Information overloading– Scaling up of language resources

• Webpages

• News

• Weblogs

• Microblogs

32

Multilingual Understanding??

33

Multilingual Understanding??

34

Multilingual Understanding??

35

Real World Situation• Use statistical model based on REAL WORLD DATA and care

about the best sentence only • Imagine:

– Each sentence W = { w1, w2, ..., wn } gets a probability P(W|X) in a context X

– For every possible context X, sort all the imaginable sentences W according to P(W|X):

– Ideal situation:

best sentence (most probable in context X)

P(W)

Wbest Wworst

36

Real World Situation

• Unable to specify a set of grammatical sentences using fixed “categorical” rules

• (disregarding the “grammaticality” issue)

best sentence (most probable in context X)

P(W)

Wbest Wworst