119
ISCSLP02 L. F. Chien Information Retrieval Techniques for Spoken Language Processing Lee-Feng Chien ( ) Institute of Information Science Institute of Information Science Academia Academia Sinica Sinica , Taiwan , Taiwan International Symposium on Chinese Spoken Language Processing (ISCSLP 2002) Taipei, Taiwan August 23Ć24, 2002 ISCA Archive http://www.iscaĆspeech.org/archive

Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Information Retrieval Techniques for Spoken Language Processing

Lee-Feng Chien (簡立峰)

Institute of Information Science Institute of Information Science Academia Academia SinicaSinica, Taiwan, Taiwan

�������������� ����� ���������������������������������������������

�������������

������� !�"�����

ISCA Archive����#$$���%����!������%���$�����&�

Page 2: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Outline

IR vs. SLPConventional IR TechniquesWeb IR Techniques Web Mining Techniques Term Clustering through Web MiningAnchor Text Mining

Page 3: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

I. IR vs. SLP

Page 4: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Information Retrieval

a research with a long-term research goal of exploration of information storage, classification, extraction, indexing and browsing techniques for the retrieval of non-structural databases such as textual documents

Page 5: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Different Research Aspects

Text IRWeb IRMultimedia IRIntelligent IR

Page 6: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Different Research Aspects

Text IRText indexing, searching, presentation, user study

Web IRCrawling, page ranking, distributed search, scalability, multi-lingual and multi-culture

Multimedia IRRetrieving multimedia contents such as speech, audio, music, image, video

Intelligent IRAdvanced language and information processing topics such as question answering, cross-language, information tracking, information extraction, summarization, speech interaction, etc.

Page 7: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Chinese Information Retrieval

Language issuesCulture/geographical issuesChinese people issues

Page 8: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Chinese Information Retrieval

Language issuesWord segmentation, term extraction, parsing, linguistic resources Font display, conversion between simplified Chinese and traditional Chinese, etc.

Culture/geographical issuesChinese pages from world wide, language identification required, preferred topics different, etc.

Chinese people issuesUser behaviors

Page 9: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Web Users and Pages (3 years ago)

Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99

Page 10: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Number of Chinese Web Pages

328,000,000 pages

Page 11: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Number of Chinese Web Pages

328,000,000 pages

Scalability Problem !

Page 12: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Number of Web Pages

The world’s largest search engine ?

2,073,418,204 pages (Google)2,095,568,809 pages (FAST)

Page 13: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Number of Web Pages (Cont.)

Page 14: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Why IR Useful for SLP ?

ASR or dictation machine: lexicon, corpus, and language model Voice portal: search via spoken queries Speech retrieval: indexing & searching Topic detection & tracking : document classification & clustering

Page 15: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Natural Language Processing

Speech Recognition

Information Retrieval Search

Engine

DictationMachine

Parser

IR Via Voice

IR vs. SLP

Q&A

Speech IR

Page 16: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Search Engine

DictationMachine

Parser

IR Via Voice

Research Paradigm

Web Mining

Page 17: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Anchor Text Mining for Query Translation (Lu, ICDM’01)

EnglishTraditional Chinese

Simplified ChineseJapanese

SonyNikeStanfordSydneyinternetnetworkhomepagecomputerdatabaseinformation

新力耐吉史丹佛雪梨網際網路網路首頁電腦資料庫資訊

索尼耐克斯坦福悉尼互联网网络主页计算机数据库信息

ソニーナイキスタンフォードシドニーインターネットネットワークホームページコンピューターデータベースインフォメーション

Page 18: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Cross-Language Web SearchA Web search service allows users to query in one language and search documents that are written or indexed in another language.

Page 19: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

II. Conventional IR Techniques

Page 20: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

The Vector Space Model

Measure closeness between query and document.Queries and documents represented as n dimensional vectors.Each dimension corresponds to a word/term. Advantages: Conceptual simplicity and use of spatial proximity for semantic proximity.

Page 21: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Vector Similarity

d = The man said that a space age man appeared d’= Those men appeared to say their age

Page 22: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Vector Similarity (Cont.)

cosine measure or normalized correlation coefficient

Euclidean Distance:

Page 23: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Term Weighting

Quantities used:tfi,j (Term frequency) : # of occurrences of wiin di

dfi (Document frequency) : # of documents that wi occurs incfi (Collection frequency) : total # of occurrences of wi in the collection

Page 24: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Inverted File for Keyword Matching

Google’s Index File Structure

Page 25: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Chinese IR & Indexing Unit Section

Character-based indexing and searchspeed/space problemincorrect matching due to free combination of characters, EX: 電腦科學

Word-based indexing and searchlexicon is a prerequisite and limitationunknown word identification, disambiguation of word segmentation

Csmart’s approach (Chien, SIGIR’95)signature-based, feature grouping (unigram, bigramcharacters)two-stage search, fuzzy search; suited for not too large files and demand of fuzzy search

Page 26: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Chinese Track in TREC’5

Berkeley (A. Chen, SIGIR’97)Indexing units

• use dictionary to segment texts• obtained from public domain with 91,000 words and phrases• stopword list with 444 entries

Searching algorithm, segmentation method• 0.461 average precision for manual queries, 0.32 for automatic

run

CUNY (Kwok, SIGIR’97)Indexing units

• Use 2,000 words to segment texts initially• use a learning strategy to extend the word entries to 15,000

finallySearching algorithm, segmentation method

• 0.40 in word-based; 0.42 in word and character-based

Page 27: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Text Indexing

Indexing Chinese texts Indexing Units

• Single character, Bi-character, word, term, stringNew word identification and word segmentation problems

Indexing English texts Indexing units

• Word, term, string (few)Problems

• Stemming, capitalization, hyphen, word sense disambiguation, typing errors

Structure for indexingInverted file, PAT array

Term extraction and term clustering techniques are the required key techniques

Page 28: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Term Extraction

Term is a meaningful and representative unit in terms of information retrieval, e.g., name, location, proper noun, topicTerms are derived and most excluded in common dictionariesTerm extraction can reduce word sense ambiguities in text retrieval and remedy weaknesses of word-based approaches, EX:

computer network (linking, net, mesh)Bank America, current theory

Term extraction is the first step toward concept-based IR

Page 29: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Chinese Term Extraction

• PAT-tree-based Approach • Poster Presentation Award by ACM SIGIR’98

Page 30: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Context Dependency

Association

Left Context Dependency (LCD) Right Context Dependency (RCD)

Lexical Pattern

usage of freedom

新加坡 國大政府

到達訪問

Page 31: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Example of the PAT tree (Chien, SIGIR’97)

Data stream: 個人電腦 , 人腦

0 2 4 6 9 11

Semi-infinite strings:

個人電腦 10101101 11010011 10100100 ...

bit: 1 9 17 25

人電腦

電腦

人腦

10100100 01001000 10111001 ...

10111001 01110001 00000000 ...

10111000 00000000 00000000 ...

10100100 01001000 00000000 ...

10111000 00000000 00000000 ...

(24,2,1)

0

4

2

9

6

(0,6,1)

(4,6,1)

(5,3,1) (8,3,2)

Data position

(comparison bit, # external nodes, frequency)

abcd, ed

Suffix Prefixabcd (a, ab, abc, abcd)bcd (b, bc, bcd)cd (c, cd)

d (d)ed (e, ed)d (d)

abcdcd

bcd

ed

d

Page 32: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Incremental Term Extraction

Term length

(character

N-gram)

Number of

extracted new

terms

Number of

documents

with new

terms

extracted

Average

number of

document

inputs can

find new

terms (A)

Average

frequency of

the extracted

new terms

Average

frequency as

the term can

be extracted

2 776 515 3.93 34.22 9.37

3 416 325 6.04 24.60 9.09

4 171 157 12.16 19.22 8.97

5 51 49 37.28 20.35 9.18

6 17 17 109.81 27.00 8.65

7 15 15 123.60 27.40 11.20

8 6 6 274.67 13.00 9.83

9 3 3 205.67 18.00 11.33

Total N-grams 1,455 814 2.41 28.95 9.25

Table 4. The detailed results for incremental term extraction when the threshold value was larger than 2 in the significance analysis;

the results were obtained from a total of 1,872 political news abstracts published in July 1997.

Page 33: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Incremental Term Extraction (Cont.)

S(Y) Total

Extracted

Terms(A)

No. of

Correct Terms

Extracted(B)

No. of

Correct Terms

Outside

Dictionary(C)

Precision

(B/A)

Recall

>1.5 2,291 1,374 297 0.60 0.53

>2 1,455 1,135 258 0.78 0.44

>2.5 723 593 172 0.82 0.23

>3 214 184 66 0.86 0.07

Table 3. The testing results for incremental term extraction using different threshold values in the significance analysis; the results

were obtained from a total of 1,872 political news abstracts published in July, 1997.

Page 34: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Incremental Term Extraction (Cont.)

S(Y) Total

Extracted

Terms(A)

No. of

Correct Terms

Extracted(B)

No. of

Correct Terms

Outside

Dictionary(C)

Precision

(B/A)

Recall

>1.5 2,291 1,374 297 0.60 0.53

>2 1,455 1,135 258 0.78 0.44

>2.5 723 593 172 0.82 0.23

>3 214 184 66 0.86 0.07

Table 3. The testing results for incremental term extraction using different threshold values in the significance analysis; the results

were obtained from a total of 1,872 political news abstracts published in July, 1997.

How to deal with low-frequency terms ?

Page 35: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Characteristics of the Chinese

Language

Semantics

Form

Phonetics

Word segmentation:•一台大電腦 (a set of large computer)

Page 36: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Characteristics of the Chinese

Language

Semantics

Form

Phonetics

Word segmentation:•一台大電腦 (a set of large computer)

Page 37: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Characteristics of the Chinese

Language

Semantics

Form

Phonetics

Word segmentation:•一台大電腦 (a set of large computer)

Page 38: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Characteristics of the Chinese

Language

Semantics

Form

Phonetics

Word segmentation:•一台大電腦 (National Taiwan University)

Page 39: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Characteristics of the Chinese

Language

Semantics

Form

Phonetics

Word segmentation:. 參加一台大電腦會議 (Attend a computer science meeting in National Taiwan University)

Page 40: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Characteristics of the Chinese

Language

Semantics

Form

Phonetics

New word identification:•台大電腦公司 (Taida Computer Inc.)

Page 41: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

III. Web IR Techniques

Page 42: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Spectrums of Web Search

Types of contentText, e.g. Web text、documents、newsAudio, e.g. music、speech、sounds、broadcast newsImage, e.g. pictures、photos、graphicsVideo, e.g. films、clipsFormatted Data, e.g. products

Scopes of contentGeneral or specific

Languages

ScalabilityPersonal、content site、intranet 、InternetThousands, millions or billions of (documents、users、queries)

InterfaceWeb-based、WAP-based、Voice-based

Page 43: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

IndexSpace

User Space

DocumentSpace

Information UseInformation Need

Seek

Use

Users Authors

Short QuerySubject TermsReal Names

X YX1,X2... Y1,Y2...

An Analytical Model

Page 44: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Different Web Search ModelsYahoo

manual recommendation in index space

Altavista、Inktomifull-text pattern matching in document space

Googlecitation information in document space

Realnamemanual real-name retrieval in user space

DirectHitcollaborative analysis in user space

AskJeevesQ&A (or FAQ search) in specific domains

Page 45: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Hypertext on the Web

Internal Affairs

People

IIS

CS&IE, NTU

Institute of Information Science

http://www.iis.sinica.edu.tw

IISInstitute of

Information Science SE

Academia Sinica

Research Institutions

Hyperlink reference Sibling information

Web usage informationQuery & Click stream

Local content

Page 46: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Basic Architecture of a Spider-based Web Search Engine

SESE

SESE

SESE BrowserBrowserIndexIndexWeb

50M queries/day

Quality resultsLogLog

Spider

Spam

Freshness3B pages

Scalable, e.g., 20K PCs in Scalable, e.g., 20K PCs in GoogleGoogle

Page 47: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Crawling

Indexed Page

Out LinksDuplication Authority

Out link Traverse

Authorized Pages

Page 48: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Indexed Features & Page Ranking

Page Title: Academia Sinica

Indexed Page

Anchor Text:Highest Government Research Institution in Taiwan

abstractPopularity

Anchor Text:Chien’s Lab

Authority

Page 49: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Distributed Search

Query

Query Processor

SE

SE

SE

SEDocument Delivery

Page 50: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Facts and Problems I

Query short query problem50% are personal and company namesBoolean or natural language query is few

Browsingno more 2nd pageprecision is more important than recall

Robotlow coverage、deadlinks、garbage sites and pages

Page 51: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Facts and Problems II- Relevancy

Who judge the retrieval relevancyUsers

• What user want? What do they input?• Short query or NLQ?• HFQ、LFQ or MFQ?

Search engines• Technology limitations?• How many indexed pages? millions or billions

of pages ? • Ranking algorithm?

Page 52: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Quality vs. Quantity

Source: The Direct Hit Technology: A White Paper (http://system.directhit.com/whitepaper.html)

Page 53: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Facts and Problems III - Speed

What make the retrieval speed ?Users

• Where you are and how is the bandwidth?• Dial-up or T1? Cache or proxy?

• What is the query?• HFQ、LFQ or MFQ ?

Search Engines• How far and how good the infrastructure • Document delivery speed is the key

Page 54: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Important Issues

Web user studyUser behavior analysisQuery log mining

Content indexingLanguage identification Information conversionUnified indexing Term extraction Term clustering

Content searchingEnglish-Chinese bilingual searchConcept-based search Personalized searchCross-language search

Content presentation Concept-based term suggestion Concept-based search result clustering

Page 55: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Language Distribution (Pu’2000)

Table 3 Statistics concerning what language used in each search termAll Chinese All English Other

Dreamer 78.20% 19.18% 2.62%GAIS 78.22% 16.90% 4.88%

Page 56: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Term Length

Table 4 Statistics concerning the number of terms per queryin Chinese in English All

Dreamer 3.18 characters 1.10 words 6.31 bytesGAIS 3.55 characters 1.22 words 7.26 bytes

Page 57: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Information Needs by Subject Categories

Adult

Computer

EntertainmentChat

Life

Education

Travel

Game

Business

SocietyMedia

HumanitiesHealth

ScienceOther

Software

GraphSearch engine

Network

Company

HardwareBBS Other

-- Query categorization approach (Pu, JASIS’02). Obtained from analyzing about 2M queries

Page 58: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Term Conversion

Page 59: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

IV. Web Mining

Page 60: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Web Search

Weblogs, texts, images, …

Search Engine

Information Seeking

Millions of Users

Page 61: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Web Mining

Weblogs, texts, images, …

Search Engine

Knowledge Discovery

Millions of Users

Page 62: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Taxonomy of Web Mining (R. Cooley)

Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

DM

Page 63: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Web Content Mining

Most focus on extraction of knowledge from the text of web pagesWeb Page Classification (Chuang & Chien’s IRWK’02)

Text Mining Web Information Extraction XML/Semantic Web MiningMessage Understanding (NLP viewpoint)

Multimedia Content MiningWeb Image Classification (Tseng’s IRWK’02)Speech Archive Mining (Chien’s ISCSLP’02)

Page 64: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Web Page Classification ApplicationsCMU Web KB Project (1998-2000) [Craven98]

Classifying Web pages is an essential step to constructWeb knowledge base

Page 65: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Web Usage Mining

Data Gathering Web server log, site description data, concept hierarchies

Data Preparation Distinguish among users, build sessions

Data MiningPattern discovery & analysis

Page 66: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Web Structure Mining

Google’s Page Rank

Document Citation (siteseer)

Page 67: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Semantic Web Mining

Current Web Most of Web content is designed for humans to read, not for machine to manipulate meaningfully

Semantic Web XML+RDF + Ontology + Agent

Semantic Web MiningAuto-construction of OntologyCase-based reasoning/inference RDF1

RDF2

Page 68: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

V. Term Clustering Trough Web Mining

Page 69: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Term Clustering (Chuang’02)

Hierarchical clustering

勞委會

職訓局

就業

青輔會

自傳

徵才

人力資源

104人力銀行

人力銀行

找工作

履歷表

求職

求才

占卜

塔羅牌

算命

紫微斗數

命理

姓名學

心理測驗

星座

愛情

eva長榮航空

長榮

航空公司

航空

華航

中華航空

補帖

大補帖

泡麵

dbt武俠

金庸

武俠小說

黃易

作家

武俠金庸武俠小說黃易作家

補帖大補帖泡麵dbt

eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)

占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情

勞委會職訓局就業青輔會自傳徵才人力資源104人力銀行人力銀行找工作履歷表求職求才cut

1 2 3 4 5

1 23 4

5

Page 70: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Hierarchical Query Clustering

Page 71: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Virus

Page 72: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Virussynonyms

Page 73: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Security

Page 74: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Security

Screen

Microsoft

Page 75: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Search Result Clustering

GoogleTaiwan University

ConceptSearch

Query Taxonomy

……

… …

Taiwan University

Page 76: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

SE

DocumentSpace

UsersAuthors

Information UseInformation Need

DocumentTaxonomy

QuerySpace

QueryTaxonomy

Personalized Search

QuerySpace

….

Personal Directory Trees

QuerySpace

QueryTaxonomy

QuerySpace

QueryTaxonomy

Page 77: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Hierarchical Query Clustering

Page 78: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

The Steps

Feature ExtractionUse co-occurred seed terms extracted from retrieved top pages

Term VectorEach query term is assigned a term vector

• Record the co-occurred feature terms and their frequency values in the retrieved documents.

Term Similaritytf*idf-based Cosine measurement

Hierarchical Term ClusteringCluster popular query terms in the log into initial categoriesQuery terms with similar features are grouped into clusters.

Page 79: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Feature Extraction

Use co-occurred seed terms extracted from retrieved top pages

Creative Nude Photography Network -- Fine Art Nude and ...... The Creative Nude and Erotic Photography Network is the number one net portal to the best in fine art nude and erotic photography! Over 100 CNPN Member Sites ...

Nude Places... to be naked. Walking in the forest, cruising the lake in open boats, swimming, picnicking and nude photography are all enjoyed in the nude. 60 minutes $39.95. ...

A Brave Nude World... A Brave Nude World! Warning: This site contains links to fine art nude & erotic photography. If you are under 18 or do not wish to view this material, You can ...

nudeCo-occurred

feature terms

3/2erotic photography

1/1naked

………3/2art

2/2photography

tf/dfterm

Page 80: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Term Weighting

Page 81: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Extraction of Basic Feature TermsPerformance of different features: randomly selected, hi-frequency, and seed terms

Popular queries not affected by ephemeral trends, e.g., “movie”, “basketball”, “mutual fund”, etc.More expressive and distinguishable in describing a particular categoryTwo logs compared and extracted 9,709 overlapping top query terms as feature terms

   G-1999D-1998

Top 1,000 terms top 20,000 terms ALL

Top 1,000 terms 583/58.30% 977/97.70% 992/99.20%Top 20,000 terms 914/91.40% 9,709/50.71% 14,721/76.89%

Page 82: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Query Clustering (Cont.)

Feature ExtractionUse co-occurred seed terms extracted from retrieved top pages

Term VectorEach query term is assigned a term vector

• Record the co-occurred feature terms and their frequency values in the retrieved documents.

Term SimilarityTF *IDF-based Cosine measurement

Hierarchical Term ClusteringCluster popular query terms in the log into initial categoriesQuery terms with similar features are grouped into clusters.

Page 83: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Term Similarity

Page 84: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Hierarchical Term Clustering

Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster

similarity value

Merge the most similar (closest) two clusters• Complete linkage method

Update the cluster vector of the new cluster Repeat steps 2 and 3 until only a single cluster remains

Page 85: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Hierarchical Term Clustering

Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms

Term

Cluster

Page 86: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Hierarchical Term Clustering

Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms

Term

Cluster

Page 87: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Hierarchical Term Clustering

Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster

similarity value

Term

Cluster

Page 88: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Hierarchical Term Clustering

Agglomerative hierarchical clustering (AHC)Merge the most similar (closest) two clusters• Complete-linkage method

0.3

0.10.5

Page 89: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Hierarchical Term Clustering

Agglomerative hierarchical clustering (AHC)Merge the most similar (closest) two clusters• Complete-linkage method

Page 90: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Hierarchical Term Clustering

Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster

similarity value

Merge the most similar (closest) two clusters• Complete linkage method

Update the cluster vector of the new cluster Repeat steps 2 and 3 until only a single cluster remains

Page 91: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

The Clustering Algorithm

Page 92: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Page 93: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Clustering Results

勞委會

職訓局

就業

青輔會

自傳

徵才

人力資源

104人力銀行

人力銀行

找工作

履歷表

求職

求才

占卜

塔羅牌

算命

紫微斗數

命理

姓名學

心理測驗

星座

愛情

eva長榮航空

長榮

航空公司

航空

華航

中華航空

補帖

大補帖

泡麵

dbt武俠

金庸

武俠小說

黃易

作家

武俠金庸武俠小說黃易作家

補帖大補帖泡麵dbt

eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)

占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情

勞委會職訓局就業青輔會自傳徵才人力資源104人力銀行人力銀行找工作履歷表求職求才cut

1 2 3 4 5

1 23 4

5

Page 94: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Cluster Partition

Page 95: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Quality Function

Page 96: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Quality Function (Cont.)

Page 97: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Quality Function (Cont.)

Page 98: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

The Clustering Partition

Algorithm

Page 99: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Preliminary Experiment

Test queries • Two sets: top 1k queries and random 1k queries• Each of the test queries has been manually assigned

according classes

Evaluation metrics• F-Measure

Page 100: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Evaluation: F-Measure

Page 101: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Obtained F-Measures

Page 102: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Page 103: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Results of Hierarchical Structure Generation

Page 104: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

VI. Anchor Text Mining

Page 105: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Anchor Text Mining for Query Translation (Lu, ICDM’01)

EnglishTraditional Chinese

Simplified ChineseJapanese

SonyNikeStanfordSydneyinternetnetworkhomepagecomputerdatabaseinformation

新力耐吉史丹佛雪梨網際網路網路首頁電腦資料庫資訊

索尼耐克斯坦福悉尼互联网网络主页计算机数据库信息

ソニーナイキスタンフォードシドニーインターネットネットワークホームページコンピューターデータベースインフォメーション

Page 106: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Cross-Language Web SearchA Web search service allows users to query in one language and search documents that are written or indexed in another language.

Page 107: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

- in USATaiwan -

www.yahoo.comwww.yahoo.com.tw

Yahoo Yahoo

Source Query

Observation of Anchor Text

Page 108: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

- in USATaiwan - 台灣 - 搜尋引擎

www.yahoo.comwww.yahoo.com.tw

Yahoo 雅虎雅虎 Yahoo

Translation Candidates

Anchor-Text Set

Observation of Anchor Text

Page 109: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

……(#in-link= 187)

……(#in-link= 21)

- in USATaiwan - 台灣 - 搜尋引擎

www.yahoo.comwww.yahoo.com.tw

Yahoo 雅虎雅虎 Yahoo

Page Authority

Observation of Anchor Text

Co-occurrence

Page 110: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Asymmetric model:

Symmetric model with link information :)(

)()|(s

tsst

TPTTPTTP ∩

=

inktotal in-lUin-linkUP

UP)|U)P(T|UP(TUTPUTP

U)P|U)P(T|UP(T

UPUTTP

UPUTTP

TTPTTPTTP

ii

Uiitisitis

Uiitis

Uiits

Uiits

ts

tsts

i

i

i

i

# of #)( where

)(])|()|([

)(

)()|(

)()|(

)()()(

=

−+≈

∩=

∪∩

=↔

∑∑

∑∑

……

Probabilistic Inference Model

Page Authority

Co-occurrence

PageRank

Page 111: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Type of Model Top-1 Top-10

MA 41% 81% MAL 44% 83% MS 51% 84%

MSL* 53% 85%

* Training Data -- 109,416 anchor-text sets

from 1,980,816 pages* Test Query Set--622 English terms from 1,230 most popular English queries

* Training Data -- 109,416 anchor-text sets

from 1,980,816 pages* Test Query Set--622 English terms from 1,230 most popular English queries

Effects of Different Models

Using different modelsMA: Asymmetric modelMAL: Asymmetric model with link informationMS: Symmetric modelMSL: Symmetric model with link information

Page 112: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Top-n inclusion rates obtained with three different approaches.

ApproachesTop1Top2Top3Top4Top5

Anchor Text Mining

57.0%68.6%74.3%77.9%80.1%

Dictionary Lookup

30.5%30.5%30.5%30.5%30.5%

Combine Anchor Text Mining and Dictionary Lookup

74.0%82.9%86.7%88.9%90.2%

Performance

Back

Page 113: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

References

Overview Chien, L. F., Pu, H. T., “Important Issues on Chinese Information Retrieval;”, Computational Linguistics and Chinese Language Processing, August 1996, pp. 205-221.Lua, K. T., “Chinese Information Processing – Past, Present and Future”, IRAL’98.

Text Retrieval Chien, L. F., “Fast and Quasi-Natural Language Search for Gigabytes of Chinese Texts”, ACM SIGIR’95. Huang, X. and Robertson, “Okapi Chinese Text Retrieval Experiments at TREC-6. Nie, J., “On Chinese Text Retrieval”, SIGIR’96.Kwok, K. L., Comparing Representations in Chinese Information Retrieval, SIGIR’97.Chen, A. Chinese Text Retrieval without Using a Dictionary, SIGIR’97.

Text Classification Lam, W., “Using a Generalized Instance Set for Automatic Text Categorization”, SIGIR’98, pp. 81-89.

Natural Language ProcessingChen, K. J. et al., “Word Identification for Mandarin Chinese Sentences”, CONLING’92.Sproat, R. and Shih, C., “A Statistical Method for Finding Word Boundaries in Chinese Text”, CPCOL, 1989, pp. 240-271.Wu, Zimin and Tseng, G., “Chinese Text Segmentation for Text Retrieval: Achievements and Problems”, JASIS, 1994, pp. 532-542. Term ExtractionGao, J., et al., “Toward a Unified Approach to Statistical Language Modeling for Chinese”, ACM Trans. On Asian Language Information Processing, 2002, pp. 3-33.

Page 114: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

References (Cont.)

Term Extraction Chang, J. S. and Su, K. Y., “An Unsupervised iterative Method for Chinese New Lexicon Extraction”, International Journal of Computational Linguistics & Chinese Language Processing, August 1997, pp. 97-147.Chien, L. F., “ PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval”, SIGIR’97, pp. 50-58.Lai, Y. S. and Wu, C. H., “Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology”, ACM Trans. On Asian Language Information Processing, 2002, pp. 34-64.

Web IRPu, H. T., "Understanding Chinese Users' Information Behaviors through Analysis of Web Search Term Logs”, Journal of Computers (電腦學刊), December 2000, pp. 75-82.

Speech RetrievalWang, H. M., “Experiments in Syllable-based Retrieval of Broadcast News Speech in Mandarin Chinese”, Speech Communication, 2000, pp. 49-60. Li, X. and Crosft, B., Evaluating Question-Answering Techniques in Chinese, 2001.

Information Tracking and DetectionChen, H. H. and Ku, L. W., “An NLP & IR Approach to Topic Detection”, Topic Detection and Tracking: Event-based Information Organization, Chapter 12, ed. By Allan J., KluwerAcademic Publishers, 2002, pp. 243-264.

Cross-Language IRKwok, K. L., “Evaluation of an English-Chinese Cross-lingual Retrieval Experiment”, TREC’97.

Page 115: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

ReferencesWeb MiningKosala, R., & Blockheel, H. (2000). Web Mining Research: A Survey. SIGKDD Explorations, 2(1),1-15. PS PDFSrivastava,J. Cooley, R., Deshpande, M., & Tan, P.-N. (2000). Web usage mining:discovery and application of usage patterns from web data. SIGKDD Explorations,1, 12-23. PSJ. Sirvastava & R. Cooley, Mining web data for e-commerce: concepts & applications, PKDD’01 Conferences & Workshops KDD 2001, PKDD 2001, WebKDD 1999l, WebKDD 2000, WebKDD 2001Web Content MiningD. Mladenic et al., Text Mining: What if your data is made of words, PKDD’01M. Hearst, Untangling Text Data Mining, ACL’99. (Chang et al., 2001) (s.a.) Chapter 6 HandapparatChakrabarti, S. (2000). Data mining for hypertext: A tutorial survey. SIGKDD Explorations 1(2), 1-11. PS PDFWeb Structure Mining (Chang et al., 2001) (s.a.) Chapter 7.3 Handapparat(Chakrabarti, 2000) s.a.Page, L., Brin, S., Motwani, R.,& Winograd, T. (1998). The PageRank Citation Ranking: Bringing Order to the Web. PS

Page 116: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

References (Cont.)

Web Usage Mining(Srivastava et al., 2000) s.a.Spiliopoulou, M. (2000). Web usage mining for site evaluation: Making a sitebetter fit its users. Special Section of the Communications of ACM on "Personalization Technologies with DataMining'', 43(8), 127-134. HandapparatACM Digital LibraryCooley, R. 2000. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. University of Minnesotal. PSBorges, J.L. (2000).A Data Mining Model to Capture User Web Navigation Patterns. Department of Computer Science, University College London, London University. PS PDFFor more references can refer at http://www.wiwi.hu-berlin.de/~berendt/lehre/2001w/wmi/literature.html

Page 117: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

References (Cont.)

Text and Web page categorizationS. Chakrabarti, B. Dorm, and P. Indyk. Enhanced hypertext categorization using hyperlinks. SIGMOD’98, pp. 307-318, 1998.J. M. Pierre, Practical issues for automated categorization of Web sites, ECDL 2000 Workshop on the Semantic Web, 2000.C.Y. Quek. Classification of World Wide Web Documents. Senior Honors Thesis, School of Computer Science, CMU, May 1997.Y. Yang and X. Liu. A re-examination of text categorization methods, SIGIR’99, pp. 42-49, 1999.

Web page classification applicationsC. Chekuri, M.H. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. WWW’97.M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. AAAI’98, pp. 509-516, 1998.M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, and M. Gori, Focused crawling using context graphs, VLDB2000, pp. 527-534, 2000.

Link and context analysisG. Attardi, A. Gulli, and F. Sebastiani. Automatic web page categorization by link and context analysis. Proceedings of THAI’99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105-119, 1999.S. Brin and L. Page. The anatomy of large-scale hypertextual web search engine, WWW’98.J. Dean and M. R. Henzinger. Finding related pages in the world wide web. WWW’99, pp. 389-401, 1999.J. Kleinberg. Authoritative sources in a hyperlinked environment. Proceedings of the 9th annual ACM SIAM Symposium on Discrete Algorithms, pp. 668-677, 1998.

Page 118: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

References (Works in Academia Sinica)

1. S. L. Chuang, L. F. Chien, “Automatic Subject Categorization of Query Terms for Web Information Retrieval”, accepted by Decision Support System, 2002.2. Lee-Feng Chien, et al., “Incremental Extraction of Domain-Specific Terms from Online Text Collections”, Recent Advances in Computational Terminology, ed. By D. Bourigault et al., 2001.3. Lee-Feng Chien, “PAT-Tree-Based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval” , special issue on “Information Retrieval with Asian Languages”, Information Processing and Management ,Elsevier Press, 1999.4. W. H. Lu, L. F. Chien, H. J. Lee, “ Mining Anchor Texts for Translation of Web Queries”, accepted by ACM Trans on Asian Language Information Processing, 2002.5. W. H. Lu, L. F. Chien, S. J. Lee, “Web Anchor Text Mining for Translation of Web Queries”, IEEE Conference on Data Mining, Nov., San Jose, 2001. 6. C. K. Huang, L. F. Chien, Y. J. Oyang, “Interactive Web Multimedia Search Using Query-Session-Based Query Expansion”, The 2001 Pacific Conference on Multimedia (PCM2001), Oct., Beijing. 7. C. K. Huang, Y. J. Oyang, L. F. Chien, “A Contextual Term Suggestion Mechanism for Interactive Search”, The First Web Intelligence Conference(WI’2001), Japan.8. Lee-Feng Chien. PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval, The 1997 ACM SIGIR Conference, Philadelphia, USA, 50-58 (SIGIR’97).

Page 119: Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

ISCSLP’02 L. F. Chien

Q&A

Thanks!