Text Information Retrieval and Applications – Advanced Topics

Preview:

DESCRIPTION

Text Information Retrieval and Applications – Advanced Topics. By J. H. Wang May 27, 2009. Outline. Advanced Retrieval Technologies Cross-Language Information Retrieval Multimedia Information Retrieval Semantic Retrieval Applications to IR Advanced Google Meta Search - PowerPoint PPT Presentation

Citation preview

1

Text Information Retrieval and Applications – Advanced

TopicsBy J. H. WangMay 27, 2009

2

Outline

• Advanced Retrieval Technologies– Cross-Language Information Retrieval– Multimedia Information Retrieval– Semantic Retrieval

• Applications to IR– Advanced Google– Meta Search– Search Result Clustering

3

Advanced Retrieval Technologies

• Cross-Language Information Retrieval (CLIR)

• Multimedia IR (image, speech, music, video)

• Semantic retrieval (XML, Semantic Web)

4

Cross-Language Information Retrieval

• Cross Language Information Retrieval (CLIR) -- A technology enabling users to query in one language and retrieve relevant documents written or indexed in another language

5

Cross Language Web Search

• A technology enabling users to query in one language and retrieve relevant Web pages written or indexed in another language

6

Why “Cross-Language”?

• Source: Global Reach (global-reach.biz/globstats)

7

Internet World Users by Language

8

Top Ten Languages Used in the WebSource: Internet World Stats (Mar. 31,

2009)

TOP TEN LANGUAGESIN THE INTERNET

Internet Usersby Language

InternetPenetrationby Language

Growthin Internet

( 2000 - 2008 )

Internet Users% of Total

World Populationfor this Language(2008 Estimate)

English 463,790,410 37.2 % 226.7 % 29.1 % 1,247,862,351

Chinese 321,361,613 23.5 % 894.8 % 20.1 % 1,365,138,028

Spanish 130,775,144 32.0 % 619.3 % 8.2 % 408,760,807

Japanese 94,000,000 73.8 % 99.7 % 5.9 % 127,288,419

French 73,609,362 17.8 % 503.4 % 4.6 % 414,043,695

Portuguese 72,555,800 29.7 % 857.7 % 4.5 % 244,080,690

German 65,243,673 67.7 % 135.5 % 4.1 % 96,402,666

Arabic 41,396,600 14.2 % 1,545.2 % 2.6 % 291,073,346

Russian 38,000,000 27.0 % 1,125.8 % 2.4 % 140,702,094

Korean 36,794,800 51.9 % 93.3 % 2.3 % 70,944,739

TOP 10 LANGUAGES 1,337,527,402 30.4 % 329.2 % 83.8 % 4,406,296,835

Rest of the Languages 258,742,706 11.2 % 424.5 % 16.2 % 2,303,732,235

WORLD TOTAL 1,596,270,108 23.8 % 342.2 % 100.0 % 6,710,029,070

Top Ten Languages Used in the Web( Number of Internet Users by Language )

More and more non-English users!

9

0.1

1.0

10.0

100.0

Inte

rnet

Hos

ts (

mil

lion

):

English Japanese German French Dutch Finnish Spanish Chinese Swedish

Language (estimated by domain)

Web Content

Source: Network Wizards Internet Domain Survey (Jan 99 )

More and more non-English pages

10

Chart of Web Content (by Language)

• Total Web pages: 313 B – English 68.4% – Japanese 5.9% – German 5.8% – Chinese 3.9% – French 3.0% – Spanish 2.4% – Russian 1.9% – Italian 1.6% – Portuguese 1.4% – Korean 1.3% – Other 4.6%

[Source: Vilaweb.com, as quoted by eMarketer (Feb. 2001)]

11

Language Percent of Public Sites

• English 72% • German 7% • Japanese 6% • Spanish 3% • French 3% • Italian 2% • Dutch 2% • Chinese 2% • Korean 1% • Portuguese 1% • Russian 1% • Polish 1%

[Source: OCLC, 2002]

12

Web Users and Pages (10 years ago)

Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99

Challenge of Scalability !

Total Users: 800MChinese Users: 110M

Including 87M (CN), 4.9M (HK), 11.6M (TW), 2.9M (MY), 2.14M (SG), 1.5M (US), and others.

Source: Global Reach, 2004

13

10,030,000,000 pages

Scalability Problem !

Number of Chinese Web Pages

14

Number of Web Pages

The world’s largest search engine ?

Billions Of Textual Documents Indexed

December 1995-September 2003

KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.

Source: Search Engine Watch (Nov. 2004)

SearchEngine

ReportedSize

PageDepth

Google 8.1 billion 101K

MSN 5.0 billion 150K

Yahoo4.2 billion(estimate)

500K

AskJeeves

2.5 billion 101K+

15

Number of Web Pages

• Estimated size:– Web pages in the world: 19.2 billion pages

(indexed by Yahoo as of August 2005)– Websites in the world: 70,392,567 websites

(indexed by Netcraft as of August 2005)– Web pages per website: 273 (rounding to

the nearest whole number)• Updated estimate:

– 231,510,169 distinct websites (as found by the Netcraft Web Server Survey in April 2009)

– 63.2 billion [Source: http://news.netcraft.com/archives/web_server_survey.html]

[Source: http://www.boutell.com/newfaq/misc/sizeofweb.html]

16

Number of Web Pages

• 1 trillion unique URLs (We knew the web was big, by Jesse Alpert & Nissan Hajaj, Software Engineers, Web Search Infrastructure Team, 25 July 2008)

• 19,200,000,000 pages (Mayer, Tim, 8 August 2005, Our Blog is Growing Up And So Has Our Index)

• 320,000,000 pages (World Wide Web is 320 million and growing, BBC News Sci/Tech, 3 April 1998.)

• 1,000,000,000 pages (Internet. How much information? 2000. Regents of the University of California.)

• 800,000,000 pages (Maran, Ruth, and Paul Whitehead. "Web Pages." Internet and World Wide Web Simplified, 3rd ed. Foster City: IDG Books Worldwide, 1999. )

• 8,034,000,000 pages (Miller, Colleen. web sites: number of pages. NEC Research, IDC.)

[Source: http://hypertextbook.com/facts/2007/LorantLee.shtml]

17

Challenge of Cross-Language Web Search

• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup

• 81% of the search terms could not be obtained from common English-Chinese translation dictionaries

中 央 處 理 器 (CPU), 電 子 商 務 (E-commerce),

個人數位助理 (PDA), 雅虎 (Yahoo), 太空總署 (NASA), 星際大戰 (Star War),非典型肺炎 (SARS), …

18

Challenge

• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup

• 81% of the search requests could not be obtained from common English-Chinese translation dictionaries

• How to find effective translations automatically for query terms not included in a dictionary ?

19

Query Translation & CLIR in DL

Chinese Query Mono-Lingual

Document Search

Mono-Lingual Document

Search瓷器 Chinese Digital

Libraries

Chinese Digital

Libraries

Possible global use

20

Query Translation & CLIR in DL

?

English Query

Porcelain

Chinese Query Mono-Lingual

Document Search

Mono-Lingual Document

Search瓷器 Chinese Digital

Libraries

Chinese Digital

Libraries

Need for CLIR services

21

Query Translation & CLIR in DL

?

English Query

Porcelain

Chinese Query Mono-Lingual

Document Search

Mono-Lingual Document

Search瓷器 Chinese Digital

Libraries

Chinese Digital

Libraries

Query Translation

Query Translation

瓷器 / 瓷 / 陶瓷

22

Query Translation & CLIR in DL

?

English Query

Porcelain

Chinese Query Mono-Lingual

Document Search

Mono-Lingual Document

Search瓷器 Chinese Digital

Libraries

Chinese Digital

Libraries

Query Translation

Query Translation

瓷器 / 瓷 / 陶瓷

Cost-ineffective to construct translation dictionaries

23

Query Translation & CLIR in DL

?

English Query

Porcelain

Chinese Query Mono-Lingual

Document Search

Mono-Lingual Document

Search瓷器 Chinese Digital

Libraries

Chinese Digital

Libraries

Web

Query Translation

Query Translation

瓷器 / 瓷 / 陶瓷

Taking the Web as online corpus to deal with translation of

unknown terms

24

Query Translation & CLIR in DL

?

English QueryNational

Palace Museum

Chinese Query Mono-Lingual

Document Search

Mono-Lingual Document

Search瓷器 Chinese Digital

Libraries

Chinese Digital

Libraries

Web

Query Translation

Query Translation

故宮 / 故宮博物院

Online Term Translation Suggestions

25

Query Translation & CLIR in DL

Auto-generated

Translation Lexicons

?

English/Japanese/Korean Queries

Chinese Query Mono-Lingual

Document Search

Mono-Lingual Document

Search瓷器 Chinese Digital

Libraries

Chinese Digital

Libraries

Web

Query Translation

Query Translation

瓷器 / 瓷 / 陶瓷

26

CLIR

• Conventional approach to query translation – Parallel documents as the corpus– Assume long queries

• Problems of CLIR in digital libraries– No corpus for cross-lingual

training– Short queries

“Out-of-dictionary” terms– Ex: proper nouns, new

terminologies, …

English Terminologies

Chinese Translation

mechanical strain 機械應變viscous damping 黏滯阻尼Richard Feynman 費曼Hyoplastic Left Heart Syndrome

左心發育不全症候群

NII Japan 國立情報學研究所

SARS 嚴重急性呼吸道症候群

Extracorporeal Shock Wave Lithotripsy

震波碎石

Davinci 達文西

27

Translation Lexicon Construction for CLIR

• To use the Web as the corpus for query translation– Web mining techniques

• Anchor-text-based [ACM TOIS ‘04, ACM TALIP ‘02]• Search-result-based [JCDL ‘04]

• To extract terms from real document collections as possible queries– Term extraction method [SIGIR ‘97]

28

Web Mining Approach to Term Translation Extraction

• LiveTrans: http://wkd.iis.sinica.edu.tw/LiveTrans/

LiveTrans Engine

LiveTrans Engine

Academia SinicaAnchor textsAnchor texts

Search resultsSearch results

The Web

中央研究院 / 中研院

Source query

Target translations

29

National Palace Museum vs. 故宮博物院Search-Result Page

• Mixed-language characteristic in Chinese pages• How to extract translation candidates?• Which candidates to choose?

Noises

30

Yahoo vs. 雅虎 -- Anchor-Text Set

• Anchor text (link text)– The descriptive text of a

link on a Web page

• Anchor-text set– A set of anchor texts

pointing to the same page (URL)

– Multilingual translations− Yahoo/雅虎 /야후− America/美国 /アメリカ

• Anchor-text-set corpus– A collection of anchor-

text sets

Yahoo Search Engine

美国雅虎 雅虎搜尋引擎

Yahoo! America

Taiwan

China

Japan

Korea

야후 -USA

アメリカの Yahoo! http://www.yahoo.com

31

Term Translation Extraction from Different Resources

Term

Extraction

Term

Extraction

Source Query

TargetTranslation

Search-ResultPages

SearchEngineSearchEngine

SimilarityEstimationSimilarityEstimation

National Palace Museum

國立故宮博物院 , 故宮 , 故宮博物院

Anchor-Text

Corpus

WebSpiderWeb

Spider

32

LiveTrans: Cross-language Web Search

33

More Examples

34

More Examples

35

Multimedia IR

• Different forms of information need• Image retrieval• Speech information retrieval• Music information retrieval• Video information retrieval

36

Image Retrieval

• Content-based– Query by image content

• Query by example ( 以圖找圖 )

– Similarity in visual features• Color, texture, shape, …

– Relevance feedback

• Text-based– Annotation

37

Content-Based Image Retrieval (CBIR)

• Example systems– CIRES (Content-based Image Retrieval

System): http://amazon.ece.utexas.edu/~qasim/research.htm

– SIMPLIcity: http://www-db.stanford.edu/IMAGE/– National Museum of History:

http://210.201.141.12/cgi-bin/cbir-query.cgi?tid=-1

– …

38

Image

Similarimages(no RF)

Relevance Feedback (RF)Source: Dr. Cheng

39

Similar Images Using Relevance Feedback

Similarimagesusing RF

Image

40

Automatic Image Annotation

Keywords?

white bear snow tundra polar bears snow fight

polar bear ice snow

Visual Similarity

Problem 1

Image Banks with Annotations

41

Spoken Document Retrieval

• Spoken document retrieval– Indexing speech messages using speech recognition– Retrieving relevant messages for a text/speech

query

• Techniques– Document Processing: acoustic change detection,

speech/non-speech detection, Mandarin/non-Mandarin detection, story segmentation, speaker recognition/clustering

– Speech Recognition– Indexing/Retrieval

42

SoVideo

http://slam.iis.sinica.edu.tw/demo.htm

43

Music Information Retrieval

• Finding a song by similar melody– Query by singing – Query by humming

• Singer identification– Background noise– Singer voice model

• Demo:– http://slam.iis.sinica.edu.tw/demo.htm

44

Video Information Retrieval

• Difference with CBIR– Temporal information– Structural organization– Complexity of querying system

• Techniques– Video segmentation– Keyframe identification

45

Semantic Retrieval

• HTML vs. XML• Semantic Web (Agent, Ontology,

RDF)

46

Common Language of the Web

• HTML– Link: Pi Pj

– URL (URI), anchor text• Part-of

NTU

National Taiwan University

http://www.ntu.edu.tw/

47

Link Analysis –Hubs & Authorities in PageRank

100

9

53

50

50

50

3

3

3

48

Current Web Search

• Keyword-based search (e.g., Google)– Full text indexing – Page authority (link analysis)– Page popularity (query log and user’s click)

• Problems– Not specific

• Data in pages have no semantic annotations• Yo-yo Ma’s most recent CD

– No topic disambiguation• Documents with different topics mix together• Yo-yo Ma’s CDs, concerts, biography, gossips,…

49

Search on Semantic Web

• Metadata search– To increase precision and flexibility

• Topic-based search– To help contextualize queries and

overlay results in terms of a knowledge base

50

XML (Extensible Markup Language)

• More flexible tags• DTD (Data Type Definition)

– Definition of the tags

51

XML Search

• XML Text Search Engines– Amberfish (Etymon)– X3 (X-cubed) (DocSoft)– UltraSeek (Verity)

• XML Structured Query Engines– Fxgrep– Cheshire II (UC Berkeley)

• XML Query Languages– XQuery (W3C XMLQuery)– XQL– XML-QL

52

Semantic Web

• "The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001

53

Semantic Web

Agent

Agent

Agent

RDF

ontology

54

Semantic Web

• RDF (Resource Description Framework)– Common language

• Ontology – Knowledge representation

• Agent

55

Why Semantic Web?

• Standardizing knowledge sharing and reusability on the Web

• Interoperable (independent of devices and platforms)

• Machine readable—enabling intelligent processing of information

56

An Example of Semantic Relation

author

work

publisherpublish

written by

57

What is a Software Agent?

• A paradigm shift of information utilization from direct manipulation to indirect access and delegation

• A kind of middleware between information demand (client) and information supply (server)

• A software that has autonomous, personalized, adaptive, mobile, communicative, social, decision making abilities

58

What is Ontology?

• An ontology is a formal and explicit specification of shared conceptualization of a domain of interest (T. Gruber)– Formal semantics– Consensus of terms– Machine readable and processible– Model of real world– Domain specific

59

What is Ontology?(2)

• Generalization of– Entity relationship diagrams– Object database schemas– Taxonomies– Thesauri

• Conceptualization contains phenomena like– Concepts/classes/frames/entity types– Constraints– Axioms, rules

60

Agents and Ontology

• Agents must have domain knowledge to solve domain-specific problems

• Agents must have common sharable ontology to communicate and share knowledge with each other

• The common sharable ontology must be represented in a standard format so that all software agents can understand and communicate

61

Agents and Semantic Web

• Semantic Web provides the structure for meaningful content of Web pages, so that software agents roaming from page to page will carry out sophisticated tasks– An agent coming to a clinic’s web page will

know Dr. Henry works at the clinic on Monday, Wednesday and Friday without having the full intelligence to understand the text…

– Assumption is Dr. Henry make the page using an off-the-shelf tool, as well as the resources listed on the Physical Therapy Association’s site

62

Knowledge Representation on the Web

• The challenge of the Web is to provide a language to express both data and rules for reasoning about the data [meta-data] that allows rules from any existing knowledge representation system to be exported onto the Web

• Adding logic to the Web means to use rules to make inference, choose actions and answer questions. The logic must be powerful enough but not too complicated for agents to consider a paradox

63

Language Layers on the Web

XML

XHTML SMIL RDF

PICS

HTML

Declarative Languages:OIL, DAML+Ont

DC

Semantic web infrastructure is built on RDF data model

DAML-L (logic)

Trust

64

Languages on the Web

• HTML+URL• XML+DTD (Data Type Definition)• RDF+RDF schema

65

Statements: RDF

• The basic structure of RDF is object-attribute-value

• In terms of labeled graph: [O]-A->[V]

O

A

V

66

Semantic Web Search Engine

• Swoogle: http://swoogle.umbc.edu/ [CIKM 2004]

• SHOE (Simple HTML Ontology Extensions): http://www.cs.umd.edu/projects/plus/SHOE/search/

• SWSE: http://www.swse.org/• http://

www.semanticwebsearch.com/

67

Applications to IR

• Advanced Google• Meta Search• Search Result Clustering

68

What do Users Really Want?

• Topic-based vs. keyword-based– “NTU”

• How to improve current search engines?

• Resources about Search Engines – Search Engine Watch:

http://searchenginewatch.com/ – Research Buzz: http://researchbuzz.com/

69

Advanced Google

• Is Google good enough?– “NTU”– “NTU university”– “NTU university Singapore”

• More and more Services– Google Web, Image, News, Video, Google Desktop Search ,

…– Google Groups, Gmail, Google Talk, Google Calendar, …– Google Mobile, Google SMS, Google Local, …– Google Print (Book Search), Google Maps, Google Earth, …– Google Scholar, Translate, Finance, Docs, Reader, …

• More about Google Services– http://www.google.com/options/– Google Labs: http://labs.google.com/

70

More Types of Document Search

• Google: Web, Image, News, Groups, Desktop (Office, mail),

• Microsoft: +Lookout (mail)• Yahoo: +Stata (mail), +Adobe

(PDF)

71

Searching Different Media

• Multimedia Search: MP3, Blog, messenger, mobile, …– Baidu.com: MP3, image, news, …– Singingfish.com (AOL): audio/video, … – GoFish.com: audio, video, mobile, games– AllTheWeb.com: pictures, audio, video, …

• Blog search engines– Daypop, Bloogz, Waypath, …

• A9.com (by Amazon)– Books, movies, …– Bookmark, history, discover, diary

• Mobissimo.com– Airfare search, hotel search

• Yahoo-OCLC toolbar: library search– Searching Open WorldCat (OCLC union catalog)

72

Different Forms of Presentation

• Clusty.com (by Vivisimo)– Clustering engine

• Snap.com (by Idealab)– Sorting by popularity, satisfaction, Web popularity, Web

satisfaction, domain, …• Alexa.com (by Amazon)

– Average user review ratings, …• Visualization

– TouchGraph Google Browser: http://www.touchgraph.com/TGGoogleBrowser.html

– Kartoo.com: a visual meta search engine– Girafa– ConceptSpace– LostGoggles (formerly MoreGoogle): thumbnail preview

73

Focused Search Engines

• Scirus: http://scirus.landingzone.nl– For scientific information only

• Google Scholar: http://scholar.google.com/ – For scholarly literature

74

Some Google Hacks and Searching Tricks

• References:– Tara Calishain and Rael Dornfest, “Google

Hacks,” O’Reilly– Kevin Hemenway and Tara Calishain,

“Spidering Hacks,” O’Reilly– http://douweosinga.com/projects/

googlehacks – Tara Calishain, “Web Search Garage,”

Prentice Hall– Chris Sherman, “Google Power: Unleash the

Full Potential of Google,” McGraw Hill

75

Further Utilizing Google…

• Google API: http://www.google.com/apis/ – 1,000 automated queries per day

• Google Hacks– Google Talk– Word Color– Google Battle– Google Date– Google Best Time to Visit– Google Protocol

• …

76

Meta (Federated) Search

• To search simultaneously several individual search engines and their databases of web pages – Ixquick, Metacrawler, Dogpile, …

• Clustering meta-searchers– Vivisimo, KillerInfo, …

• Meta-search engines for deep digging– SurfWax, Copernic Agent, …

77

Meta Search Engine

MetaSearchEngine

SE1

User

SEn

SE2

Web

78

Search Result Clustering

• Why search result clustering?• Why is SRC different from

document clustering?– In assessment of algorithm’s quality– Precision, recall vs. user-oriented,

subjective assessment

79

Example of Search Result Clustering

National Taiwan University NTU Hospital

Nanyang Technological University, Singapore

NTU?

80

Example Clustering Search Engines

• Vivisimo.com– Clusty.com

• WebClust.com• KillerInfo.com• InfoNetWare.com• SnakeT (Snippet Aggregation for

Knowledge ExTraction): http://roquefort.unipi.it/ – A hierarchical clustering engine for snippets

• Mooter.com• …

81

Example on Vivisimo

82

Vivisimo (cont.)

83

Clusty.com

84

InfoNetWare.com

85

Thanks for Your Attention!

Recommended