1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

資訊檢索與擷取Information Retrieval and Extraction

陳信希Hsin-Hsi Chen

台大資訊系

Information Retrieval• generic information retrieval system

select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user

• functions– document search

the selection of documents from an existing collection of documents

– document routingthe dissemination of incoming documents to appropriate users on the basis of user interest profiles

Detection Need• Definition

a set of criteria specified by the user which describes the kind of information desired.– queries in document search task– profiles in routing task

• forms– keywords– keywords with Boolean operators– free text– example documents– ...

Example

<head> Tipster Topic Description<num> Number: 033<dom> Domain: Science and Technology<title> Topic: Companies Capable of Producing Document

Management<des> Description:Document must identify a company who has the capability toproduce document management system by obtaining a turnkey-system or by obtaining and integrating the basic components.<narr> Narrative:To be relevant, the document must identify a turnkey documentmanagement system or components which could be integratedto form a document management system and the name of eitherthe company developing the system or the company using thesystem. These components are: a computer, image scanner oroptical character recognition system, and an information retrievalor text management system.

Example (Continued)

<con> Concepts:1. document management, document processing, office automationelectronic imaging2. image scanner, optical character recognition (OCR)3. text management, text retrieval, text database4. optical disk<fac> Factors:<def> DefinitionsDocument Management-The creation, storage and retrieval of documents containing, text, images, and graphics.Image Scanner-A device that converts a printed image into a videoimage, without recognizing the actual content of the text or pictures.Optical Disk-A disk that is written and read by light, and are sometimes associated with the storage of digital images because oftheir high storage capacity.

search vs. routing

• The search process matches a single Detection Need against the stored corpus to return a subset of documents.

• Routing matches a single document against a group of Profiles to determine which users are interested in the document.

• Profiles stand long-term expressions of user needs.• Search queries are ad hoc in nature.• A generic detection architecture can be used for both the

search and routing.

Search• retrieval of desired documents from an existing corpus• Retrospective search is frequently interactive.• Methods

– indexing the corpus by keyword, stem and/or phrase– apply statistical and/or learning techniques to better

understand the content of the corpus– analyze free text Detection Needs to compare with

the indexed corpus or a single document– ...

Document Detection: Search

Document Detection: Search(Continued)

• Document Corpus– the content of the corpus may have significant

the performance in some applications

• Preprocessing of Document Corpus– stemming– a list of stop words– phrases, multi-term items– ...

• Building Index from Stems– key place for optimizing run-time performance

– cost to build the index for a large corpus

• Document Index– a list of terms, stems, phrases, etc.

– frequency of terms in the document and corpus

– frequency of the co-occurrence of terms within the corpus

– index may be as large as the original document corpus

• Detection Need– the user’s criteria for a relevant document

• Convert Detection Need to System Specific Query– first transformed into a detection query, and then a

retrieval query.– detection query: specific to the retrieval engine, but

independent of the corpus– retrieval query: specific to the retrieval engine, and to

the corpus

• Compare Query with Index

• Resultant Rank Ordered List of Documents– Return the top ‘N’ documents – Rank the list of relevant documents from the

most relevant to the query to the least relevant

Routing

Routing (Continued)

• Profile of Multiple Detection Needs– A Profile is a group of individual Detection

Needs that describes a user’s areas of interest.– All Profiles will be compared to each incoming

document (via the Profile index).– If a document matches a Profile the user is

notified about the existence of a relevant document.

Routing (Continued)

• Convert Detection Need to System Specific Query

• Building Index from Queries– similar to build the corpus index for searching– the quantify of source data (Profiles) is usually

much less than a document corpus– Profiles may have more specific, structured

data in the form of SGML tagged fields

Routing (Continued)

• Routing Profile Index– The index will be system specific and will make use

of all the preprocessing techniques employed by a particular detection system.

• Document to be routed– A stream of incoming documents is handled one at

a time to determine where each should be directed.– Routing implementation may handle multiple

document streams and multiple Profiles.

Routing (Continued)

• Preprocessing of Document– A document is preprocessed in the same manner that

a query would be set-up in a search

– The document and query roles are reversed compared with the search process

• Compare Document with Index– Identify which Profiles are relevant to the document

– Given a document, which of the indexed profiles match it?

Routing (Continued)

• Resultant List of Profiles– The list of Profiles identify which user should

receive the document

Summary

• Generate a representation of the meaning or content of each object based on its description.

• Generate a representation of the meaning of the information need.

• Compare these two representations to select those objects that are most likely to match the information need.

Documents Queries

DocumentRepresentation

QueryRepresentation

Comparison

Basic Architecture of an Information Retrieval System

Research Issues

• Given a set of description for objects in the collection and a description of an information need, we must consider

• Issue 1– What makes a good document representation?– What are retrievable units and how are they

organized?– How can a representation be generated from a

description of the document?

Research Issues (Continued)

• Issue 2How can we represent the information need and how can we acquire this representation either from a description of the information need or through interaction with the user?

• Issue 3How can we compare representations to judge likelihood that a document matches an information need?

Research Issues (Continued)

• Issue 4How can we evaluate the effectiveness of the retrieval process?

Information Extraction

• Generic Information Extraction SystemAn information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.

Information Extraction (Continued)

• What are the transducers or modules?

• What are their input and output?

• What structure is added?

• What information is lost?

• What is the form of the rules?

• How are the rules applied?

• How are the rules acquired?

Example: Parser

• transducer: parser• input: the sequence of words or lexical items• output: a parse tree• information added: predicate-argument and

modification relations• information lost: no• rule form: unification grammars• application method: chart parser• acquisition method: manually

Modules

• Text Zonerturn a text into a set of text segments

• Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes

• Filterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones

• Preparsertake a sequence of lexical items and try to identify various reliably determinable, small-scale structures

Modules (Continued)

• Parserinput a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete

• Fragment Combinerturn a set of parse tree or logical form fragments into a parse tree or logical form for the whole sentence

• Semantic Interpretergenerate a semantic structure or logical form from a parse tree or from parse tree fragments

Modules (Continued)

• Lexical Disambiguationturn a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates

• Coreference Resolution, or Discourse Processingturn a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the text

• Template Generatorderive the templates from the semantic structures

Topics

1. Introduction to Information Retrieval and Extraction2. Conventional Text-Retrieval Systems (Salton, Chapter 8) - Database Management and Information Retrieval - Text Retrieval Using Inverted Indexing Methods - Extensions of the Inverted Index Operations - Typical File Organization - Text-Scanning Systems3. Automatic Indexing (Salton, Chapter 9) - Indexing Environment - Indexing Aims - Single-Term Indexing Theories - Term Relationships in Indexing - Term-Phrase Formulation - Thesaurus-Group Generation

4. Advanced Information-Retrieval Models (Salton, Chapter 10) - The Vector Space Model - Automatic Document Classification - Probabilistic Retrieval Model - Extended Boolean Retrieval Model5. File Structures (Frakes & Baeza-Yates, Chapters 3-5) - Inverted Files - Signature Files - PAT trees6. Term and Query Operations (Frakes & Baeza-Yates, Chapters 7-9,10) - Lexical Analysis and Stoplists - Stemming Algorithms - Thesaurus Construction - Relevance Feedback7. Evaluation Metrices (Jones & Willett, Chapter 4) - The Pragmatics of Information Retrieval Experimentation, Revisited - The TREC Conferences

Topics (Continued)

8. IR on the World Wide Web (Cheong, Chapter 4) - Spiders for Indexing the Web - Web Indexing Spiders - WebCrawler: Finding What People Want - Lycos: Hunting WWW Information - Harvest: Gathering and Brokering Information - WebAnts: Hunting in Packs - Issues of Web Indexing - Spiders of the Future 9. Cross-Language Information Retrieval (Hsin-Hsi Chen)10. Information Extraction (Jerry R. Hobbs) - What information extraction is - What is involved in building information extraction systems, and some how to? - What kinds of resources and tools are needed, and how to access them

Topics (Continued)

Information Sources• Books

– Salton, G. (1989) Automatic Text Processing. The Transformation, Analysis and Retrieval of Information by Computer. Reading, MA: Addison-Wesley.

– Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992) Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall.

– Cheong, F. (1996) Internet Agents: Spiders, Wanderers, Brokers, and Bots. Indianapolis, IN: New Riders, 1996.

– Karen Sparck Jones and Peter Willett (1997) Readings in Information Retrieval, CA: Morgan Kaufmann Publishers.

Information Sources

• Conference Proceedings– ACM SIGIR Annual International Conference on Research

and Development in Information Retrieval (1978-)

• Journals– ACM Transactions on Information Systems

– Information Processing and Management (formerly Information Storage and Retrieval)

– Journal of the American Society for Information Science (formerly American Documentation)

– Journal of Documentation

1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

Documents

NET 資訊能源

資料擷取入門手冊：多通道量測系統介紹mme-user.net.tw/MAVIN/Support/Tektronix/Mavin-1KT-61325-0_DAQ Data... · 在討論資料擷取時，通常會與使用繼電器來切換訊號的想法

第四章地理資訊與地理資訊系統

眾至資訊Outlook connector

資訊理論與視訊壓縮 mpeg 7

組織理論與管理 Chapter 9 資訊系統與組織控制 9-2 Chapter 9 資訊系統與組織控制資訊與資訊科技資訊科技的內部應用資訊科技的外部應用

資訊治理 IT Governance

第一章資訊與資訊系統

Chapter 21 資訊科技：概念與管理. “Copyright 2006 滄海書局 ” Chapter 22 本章概要資訊系統：概念與定義資訊系統的演進資訊系統的分類資訊系統的例子

CH10 資料庫與資訊系統

資訊素養研習 _ 資訊安全

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University chia@csie.ncu.edu.tw

Qsync 資訊同步

醫事人員資訊行為邱子恆 2008-06-09. 資訊行為 (Information behavior) 資訊需求 (information need) 資訊尋求行為 (information seeking behavior) 資訊使用 (information

Ch07 資訊管理

陸、資訊科技與人類社會 1. 資訊科技與生活 2. 資訊科技與學習 3. 資訊社會相關議題

司法院全球資訊網 · 司法院全球資訊網

資訊學科中心研習資訊素養與自我成就

資訊問題 Release

如何用 grs 擷取台灣上市股票股價資訊 PyCon APAC 2014