35
Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou [email protected]

Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou [email protected]

Embed Size (px)

Citation preview

Page 1: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Information Retrieval course ::: Information Management

TechnologiesKalliopi Zervanou

[email protected]

Page 2: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Overview

The need for information processing Structured vs. unstructured data (text) The challenges of text Textual information processing technologies

Page 3: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

The need for info processing

Large amounts of data in electronic form

Need for large scale & fast info processing

Most information to be found in texttext

Page 4: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Types of Data

Structured data

Semi-structured data

Unstructured, free-text data

Page 5: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Structured Data: e.g. Databases

Title: Introduction to Information RetrievalAuthor: C.D.Manning, P.Raghavan, H.SchützeDoc type: BookPublisher: Cambridge University PressPub date: 2008Id: CM20BLocation: Computer Science sectionKeywords: Information Retrieval, Indexing, …

Page 6: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Semi-Structured Data (e.g. XML)

<?xml version="1.0" encoding="utf-8" ?><cmsbwsa_iisg_nl> <bwsa> <path> bios/bymholt.html </path> <voornaam> Berend </voornaam> <achternaam> Bymholt </achternaam> <geboortejaar> 1864 </geboortejaar> <geboortedatum>07-09</geboortedatum> <sterfjaar>1947</sterfjaar> <sterfdatum>05-27</sterfdatum> <extrainfo> socialistisch en anarchistisch publicist en auteur van de

Geschiedenis der Arbeidersbeweging in Nederland</extrainfo> <id>77</id> </bwsa>

...</cmsbwsa_iisg_nl>

Page 7: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Free-Text/ Unstructured data

Bertelsmann 9-mth profit slips on start-up lossesFRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a

slight decline in nine-month operating profit due to start-up losses related to new businesses.

Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices.

Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider.

Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.

Page 8: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Data Mining

analysis of structured data detection of unknown

interesting patterns: groups of data records

(cluster analysis)

unusual records (anomaly detection)

data dependencies (association rule mining)

Page 9: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Text Mining / Text Analytics

analysis of text (semi-/unstructured data) detection of unknown, interesting information:

group documents (classification/clustering)extract information (content descriptors, concepts of

interest)associate/link information (e.g. concept relations) discover previously unknown facts

Page 10: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

The challenges of text

Full text understanding beyond current technology

Human understanding based on contextcontext

Context: text, but also world knowledge

Text: ambiguity (syntactic, semantic, lexical, pragmatic)

Page 11: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Doc CollectionIR

Important Info

IE

Relevant Docs

Summarisation (or Abstracting)

( Indexing )

Index Terms

Terminology

ATR

Data Bases

- Thesauri- Lexicons

- Ontologies- Gazetteers

Data Mining

Reasoning,etc…

Derived Info

Process Resource

Stru

ctured

Info

Relevant Info

NE …EVENT …

UNSTRUCTUREDUNSTRUCTURED

DATADATA

UNSTRUCTUREDUNSTRUCTURED

DATADATA

STRUCTUREDSTRUCTURED

DATADATA

STRUCTUREDSTRUCTURED

DATADATA

Page 12: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

IR: Select relevant documents

Query: “query term” Relevant: Documents containing the “term” Methods:

Indexing or Automatic Term Recognition

Page 13: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Automatic Term Recognition

supervised/ unsupervised task Methods: rule based, statistics-based,

machine learning, hybrid

Objective:detect words or phrases denoting specialised concepts, i.e. termsterms

Objective:detect words or phrases denoting specialised concepts, i.e. termsterms

Page 14: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

ATR: example

C-value Candidate term

338.13958 trade union [trade union, Trades Union,…]213.127 ernst papanek [Ernst Papanek]200.55471 new york [New York]143.48147 press clipping[Press clippings, press -clippings,…]139.07053 world war [world war, world wars, World Wars,…]134.47055 print material [printed materials, Printed material,…]131.19386 executive committee [executive committee, …]124.91502 communist party [Communist party,…]94.48066 second world war [Second World War, …]91.18482 spanish civil war [Spanish Civil War, …]90.80228 great britain [Great Britain, Great -Britain]

Page 15: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Document clustering

unsupervised task“clusters”, group categories unknown

machine learning and statistics-based approaches

Objective:group documents based on their content / semantic similarities

Objective:group documents based on their content / semantic similarities

Page 16: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Objective:classify documents based on their content / semantics

Objective:classify documents based on their content / semantics

Document classification

supervised task we know the classes/categories

use of machine learning, or statistics-based methods

Page 17: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Doc CollectionIR

Important Info

IE

Relevant Docs

Summarisation (or Abstracting)

( Indexing )

Index Terms

Terminology

ATR

Data Bases

- Thesauri- Lexicons

- Ontologies- Gazetteers

Data Mining

Reasoning,etc…

Derived Info

Process Resource

Stru

ctured

Info

Relevant Info

NE …EVENT …

Page 18: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Summarisation or Abstracting

Bertelsmann 9-mth profit slips on start-up lossesFRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a

slight decline in nine-month operating profit due to start-up losses related to new businesses.

Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices.

Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider.

Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.

Page 19: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Information Extraction

supervised, or unsupervised/generic task Methods: rule-based, machine learning

Objective:detect specific types of info in documents, e.g. names, events, relations

Objective:detect specific types of info in documents, e.g. names, events, relations

Page 20: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

IE tasks

Named Entity (NE) recognise entities/concepts of interest, e.g. persons, organisations, dates & times

Co-reference (CO) recognise mentions to the same entity

Template Relation (TR) & Scenario Template (ST) recognise relations among concepts, e.g. concept properties & entities involved in facts & events of interest

Page 21: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

IE Tasks

Bertelsmann said operating earnings before interest

and tax (EBIT) rose 35 percent to 215 million euros

($272.1 million) compared with 2005, and sales were

up 17.3 percent at 4.5 billion euros.

Europe's largest media group on Thursday said it still

expects its 2011 operating profit to decline slightly

year-on-year.

ORGANISATION

PERCENT

DATE

AMOUNT

ORGANISATION=“Bertelsmann” DATE=“2011-11-10”

Page 22: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

IE Tasks

Bertelsmann said operating earnings before interest

and tax (EBIT) rose 35 percent to 215 million euros

($272.1 million) compared with 2005, and sales were

up 17.3 percent at 4.5 billion euros.

Europe's largest media group on Thursday said it still

expects its 2011 operating profit to decline slightly

year-on-year.

SALES_of

Event_type: sales

Organisation_type: Company

Organisation_name: Bertelsmann

Sector: media

Sales_mode: increase

Sales_amount: 4.500.000.000

Currency: euros

Period: ??

Date: ??

Page 23: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Sentiment analysis/Opinion mining

Polarity classification (positive/negative) Objectivity/Subjectivity detection

Page 24: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Doc CollectionIR

Important Info

IE

Relevant Docs

Summarisation (or Abstracting)

( Indexing )

Index Terms

Terminology

ATR

Data Bases

- Thesauri- Lexicons

- Ontologies- Gazetteers

Data Mining

Reasoning,etc…

Derived Info

Process Resource

Stru

ctured

Info

Relevant Info

NE …EVENT …

Page 25: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Structured Data: e.g. Databases

Title: Introduction to Information RetrievalAuthor: C.D.Manning, P.Raghavan, H.SchützeDoc type: BookPublisher: Cambridge University PressPub date: 2008Id: CM20BLocation: Computer Science sectionKeywords: Information Retrieval, Indexing, …

Page 26: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Structured Data: Ontologies

Structure of concepts:Entities (concepts, objects)Properties (concept properties)Relations (links between concepts)Domain specific relations, e.g., “has_capital”

Objective: describe domain knowledge and reason about

concepts & relations

Page 27: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Einstein's riddle

we have five houses in a row, each house is painted with a different colour, each house has a single inhabitant

each inhabitant is of different nationalitydrinks different beverage, owns a different pet,smokes different brands of cigarettes

Source: http://en.wikipedia.org/wiki/Zebra_puzzle

Page 28: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Einstein's riddle

1. There are five houses.

2. The EnglishmanEnglishman lives in the red housered house.

3. The SpaniardSpaniard owns the dog.

4.4. CoffeeCoffee is drunk in the green housegreen house.

5. The UkrainianUkrainian drinks tea.

Source: http://en.wikipedia.org/wiki/Zebra_puzzle

Page 29: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Einstein's riddleSource: http://en.wikipedia.org/wiki/Zebra_puzzle

6. The green housegreen house is immediately to the right of the ivory houseivory house.

7. The Old Gold smoker owns snailssnails.

8. Kools are smoked in the yellow houseyellow house.

9.9. MilkMilk is drunk in the middle house.

10. The NorwegianNorwegian lives in the first house.

Page 30: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Einstein's riddle

11. The man who smokes Chesterfields lives in the house next to the man with the fox.

12.12. KoolsKools are smoked in a house next to the house where the horse is kept.

13. The Lucky Strike smoker drinks orange juiceorange juice.

14. The JapaneseJapanese smokes Parliaments.

15. The NorwegianNorwegian lives next to the blue houseblue house.

Source: http://en.wikipedia.org/wiki/Zebra_puzzle

Page 31: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Einstein's riddle

Who drinks water?

Who owns a zebra?

Source: http://en.wikipedia.org/wiki/Zebra_puzzle

Page 32: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Ontology: hierarchical structure

Thing/Root

Inhabitant

Colour

Pet

Beverage

House-1

House-2

House-3

House-4

House-5

House House...

Englishman

Spaniard

Japanese

Norwegean

Ukranian

Spaniard...

Red

Green

Blue

IvoryYellow

Green...

Dog

Horse

Snails

Fox

Zebra

Page 33: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Ontology

“is-a” or taxonomic relationships

Denote the “kind” of a concept

But ontologies: more than taxonomic relationships!

Thing/Root

Inhabitant

Colour

Pet

Brand

House-1

House-2House House...

Englishman

SpaniardSpaniard...

Red

GreenGreen...

Dog

Horse...

Beverage

Page 34: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Ontology: properties

Thing/Root

Inhabitant

Colour

Pet

House

Has_colour:(Colour>Is_ColourOf:[House])

[Colour]

Has_inhabitant:(Inhabitant>LivesIn:[House])

[Inhabitant]

Is_rightTo: [House]

House-1

Brand

Beverage

Page 35: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl

Ontology: properties

Thing/Root

Inhabitant

Colour

Pet

House

LivesIn:(House>Has_inhabitant:[Inhabitant])

[House]

Has_pet:(Pet>Has_owner: [Inhabitant])

[Pet]

Drinks:(Beverage>Drunk_by: [Inhabitant])

[Beverage]

Uses_brand:(Brand>Used_by: [Inhabitant])

[Brand]

Spaniard

Brand

Beverage