Vorlesung: Information Retrieval 2.ketabdar.hamed/teaching... · Vorlesung: Information Retrieval...

Preview:

Citation preview

08.01.2009

Computer-gestützte Interaktion.Vorlesung: Information Retrieval 2.

Florian Metze, Fachbereich UsabilityWS 2008/2009

VL CGI FMe 13 - IR2.ppt · X 1

Computer-gestützte Interaktion.Termin: Donnerstags 10:15 – 11:45; TEL20, Auditorium

Multimodale Schnittstellen (12:15-13:45)1011.12.2008

Fusion/ Fission: Audio, Video, Keyboard, Touch, … (10:15-11:45)1118.12.2008

Anwendungen & Wiederholung (12:15-13:45)1218.12.2008

Information Retrieval, Dokumentensuche (10:15-11:45)1308.01.2009

Information Retrieval 2, Expertensuche (12:15-13:45)1408.01.2009

(Sprach-)dialogsysteme (10:15-11:45)911.12.2008

Sprachübersetzungssysteme (12:15-13:45)804.12.2008

Statistische Übersetzung (10:15-11:45)704.12.2008

Grundlagen und regelbasierte Übersetzung627.11.2008

Future ASR520.11.2008

ASR Anwendungen und Systeme413.11.2008

Grundlagen und ASR306.11.2008

Klassifikation230.10.2008

Statistik123.10.2008

Q&U LabEinführung16.10.2008

TopicRemarkDate

VL CGI FMe 13 - IR2.ppt · X 2

Computer-gestützte Interaktion.Human Computer Interfaces: Example Information Retrieval.

Introduction

Conceptual model

Relationship of IR and HCI and HCC

Latent Semantic Indexing

The ESP Game

Assessing the retrieval

Future Directions

VL CGI FMe 13 - IR2.ppt · X 3

Computer-gestützte Interaktion.HCI: Information Retrieval Model.

Content-Centered Retrieval as Matching

Document Representations to

Query Representations

A powerful paradigm that has driven IR R&D for half a century.

Evaluation metric is effectiveness of the match. (e.g., recall and precision).

VL CGI FMe 13 - IR2.ppt · X 4

Computer-gestützte Interaktion.HCI-IR: Content Trend.

Content Features (queries too)

– Not only text

– Statistics, images, music, code, streams, bio-chemical

– Multimedia, multilingual

Dynamic

– Temporal (e,g., blogs, wikis, sensor streams)

– Conditional (e.g., computed links, recommendations)

Content Relationships

– Hyperlinks, new metadata, aggregations

– Digital libraries, personal collections

Content acquires history → context retrieval

VL CGI FMe 13 - IR2.ppt · X 5

Computer-gestützte Interaktion.HCI-IR: Responses to Content Trend.

Link analysis

Multiple sources of evidence (fusion)

– Authors’ words (e.g., full text IR)

– Indexer/ abstractor words (e.g., OPACs)

– Authors’ citations/links (e.g., Google)

– Readers’ search paths (e.g., recommenders, opinion miners: „collaborative filtering“)

– Machine generated features and relationships („mining“)

Three key challenges:

– How do we generate references?

– What new relationships can we leverage (human and machine)?

– How can we integrate multiple sources of evidence?

VL CGI FMe 13 - IR2.ppt · X 6

Computer-gestützte Interaktion.HCI-IR: User Trend.

Technical advances and technical literacy allows us to leverage information seeker intelligence

– Rather than sole dependence on matching algorithms, focus on flow of representationsand actions in situ as people think with these new tools and information resources

– To leverage human intelligence and effort, people must assume responsibilities: beyondthe two-word, single query

Web and TV remotes have legitimized browsing as human-controlled information seeking

Aim at understanding rather than retrieval

Responses to User Trend:

Adapt techniques to WWW

– Relevance feedback

– Query expansion

– User modeling/profiles, SDI services

Recommender systems: explicit and implicit models

Capture everything (e.g., Lifebits)

User Interfaces: dynamic queries, agile views, tuning of IR systems

VL CGI FMe 13 - IR2.ppt · X 7

Computer-gestützte Interaktion.HCI: HCC Model of HCI.

A user-oriented model that has driven R&D. Evaluation based on user time, accuracy, and satisfaction.

VL CGI FMe 13 - IR2.ppt · X 8

Computer-gestützte Interaktion.HCI: WWW Trends.

First decade of WWW as great equalizer (we all get impoverished, but we admit MANY morepeople)

Universal access

Platform independence (lots of devices)

Enhanced browsers, specialized browsers

Interface Servers

Social awareness (user is not alone)

VL CGI FMe 13 - IR2.ppt · X 9

Computer-gestützte Interaktion.HCI-IR: An Expanded Model.

Think of IR from the perspective of an active human with

information needs,

information skills,

powerful IR resources (that include other humans), and

situated in global and local connected communities,

all of which evolve over time.

Get people closer to the information they need

– Closer to the backend

– Closer to the meaning

Involve information professionals as integral to the IR system

– Increase responsibility as well as control

– Leverage more demanding and knowledgeable installed base

– Consider ubiquity, digital libraries, e-commerce as extended memories and tools(personal and shared)

VL CGI FMe 13 - IR2.ppt · X 10

Computer-gestützte Interaktion.HCI-IR: Key Challenges.

Linking conceptual interface to system backend

– Metadata generation

– Alternative representations and control mechanisms

Raising user literacy and involvement

– Engaging without insulting or annoying

Adding human intelligence to the system

Moving beyond retrieval to understanding

– Context

VL CGI FMe 13 - IR2.ppt · X 11

Computer-gestützte Interaktion.HCI Example 1: Word-Net.

WordNet® is a large lexical database of English.

Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

Synsets are interlinked by means of conceptual-semantic and lexical relations.

The resulting network of meaningfully related words and concepts can be navigated with the browser.

WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

WordNet relations can be expressed in OWL, RDFS orother ontology markup languages:

VL CGI FMe 13 - IR2.ppt · X 12

Computer-gestützte Interaktion.HCI Example 2: The ESP Game.

How to label images? On the web?

Clever way to automate meta-data generation

Image annotation/ recognition very difficult

„Labeling the Web“ using „Human Computation“Two-player game on the web

– Players get points for generating keywordsdescribing a picture, if the other playeragrees

– Taboo words exist, too– Accuracy assured by over-sampling

– Social aspect („become top labeler“) and fun as motivation

Funded by NSA, conceived by Luis von Ahn at CMU.

Now sold to Google.

VL CGI FMe 13 - IR2.ppt · X 13

Computer-gestützte Interaktion.HCI Example 2: Latent Semantic Indexing (LSA).

How LSA works:LSA uses a term-document matrix whichdescribes the occurrences of terms in documentsIt is a sparse matrix whose rows correspond to terms (typically stemmed words) and whosecolumns correspond to documents, matrixelements are tf-idf.LSA transforms the occurrence matrix into a relation between the terms and some concepts, and a relation between those concepts and thedocuments.

– Thus the terms and documents are nowindirectly related through the concepts.

– LSA finds a low-rank approximation to theterm-document matrix.

The consequence of the rank lowering is thatsome dimensions are combined and depend on more than one term:

{(car), (truck), (flower)} →{(1.3452 * car + 0.2828 * truck), (flower)}

Principal Component Analysis (PCA) in termspace

The new concept space typically can be used to:Compare the documents in the concept space (data clustering, document classification). Find similar documents across languages, after analyzing a base set of translated documents (cross language retrieval). Find relations between terms (synonymy and polysemy). Given a query of terms, translate it into the concept space, and find matching documents (information retrieval).

Synonymy and polysemy are fundamental problems in natural language processing:

Synonymy is the phenomenon where different words describe the same idea. Polysemy is the phenomenon where the same word has multiple meanings.

VL CGI FMe 13 - IR2.ppt · X 14

Computer-gestützte Interaktion.HCI and Computer Aided Interaction.

“Automatic” classification works best when its application is supported by humans with knowledge of the domain and the techniques at hand.

(Gary Marchionini)

Computers should learn!

The „Relation Browser“ tool for metadata mining:

VL CGI FMe 13 - IR2.ppt · X 15

Computer-gestützte Interaktion.HCI: The Relation Browser.

A general purpose dynamic query interface for databases with a small number of facets (~10) and a small number of categories in each facet (~10).

Easy to look ahead (overviews and previews)

Couples interactive partitioning/exploration with string query

Semi-automatic categorygeneration and webpageclassification

Mousing over „Coal“ revealsthe distribution of „coal“-relatedweb-pages in the othercategories

VL CGI FMe 13 - IR2.ppt · X 16

Computer-gestützte Interaktion.HCI: The Relation Browser.

1) Acquire data:

Crawl sites/ Internet

– Formats?– Mirror locally?

Clean data

– Remove non-alphabeticals– Lower case all

– Word-Net validate words

– Stem or not stem

2) Build Representation:

Select data to include

– Pages to include/ exclude– ASCII text from

Titles

Link anchorsMetadata tags

Build raw term-document matrix

– Pages as rows (observations)

– Terms as columns (variables)– Frequencies or TF-IDF weights in cells

VL CGI FMe 13 - IR2.ppt · X 17

Computer-gestützte Interaktion.HCI: The Relation Browser.

3) Filter data:

Stop word lists

– General terms– Domain specific terms

– Web and navigation terms

– Iteratively developed/ refinedTerm discrimination filters (various)

– .01-.1 doc frequency interval

– Interval augmented by 100 top freq

– Empirical threshold (e.g., > 5 docs)

4) Project data onto lower dimensional space

First N principal components

– 50-100 latent semantic dimensions– 50-100 independent components

Reduces to ‘narrower’ term-doc matrix

– Still kind of experimental

VL CGI FMe 13 - IR2.ppt · X 18

Computer-gestützte Interaktion.HCI: The Relation Browser.

5) Cluster documents

K-means, e.g., with k<<100

EM yields a probability distribution for eachdocument over the clusters (so a document has some probability of belonging to each cluster)

6) Evaluate clusters and name topics

Create usable output

– A web page with the clusters and numberof documents in each

– For each cluster, a list of the top 10 mostfrequently occurring terms; a list of the top10 log-odds ratio terms; and links to all thepages in that cluster

– Eyeball the terms, pick a cluster (topic) name (names); else iterate previous steps

VL CGI FMe 13 - IR2.ppt · X 19

Computer-gestützte Interaktion.HCI: The Relation Browser.

7) Assign pages to topics

For every page, compute the probabilitydistribution (using EM model) over each cluster/ topic

Select a threshold for placing pages into topics(most easily go into only one topic)

8) Create other facets (views) and display

Use a set of heuristic rules to place pages intogeographic categories

Use a set of heuristic rules to place pages intotemporal categories (ad hoc at present)Map the files onto the RB relational scheme

VL CGI FMe 13 - IR2.ppt · X 20

Computer-gestützte Interaktion.HCI: Interaction Principles and Caveats (Incomplete).

Principles

– Look ahead without penalty

– Minimize scrolling and clicking

– Alternative ways to slice and dice

– Closely couple search, browse, and examine

– Continuous engagement—useful attractors

– Treasures to surface

Caveats

– Scalability (getting metadata to client side)

– Metadata crucial: e.g. working on automatically creating partitions

– Increasing expectations about useful results (answers!)

VL CGI FMe 13 - IR2.ppt · X 21

Computer-gestützte Interaktion.HCI: Long-term IR paradigm.

Information interaction as core life cycle process:

Examples represent early ways to get the information seeker more involved in the informationseeking process – there is plenty more to do.

Like eating we have varying expectations, invest different levels of effort, and use diverse and ubiquitous infrastructures.

Key challenge is to span boundaries between cyberinfrastructure and the „real“ world.

Coda:

Our hopes that we can create systems (solutions) that ‘do’ IR for us are unreasonable

Our expectations that people can find and understand information without thinking and investing effort are unreasonable.

Aim to develop ‘systems’ that involve people and machines continuously learning and changingtogether.

– Google would not work as well next month if there were not a large group of employeestuning the system, adding new spam filters, and crawlers checking out pages and links continuously.

08.01.2009

Backup.

Recommended