View
222
Download
0
Category
Preview:
Citation preview
08.01.2009
Computer-gestützte Interaktion.Vorlesung: Information Retrieval 2.
Florian Metze, Fachbereich UsabilityWS 2008/2009
VL CGI FMe 13 - IR2.ppt · X 1
Computer-gestützte Interaktion.Termin: Donnerstags 10:15 – 11:45; TEL20, Auditorium
Multimodale Schnittstellen (12:15-13:45)1011.12.2008
Fusion/ Fission: Audio, Video, Keyboard, Touch, … (10:15-11:45)1118.12.2008
Anwendungen & Wiederholung (12:15-13:45)1218.12.2008
Information Retrieval, Dokumentensuche (10:15-11:45)1308.01.2009
Information Retrieval 2, Expertensuche (12:15-13:45)1408.01.2009
(Sprach-)dialogsysteme (10:15-11:45)911.12.2008
Sprachübersetzungssysteme (12:15-13:45)804.12.2008
Statistische Übersetzung (10:15-11:45)704.12.2008
Grundlagen und regelbasierte Übersetzung627.11.2008
Future ASR520.11.2008
ASR Anwendungen und Systeme413.11.2008
Grundlagen und ASR306.11.2008
Klassifikation230.10.2008
Statistik123.10.2008
Q&U LabEinführung16.10.2008
TopicRemarkDate
VL CGI FMe 13 - IR2.ppt · X 2
Computer-gestützte Interaktion.Human Computer Interfaces: Example Information Retrieval.
Introduction
Conceptual model
Relationship of IR and HCI and HCC
Latent Semantic Indexing
The ESP Game
Assessing the retrieval
Future Directions
VL CGI FMe 13 - IR2.ppt · X 3
Computer-gestützte Interaktion.HCI: Information Retrieval Model.
Content-Centered Retrieval as Matching
Document Representations to
Query Representations
A powerful paradigm that has driven IR R&D for half a century.
Evaluation metric is effectiveness of the match. (e.g., recall and precision).
VL CGI FMe 13 - IR2.ppt · X 4
Computer-gestützte Interaktion.HCI-IR: Content Trend.
Content Features (queries too)
– Not only text
– Statistics, images, music, code, streams, bio-chemical
– Multimedia, multilingual
Dynamic
– Temporal (e,g., blogs, wikis, sensor streams)
– Conditional (e.g., computed links, recommendations)
Content Relationships
– Hyperlinks, new metadata, aggregations
– Digital libraries, personal collections
Content acquires history → context retrieval
VL CGI FMe 13 - IR2.ppt · X 5
Computer-gestützte Interaktion.HCI-IR: Responses to Content Trend.
Link analysis
Multiple sources of evidence (fusion)
– Authors’ words (e.g., full text IR)
– Indexer/ abstractor words (e.g., OPACs)
– Authors’ citations/links (e.g., Google)
– Readers’ search paths (e.g., recommenders, opinion miners: „collaborative filtering“)
– Machine generated features and relationships („mining“)
Three key challenges:
– How do we generate references?
– What new relationships can we leverage (human and machine)?
– How can we integrate multiple sources of evidence?
VL CGI FMe 13 - IR2.ppt · X 6
Computer-gestützte Interaktion.HCI-IR: User Trend.
Technical advances and technical literacy allows us to leverage information seeker intelligence
– Rather than sole dependence on matching algorithms, focus on flow of representationsand actions in situ as people think with these new tools and information resources
– To leverage human intelligence and effort, people must assume responsibilities: beyondthe two-word, single query
Web and TV remotes have legitimized browsing as human-controlled information seeking
Aim at understanding rather than retrieval
Responses to User Trend:
Adapt techniques to WWW
– Relevance feedback
– Query expansion
– User modeling/profiles, SDI services
Recommender systems: explicit and implicit models
Capture everything (e.g., Lifebits)
User Interfaces: dynamic queries, agile views, tuning of IR systems
VL CGI FMe 13 - IR2.ppt · X 7
Computer-gestützte Interaktion.HCI: HCC Model of HCI.
A user-oriented model that has driven R&D. Evaluation based on user time, accuracy, and satisfaction.
VL CGI FMe 13 - IR2.ppt · X 8
Computer-gestützte Interaktion.HCI: WWW Trends.
First decade of WWW as great equalizer (we all get impoverished, but we admit MANY morepeople)
Universal access
Platform independence (lots of devices)
Enhanced browsers, specialized browsers
Interface Servers
Social awareness (user is not alone)
VL CGI FMe 13 - IR2.ppt · X 9
Computer-gestützte Interaktion.HCI-IR: An Expanded Model.
Think of IR from the perspective of an active human with
information needs,
information skills,
powerful IR resources (that include other humans), and
situated in global and local connected communities,
all of which evolve over time.
Get people closer to the information they need
– Closer to the backend
– Closer to the meaning
Involve information professionals as integral to the IR system
– Increase responsibility as well as control
– Leverage more demanding and knowledgeable installed base
– Consider ubiquity, digital libraries, e-commerce as extended memories and tools(personal and shared)
VL CGI FMe 13 - IR2.ppt · X 10
Computer-gestützte Interaktion.HCI-IR: Key Challenges.
Linking conceptual interface to system backend
– Metadata generation
– Alternative representations and control mechanisms
Raising user literacy and involvement
– Engaging without insulting or annoying
Adding human intelligence to the system
Moving beyond retrieval to understanding
– Context
VL CGI FMe 13 - IR2.ppt · X 11
Computer-gestützte Interaktion.HCI Example 1: Word-Net.
WordNet® is a large lexical database of English.
Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.
Synsets are interlinked by means of conceptual-semantic and lexical relations.
The resulting network of meaningfully related words and concepts can be navigated with the browser.
WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
WordNet relations can be expressed in OWL, RDFS orother ontology markup languages:
VL CGI FMe 13 - IR2.ppt · X 12
Computer-gestützte Interaktion.HCI Example 2: The ESP Game.
How to label images? On the web?
Clever way to automate meta-data generation
Image annotation/ recognition very difficult
„Labeling the Web“ using „Human Computation“Two-player game on the web
– Players get points for generating keywordsdescribing a picture, if the other playeragrees
– Taboo words exist, too– Accuracy assured by over-sampling
– Social aspect („become top labeler“) and fun as motivation
Funded by NSA, conceived by Luis von Ahn at CMU.
Now sold to Google.
VL CGI FMe 13 - IR2.ppt · X 13
Computer-gestützte Interaktion.HCI Example 2: Latent Semantic Indexing (LSA).
How LSA works:LSA uses a term-document matrix whichdescribes the occurrences of terms in documentsIt is a sparse matrix whose rows correspond to terms (typically stemmed words) and whosecolumns correspond to documents, matrixelements are tf-idf.LSA transforms the occurrence matrix into a relation between the terms and some concepts, and a relation between those concepts and thedocuments.
– Thus the terms and documents are nowindirectly related through the concepts.
– LSA finds a low-rank approximation to theterm-document matrix.
The consequence of the rank lowering is thatsome dimensions are combined and depend on more than one term:
{(car), (truck), (flower)} →{(1.3452 * car + 0.2828 * truck), (flower)}
Principal Component Analysis (PCA) in termspace
The new concept space typically can be used to:Compare the documents in the concept space (data clustering, document classification). Find similar documents across languages, after analyzing a base set of translated documents (cross language retrieval). Find relations between terms (synonymy and polysemy). Given a query of terms, translate it into the concept space, and find matching documents (information retrieval).
Synonymy and polysemy are fundamental problems in natural language processing:
Synonymy is the phenomenon where different words describe the same idea. Polysemy is the phenomenon where the same word has multiple meanings.
VL CGI FMe 13 - IR2.ppt · X 14
Computer-gestützte Interaktion.HCI and Computer Aided Interaction.
“Automatic” classification works best when its application is supported by humans with knowledge of the domain and the techniques at hand.
(Gary Marchionini)
Computers should learn!
The „Relation Browser“ tool for metadata mining:
VL CGI FMe 13 - IR2.ppt · X 15
Computer-gestützte Interaktion.HCI: The Relation Browser.
A general purpose dynamic query interface for databases with a small number of facets (~10) and a small number of categories in each facet (~10).
Easy to look ahead (overviews and previews)
Couples interactive partitioning/exploration with string query
Semi-automatic categorygeneration and webpageclassification
Mousing over „Coal“ revealsthe distribution of „coal“-relatedweb-pages in the othercategories
VL CGI FMe 13 - IR2.ppt · X 16
Computer-gestützte Interaktion.HCI: The Relation Browser.
1) Acquire data:
Crawl sites/ Internet
– Formats?– Mirror locally?
Clean data
– Remove non-alphabeticals– Lower case all
– Word-Net validate words
– Stem or not stem
2) Build Representation:
Select data to include
– Pages to include/ exclude– ASCII text from
Titles
Link anchorsMetadata tags
Build raw term-document matrix
– Pages as rows (observations)
– Terms as columns (variables)– Frequencies or TF-IDF weights in cells
VL CGI FMe 13 - IR2.ppt · X 17
Computer-gestützte Interaktion.HCI: The Relation Browser.
3) Filter data:
Stop word lists
– General terms– Domain specific terms
– Web and navigation terms
– Iteratively developed/ refinedTerm discrimination filters (various)
– .01-.1 doc frequency interval
– Interval augmented by 100 top freq
– Empirical threshold (e.g., > 5 docs)
4) Project data onto lower dimensional space
First N principal components
– 50-100 latent semantic dimensions– 50-100 independent components
Reduces to ‘narrower’ term-doc matrix
– Still kind of experimental
VL CGI FMe 13 - IR2.ppt · X 18
Computer-gestützte Interaktion.HCI: The Relation Browser.
5) Cluster documents
K-means, e.g., with k<<100
EM yields a probability distribution for eachdocument over the clusters (so a document has some probability of belonging to each cluster)
6) Evaluate clusters and name topics
Create usable output
– A web page with the clusters and numberof documents in each
– For each cluster, a list of the top 10 mostfrequently occurring terms; a list of the top10 log-odds ratio terms; and links to all thepages in that cluster
– Eyeball the terms, pick a cluster (topic) name (names); else iterate previous steps
VL CGI FMe 13 - IR2.ppt · X 19
Computer-gestützte Interaktion.HCI: The Relation Browser.
7) Assign pages to topics
For every page, compute the probabilitydistribution (using EM model) over each cluster/ topic
Select a threshold for placing pages into topics(most easily go into only one topic)
8) Create other facets (views) and display
Use a set of heuristic rules to place pages intogeographic categories
Use a set of heuristic rules to place pages intotemporal categories (ad hoc at present)Map the files onto the RB relational scheme
VL CGI FMe 13 - IR2.ppt · X 20
Computer-gestützte Interaktion.HCI: Interaction Principles and Caveats (Incomplete).
Principles
– Look ahead without penalty
– Minimize scrolling and clicking
– Alternative ways to slice and dice
– Closely couple search, browse, and examine
– Continuous engagement—useful attractors
– Treasures to surface
Caveats
– Scalability (getting metadata to client side)
– Metadata crucial: e.g. working on automatically creating partitions
– Increasing expectations about useful results (answers!)
VL CGI FMe 13 - IR2.ppt · X 21
Computer-gestützte Interaktion.HCI: Long-term IR paradigm.
Information interaction as core life cycle process:
Examples represent early ways to get the information seeker more involved in the informationseeking process – there is plenty more to do.
Like eating we have varying expectations, invest different levels of effort, and use diverse and ubiquitous infrastructures.
Key challenge is to span boundaries between cyberinfrastructure and the „real“ world.
Coda:
Our hopes that we can create systems (solutions) that ‘do’ IR for us are unreasonable
Our expectations that people can find and understand information without thinking and investing effort are unreasonable.
Aim to develop ‘systems’ that involve people and machines continuously learning and changingtogether.
– Google would not work as well next month if there were not a large group of employeestuning the system, adding new spam filters, and crawlers checking out pages and links continuously.
08.01.2009
Backup.
Recommended