Upload
elizabeth-sanders
View
218
Download
0
Embed Size (px)
Citation preview
CS276AText Information Retrieval, Mining, and
Exploitation
Lecture 163 Dec 2002
Today’s Topics
Recap: XQuery Course evaluations XML indexing and search II
Systems supporting relevance ranking UC Davis system (XQuery extension) XIRQL: Relevance ranking / XML search Summary, discussion
Metadata
Recap: XQuery
Møller and Schwartzbach
Queries Supported by XQuery
Location/position (“chapter no.3”) Attribute/value
/play/title contains “hamlet” Path queries
title contains “hamlet” /play//title contains “hamlet”
Complex graphs Ingredients occurring in two recipes
Subsumes: hyperlinks What about relevance ranking?
XQuery 1.0 Standard on Order
Document order defines a total ordering among all the nodes seen by the language processor. Informally, document order corresponds to a depth-first, left-to-right traversal of the nodes in the Data Model.
… if a node in document A is before a node in document B, then every node in document A is before every node in document B.
Document Collection = 1 XML Doc.
<!DOCTYPE MedlineCitationSetPUBLIC "MedlineCitationSet""http://www.nlm.nih.gov/databases/dtd/
nlmmedline_001211.dtd"><MedlineCitationSet><MedlineCitation><MedlineID>91060009</MedlineID><DateCreated><Year>1991</Year><Month>01</
Month><Day>10</Day></DateCreated>
<Article>some content</Article></MedlineCitation>
Document Collection = 1 XML Doc.
<!DOCTYPE MedlineCitationSetPUBLIC "MedlineCitationSet""http://www.nlm.nih.gov/databases/dtd/
nlmmedline_001211.dtd"><MedlineCitationSet><MedlineCitation> (content) </MedlineCitation><MedlineCitation> (content) </MedlineCitation><MedlineCitation> (content) </MedlineCitation>…</MedlineCitationSet>
How XQuery makes ranking difficult
All documents in collection A must be ranked before all documents in collection B.
Fragments must be ordered in depth-first, left-to-right order.
XQuery: Order By Clause
F for $d in document("depts.xml")//deptnoL let $e :=
document("emps.xml")//emp[deptno = $d] W where count($e) >= 10 O order by avg($e/salary) descending R return <big-dept> { $d,
<headcount>{count($e)}</headcount>, <avgsal>{avg($e/salary)}</avgsal> } </big-dept>
XQuery Order By Clause
Order by clause only allows ordering by “overt” criterion
Relevance ranking Is often proprietary Can’t be expressed easily as function of set
to be ranked Is better abstracted out of query formulation
(cf. www)
Next: XML IR System with Relevance Ranking
Thursday (12/5): Presentation of Projects
Copy slides into your subdirectory in the cs276a submit/ directory
Slides must be named slides.{ppt,pdf} Slides must be submitted by noon
tomorrow (12/4) You have 3 minutes to present on
Thursday Use all of it Or leave 1 minute for 1 question
Projects are due today
Evaluations
UC Davis System
Extension of XQuery for relevance ranking Term counts are stored with only one
related node (usually leaf) Term weight computation at query time Solves “granularity problem” Reasonable index size
30% - 60% of collection
Integrated Query Processing
Abbreviations
DF = Document Fragment DFS = Document Fragment Sequence SSD = sequence of sets of DFs
Integrated Query Processing
Element vs Attribute
Search for element /play[title=“hamlet”]
Search for attribute /play[@title=“hamlet”]
Both are valid encodings -> need to know schema
UC Davis system treats elements/attributes identically for text search
XIRQL supports /play[~title=“hamlet”]
Proposed XQuery Extension
RankBy ::=Expr “rankby” “(” QuerySpecList “)”[“basedon” “(“ TargetSpecList “)”][“using” MethodFunctCall]
QuerySpecList ::= Expr (“,” QuerySpecList)?TargetSpecList ::=PathExpr (“,” TargetSpecList)?
Proposed XQuery Extension: Example
document(“news.xml”)//article//paragraph rankby (“New York”)
for $c in document(“newsmeta.xml”)//category return <category> <name> {$c/name[text()} </name> For $a in document
(“news.xml”)//article[@icd=$c/@id] Return <title> $a/title/text() </title> Rankby ($c/keywords) </category>
Proposed XQuery Extension: Example
For $a in document(“bib.xml”)//article Return <paper> {$a/title} </paper> Rankby (“albert”,”einstein”) basedon
(references)
Observations
Possible: Ranked text != Returned text Rank on references Return titles
Query can be user supplied or path expression $c/keywords example
Implementation
Term weighting Term weights are computed on the fly
Index construction
Term Weighting
Parameters we need for computing term weights: tf – term frequency in DF df – number of DFs containing term dlen – length of DF in terms slen – number of terms in DFS
Premise: First step of query evaluation is p query
So term weights can be computed on the fly while evaluating p query
Integrated Query Processing
Index Requirements
Inverted index Non-replication of statistics DF inclusion DF intersection
Worst case complexity Quadratic Number of DFs in current result set x
number DFs contributing to term statistics
Document Guide
Node Encoding
Use Document Guide Encode each node with path identifier (PID) PID = (DG-index,[path number,path
number,…]) Example:
/db/article[1]/text/para[3] Encoded as: (5,(1,3))
Encode DG-index with log(n) bits Encode each path number with log(ki) bits
Node Encoding
Term Counter Accumulation
Worst case: quadratic Trick: order PIDs to reduce set of DFs to
check (n1,N1) < (n2,N2) iff n1<n2 or (n1=n2 and
N1<N2) To find all descendants of (n1,N1)
Start with (n1,N1) Proceed to first node (n2,N2) that is not a
descendant Stop
Index Parameters
Performance
Limitations
Need to start with “p query” (a path) Often inconvenient / not possible User must know schema
No element/attribute distinction for r queries
XQuery allows construction of new fragments from subfragments
XQuery allows construction of completely new documents
XIRQL
University of Dortmund Goal: open source XML search engine
Motivation “Returnable” fragments are special
“atomic units” E.g., don’t return a <bold> some text </bold>
fragment Structured Document Retrieval Principle Empower users who don’t know the schema
Enable search for any person_name no matter how schema refers to it
Atomic Units
Specified in schema Only atomic units can be returned as result
of search (unless unit specified) Tf.idf weighting is applied to atomic units Probabilistic combination of “evidence”
from atomic units
XIRQL Indexing
Structured Document Retrieval Principle
A system should always retrieve the most specific part of a document answering a query.
Example query: xql Document:
<chapter> 0.3 XQL<section> 0.5 example </section><section> 0.8 XQL 0.7 syntax </section></chapter>
Return section, not chapter
Augmentation weights
Ensure that Structured Document Retrieval Principle is respected.
Assume different query conditions are disjoint events -> independence.
P(chapter,XQL) =P(XQL|chapter)+P(sec.|chapter)*P(XQL|sec.)-P(XQL|chapter)*P(sec.|chapter)*P(XQL|sec.)
= 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636
Section ranked ahead of chapter
Datatypes
Example: person_name Assign all elements and attributes with
person semantics to this datatype Allow user to search for “person” without
specifying path
XIRQL: Summary
Relevance ranking Fragment/context selection Datatypes (person_name) Probablistic combination of evidence
XML Summary
Why you don’t want to use a DB Relevance ranking
Why you can’t use a standard IR engine Term statistics / indexing granularity Issues with fragments (granularity,
coherence …)
XML Summary: Schemas
Ideally: There is one schema User understands schema
In practice: rare Many schemas Schemas not known in advance Schemas change Users don’t understand schemas
Need to identify similar elements in different schemas Example: employee
XML Summary: UI Challenges
Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender
What is the query language you expose to user? XQuery? No. Forms? Parametric search? A textbox?
In general: design layer between XML and user
Metadata
Dublin Core Element Set Title (e.g., Dublin Core Element Set) Creator (e.g., Hinrich Schuetze) Subject (e.g, keywords) Description (e.g., an abstract) Publisher (e.g., Stanford University) Contributor (e.g., Chris Manning) Date (e.g, 2002.12.03) Type (e.g., presentation) Format (e.g., ppt) Identifier (e.g.,
http://www.stanford.edu/class/cs276a/syllabus.html) Source (e.g. http://dublincore.org/documents/dces/) Language (e.g, English) Coverage (e.g., San Francisco Bay Area) Rights (e.g., Copyright Stanford University)
Why metadata?
“Normalized” semantics Enables searches otherwise not possible:
Time Author Url / filename
Non-text content Images Audio Video
For Effective Metadata We Need:
Semantics Commonly understood terms to describe
information resources Syntax
Standard grammar for connecting terms into meaningful “sentences”
Exchange framework So we can recombine and exchange
metadata across applications and subjects
Weibel and Miller
Dublin Core: Goals
Metadata standard Framework for interoperability Facilitate development of subject specific
metadata More general goals of DC community
Foster development of metadata tools Creating, editing, managing, navigating metadata
Semantic registry for metadata Search declared meanings and their relationships
Registry for metadata vocabularies Maybe somebody has already done the work
RDF =Resource Description Framework
Engineering standard for metadata W3C standard Part of W3C’s metadata framework Specialized for WWW Desiderata
Combine different metadata modules (e.g., different subject areas)
Syndication, aggregation, threading
Metadata: Who creates them?
Authors Editors Automatic creation Automatic context capture
Automatic Context Capture
Metadata Pros and Cons CONS
Most authors are unwilling to spend time and energy on learning a metadata standard annotating documents they author
Authors are unable to foresee all reasons why a document may be interesting.
Authors may be motivated to sabotage metadata (patents). PROS
Information retrieval often does not work. Words poorly approximate meaning. For truly valuable content, it pays to add metadata.
Synthesis In reality, most documents have some valuable metadata If metadata is available, it improves relevance and user
experience But most interesting content will always have inconsistent
and spotty metadata coverage
Advanced Research Issues
Cross-lingual IR Topic detection & tracking Information filtering and collaborative
filtering Data fusion Automatic classification Merging database and IR technologies Digital libraries Multimedia information retrieval
Resources
Jan-Marco Bremer’s publications on xml and ir: http://www.db.cs.ucdavis.edu/~bremer
www.w3.org/XML - XML resources at W3C Ronald Bourret on native XML databases:
http://www.rpbourret.com/xml/ProdsNative.htm Norbert Fuhr and Kai Grossjohann. XIRQL: A query
language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001.
http://www.sciam.com/2001/0501issue/0501berners-lee.html
http://www.xml.com/pub/a/2001/01/24/rdf.html
Additional Resources
http://www.hyperorg.com/backissues/joho-jun26-02.html#semantic
http://www.xml.com/pub/a/2001/05/23/jena.html
http://www710.univ-lyon1.fr/%7Echampin/rdf-tutorial/
http://www-106.ibm.com/developerworks/library/w-rdf/