68
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Embed Size (px)

Citation preview

Page 1: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

WEB BAR 2004 Advanced Retrieval and Web Mining

Lecture 11

Page 2: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Overview Monday

XML Clustering 1

Tuesday Clustering 2 Clustering 3, Interactive Retrieval

Wednesday Classification 1 Classification 2

Thursday Classification 3 Information Extraction

Friday Bioinformatics Projects

Joker Active learning in Text Mining

Page 3: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Today’s Topics

Quick XML intro XML indexing and search

Database approach Xquery 2 IR approaches

Page 4: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

What is XML?

eXtensible Markup Language A framework for defining markup

languages No fixed collection of markup tags Each XML language targeted for

application All XML languages share features Enables building of generic tools

Page 5: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Basic Structure

An XML document is an ordered, labeled tree

character data at the leaf nodes contain the actual data (text strings)

element nodes are each labeled with a name (often called the element type), and a set of attributes, each consisting of a

name and a value, can have child nodes

Page 6: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML Example

Page 7: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML Example

<chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>

Page 8: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Elements

Elements are denoted by markup tags <foo attr1=“value” … > thetext </foo> Element start tag: foo Attribute: attr1 The character data: thetext Matching element end tag: </foo>

Page 9: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML vs HTML

Relationship?

Page 10: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML vs HTML

HTML is a markup language for a specific purpose (display in browsers)

XML is a framework for defining markup languages

HTML can be formalized as an XML language (XHTML)

XML defines logical structure only HTML: same intention, but has evolved into

a presentation language

Page 11: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML: Design Goals

Separate syntax from semantics to provide a common framework for structuring information

Allow tailor-made markup for any imaginable application domain

Support internationalization (Unicode) and platform independence

Be the future of (semi)structured information (do some of the work now done by databases)

Page 12: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Why Use XML?

Represent semi-structured data (data that are structured, but don’t fit relational model)

XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free

Page 13: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Applications of XML XHTML CML – chemical markup language WML – wireless markup language ThML – theological markup language

<h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>

Page 14: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML Schemas

Schema = syntax definition of XML language

Schema language = formal language for expressing XML schemas

Examples DTD XML Schema (W3C)

Relevance for XML information retrieval Our job is much easier if we have a (one)

schema

Page 15: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML Tutorial

http://www.brics.dk/~amoeller/XML/index.html

(Anders Møller and Michael Schwartzbach) Previous (and some following) slides are

based on their tutorial

Page 16: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML Indexing and Search

Page 17: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Native XML Database

Uses XML document as logical unit Should support

Elements Attributes PCDATA (parsed character data) Document order

Contrast with DB modified for XML Generic IR system modified for XML

Page 18: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML Indexing and Search

Most native XML databases have taken a DB approach Exact match Evaluate path expressions No IR type relevance ranking

Only a few that focus on relevance ranking Many types of XML don’t need relevance

ranking If there is a lot of text data, relevance

ranking is usually needed.

Page 19: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Timber: DB extension for XML

DB: search tuples Timber: search trees Main focus

Complex and variable structure of trees (vs. tuples)

Ordering Non-native XML database

without relevance ranking without “IR-type” handling of text

Page 20: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Three Native XML Databases

Toxin Xirql IBM Haifa system

Page 21: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

ToXin

Exploits overall path structure Supports any general path query

Query evaluation in three stages Preselection stage Selection stage Postselection stage

Page 22: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

ToXin: Motivation

Strawman (Dataguides)

Index all paths occurring in database

Sufficient for simple queries: Find all authors with last

name Smith Does not allow backward

navigation

Example query: find all the titles of articles

authored by Smith

Page 23: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Query Evaluation Stagesfor Backward Navigation

Pre-selection First navigation down the tree

Selection Value selection according to filter

Post-selection Navigation up and down again

Page 24: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

ToXin

Page 25: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Evaluation:Factors Impacting Performance

Data source (collection) specific Document size Number of XML nodes and values Path complexity (degree of nesting) Average value size

Query specific Selectiveness of path constraint Size of query answer Number of elements selected by filter

Page 26: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Test Collections

Page 27: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Query Classification

Page 28: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Evaluation

Page 29: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

ToXin: Summary

Efficient system supporting structured queries

All paths are indexed (not just from root) Path index linear in corpus size Shortcomings

Order of nodes ignored No IR-type relevance

Page 30: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

IR/Relevance Ranking for XML

Why is this difficult?

Page 31: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

IR XML Challenge 1: Term Statistics

There is no document unit in XML How do we compute tf and idf? Global tf/idf over all text contexts is

problematic Consider medical collection “new” not a discriminative term in general Very discriminative for journal titles

New England Journal of Medicine

Page 32: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

IR XML Challenge 2: Fragments

Which fragments are legitimate to return? Paragraph, abstract, title Bold, italic

IR systems don’t store content (only index) Need to go to document for displaying

fragment Problematic if fragment is not simply a node

Page 33: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Remainder of Lecture

Queries for semi-structured text How they differ from regular IR queries Xquery

Two XML search systems with relevance ranking Xirql IBM Haifa system

Page 34: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Types of (Semi)Structured Queries

Location/position (“chapter no.3”) Simple attribute/value

/play/title contains “hamlet” Path queries

title contains “hamlet” /play//title contains “hamlet”

Complex graphs Employees with two managers

All of the above: mixed structure/content

Page 35: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XPath

Declarative language for Addressing (used in XLink/XPointer and in

XSLT) Pattern matching (used in XSLT and in

XQuery) Location path

a sequence of location steps separated by /

Example: child::section[position()<6] /

descendant::cite / attribute::href

Page 36: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Axes in XPath

ancestor, ancestor-or-self, attribute, child, descendent, descendent-or-self, following, following-sibling, namespace, parent, preceding, preceding-sibling, self

Page 37: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Location steps

A single location step has the form: axis :: node-test [ predicate ]

The axis selects a set of candidate nodes (e.g. the child nodes of the context node).

The node-test performs an initial filtration of the candidates based on their types (chardata node, processing

instruction, etc.), or names (e.g. element name).

The predicates (zero or more) cause a further, more complex, filtration

child::section[position()<6]

Page 38: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XQuery

SQL for XML Usage scenarios

Human-readable documents Data-oriented documents Mixed documents (e.g., patient records)

Based on XPath

Page 39: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XQuery Expressions

path expressions element constructors list expressions conditional expressions quantified expressions datatype expressions

Page 40: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

FLWR Expressions

FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p

FOR generates an ordered list of bindings of publisher names to $p

LET associates to each binding a further binding of the list of book elements with that publisher to $b

at this stage, we have an ordered list of tuples of bindings: ($p,$b)

WHERE filters that list to retain only the desired tuples

RETURN constructs for each tuple a resulting value

Page 41: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XQuery vs SQL

Order matters! document("zoo.xml")//chapter[2]//

figure[caption = "Tree Frogs"] XQuery is turing complete, SQL is not.

Page 42: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XQuery Example

Møller and Schwartzbach

Page 43: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XQuery 1.0 Standard on Order

Document order defines a total ordering among all the nodes seen by the language processor. Informally, document order corresponds to a depth-first, left-to-right traversal of the nodes in the Data Model.

… if a node in document A is before a node in document B, then every node in document A is before every node in document B.

This structure-oriented ordering can have undesirable effects.

Example: Medline

Page 44: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Document collection = 100s of XML docs, each with thousands of abstracts

<!DOCTYPE MedlineCitationSetPUBLIC "MedlineCitationSet""http://www.nlm.nih.gov/databases/dtd/

nlmmedline_001211.dtd"><MedlineCitationSet><MedlineCitation><MedlineID>91060009</MedlineID><DateCreated><Year>1991</Year><Month>01</

Month><Day>10</Day></DateCreated>

<Article>some content</Article></MedlineCitation>

Page 45: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Document collection = 100s of XML docs, each with thousands of abstracts

<!DOCTYPE MedlineCitationSetPUBLIC "MedlineCitationSet""http://www.nlm.nih.gov/databases/dtd/

nlmmedline_001211.dtd"><MedlineCitationSet><MedlineCitation> (content)

</MedlineCitation><MedlineCitation> (content)

</MedlineCitation>…</MedlineCitationSet>

Page 46: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

How XQuery makes ranking difficult

All documents in collection A must be ranked before all documents in collection B.

Fragments must be ordered in depth-first, left-to-right order.

Page 47: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Semi-Structured Queries

More complex than “unstructured” queries Xquery standard

Page 48: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XIRQL

University of Dortmund Goal: open source XML search engine

Motivation “Returnable” fragments are special

“atomic units” E.g., don’t return a <bold> some text </bold>

fragment Structured Document Retrieval Principle Empower users who don’t know the schema

Enable search for any person_name no matter how schema refers to it

Page 49: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Atomic Units

Specified in schema Only atomic units can be returned as result

of search (unless unit specified) Tf.idf weighting is applied to atomic units Probabilistic combination of “evidence”

from atomic units

Page 50: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XIRQL Indexing

Page 51: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Structured Document Retrieval Principle

A system should always retrieve the most specific part of a document answering a query.

Example query: xql Document:

<chapter> 0.3 XQL<section> 0.5 example </section><section> 0.8 XQL 0.7 syntax </section></chapter>

Return section, not chapter

Page 52: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Augmentation weights

Ensure that Structured Document Retrieval Principle is respected.

Assume different query conditions are disjoint events -> independence.

P(XQL|chapter-N) =P(XQL|chapter-F)+P(sec.|chapter-N)*P(XQL|sec.)-P(XQL|chapter-F)*P(sec.|chapter-N)*P(XQL|

sec.) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636

P(XQL|sec.)=0.8 > 0.636=P(XQL|chapter-N) Section ranked ahead of chapter

Page 53: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Datatypes

Example: person_name Assign all elements and attributes with

person semantics to this datatype Allow user to search for “person” without

specifying path

Page 54: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XIRQL: Summary

Relevance ranking Fragment/context selection Datatypes (person_name) Probabilistic combination of evidence

Page 55: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

IBM Haifa Approach

Reject XQuery Willing to give up some expressiveness

No joins and backward navigation Find all the titles of articles authored by Smith

Simpler & more efficient approach Represent queries as XML fragments

Page 56: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Query Examples

Page 57: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Extended Weighting Formula

Direct extension of tf.idf cr = context resemblance measure

Page 58: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Context Resemblance Measures

Flat: Perfect match cr(ci,cj):=1 if i==j, cr(ci,cj):=0 otherwise

Partial match cr(ci,cj):= (1+|ci|)/(1+|cj|) if ci subsequence

of cj cr(ci,cj):= 0 otherwise

Fuzzy match For example, string similarity of paths Example?

Ignore context cr(ci,cj) := 1 in all cases

Page 59: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Implementation

Indexing Index term/context pairs: t#c Example: istambul#/country/capital

Retrieval Fetch all contexts of a term

Page 60: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Weighting

IDF per context: compute inverse document frequency for each context separately Problem: not enough data

Global IDF: compute a single global idf weighting

Merge-idf Compute idf for context ci by looking at all

contexts with similarity > 0 Merge-all

Compute tf in analogy to merge-idf

Page 61: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Results

Page 62: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Discussion

Flat best But it depends on the query. Average hides

individual differences. Which queries will flat do well on?

Semantics of XML structure Best case

XML structure corresponds to unit/subunit structure of documents

Worst case (except for flat) Semantics of terms different in different structural

units

Page 63: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

IBM Haifa: Summary

Goal: information discovery vs. data exchange, data access via API etc

Queries are XML fragments No separate query language

One of the best performers in Inex bakeoff Extension of vector space Works well for:

Specific context, vague information need Doesn’t work well for

Non-specific context, DB-type information need

Page 64: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML Summary

DB approach Good for DB-type queries But no relevance ranking And no ordering

Why you can’t use a standard IR engine Term statistics / indexing granularity Issues with fragments (granularity,

coherence …) Different approaches to relevance-ranked

XML IR

Page 65: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML IR challenge: Schemas

Ideally: There is one schema User understands schema

In practice: rare Many schemas Schemas not known in advance Schemas change Users don’t understand schemas

Need to identify similar elements in different schemas Example: employee

Page 66: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

XML IR challenge: UI

Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender

What is the query language you expose to user? XQuery? No. Forms? Parametric search? A textbox?

In general: design layer between XML and user

Page 67: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Project Suggestions

XML information retrieval using Lucene Address some of the XML IR challenges

Automatic creation of datatypes Weighting Others?

Page 68: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

Resources

xquery full text requirements

Other approaches http://www.cs.cornell.edu/database/publicati

ons/2003/sigmod2003-xrank.pdf http://www.ercim.org/publication/ws-procee

dings/DelNoe01/22_Schlieder.pdf

Xml classification http://citeseer.nj.nec.com/583672.html

http://www.w3.org/TR/xquery-full-text-requirements/