Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors...

Querying for relations from the semi-structured Web

Sunita Sarawagi

IIT Bombay

http://www.cse.iitb.ac.in/~sunita

Contributors

Rahul Gupta Girija Limaye Prashant Borole

Rakesh Pimplikar Aditya Somani

Web Search

Mainstream web search User Keyword queries Search engine Ranked list of documents

15 glorious years of serving all of user’s search need into this least common denominator

Structured web search User Natural language queries ~/~ Structured queries Search engine Point answer, record sets

Many challenges in understanding both query and content

15 years of slow but steady progress

The Quest for Structure Vertical structured search engines

Structure Schema Domain-specific Shopping: Shopbot: (Etzoini + 1997)

Product name, manufacturer, price Publications: Citeseer (Lawrence, Giles,+ 1998)

Paper title, author name, email, conference, year Jobs: Flipdog Whizbang labs (Mitchell + 2000)

Company name, job title, location, requirement People: DBLife (Doan 07)

Name, affiliations, committees served, talks delivered.

Triggered much research on extraction and IR-style search of structured data (BANKS ‘02).

Horizontal Structured Search Domain-independent structure Small, generic set of structured primitives over

entities, types, relationships, and properties <Entity> IsA <Type>

Mysore is a city <Entity> Has <Property>

<City> Average rainful <Value> <Entity1> <related-to> <Entity2>

Types of Structured Search Web+People Structured databases ( Ontologies)

Created manually (Psyche), or semi-automatically (Yago) True Knowledge (2009), Wolfram Alpha (2009)

Web annotated with structured elements Queries: Keywords + structured annotations

Example: <Physicist> +cosmos Open-domain structure extraction and annotations of web

docs (2005—)

Users, Ontologies, and the Web

Users are from Venus• Bi-syllabic, impatient, believe in

mind -reading Ontologies are from Mars

• One structure to fit allG

• Web content creators are from some other galaxy

– Ontologies= – Let search engines bring the

What is missed in Ontologies The trivial, the transient, and the textual Procedural knowledge

• What do I do on an error? Huge body of invaluable text of various type

reviews, literature, commentaries, videos Context

By stripping knowledge to its skeletal form, context that is so valuable for search is lost.

As long as queries are unstructured, the redundancy and variety in unstructured sources is invaluable.

Structured annotations in HTML Is A annotations

KnowITAll (2004) Open-domain Relationships

Text runner (Banko 2007) Ontological annotations

SemTag and Seeker (2003) Wikipedia annotations (Wikify! 2007, CSAW 2009)

All view documents as a sequence of tokens

Challenging to ensure high accuracy

WWT: Table queries over the semi-structured web

Queries in WWT Query by content

Query by description

Alan Turing Turing Machine

E. F. Codd Relational Databases

Desh Late night

Bhairavi Morning

Patdeep Afternoon

Inventor Computer science concept Year

Indian states Airport City

Answer: Table with ranked rows

Person Concept/Invention

Seymour Cray Supercomputer

Tim Berners-Lee WWW

Charles Babbage Babbage Engine

Verbose articles, notstructured tables

The only document with an unstructured listof some desired records

Desired records spread across many documents

Correct answer is notone click away.

Computer science concept inventor year

Keyword search to find structured records

The only list in one of the retrieved pages

Highly relevant Wikipedia table not retrieved in the top-k

Ideal answer should be integrated from these incomplete sources

Attempt 2: Include samples in query

Documents relevant only to the keywords

Ideal answer still spread across manydocuments

Known examples

alan turing machine codd relational database

WWT Architecture

Index Query Builder

Extract record sources

Query Table

Content+context index

Offline

L1,…

Type Inference

Resolver

Resolver builder

Typesystem Hierarchy

Extractor

Record labeler

CRF modelsConsolidator

Tables T1,…,Tk

Consolidated Table

StatisticsCell resolver Row resolver

Ranker

Row and cell scores

Final consolidated table

Annotate

Ontology

Offline: Annotating to an Ontology

Annotate table cells with entity nodes and table columns with type nodes

movies

Indian_films English_films

2008_films Terrorism_films

A_Wednesday

Black&White

Coffee_house (film)

Wednesday

People

Entertainers

Coffee_house (Loc)

Indian_films

2008_filmsIndian_directors

Coffee_house (film)

Black&White

A_Wednesday

Challenges Ambiguity of entity names

“Coffee house” both a movie name and a place name

Noisy mentions of entity names Black&White versus Black and White

Multiple labels Yago Ontology has average 2.2 types per entity

Missing type links in Ontology cannot use least common ancestor Missing link: Black&White to 2008_films Not a missing link: 1920 to Terrorism_films

Scale: Yago has 1.9 million entities, 200,000 types 19

A unified approachGraphical model to jointly label cells and

columns to maximize sum of scores on

ycj = Entity label of cell c of column j

yj = Type label of column j Score(ycj ): String similarity between c & ycj .

Score(yj ): String similarity between header in j & yj

Score( yj, ycj)

Subsumed entity: Inversely proportional to distance between them

Outside enity: Fraction of overlapping entities between yj and immediate parent of ycj

Handles missing links: Overlap of 2008_movies with 2007_movies zero but with Indian movies is non-zero.

movies

Indian_films

English_films

Terrorism_films

Subsumed entity y1j

Outside entity y3j

Subsumed entity y2j

WWT Architecture

Index Query Builder

Query Table

Offline

L1,…

Type Inference

Resolver

Resolver builder

Extractor

Record labeler

Tables T1,…,Tk

Consolidated Table

Ranker

Row and cell scores

Annotate

Ontology

Extraction: Content queriesExtracting queries columns from list records

New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as King’s College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy.

Cornell University Ithaca

State University of New York Stony Brook

New York University New York

Lists are often human generated.

Query: QQuery: Q

A source: LiA source: Li

Extraction

New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as King’s College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy.

Rule-based extractor insufficient. Statistical extractor needs training data.

Generating that is also not easy!

Extracted table columnsExtracted table columns

Query: QQuery: Q

Extraction: Labeled data generation

A fast but naïve approach for generating labeled records

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in Stony Brook, New York.

Query about colleges in NY

Fragment of a relevant list source

Lists are unlabeled. Labeled records needed to train a CRF

Monroe College Brighton

A fast but naïve approach

In the list, look for matches of every query cell.

Another match for New York UniversityAnother match for New York

In the list, look for matches of every query cell. Greedily map each query row to the best match in the list

Hard matching criteria has significantly low recall Missed segments. Does not use natural clues like Univ = University

Greedy matching can be lead to really bad mappings

Unmapped (hurts recall)

Wrongly MappedAssumed as ‘Other’

Generating labeled data: Soft approach New York Univ. in NYC

Match score for each query and source row Score of best segmentation of source row

to query columns Score of a segment s of column c:

Probability Cell c of query row same as segment s

Computed by the Resolver module based on the type of the column

Generating labeled data: Soft approach

Match score for each query and source row Score of best segmentation of source row into query columns Score of a segment s of column c:

Probability Cell c of query row same as segment s Computed by the Resolver module based on the type of the

column

Compute the maximum weight matching Better than greedily choosing the best match for each row

Soft string-matching increases the labeled candidates significantly

Vastly improves recall, leads to better extraction models.

0.70.3

Greedy matching in red

Extractor

Use CRF on the generated labeled data Feature Set

Delimiters, HTML tokens in a window around labeled segments.

Alignment features Collective training of multiple sources

Experiments

Aim: Reconstruct Wikipedia tables from only a few sample rows.

Sample queries TV Series: Character name, Actor name, Season Oil spills: Tanker, Region, Time Golden Globe Awards: Actor, Movie, Year Dadasaheb Phalke Awards: Person, Year Parrots: common name, scientific name, family

Experiments: Dataset

Corpus: 16M lists from 500M pages from a web crawl. 45% of lists retrieved by index probe are irrelevant.

Query workload 65 queries. Ground truth hand-labeled by 10 users

over 1300 lists. 27% queries not answerable with one list (difficult). True consolidated table = 75% of Wikipedia table,

25% new rows not present in Wikipedia.

Extraction performance

Benefits of soft training data generation, alignment features, staged-extraction on F1 score.

More than 80% F1 accuracy with just three query records

Queries in WWT Query by content

Query by description

Desh Late night

Bhairavi Morning

Patdeep Afternoon

Inventor Computer science concept Year

Indian states Airport City

Extraction: Description queries

Lithium 3

Sodium 11

Beryllium 4

Non-informative headers No headers

Context to get at relevant tables Ontological annotations

Context is union of Text around tables Headers Ontology labels when

present

Chemical_elements

Metals Non_Metals

Alkali Gas

Aluminium

Lithium Hydrogen

People

Non alkali

Lithium 3

Sodium 11

Beryllium 4

Non-gas

CarbonSodium

Alkali

Chemical element

Joint labeling of table columns Given

Candidate tables: T1 ,T2,..Tn

Query column q1, q2,.. qm

Task: label columns of Ti with {q1, q2,…, qm, } to maximize sum of these scores Score (T , j , qk) = Ontology type match + Header string

match with qk

Score (T , * , qk) = Match of description of T with qk

Score (T , j, T’ , j’, qk) = Content overlap of column j of table T with column j’ of table T’ when both label qk

Inference algorithm in a graphical model solve via Belief Propagation. 40

WWT Architecture

Index Query Builder

Query Table

Offline

L1,…

Type Inference

Resolver

Resolver builder

Extractor

Record labeler

Tables T1,…,Tk

Consolidated Table

Ranker

Row and cell scores

Annotate

Ontology

Step 3: Consolidation

State University of New York

Stony Brook

New York University New York City

Binghamton University Binghamton

Merging the extracted tables into one

SUNY Stony Brook

New York University (NYU)

New York

RPI Troy

Columbia University New York

Syracuse University Syracuse

State University of New York OR SUNY

Stony Brook

New York University OR New York University (NYU)

New York City OR New York

Binghamton University Binghamton

RPI Troy

Columbia University New York

Syracuse University Syracuse

=Merging duplicates

Consolidation

Challenge: resolving when two rows are the same in the face of Extraction errors Missing columns Open-domain No training.

Our approach: a specially designed Bayesian Network with interpretable and generalizable parameters

Resolver

P(RowMatch|rows q,r)

P(1st cell match|q1,r1) P(ith cell match|qi,ri) P(nth cell match|qn,rn)

Bayesian Network

Cell-level probabilities Parameters automatically set using list statistics Derived from user-supplied type-specific similarity functions

Ranking• Factors for ranking– Relevance: membership in overlapping sources– Support from multiple sources

– Completeness: importance of columns present Penalize records with only common ‘spam’ columns like City and

State Correctness: extraction confidence

School Location State Merged Row Confidence Support

- - NY 0.99 9

- NYC New York 0.95 7

New York Univ. OR New York University

New York City OR New York

New York 0.85 4

University of Rochester OR Univ. of Rochester,

Rochester New York 0.50 2

University of Buffalo Buffalo New York 0.70 2

Cornell University Ithaca New York 0.76 1

Relevance ranking on set membership Weighted sum approach

Score of a set t: s(t) = fraction of query rows in t

Relevance of consolidated row r: r t s(t)

Graph walk based approach Random walk from rows to table

nodes starting from query rows along with random restarts to query rows

Tables

Consolidated rows

Query rows

Ranking Criteria• Score(Row r):

× Graph-relevance of r.× Importance of columns C present in r (high if C functionally

determines the other)× Sum of cell extraction confidence: noisy-OR of cell extraction

confidence from individual CRFs

School Location State Merged Row Confidence Support

New York Univ. OR New York University (0.90)

New York City OR New York (0.95)

New York (0.98) 0.85 4

University of Buffalo (0.88) Buffalo (0.99) New York (0.99) 0.70 2

Cornell University (0.92) Ithaca (0.95) New York (0.99) 0.76 1

University of Rochester OR Univ. of Rochester, (0.80)

Rochester (0.95) New York (0.99) 0.50 2

- - NY (0.99) 0.99 9

- NYC (0.98) New York (0.98) 0.95 7

Overall performance

Justify sophisticated consolidation and resolution. So compare with: Processing only the magically known single best list

=> no consolidation/resolution required. Simple consolidation. No merging of approximate duplicates.

WWT has > 55% recall, beats others. Gain bigger for difficult queries.

All Queries Difficult Queries

Running time

< 30 seconds with 3 query records.

Can be improved by processing sources in parallel. Variance high because time depends on number of columns,

record length etc.

Related Work Google-Squared

Developed independently. Launched in May 2009 User provides keyword query, e.g. “list of Italian

joints in Manhattan”. Schema inferred. Technical details not public.

Prior methods for extraction and resolution. Assume labeled data/pre-trained parameters We generate labeled data, and automatically train

resolver parameters from the list source.

Summary Structured web search & the role of non-text,

partially structured web sources WWT system

Domain-independent Online: structure interpretation at query time Relies heavily on unsupervised statistical learning

Graphical model for table annotation Soft-approach for generating labeled data Collective column labeling for descriptive queries Bayesian network for resolution and consolidation Page rank + confidence from a probabilistic extractor for

ranking

What next? Designing plans for non-trivial ways of combining of

sources Better ranking and user-interaction models. Expanding query set

Aggregate queries: tables are rich in quantities Point queries: attribute value and relationship queries

Interplay between semi-structured web & Ontologies Augmenting one with the other.

Quantify information in structured sources vis-à-vis text sources on typical query workloads

Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors...

Documents

Tutorial on Data Mining Workshop of the Indian Database Research Community Sunita Sarawagi School of IT, IIT Bombay

Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay

Haja | Sunita

Data warehousing, data analysis and OLAP Sunita Sarawagi sunita@iitb.ac.in

Freecharge by Naman Sarawagi

Exploring Venus Limaye IAA

Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac.in

Sigmatropic_Processes sunita

Graphical models for structure extraction and information integration Sunita Sarawagi IIT Bombay sunita

Maharashtra Foundation - EDITORIAL...Sunil Deshmukh Sunita Dhumale Shirish Gupte Shaila Vidwans Arundhati Vinod Project ommittee hair: Maneesha Kelkar Member: Anjalli Limaye Vaibhav

Structured learning Sunita Sarawagi IIT Bombay sunita TexPoint fonts used in EMF. Read the TexPoint manual before you delete

1 Information Extraction using HMMs Sunita Sarawagi

Sunita Aaaaa

Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Parallel Iterative Edit Models for Local Sequence Transduction · Parallel Iterative Edit Models for Local Sequence Transduction Abhijeet Awasthi , Sunita Sarawagi , Rasna Goyal ,

1 Mining surprising patterns using temporal description length Soumen Chakrabarti (IIT Bombay) Sunita Sarawagi (IIT Bombay) Byron Dom (IBM Almaden)

1 Scaling multi-class Support Vector Machines using inter- class confusion Author:Shantanu Sunita Sarawagi Sunita Sarawagi Soumen Chakrabarti Soumen Chakrabarti

Sunita dham

Sequence data mining - IIT Bombaysunita/papers/sequence.pdf · Sequence data mining Sunita Sarawagi Indian Institute of Technology Bombay. sunita@iitb.ac.in Summary. Many interesting

Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita