Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon

Center for E-Business TechnologySeoul National University

Seoul, Korea

WebTables: Exploring the Power of Tables on the Web

Michael J. Cafarella, Alon Halevy, Zhe Daisy Wang, Eugene Wu, Yang Zhang

VLDB 2008

2009. 01. 08.

Summarized and Presented by {Name}, IDS Lab., Seoul National University

Copyright 2009 by CEBT

Introduction

Web is a corpus of unstructured data

Some structure is imposed by

Hierarchical URLs

Hyperlink Graph

Web pages generally contain

Text as paragraphs

Tabular data (Relations)

Text and tables have different characteristics

Tables have more structured data than raw text

2


Introduction (2)

Tables can give some hints about semantics

Headers

Tuples

Regular keyword query techniques are not very effective for tables

3


Motivation

Enable analysis and integration of data on the web

User demand for structured data

For 30 million queries users clicked on results containing tables

This paper focuses on two fundamental questions

What are effective methods for searching within large collections of tables?

Is there additional power that can be derived by analyzing large corpus of tables?

4


WebTables - Data

WebTables system considers HTML tables that are already surfaced and crawlable

Deep Web refers to the content that is made available through filling HTML forms

Corpus

14.1 Billion raw HTML tables

154 Million distinct relational databases

Relational database form 1.1% of raw HTML tables

60% of data from non-deep-web sources

40% of data from parameterized URLs

5


Extracting Relations

Most HTML tables are used for page layouts

To filter relational and non relational tables

Handwritten detectors

Statistically trained classifiers

Training & Test data generated by two independent judges

Scale of relational quality 1-5

Tables that received average score of 4 or above were considered as relational

6


Data Model

7

R Corpus of databases where each database is a relation

R Is a relation, R Є R Ru , Ri uniquely define R

Ru URL of the page from which relation was extracted

Ri Offset of the relation within the page

Rs Schema of a Relation

Rt A list of tupless

A Attribute Correlation Statistics Database (ACSDb)


Attribute Correlation Statistics Database (ACSDb)

For each Unique Schema Rs, ACSDb contains frequency

count

A = {(Rs1,C1), (Rs2,C2), (Rs3,C3) … }

If schema appears multiple times under same domain name it is counted only once

ACSDb contains

5.4M unique attribute names

2.6M unique schemas

ACSDb is simple but can be used to compute probabilities

For example, conditional probability of finding attribute ‘Address’ in a schema given attribute ‘Name’

P(address|name) = count of schemas containing address, name / count of schemas containing name

8


ACSDb

9


Relation Search

WebTables search engine allows users to rank relations by relevance

Query appropriate visualizations can be created

Columns containing place names can be displayed on a map

Graphs can be generated from table data

Traditional structured operations can be applied over search results

Selection

Projection

10

Copyright 2009 by CEBT 11


Ranking

Keyword ranking for databases is a novel problem

Challenges

Relations does not exist in a domain specific schema graph

Word frequencies apply ambiguously to tables (Ex: which table in the page is described by which frequent word)

Attribute labels are extremely important

Attributes provide good summaries of the subject matter

Tuples may have a key like element that summaries the row

Ranking Functions

naïveRank

filterRank

featureRank

schemaRank

12


Ranking Function (1)

Naïve Rank

It simply uses the top k search engine result pages to generate relations.

If there are no relations in the top k search results, naïve Rank will emit no relations.

Roughly simulates modern search engine user

13



Filter Rank

Similar to naïve rank

It will go as far down the search result pages as necessary to find ‘k’ relations

14



Feature Rank

Does not rely on an existing search engine

Uses relation specific features to score each extracted relation in the Corpus

Sorts results by score

Different feature scores were combined using linear regression estimator

– trained by a thousand (q, relation) pairs each scored by two human judges

15



Schema Rank

Same as feature Rank

Additionally uses ACSDb based Schema coherence score

Coherent Schema is one where attributes are strongly related

Make, Model

Make, Zipcode

PMI - Point Mutual Information

Gives a sense of how strongly two items are related

Coherence score for a schema is the average of all possible attribute-pairwise PMI scores for the schema

16


Indexing

Traditional Search Engines use Inverted Index

Inverted Index can not retrieve relational features

Inverted Index

Term -> (docid, offset)

WebTables data exists in two dimensions

Term -> (docid, offset-X, offset-Y)

17


ACSDb Application (1)

Schema Auto Complete

Designed to assist novice database designers when creating a relational schema

Schemas consisting of Single Relations

User enter one or more domain-specific attributes and the auto-completer guesses the rest if the attributes

18



Attribute Synonym-Finding

Automatically find synonyms between arbitrary attribute strings

Based on a set of context attributes generates attribute pairs

Assumptions

– Synonymous attributes will never appear together in same chema

– Odds of synonymity are higher if p(a,b) = 0 despite a large value for p(a)p(b)

– Two synonyms will appear in similar contexts

19



Join Graph Traversal

Provide a useful way of navigating huge graph of 2.6M Schemas

Basic join graph

– Contains a node ‘N’ for each unique schema

– Undirected join link between any two schemas that share a attribute

Every schema that contains ‘name’ field is linked to every other schema that contains ‘name’

Cluster together similar schemas to minimize graph clutter

Schema: X,Y

Shared Attribute: D

20


Exp. Results – Relation Ranking

Rank-ACSD beats Naïve (simulates search engine users) by 78-100%

All of the non-Naïve solutions improve as k (number of results) increases

21


Exp. Results – Schema Auto Complete

Test Scenario

6 Humans designed schemas using given attributes

Auto-Complete tool got three tries

By 3rd output Auto complete was able to reproduce a large number of schemas

No test designer recognized ‘ab’ as an abbrevation for ‘at-bats’, baseball terminology

22


Exp. Results – Synonym Finding

Ranked by quality

An ideal ranking would present a stream of only correct synonyms, followed by only incorrect ones

Poor ranking will mix them together

23


Exp. Results – Join Graph Traversal

24


Conclusion

WebTables is first large scale attempt to extract relational information embedded in HTML tables

Relation Ranking

ACSDb uses

Schema auto complete

Attribute Synonym Finding

Join Graph Traversing

Adding signal for source page quality like PageRank will improve overall quality

25


Discussion

Pros

Handling tables separately for search is a good idea

Cons

Most of the paper is focused on uses of ACSDb

26

Documents

Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon