Diadem 1.0

Preview:

DESCRIPTION

 

Citation preview

DIADEMDomain-centric, Intelligent, Automated

Data ExtractionTim Furche, Georg Gottlob, Giorgio Orsi

May 11th, 2011 @ Oxford University Computing Laboratoriesjoint work with Giovanni Grasso, Omer Gunes, Xiaonan Guo, Andrey Kravchenko, Thomas

Lukasiewicz, Christian Schallhart, Andrew Sellers, Gerardo Simaris, Cheng Wang

3

1

Web Data Extraction

4

Section 1: Web Data Extraction

Data on the Web

there is more of it than we can use

no longer availability, but finding, integrating, analysing, …

5

Section 1: Web Data Extraction

Surface vs. Deep Web

estimated 500 × surface web

estimated 400 000 deep web databases

What?

Products (stores)

Directories (yellow pages)

Catalogs (libraries)

Public DBs (publications, census, data.gov,…)

Public services (weather, location, …)

6And it’s not just one haystack …

8

10

11

7 bedrooms

5 bedrooms

12

Section 1: Web Data Extraction

The Web is more than HTML

13

Section 1: Web Data Extraction

Overview

Introducing Web Data Extraction

Scenarios

Why now?

Supervised Web Data Extraction

Unsupervised Web Data Extraction

DIADEM

OPAL

AMBER

OXPath

IVLIA

Datalog±

14

1.1

Web Data Extraction: Scenarios

15

Section 1: Web Data Extraction

The Need of Web Data Extraction

information

drives business (decision making, trend analysis, …)

available in troves on the internet

but: as HTML made for humans, not as structured data

companies need

product specifications

pricing information

market trends

regulatory information

17

keyword search fails

18

Section 1: Web Data Extraction

Scenario ➀: Electronics retailer

electronics retailer: online market intelligence

comprehensive overview of the market

daily information on price, shipping costs, trends, product mix

by product, geographical region, or competitor

thousands of products

hundreds of competitors

nowadays: specialised companies

mostly manual, interpolation

large cost

19

Section 1: Web Data Extraction

Scenario ➁: Supermarket chain

supermarket chain

competitors’ product prices

special offer or promotion (time sensitive)

new products, product formats & packaging

20

Section 1: Web Data Extraction

Scenario ➂: Hotel Agency

online travel agency

best price guarantee

prices of competing agencies

average market price

21

Section 1: Web Data Extraction

Scenario ➃: Hedge Fund

house price index

published in regular intervals by national statistics agency

affects share values of various industries

hedge fund

online market intelligence to predict the house price index

22

Section 1: Web Data Extraction

And a lot more …

monitor blogs and forums

market intelligence, e.g., complaints, common problems

customer opinions

ranking and analysing product reviews

financial analysts

monitor trends and stats for products of a certain company / category

interest rates from financial institutions

press releases and financial reports

patent search & analysis

24

1.1

Web Data Extraction: Why Now?

25

Scale

26

Applications

27

Section 1: Web Data Extraction

How to book a flight?

31

Structured Data

33

Section 1: Web Data Extraction

Why Web Data Extraction Now?

Why now? Trends

Trend ➊: scale—every business is online

automation at scale

Trend ➋: web applications rather than web documents

automated form filling (deep web navigation)

Trend ➌: structured, common-sense data available

allows more sophisticated automated analysis

also a tool for improved data extraction?

34

Web Data Extraction: Supervised

2

35

manual: (e.g., Web Harvest)

user writes the wrapper, sometimes using wrapping libraries

supervised: (e.g., Lixto)

user provides examples and refines the wrapper

semi-supervised:

user provides examples (per site), wrapper is automatically learned

unsupervised: entirely automated (e.g., DIADEM)

some systems omit examples and run analysis directly on all pages

some systems automatically guess examples

36

Section 2: Supervised Web Data Extraction

Supervised Web Data Extraction

User interaction needed to

rather than manually writing in a programming language

record interaction sequences (such as form fillings)

visually select examples for data

Current gold standard for high-accuracy extraction

Examples:

Lixto

Automation Anywhere

Web Harvest

38

40

Section 1: Supervised Web Data Extraction

Lixto: Extraction & Analysis

Lixto: sophisticated, visual semi-automated extraction tool

visually select, automatically derives patterns, verification

highly scalable extraction and processing with Lixto server

but also: data integration & business analytics suite

data cleaning

data flow scenarios: merge & filter from different web sites

market intelligence & analytics

43

Web Data Extraction: Unsupervised

3

44

17000 real estate sites in the UK

alone

45

Section 3: Unsupervised Web Data Extraction

Why Automating Data Extraction?

Too many fish in the pond

> 17 000 real estate UK sites

similar for restaurants, travel, airlines, pharmacies, retail shops, …

aggregators cover only a fraction

updated slowly

⇒ per site manual work infeasible

wrapper construction too expensive

tracking changes

excludes manual & (semi-) supervised

46

Section 3: Unsupervised Web Data Extraction

Why Automating Data Extraction?

All the fish are different

large, modern aggregators (>100000)

nation-wide agencies (>10000)

agencies for single quarter (< 15)

⇒ no single unsupervised wrapper

can do this today

47

Section 3: Unsupervised Web Data Extraction

… and we really need it!

search engine providers (Google, Microsoft, Yahoo!) all work on

information and data extraction for

“vertical”, “object” and “semantic” search

turn search engines into knowledge bases for decision support

48

“no one really has done this successfully at scale yet”

Raghu Ramakrishnan, Yahoo!, March 2009

“Current technologies are not good enough yet to provide what search

engines really need. [...] Any successful approach would probably need a combination of knowledge and

learning.”

Alon Halevy, Google, Feb. 2009

49

Section 3: Unsupervised Web Data Extraction

Unsupervised: The Story so Far

Key observation:

“database” web sites are generated using templates

wrapper generators need to automatically identifying templates

Two major approaches

machine learning from a few hand-labeled examples

similar to semi-supervised, but only one set of examples for an entire domain

high precision only for simple domains (single entity type, few attributes)

fully automatically exploit the repeated structure of result pages

good precision needs a lot of data (many records per page, many pages)

doesn’t work for forms (no repetition)

51

?

52

4

DIADEM

53

Section 4: DIADEM

Domain-Centric Data Extraction

Blackbox analyser that

turns any of the thousands of websites of a domain

into structured data

54

host of domain specific annotators

55

domain ontology & phenomenology

56

+ everything the others are doing

machine learning for classification

template discovery

57

58

59

Section 4: DIADEM

DIADEM: Overview

DIADEM combines

host of domain-specific annotators with

gives us a first “guess” to automatically generate examples

high-level ontology about domain entities and

their phenomenology on web sites of the domain

allows us to verify & refine examples

+ advances in existing techniques for

repeated structure analysis

page & block classification

bottom-up understanding & top-down reasoning

60

4.1

DEMO

61

62

DIADEM 0.1First prototype

63

69

70

OPAL:Ontologies for Form Analysis

4.2

71

72

Diversity

74

Section 4: DIADEM » OPAL

OPAL: Overview

Three step process:

browser extraction and annotation

labelling & segmentation

classification (phenomenological mapping)

Model-based, knowledge driven

latter two steps are model transformations

thin layer of domain-dependent concepts

field types and labels

triggers for field & form creation

75

77

78

79ICQ Data Set: Application to Other Domains

80

AMBER:Ontologies for

Record Extraction

4.3

81

7 bedrooms

5 bedrooms

82

just opposite as in OPAL

83

AMBER: Overview

Three step process like OPAL

browser extraction and annotation

classification (phenomenological mapping)

record segmentation (much harder than in OPAL)

Model-based, knowledge driven

latter two steps are model transformations

thin layer of domain-dependent concepts

record and attribute types

triggers for record & attribute creation

Section 4: DIADEM » AMBER

84

85

86

Repeating

87

Similarity

88

89

OXPath:Scalable, Memory-

Efficient Web Extraction

4.4

90

How to book a flight?

Section 4: DIADEM » OXPath

92

How to find a flat?

Section 4: DIADEM » OXPath

94

How to find a flat with OXPath

Start at rightmove.co.uk: doc("rightmove.co.uk")

Fill “oxford’ into the first visible field /descendant::field()[1]/{"oxford"}

Click on the second next button /following::field()[2]/{click /}

On the refinement form just continue by clicking on the last field /descendant::field()[last()]/{click /}

Grab all the prices //p.price

Section 4: DIADEM » OXPath

95

State of Web Extraction

No interaction with rich, scripted interfaces

no actions other than form filling and submission

➀ Imperative extraction scripts

explicit variable assignments, flow control, etc.

either proprietary selection language or mix of XPath & external flow control

➁ Focus on automation and visual interfaces

no or very limited extraction language, only ad-hoc extractions

no multiway navigation, no optimization

Section 4: DIADEM » OXPath

98

Summary of Complexity

Section 4: DIADEM » OXPath

Time Space

OXPath w/o Actions & Kleene

O( n6⋅q2 ) O( n5⋅q2 )

OXPath w/o Kleene O( (p⋅n)6⋅q3 ) O( n5⋅q3 )

OXPath w/o unbounded Kleene

O( (p⋅n)6⋅q3 ) O( n5⋅q∑3 )

OXPath (full) O( (p⋅n)6⋅q3 ) O( n5⋅(q+d)3 )

O(n4⋅q2) O(n3⋅q2)

Combined: PTime-hard PTime-hard

Data: NLogSpace

LogSpaceExtraction marker = n-ary, nested

queries

Contextual actions (action free prefix)

Actions = multiple pages

Buffer bounded by page depth

99

Constant Memory

105

even faster

106

4.5

IVLIA:Ontologies for PDF Extraction

107

108

PDF Analysis

Section 4: DIADEM » IVLIA

109

Semantic Analysis and Annotation

Section 4: DIADEM » IVLIA

110

Datalog±:Ontological Reasoning

at Web Scale

4.6

113

Relational Schemaperson(ssn, name, birthdate)employee (ssn, empID, name, birthdate, department)department (depName, building)project (projID, startDate, duration)supervision (supervisor, supervised)assignment (employee, project)

E/R Schema Object Relational Schema

Ontological Databases

Section 4: DIADEM » Datalog±

114

Taxonomy Definitions

employee(X,Y,Z,W) → ∃V person(V,Y,Z)

project(X,Y,Z) → activity(X,Y,Z)

employee(X1,Y1,Z1,W1,U1), supervision(Y1,Y2), employee(X2,Y2,Z2,W2,U2) → supervisor(X1,Y1,Z1,W1,U1)

Concept Definitions

generalManager(X1,Y1,Z1,W1,U1) → supervision(Y1,Y1)

An employee who supervises another employee is a supervisor

A general manager supervises him/herself

Ontological Constraints

Section 4: DIADEM » Datalog±

115

efficiency

KR

expressiveness

expressiveness

DB

efficiency

Big Picture

116

Big Picture

123

Q&A

diadem-project.info

Recommended