AMBER WWW 2012 (Demonstration)

DIADEM domain-centric intelligent automated data extraction methodology

Automatically Learning Gazetteers from the

Deep WebChristian Schallhart

April 19th, 2012 @ WWW in Lyon

joint work with Tim Furche, Giovanni Grasso, Giorgio Orsi, and Cheng Wang

Friday, May 11, 2012

AMBER: Extraction from Result Pages

2



2

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>



2






2




97.5%

98.0%

98.5%

99.0%

99.5%

100.0%

data areas records attributes

precision recall

>98.5% F1 score



2




DomainKnowledge

(no per-site training)

>98.5% F1 score



2




Little Ontology(mandatory)

attribute types

DomainKnowledge


>98.5% F1 score



2




Gazetteersterm lists


attribute types

DomainKnowledge


>98.5% F1 score


Quite easy.


2






attribute types

DomainKnowledge


>98.5% F1 score


Quite easy. A lot of work!


2






attribute types

DomainKnowledge


>98.5% F1 score


AMBER: From Extraction to Learning

3


A lot of work!

Leverage the repeated structure in result pages

to learn new terms.


AMBER: Automatically Learning Gazetteers

4





>98.5% F1 score

A lot of work!



4





>98.5% F1 score

•Page Segmentation

‣clusters attribute instances

‣analyses repeated structures

A lot of work!



4





>98.5% F1 score




AMBER annotates firstto integrate semantic information into its repeated structure analysis.

A lot of work!



4





>98.5% F1 score




•Attribute Alignmentmatches knowledge with observations


A lot of work!



4





>98.5% F1 score





•Gazetteer Learningturns phrases into terms



A lot of work!



4





>98.5% F1 score








A lot of work!


AMBER: Page Segmentation

Page Retrieval

Mozilla via XUL Runner

GATE Annotations

5


Page Retrieval


GATE Annotations

Data Area Identification

Pivot node clustering


6

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA


Page Retrieval


GATE Annotations



A data area is a maximal DOM subtree, which• contains ≥2 pivot nodes, which are• depth consistent (depth(n)=k±ε)• distance consistent (pathlen(n,n')=k±δ)• continuous, such that• their least common ancestor is d's root.


6

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA



Page Retrieval


GATE Annotations



Record Segmentation

head/tail cut-off

segmentation boundaryshifting

7

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA



Page Retrieval


GATE Annotations



Record Segmentation

head/tail cut-off


8

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA



Page Retrieval


GATE Annotations



Record Segmentation

head/tail cut-off


8

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

A result record is a sequence of children of thedata area root.

A result record segmentation divides a data area • into non-overlapping records, • containing the same number of siblings, • each based on a single selected pivot node.


AMBER: Attribute Alignment

9

L

P X

LL

P P P A

L

P A

L

P

L

P AA



9

L

P X

LL

P P P A

L

P A

L

P

L

P AA

The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.

The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.



Attribute Cleanup

discard attributes withlow support

9

L

P X

LL

P P P A

L

P A

L

P

L

P AA



Cleanup



Attribute Cleanup


Attribute Disambiguation

discard ambiguous attributes with lower support

9

L

P X

LL

P P P A

L

P A

L

P

L

P AA



Disam-biguation

Cleanup



Attribute Cleanup


Attribute Disambiguation

discard ambiguous attributes with lower support

Attribute Generalisation

add new un-annotatedattributes with sufficientsupport

9

L

P X

LL

P P P A

L

P A

L

P

L

P AA



Disam-biguation

Cleanup Generation


AMBER: Gazetteer Learning

10

Oxford, Walton Street, top-floor apartment



Term Formulation

split newly generated attributesinto terms

10


Oxford

Walton Street

top-floor apartment



Term Formulation


discard terms onblack-lists and from non-overlapping attributes

10


Oxford

Walton Street

top-floor apartment



Term Formulation


discard terms onblack-lists and from non-overlapping attributes

Term Validation

track term relevance

discard irrelevant ones

10


Oxford

Walton Street

top-floor apartment


AMBER: Evaluation

11


AMBER: Evaluation

Learning Location from 250 pages from 150 sites(UK real estate market)

Starting with a 25% sample of our full gazetteer(containing 33.243 terms)

11


AMBER: Evaluation

Learning Location from 250 pages from 150 sites(UK real estate market)

Starting with a 25% sample of our full gazetteer(containing 33.243 terms)

initially failed to annotate 328 locations

after 3 learning rounds learned 265 of those (recall: 80.6% precision: 95.1%)

11


AMBER: Evaluation!"##$%

&'(#)%

!"#$%&%'(

)$"*&+&,%

!"#$%&%'

-"*#..

/,0#.

)$"*&+&,%

/,0#.

-"*#..

123

453

613

773

8223

-,9%:(8 -,9%:(; -,9%:(5


AMBER: Evaluation

!"##$%

&'(#)%

!"#$%&' !"#$%&( !"#$%&)

*

+*

'**

'+*

(**

(+*

)**

,-./$0&,"1.02"$3 4"//-10&,"1.02"$3


!

"

#

$

%

DEMO


!

"

#

$

%

Webpage with identifiedrecords and attributes1 Domain schema concepts2 Learned terms with

confidence values3 URLs for analysis4 Seed Gazetteer5

AMBER GUI

Reas

onin

g Attribute Alignment

Record Segmentaton

DataArea IdentificationAnno

tatio

n

GATE

Domain Gazetteers

Web

Acc

ess

Browser Common API

WebKitMozilla

AMBER Learning Evaluation!"##$%

&'(#)%

!"#$%&%'(

)$"*&+&,%

!"#$%&%'

-"*#..

/,0#.

)$"*&+&,%

/,0#.

-"*#..

123

453

613

773

8223

-,9%:(8 -,9%:(; -,9%:(5

!"##$%

&'(#)%

!"#$%&' !"#$%&( !"#$%&)

*

+*

'**

'+*

(**

(+*

)**

,-./$0&,"1.02"$3 4"//-10&,"1.02"$3

Learning Accuracy Terms learnt

• Learning Locations from 250 pages from 150 sites (UK real estate)

• Starting with 25% of the full gazetteer (containing 33.243 terms)

unannotated instances (328) total instances (1484)

rnd. aligned corr. prec. rec. prec. rec.

1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%

Table 1: Total learned instances

rnd. unannot. recog. corr. prec. rec. terms

1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0

Table 2: Incrementally recognized instances and learned terms

miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.

We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.

Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.

Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.

We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.

Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of

94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!

Figure 4: AMBER Evaluation on Real-Estate Domain

fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.

4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction

from Large Websites. Journal of the ACM, 51(5):731–779,2004.

[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE

(Version 6). The University of Sheffield, Department ofComputer Science, 2011.

[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.

of the ACM SIGMOD International Conference on

Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic

wrappers for large scale web extraction. The Proceedings of

the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical

Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.

[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.

[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th

ACM Conference on Information and Knowledge

Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:

Ontology-Assisted Data Extraction. ACM Transactions on

Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web

Based on Partial Tree Alignment. IEEE Transactions on

Knowledge and Data Engineering, 18(12):1614–1628, 2006.



1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%



1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0








94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!





















• Fails to annotate 328 or 1,484 locations• Saturated after 3 rounds

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%


precision recall

94.0%

96.0%

98.0%

100.0%

reception price bathroom

legal status

detailed page bedroom

location postcode

property type

precision recall

overall attributes large scale

Real Estate

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%


precision recall

90.0%

92.5%

95.0%

97.5%

100.0%

door num detail page make

fuel type color price mileage

cartype trans model

engine size location

precision recall

overall attributes

Used Cars

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

price location


legal status postcode

property type bathroom

reception

250 pages, manual 2215 pages, automatic

281real estate 14,6142785

12,732

attributes

114,714

records

20,7231,608151

2,215

pages

used car(large scale)

extracts attributes with >99% precision and >98% recall

Automatically LearningGazetteers from the Deep Web

AMBER Learning Cycle

Page Segmentation

PageRetrieval

Mozilla,GATE annotations

Data AreaIdentification

Pivot node (mandatoryfields) clustering

RecordSegmentation

Head/tail cut off,Segment boundary shifting

Attribute Alignment

AttributeCleanup

Discard attributesof low support

AttributeDisambiguation

Discard redundantattributes of lower support

AttributeGeneralization

Add new attributes ofsufficient support

Gazetteer Learning

TermFormulation

Spilt new attributes intoterms

TermValidation

Track term relevance,Discard irrelevant terms

1

2

3

2

3

L L

P X

L

P P P A

L

P A

L

P

L

P AA2 31

1

1

2

L

P A

1


Oxford Walton Street top-floor apartment

Gazet-teers


A result record is a sequence of children of the data area root.


2 R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

3 R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA



Remove terms which occur• in black lists,• in other gazetteers

Compute confidence based on• support of its type/tag path pair,• relative size of the term within the entire attribute

Tim Furche, Giovanni Grasso, Giorgio Orsi,Christian Schallhart, Cheng Wang

Authors

diadem.cs.ox.ac.uk/[email protected]

Digital Home

Reasoning in Datalog (DLV) rules• stratified negation• finite domains• non-recursive aggregation

easy integration with domain knowledge

Number of Rules• Data Area Idenifitication: 11• Record Segmentation: 32• Attribute Alignment: 34

AMBER Applications

Result PageAnalysis

Data ExtractionExample Generation

forWrapper Induction

Gazetteer LearningGazetteerOntology

Part of DIADEM (Domain-centric Intelligent Automated DataExtraction Methodology): Analyzing the pages reached viaOPAL to generate OXPath expressions for efficient extraction.

... but useable independently of DIADEM as well...

Sponsors

AMBER Evaluation AMBER Architecture

DIADEM data extraction methodologydomain-centric intelligent automated

diadem.cs.ox.ac.uk

P is only allowed toappear once, thus the

second P with less supportis dropped.

X only occurs onceand has too low

support to be kept.

A has a support of3/4 at this node andhence we add the

annotation.

We inferred that thisnode is of type A --

hence we learn its terms.

Friday all day,otherwise during breaks!

Booth Number 1well-hidden in the corner


!

"

#

$

%

Webpage with identifiedrecords and attributes1 Domain schema concepts2 Learned terms with

confidence values3 URLs for analysis4 Seed Gazetteer5

AMBER GUI

Reas

onin

g Attribute Alignment

Record Segmentaton

DataArea IdentificationAnno

tatio

n

GATE

Domain Gazetteers

Web

Acc

ess

Browser Common API

WebKitMozilla

AMBER Learning Evaluation!"##$%

&'(#)%

!"#$%&%'(

)$"*&+&,%

!"#$%&%'

-"*#..

/,0#.

)$"*&+&,%

/,0#.

-"*#..

123

453

613

773

8223

-,9%:(8 -,9%:(; -,9%:(5

!"##$%

&'(#)%

!"#$%&' !"#$%&( !"#$%&)

*

+*

'**

'+*

(**

(+*

)**

,-./$0&,"1.02"$3 4"//-10&,"1.02"$3

Learning Accuracy Terms learnt

• Learning Locations from 250 pages from 150 sites (UK real estate)

• Starting with 25% of the full gazetteer (containing 33.243 terms)



1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%



1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0








94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!























1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%



1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0








94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!





















• Fails to annotate 328 or 1,484 locations• Saturated after 3 rounds

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%


precision recall

94.0%

96.0%

98.0%

100.0%

reception price bathroom

legal status


location postcode

property type

precision recall

overall attributes large scale

Real Estate

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%


precision recall

90.0%

92.5%

95.0%

97.5%

100.0%

door num detail page make

fuel type color price mileage

cartype trans model

engine size location

precision recall

overall attributes

Used Cars

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

price location


legal status postcode

property type bathroom

reception

250 pages, manual 2215 pages, automatic

281real estate 14,6142785

12,732

attributes

114,714

records

20,7231,608151

2,215

pages

used car(large scale)

extracts attributes with >99% precision and >98% recall

Automatically LearningGazetteers from the Deep Web

AMBER Learning Cycle

Page Segmentation

PageRetrieval

Mozilla,GATE annotations

Data AreaIdentification

Pivot node (mandatoryfields) clustering

RecordSegmentation

Head/tail cut off,Segment boundary shifting

Attribute Alignment

AttributeCleanup

Discard attributesof low support

AttributeDisambiguation

Discard redundantattributes of lower support

AttributeGeneralization

Add new attributes ofsufficient support

Gazetteer Learning

TermFormulation

Spilt new attributes intoterms

TermValidation

Track term relevance,Discard irrelevant terms

1

2

3

2

3

L L

P X

L

P P P A

L

P A

L

P

L

P AA2 31

1

1

2

L

P A

1


Oxford Walton Street top-floor apartment

Gazet-teers


A result record is a sequence of children of the data area root.


2 R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

3 R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA



Remove terms which occur• in black lists,• in other gazetteers

Compute confidence based on• support of its type/tag path pair,• relative size of the term within the entire attribute

Tim Furche, Giovanni Grasso, Giorgio Orsi,Christian Schallhart, Cheng Wang

Authors

diadem.cs.ox.ac.uk/[email protected]

Digital Home

Reasoning in Datalog (DLV) rules• stratified negation• finite domains• non-recursive aggregation

easy integration with domain knowledge

Number of Rules• Data Area Idenifitication: 11• Record Segmentation: 32• Attribute Alignment: 34

AMBER Applications

Result PageAnalysis

Data ExtractionExample Generation

forWrapper Induction

Gazetteer LearningGazetteerOntology

Part of DIADEM (Domain-centric Intelligent Automated DataExtraction Methodology): Analyzing the pages reached viaOPAL to generate OXPath expressions for efficient extraction.

... but useable independently of DIADEM as well...

Sponsors

AMBER Evaluation AMBER Architecture

DIADEM data extraction methodologydomain-centric intelligent automated

diadem.cs.ox.ac.uk

P is only allowed toappear once, thus the

second P with less supportis dropped.

X only occurs onceand has too low

support to be kept.

A has a support of3/4 at this node andhence we add the

annotation.

We inferred that thisnode is of type A --

hence we learn its terms.

Friday all day,otherwise during breaks!

Booth Number 1well-hidden in the corner


Backups


AMBER: Evaluation!"##$%

&'(#)%

!"#$%&%'(

)$"*&+&,%

!"#$%&%'

-"*#..

/,0#.

)$"*&+&,%

/,0#.

-"*#..

123

453

613

773

8223

-,9%:(8 -,9%:(; -,9%:(5



1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%



1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0








94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!






















AMBER: Evaluation

!"##$%

&'(#)%

!"#$%&' !"#$%&( !"#$%&)

*

+*

'**

'+*

(**

(+*

)**

,-./$0&,"1.02"$3 4"//-10&,"1.02"$3unannotated instances (328) total instances (1484)


1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%



1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0








94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!






















Technology

AMBER WWW 2012 (Demonstration)