43
DIADEM domain-centric intelligent automated data extraction methodology Automatically Learning Gazetteers from the Deep Web Christian Schallhart April 19 th , 2012 @ WWW in Lyon joint work with Tim Furche, Giovanni Grasso, Giorgio Orsi, and Cheng Wang Friday, May 11, 2012

AMBER WWW 2012 (Demonstration)

Embed Size (px)

DESCRIPTION

Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (>98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the 4th iteration.

Citation preview

Page 1: AMBER WWW 2012 (Demonstration)

DIADEM domain-centric intelligent automated data extraction methodology

Automatically Learning Gazetteers from the

Deep WebChristian Schallhart

April 19th, 2012 @ WWW in Lyon

joint work with Tim Furche, Giovanni Grasso, Giorgio Orsi, and Cheng Wang

Friday, May 11, 2012

Page 2: AMBER WWW 2012 (Demonstration)

AMBER: Extraction from Result Pages

2

Friday, May 11, 2012

Page 3: AMBER WWW 2012 (Demonstration)

AMBER: Extraction from Result Pages

2

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Friday, May 11, 2012

Page 4: AMBER WWW 2012 (Demonstration)

AMBER: Extraction from Result Pages

2

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Friday, May 11, 2012

Page 5: AMBER WWW 2012 (Demonstration)

AMBER: Extraction from Result Pages

2

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%

data areas records attributes

precision recall

>98.5% F1 score

Friday, May 11, 2012

Page 6: AMBER WWW 2012 (Demonstration)

AMBER: Extraction from Result Pages

2

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

DomainKnowledge

(no per-site training)

>98.5% F1 score

Friday, May 11, 2012

Page 7: AMBER WWW 2012 (Demonstration)

AMBER: Extraction from Result Pages

2

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Little Ontology(mandatory)

attribute types

DomainKnowledge

(no per-site training)

>98.5% F1 score

Friday, May 11, 2012

Page 8: AMBER WWW 2012 (Demonstration)

AMBER: Extraction from Result Pages

2

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Gazetteersterm lists

Little Ontology(mandatory)

attribute types

DomainKnowledge

(no per-site training)

>98.5% F1 score

Friday, May 11, 2012

Page 9: AMBER WWW 2012 (Demonstration)

Quite easy.

AMBER: Extraction from Result Pages

2

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Gazetteersterm lists

Little Ontology(mandatory)

attribute types

DomainKnowledge

(no per-site training)

>98.5% F1 score

Friday, May 11, 2012

Page 10: AMBER WWW 2012 (Demonstration)

Quite easy. A lot of work!

AMBER: Extraction from Result Pages

2

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Gazetteersterm lists

Little Ontology(mandatory)

attribute types

DomainKnowledge

(no per-site training)

>98.5% F1 score

Friday, May 11, 2012

Page 11: AMBER WWW 2012 (Demonstration)

AMBER: From Extraction to Learning

3

Gazetteersterm lists

A lot of work!

Leverage the repeated structure in result pages

to learn new terms.

Friday, May 11, 2012

Page 12: AMBER WWW 2012 (Demonstration)

AMBER: Automatically Learning Gazetteers

4

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Gazetteersterm lists

>98.5% F1 score

A lot of work!

Friday, May 11, 2012

Page 13: AMBER WWW 2012 (Demonstration)

AMBER: Automatically Learning Gazetteers

4

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Gazetteersterm lists

>98.5% F1 score

•Page Segmentation

‣clusters attribute instances

‣analyses repeated structures

A lot of work!

Friday, May 11, 2012

Page 14: AMBER WWW 2012 (Demonstration)

AMBER: Automatically Learning Gazetteers

4

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Gazetteersterm lists

>98.5% F1 score

•Page Segmentation

‣clusters attribute instances

‣analyses repeated structures

AMBER annotates firstto integrate semantic information into its repeated structure analysis.

A lot of work!

Friday, May 11, 2012

Page 15: AMBER WWW 2012 (Demonstration)

AMBER: Automatically Learning Gazetteers

4

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Gazetteersterm lists

>98.5% F1 score

•Page Segmentation

‣clusters attribute instances

‣analyses repeated structures

•Attribute Alignmentmatches knowledge with observations

AMBER annotates firstto integrate semantic information into its repeated structure analysis.

A lot of work!

Friday, May 11, 2012

Page 16: AMBER WWW 2012 (Demonstration)

AMBER: Automatically Learning Gazetteers

4

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Gazetteersterm lists

>98.5% F1 score

•Page Segmentation

‣clusters attribute instances

‣analyses repeated structures

•Attribute Alignmentmatches knowledge with observations

•Gazetteer Learningturns phrases into terms

AMBER annotates firstto integrate semantic information into its repeated structure analysis.

•Gazetteer Learningturns phrases into terms

A lot of work!

Friday, May 11, 2012

Page 17: AMBER WWW 2012 (Demonstration)

AMBER: Automatically Learning Gazetteers

4

<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>

<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>

Gazetteersterm lists

>98.5% F1 score

•Page Segmentation

‣clusters attribute instances

‣analyses repeated structures

•Attribute Alignmentmatches knowledge with observations

•Gazetteer Learningturns phrases into terms

AMBER annotates firstto integrate semantic information into its repeated structure analysis.

•Gazetteer Learningturns phrases into terms

A lot of work!

Friday, May 11, 2012

Page 18: AMBER WWW 2012 (Demonstration)

AMBER: Page Segmentation

Page Retrieval

Mozilla via XUL Runner

GATE Annotations

5

Friday, May 11, 2012

Page 19: AMBER WWW 2012 (Demonstration)

Page Retrieval

Mozilla via XUL Runner

GATE Annotations

Data Area Identification

Pivot node clustering

AMBER: Page Segmentation

6

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

Friday, May 11, 2012

Page 20: AMBER WWW 2012 (Demonstration)

Page Retrieval

Mozilla via XUL Runner

GATE Annotations

Data Area Identification

Pivot node clustering

A data area is a maximal DOM subtree, which• contains ≥2 pivot nodes, which are• depth consistent (depth(n)=k±ε)• distance consistent (pathlen(n,n')=k±δ)• continuous, such that• their least common ancestor is d's root.

AMBER: Page Segmentation

6

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

Friday, May 11, 2012

Page 21: AMBER WWW 2012 (Demonstration)

AMBER: Page Segmentation

Page Retrieval

Mozilla via XUL Runner

GATE Annotations

Data Area Identification

Pivot node clustering

Record Segmentation

head/tail cut-off

segmentation boundaryshifting

7

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

Friday, May 11, 2012

Page 22: AMBER WWW 2012 (Demonstration)

AMBER: Page Segmentation

Page Retrieval

Mozilla via XUL Runner

GATE Annotations

Data Area Identification

Pivot node clustering

Record Segmentation

head/tail cut-off

segmentation boundaryshifting

8

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

Friday, May 11, 2012

Page 23: AMBER WWW 2012 (Demonstration)

AMBER: Page Segmentation

Page Retrieval

Mozilla via XUL Runner

GATE Annotations

Data Area Identification

Pivot node clustering

Record Segmentation

head/tail cut-off

segmentation boundaryshifting

8

R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

A result record is a sequence of children of thedata area root.

A result record segmentation divides a data area • into non-overlapping records, • containing the same number of siblings, • each based on a single selected pivot node.

Friday, May 11, 2012

Page 24: AMBER WWW 2012 (Demonstration)

AMBER: Attribute Alignment

9

L

P X

LL

P P P A

L

P A

L

P

L

P AA

Friday, May 11, 2012

Page 25: AMBER WWW 2012 (Demonstration)

AMBER: Attribute Alignment

9

L

P X

LL

P P P A

L

P A

L

P

L

P AA

The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.

The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.

Friday, May 11, 2012

Page 26: AMBER WWW 2012 (Demonstration)

AMBER: Attribute Alignment

Attribute Cleanup

discard attributes withlow support

9

L

P X

LL

P P P A

L

P A

L

P

L

P AA

The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.

The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.

Cleanup

Friday, May 11, 2012

Page 27: AMBER WWW 2012 (Demonstration)

AMBER: Attribute Alignment

Attribute Cleanup

discard attributes withlow support

Attribute Disambiguation

discard ambiguous attributes with lower support

9

L

P X

LL

P P P A

L

P A

L

P

L

P AA

The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.

The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.

Disam-biguation

Cleanup

Friday, May 11, 2012

Page 28: AMBER WWW 2012 (Demonstration)

AMBER: Attribute Alignment

Attribute Cleanup

discard attributes withlow support

Attribute Disambiguation

discard ambiguous attributes with lower support

Attribute Generalisation

add new un-annotatedattributes with sufficientsupport

9

L

P X

LL

P P P A

L

P A

L

P

L

P AA

The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.

The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.

Disam-biguation

Cleanup Generation

Friday, May 11, 2012

Page 29: AMBER WWW 2012 (Demonstration)

AMBER: Gazetteer Learning

10

Oxford, Walton Street, top-floor apartment

Friday, May 11, 2012

Page 30: AMBER WWW 2012 (Demonstration)

AMBER: Gazetteer Learning

Term Formulation

split newly generated attributesinto terms

10

Oxford, Walton Street, top-floor apartment

Oxford

Walton Street

top-floor apartment

Friday, May 11, 2012

Page 31: AMBER WWW 2012 (Demonstration)

AMBER: Gazetteer Learning

Term Formulation

split newly generated attributesinto terms

discard terms onblack-lists and from non-overlapping attributes

10

Oxford, Walton Street, top-floor apartment

Oxford

Walton Street

top-floor apartment

Friday, May 11, 2012

Page 32: AMBER WWW 2012 (Demonstration)

AMBER: Gazetteer Learning

Term Formulation

split newly generated attributesinto terms

discard terms onblack-lists and from non-overlapping attributes

Term Validation

track term relevance

discard irrelevant ones

10

Oxford, Walton Street, top-floor apartment

Oxford

Walton Street

top-floor apartment

Friday, May 11, 2012

Page 33: AMBER WWW 2012 (Demonstration)

AMBER: Evaluation

11

Friday, May 11, 2012

Page 34: AMBER WWW 2012 (Demonstration)

AMBER: Evaluation

Learning Location from 250 pages from 150 sites(UK real estate market)

Starting with a 25% sample of our full gazetteer(containing 33.243 terms)

11

Friday, May 11, 2012

Page 35: AMBER WWW 2012 (Demonstration)

AMBER: Evaluation

Learning Location from 250 pages from 150 sites(UK real estate market)

Starting with a 25% sample of our full gazetteer(containing 33.243 terms)

initially failed to annotate 328 locations

after 3 learning rounds learned 265 of those (recall: 80.6% precision: 95.1%)

11

Friday, May 11, 2012

Page 36: AMBER WWW 2012 (Demonstration)

AMBER: Evaluation!"##$%

&'(#)%

!"#$%&%'(

)$"*&+&,%

!"#$%&%'

-"*#..

/,0#.

)$"*&+&,%

/,0#.

-"*#..

123

453

613

773

8223

-,9%:(8 -,9%:(; -,9%:(5

Friday, May 11, 2012

Page 37: AMBER WWW 2012 (Demonstration)

AMBER: Evaluation

!"##$%

&'(#)%

!"#$%&' !"#$%&( !"#$%&)

*

+*

'**

'+*

(**

(+*

)**

,-./$0&,"1.02"$3 4"//-10&,"1.02"$3

Friday, May 11, 2012

Page 38: AMBER WWW 2012 (Demonstration)

!

"

#

$

%

DEMO

Friday, May 11, 2012

Page 39: AMBER WWW 2012 (Demonstration)

!

"

#

$

%

Webpage with identifiedrecords and attributes1 Domain schema concepts2 Learned terms with

confidence values3 URLs for analysis4 Seed Gazetteer5

AMBER GUI

Reas

onin

g Attribute Alignment

Record Segmentaton

DataArea IdentificationAnno

tatio

n

GATE

Domain Gazetteers

Web

Acc

ess

Browser Common API

WebKitMozilla

AMBER Learning Evaluation!"##$%

&'(#)%

!"#$%&%'(

)$"*&+&,%

!"#$%&%'

-"*#..

/,0#.

)$"*&+&,%

/,0#.

-"*#..

123

453

613

773

8223

-,9%:(8 -,9%:(; -,9%:(5

!"##$%

&'(#)%

!"#$%&' !"#$%&( !"#$%&)

*

+*

'**

'+*

(**

(+*

)**

,-./$0&,"1.02"$3 4"//-10&,"1.02"$3

Learning Accuracy Terms learnt

• Learning Locations from 250 pages from 150 sites (UK real estate)

• Starting with 25% of the full gazetteer (containing 33.243 terms)

unannotated instances (328) total instances (1484)

rnd. aligned corr. prec. rec. prec. rec.

1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%

Table 1: Total learned instances

rnd. unannot. recog. corr. prec. rec. terms

1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0

Table 2: Incrementally recognized instances and learned terms

miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.

We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.

Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.

Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.

We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.

Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of

94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!

Figure 4: AMBER Evaluation on Real-Estate Domain

fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.

4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction

from Large Websites. Journal of the ACM, 51(5):731–779,2004.

[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE

(Version 6). The University of Sheffield, Department ofComputer Science, 2011.

[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.

of the ACM SIGMOD International Conference on

Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic

wrappers for large scale web extraction. The Proceedings of

the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical

Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.

[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.

[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th

ACM Conference on Information and Knowledge

Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:

Ontology-Assisted Data Extraction. ACM Transactions on

Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web

Based on Partial Tree Alignment. IEEE Transactions on

Knowledge and Data Engineering, 18(12):1614–1628, 2006.

unannotated instances (328) total instances (1484)

rnd. aligned corr. prec. rec. prec. rec.

1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%

Table 1: Total learned instances

rnd. unannot. recog. corr. prec. rec. terms

1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0

Table 2: Incrementally recognized instances and learned terms

miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.

We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.

Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.

Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.

We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.

Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of

94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!

Figure 4: AMBER Evaluation on Real-Estate Domain

fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.

4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction

from Large Websites. Journal of the ACM, 51(5):731–779,2004.

[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE

(Version 6). The University of Sheffield, Department ofComputer Science, 2011.

[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.

of the ACM SIGMOD International Conference on

Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic

wrappers for large scale web extraction. The Proceedings of

the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical

Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.

[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.

[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th

ACM Conference on Information and Knowledge

Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:

Ontology-Assisted Data Extraction. ACM Transactions on

Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web

Based on Partial Tree Alignment. IEEE Transactions on

Knowledge and Data Engineering, 18(12):1614–1628, 2006.

• Fails to annotate 328 or 1,484 locations• Saturated after 3 rounds

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%

data areas records attributes

precision recall

94.0%

96.0%

98.0%

100.0%

reception price bathroom

legal status

detailed page bedroom

location postcode

property type

precision recall

overall attributes large scale

Real Estate

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%

data areas records attributes

precision recall

90.0%

92.5%

95.0%

97.5%

100.0%

door num detail page make

fuel type color price mileage

cartype trans model

engine size location

precision recall

overall attributes

Used Cars

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

price location

detailed page bedroom

legal status postcode

property type bathroom

reception

250 pages, manual 2215 pages, automatic

281real estate 14,6142785

12,732

attributes

114,714

records

20,7231,608151

2,215

pages

used car(large scale)

extracts attributes with >99% precision and >98% recall

Automatically LearningGazetteers from the Deep Web

AMBER Learning Cycle

Page Segmentation

PageRetrieval

Mozilla,GATE annotations

Data AreaIdentification

Pivot node (mandatoryfields) clustering

RecordSegmentation

Head/tail cut off,Segment boundary shifting

Attribute Alignment

AttributeCleanup

Discard attributesof low support

AttributeDisambiguation

Discard redundantattributes of lower support

AttributeGeneralization

Add new attributes ofsufficient support

Gazetteer Learning

TermFormulation

Spilt new attributes intoterms

TermValidation

Track term relevance,Discard irrelevant terms

1

2

3

2

3

L L

P X

L

P P P A

L

P A

L

P

L

P AA2 31

1

1

2

L

P A

1

Oxford, Walton Street, top-floor apartment

Oxford Walton Street top-floor apartment

Gazet-teers

A data area is a maximal DOM subtree, which• contains ≥2 pivot nodes, which are• depth consistent (depth(n)=k±ε)• distance consistent (pathlen(n,n')=k±δ)• continuous, such that• their least common ancestor is d's root.

A result record is a sequence of children of the data area root.

A result record segmentation divides a data area • into non-overlapping records, • containing the same number of siblings, • each based on a single selected pivot node.

2 R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

3 R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.

The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.

Remove terms which occur• in black lists,• in other gazetteers

Compute confidence based on• support of its type/tag path pair,• relative size of the term within the entire attribute

Tim Furche, Giovanni Grasso, Giorgio Orsi,Christian Schallhart, Cheng Wang

Authors

diadem.cs.ox.ac.uk/[email protected]

Digital Home

Reasoning in Datalog (DLV) rules• stratified negation• finite domains• non-recursive aggregation

easy integration with domain knowledge

Number of Rules• Data Area Idenifitication: 11• Record Segmentation: 32• Attribute Alignment: 34

AMBER Applications

Result PageAnalysis

Data ExtractionExample Generation

forWrapper Induction

Gazetteer LearningGazetteerOntology

Part of DIADEM (Domain-centric Intelligent Automated DataExtraction Methodology): Analyzing the pages reached viaOPAL to generate OXPath expressions for efficient extraction.

... but useable independently of DIADEM as well...

Sponsors

AMBER Evaluation AMBER Architecture

DIADEM data extraction methodologydomain-centric intelligent automated

diadem.cs.ox.ac.uk

P is only allowed toappear once, thus the

second P with less supportis dropped.

X only occurs onceand has too low

support to be kept.

A has a support of3/4 at this node andhence we add the

annotation.

We inferred that thisnode is of type A --

hence we learn its terms.

Friday all day,otherwise during breaks!

Booth Number 1well-hidden in the corner

Friday, May 11, 2012

Page 40: AMBER WWW 2012 (Demonstration)

!

"

#

$

%

Webpage with identifiedrecords and attributes1 Domain schema concepts2 Learned terms with

confidence values3 URLs for analysis4 Seed Gazetteer5

AMBER GUI

Reas

onin

g Attribute Alignment

Record Segmentaton

DataArea IdentificationAnno

tatio

n

GATE

Domain Gazetteers

Web

Acc

ess

Browser Common API

WebKitMozilla

AMBER Learning Evaluation!"##$%

&'(#)%

!"#$%&%'(

)$"*&+&,%

!"#$%&%'

-"*#..

/,0#.

)$"*&+&,%

/,0#.

-"*#..

123

453

613

773

8223

-,9%:(8 -,9%:(; -,9%:(5

!"##$%

&'(#)%

!"#$%&' !"#$%&( !"#$%&)

*

+*

'**

'+*

(**

(+*

)**

,-./$0&,"1.02"$3 4"//-10&,"1.02"$3

Learning Accuracy Terms learnt

• Learning Locations from 250 pages from 150 sites (UK real estate)

• Starting with 25% of the full gazetteer (containing 33.243 terms)

unannotated instances (328) total instances (1484)

rnd. aligned corr. prec. rec. prec. rec.

1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%

Table 1: Total learned instances

rnd. unannot. recog. corr. prec. rec. terms

1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0

Table 2: Incrementally recognized instances and learned terms

miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.

We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.

Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.

Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.

We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.

Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of

94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!

Figure 4: AMBER Evaluation on Real-Estate Domain

fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.

4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction

from Large Websites. Journal of the ACM, 51(5):731–779,2004.

[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE

(Version 6). The University of Sheffield, Department ofComputer Science, 2011.

[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.

of the ACM SIGMOD International Conference on

Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic

wrappers for large scale web extraction. The Proceedings of

the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical

Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.

[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.

[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th

ACM Conference on Information and Knowledge

Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:

Ontology-Assisted Data Extraction. ACM Transactions on

Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web

Based on Partial Tree Alignment. IEEE Transactions on

Knowledge and Data Engineering, 18(12):1614–1628, 2006.

unannotated instances (328) total instances (1484)

rnd. aligned corr. prec. rec. prec. rec.

1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%

Table 1: Total learned instances

rnd. unannot. recog. corr. prec. rec. terms

1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0

Table 2: Incrementally recognized instances and learned terms

miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.

We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.

Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.

Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.

We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.

Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of

94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!

Figure 4: AMBER Evaluation on Real-Estate Domain

fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.

4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction

from Large Websites. Journal of the ACM, 51(5):731–779,2004.

[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE

(Version 6). The University of Sheffield, Department ofComputer Science, 2011.

[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.

of the ACM SIGMOD International Conference on

Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic

wrappers for large scale web extraction. The Proceedings of

the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical

Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.

[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.

[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th

ACM Conference on Information and Knowledge

Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:

Ontology-Assisted Data Extraction. ACM Transactions on

Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web

Based on Partial Tree Alignment. IEEE Transactions on

Knowledge and Data Engineering, 18(12):1614–1628, 2006.

• Fails to annotate 328 or 1,484 locations• Saturated after 3 rounds

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%

data areas records attributes

precision recall

94.0%

96.0%

98.0%

100.0%

reception price bathroom

legal status

detailed page bedroom

location postcode

property type

precision recall

overall attributes large scale

Real Estate

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%

data areas records attributes

precision recall

90.0%

92.5%

95.0%

97.5%

100.0%

door num detail page make

fuel type color price mileage

cartype trans model

engine size location

precision recall

overall attributes

Used Cars

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

price location

detailed page bedroom

legal status postcode

property type bathroom

reception

250 pages, manual 2215 pages, automatic

281real estate 14,6142785

12,732

attributes

114,714

records

20,7231,608151

2,215

pages

used car(large scale)

extracts attributes with >99% precision and >98% recall

Automatically LearningGazetteers from the Deep Web

AMBER Learning Cycle

Page Segmentation

PageRetrieval

Mozilla,GATE annotations

Data AreaIdentification

Pivot node (mandatoryfields) clustering

RecordSegmentation

Head/tail cut off,Segment boundary shifting

Attribute Alignment

AttributeCleanup

Discard attributesof low support

AttributeDisambiguation

Discard redundantattributes of lower support

AttributeGeneralization

Add new attributes ofsufficient support

Gazetteer Learning

TermFormulation

Spilt new attributes intoterms

TermValidation

Track term relevance,Discard irrelevant terms

1

2

3

2

3

L L

P X

L

P P P A

L

P A

L

P

L

P AA2 31

1

1

2

L

P A

1

Oxford, Walton Street, top-floor apartment

Oxford Walton Street top-floor apartment

Gazet-teers

A data area is a maximal DOM subtree, which• contains ≥2 pivot nodes, which are• depth consistent (depth(n)=k±ε)• distance consistent (pathlen(n,n')=k±δ)• continuous, such that• their least common ancestor is d's root.

A result record is a sequence of children of the data area root.

A result record segmentation divides a data area • into non-overlapping records, • containing the same number of siblings, • each based on a single selected pivot node.

2 R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

3 R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.

The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.

Remove terms which occur• in black lists,• in other gazetteers

Compute confidence based on• support of its type/tag path pair,• relative size of the term within the entire attribute

Tim Furche, Giovanni Grasso, Giorgio Orsi,Christian Schallhart, Cheng Wang

Authors

diadem.cs.ox.ac.uk/[email protected]

Digital Home

Reasoning in Datalog (DLV) rules• stratified negation• finite domains• non-recursive aggregation

easy integration with domain knowledge

Number of Rules• Data Area Idenifitication: 11• Record Segmentation: 32• Attribute Alignment: 34

AMBER Applications

Result PageAnalysis

Data ExtractionExample Generation

forWrapper Induction

Gazetteer LearningGazetteerOntology

Part of DIADEM (Domain-centric Intelligent Automated DataExtraction Methodology): Analyzing the pages reached viaOPAL to generate OXPath expressions for efficient extraction.

... but useable independently of DIADEM as well...

Sponsors

AMBER Evaluation AMBER Architecture

DIADEM data extraction methodologydomain-centric intelligent automated

diadem.cs.ox.ac.uk

P is only allowed toappear once, thus the

second P with less supportis dropped.

X only occurs onceand has too low

support to be kept.

A has a support of3/4 at this node andhence we add the

annotation.

We inferred that thisnode is of type A --

hence we learn its terms.

Friday all day,otherwise during breaks!

Booth Number 1well-hidden in the corner

Friday, May 11, 2012

Page 41: AMBER WWW 2012 (Demonstration)

Backups

Friday, May 11, 2012

Page 42: AMBER WWW 2012 (Demonstration)

AMBER: Evaluation!"##$%

&'(#)%

!"#$%&%'(

)$"*&+&,%

!"#$%&%'

-"*#..

/,0#.

)$"*&+&,%

/,0#.

-"*#..

123

453

613

773

8223

-,9%:(8 -,9%:(; -,9%:(5

unannotated instances (328) total instances (1484)

rnd. aligned corr. prec. rec. prec. rec.

1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%

Table 1: Total learned instances

rnd. unannot. recog. corr. prec. rec. terms

1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0

Table 2: Incrementally recognized instances and learned terms

miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.

We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.

Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.

Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.

We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.

Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of

94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!

Figure 4: AMBER Evaluation on Real-Estate Domain

fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.

4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction

from Large Websites. Journal of the ACM, 51(5):731–779,2004.

[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE

(Version 6). The University of Sheffield, Department ofComputer Science, 2011.

[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.

of the ACM SIGMOD International Conference on

Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic

wrappers for large scale web extraction. The Proceedings of

the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical

Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.

[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.

[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th

ACM Conference on Information and Knowledge

Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:

Ontology-Assisted Data Extraction. ACM Transactions on

Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web

Based on Partial Tree Alignment. IEEE Transactions on

Knowledge and Data Engineering, 18(12):1614–1628, 2006.

Friday, May 11, 2012

Page 43: AMBER WWW 2012 (Demonstration)

AMBER: Evaluation

!"##$%

&'(#)%

!"#$%&' !"#$%&( !"#$%&)

*

+*

'**

'+*

(**

(+*

)**

,-./$0&,"1.02"$3 4"//-10&,"1.02"$3unannotated instances (328) total instances (1484)

rnd. aligned corr. prec. rec. prec. rec.

1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%

Table 1: Total learned instances

rnd. unannot. recog. corr. prec. rec. terms

1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0

Table 2: Incrementally recognized instances and learned terms

miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.

We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.

Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.

Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.

We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.

Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of

94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!

Figure 4: AMBER Evaluation on Real-Estate Domain

fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.

4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction

from Large Websites. Journal of the ACM, 51(5):731–779,2004.

[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE

(Version 6). The University of Sheffield, Department ofComputer Science, 2011.

[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.

of the ACM SIGMOD International Conference on

Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic

wrappers for large scale web extraction. The Proceedings of

the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical

Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.

[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.

[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th

ACM Conference on Information and Knowledge

Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:

Ontology-Assisted Data Extraction. ACM Transactions on

Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web

Based on Partial Tree Alignment. IEEE Transactions on

Knowledge and Data Engineering, 18(12):1614–1628, 2006.

Friday, May 11, 2012