48
WELCOME 1 DIADEM data extraction methodology domain-centric intelligent automated Web data as you want it

Diadem DBOnto Kick Off meeting

  • Upload
    dbonto

  • View
    91

  • Download
    5

Embed Size (px)

DESCRIPTION

http://diadem.cs.ox.ac.uk/

Citation preview

Page 1: Diadem DBOnto Kick Off meeting

W E L C O M E 1

DIADEM data extraction methodologydomain-centric intelligent automated

Web data as you want it

Page 2: Diadem DBOnto Kick Off meeting

T E A M 2Georg Gottlob  

Professor, FRS Project lead

Scientific director

Tim Furche  Postdoc

Technical director

Giovanni Grasso  Postdoc

Extraction infrastructure

Giorgio Orsi  Postdoc

Knowledge modelling

Christian Schallhart  Postdoc

Software engineering

Xiaonan Guo  Postdoc

Forms and interaction

Page 3: Diadem DBOnto Kick Off meeting

T E A M 3

Omer Gunes  D.Phil. student

Jinsong Guo  D.Phil. student

Andrew Sellers  Captain USAF

former D.Phil. student

Andrey Kravchenko  D.Phil. student

Stefano Ortona  D.Phil. student

Cheng Wang  D.Phil. student

Page 4: Diadem DBOnto Kick Off meeting

F U N D I N G 4

C O N C L U S I O N :

~$5M, equity-free investment in basic, unique technology

$3.4M

Page 5: Diadem DBOnto Kick Off meeting

DIADEM helps you collect the right data

Page 6: Diadem DBOnto Kick Off meeting

DIADEM shovel for the data science rush

Page 7: Diadem DBOnto Kick Off meeting

7

– S T E V E L O H R

New York Times, Aug. 2014

Data scientists […] spend 50 to 80 percent of their time […] collecting and preparing […] digital data […] from sensors,

documents, the web and conventional databases.

50-80%

Page 8: Diadem DBOnto Kick Off meeting

Data … is still a painI N T R O D U C T I O N 8

○ Data exists, but getting and using it is hard

◗ For example, when you are making decisions

○ Tipping point: tech leaders leverage data to striking effect

◗ Amazon, Walmart, Google

○ What about the rest of the world?

Page 9: Diadem DBOnto Kick Off meeting

9

– S H A R M I L A S H A H A N I - M U L L I G A N

CEO Clearstory(New York Times, Aug 2014)

“You can’t do this manually, you’re never going to find enough data scientists and analysts.”

collect & prepare

data

Page 10: Diadem DBOnto Kick Off meeting

… but there is a remedyI N T R O D U C T I O N 10

○ We can get you the data you need in the form you need

◗ from competitors

◗ from open sources

◗ from your intranet

○ At any scale, covering popular as well as long tail sources

○ Far more comprehensive than manual solutions

○ Far cheaper even than partial, manual solution

Page 11: Diadem DBOnto Kick Off meeting

What? Data ExtractionH O W : T E C H N O L O G Y & T E A M 11

ref-code postcode bedrooms bathrooms available price

33453 OX2 6AR 3 2 15/10/2013 £1280 pcm

33433 OX4 7DG 2 1 18/04/2013 £995 pcm

Page 12: Diadem DBOnto Kick Off meeting

What? Data ExtractionH O W : T E C H N O L O G Y & T E A M 12

ref-code postcode bedrooms bathrooms available price

33453 OX2 6AR 3 2 15/10/2013 £1280 pcm

33433 OX4 7DG 2 1 18/04/2013 £995 pcm

>10000

Page 13: Diadem DBOnto Kick Off meeting

Scale — what it’s all about

Page 14: Diadem DBOnto Kick Off meeting

14

– N I L E S H D A LV I

Yahoo!

“For many kinds of information one has to extract from thousands of sites in order to build a

comprehensive database”

Page 15: Diadem DBOnto Kick Off meeting

15

– R A G H U R A M A K R I S H N A N

Yahoo!

“No one really has done this successfully at scale yet”

Page 16: Diadem DBOnto Kick Off meeting

16

– A L O N H A L E V Y

Google

“Current technologies are not good enough yet”

Page 17: Diadem DBOnto Kick Off meeting

Technology: Our StrengthH O W : T E C H N O L O G Y & T E A M 17

10,493Sites from real-estate and used-car

92%Effective wrappers for more than 92% of sites on average

97%Precision of extracted primary attributes

20 2.1Days on a 45 node Amazon EC2 cluster

Days (one expert) to adjust system to a new domain

Page 18: Diadem DBOnto Kick Off meeting

Technology: Our StrengthH O W : T E C H N O L O G Y & T E A M 18

0

500

1000

1500

2000

0 250 500 750 1000number of records

time

(sec

onds

)

Page 19: Diadem DBOnto Kick Off meeting

H O W : T E C H N O L O G Y & T E A M 19

Phenomenology  

Self-organising adjusts itself to observations on the pages

different sequence of tasks for every site

strong isolation of components

AI

Rule-based AI

declarative rules instead of heuristics

uniform query of pages, phenomenology, …

all domain-independent

appearance of objects on the web

reason for DIADEM’s high accuracy

easily adapted to new domains

Page 20: Diadem DBOnto Kick Off meeting

H O W : T E C H N O L O G Y & T E A M 20http://diadem.cs.ox.ac.uk/demo

Page 21: Diadem DBOnto Kick Off meeting

H O W : T E C H N O L O G Y & T E A M 21

Manual Automatic  

Supervised  

+  magic

Data extraction isn’t new …

Scaling costly

Very common

Most commercial productsHuman + algorithm

Fully algorithmic

Active research

Page 22: Diadem DBOnto Kick Off meeting

CompetitorsH O W : T E C H N O L O G Y & T E A M 22

DIADEM data extraction methodologydomain-centric intelligent automated Mozenda, Lixto, Connotate,

BlackLocus, import.io, scrapinghub.com, promptcloud.com

massive human effort small human effortoncecontinuously

low cost efficiency high cost efficiency

low scaleone or few sources

massive scalethousands of sources

Page 23: Diadem DBOnto Kick Off meeting

What about Google & Co.H O W : T E C H N O L O G Y & T E A M 23

○ Verticals are becoming ever more relevant for search

◗ the major change to Google’s result page in the last decade

◗ crucial for intelligent personal assistants (Siri, Google Now)

○ Revived interest in large-scale extraction of structured data

◗ as part of knowledge graph

◗ currently only good for common sense facts

○ Recent AI/deep learning acquisitions by Google, Facebook

Page 24: Diadem DBOnto Kick Off meeting

Data science—a huge marketH O W ? I N C U B A T I O N P L A N 24

Data science market 2017

$50 billion

* A C C O R D I N G T O F O R B E S , W I K I B O N F O R E C A S T

$25 billion

Data collection & cleaning

* A C C O R D I N G N E W Y O R K T I M E S

Page 25: Diadem DBOnto Kick Off meeting

ClientsH O W ? I N C U B A T I O N P L A N 25

Page 26: Diadem DBOnto Kick Off meeting

Strategic Partners

H O W ? I N C U B A T I O N P L A N 26

Price intelligence & analytics

Recommendations & reviews

Price comparison & catalogs

Page 27: Diadem DBOnto Kick Off meeting

DIADEM VisionH O W ? I N C U B A T I O N P L A N 27

Short term

Deep data for products

Page 28: Diadem DBOnto Kick Off meeting

DIADEM VisionH O W ? I N C U B A T I O N P L A N 28

Long-term term

Deep data for everyone

Page 29: Diadem DBOnto Kick Off meeting

DIADEM VisionH O W ? I N C U B A T I O N P L A N 29

“Suggest a great evening out!” “Suggest a cheap

headphone with great bass!”

“Suggest a great hotel in an area with lots of bars and close to my

conference!”

“Suggest the best smart watch for my preferences!”

Page 30: Diadem DBOnto Kick Off meeting

WWW 2014: Fallacies in DEH O W : T E C H N O L O G Y & T E A M 30

– K E V I N C . C H A N G

Co-Founder Cazoodle, move.com, UIUC

#1: Can not start with ‘given a set of result pages’#2: Must not stop at 70% accuracy

#3: Must be scalable to more than thousands of sources #4: Must leverage human feedback

✓✓✓✓

DIADEM

Page 31: Diadem DBOnto Kick Off meeting

Wrapper qualityD I A D E M A N A LY S I S 31

For “presentational” attributes such as the main image or description, AMBER com-bines template discovery with domain-independent mainly visual, heuristics, as used, e.g, in http://diffbot.com. AMBER also exploits the domain knowl-edge to segment multi-attribute text nodes, which are ignored by other template dis-covery approaches: It looks for a regular segmentation, e.g., with a common separa-tor or shared prefix and uses background knowledge (e.g., “period unit modifies price”) to refine these segmentations. Figure 3 shows the resulting record and attribute segmentation for four result pages on the example site. The described ap-proach works on single result pages in iso-lation. However, DIADEM identifies pagina-tion links and follows these to collect sev-eral examples of such pages. The final wrapper induction step, discussed in the next section, generalizes the results of the template discovery on individual result pages.

Full site wrapper induction (500w)

The final step of the exploration and induc-tion core of DIADEM, the wrapper induc-tion, turns the collected information into an OXPath6 wrapper. OXPath is DIADEM’s wrapper language designed for efficient extraction with minimal resources. It has been shown to use constant memory6 even for extractions over millions of pages and to outperform both commercial and open-source systems by a good margin6. The wrapper induction makes use of OX-Path’s ability to combine navigation with extraction to generalize not only the ex-traction part but also the form filling and navigation parts, where possible. It exploits XPath’s text matching expressions to seg-ment multi-attribute text nodes. Figure 4 shows the wrapper generated by DIADEM for the site from Figure 3. The first two lines cover the form on the start page, first selecting an option from the “Buy/

Rent” toggle, then clicking the submit but-ton. The remaining three parts identify (1) the data area with the number of results (“171 properties”) that are collected to ver-ify the wrapper execution; (2) the next link that is used to iterate over all result pages (indicated through the Kleene star ()* operator); and (3) the record and attribute extraction expressions that are evaluated on each such page. The latter are divided in the initial expression selecting the record and in attribute extraction expressions, each nested into a predicate [], that may also be absent (indicated by the ? after the opening predicate bracket). Each attribute is identified by a standard XPath expres-sion and extracted by an extraction marker of the form <name=e>, where e is typically an XPath string expression. The wrapper induction algorithm is tailored to produce robust XPath expressions that use unique features of the target node or its siblings. Such expressions tend to be effected only by changes to the actual record template and oblivious to changes in the formatting of advertisement, navigation elements, and similar parts of the page. In Figure 3, e.g., many expression use preceding text labels “Bedroom:” or CSS classes unique to the postcode node (@class=”orange”).

Key Findings/Results (750w)abcTable w/ datasets2 Figures, 1 Table w/ resultsOverall results

ViNTs Giovanni, roughly 10%, due to modern form

effect of noise

comparative results (OPAL, AMBER)

Discussion (200w)abcMissing: comparison with commercial tools such as diffbot etc. Maybe a diagram where DIADEM is at the “top” for both research and commercial?

While we focus here on the core induction process, source discovery, integration, …

AcknowledgementThis work is supported by the European Research Council under the European Community’s Seventh Framework Pro-gramme (FP7/2007–2013) / ERC grant agreement DIADEM, no. 246858. Giorgio Orsi and Christian Schallhart have been supported by the Oxford Martin School, Institute for the Future of Computing.

COMMUNICATIONS OF THE ACM 5

doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /} //form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /} /.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>] /(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})* //div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>]! [? .//span[@class='prop_price']/text():<price=normalize-space(.)> ]! [? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ]! [? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ]! [? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ]! [? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ]! [? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ]! [? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ]! [? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ]! [? .//@src:<image=normalize-space(.)> ]! [? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ]! [? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ]

Figure 4: Wrapper for wwagency.com

LOCATION

META

ROOMS

FORM

DATA AREA

RECORD & ATTRIBUTES

NEXT LINK

sites sizesize search formsearch formsites

records attributes used #fields

UK real estate 3404 85 (123σ) 869 (1353σ) 2092 7.4

Oxford real estate 172 108 (162σ) 1151 (1752σ) 137 7.5

UK used cars 7089 77 (122σ) 1250 (2046σ) 4254 5.3

US real estate 111 110 (161σ) 1000 (1745σ) 78 10.1

ICQ (5 domains) 100 — — 100 6.5

Table 2: Datasets

wrapperwrapperwrapper

effective wrong or missing data

no data

UK real estate 91% 7% 2%

Oxford real estate 90% 6% 4%

ViNTs10 4% 5% 91%

UK used cars 93% 4% 3%

US real estate 90% 5% 5%

Table 3: Wrapper quality

What about the name, OXDuce

Page 32: Diadem DBOnto Kick Off meeting

Competition?D I A D E M A N A LY S I S 32

precision recall

99%98%95%

88%84%

77%56%

38%

99%97%

81%78%

58%53%

72%48%

DIADEM

ViNTs

DEPTAMDR

0%25%

50%75%

100% 0%25%

50%75%

100%

RE⌧RND UC⌧RNDRecords

C O N C L U S I O N :

Do only a part of the job, and poorly

Page 33: Diadem DBOnto Kick Off meeting

Competition?D I A D E M A N A LY S I S 33

precision recall

95%97%

84%83%

48%42%

95%96%

58%74%

60%65%

DIAD

EMDE

PTARo

adRu

nner

0%25%

50%75%

100% 0%25%

50%75%

100%

RE⌧RND UC⌧RNDAttributes

C O N C L U S I O N :

Do only a part of the job, and poorly

Page 34: Diadem DBOnto Kick Off meeting

Competition?D I A D E M A N A LY S I S 34

C O N C L U S I O N :

Do only a part of the job, and poorly

0%

25%

50%

75%

100%

price

descriptionlocation url

imagebeds

townstatus type

street

RE.FULLRE.OXFRE.RND

(a) RE attribute type frequency

0%

25%

50%

75%

100%

price

descriptionmake

imagemodel

mileage fuel

transmissioncolour

doors

UC.FULLUC.RND

(b) UC attribute type frequency

0.0

0.5

1.0

1.5

buttonselect text

numeric.min

numeric.maxnumeric

range

multi�selecttext.min

text.max

RE.FULLRE.OXFRE.RNDUC.FULLUC.RND

(c) Form types per form

Figure 9: Attribute and form type frequencies

0%

25%

50%

75%

100%

RE�OXF

price

location

postcode

property_type

property_status

furnishing

period_unit

bedsbaths

receptions

exact contained wrong

0%

25%

50%

75%

100%

RE�RND

price

location

postcode

property_type

property_status

furnishing

period_unit

bedsbaths

receptions

exact contained wrong

0%

25%

50%

75%

100%

UC�RNDpric

e

location

postcodemodel

make

transmission

colour

body_type

fuel_typeage

engine_size

registration

door_number

mileage

exact contained wrong

Figure 11: Attribute quality

space reasons, but refer to [20].Table 2b shows the assessment of the wrapper effectiveness in-

duced by DIADEM for RE-RND, UC-RND, RE-US, and RE-OXF. Forthe latter, we also show the corresponding numbers for ViNTs [44].We assume that the random samples are representative for the fulldatasets, as indicated by the highly correlated characteristics. Theprimary result is that over 90% of the wrappers are effective ineach datasets, with 91% average effectiveness. To avoid bias, weuse a two step verification of the wrappers: Each wrapper is man-ually verified by one person. If a wrapper is considered effec-tive, the actual extracted records are automatically compared tothe SEARCH_RESULTS_NUMBER identified on the first listings page, ifpresent. If not present, we use uniqueness of URL’s and imagesand identical record numbers from different fillings. If this auto-matic verification fails, two more persons are asked to verify thewrapper and the aggregated result is reported.

Contrast this to ViNTs, a system for fully automatically generat-ing wrappers for search engines. It provides only few attributes thatare common to many search engines, namely the title of the resultand its textual description and thus we consider a wrapper effectiveif it extracts the right records (where for DIADEM we also inspectthe attributes). ViNTs performs quite well if provided with a fewresult pages and a non-result (control) page. As DIADEM, ViNTs isalso able to identify and fill forms and generate wrappers from justa site URL. However, this part of ViNTs has been specifically engi-neered for simple search forms. We remove all sites with no searchforms, iFrames, or too few properties (about 17% of the sites inthe RE-OXF dataset) from the evaluation. Even in this case, ViNTsstill only manages to produce a wrapper in 9%. Only for 4% of thesites it produces an effective wrapper. In all other cases, ViNTs onlyextracts part of the data, e.g., no rentals.

Among the most common causes for a DIADEM wrapper to benon effective are misaligned attributes, e.g., in presence of multiplepivot attributes or rare optional attributes, and sites that list relatedproducts more prominently. E.g., on a few sites that offer also newcars, DIADEM may extract those rather than the used cars, if nei-ther listing contains many used car specific attributes and the new

ICQ dataset HA [14] ExQ [41] StatParser [36] DIADEM [17]

F1 for labeling 92% 96% 96% 98%

Table 3: Form labeling accuracy

cars are more prominently placed on the site. There are about 3%of sites where no wrapper can be induced, typically as they con-tain no properties, all properties are on aggregators, or they containno pivot attribute. For these sites, DIADEM correctly detects thatthere is no effective wrapper. The final case is that DIADEM failsto produce an effective wrapper, yet one exists. The most commonreasons for these failures are dynamic forms (15%), result pageswith dynamically rendered prices (12%), forms located in sidebariframes (15%), prices without currencies (6%), or sites which con-tain only a single property (6%).

To demonstrate, that DIADEM does not produce a wrapper forsites that are not in the target domain, we also run DIADEM on theset of top UK shopping websites UK-100. On this set, DIADEMinduces a wrapper for only 5 sites, confusing toy cars on Amazonand Toys-R-Us for used cars.

To determine the attribute quality of the extracted data, we per-form a manual evaluation on the RE-OXF, RE-RND, and UC-RNDdatasets. Again, we use a two step verification, both manual andautomatic with DIADEM’s LNER or Bing (for locations). Attributesare either exact matches, contain the intended value, or wrong. Fig-ure 11 summarises the results. Overall, the attribute quality > 97%in the two random datasets (and even higher in RE-OXF). The at-tribute with the highest error rate is the location in UC-RND. In thereal estate cases this attribute has a rather low error rate. The reasonis that in UC-RND, location is not a very common attribute (below20% of the records). It typically appears only on sites of dealerswith multiple offices, indicating the cars position.

In addition to the overall performance of the full-site extractionconsidered so far, we also evaluated two components separately,DIADEM’s form labeling and record and attribute identification.First, we focus on the form labeling accuracy produced as partof DIADEM’s form understanding, see Section 4.3. Unfortunately,

Page 35: Diadem DBOnto Kick Off meeting

DIADEM’s ComponentsD I A D E M 35

1ROSeAnn (VLDB’14)World-best entity extraction from text and structure

Page 36: Diadem DBOnto Kick Off meeting

DIADEM’s ComponentsD I A D E M 36

1ROSeAnn (VLDB’14)World-best entity extraction from text and structure

2OPAL (WWW’12, VLDBJ’13)World-most-effective form understanding & filling

Range widget ⟸ two fields + connected by “to” or other range connector + some clues in the annotations or classifications

The Ontological Key: Automatically Understanding and Integrating Forms to Access the Deep Web 13

1 TEMPLATE field_by_proper<C,A> {field<C>(N)(N@A{d,e,p}}2

3 TEMPLATE field_by_segment<C,A>{field<C>(N)(N@A{e,p}}4

5 TEMPLATE field_by_value<C,A> {field<C>(N)(N@A{m},6 ¬(A1 6= A, N@A1{d,e,p}_ N@A1{e,p}) }7

8 TEMPLATE field_minmax<C,CM,A> {9 field<CM>(N1)(child(N1,G),child(N2,G),adjacent(N1,N2),

10 N1@A{e,d},(field<C>(N2)_ N2@A{e,d})11 field<C_range>(N2)(child(N1,G),child(N2,G),next(N2,N1),12 field<C>(N1),N2@range_connector{e,d},¬(A1� C,N2@A1{d})13 field<CM>(N1)(child(N1,G),child(N2,G),adjacent(N1,N2),14 N1@A{e,p},N2@A{e,p},

�(N1@min{e,p},N2@max{e,p})

15 _ (N1@max{e,p},N2@min{e,p})�

Fig. 10: OPAL-TL field classification templates

Figure 10 shows the field classification templates forreal-estate and used car: (1) Field by proper label. The firsttemplate captures direct classification of a field N with typeC, if N matches X@A{d,e,p}, i.e., has more proper labels oftype A than of any other type A0 with A0 � A. This templateis used by far most frequently, primarily for types with un-ambiguous proper labels. (2) Field by segment label. Thesecond template relaxes the requirement by considering alsoindirect labels (i.e., labels of the parent segment). In the realestate and used car domains, this template is instantiated pri-marily for control fields such as ORDER_BY or DISPLAY_METHOD

(grid, list, map) where the possible values of the field are of-ten misleading (e.g., an ORDER_BY field may contain “price”,“location”, etc. as values). (3) Field by value label. Thethird template also considers value labels, but only if neitherthe first nor the second template can match. In that case, weinfer that a field has type C, if the majority of its direct or in-direct, value or proper labels are annotated with A. (4) Min-max field. Web forms often show pairs of fields representingmin-max values for a feature (e.g., the number of bedroomsof a property). We specify this template with three simplerules (Line 5–12), that describe three configurations of seg-ments with fields associated with value labels only (properlabels are captured by the first two templates). It is the onlytemplate with two template parameters, C and CM whereCM @C is the “minmax” variant of C. The first locates, adja-cent pairs of such nodes or a single such node and one that isalready classified as C. The second rule locates nodes wherethe second follows directly the first (already classified withC), has a range_connector (e.g., “from” or “to”), and is not an-notated with an annotation type with precedence over A. Thelast rule also locates adjacent pairs of such nodes and classi-fies them with CM if they carry a combination of min and max

annotations.

In addition to these templates, there is also a small num-ber of specific rules. In the real estate domain, e.g., we use

1 TEMPLATE segment<C>{2 segment<C>(G)(lone<C>(G),child(N1,G),¬

�child(N2,G),

3 ¬(C1 –⇧ C, field<C1>(N2)_ segment<C1>(N2))�}

4

5 TEMPLATE segment_range<C,CM> {6 segment<C>(G)(lone<C>(G),field<CM>(N1),7 field<CM>(N2),N1 6= N2,child(N1,G),child(N2,G) }8

9 TEMPLATE segment_with_unique<C,U> {10 segment<C>(G)(lone<C>(G),child(N1,G),field<U>(N1),11 ¬�C1 –⇧ C, child(N2,G), N1 6= N2,12 ¬(field<C1>(N2)_segment<C1>(N2))

�.}

13

14 TEMPLATE lone<C>{15 lone<C>(G)(child(N,G),(segment<C>(N)_ field<C>(N)),16 ¬(adjacent(G, G0), segment<C>(G0)). }

Fig. 11: OPAL-TL segment constraints

the following rule to describe forms that use links (a ele-ments) for submission (rather than submit buttons). Identi-fying such a link (without probing and analysis of Javascriptevent handlers) is performed based on an annotation typefor typical content, title (i.e., tooltip), or alt attribute ofcontained images. This is mostly, but not entirely domainindependent (e.g., in real estate a there is a “rent” link).

field<LINK_BUTTON>(N1)(form(F),descendant(N1,F),link(N1),N1@LINK_BUTTON{d},¬�descendant(N2,F),(field<BUTTON>(N2)_ next(N1,N2))

5.3 Segment Classification

As for field constraints, we use OPAL-TL to specify the seg-ment constraints. The segment constraints and templates inthe real estate and used car domains are shown in Figure 11(omitting only the instantiation as in the field case). All seg-ment templates require that the segment has at least oneC child and is the lone C segment among its siblings (seelone<C>(G)). (1) Basic segment. A segment is a C segment, ifits children are only other segments or fields typed with C.This is the dominant segmentation rules, used, e.g., for ROOM,PRICE, or PROPERTY_TYPE in the real estate domain. (2) Minmaxsegment. A segment is a C segment, if it has at least twofield children typed with CM where CM @ C is the minmaxtype for C. This is used, e.g., for PRICE and BEDROOM range seg-ments. (3) Segment with mandatory unique. A segment is aC segment, if its children are only segments or fields typedwith C except for one (mandatory) field child typed withU where U 6@ C. This is used, e.g., for GEOGRAPHY segmentswhere only one RADIUS may occur.

Page 37: Diadem DBOnto Kick Off meeting

DIADEM’s ComponentsD I A D E M 37

1ROSeAnn (VLDB’14)World-best entity extraction from text and structure

2OPAL (WWW’12, VLDBJ’13)World-most-effective form understanding & filling

3AMBER (TWeb’14)World-most-accurate record identification for listing pages

Algorithm 3: OptimalAttributeAlignment(Extraction instance E,DOM P,domain schema S)

1 foreach n 2 P with µ(A,n,v) do2 if 9r 2 E,m 2 f(r) : descendant-or-self(m,n) then3 add new attribute a to record r in E;4 k(a) v,t(a) A,f(a) {n};

5 M mandatory attribute types from S;6 foreach data area d in E do7 Mandats {(r,a) 2 E : (d,r) 2 E ^ t(a) 2M};8 while 9(r,a) 2 Mandats : suppr(a,t(a))< s1 do delete a from E;9 while 9(d,r) 2 E : 9A 2M :6 9(r,a) 2 Mandats : t(a) = A do

10 if 9n 2 P : text(n)^ desc(n,r)^ suppr(n,A)> s1 then11 add new attribute a to record r in E;12 k(a) contents(n),t(a) A,f(a) {n};

13 while 9(d,r) 2 E : 9A 2M :6 9(r,a) 2 Mandats : t(a) = A do14 delete r from E;15 while 9(d,r),(r,a) 2 E : suppr(a,t(a))< s0 do delete a from E;16 while 9(d,r),(r,a),(r,a0) 2 E : t(a) = t(a0) = A do17 if suppr(a,A) suppr(a0,A) then delete a from E ;18 if |{(d,r) 2 E}|< 2 then delete d from E;

both false positives (nodes that are annotated with type A but are infact not of type A) and false negatives (nodes that are not annotatedby type A but ought to be).

Thus, AMBER improves the quality of the attributes with a set ofreconciliation techniques for detecting and repairing extractions ap-pearing inconsistent with the annotations observed in other records.

In the repair we perform essentially two steps for each attributetype: Validation to detect false positives: Remove all attributes thatare not well-supported, i.e., where in too few other records an at-tribute of the same type occurs at the same position. Inference todetect false negatives: Infer missing attributes in those positionsof a record where there are sufficient other records that have anattribute of the same type at that position.

In practice, inference of optional attributes turns out to be ratherunreliable and to reduce precision notably at little gain in recall.Therefore, we limit inference to mandatory attributes only.

The final repair step ensures that the produced extraction in-stance is consistent with S by deleting all records that lack one ormore mandatory attributes.

Algorithm 3 shows how AMBER enforces these repairs: Initially,in lines 1� 4, we add all annotated nodes in a record as potentialattributes of that record.

In lines 6�18, we iterate over all data areas d in E, to repair theattributes of the records in d. First, in line 7� 12 we validate andinfer mandatory attributes. Then, we remove records which lack atleast one mandatory attribute in lines 13�14. In line 15 we validateoptional attributes and in line 16� 17 we enforce the multiplicityconstraints. In line 8, we remove immediately all such attributeswhich have a support below threshold s1, addressing attribute val-idation for mandatory attributes. Conversely, in lines 9� 12, weinfer attributes where there are nodes that are not annotated but arewell-supported. Finally, Line 18, removes all data areas which haveless than 2 records after cleanup.

THEOREM 2. AMBER’s attribute alignment (Algorithm 3) com-putes an optimal attribute alignment for a page with DOM P inO(|A| ·n3) where n is the size of P.

PROOF. Definition 11 requires three conditions on the extrac-tion instance E 0 constructed from a partial instance E, namely that(1) E 0 contains a subset of the data areas and records of E, (2) E 0

diva a div div aa

data area

p

spanPRICE

bLOCATION

p

spanPRICE

bLOCATION

p

spanPRICE

span

em p

strongPRICE

spanPRICE

bLOCATION

div

LOCATION

iBEDS

Figure 5: Attribute alignment

is well-supported and consistent for S, and (3) E 0 is maximal inthe number of nodes among all such extraction instances. Condi-tion (1) is trivially satisfied, since Algorithm 3 might add attributesbut only removes records, data areas. Condition (2) is also satis-fied, since Algorithm 3 assures (i) well-supportedness: In lines 8,AMBER deletes all mandatory attributes without sufficient support,and in lines 9�12, it introduces only mandatory attributes with suf-ficient support, such that all obtained mandatory attributes are well-supported. In line 15, we also remove all optional attributes withoutsufficient support, as well. (ii) consistency: First in line 15, AMBERremoves all records lacking (well-supported) mandatory attributes.Thus, the extraction instance can only be inconsistent by havingmultiple attributes of the same type. AMBER takes care of preciselythis case in lines 16� 17. It remains to show that Condition (3)holds as well. To this end, we argue that we obtain the maximumpossible set of mandatory attributes before line 13, possibly evenviolating the multiplicity constraints: But this is the case, sincelines 9�12 add all mandatory attributes which are well-supported.This is sufficient, since afterwards, we only remove records withoutlacking mandatory attributes, data areas lacking a sufficient num-ber of records, and surplus attributes violating the multiplicity con-straints.

Algorithm 3 is dominated by the calculation of the support. Tocalculate the support, we need to compute for each node in a recordthe node at corresponding position in every other record. This mayneed to be computed for every attribute type. Since there are atmost n nodes, the number of records is also bounded by n and com-puting as well as resolving a tag path requires at most n time, thecalculation of the support is bounded by O(|A| · n3). This is pre-computed (to allow for constant access to the support) and the re-mainder of Algorithm 3 is then bounded by O(n2).

In Figure 5 we illustrate attribute alignment in AMBER for s1 =40% and s0 = 20% and with mandatory types PRICE and LOCATION

and optional type BEDS: The data area has four records each span-ning two of the children of the data area (shown as blue diamonds).Red triangles represent attributes with the attribute type written be-low. Other labels are HTML tags. A filled triangle is an attributedirectly derived from an annotation, an empty triangle one inferredby the algorithm in line 9�12. In this example, the second recordhas no PRICE annotation. However, there is a span with tag patha/first-child::p/first-child::span and there are two otherrecords (the first and third) with a span with the same tag path fromtheir record. Therefore that span has support > 40% for PRICE isadded as an attribute of that type to the second record. Similarly,for the b element in record 1 we infer type LOCATION from the sup-port in record 2 and 4. Record 3 has a LOCATION annotation, but inan em. This has not enough support and is therefore removed. Since3 has thus no LOCATION attribute and LOCATION is mandatory it is re-moved (indicated by the dotted, dimmed record nodes). This con-trasts to the i in record 1 which is annotated as BEDS and is retainedas an attribute as optional attributes only need 20% support (andthus the support from record 1 suffices). Record 4 is also removed

Page 38: Diadem DBOnto Kick Off meeting

DIADEM’s ComponentsD I A D E M 38

4OXPath (VLDB’11, VLDBJ’13)World-most-efficient extraction language

Bitemporal Complex Event Processing of

Web Event Advertisements

?

Tim Furche1, Giovanni Grasso1, Michael Huemer2,Christian Schallhart1, and Michael Schrefl2

1 Department of Computer Science, Oxford University,Wolfson Building, Parks Road, Oxford OX1 3QD

[email protected] Department of Business Informatics – Data & Knowledge Engineering,

Johannes Kepler University, Altenberger Str. 69, Linz, [email protected]

doc(’http://www.scottfraser.co.uk/’)//select[@id=’search-type’]/{1 /}2 //input/{click /}/(//div[1]/table//td[4]/a/{click /})*{0,500}

//div[@class=’property-wrapper’]:<record>4 [? .:<ORIGIN_URL=current-url()>]

[? .//div[@class=’propertyPrice’]/text()[last()-1]:<PRICE=normalize-space(.)> ]6 [? .//li[@class=’rec’]/span[@class=’value’]/text():<RECEPTION_ROOM_NUMBER=string(.)> ]

[? .//div[@class=’propertyTitle’]//@href:<URL=string(.)> ]8 [? .//span[@class=’priceQualifier’]/text():<PERIOD_UNIT=string(.)> ]

[? .//div[@class=’propertyDescription’]/text()[1]:<DESCRIPTION=string(.)> ]10 [? .//li[@class=’bed’]/span[@class=’value’]/text():<BEDROOM_NUMBER=string(.)> ]

[? .//li[@class=’bath’]/span[@class=’value’]/text():<BATHROOM_NUMBER=string(.)> ]12 [? .//div[@class=’propertyThumbnail’]/a//@src:<IMAGE=string(.)> ]

[? .//div[@class=’propertyTitleWrapper’]//a/text():<LOCATION=string(.)> ]

doc(’http://www.timruss.co.uk/’)//input[@value=’cntrlListingType_Sales’]/{click /}2 //input[@name=’ctl00$ctl14$btnSearch$ctl00’]/{click /}/

(//div[5]//td/following-sibling::td[contains(string(.),’>’)]/a/{click /})*{0,500}4 //div[@id=’ctl00_cntrlCenterRegion_ctl01_pnlPagingFooter’]/preceding-sibling::div/div[1]/div:<record>

[? .:<ORIGIN_URL=current-url()>]6 [? .//div/following-sibling::h2//text():<PRICE=substring(normalize-space(.),string-length(substring-before(normalize-space(.)," "))+1)> ]

[? .//div[@class=’ListResultsRooms’]/div[last()]/span/text():<RECEPTION_ROOM_NUMBER=substring-after(normalize-space(.),"Receptions: ")> ]8 [? .//a[.=’Full Details >’]/@href:<URL=string(.)> ]

[? .//div[contains(@class,’SearchText’)]:<DESCRIPTION=string(.)> ]10 [? .//div[contains(string(.),’Bedrooms:’)]/span/text():<BEDROOM_NUMBER=substring-after(normalize-space(.),"Bedrooms: ")> ]

[? .//div[contains(string(.),’Bathrooms:’)]/span/text():<BATHROOM_NUMBER=substring-after(normalize-space(.),"Bathrooms: ")> ]12 [? .//a[@class=’propAdd’]/text():<TOWN=string(.)> ]

[? .//img[@class=’fulldetails-photo-item’]/@src:<IMAGE=string(.)> ]14 [? .//a[@class=’propAdd’]/text():<LOCATION=string(.)> ]

? The research leading to these results has received funding from the European Research Councilunder the European Community’s Seventh Framework Programme (FP7/2007–2013) / ERCgrant agreement DIADEM, no. 246858. Michael Huemer has been supported by a MariettaBlau Scholarship granted by the Austrian Federal Ministry of Science and Research (BMWF)for a research stay at Oxford University’s Department of Computer Science.

1ROSeAnn (VLDB’14)World-best entity extraction from text and structure

2OPAL (WWW’12, VLDBJ’13)World-most-effective form understanding & filling

3AMBER (TWeb’14)World-most-accurate record identification for listing pages

Page 39: Diadem DBOnto Kick Off meeting

DIADEM’s ComponentsD I A D E M 39

1ROSeAnn (VLDB’14)World-best entity extraction from text and structure

2OPAL (WWW’12, VLDBJ’13)World-most-effective form understanding & filling

3AMBER (TWeb’14)World-most-accurate record identification for listing pages

4OXPath (VLDB’11, VLDBJ’13)World-most-efficient extraction language

5DIADEM (VLDB’14)World-first accurate, automatic full-site extraction system

Page 40: Diadem DBOnto Kick Off meeting

Example 1: FormF O R M P H E N O M E N O L O G Y 40

○ Task: classify and group form fields into semantic segments

◗ Problem: HTML structure is only an approximation

○ Phenomenology: Detect semantic segments, e.g.,

◗ if there is a continuous list of option fields (🔘, ☑️)

◗ with the same type

◗ and a parent that can’t be classified

Page 41: Diadem DBOnto Kick Off meeting

Example 1: FormF O R M P H E N O M E N O L O G Y 41

segment<C>(∃ X):- html-child(N1, P),

html-child(N2, P) , N1 ≠ N2, ¬segment(P),

option-field(N 1), option-field(N 2),

concept<C>(N 1), concept<C>(N 2),

max-cont-list-of-fields-with-type<C>(N 1, N 2).

parent can not be classified

both option fields

same type C

end points of largest continuous list of type C

Page 42: Diadem DBOnto Kick Off meeting

Example 2: DataareasR E S U LT PA G E P H E N O M E N O L O G Y 42

○ Task: Finding areas on a page that contain relevant data

○ Idea: Use the regularity resulting from the DB templates

○ Problem: Distinguishing regular noise, e.g., featured properties

○ Solution: Maximisation problem over pivot elements

◗ occurrences of mandatory attributes such as price

Page 43: Diadem DBOnto Kick Off meeting

R E S U LT PA G E P H E N O M E N O L O G Y 43

D1

M1,1

M1,2

D2

D3

M1,3 E

M1,4

Figure 3: Data area identification

its of order dominance: The pivot nodes in E are organized ratherregularly, whereas the pivot nodes in D1 vary quite notably. How-ever, there variation is small enough that M1,1 to M1,4 are depth anddistance consistent (for d = e = 3). The two lower pivot nodes inE however are neither depth (due to M1,1) nor distance consistent(due to M1,2 and M1,3) and therefore can not be added to this clus-ter. They form a separate cluster together with the rightmost pivotnode in E. This cluster, however, is not order dominant and there-fore dropped in lines 24� 28. Thus, y(D1), the support of D1, isonly {M1,1, . . . ,M1,4} and the three remaining pivot nodes in E arenot used further.

The latter shows that in some cases order dominance may notidentify the “best” data area. The primary reason is that depth anddistance consistence are defined using absolute thresholds for theentire page, rather than allowing data areas with different levelsof consistency on a page. Pages with such a structure occur veryinfrequently in practice (as demonstrated by the evaluation in Sec-tion 5) and could be addressed by a slight extension of the currentidentification algorithm (see Section 6).

4.2 Record SegmentationAMBER is tailored to result pages with multiple “records”, i.e.,

representations of domain entities. During the data area identifica-tion, we identify areas of a page with sufficient repeated structurein the relevant data that we can assume that records in such a dataarea are instantiations of the same template and thus have a similarstructure. Despite this assumption AMBER can deal with a largedegree of noise: (1) AMBER tolerates inter-record noise, such asadvertisements, by focusing on relevant data. (2) AMBER toler-ates most intra-record variances due to, e.g., optional attributes ormultiple entity types by segmenting records based only on manda-tory, usually highly regular attributes. (3) AMBER also addressesmulti-template pages, where records on the same page are gener-ated from different templates by considering each data area sepa-rately for record segmentation. AMBER approximates relevant dataand structural similarity of records through occurrences of manda-tory attribute types only, as in the data area case. This allows AM-BER to scale to large and complex pages at ease.

DEFINITION 7. A record is a set r of children of a data aread such that r is continuous for � and r contains at least one pivotnode from y(d). A record segmentation of d is a set of uniform,non-overlapping records R, i.e., all records in R have the samesize and no child of d occurs in more than one record.

For example generation, we are interested in record segmenta-tions that expose the regular structure of the page. We formalizethis as the following dual objective optimization problem:

(1) Maximize the length of an evenly segmented sequence of pivotnodes. A sequence of pivot nodes p1, . . . , pn is evenly seg-mented in a data area d, if the subtrees containing the pi oc-cur in distinct records and all have the same distance from eachother, i.e., if there is a k such that li �sibl li+1 = k for all i whereli is the child of the data area d that contains pi.

(2) Minimize the irregularity of the record segmentation. Theirregularity of a record segmentation R is the sumof the relative tree edit distances between all pairsof nodes in different records in R, irregularity(R) =Ân2r,n02r0with r 6=r02R editDist(n,n0) where editDist(n,n0) is thestandard tree edit distance normalized by the size of the sub-trees rooted at n and n0 (their “maximum” edit distance).

In AMBER we approximate such a record segmentation using Al-gorithm 2. It computes a record segmentation in two steps such thatthe record segmentation contains a large sequence of evenly seg-mented pivot nodes and has minimal irregularity among all recordsegmentations with those pivot nodes and same record size. Ina pre-processing step all children of the data area that contain notext or attributes (“empty” nodes) are collapsed and excluded fromthe further discussion under the assumption that these are separatenodes such as br.

First, we determine the sequence of pivot nodes underlying thesegmentation. We identify the pivot nodes by their “leading node”,i.e., the child of the data area that contains the pivot node (line 1, L).In lines 3� 4 we estimate the distance Len between leading nodesthat yields the largest evenly segmented sequence: The children ofthe data area are partitioned at each leading node and Len becomesthe minimum partition size that occurs with maximal frequency inthe resulting partition (line 4). In lines 5� 8 we drop all leadingnodes from L that are less than Len from their previous leadingnode, except for the start (line 5) and end (line 6) of the sequence,where we remove the outer leading nodes under the assumption thatthey are noise in the header or trailer of the data area.

Second, we use the remaining leading nodes to compute all seg-mentations with record size Len such that each record contains atleast one leading node from L. To that end, line 9 compute thestart points of these records by shifting to the left from the nodesin L. We then iterate over all the sequences of such start pointsin the loop of line 12� 18 and compute the actual segmentationsas the records of Len length from each starting point (line 14). Byconstruction these are records, as they are continuous and containat least one leading node (and thus at least one pivot node). Thewhole Segmentation is a record segmentation as its record are non-overlapping (due to lines 5� 8) and of uniform size Len (line 15).Among all these record segmentations we then return the one withthe lowest irregularity (lines 15�18).

PROPOSITION 1. Algorithm 2 runs in O(b ·n3) on a data aread where b is the degree of D and n the size of d.

PROOF. Lines 1� 8 are clearly in O(b2). Line 9 generates atmost b + 1 segmentations (as Len b) of at most b size. Theloop is executed once for each such segmentation and dominatedby the computation of irregularity() which is bounded by O(n3) us-ing a standard tree edit distance algorithm. Since b n, the overallbound is O(b ·n3).

In Figure 2, the record segmentation is fairly straightforwardsince both data areas are rather regular. We eliminate the sepa-rator nodes (the white diamonds) and then segment the children ofthe data areas. The first f of the e data area is omitted as it does notform a record of size 2 as all others in e.

consistent_cluster_members(C, N1, N2, N3) :- pivot(N1), pivot(N2), ...

similar_depth(N1, N2), similar_depth(N2, N3), similar_depth(N1,N3),

similar_tree_distance(N1, N2, N3).

cluster(C,N) :- ... continuous, lca, contains at least one of all mandatories

Page 44: Diadem DBOnto Kick Off meeting

Example 2: Record alignmentR E S U LT PA G E P H E N O M E N O L O G Y 44

○ set of uniform, non-overlapping records

○ maximise regularity, minimise outliers

◗ pairwise edit distance with bias towards pivot nodes

imga img a img img a img img

£860

div

£900 £500

div

data area

div

£900

p

£900

p

Figure 4: Record Segmentation

Algorithm 2: Segmentation(DOM P,Data Area d)

1 L {n : child(f(d),n) 2 P^9n0 2 y(d) : desc-or-self(n,n0)};2 sort L in document order;3 foreach 1 k |L|�1 do Partition[k] {n : L[k]� n� L[k+1]};4 Len min{|Partition[i]|: |{ j : |Partition[ j]|= |Partition[i]|}| maximal};5 while L[1]�sibl L[2]< Len do delete L[1];6 while L[|L|�1]�sibl L[|L|]< Len do delete L[|L|];7 while 1 < k < |L| do8 if L[k]�sibl L[k+1]< Len then delete L[k+1] else k++;9 StartCandidates {L}[{{n : 9l 2 L : n�sibl l = i} : i Len};

10 OptimalSegmentation /0; OptimalSim •;11 foreach S 2 StartCandidates do12 sort S in document order;13 foreach 1 k |L|�1 do14 Segmentation[k] {n : n�sibl S[k] Len};15 if 8P 2 Segmentation : |P|= Len then16 if irregularity(Segmentation)< OptimalSim then17 OptimalSegmentation Segmentation;18 OptimalSim irregularity(Segmentation);

19 return OptimalSegmentation;

In the example of Figure 4, generates five segmentations as Lenis 4 due to the distance between the first and the second (red) div.Note, how the first and last leading nodes (p elements) are elim-inated (in lines 5� 6 of 2) as they are too close to other leadingnodes. Of the five segmentations (shown at the bottom of Fig-ure 4), the first and the last are invalid as they contain records ofa length other than 4. The middle three segmentations are properrecord segmentations. The middle (solid line) one has the lowestirregularity among those and is thus selected by AMBER.

4.3 Attribute AlignmentAfter segmenting the data area into records, AMBER aligns the

contained attributes to complete the extraction instance. Recall,that we limit our discussion to single valued attributes, i.e., attributetypes which occur at most once in each record.

When aligning attributes, AMBER must compare the position ofattribute occurrences in different records to detect repeated struc-tures. To encode the position of an attribute in a record we use thepath from the record node to the attribute:

DEFINITION 8. For DOM nodes a and n with descendant(a,n),we define the characteristic tag path tag-patha(n) as the sequenceof HTML tags occurring on the path from a to n, including those ofa and n itself, taking only first-child and next-sibl steps while skipping

all text nodes. With the exception of a’s tag, all HTML tags areannotated by the type of step.

For the leftmost a and its i descendant in Figure 5, e.g., the tagpath is a/first-child::p/first-child::span/next-sibl::i.

Based on the tag path, AMBER quantifies the fraction of recordsthat support the assumption that a node n is an attribute of type Awithin record r with the support suppr(n,A).

DEFINITION 9. Let E be an extraction instance on DOM P,containing a node n within record r belonging to data area d, andA 2 A an attribute type. Then suppr(n,A) denotes the support ofn as attribute of type A within r, defined as the fraction of recordsr0 6= r in d that contain a node n0 with tag-pathr(n) = tag-pathr0(n0)that is annotated with A.

Consider a data area with 10 records, containing 1 PRICE-annotated node n1 with tag path div/. . . /next-sibl::span withinrecord r1, and 3 PRICE-annotated nodes n2 . . .n4 with tag pathdiv/. . . /first-child::p within records r2 . . .r4, respectively. Then,suppr1(n1, PRICE) = 0.1 and suppri(ni, PRICE) = 0.3 for 2 i 4.

The support of an attribute allows us to characterize extractioninstances where all attributes are supported by enough evidence onthe web page as well as the notion of an extraction instance that isoptimal w.r.t. the alignment of attributes to records.

DEFINITION 10. Let E be an extraction instance under schemaS on DOM P and M is the set of mandatory attribute types in Sand s0 � s1 two thresholds. Then E is well-supported, if for allattributes (leaves) a 2 E with (r,a) 2 E, suppr(a,t(a))> sa holds,with sa = s0 for t(a) 62M and sa = s1 otherwise.

Mandatory attributes require a higher support than optional at-tributes and we typically use s0 = 50% and s1 = 20%.

DEFINITION 11. Let E be a partial extraction instance underschema S on DOM P containing only data areas and records (andno attributes). Then the optimal attribute alignment for E is anextraction instance E 0 such that (1) E 0 contains a subset of the dataareas and records of E, (2) E 0 is well-supported and consistent forS, and (3) E 0 is maximal in the number of mandatory attributesamong all such extraction instances.

To construct an optimal attribute alignment from a partial extrac-tion instance constructed as in Section 4.2, AMBER first associatesall annotated nodes as attributes to the record they are contained in.However, these annotations are inherently noisy and there may be

Page 45: Diadem DBOnto Kick Off meeting

Example 3: Pagination linksB L O C K P H E N O M E N O L O G Y 45

Website n n1 n2 P R Screenshot

Rea

lest

ate FindAProperty 370 1 1 1 1

Zoopla 332 1 1 1 1Savills 234 2 2 1 1

Car

s Autotrader 262 2 2 1 1Motors 472 2 2 1 1Autoweb 103 2 2 1 1

Ret

ail Amazon 448 1 1 1 1

Ikea 290 2 0 1 1

Lands’ End 527 2 2 1 1

Foru

ms TechCrunch 279 0 1 1 1

TMZ 200 2 2 1 1Ars Technica 341 2 2 1 1

Table 1: Sample pages

recall). n is the number of links on the result page, n1 (n2) the number of immediatenumeric (non-numeric) pagination links on the page, and P, R are precision and recallfor our approach.1 For each website we also present a screenshot of either its pagina-tion links or a potential false positive. Even in this small sample of webpages, we canobserve the diversity of pagination links: Only six of the twelve websites have a typ-ical pagination link layout (non-numeric link containing a NEXT keyword and a list ofnumeric links with the current page represented as a non-link). Some of the challengesevident from this table are:1. For FindAProperty and IKEA the index of the current page is a link and thus we

need to consider, e.g., its style to distinguish it from the other links.2. For Zoopla the “50” for the results per page can be easily mistaken for an immediate

numeric pagination link.3. For Savills, numeric links come as intervals. However, our NUMBER annotations also

cover numeric ranges (as well as “2k” or “two”).4. For Amazon the result page contains a confusing scrollbar for navigation through

the related products (right screenshot).5. For Lands’ End the non-numeric pagination link is an image. However, our ap-

proach classifies it correctly, based on the context and attribute values.6. TechCrunch contains a single isolated non-numeric pagination link, that we are able

to identify due to the keyword present in its text and the proximity to “Page 1”.7. TMZ has a pagination link that carries both a NEXT and a NUMBER annotation. From

the context, we nevertheless identify it correctly as non-numeric.

1 Precision is the percentage of true positives among the nodes identified as pagination links,recall the percentage of identified pagination links among all pagination links (and thus lowerrecall means more false negatives).

Page 46: Diadem DBOnto Kick Off meeting

Example 3: Pagination linksB L O C K P H E N O M E N O L O G Y 46

○ Machine learning on top of derived features

Description Type Predicate

Con

tent

1 Annotated as NEXT bool plm::annotated_by<NEXT>

2 Annotated as PAGINATION bool plm::annotated_by<PAGINATION>

3 Annotated as NUMBER bool plm::annotated_by<NUMBER>

4 Number of characters int plm::char_num

Page

posi

tion 5 Relative position on page int2 plm::relative_position<css::page>

6 Relative position in first screen int2 plm::relative_position<std::first_screen>

7 In first screen bool plm::contained_in<std::first_screen>

8 In last screen bool plm::contained_in<std::last_screen>

Visu

alpr

oxim

ity

9 Pagination annotation close to node bool plm::in_proximity<plm::annotated_by<PAGINATION>>

10 Number of close numeric nodes int plm::num_in_proximity<numeric>

11 Closest numeric node is a link bool plm::closest<std::left_proximity>_with

<numeric>_is<non_link>

12 Closest numeric node has different style bool <numeric>_is<different_style>

13 Closest link annotated with NEXT bool <dom::clickable>_is<plm::annotated_by<NEXT>

14 Ascending w. closest numeric left, right bool plm::ascending-numerics

Stru

ctur

al

15 Preceding numeric node is a link bool plm::closest<std::preceding>_with

<numeric>_is<non_link>

16 Preceding numeric node has different style bool <numeric>_is<different_style>

17 Preceding link annotated with NEXT bool <dom::clickable>_is<plm::annotated_by<NEXT>

Table 3: PLM: Pagination Link Model

The page position features are the relative position on the first page and on thefirst screen, as well as whether a node is on the first or last screen. They are de-fined by two instantiations in Figure 4, one for relative positions (using css::page andstd::first_screen, resp.) and one for the presence in the first or last screen.

The visual proximity features are the most involved ones. They include a feature onwhether there is a node in visual proximity that is annotated with PAGINATION (9), a fea-ture on the number of numeric nodes in the proximity (10), and a features that specifieswhether the node and the closest numeric nodes in the proximity to the left and to theright form an ascending sequence (14). 11� 13 ask if the closest node with a certainproperty passes a given test, e.g., whether the closest numeric node is a link. Accord-ingly, 11�13 are instantiations of closest. 9 and 10 are instantiations of in_proximityand num_in_proximity. 14 is the only feature in this model that is defined entirely fromscratch.

The structural features are similar to 11�13, but use XPath’s preceding instead ofvisual proximity. E.g., 15 tests if the numeric node, immediately preceding the givennode, is a link. They are omitted from Figure 5 as they are similar to 11�13.

5.1 Training the Classifier

With this feature model, BERyL derives a classifier based on a small training set. Forpagination link classification, a training corpus of only two dozen pages suffices to

Page 47: Diadem DBOnto Kick Off meeting

Example 3: Pagination linksB L O C K P H E N O M E N O L O G Y 47

○ Datalog± rules for deriving features

○ Lots of visual reasoning on the page

○ Rich template language to avoid duplication

TEMPLATE annotated_by<Model,AType> {2 <Model>::annotated_by<AType>(X) ( node_of_interest(X),

gate::annotation(X, <AType>, _). }4 TEMPLATE in_proximity<Model,Property(Close)> {

<Model>::in_proximity<Property>(X) ( node_of_interest(X),6 std::proximity(Y,X), <Property(Close)>. }TEMPLATE num_in_proximity<Model,Property(Close)> {

8 <Model>::in_proximity<Property>(X,Num) ( node_of_interest(X),std::proximity(Close,X), Num = #count(N: <Property(Close)>). }

10 TEMPLATE relative_position<Model,Within(Height,Width)> {<Model>::relative_position<Within>(X, (PosH, PosV)) ( node_of_interest(X),

12 css::box(X, LeftX, TopX, _, _), <Within(Height,Width)>,

PosH = 100·LeftXWidth , PosV = 100·TopX

Height . }

14 TEMPLATE contained_in<Model,Container(Left,Top,Bottom,Right)> {<Model>::contained_in<Container>(X) ( node_of_interest(X),

16 css::box(X,LeftX,TopX,RightX,BottomX), <Container(Left,Top,Right,Bottom)>,Left < LeftX < RightX < Right, Top < TopX < BottomX < Bottom. }

18 TEMPLATE closest<Model,Relation(Closest,X),Property(Closest),Test(Closest)> {<Model>::closest<Relation>_with<Property>_is<Test>(X) ( node_of_interest(X),

20 <Relation(Closest,X)>, <Property(Closest)>, <Test(Closest)>,¬(<Relation(Y,X)>, <Property(Y)>, <Relation(Y,Closest)>). }

Fig. 4: BERyL feature templates

In a similar way, the second template defines a boolean feature that holds for nodesof interest, if there is another node in their proximity for which Property(Close) is true.To instantiate it to nodes that are annotated with PAGINATION, we write

INSTANTIATE in_proximity<Model,Property(Close)>2 USING <plm, plm::annotated_by<PAGINATION(Closest)>

Observe, that BERyL templates thus allow for two forms of template parameters: vari-ables and predicates. More formally,

Definition 3. A BERyL template is an expression TEMPLATE N<D1, . . . ,Dk>{p( expr} suchthat N is the template name, D1, . . . ,Dk are template parameters, p is a template atom,expr is a conjunction of template atoms and annotation queries. A template parameteris either a variable or an expression of the shape p(V1, . . . ,Vl) where p is a predicatevariable and V1, . . . , Vn are names of required first order variables in bindings of p.

A template atom p<C1, . . . ,Ck>(X1, . . . ,Xn) consists of a first-order predicate name orpredicate variable p, template variables C1, . . . ,Ck, and first-order variables X1, . . . ,Xn.If p(V1, . . . ,Vl) is a parameter for N, then {V1, . . .Vl}⇢ {X1, . . . ,Xn}.

An instantiation always has to provide bindings for all template parameters. Weextend the usual safety and stratification definitions in the obvious way to a BERyLtemplate program. Then it is easy to see that the rules derived by instantiating a safeand stratified template program are always a safe, stratified Datalog¬,Agg program.

Page 48: Diadem DBOnto Kick Off meeting

DiscussionQ U E S T I O N S 48

?