SXPath - Extending XPath towards Spatial Querying on Web ...staab/Research/... · SXPath - Extending XPath towards Spatial Querying on Web Documents Ermelinda Oro Department of Electronics,

SXPath - Extending XPath towards Spatial Querying onWeb Documents

Ermelinda OroDepartment of Electronics,Informatics, and Systems

University of CalabriaAltilia s.r.l.

Via. P. Bucci, 41/C, 87036,Rende (CS), Italy

[email protected]

Massimo RuffoloInstitute of High PerformanceComputing and Networking

of the Italian CNRAltilia s.r.l.

Via. P. Bucci, 41/C, 87036,Rende (CS), Italy

[email protected]

Steffen StaabInstitute for Computer Science

University of KoblenzUniversitatsstraße 1,

PO Box 201 602 56016Koblenz, Germany

[email protected]

ABSTRACTQuerying data from presentation formats like HTML, for purposessuch as information extraction, requires the consideration of treestructures as well as the consideration of spatial relationships be-tween laid out elements. The underlying rationale is that frequentlythe rendering of tree structures is very involved and undergoingmore frequent updates than the resulting layout structure. There-fore, in this paper, we present Spatial XPath (SXPath), an extensionof XPath 1.0 that allows for inclusion of spatial navigation primi-tives into the language resulting in conceptually simpler queries onWeb documents. The SXPath language is based on a combinationof a spatial algebra with formal descriptions of XPath navigation,and maintains polynomial time combined complexity. Practical ex-periments demonstrate the usability of SXPath.

1. INTRODUCTIONHTML is a presentation-oriented document format conceived for

presenting Web documents on screen. Querying data from such aformat requires a versatile machinery to abstract from presentationstructure into conceptual relationships. Germane to such abstrac-tion is the transformation into structural languages such as XML.Indeed quite often XML languages are used as presentation for-mats, e.g. HTML 5. Based on these structural languages, one mayuse well founded and known query formalisms such as XPath 1.0[25] and XQuery 1.0 [24] in order to turn presentation structureinto meaningful relational or logical representations. Typical prob-lems are incurred by the separation of document structure and theensued spatial layout — whereby the layout often indicates the se-mantics of data items, e.g. the meaning of a table cell entry is mosteasily defined by the leftmost cell of the same row and the topmostcell of the same column (in Western languages) [9]. In tree struc-tures such as those arising on real-world Web pages, such spatialarrangements are rarely explicit and frequently hidden in complexnestings of layout elements — corresponding to intricate tree struc-tures that are conceptually difficult to query.

Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the VLDB copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.VLDB ‘10, September 13-17, 2010, SingaporeCopyright 2010 VLDB Endowment, ACM 000-0-00000-000-0/00/00.

In the literature, approaches aimed at manipulating Web pagesby leveraging the visual arrangement of page contents [13, 22], andframeworks for representing and querying multimedia and presen-tation databases [2, 14] have been proposed. However such ap-proaches and frameworks provide limited capabilities in navigatingand querying Web documents for information extraction purposes.Therefore, we propose to extend XPath 1.0 by new spatial navi-gation primitives that allow for querying Web documents, namely:(i) Spatial Axes, based on topological [21] and rectangular cardinalrelations [18], that allow for selecting document nodes that havea specific spatial relation w.r.t. the context node. (ii) Spatial Po-sition Functions that exploit some spatial orderings among docu-ment nodes, and allow for selecting nodes that are in a given spatialposition w.r.t. the context node.

The main contributions of this paper are the following: (i) Wehave defined the novel Spatial Document Object Model (SDOM)that constitutes a mixed spatial and structural model of HTML pages.In our model, we represent the relationships between DOM nodesresulting from the XML tree structures on the one hand and fromthe spatial relationships between the rendered XML elements onthe other hand. (ii) We have defined and implemented the SpatialXPath (SXPath) language that allows for navigating the SDOM.It can be used on its own, but also as building block for moreexpressive languages such as XQuery. (iii) We have defined twofragments of SXPath language, namely: Core SXPath that repre-sents the navigational core of SXPath, and Spatial Wadler Frag-ment (SWF ) that is a fragment with very interesting practical value,since the vast majority of useful queries fall into it, and that allowsfor optimized evaluation. (iv) We have defined the formal seman-tics of the language and the evaluation algorithms that allow formaintaining combined polynomial time complexity [12]. (v) Wehave performed several empirical evaluations showing that (a) pro-cessing time needed is polynomial as predicted by worst case com-plexity, and that the SXPath language is (b) substantially indepen-dent from the adopted browser and visualization parameters varia-tions, (c) easy to learn and easier to use than pure XPath on Webpages, (d) more tolerant to modifications of the internal structure ofWeb pages, and (e) capable to provide benefit to some current Webcontents manipulation and wrapper learning approaches.

The paper is organized as follows. In Sec. 2 we provide moti-vations that drove us to extend XPath 1.0 with spatial navigationprimitives. In Sec. 3 we describe SXPath data model, formal syn-tax, semantics and complexity issues. Sec. 4 reports notes on theimplementation and results of experiments aimed at evaluating pro-cessing performances, SXPath usability, and qualitative enhance-

1

ment in real applications. Sec. 5 briefly describes related works.

2. WHY SPATIAL XPATH?Web designers plan Web pages contents in order to provide vi-

sual patterns that help human readers to make sense of documentcontents. This aspect is particularly evident in Deep Web pages[16], where designers always arrange data records and data itemswith visual regularity to meet the reading habits of humans.

Figure 1: A Page of the http://www.lastfm.it/ Web Site

In Fig. 1 we show a deep Web page coming from the last.fm so-cial network. For instance, information about the two bands ‘Cold-play’ and ‘Radiohead’ in Fig. 1 is given using similar layout. In thepast, manual wrapper construction (e.g. [22]), or wrapper induc-tion approaches (e.g. [7, 19, 27]) have exploited regularities in theunderlying document structures, which led to such similar layout,to translate such information into relational or logical structures.However, surveying a large number of real (Deep) Web pages, wehave observed that the document structure of current Web pageshas become more complicted than ever implying a large concep-tual gap between document structure and layout structure. Thus, ithas become very difficult: (i) for human and applications aimingat manipulating Web contents (e.g. [10, 13, 22]), to query the Webby language such as XPath 1.0; (ii) for existing wrapper inductionapproaches (e.g. [19, 27]) to infer the regularity of the structure ofdeep Web pages by only analyzing the tag structure. Hence, theeffectiveness of manual and automated wrapper construction arelimited by the requirement to analyse HTML documents with in-creasing structural complexity. The SXPath language allows fornavigating DOM structures like XPath 1.0 as well as for exploitingthe spatial layout of DOM nodes after Web pages rendering. In thefollowing examples we give an intuition about novel features andcapabilities of the SXPath language.

Layout engines of Web browsers consider the area of the screenaimed at visualizing a Web page, as a 2-dimensional Cartesianplane. They adopt rendering rules that take into account the pageDOM structure and the associated cascade style sheets (CSS). Inthe rendered page each DOM node is visualized in a rectangle hav-ing sides parallel to the axes of the Cartesian plane. Fig. 2 depictsrectangles created by a layout engine for the page in Fig. 1. Somerectangles are annotated by fragments of XPath 1.0 location pathsthat characterize related DOM nodes.

A human reader can relate information on the page in Fig. 1 byconsidering the spatial arrangement of laid out content elements.He can interpret the spatial proximity of images and nearby stringsas a corresponding aggregation of information, namely as the com-plete record describing the details of a music band profile and oneof its photos. Fig. 3 depicts a sketch of the DOM related to the

part of the document containing the profile of the Radiohead musicband. Grey circles and boxes are used for the nodes that are visu-alized on screen, whereas white circles and boxes depict nodes thatbelong to the DOM but are not visible on screen.

../ul/li[2]../ul/li[2]/a[2]../ul/li[2]/a[2]/text()

../ul/li[2]/a[1]/strong../ul/li[2]/a[1]/strong/text()

../ul/li[2]/p[1]../ul/li[2]/p[1]/text()

../ul/li[2]/p[2]

../ul/li[2]/p[2]/text()[1]

../ul/li[2]/p[2]/a[2]../ul/li[2]/p[2]/a[2]/text()

../ul/li/a[1]../ul/li[2]/a[1]/span../ul/li[2]/a[1]/span/span

../ul/li[2]/a[1]/span/span/img

../ul/li[2]/p[3] ../ul/li[2]/p[3]/a../ul/li[2]/p[3]/a/span

../ul/li[2]/p[3]/a/span/text()

../ul/li[2]/p[2]/text()[4]

Figure 2: Rectangles that Bound Visualized DOM Nodes

Figure 3: A DOM Portion of the Web Page in Fig. 1

The Web page in Fig. 1 allows us to show how to extract dataabout music bands by a corresponding conventional query usingexisting XQuery 1.0 [24] and XPath 1.0 on the DOM partially de-picted in Fig. 3.

Example 1. XQuery 1.0 and XPath 1.0for $li in document ("last-fm.htm")(1.1) //div[@id=’content’]//ul/li return

<music-band>(1.2) <name> {$li/a/strong/text()} </name>

<similar-bands>(1.3) {$li/p[2]//text()}

</similar-bands></music-band>

For writing the query in Example 1 above, a human user mustknow the intricate DOM structure of the input Web page (a sketchof which is given in Fig. 3). The location path (1.1) is the shortestrelative path found by performing many attempts. In fact, shorterlocation paths like //li or //div//ul/li (that does not use at-tributes) make the query unsound. In contrast, the following queryexploits SXPath that allows for extracting details of music bands byexploiting only the DOM nodes of type img and text, and theirspatial relationships.

Example 2. XQuery 1.0 and SXPathfor $img in document ("last-fm.htm")(2.1) /CD::img[N|S::img] return

<music-band>(2.2) <name> {$img/E::text[W,1][N,1]} </name>

<similar-bands>(2.3) {$img/E::*[W,1][N,3][max]/CD::text}

</similar-bands></music-band>

The spatial location path (2.1) returns images that form a ver-tical sequence. These are exactly the photos of music bands in thepage. Such kind of visual patterns (i.e. alignment in a given direc-tion) are very frequent in Deep Web pages, but often hard to recog-nize in the DOM structure. The spatial location path $img/E::text in (2.2) returns all nodes labeled by text that lie on east(spatial axis E) of the context node represented by the variable$img (photos of music bands). Among these nodes the predi-cates [W,1] and [N,1] select the node that is the first from westand from north, i.e. the name of the bands (e.g. Coldplay and

2

Radiohead). The spatial location path (2.3) selects the nodesthat are first from west, third from north, and are not containedin other nodes (predicate [max]) among all nodes on east of eachphoto of music bands (see the node ..ul/li[2]/p[2] in Fig. 2).So, all similar bands are selected.

Example 2 demonstrates that human users can exploit visual pat-terns that they recognize on the screen. They can define spatial lo-cation paths based on layout and DOM structure1. We argue thatin the future machines will be able to perform a likewise step ofexploiting visual patterns for learning extraction rules, because thespatial location paths are conceptually simpler than their XPath 1.0based counterparts.

3. THE SXPATH LANGUAGEThe SXPath language extends the W3C’s XPath 1.0 [25] with

spatial capabilities. Intuitive navigational features and queryingcapabilities of XPath 1.0 are central to most XML-related tech-nologies. For this reason XPath 1.0 has attracted great attentionin the computer science research community. Even though XPath2.0 reached recommendation status, it is not well suited for our ob-jectives. In fact, the XPath Core 2.0 query evaluation is PSpace-complete [23] in contrast to the polynomial time complexity ofXPath 1.0 [12, 20] (combined complexity). Furthermore, a strongtheoretical background on XPath 1.0 is currently available, and spe-cific subsets of the language with attractive properties and essentiallanguage features have been characterized [6]. These investiga-tions lay the foundations for understanding the formal properties ofSXPath. The SXPath language adopts the path notation of XPath1.0 augmented by a user-friendly syntax having a natural semanticsthat enables spatial querying. In particular, SXPath provides: (i) Anew set of spatial axes that allow for selecting nodes that have aspecific spatial relation w.r.t. context nodes; (ii) New node set func-tions, namely spatial position functions, that allow for expressingpredicates working on positions of nodes in the plane. As alreadyshown in Example 2, such extensions are very useful in practice.In this section, we first define the SXPath data model and describeits new spatial capabilities. Then we provide the formal syntax andsemantics, and the computational complexity of the language.

3.1 Data ModelIn this section we present the SXPath data model, namely Spatial

DOM (SDOM), and some auxiliary functions used in the rest of thepaper. The SDOM considers relations existing among the visualrepresentation of DOM nodes defined as follows.

Definition 1. Let n be a node in the DOM of a Web page, theminimum bounding rectangle (MBR) of n is the minimum rect-angle r that surrounds the contents of n (see Fig. 2) and has sidesparallel to the axes (x and y) of the Cartesian plane. The functionmbr(n) returns the rectangle r assigned to a DOM node by thelayout engine of a Web browser. We call rx and ry the segmentsthat are obtained as the projection of r on the x-axis and the y-axisrespectively. Then, each side of the rectangle is represented by thesegments (r−x, r

+x) and (r−y , r

+y ), where r−x (resp. r−y ) denote the infi-

mum on the x-axis (y-axis) and r+x (resp. r+y ) denote the supremumon the x-axis (y-axis) of the segments rx and ry .

Considering the function mbr(n) given in Def. 1, a Web pagecan be modeled as a DOM enriched by spatial relations existingbetween MBRs. For representing such spatial relations we adoptthe Rectangle Algebra (RA) qualitative spatial model [4], which al-lows for representing all possible relations between rectangles thesides of which are parallel to the axes of some orthogonal basis in a1In this example, the DOM structure in use was restricted to XMLnode types.

2-dimensional Euclidean space. RA is a straightforward extensionof the standard model for temporal reasoning, the Interval Algebra(IA) [3], to the 2-dimensional case. IA models the relative positionbetween pairs of segments by a set of 13 atomic relations (Rint),namely before (b), meet (m), overlap (o), start (s), during (d),finish (f ), together with theirs inverses {bi, mi, oi, si, di, fi}and the relation equal (e). Let s and s1 be two segments the IArelation s b s1 represents that the segment s is preceded by the seg-ment s1. For a pictorial representation of IA relations see Table 4 inApp. A. Let a and b be two rectangles, a RA relation between themis written as a ρ b where ρ = (ρx, ρy) is a pair of IA relations. TheRA relation holds iff the IA relations ax ρx bx and ay ρy by hold forsegments that are obtained as projections of rectangle sides along x(i.e. ax, bx) and y (i.e. ay, by) respectively. The expressiveness ofRA covers the modeling of all qualitative spatial relations betweentwo MBRs. For a pictorial representation of RA relations see Fig. 7in App. A. Furthermore, since the RA model is an algebra, it holdsmany important properties (e.g. invertibility) that allow optimizedquery evaluation.

Definition 2. SDOM is a node labeled sibling ordered tree [11,15, 17] enriched by RA relations. It is described by the following5-tuple:

SDOM = ⟨V,R⇓,R⇒,A, fs⟩where:

● V is the set of labeled DOM nodes. V = Vv ∪Vnv , where Vvis the set of nodes visualized on screen, and Vnv is the set ofnodes that are not visualized.

● R⇓ is the firstchild relation. Let n and n′ be two nodes inV , nR⇓n′ holds iff n′ is the first child of n.

● R⇒ is the nextsibling relation. Let n and n′ be two nodesin V , nR⇒n′ holds iff n′ is the next sibling of n.

● A ⊆ Vv × Vv is the set of arcs that represent spatial relationsbetween pairs of nodes visualized on screen.

● Let Rrec be the set of RA relations, fs ∶ A → Rrec is thefunction that assigns to each element in A a RA relation inRrec. So, let n and n′ be two nodes in Vv , we have a =

(n,n′) ∈ A holds iff mbr(n) fs(a)mbr(n′).

Definition 3. Let Σ be the labeling alphabet (i.e., “tags”). Wedefine the node test function T ∶ (Σ∪ {∗}∪ {text})→ 2V (“nodetest”) which assigns to each label (XML tag or textual leaf node)the set of nodes labeled with it. Furthemore, T (∗) ∶= V .

Figure 4: A SDOM Fragment of the Web Page Portion in Fig. 2

Fig. 4 depicts the SDOM for a portion of Web page shown inFig. 2. Solid arrows represent the classical DOM tree structure thatis equivalent to the DOM depicted in Fig. 3. Dashed labeled arcsrepresent the RA relations between pairs of nodes. For instance,the arc (d, d) between nodes ul and li represents the RA relationmbr(ul)(d, d)mbr(li), i.e. both IA relationsmbr(ul)x d mbr(li)xand mbr(ul)y d mbr(li)y hold, and means that the rectanglembr(li) is spatially contained in the rectangle mbr(ul). In orderto improve readability of the figure, spatial relations are representedonly for a subset of nodes.

3

The SDOM model provides orders among nodes defined as follows.

Definition 4. Document order relations for a SDOM are:● the document total order <doc, already defined in [25]. Letu,w ∈ V be two nodes, then u <doc w iff the opening tag ofu precedes the opening tag of w in the (well-formed versionof the) document.

● the document directional total orders ⩽↑ (from south), ⩽→(from west), ⩽↓ (from north), and ⩽← (from east). Let u,w ∈

Vv be two nodes with MBRs a = mbr(u) and b = mbr(w)

respectively, we have u < w ( or u = w) iff:– a−y < b−y (a−y = b−y ) for ⩽↑, – a−x < b−x (a−x = b−x) for ⩽→,

– a+y > b+y (a+y = b+y ) for ⩽↓, – a+x > b+x (a+x = b+x) for ⩽←.

● the containment partial order ⩽t. Let u,w ∈ Vv be two nodeswith a =mbr(u) and b =mbr(w) respectively, we have u ≤w iff a−x ⩽ b

−x ⩽ b

+x ⩽ a

+x and a−y ⩽ b

−y ⩽ b

+y ⩽ a

+y . In particular,

u = w if a−x = b−x = b

+x = a

+x and a−y = b

−y = b

+y = a

+y .

In the following we introduce the concept of layering adoptedfor computing spatial order among nodes of a given node set.

Definition 5. Let Γ ⊆ Vv be a set of SDOM nodes, for eachspatial order ⩽z , where ⩽z∈ {⩽↑,⩽→,⩽↓,⩽←,⩽t}, the spatial lay-ering of nodes in Γ w.r.t. ⩽z is an ordered list of non-empty listsL⩽z(Γ) = {l1⩽z

,⋯, lh⩽z}, where let u,w be two nodes in Γ:

● if u = w w.r.t. ⩽z , then u,w ∈ li⩽zfor some i⩽z ∈ {1,⋯, h⩽z}.

● if u < w w.r.t. ⩽z , then u ∈ li⩽zand w ∈ lj⩽z

where i⩽z andj⩽z are the smallest numbers such that i⩽z , j⩽z ∈ {1,⋯, h⩽z}

and i⩽z < j⩽z .

Example 3. Considering Fig. 5 and the spatial order ⩽↑, we ob-tain the layeringL⩽↑({n1, . . . , n6}) = {l1⩽↑ , l2⩽↑ , l3⩽↑ } where l1⩽↑ ={n1}, l2⩽↑ = {n2, n4, n6}, and l3⩽↑ = {n3, n5}. The layeringcorresponds to the following directional total order from south:n1 ⩽↑ n2 =↑ n4 =↑ n6 ⩽↑ n3 =↑ n5.

In order to facilitate the definition of SXPath language semanticswe introduce the concepts of spatial index function and last indexfunction as follows.

Definition 6. Let Γ ⊆ Vv be a set of SDOM nodes, and letL⩽z(Γ)

= {l1⩽z, ⋯, lh⩽z

} be the spatial layering of nodes in Γ w.r.t. ⩽z ,where ⩽z∈ {⩽↑,⩽→,⩽↓,⩽←,⩽t}. The spatial index function pidx⩽z

(u,Γ) returns the index of the layer which the node u belongs to(where 1 is the smallest index). The last directional index functionplast⩽z(Γ) returns the index h⩽z .

mbr(n2)

mbr(n3) mbr(n5)

mbr(n6)

From East to West

From South to North

From North to SouthFrom West to East

mbr(n1)

mbr(n4)

Figure 5: Ordering directions

3.2 Spatial AxesRA relations, stored in the SDOM, represent all qualitative spa-

tial relations between MBRs, but they are too fine grained, verboseand not intuitive for querying. Therefore, for defining SXPath spa-tial axes we consider the more synthetic and intuitive RectangularCardinal Relation (RCR) [18] and Rectangular Connection Calcu-lus (RCC) [21] models. In particular, RCRs express spatial axesthat represent directional relations between MBRs. RCRs are com-puted by analyzing the 9 regions (cardinal tiles) formed, as shown

in Fig. 6, by the projections of the sides of the reference MBR (i.e.r). The atomic RCRs are: belongs to (B), South (S), SouthWest(SW), West (W), NorthWest (NW), North (N), NorthEast(NE), East (E), and SouthEast (SE). Using the symbol “:” itis possible to express conjunction of atomic RCRs. For instance,by considering Fig. 6, r E:NE r1 means that the rectangle r1 lieson east and (symbol “:”) north-east of the rectangle r. Moreover,the RCR model allows for expressing uncertain (disjunction of) di-rectional relations: for example r E|E:NE r1 means that r1 lieson E or (symbol “∣”) on E:NE of r. Furthermore, the three rela-tions inspired by the RCC calculus, namely: contained (CD), con-tainer (CR), and equivalent (EQ), allow for expressing spatial axesthat represent topological relations between MBRs. For instance,r CD r2 means that the rectangle r2 is spatially contained in therectangle r.

r2

r1

r B

N

S

W E

SE

NENW

SW

Figure 6: Cardinal tilesEach spatial axes (expressed by a RCR or a topological relation)

corresponds to a set of RA relations expressed by the Cartesianproduct shown in tables 1 and 2. Formally:

Definition 7. Let Rcard, Rtopo and Rrec be the set of (atomicand conjunctive) RCRs, topological and RA relations respectively,then the mapping function µ ∶ 2Rcard ∪Rtopo → 2Rrec that mapsRCRs and topological relations in terms of RA relations is:

µ(R) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

as in table 1 if R ∈ Rcardas in table 2 if R ∈ Rtopoµ(R1) ∪ ... ∪ µ(Rk) if R = R1∣...∣Rk

Example 4. For the relation NW reported in Tab. 1 we haveµ(NW ) = {b,m} × {bi,mi} = {(b, bi), (b,mi), (m,bi), (m,mi)}.

× {b,m} {o,fi} {di} {e,s,d,f} {si,oi} {bi,mi}{b,m} SW W:SW SW:W:NW W W:NW NW{o,fi} S:SW B:S:SW:W B:S:SW:W:NW:N B:W B:W:NW:N N:NW{di} S:SW:SE B:S:SW: All B:W:E B:W:NW: NW:N:NE

W:E:SE N:NE:E{e,s,d,f} S B:S B:S:N B B:N N{si,oi} S:SE B:S:E:SE B:S:N:NE:E:SE B:E B:N:NE:E N:NE{bi,mi} SE E:SE NE:E:SE E E:NE NE

Table 1: RCRs as cartesian products of RA RelationsTop. rel. RA-relations

EQ := { (e,e) }CR := {{ e, si, di, fi } × { e, si, di, fi }} − {(e,e)}CD := {{ e, s, d, f } × { e, s, d, f }} − {(e,e)}

Table 2: Topological relations mapped into RA RelationsLike in XPath 1.0, SXPath axes are interpreted binary relations

χ ⊆ V × V . Let self ∶= {⟨u,u⟩∣u ∈ V } be the reflexive axis,remaining SXPath axes are partitioned in two sets: ∆t and ∆s.The set ∆t contains traditional XPath 1.0 axes (forward, e.g. child,descendant, and reverse, e.g. parent, ancestor) that allow for navi-gating along the tree structure. They are encoded in terms of theirprimitive relations (i.e. firstchild, nextsibling and their inverses), asshown in [11, 12]. The set ∆s contains the novel (directional andtopological) spatial axes corresponding to the RCRs and Topolog-ical Relations that allow for navigating along the spatial RA rela-tions. In the following we formally define spatial axes in terms oftheir primitive RA relations stored in the SDOM.

Definition 8. SXPath spatial axes are interpreted binary rela-tions χs ⊆ Vv × Vv of the following form χs = {⟨u,w⟩∣u,w ∈

Vv ∧ mbr(u) ρ mbr(w) ∧ ρ ∈ µ(R)}. Here, R is the RCR ortopological relation that names the spatial axis relation and µ is themapping function.

In order to define the semantics of SXPath we give a trivial gen-eralization of the document total order w.r.t. an axis [25] and definethe relative index function.

4

Definition 9. The document total order w.r.t. the axis χ writtenas <doc,χ is: (i) the reverse document order >doc when χ is a tradi-tional reverse axis [25]; (ii) the document total order <doc (Def. 4)otherwise.

Definition 10. Let Γ ⊆ V be a set of nodes, the index functionw.r.t. the axis χ written as idxχ(u,Γ) returns the index of the nodeu in Γ w.r.t. <doc,χ (where 1 is the smallest index).

3.3 SyntaxIn this section we present the formal syntax of two important

fragments of SXPath, namely: Core SXPath and Spatial WadlerFragment. The grammar of their unabbreviated version is incre-mentally provided. Like XPath 1.0, SXPath has also a number ofsyntactic abbreviations, used in Example 2 of Sec. 2. But they arejust syntactic sugar and are not considered in the following.

We define the fragment of Core SXPath as the navigational coreof SXPath. It is obtained extending Core XPath [11] (the naviga-tional core of XPath 1.0) by spatial axes introduced in Sec. 3.2.

Definition 11. The EBNF grammar of Core SXPath is:locpath ::= ‘/’ locpath | locpath ‘/’ locpath |

locpath ‘|’ locpath | locsteplocstep ::= axis ‘::’ t | locstep ‘[’ bexpr ‘]’bexpr ::= bexpr ‘and’ bexpr | bexpr ‘or’ bexpr|

‘not(’ bexpr ‘)’ | locpath.axis ::= xpathAxis | spatialAxisxpathAxis ::= ‘self’ | ‘child’ | ‘parent’ | ⋯

spatialAxis::= topAxis | dirAxistopAxis ::= ‘EQ’ | ‘CD’ | ‘CR’dirAxis ::= ‘B’ | ⋯ | ‘U’

where:● locpath is the start symbol.● axis denotes axis relations that are traditional XPath axis

(xpathAxis) and atomic directional and topological axes(spatialAxis).

● t is the node test.Since Core SXPath lacks the ability to exploit spatial position of

nodes, we define the Spatial Wadler Fragment (SWF). SWF is thespatial extension of the Extended Wadler Fragment (EWF) [12]. Itallows positional, logical and arithmetic features.

Definition 12. The syntax of the SWF-Queries is defined by theCore SXPath grammar with the following extensions.expr ::= locpath | bexpr | nexprdirAxis ::= ‘B’ | ⋯ | ‘U’ | disjDirAxisbexpr ::= bexpr ‘and’ bexpr | bexpr ‘or’ bexpr|

‘not(’ bexpr ‘)’ | nexpr relop nexpr |sexpr relop sexpr | locpath |locpath relop sexpr | locpath relop number

nexpr ::= number | nexpr arithop nexpr |‘position()’ | ‘last()’ | ‘posFromS()’ |‘lastFromS()’ | ‘posFromN()’ | ‘lastFromN()’|‘posFromW()’ | ‘lastFromW()’ | ‘posFromE()’ |‘lastFromE()’ | ‘posSpatialNesting()’

sexpr ::= stringarithop ::= ‘+’ | ‘-’ | ‘*’ | ‘div’ | ‘mod’relop ::= ‘=’ | ‘!=’ | ‘<’ | ‘<=’ | ‘>’ | ‘>=’

where:● expr (instead of locpath) is the start symbol.● dirAxis considers disjunctive directional relation disjDirAxis2.

● nexpr extends traditional XPath numerical expressions withspatial position functions. number and string denoteconstant real-valued numbers and strings respectively.

We don’t give here the syntax of the full SXPath language. Thereason for this is lack of space. However, by considering [25] ex-tending the syntax is an easy exercise.2Core SXPath does not allow for querying spatial orders, so spatialaxes corresponding to disjunction ofRCRs do not add expressive-ness.

3.4 SemanticsIn this section we define the semantics of SXPath by adopting

the denotational semantics proposed in [26]. Like in XPath 1.0the main structural feature of SXPath are expressions, that return avalue from one of the following types: node set, number, string, orBoolean. Every expression evaluates relative to a context. Contextand domain of context are defined in the following as extension ofthe definition given in [26].

Definition 13. The Context is the following 12-tuple:c = ⟨n, p<doc , s<doc , p⩽↑ , s⩽↑ , p⩽→ , s⩽→ , p⩽↓ , s⩽↓ , p⩽← , s⩽← , p⩽t ⟩

where: (i) n is a context node. (ii) p⩽z are natural numbers thatindicate the context positions w.r.t. orders ⩽z ∈ {<doc,⩽↑,⩽→,⩽↓,⩽←,⩽t}. (iii) s⩽z are natural numbers that indicate the contextsizes w.r.t. orders in {<doc,⩽↑,⩽→,⩽↓,⩽←}.

Definition 14. The Domain of Contexts is the following set:C = {⟨n, p<doc , s<doc , p⩽↑ , s⩽↑ , p⩽→ , s⩽→ , p⩽↓ , s⩽↓ , p⩽← , s⩽← , p⩽t ⟩

∣n ∈ V ∧ 1 ⩽ p<doc⩽ s<doc

⩽ ∣V ∣ ∧ 1 ⩽ p⩽z ⩽ s⩽z ⩽ ∣Vv ∣ ∧ 1 ⩽ p⩽t ⩽ ∣Vv ∣}

where ⩽z∈ {⩽↑,⩽→,⩽↓,⩽←}.

In the following we first define the auxiliary semantic functionfor location paths then we give the semantics of SXPath.

Definition 15. Location path semantics. Considering the gram-mar defined in the Sec. 3.3. Let π,π1, π2 be location paths, letlocstep be a location step over an axis χ, let bexpr be a booleanexpression and let n be a context node, then the semantics functionof SXPath location paths P ∶ LocationPath→ node→ nodeset

is defined as follows:P J/πK(n) := P JπK(root)P Jπ1/π2K(n) := {n2∣n1 ∈ P Jπ1K(n) ∧ n2 ∈ P Jπ2K(n1)}

P Jπ1∣π2K(n) := P Jπ1K(n) ∪P Jπ2K(n)P Jaxis ∶∶ tK(n) :={n′ ∣ JaxisK(n,n′)} ∩ T (t)

P Jlocstep[bexpr]K(n) := {n′ ∣ W=P JlocstepK(n) ∧ n′∈W ∧

εJbexprK( cn′) = true ∧ cn′ ∶= ⟨w, idxχ(n′, W ), ∣W ∣, pidx⩽↑(n′, W )

plast⩽↑(W ), pidx⩽→(n′, W ), plast⩽→(W ), pidx⩽↓(n′, W ),

plast⩽↓(W ), pidx⩽←(n′, W ), plast⩽←(W ),pidx⩽t(n′, W )⟩}

where T is the node test function defined in Def. 3, and the se-mantics of an axis (JaxisK) is: (i) JspatialAxisK:={(n,n′) ∣

mbr(n) ρ mbr(n′) ∧ ρ = µ(spatialAxis)} for spatial axes, (ii)the semantic of traditional axis already described in [17].

Definition 16. Given an SXPath expression e and a context c,the semantics function ε ∶ SXPathExpression→C→ SXPathType

returns number, string, boolean, or node set values:εJπK(c) ∶= P JπK(n) εJposSpatialNesting()K(c) ∶= ptεJposition()K(c) ∶= p<doc εJlast()K(c) ∶= s<doc

εJposFromN()K(c) ∶= p⩽↓ εJlastFromN()K(c) ∶= s⩽↓εJposFromS()K(c) ∶= p⩽↑ εJlastFromS()K(c) ∶= s⩽↑εJposFromW ()K(c) ∶= p⩽→ εJlastFromW ()K(c) ∶= s⩽→εJposFromE()K(c) ∶= p⩽← εJlastFromE()K(c) ∶= s⩽←εJOp(e1, ..., em)K(c) ∶= F JOpK(εJe1K(c), . . . , εJemK(c))

where: (i) P is the function in Def. 15. (ii) F JOpK ∶ O1×...×Om →O, is defined in Tab. 3 (for a complete list of functions we refer to[11, 25]). The meaning of each expression is given by the meaningof its subexpressions.F J const. number i:→ num K() ::= iF J ArithOp: num × num→ num K (i1, i2 ) ::= i1 ArithOp i2 , where ArithOp∈ {+,-,*,/,%}F J constant string s:→ str K() ::= s

F J RelOp: num × num→ bool K (i1, i2 ) ::= i1 RelOp i2 , where RelOp ∈ {=,≠,⩽,<,⩾,>}F J RelOp: str × str→ bool K (s1, s2 ) ::= s1 RelOp s2F J RelOp: nset × const. string→ bool K (Γ, s)::= ∃u∈Γ:strval(u) RelOp sF J RelOp: nset × bool→ bool K (Γ, b) ::= F J boolean K(Γ)RelOp b

. . .

Table 3: Semantics of Functions in SXPath

5

3.5 Complexity IssuesThis section summarizes the computational complexity results of

the SXPath query evaluation problem. We show that Core SXpath,Spatial Wadler Fragment (SWF ), and Full SXpath allow polyno-mial time combined complexity query evaluation with increasingdegree of the polynomial. In the following theorems we denote byD the XML document, which has size Θ(∣V ∣), where ∣V ∣ is thenumber of nodes of its SDOM representation. It is noteworthy thatthe SDOM (see Sec. 3.1) has size O(∣V ∣

2).

Theorem 1. Core SXPath queries can be evaluated in timeO(∣D∣2

∗ ∣Q∣), where ∣D∣ is the size of the XML document, and ∣Q∣ is thesize of the query Q.

Proof Sketch 1. In [12] it has been shown that Core XPath 1.0has linear time combined complexity. We extend Core XPath 1.0evaluation algorithm, proposed in [12], by spatial axes function.For evaluating spatial axes, spatial relations between any pair ofvisualized nodes must be considered. Since in the SDOM any vis-ible node is spatially related to any other visible node, there areO(∣Vv ∣

2) many spatial relations to be considered in addition to the

O(∣V ∣) many relations of the DOM incurring a higher polynomialworst case complexity. The detailed description of the proof isfound in the Appendix B.1.

Considering the SWF syntax presented in Sec. 3.3, we can givethe following complexity result.

Theorem 2. SXPath queries that fall in the Spatial Wadler Frag-ment can be evaluated in time O(max(∣D∣

3∗ ∣Q∣, ∣D∣

2∗ ∣Q∣

2)),

where D is the XML document, and Q is a SWF query.Proof Sketch 2. The basic idea for proving this theorem is to

adopt the top-down/bottom-up evaluation strategy proposed in [12]for EWF queries. The most costly operation in SWF is the evalu-ation of spatial location paths that require the construction of thefull context. This operation, which requires the computation of thelayering introduced in Def. 5, generates a higher polynomial worstcase complexity w.r.t. EWF [12]. The detailed description of theproof is found in the Appendix B.2.

Theorem 3. Full SXPath queries can be evaluated in timeO(∣D∣4∗

∣Q∣2) and space O(∣D∣

2∗ ∣Q∣

2), where D is the XML document,

and Q is a Full SXPath query.Proof Sketch 3. Full SXPath combined complexity bound can

be proved by adopting a seamless extension of Full XPath queriesevaluation strategy proposed in [12] by spatial operators. The mostcostly operation in SXPath still remains the evaluation of XPath op-erations on any input context [12] resulting in the same combinedcomplexity bound of XPath 1.0. The detailed description of theproof is found in the Appendix B.2.

4. IMPLEMENTATION AND EXPERIMENTSWe have implemented the language in a system that embeds the

Mozilla browser and computes the SDOM in real time at each vari-ation of visualization parameters (i.e. screen resolution, browserwindow size, font type and dimension). In this way for each Webpage and visualization condition there is a unique correspondingSDOM that enables the user to query the Web page by consider-ing what s/he sees on the screen. For computing the SDOM thesystem exploits the Mozilla XULRunner3 application frameworkthat allows for implementing the function mbr (see Def. 1) that ac-quires coordinates of the MBRs assigned by the layout engine toeach visualized DOM node. Queries posed by users are parsed andevaluated on the SDOM by the query evaluator that returns queryanswers, and visualizes returned SDOM nodes on the Web page byrectangles with highlighted borders.3https://developer.mozilla.org/en/XULRunner 1.9.2 Release Notes

4.1 System EfficiencyFor evaluating the SXPath system efficiency we have performed

experiments that evidence the practical system behavior for both in-creasing document and query sizes. Due to lack of space, we illus-trate and compare results only for queries falling in SWF. For eval-uating data efficiency of the query evaluator, which is independentfrom SDOM construction, we have used a fixed SWF query hav-ing size ∣Q∣=167. Then, we have computed the needed query timefor increasing documents sizes from ∣D∣=50 to double the maxi-mum size we found on real-world Web pages, i.e. ∣D∣=7500 nodes.Fig. 8, in the App. C.1, depicts obtained curves (in log log scale)indicating polynomial time efficiency w.r.t. document size. In par-ticular, curves slopes are all 1, which indicates linear time behav-ior. The time needed for constructing the SDOM is depicted as astraight line on the log log scale (see Fig. 9 in the App. C.1). In par-ticular, the line slope is 2 indicating quadratic runtime behavior —as expected. Thus, the runtime requirements for the integrated sys-tem with SDOM construction and query evaluation remains poly-nomial. For evaluating query efficiency we tested the whole systemwith fixed document sizes (∣D∣=1000, ∣D∣=3000, ∣D∣=6000) andincreasing query sizes. Fig. 10, in the App. C.1, depicts obtainedcurves on a log log scale. For this experiment time grows linearlywith the query size (curves slopes are all 1). Thus, empirical behav-ior of the system is better than expected from the SWF worst caseupper bound. Details and rationales about Web pages and queriesadopted in the experiments are given in App. C.1.

4.2 Language UsabilityIn this section we present experiments aimed at assessing the

usability of our approach and the enhancements provided by SX-Path language over XPath 1.0. For the evaluation we have con-sidered the situation of an expert user aiming at manually devel-oping Web wrappers for Deep and Social Web sites. As for thisexpert we assume that s/he has a good command of XPath. S/hevisualizes and explores Web pages by using the SXPath systemwhere the embedded Mozilla browser is configured using a typicalsetting for which Web pages are generally developed (i.e. screenresolution of 1280x800 pixels, browser window size of 1024x768pixels, standard font type and dimension). We investigate the fol-lowing questions in our evaluation: (1) How robust is the SXPathlanguage w.r.t. Web browser and visualization parameters varia-tions? (2) How “easy” is it for this expert to understand and ap-ply SXPath? (3) How suitable is SXPath for manual wrapper con-struction, giving the expert the possibility to look only at the vi-sualized Web page, in comparison to XPath? (4) How suitable isSXPath for manual wrapper construction, giving the expert full in-sight into the SDOM/DOM, when compared against XPath? (5)How transportable is SXPath across multiple Web sites in com-parison to XPath? Experiments described in the following explorethese questions. In the experiments we have used a dataset of 125pages obtained by collecting 5 pages per site from 25 Deep Websites already exploited for testing wrapper learning approaches [10,19, 27] (see Tab. 7 in App. C.2). Experiments have involved tenusers who were students well trained in XPath with no experiencein SXPath.

Experiment 1. In this experiment we have explored question(1) by evaluating the robustness of SXPath w.r.t. Web browser andvisualization parameters variations. Firstly, considering the datasetof 25 Web sites listed in Tab. 7 in App. C.2, we have varied valuesfor screen resolution (i.e. 1280x800, 1024x768, 800x600 pixels),browser window size (i.e. 1024x768, 800x600, 640x480 pixels),font dimension (i.e. small, standard, large, huge), and font type(i.e. times, times new roman, tahoma, arial) and tested if the spatialrelations between node change. We have observed that modifica-

6

tions in: (i) Screen resolution and font type do not cause changes inthe spatial arrangement of visualized nodes. (ii) Browser windowsize does not affect 20 Web sites (highlighted by “†” in Tab. 7 inApp. C.2) because they are based on absolute positioning proper-ties, whereas for 5 sites we have changes in the spatial arrangementof visualized nodes. (iii) Font dimension affects the spatial arrange-ment of visualized nodes. In particular, by using large and hugefonts, visual appearance of Web pages strongly changes w.r.t. stan-dard font dimension. Thus, modifications in screen resolution andfont type do not affect query results, whereas changes in browserwindow size and font dimension could affect the query result. How-ever, this aspect does not impact SXPath usability because the SX-Path system: (i) Embeds the browser and computes the SDOM realtime (i.e. at each changing of visualization parameters). So, userscan query what they see on the screen at each moment. (ii) SXPathqueries are stored with visualization parameter settings adopted bythe user during the query design process. Thus, when a query isreused on a Web page the embedded browser is set with visualiza-tion parameters for which the query has been designed.

Secondly, we have constructed a dataset of 200 Deep Web pages,presenting either data records or tables, randomly selected fromthe www.completeplanet.com Web site (the largest portal of DeepWeb sites). We have visualized pages by the four most used Webbrowsers, i.e. Microsoft Internet Explorer, Mozilla Firefox, GoogleChrome, Safari set with default visualization parameters and stan-dard window size (i.e. 1024x768 pixels). We have observed that,even if coordinates returned by different browsers for rectanglesbounding visualized nodes were slightly different, the qualitativespatial arrangement of visualized SDOM nodes remained stable.Such a result was expected because rendering rules of most diffusedlayout engines4 have been strongly standardized as it is possible toverify by tests available in [1]. So, SXPath queries do not dependon the browser for fixed visualization parameters.

Experiment 2. In this experiment we have investigated ques-tion (2) by evaluating the effort needed for learning SXPath and thefeeling perceived by users in applying the language. We have de-fined the user task “identify product data records and extract prod-uct names and prices” from the Web site www.bol.de. We haveasked users to learn the SXPath language and complete the taskby writing a sound and complete SXPath query looking only at thevisualized Web page. For learning the language, we have providedusers with a short manual explaining spatial axis and position func-tion behavior. We have computed both the number of attempts andthe time needed by users for defining the query by using the SX-Path system. We have observed that users have required an averagenumber of 4.3 and 3.6 attempts for recognizing name and price re-spectively, and an average time equal to 57 minutes for learning theSXPath language and accomplishing the assigned task. For evalu-ating the level of “easiness” and “satisfaction” perceived by usersin learning and applying the SXPath language, at the end of theexperiment, we have asked users to answer a questionnaire basedon the seven-item Likert scale (described in App. C.2) where -3corresponds to “very difficult/unsatisfactory”, and +3 correspondsto “very easy/satisfactory”. The language was assessed as easy tolearn and quite satisfactory to use.

Experiment 3. In this experiment we have explored question(3) by assessing whether spatial information is helpful for a humanaiming at manually writing Web wrappers. We have asked users toperform the same extraction task of Experiment 2 by identifying,for each Web site in the dataset and only by looking at the displayedWeb pages, the best XPath and SXPath queries they could achieve4Trident for Microsoft Internet Explorer, Gecko for Firefox, We-bKit for Safari and Google Chrome

by using at the most 5 attempts. Results (shown in Tab. 7 in theApp. C.2) indicate average precision of 99% and recall of 100%for SXPath and average precision of 42% and recall of 99% forpure XPath 1.0. Users were able to define a good SXPath query byusing 2 attempts on average, whereas all the 5 available attemptswere not enough for finding a good pure XPath query for all Webpages in the dataset. The experiment clearly shows the advantageof SXPath over pure XPath. In fact, only SXPath makes spatiallayout information explicitly available for querying.

Experiment 4. In this experiment we have investigated ques-tion (4). We have asked users to perform the same extraction taskof Experiment 2 by identifying, for each Web site in the datasetand looking at both visualized Web pages and internal page struc-tures (i.e. DOM and SDOM), sound and coplete XPath and SXPathqueries by using at the most 10 minutes. Results (shown in Tab. 7in the App. C.2) indicate that: (i) For writing queries users needed4.2 and 2.7 average attempts for pure XPath and SXPath respec-tively. Since the number of attempts is proportional to needed time,a greater number of attempts indicate that to write queries in pureXPath requires more time in comparison to SXPath. (ii) In defin-ing pure XPath queries all users have shown the tendency to writeabsolute location paths by deeply analyzing the DOM structure,hence the average number of location steps for pure XPath and SX-Path has been 18.9, and 5.3 respectively. Hence, we asked users towrite queries by using relative pure XPath location paths by using atmost 5 minutes. We observed that further attempts needed by usersare proportional to the complexity of the DOM structure. Suchan experiment clearly shows that manually writing queries in pureXPath is more “complex” in comparison to SXPath because XPathrequires the navigation of very intricate DOM structures, whereasSXPath mainly requires to look at the displayed Web page.

Experiment 5. In this experiment we have aimed at answer-ing question (5) by evaluating whether SXPath queries are moregeneral and abstract than pure XPath queries given different Webpages and the same extraction task. Firstly, we have chosen thesubset of Web sites in the dataset having product names presentedby the same visual pattern (such sites, highlighted by “⊙” in Tab. 7of App. C.2, have different internal representations). We have ob-served that it is possible to use the same sound and complete spatiallocation path for extracting product names on all these Web sites.Instead, different XPath location paths are needed. Secondly, wehave asked users to write a SXPath query aimed at extracting thelist of friends in a social network randomly chosen among thoselisted in Tab. 8 in App. C.2. Then we have asked users to try toapply the same query to other remaining social networks in thelist. Even though the internal tag structure of various social net-works differ strongly (so different pure XPath queries are needed),all users have been able to use almost the same SXPath query for allsocial network Web sites. This experiment points out that SXPathallows for more general and abstract queries, that are independentfrom the internal structure of Web pages, in comparison to XPath.

Experiments provide a strong evidence for believing that humansaiming at manually defining Web wrappers and manipulating Webpages, may benefit from using SXPath navigation instead of pureXPath navigation. Moreover, the transportability of SXPath queriesfrom one Web site to the next simplify manual definition of Webwrappers and can also support wrapper induction from sparsely an-notated data, while the lack of such transportability observed forpure XPath is detrimental for both manual wrapper definition andwrapper induction. A full investigation of SXPath implications forwrapper induction effectiveness, however, goes beyond space andscope of this paper. Details and rationales about all previous exper-iments are given in App. C.2.

7

5. RELATED WORKRelated work broadly falls in the following three areas. Lan-

guages aimed at manipulating visual information. Kong et al. pre-sented [13] a formalism that uses the visual appearance of Webpages in order to extract relevant information from them. Such aformalism, named spatial graph grammars (SGGs), extends graphgrammars by incorporating spatial notions into language constructs.SGG productions are used to describe parts of Web pages by agraph representation. The representation includes spatial relationsof the following types: direction relations that describe an order inthe space, topological relations that express neighborhood and in-cidence, and distance relations such as near and far. The systemallows for extracting page contents of interest for the user, but it isquite complex in term of both usability and efficiency. Web wrap-per induction exploiting visual interfaces. Sahuguet et al presentedW4F system [22] that uses a SQL-like language named HEL. InW4F parts of the query in HEL can be generated using a visualwizard which returns the full XPath location path of DOM nodes.So the HEL language is unable to recognize when information ina Web page is presented by the same visual pattern but is repre-sented by different tag structures. So, the W4F system may benefitfrom using SXPath as more expressive basis for defining extrac-tion rules. Query languages for multimedia database and presen-tation. Lee et al. [14] and Adah et al. [2] represent multimedia andpresentation objects as direct acyclic graphs with spatiotemporalrelations. They adopt algebraic language augmented with ad hocoperators for querying spatial relations among objects. Unlike SX-Path, these languages are too complex and not suitable for queryingWeb pages.

6. CONCLUSION AND FUTURE WORKIn this paper we have extended XPath to include spatial navi-

gation into the query mechanism. We have used spatial algebrasto define new navigational primitives and mapped them for queryevaluation onto an extension of the XML document object model(DOM), i.e. the SDOM. Thus, we have given a formal model of theextended query language and have evaluated theoretical complex-ity. The theory has been implemented in a SXPath tool. Empiri-cal evaluation has been performed for testing its performances andfunctionality. Results on representative real Web pages have evi-denced practical applicability of SXPath. The language can still behandled efficiently, yet it is easier to use and allows for more gen-eral queries than pure XPath. The exploitation of spatial relationsamong data items perceived from the visual rendering allows forshifting parts of the information extraction problem from low levelinternal tag structures to the more abstract levels of visual patterns.

The SXPath query language that we have defined will be a step-ping stone for our future work on extracting information from Webpages. We argue that SXPath could improve both human and au-tomatic capabilities in querying and extracting information. Forinstance, the popular LIXTO system [10], which allows users tovisually define Web wrappers, uses XPath patterns and depends oninternal HTML tag structure. It could exploit SXPath in order toleverage visual patterns as desired by authors of [5].

7. REFERENCES[1] Acid Tests, http://www.acidtests.org. Web Standards Project.[2] S. Adali, M. L. Sapino, and V. S. Subrahmanian. An algebra

for creating and querying multimedia presentations.Multimedia Syst., 8(3):212–230, 2000.

[3] J. F. Allen. Maintaining knowledge about temporal intervals.Commun. ACM, 26(11):832–843, 1983.

[4] P. Balbiani, J.-F. Condotta, and L. F. d. Cerro. A newtractable subclass of the rectangle algebra. In IJCAI, pages

442–447, 1999.[5] R. Baumgartner, G. Gottlob, and M. Herzog. Scalable web

data extraction for online market intelligence. VLDB,2(2):1512–1523, 2009.

[6] M. Benedikt and C. Koch. Xpath leashed. ACM Comput.Surv., 41(1):1–54, 2008.

[7] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,18(10):1411–1428, 2006.

[8] P. Eades and K. Sugiyama. How to draw a directed graph. J.Inf. Process., 13(4):424–437, 1990.

[9] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, andB. Pollak. Towards domain-independent informationextraction from web tables. In WWW, pages 71–80, 2007.

[10] G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, andS. Flesca. The lixto data extraction project: back and forthbetween theory and practice. In PODS, pages 1–12, 2004.

[11] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing xpath queries. In VLDB, pages 95–106, 2002.

[12] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing xpath queries. ACM Trans. Database Syst.,30(2):444–491, 2005.

[13] J. Kong, K. Zhang, and X. Zeng. Spatial graph grammars forgraphical user interfaces. ACM Trans. Comput.-Hum.Interact., 13(2):268–307, 2006.

[14] T. Lee, L. Sheng, T. Bozkaya, N. H. Balkir, Z. M.Ozsoyoglu, and G. Ozsoyoglu. Querying multimediapresentations based on content. TKDE, 11(3):361–385, 1999.

[15] L. Libkin. Elements Of Finite Model Theory. SpringerVerlag,2004.

[16] J. Madhavan, S. R. Jeffery, S. Cohen, X. . Dong, D. Ko,C. Yu, A. Halevy, and G. Inc. Web-scale data integration:You can only afford to pay as you go. In CIDR, 2007.

[17] M. Marx and M. de Rijke. Semantic characterizations ofnavigational xpath. SIGMOD Rec., 34(2):41–46, 2005.

[18] I. Navarrete and G. Sciavicco. Spatial reasoning withrectangular cardinal direction relations. In ECAI, pages 1–9,2006.

[19] N. K. Papadakis, D. Skoutas, K. Raftopoulos, and T. A.Varvarigou. Stavies: A system for information extractionfrom unknown web data sources through automatic webwrapper generation using clustering techniques. TKDE,17(12):1638–1652, 2005.

[20] P. Parys. Xpath evaluation in linear time with polynomialcombined complexity. In PODS, pages 55–64. ACM, 2009.

[21] J. Renz. Qualitative spatial reasoning with topologicalinformation. Springer, 2002.

[22] A. Sahuguet and F. Azavant. Building intelligent webapplications using lightweight wrappers. DKE,36(3):283–316, 2001.

[23] B. ten Cate and M. Marx. Axiomatizing the logical core ofxpath 2.0. Theor. Comp. Sys., 44(4):561–589, 2009.

[24] W3C, http://www.w3.org/XML/Query/. XML Query(XQuery), 1.0 edition.

[25] W3C, http://www.w3.org/TR/xpath. XML Path Language(XPath) Version 1.0, 1.0 edition, November 1999.

[26] P. Wadler. Two semantics for xpath. Draft: http://homepages.inf.ed.ac.uk/∼wadler/papers/xpath-semantics, 2000.

[27] Y. Zhai and B. Liu. Structured data extraction from the webbased on partial tree alignment. TKDE, 18(12):1614–1628,2006.

8

APPENDIXA. SPATIAL REASONING MODELS

The rectangle algebra model (RA), introduced in [4], has been obtainedby extending Allen’s temporal interval algebra (IA) [3] to the 2-dimensionalcase. IA models the relative spatial position between two segments and it isbased on 13 atomic spatial relationsRint = {b, m, o, s, d, f, bi,mi, oi,si, di, fi, e}, which are represented in Tab. 4. For instance, in the firstrow of Tab. 4 the IA relation s b s1 means that the segment s is precededby the segment s1.

Relation Symbol Meaning Inversebefore b s1zÐx zÐxs bi

meets m s1zÐxzÐxs mi

overlaps o s1zÐxzÐxs oi

starts s s1zxzÐxs si

during d s1zxzÐxs di

finish f s1zxzÐxs fi

equals e s1zÐxzÐxs e

Table 4: IA Relations.The set of possible atomic and conjuntive RA relations is called Rrec.

It is obtained by combining Rint relation on x and y-axes. Thus, Rrec =R2int contains 132 = 169 elements. Each cell in Fig. 7 depicts a RA re-

lation and contains, as strings, the corresponding RCRs and TopologicalRelations. Fig. 7 also shows that RA relations are invertible and representsa pictorial representation of the Mapping Function µ (Def. 7).

Figure 7: Pictorial Representation of the RA relations

B. PROOFSB.1 Proof of Theorem 1

In this section we prove that Core SXPath queries Q can be evaluated inO(∣D∣2 ∗ ∣Q∣) combined complexity (Theorem 1), where D is the XMLdocument. In order to prove the Theorem 1 we first: (i) define the axisfunction by extending the definition presented in [11] to both traditional andspatial SXPath axes; (ii) define the RA function; (iii) introduce the algorithmwhich compute the axis function.

Definition 17. Let χ denote an axis in ∆. The related axis function,which overloads the relation name χ ∶ 2V → 2V is defined as χ(Γ) ∶=

{u∣∃c ∈ Γ ∶ cχu} , where Γ ⊆ V is a set of nodes. Moreover, the inverseaxis function χ−1 ∶ 2V → 2V is defined as χ−1(Γ) ∶= {c ∈ V ∣χ({c}) ∩Γ ≠ ∅}.

Definition 18. Let ρ denote a RA relation inRrec. The related RA func-tion fρ ∶ Vv → 2Vv is defined as fρ(u) ∶= {v ∣ uρv ∧ u, v ∈ Vv ∧ ρ ∈Rrec}

The function fρ works on the SDOM (Def. 2) represented as a nesteddirect-access data structure. Such data structure for SDOM can be com-puted in a preprocessing step, which runs inO(∣Vv ∣2) and requiresO(∣Vv ∣2)more space than a standard DOM. It is noteworthy that fρ takes as input asingle node u ∈ Vv , accesses in constant time the hash set representing theSDOM, and returns the node set associated to u by the RA relation ρ inconstant time.

Lemma 1. Let ρ be an RA relation. For each pair of nodes u, v ∈ Vv ,u ρ v iff v ρ−1 u.

Let Γ ⊆ V be a set of nodes of a SDOM, let χt ∈ ∆t be an axis fromthe XPath language other than self, let χs ∈ ∆s be a spatial axis, let % bea set of RA relations, and let E(χt) denote a regular expression based onthe primitives firstchild and nextsibling as presented in [11, 12] that definesχt ∈ ∆t.

Algorithm 1. (Axis evaluation)Input: A set of nodes Γ and an axis χ ∈ ∆Output: χ(Γ)

Method: evalχ(Γ)

(1.1) function evalself (Γ) ∶= Γ.(1.2) function evalχt(Γ) ∶= evalE(χt)(Γ).(1.3) function evalχs(Γ) ∶= eval{ρi ∣ρi∈µ(χs)}(Γ).(1.4) function evalχ−1

s(Γ) ∶= eval{ρ−1

i∣ρi∈µ(χs)}(Γ).

(1.5) function eval%(Γ) begin(1.6) Γ′ ∶= ∅;(1.7) foreach u ∈ Γ ∩ u ∈ Vv do(1.8) foreach ρi ∈ % do(1.9) Γ′ ∶=Γ′ ∪set fρi(u)

odod

(1.10) return Γ′end.

The Algorithm 1 computes the set of nodes reached from a set of nodes Γby means of an axis χ. We distinguish between traditional and spatial axis.So, for traditional axes in (1.1) and (1.2) the algorithm requires timeO(∣V ∣)

[11]. Whereas, for spatial axis in (1.3) and (1.4), we have to consider themapping function µ that runs in constant time (see Def. 7) and returns a setof atomic/conjunctive RA relations % representing the axis given as input.The function eval% (1.5)-(1.10), for each input node u and for each RArelation ρ ∈ % takes the reached nodes using the function fρ . The union ofthe resulting set of nodes (1.9) runs in time O(∣Vv ∣), hence the total timerequired is O(∣V ∣2). Now, we are ready to prove the theorem 1.

PROOF. In [11] it is was shown that Core XPath 1.0 fragment and itsextensions with id axes, have linear combined complexity by mapping itsqueries to an algebraic expression. Likewise for Core XPath 1.0 and itsextension, every query that falls in the Core SXPath can be mapped in timeO(∣Q∣) to an algebraic expression Φ over the set of operations ∩,∪,¬, χ(the axis function) and its inverse χ−1 and V

root(Γ) ∶= {x ∈ V ∣root ∈ Γ}

(that is, returns V if root ∈ Γ and ∅ otherwise). In the query Q there areat most O(∣Q∣) of such operations. Each operation in our algebra can becomputed in time O(∣V ∣) except for the axis function that is computable inO(∣Vv ∣2) time bound in the spatial case as shown in Algorithm 1. Hence,the computation has time bound O(∣V ∣2 ∗ ∣Q∣).

B.2 Proof of Theorem 2 and 3In this section we prove first that Full SXPath query can be evaluated

in time O(∣D∣4 ∗ ∣Q∣2) and space O(∣D∣2 ∗ ∣Q∣2), where D is the XMLdocument, and Q is a Full SXPath query (Theorem 3). Then we exploitthe prof of Theorem 3 in order to prove that queries falling in the SpatialWadler Fragment can be evaluated in better time than full SXPath queries.

In order to obtain a polynomial-time combined complexity bound forFull SXPath query evaluation we use dynamic programming adopting theContext-Value Table (CV-Table) principle proposed in [11]. Given an ex-pression e belonging to an input query, the CV-Table of e specifies whichvalue is obtained given a valid context c: (c, εJeK(c)). The CV-Table ofeach expression is obtained combining the values of its subexpressions.Moreover, we adopt the simple idea, first presented in [26], that for evaluat-ing each expression just the necessary information of the context (relevantcontext) can be taken into account. Thus, the CV-Tables can be restricted.The relevant contexts of any expression e associated to a node q of Q canbe computed in a preprocessing step as follows: (i) If q is leaf node of thequery parse tree, the relevant context depends on e. If e is a constant, then

9

its evaluation does not depends on the input context and thus the relevantcontext is empty. If the expression is a location path or a positional func-tion, then the relevant context corresponds to the ith value of the context5that defines the semantics of the expression as defined in Def. 16. (ii) Oth-erwise, if e is a location path, then the relevant context of q is the contextnode. In the other cases, the relevant context of q is given by the union ofthe relevant context of children parse tree nodes (q1, ..., qk) of q. That is,RelevantContext(q) ∶= ∪ki=1RelevantContext(qi).

The context-value table principle avoids exponential time complexity be-cause it guarantees that no evaluation of the same subexpression for thesame context is done more than once. So it allows simultaneous evaluationfor all possible contexts (node). As in [12] where position and size are com-puted on demand, we compute all spatial position functions in a loop for allpairs of previous/current nodes. For evalauting SXPath location steps weuse the min context algorithm presented in [12] with the substantial differ-ence being the computation of location step and spatial position functions.Spatial position functions are effectively used only in predicates of locationsteps, so the complete context will be needed only for predicates of loca-tion steps that use spatial position functions. In the following, for lack ofspace, we describe only the most costly part of the location step evaluationalgorithm that computes complete contexts.

Algorithm 2. (Location step evaluation algorithm)Input: A set of nodes Γ and a location step

e = χ ∶∶ τ[e1] . . . [em]

Output: P JeK(Γ)Method: eval(e,Γ)

begin(2.1) Res ∶= ∅(2.2) W ∶= χ(Γ) ∩ T (τ);(2.3) for each u ∈ Γ do(2.4) W ′ ∶= {w ∣w ∈W ∧ u χ w}

(2.5) for each ei with 1 ⩽ i ⩽m (in ascending order) do(2.6) W ∶= layering(W ′)(2.7) W ′ ∶= {w ∣ w∈W ∧εJeiK(cw) = true ∧

cw ∶= ⟨w, idxχ(w, W ), ∣W ∣, pidx⩽↑(w, W ),

plast⩽↑(W ), pidx⩽→(w, W ), plast⩽→(W ),

pidx⩽↓(w, W ), plast⩽↓(W ), pidx⩽←(w, W ),

plast⩽←(W ), pidx⩽t(w, W )⟩}

od(2.8) Res ∶= Res ∪W ′

od(2.9) return Res

end;The Algorithm 2 corresponds to the semantic definition of location step

presented in Def. 15. But it considers only the most expensive case thatrequires the computation of the complete context. Such a case is sufficientfor proving the combined complexity of Full SXPath. We need to computecomplete context when all expressions e1 . . . em, in the input location step,require the evaluation of positions w.r.t. document and spatial orders. Asdetailed in [12], when expressions do not require evaluation of positionswe can pre-compute the CV-Tables because the relevant context is empty orcorresponds to a single node. In the instruction (2.2) we obtain the nodesreachable via χ ∶∶ τ . In (2.3)-(2.7) we select nodes that satisfy the predi-cates e1 . . . em. We have to compute the complete context c correspondingto any pair of previous and current nodes u and w, respectively. The instruc-tion (2.7) builds the context c that considers the position of the node w inW w.r.t. the document order and all spatial orders (see definition 13). Forobtaining spatial ordering we need to apply the layering function (instruc-tion (2.6)) to W (i.e. the set of nodes reached from u). Such a functionassigns to each node a layer in order to allow the functions pidx⩽z (u,Γ)

and plast⩽z (u,Γ) (see Def. 6) for computing spatial position of w in W .The layering for each spatial order (directional or topological) can be ob-tained by applying a topological layering algorithm [8]. For lack of spaceand because the layering for the other orders can be obtained in similar way,only the layering for the topological order is shown in the following.

Definition 19. Let Γ ⊆ Vv be a set of SDOM nodes, the topologicaldirected acyclic graph (TDAG) Gt = (Γ,At) is the graph obtained fromthe SDOM by considering RA relations that express containment amongnodes. So for each pair of nodes n,n′ in Γ, an arc is added to At iff n′ isspatially contained in n.5Given a context c then c[i] represents the ith value of the context(e.g., c[3] represents the context size w.r.t. the document order).

Definition 20. Let Γ ⊆ Vv be a set of SDOM nodes, and let Gt =

(Γ,At) be the corresponding TDAG, the topological layering Lt(Gt) =

{l1,⋯, lht} (Def. 5) is obtained applying to Gt a topological layering al-gorithm [8].

Layering algorithm runs in time O(∣Γ∣+ ∣At∣) by using appropriate datastructures with minor modifications to the standard topological sorting algo-rithm. Topological layers allow for defining a topological ordering amongnodes in Γ based on their spatial nesting. So, for example, the first layerrepresents nodes in Γ that are not contained in other nodes. The secondlayer represents nodes that are directly contained in nodes in the first layerat first level of nesting, and so on. The layering for each directional ordercan be obtained also by using optimized methods that work on a pre-layeredversion of the SDOM, not explained here for lack of space.

Having obtained the complete context c, the instruction (2.7) allows forcomputing the set of nodesW ′ reached from the current node u and that sat-isfy the predicates e1 . . . ei. For the current nodew, the value of εJeiK(cw)

is looked up from the table if cw exists in the CV-Table of ei, computedotherwise. The resulting nodeset (instruction (2.9)) is the union of nodesreached from each current node u in Γ and satisfying e1 . . . em predicates.Now, we are ready to prove the Theorem 3.

PROOF. Space complexity. In the preprocessing step we create the SDOMstructure, then in order to save the spatial relations O(∣Vv ∣2) additionalspace is required w.r.t. the XML document. During the query computation,we know that an input Full SXPath query Q has at most ∣Q∣ subexpres-sions, thus at most ∣Q∣ CV-Tables are required. We explicitly set up theCV-table for a subexpression e only if the relevant context of e correspondsto a node (i.e. {c[1]}), or to the empty set. Hence, the CV-Table has atmost ∣V ∣ rows (O(∣D∣)). Moreover, since Full SXPath and Full XPath 1.0have the same set of operations and return the same result types, then themost costly operation w.r.t. space size is concatenation on string [12]. WehaveO(∣D∣∗ ∣Q∣) maximal size for one entry value in the CV-Table. Hencewe obtain the bound O(∣D∣2 ∗ ∣Q∣2) space combined complexity.Time complexity. The SDOM computation, which costs O(∣V ∣2), precedesthe query evaluation step. An SXPath expression Q has at most ∣Q∣ subex-pressions that have to be evaluated. Each subexpression e of the input queryQ has to be evaluated for at most O(∣V ∣2) different contexts that can becomputed in a loop over all possible values c corresponding to pairs previ-ous/current of context-nodes (see algorithm 2). Moreover, it was shown in[12] that time required for computing each XPath operation on any contextc is bounded by O(∣D∣2 ∗ ∣Q∣). In our case, we add only spatial relationsand the position functions. The former ones, given in input only one contextc can be computed in time O(∣Vv ∣) (In fact, in this case the method eval%of the algorithm 1, has to run only for a node u (see instruction (1.7)). Givena context c, a positional function returns the value corresponding to the ithvalue of the context that defines the semantics of the expression (see Def.16). As shown in Algorithm 2 topological layering is needed (see instruc-tion (2.2) in the Algorithm 2) for computing the complete context, but thisoperation does not impact the worst case bound. Hence we have the timecombined complexity O(∣D∣4 ∗ ∣Q∣2).

In order to prove that queries falling in the Spatial Wadler Fragment canbe evaluated in better time than full SXPath queries (Theorem 2), we: (i)Consider the syntax defined by the grammar presented in Def. 12. (ii)Adopt restrictions described in [12] for EWF. Such restrictions imply thatfunctions which select data from the XML document, and boolean expres-sions that compare node sets are not allowed. Furthermore, boolean expres-sions of the type locpath relop sexpr or locpath relop nu-mber (see the formal grammar of SWF in Defnition 12) consider only num-bers and strings which size do not depend on the XML document (i.e. arevalues fixed in the query). (iii) Adopt the bottom-up/top-down query eval-uation strategy proposed in [12]. Such an evaluation strategy distinguishbetween outer and inner location paths, where an inner path appears withina predicate, whereas an outer path does not.

For showing computational complexity of SWF and for lack of space, wedescribe how the node set resulting for a SWF query can be computed byexploiting the Algorithm 2. In particular, outer location paths are computedby the Algorithm 2 as it is (top-down and forward evaluation). Whereas,inner location paths are bottom-up and backward computed. In this casethe evaluation algorithm starts from the last location step in a inner loca-tion path, and takes as initial input node set Γ either: all nodes in V (whenthe considered boolean expression coincides with the inner location pathitself), or the set of nodes that satisfy (taking into account above consid-erations) boolean expressions of the form locpath relop sexpr or

10

locpath relop number (see the formal grammar of SWF in Defni-tion 12). Then evaluation algorithm computes node sets as in the Algorithm2 where: in instruction (2.2) the axis is χ−1 instead of χ (backward evalu-ation), instruction (2.3) is for each previous u ∈W , and in instruction (2.4)current node w is in Γ. Having in mind above discussion we can prove theTheorem 2.

PROOF. Time complexity. An SWF query Q has at most ∣Q∣ subexpres-sions that have to be evaluated. Such subexpressions are computed by theAlgorithm 2 used as described in the above discussion for enabling bothtop-down and bottom-up evaluation. Location path evaluation is performedin time O(∣V ∣2) in instruction (2.2) of the Algorithm 2 (see Algorithm 1).For any expression e, the computation of the result value of e for a singlecontext c takes at mostO(∣Q∣) time (see restrictions presented above). Fur-thermore, we have O(∣V ∣2) contexts (instructions (2.3) and (2.7)) and atmost ∣Q∣ sub expressions, hence the total time required by each expressione ofQ is limited byO(∣V ∣2∗ ∣Q∣). However, let u be the node under evalu-ation, we have to compute the layering for reached nodes in order to obtainthe complete context (instruction (2.6) in the Algorithm 2). The layeringmethod costs O(∣V ∣2) and has to be computed for reach node under evalu-ation. So, we needO(∣V ∣3) time for computing all contexts (see operations(2.3)-(2.6) in Algorithm 2). Hence, the total computational complexity isbounded by O(max(∣V ∣3 ∗ ∣Q∣, ∣V ∣2 ∗ ∣Q∣2)). Since normally, V >> Qthen, the complexity bound follows.

B.3 Complexity ComparisonThe complexity results are summarized in Table B.3. These results are

compared with the fragment of XPath 1.0 that they extend.

XPath 1.0 SXPathSpace Core[11] O(∣D∣ ∗ ∣Q∣) Spatial O(∣D∣2 ∗ ∣Q∣)

Time O(∣D∣ ∗ ∣Q∣) Core O(∣D∣2 ∗ ∣Q∣)

Space EWF[12] O(∣D∣ ∗ ∣Q∣2) SWF O(∣D∣2 ∗ ∣Q∣2)

Time O(∣D∣2 ∗ ∣Q∣2) O(max(∣D∣3 ∗ ∣Q∣, ∣D∣2 ∗ ∣Q∣2))

Space Full[11] O(∣D∣2 ∗ ∣Q∣2) Full O(∣D∣2 ∗ ∣Q∣2)

Time Xpath 1.0 O(∣D∣4 ∗ ∣Q∣2) SXPath O(∣D∣4 ∗ ∣Q∣2)

Table 5: Comparison between complexity bound of SXPathand XPath 1.0 for a XML document D and a query Q.

C. EXPERIMENT RESULTSC.1 System Efficiency

For testing system efficiency we have considered: (i) a dummy Webpage that presents a table with 3 columns and an increasing number ofrows; (ii) 3 types of queries, falling in the SWF, based on: (a) traditionalaxes and position functions, (b) spatial axes (both directional and topologi-cal) and spatial position functions; (c) a mix of traditional and spatial axesand position functions. Queries were constructed using the following sim-ple patterns. For query types (a) and (c), the first query was given by:“/descendant::TD/following-sibling::TD[1]”. For querytype (b), the first query was given by: “/CD::TD/E::TD[W,1]”. The(i+1)-th query was constructed by appending the following location paths tothe i-th query: (a) “/following-sibling::TD[1]/preceding-sibling::TD[1]”; (b) “/E::TD[W,1]/W::TD[E,1]”, and (c) “/following-sibling::TD[1]/W::TD[E,1]”. For instance, the thirdquery (i = 2) for the spatial query type (b) was “/CD::TD/E::TD[W,1]/E::TD[W,1]/W::TD[E,1]/E::TD[W,1]/W::TD[E,1]”. All qu-eries aim at extracting the central column of the table in the page. They arebased on opposite axes (i.e. “at-east” and “at-west”), so they redundantlyjump back and forward within the input documents. Our rationale was thatby this way the query processor must handle many different spatial pathsin parallel coping thus with an intuitively “difficult” query. Fig. 8 and 10show that time needed for query evaluation is linear w.r.t. the document sizeand the query size respectively (curves are all straight lines in log log scaleand curves slopes are all 1). Fig. 9 shows the curve obtained for the SDOMconstruction. The obtained curve is a straight line in log log scale with slopeequal to 2 indicating the polynomial complexity of the whole system.

C.2 Evaluation of UsabilityExperiment 2. Tab. 6 reports results of the experiment aimed at as-

sessing the effort needed for learning SXPath and the user feeling in ap-plying the language. In particular, Tab. 6 gives: (i) in column 1 the users

102 103 104

101

102

103

Docs, Log |D|

Tim

e, L

og m

illise

c

(a) Traditional (c) Mix (b) Spatial

Figure 8: Linear-time Data Efficiency of SXPath Query Evaluation

102 103 104

101

102

103

104

Docs, Log |D|

Tim

e, L

og m

illise

c

SDOM

Figure 9: Quadratic-time Complexity of SDOM Construction

100 101

102

103

Query Log #repetitions

Tim

e, L

og m

illise

c

|D|=1000 |D|=3000 |D|=6000

Figure 10: Linear-time Query Efficiency of SXPath Query Evaluation

identifiers; (ii) in column 2 the time needed for learning SXPath and man-ually writing the assigned query; (iii) in columns 3 and 4 answers pro-vided by users to the questions“How easy is the SXPath language?” and“What is your level of satisfaction in using the SXPath language?” re-spectively. Possible answers have been designed as the following seven-item Likert scale: very easy/satisfactory (3), easy/satisfactory (2), quiteeasy/satisfactory (1), neutral (0), quite difficult/unsatisfactory (-1), diffi-cult/unsatisfactory (-2), very difficult/unsatisfactory (-3); (iv) in columns5 and 6 the number of attempts that each user needed for writing spa-tial location paths for names and prices respectively; (v) in the last tworows the average values and the standard deviations for all observed val-ues. Since the system prototype highlights nodes selected by a given SX-Path query on the screen, users were able to refine the query by visuallyverifying results for each attempt. For instance, user #9 has written the fol-lowing spatial location paths /CD::img/E::text[W,1][N,2], and/CD::img/E::text[N,1][W,2] for names and prices respectively,by using two attempts for both.

#user Time (min) Easiness/ Satisfaction/ #attemptsDifficulty Unsatisfaction name price

1 75 2 0 7 62 45 3 2 4 23 65 1 1 5 44 40 2 1 2 35 50 3 2 4 46 30 3 3 2 17 125 -1 -1 9 88 50 2 1 3 49 35 3 2 2 2

10 55 2 1 5 2Average 57 2 1.2 4.3 3.6σ 26 1.18 1.1 2.2 2

Table 6: Evaluation of the Effort Needed for Learn and ApplySXPath

11

Experiment 3. Table 7 reports results of the experiment aimed at find-ing out whether it is easier to specify an SXPath query than an XPath querygiven only the rendered Web page, but no information about the Web pageinternal structure. Tab. 7 gives: (i) the average number of pairs (productnames and prices) correctly extracted (“Cr.”) using SXPath and XPath incolumns 2 and 5 respectively; (ii) in columns 3 and 6 the correspondingaverage number of pairs wrongly (Wr.) extracted (false positive – FP/ falsenegative – FN); (iii) in columns 4, and 7, the average number of attemptsperformed by users for obtaining the most accurate results (the maximalnumber of attempts was fixed to 5); (iv) in the last two rows the average re-call and precision respectively. We have investigated the reasons underlyingsuch different levels of effectiveness and have come up with the followingobservations: (i) SXPath queries were constructed by using only spatialconstructs that allow for quering visual patterns existing among XML ele-ments that are leaf nodes in the DOM (e.g. img and text tags) and thathelp human readers to make sense of document contents; (ii) few errorsarise in SXPath caused by noise nodes that are visually very similar to thecorrect ones; (iii) data records in the dataset are constructed by using mainlythe tags: table, ul and div. When the tag used in the Web site is div,it is very difficult to guess a good pure XPath query; (iv) the average num-ber of attempts performed by users for obtaining the most accurate SXPathqueries (i.e. 2) has been much smaller than attempts required for XPathqueries (i.e. 5).

Querying QueryingWithout DOM/SDOM With DOM/SDOM

Deep Web Sites SXPath XPath SXPath Abs. XPath Rel. XPathCr. Wr. Att. Cr. Wr. Att. Att. Steps Att. Steps Att. Steps

amazon.com ⊙◇ 58 0/0 2.5 43 32.5/15 5 2.5 5.5 6.5 24 13 10bestbuy.com † 68 0/2 2.2 70 825/0 5 4.2 4.5 3 12.5 2.5 6bigtray.com ⊙◇ 125 5/0 2 125 130/0 5 4 4.5 3 10 6 7

bol.de † 60 0/0 2 60 15/0 5 2 4.5 3 16.5 3.5 6.5buy.com † 100 2/0 3.2 100 30/0 5 4.2 5.5 3 19 4.5 8ebay.com ◇ 258 0/0 3 258 60/0 5 3 5.5 4.5 21 9 7

mediaworld.it †⊙ 125 0/0 2 125 30/0 5 2 5 3.5 21.5 2 5shopzilla.co.uk †⊙◇ 100 0/0 1.5 100 937/0 5 1.5 4 6.5 23 4 9

apple.com †⊙ 50 0/0 1.4 50 0/0 5 1.4 4 4 14 2 4venere.com †⊙ 75 0/0 3.2 75 75/0 5 3.2 4.5 3 9 2.5 4powells.com ⊙ 125 0/0 1.5 125 247.5/0 5 1.5 5 3 11.5 2 4

barnesandnoble.com †⊙ 50 0/0 2.3 50 255/0 5 2.3 4.5 3 15.5 2.5 7shopping.yahoo.com †⊙ 75 0/0 1.7 75 187.5/0 5 1.7 4 3 21 3 5

cooking.com † ◇ 100 7.5/0 3.2 100 180/0 5 7.2 7.5 8 36 6 9cameraworld.com †⊙ 125 0/0 2 125 10/0 5 2 5.5 3 11.5 2 4.5

drugstore.com † ◇ 45 0/4 1.5 41 5/0 5 5.5 8 6.5 16.5 2.5 4.5magazineoutlet.com †⊙ 45 0/0 1 45 125/0 5 1 5 3 21.5 5 9

dealtime.com †⊙ 150 0/0 1 150 20/0 5 1 5.5 3 16 3.5 7borders.com †⊙ 125 0/0 1.6 125 40/0 5 1.6 6 3.5 18 4 6

google.com/products ⊙ 50 0/0 1.5 50 130/0 5 1.5 5.5 3.5 9.5 2.5 5.5nothingbutsoftware.com †⊙ 80 0.8/0 2.2 80 55/0 5 4.2 6.5 3.5 28 3 8

abt.com †⊙◇ 200 5/0 1.3 200 0/0 5 2.3 6.5 4.5 27 4 6cutleryandmore.com †⊙◇ 150 1/0 1.5 150 68/0 5 2.5 5 10 36 4 10

cnet.com †⊙◇ 150 0/0 2 130 0/20 5 2 4 4 17.5 5.5 8target.com † 50 0/0 3 50 2/0 5 3 5.5 3.5 16.5 2.5 5

Average 2 5 2.7 5.3 4.2 18.9 4 6.6Total 2535 27.3/6 2506 3459.5/35Recall 100% 99%

Precision 99% 42%

Table 7: Usability Evaluation of SXPath on Deep Web Pages.

Experiment 4. In this experiment we have evaluated the effort neededby users for obtaining sound and complete SXPath and pure XPath queries,giving to users the possibility to look at both DOM/SDOM and visualizedWeb page. For writing XPath queries we have provided users by the DOMExplorer of the SXPath System. DOM explorer visualizes the DOM of aWeb page as a tree and allows for selecting a node in the DOM by clickingon an element in the rendered Web page and vice versa. This way userswere provided with a visual facility for navigating DOMs of Web pages.Tab. 7 gives: (i) in column 8 and 10 the average number of attempts neededby all users for writing sound and complete XPath and SXPath queries re-spectively. For SXPath queries already sound and complete in Experiment3 users have not defined new queries, so we have reported in column 8 thevalue of attempts already observed in column 4. (ii) in column 9, and 11the average number of location steps in SXPath and pure XPath queriesrespectively. These numbers has been computed as the average betweenthe number of location steps in location paths that identify product namesand prices; (iii) in column 12, and 13 the average number of further at-tempts and the average number of location steps needed for expressing purerelative XPath queries respectively. We have observed that all users havewritten sound and complete: (i) XPath queries by using more attempts thanfor SXPath queries; (ii) SXPath queries by adding a pure XPath locationpath to not sound and complete queries obtained in Experiment 3. XPathLocation paths to add have been selected by users looking at the DOM

of the pages and choosing the best node to use as data record container.Furthermore, all users have noticed that for some Web sites (highlightedby “◇” in Tab. 7) it has been very difficult to find a sound and completepure XPath query because of the very intricate DOM structures that makenecessary long disjunctions of location paths. Such difficulties are due tothe fact that data records in these Web sites are contained in discontinuouspieces of the DOM (e.g. www.amazon.com), and that the tag structure rep-resenting a given data item can be different from a record to another (e.g.www.ebay.com), even though records have the same spatial arrangement inall selected pages.

The experiment shows that, nevertheless both languages have the sameeffectiveness, writing wrappers for Web pages by pure XPath is more com-plex than by SXPath. For manually writing pure XPath queries users haveneeded to deeply analyze the intricate DOM structure for all Web pages,whereas for writing SXPath queries users have mainly looked at the dis-played Web pages. Furthermore, tag structure variety and discontinuitiesobserved in pages DOM are phenomena already known in wrapper induc-tion literature [27, 10]. Such phenomena requires to induce disjunctions, sothey hinder both manual definition of Web wrappers and automatic learn-ing of extraction patterns based on XPath. At the contrary, SXPath allowsfor exploiting the regularity of visual patterns, this way data records hav-ing complex internal structures can be easily and completely described byvisually defined manual wrappers.

Experiment 5. In this experiment we aimed at qualitatively evaluatingwhether SXPath queries are more general and abstract than XPath queriesgiven different Web pages and the same extraction task. So, given standardbrowser settings for all Web sites, we have evaluated wether it is possi-ble to reuse a SXPath query written for a given Web site on other Websites that present the same information arranged by uniform visual pat-terns. Firstly, we have considered the subset of Web sites in the datasetthat show the same visual pattern for product names. In such Web pagesrecords are vertically listed. Each record is represented by the productimage that has at east more than one product attribute, the first attributefrom north is the product name. Such Web sites are highlighted by ”⊙”in column 1 of Tab. 7. We have observed that the spatial location path/CD::img[GS|GN::img][GE::*[W,1][N,2][self::text]]/GE::text[W,1][N,1] is able to identify product names in all theseWeb sites in a sound and complete way. In contrast, pure XPath loca-tion paths for the same subset of sites were all completely different andno reuse of code was possible. Secondly, we have asked users to extractby a SXPath query lists of friends in social networks listed in column 1 ofTab. 8. Table 8, also, gives: (i) in column 2 sound and complete queries de-fined by user #6 without looking at the (S)DOM; (ii) in column 3 the pureXPath location paths that allow for extracting friend names produced by theuser looking at the DOM. It is worthwhile noting that SXPath queries havethe following simple basic structure: /CD::text()[.=‘‘Amici" or.=‘‘Friends"]/CR::*[CR,1]/CD::img/GS::*[N,1]. The S-DOM node containing friends list is recognized as the container that eithercontains the string ‘‘Amici’’ or ‘‘Friends’’. Then friend names arespatially recognized as the first nodes that lie at least in the south tile (spatialaxis GS, and spatial predicates [N,1]) of an image. The variety of locationpaths produced by users, looking at the DOM, and listed in Tab. 8 indicatesthe large heterogeneity of internal tag structures observed for different so-cial networks. Hence, SXPath queries are reusable when information arepresented by the same visual pattern.

Social QueriesNetworks SXPath Pure XPathfacebook /CD::text()[.="Amici"]/ //div[@id=’profile friends box

CR::*[CR,1]/CD::img/GS::*[N,1] inner content’]/div//div/div/div/ayoutube /CD::text()[.="Amici"]/ //div[@id=’user friends’]//

CR::*[CR,2]/CD::img/GS::*[N,1] div[1]/div/center/anetlog /CD::text()[.="Amici"]/ //div[@id=’nicknameFriends’]/

CR::*[CR,2]/CD::img/GS::*[N,1] div/div/a[2]care /CD::text()[.="Friends"]/ //td[@id=’col right’]//

CR::*[CR,1]/CD::img/GS::*[N,1] table[1]/tbody/tr[1]/td/a[2]bebo /CD::text()[.="Friends"]/ //div[@id=’content Friend’]/

CR::*[CR,2]/CD::img/GS::*[N,1] ul[2]/li/span[2]/a

Table 8: Generality of SXPath Queries on Social Network Sites

AcknowledgementsWork done by Steffen Staab was partially funded by German National Sci-ence Foundation (DFG) in the project ’Multipla’. Work done by ErmelindaOro and Massimo Ruffolo was partially funded by Regione Calabria in theprojects ’Voucher Ricercatori’ and ’EasyPA - ID: 1220000277’.

12